Implement a rollback strategy for Amazon Aurora PostgreSQL upgrades using Amazon RDS Blue/Green deployments

The Amazon Aurora PostgreSQL-Compatible Edition stands as a robust solution in the realm of fully managed relational database engines, offering high performance and availability. One of its notable features is the support for managed blue/green deployments, which are designed to minimize downtime and mitigate risks during updates. This deployment strategy creates a staging environment through logical replication, enabling safe deployment and testing of production changes. The blue environment signifies the current production database, while the green environment incorporates necessary updates without altering the application endpoint. This method effectively reduces the risks and downtime associated with updates, such as engine version upgrades or system patches. Once the changes are validated, the green environment can be seamlessly promoted to production, ensuring that the application endpoint remains unchanged.

Despite meticulous planning and testing in non-production environments, unforeseen issues may arise post-version upgrade. For instance, a schema change that performs flawlessly in staging might encounter errors in production due to discrepancies in real-world data patterns or untested application queries that were not exercised during testing. Additionally, performance degradation can occur due to actual traffic and workloads. In such scenarios, having a rollback plan becomes crucial for swiftly restoring service stability. Although the managed Blue/Green deployment feature does not currently offer built-in rollback functionality, alternative solutions for version management can be implemented.

This article outlines a method for manually establishing a rollback cluster using self-managed logical replication, which maintains synchronization with the newer version following an Amazon RDS Blue/Green deployment switchover. The rollback cluster serves as a backup option, allowing for a reversion to the original version if necessary.

Solution overview

The following diagram illustrates the high-level workflow of this solution.

Before the switchover, two clusters exist:

  • Blue cluster – The existing production database cluster
  • Green cluster – The mirrored and synchronized staging environment derived from the blue cluster

After the switchover, three clusters are present:

  • Old blue cluster – The original production cluster (previously the blue cluster)
  • New blue cluster – The new version of the production cluster, where the workload will operate (previously the green cluster)
  • Blue prime (rollback) cluster – A clone of the old blue cluster, synchronized with the new blue cluster data (designated as the rollback cluster)

The workflow steps are as follows:

  1. Create a blue/green deployment
  2. Cease traffic on the blue cluster and perform switchover to the green cluster
  3. Delete the blue/green deployment
  4. Clone the old blue cluster to create the blue prime (rollback) cluster
  5. Establish logical replication from the new blue cluster to the blue prime (rollback) cluster
  6. Resume traffic to the new blue cluster

In this post, we simulate an Amazon Aurora PostgreSQL-Compatible Edition major version upgrade from version 15.10 to 16.6.

Limitations

  1. Aurora managed Blue/Green deployment does not replicate DDL, sequences, refresh materialized views, create or modify large objects, or update and delete data on tables without a primary key. For further details, refer to Limitations and considerations for blue/green deployments.
  2. While Aurora managed Blue/Green deployments automatically manage the primary cluster endpoint after switchover, you must handle endpoint changes at the application or DNS level if a rollback to a previous version is necessary.
  3. Setting up the rollback cluster incurs additional downtime.

Prerequisites

To implement the solution, the following components are required:

Note: Enabling the logical replication parameter necessitates a reboot of the writer instance. For more information, see Using logical replication with Aurora PostgreSQL DB clusters.

  • A cluster parameter group for the new version database: To facilitate logical replication from the newer version to the older version, it is essential to ensure that the new version (Aurora PostgreSQL 16) has logical replication enabled. The following AWS CLI commands will create a cluster parameter group and enable the logical replication parameter.
    aws rds create-db-cluster-parameter-group 
    --db-cluster-parameter-group-name pg16-blue-green 
    --db-parameter-group-family aurora-postgresql16 
    --description "Parameter group that contains logical replication settings for Aurora PG 16"
    
    aws rds modify-db-cluster-parameter-group 
    --db-cluster-parameter-group-name pg16-blue-green 
    --parameters "ParameterName='rds.logical_replication',ParameterValue=1,ApplyMethod=pending-reboot"
  • Familiarity with the Aurora cloning feature is necessary; see Cloning a volume for an Amazon Aurora DB cluster.

Create a Blue/Green deployment

The Amazon RDS Blue/Green deployment is managed by AWS, which creates and mirrors resources from the Blue environment to the Green environment while replicating DML changes from Blue to Green using native logical replication. The AWS Command Line Interface (AWS CLI) can be utilized to create an RDS blue/green deployment with the following command, where the source is the Amazon Resource Name (ARN) of the source production database. The following RDS Console screenshot illustrates an existing cluster with logical replication enabled (Blue cluster).

Use the following command to create a Blue/Green deployment with a Green cluster on Amazon Aurora PostgreSQL version 16. The Green cluster must be linked to an appropriate parameter group with logical replication previously established.

aws rds create-blue-green-deployment 
   --blue-green-deployment-name my-blue-green-deployment 
   --source arn:aws:rds:{REGION}:{ACCOUNT_NUMBER}:cluster:{CLUSTER_ID} 
   --target-engine-version 16.6 
   --target-db-cluster-parameter-group-name pg16-blue-green

Once all instances are available, you will have a blue/green deployment featuring both the blue and green clusters.

Stop traffic and perform the switchover

To promote the green cluster, it is essential to initiate a switchover action. Prior to initiating the switchover, halt database traffic on the blue cluster to ensure data consistency during the creation of the blue prime. VPC Security Groups can be utilized to block inbound and outbound database traffic. After completing the switchover, verify the updated labels on your RDS blue/green deployment.

aws rds switchover-blue-green-deployment 
--blue-green-deployment-identifier {BG_RESOURCE_ID}  
--switchover-timeout "300"

Delete the blue/green deployment

Before configuring the Blue Prime (rollback) cluster, it is necessary to delete the Blue/Green deployment. This action releases the clusters from the managed environment and cleans up objects such as replication slots, publications, subscriptions, and logical replication components generated by Amazon RDS Blue/Green deployment.

aws rds delete-blue-green-deployment 
--blue-green-deployment-identifier {BG_RESOURCE_ID}
--no-delete-target

As depicted in the following screenshot, two independent clusters now exist: apg-blue-green-demo (v16.6) and apg-blue-green-demo-old1 (v15.10).

Clone the new Blue to create the Blue Prime (rollback) cluster

Retaining the original Blue cluster may be necessary for compliance and auditing purposes. To establish a rollback cluster:

  1. Clone the original Blue cluster to create a self-managed Blue Prime (rollback) cluster.
  2. Once available, verify cluster and data accessibility by executing simple read-only queries on the cloned cluster.
  3. Document the Blue Prime (rollback) cluster endpoint for potential DNS or application endpoint updates should a rollback become necessary.
aws rds restore-db-cluster-to-point-in-time  
--db-cluster-identifier "apg15-blue-prime" 
--restore-type copy-on-write 
--use-latest-restorable-time 
--source-db-cluster-identifier "apg-blue-green-demo-old1" 
--db-subnet-group-name "{DB_SUBNET}" 
--vpc-security-group-ids "{VPC_SECURITY_GROUP}" 
--db-cluster-parameter-group-name "pg15-blue-green"
aws rds create-db-instance 
--db-instance-identifier "apg15-blue-prime" 
--db-instance-class "db.r6g.large" 
--db-cluster-identifier "apg15-blue-prime" 
--engine "aurora-postgresql" 
--engine-version "15.10"

The following screenshot illustrates the newly created restored cluster ‘apg15-blue-prime’, which will serve as the rollback target.

Set up the Blue Prime (rollback) cluster

Upon completion of the Blue Prime clone, configure self-managed logical replication from the new Blue cluster (publisher) to the Blue Prime cluster (subscriber). It is crucial to ensure that no write activities or schema changes are permitted to avoid data synchronization issues.

  1. On the new Blue cluster (publisher), connect to the database using the cluster endpoint and create a new publication:
    CREATE PUBLICATION publication_name FOR ALL TABLES;

    Important: Ensure each table has a replication identity (such as a primary key or unique key). If multiple databases exist within the same cluster, repeat the following steps for each database in the newly promoted production cluster (new Blue).

  2. On the new Blue cluster (publisher), connect to the cluster endpoint and execute the following command to create a replication slot using the ‘pgoutput’ plugin:
    SELECT pg_create_logical_replication_slot('replication_slot_name', 'pgoutput');
  3. On the Blue Prime cluster (subscriber), use the following command to create a new subscription without copying data or creating a new slot:
    CREATE SUBSCRIPTION subscription_name
    CONNECTION 'postgres://admin_user_name:admin_user_password@source_instance_URL/database' PUBLICATION publication_name
    WITH (copy_data = false, create_slot = false, enabled = false, connect = true, slot_name = 'replication_slot_name');

    The code requires the following parameters:

    1. subscription_name – The name of the subscription.
    2. admin_user_name – The name of an administrative user with rds_superuser permissions.
    3. admin_user_password – The password associated with the administrative user.
    4. source_instance_URL – The URL of the publication server instance.
    5. database – The database that the subscription server will connect with.
    6. publication_name – The name of the publication server.
    7. replication_slot_name – The name of the replication slot created in step 2.

    Important: This activity must be repeated for every publication (step 1) on the blue prime cluster.

  4. On the Blue Prime cluster, execute the following command to enable the subscription:
    ALTER SUBSCRIPTION subscription_name ENABLE;

    Important: This activity must also be repeated for every subscription on the blue prime cluster.

After completing the logical replication setup and verifying the data flow from the new Blue to the Blue Prime cluster, traffic to the new Blue cluster can be resumed using the existing cluster endpoint. The Amazon RDS managed Blue/Green deployment automatically manages DNS changes, allowing your application to utilize the same endpoint.

Rollback to the Blue Prime cluster

Should a rollback to the Blue Prime cluster (original version) be necessary, follow these steps:

  1. Cease application traffic to maintain data integrity during the transition (Amazon Aurora VPC Security Groups can be used to block incoming traffic).
  2. Update your application or DNS records to point to the Blue Prime cluster endpoint.
  3. Drop the subscription on the Blue Prime cluster.
  4. If applicable, manually update sequence values.

This transition is not automatic, as the Blue Prime cluster is no longer under managed service. It is advisable to create a runbook or automation script for rollback activities to minimize errors during execution. While this strategy provides a rollback option, it entails additional downtime when establishing the Blue Prime cluster. This trade-off should be carefully considered when implementing this approach, and thorough testing in staging environments is recommended prior to production deployment.

Clean up

In the production environment, it is prudent to retain the new blue prime cluster while confirming that all applications have successfully transitioned. Maintaining both environments concurrently ensures that a rollback can be executed if any inconsistencies or unexpected behaviors arise in the new infrastructure. The old blue cluster can be backed up for compliance purposes before deletion to reduce costs. Charges will continue to accrue for all clusters until they are deleted.

If these resources were created for testing purposes, it is advisable to delete all clusters (blue, green, and blue prime) to avoid incurring additional charges. The following steps outline the cleanup process for your database cluster:

  1. Delete each read replica instance, if any.
  2. Delete the primary instance.
  3. Delete the database cluster.
Tech Optimizer
Implement a rollback strategy for Amazon Aurora PostgreSQL upgrades using Amazon RDS Blue/Green deployments