Fluent Commerce’s approach to near-zero downtime Amazon Aurora PostgreSQL upgrade at 32 TB scale using snapshots and AWS DMS ongoing replication

This is a guest post by Adrian Cook, DevOps Engineer at Fluent Commerce, in partnership with AWS.

Fluent Commerce, an omnichannel commerce platform, provides order management solutions that facilitate seamless shopping experiences across various channels. To efficiently process intricate customer search queries, Fluent utilizes the Amazon Aurora PostgreSQL-Compatible Edition as its high-performance online transaction processing (OLTP) database engine. This system adeptly manages multiple complex database queries while ensuring consistent, reliable, and scalable operations. The capabilities of Aurora PostgreSQL-Compatible empower Fluent to meet the demanding performance needs of its clientele with remarkable ease.

As Fluent Commerce embarked on a journey of global expansion and the continuous onboarding of new customers, it became clear that a strategic shift was necessary to support its ever-growing geographic footprint. A pivotal factor in the success of this ambitious endeavor was the enhancement of cost-efficiency across the organization.

The transformation was significantly influenced by the integration of Amazon Relational Database Service (Amazon RDS) and Amazon Aurora. The introduction of the AWS Graviton instance type, which promises up to 20% performance improvement, served as a catalyst for change. Recognizing its potential for streamlining operations globally, Fluent Commerce made the strategic decision to migrate all its instances to Graviton, thereby enhancing resource allocation and performance while maintaining efficient tenant density across instances.

The migration to the Graviton2 instance family was a straightforward process, albeit requiring a major version upgrade from Aurora PostgreSQL-Compatible 10.14 to 12.4 and higher. Upgrading databases can often be disruptive, particularly for Fluent Commerce, which supports some of the world’s largest ecommerce platforms. Managing upgrades with near-zero downtime posed challenges, especially given the immense size of their databases, which can reach up to 32 TB. Fortunately, AWS provides several technical approaches for database upgrades, including in-place upgrades, native replication, Amazon RDS blue/green deployment, and AWS Database Migration Service (AWS DMS). Fluent Commerce strategically combined these AWS-based upgrade approaches—utilizing snapshot restores and AWS DMS ongoing replication—to upgrade their 32 TB Aurora PostgreSQL databases with minimal downtime.

Migration overview

Aurora PostgreSQL-Compatible offers various upgrade paths, including in-place upgrades and blue/green deployments. Given Fluent’s specific requirements, they implemented a combined approach utilizing native logical replication and AWS DMS. This method facilitated the upgrade of over 350 production Aurora PostgreSQL databases, some as large as 32 TB, serving top global ecommerce clients with stringent downtime requirements. AWS DMS emerged as the preferred choice for both database upgrades and Graviton migration in production due to several advantages:

  • Minimized data transfer costs, making it a cost-effective alternative compared to third-party services.
  • Seamless integration with native AWS services, aligning with existing DevOps and data pipelines, including AWS IAM Identity Center, AWS CloudFormation, and CI/CD processes.
  • Change data capture (CDC) capabilities, allowing for ongoing data replication during cutover periods with minimal or near-zero downtime. AWS DMS also includes comprehensive monitoring features, utilizing Amazon CloudWatch and AWS CloudTrail logs for real-time metrics and notifications, enabling proactive management and troubleshooting of upgrade and migration tasks.

Solution overview

Fluent Commerce adopted the following approach to achieve a zero-downtime migration:

  1. Configure the source database parameter to enable replication slots.
  2. Create replication slots in the source database to establish a checkpoint.
  3. Take a snapshot of the source database. (Fluent Commerce opted not to use the Aurora clone feature due to migrating database clusters across accounts, instead employing CI/CD with AWS CloudFormation to deploy newly restored clusters in distinct accounts.)
  4. Copy the snapshot using the new KMS key and restore it to the target database.
  5. Drop the replication slots on the target database.
  6. Perform necessary prerequisites for major version upgrades (e.g., upgrading PostgreSQL extensions like PostGIS and pg_repack).
  7. Utilize AWS DMS to replicate data from the source to the target.
  8. Conduct data validation between the source and target databases using AWS DMS.
  9. Cut over the application from the source to the target.
  10. Execute post-cutover tasks such as decommissioning the old environment, backing up the new environment, and conducting end-to-end validation and testing.

The following diagram illustrates the solution architecture.

Configure the source database

To configure the source database, follow these steps:

  1. Configure the parameters for ongoing replication and CDC:
    max_worker_processes =  #On source:1 per db being replicated,On target:1 per node
    max_logical_replication_workers = 
    max_replication_slots =     #Equal to as many tasks that will have
    max_parallel_workers = 
    max_wal_senders = . # maximum no. of concurrent connections allowed from all kinds of streaming replicas
    wal_sender_timeout = 0 
    rds.logical_replication = 1
    shared_preload_libraries = pglogical
  2. Use the following code to set up the pglogical extension:
    # Install the pglogical extension
    database_name => CREATE EXTENSION pglogical;
  3. Configure the node:
    # Create pglogical node
    database_name => SELECT pglogical.create_node(node_name := 'database_name_node',
    	dsn := 'host=dmt-test-core-cluster.cluster-ID.ap-southeast-2.rds.amazonaws.com
    	dbname=database_name port=5432 user=dmt_user password=');
  4. Create a replication slot:
    database_name= > SELECT * FROM pg_create_logical_replication_slot('database_name_slot', 'pglogical');

    Note this replication slot name, as it will be needed later when configuring the source DMS endpoint.

  5. Get a confirmed flushed logical sequence number (LSN):
    database_name=> SELECT slot_name, slot_type, plugin, database, active, confirmed_flush_lsn
  6. Create two replication sets using the pglogical.create_replication_set function:
    1. The first replication set tracks updates and deletes for tables with primary keys.
    2. The second replication set tracks only inserts, sharing the same name as the first replication set, prefixed with i.
    database_name=> SELECT pglogical.create_replication_set('database_name_slot', false, true, true, false);
    database_name=> SELECT pglogical.create_replication_set('idatabase_name_slot', true, false, false, true);

    Replication slots serve as markers in the PostgreSQL write-ahead log (WAL), enabling AWS DMS to identify the correct starting point for ingesting data from the source database. Given the high DML transactions on the database, Fluent Commerce utilized two replication sets for each replication slot to enhance granular control and overall migration performance. It is crucial for all tables involved in the migration to have primary keys and be included in these two sets to ensure data integrity and consistency throughout the migration process. For further information, refer to Using a PostgreSQL database as an AWS DMS source.

  7. Add tables to the replication sets that track updates and deletes. The following query demonstrates how to add a table to the replication set:
    select pglogical.replication_set_add_table('database_name_slot', 'schema_name.my_table');

    Alternatively, you can query the catalog table to generate SQL for adding tables to the replication set in one step:

    database_name=> SELECT E'SELECT pglogical.replication_set_add_table('database_name_slot', ''
    ||tab.table_schema||'.'||tab.table_name||E'');'
    FROM information_schema.tables tab
    LEFT JOIN information_schema.table_constraints tco
    ON tco.table_schema = tab.table_schema
    AND tco.table_name = tab.table_name
    AND tco.constraint_type = 'PRIMARY KEY'
    LEFT JOIN information_schema.key_column_usage kcu
    ON kcu.constraint_name = tco.constraint_name
    AND kcu.constraint_schema = tco.constraint_schema
    AND kcu.constraint_name = tco.constraint_name
    WHERE tab.table_schema NOT IN ('pg_catalog', 'information_schema')
    AND tab.table_type = 'BASE TABLE' AND tco.constraint_name IS NOT NULL
    GROUP BY tab.table_schema,
    tab.table_name,
    tco.constraint_name
    ORDER BY tab.table_schema,
    tab.table_name;
  8. Add tables to the replication sets that track inserts only:
    select pglogical.replication_set_add_table('database_name_slot', 'schema_name.my_table');

    Similarly, you can add all tables to replication sets in one step using the following SQL query:

    database_name=> SELECT E'SELECT pglogical.replication_set_add_table('idatabase_name_slot', ''
      ||tab.table_schema||'.'||tab.table_name||E'');'
    FROM information_schema.tables tab
    LEFT JOIN information_schema.table_constraints tco
              ON tco.table_schema = tab.table_schema
              AND tco.table_name = tab.table_name
              AND tco.constraint_type = 'PRIMARY KEY'
    LEFT JOIN information_schema.key_column_usage kcu 
              ON kcu.constraint_name = tco.constraint_name
              AND kcu.constraint_schema = tco.constraint_schema
              AND kcu.constraint_name = tco.constraint_name
    WHERE tab.table_schema NOT IN ('pg_catalog', 'information_schema')
          AND tab.table_type = 'BASE TABLE' AND tco.constraint_name IS NOT NULL
    GROUP BY tab.table_schema,
             tab.table_name,
           tco.constraint_name
    ORDER BY tab.table_schema,
           tab.table_name;

Create a snapshot

To create a snapshot of the source database, use the following code. Ensure to note the name of the snapshot and share it with the target AWS account, as this will be migrated to a new AWS account:

# Install the pglogical extension
database_name => CREATE EXTENSION pglogical;

This code utilizes the following placeholders:

  • – The identifier for the DB snapshot to be created.
  • – The identifier of the DB instance for snapshot creation, matching an existing DB instance.

Share snapshots

If migrating to another account, use the following code to share your snapshots across accounts:

# Install the pglogical extension
database_name => CREATE EXTENSION pglogical;

This code employs the following placeholders:

  • – The identifier for the DB snapshot to share.
  • – The AWS account ID with which to share the snapshot. Multiple account IDs can be specified, separated by spaces.
  • – The AWS Region where the snapshot is located (e.g., us-west-2).
  • – The name of the AWS Command Line Interface (AWS CLI) profile to use for the command. This is optional if using the default profile.

When utilizing a custom KMS key, ensure to allow cross-account access. Once the snapshot is shared, you can copy the snapshot using a new KMS key in the target account:

# Install the pglogical extension
database_name => CREATE EXTENSION pglogical;

Restore the target database

To restore the target database, use the following code:

# Install the pglogical extension
database_name => CREATE EXTENSION pglogical;

This code employs the following placeholders:

  • – The identifier for the new DB cluster to be created from the snapshot.
  • – The identifier for the DB snapshot to restore from.
  • – The name of the database engine to be used for the new DB cluster (e.g., aurora, aurora-mysql, or aurora-postgresql).
  • – The Region where the snapshot is located (e.g., us-west-2).
  • – The name of the AWS CLI profile to use for the command. This is optional if using the default profile.

Upgrade the target database

Due to the absence of a direct upgrade route from version 10.20 to 14.3, the initial step is to upgrade the target database to a minor version that offers a direct upgrade path to 14.3. For a comprehensive list of compatible versions, refer to Upgrading Amazon Aurora PostgreSQL DB clusters.

In this case, the upgrade process involves transitioning from 10.20 to 10.21, and subsequently from 10.21 to 14.3:

  1. To upgrade from 10.20 to 10.21, update the CloudFormation template with the following:
    RDSEngineVersion: 10.21
  2. To upgrade from 10.21 to 14.3, update the CloudFormation template with the following:
    RDSEngineVersion: 14.3
  3. After upgrading Aurora PostgreSQL-Compatible to version 14.3, change the instance type to a supported Graviton instance and set the custom parameters to true:
    # Install the pglogical extension
    database_name => CREATE EXTENSION pglogical;
  4. Update the target database cluster and instance parameters, ensuring MAX_WORKER_PROCESSES is configured. This value must be greater than or equal to the number of databases intended for migration to this cluster. For instance, if migrating 10 databases from the source to the target, set this value to a minimum of 10:
    # Install the pglogical extension
    database_name => CREATE EXTENSION pglogical;

Configure AWS DMS

This section focuses on configuring the AWS DMS environment, including the replication instance and replication tasks. For additional information, refer to Set up replication for AWS Database Migration Service.

When configuring the source endpoint for the source PostgreSQL cluster, it is essential to specify the replication slot name created earlier. The following are key considerations for configuring your AWS DMS infrastructure for large-scale database migration with minimal downtime:

  • Separate the validation and AWS DMS main migration tasks to minimize the risk and overhead of migration failure. This separation also enhances the throughput of the data migration task.
  • If LOB data types are present in the database, consider setting the max LOB size when migration latency is high to increase throughput.
  • Position the AWS DMS instance in the same Availability Zone as the source database to reduce latency and cross-Availability Zone data transfer costs.
  • The validation task utilizes memory on the replication instance; thus, selecting an instance size with adequate memory ensures a swift validation process, efficiently transitioning from PENDING status.
  • For rapid data transfer from the source to the target, it is advisable to choose a replication instance with higher CPU and network speeds, preferably utilizing r6i instances for enhanced memory, and oversizing the replication instance for expedited migration.

Track the migration progress

This section discusses various methods to monitor migration progress.

Data ingest start time

To track the time the ingest starts, utilize the following script. This script loops over the AWS DMS task (describe-table-statistics) function and monitors when data is inserted into any of the tables. The last timestamp indicates when data began ingesting into the target database:

# Install the pglogical extension
database_name => CREATE EXTENSION pglogical;

Monitor validation tasks

The following script performs data verification between a source and target database for specified tables. It reads SQL queries from a file (verify-${DBTYPE}-sql.txt), executes them on both the source and target databases, compares the results, and logs any discrepancies:

# Install the pglogical extension
database_name => CREATE EXTENSION pglogical;

Source ingestion rate

Employing a custom script to monitor ingestion rates at the source and target databases provides a more granular view of data flow in near real-time, offering specific insights into row counts and time intervals. The following command tracks the data ingestion rate for a table (source_table) based on records created in the last 5 minutes:

# Install the pglogical extension
database_name => CREATE EXTENSION pglogical;

Target ingestion rate

This command mirrors the previous example for the source database, but applies to the target database. Comparing the insert ingest rates between source and target allows for assessing the speed at which AWS DMS operates in relation to API ingest. Note that this method is not foolproof; it merely serves as a high-level tracking mechanism for ingest rates.

CDC latency

To troubleshoot latency issues in AWS DMS, monitoring the CDCLatencySource and CDCLatencyTarget metrics in CloudWatch is advisable. For further details, refer to Types of CDC latency.

Perform the cutover

Once the target database is synchronized with the source and successful data migration and validation are confirmed, the application cutover can proceed. Key considerations prior to cutting over include:

  • Configuration changes – The application retrieves part of its database configuration from an internal table, specifically the dbhost value, which governs the database connection. Updating the dbhost in the application’s configuration allows for redirection to a new database (e.g., AURORA02), simplifying the process to merely modifying the configuration.
  • Bumping the sequence – Identify all sequence numbers in the target database (the one being cut over to) and generate a query to increment the sequence IDs. An arbitrary offset of 500,000 has been chosen for this scenario, though the exact number may vary. The key is to ensure the source database does not catch up to the target during the cutover process; a larger offset provides a safer operation—opting for a higher value is recommended for added security.
  • Vacuum and analyze tables – Execute vacuum and analyze on the tables in the target DB instance to validate optimal performance.

About the authors

Tech Optimizer
Fluent Commerce’s approach to near-zero downtime Amazon Aurora PostgreSQL upgrade at 32 TB scale using snapshots and AWS DMS ongoing replication