Using delayed read replicas for Amazon RDS for PostgreSQL disaster recovery

Human errors pose significant threats to business continuity, with a single erroneous DELETE statement or a mismanaged application deployment capable of corrupting vital business data in an instant. To mitigate these risks, Amazon Relational Database Service (Amazon RDS) offers automated backups and transaction log backups, providing a reliable safety net. However, traditional recovery methods often involve creating new database instances and executing point-in-time recovery, which can take hours for large databases, thereby disrupting business operations.

Delayed Read Replicas: A New Approach

In response to these challenges, AWS has introduced delayed read replicas for Amazon RDS for PostgreSQL. This innovative feature provides an alternative disaster recovery strategy by maintaining a standby replica that intentionally lags behind the primary database by a configurable time interval. This delay allows for the retention of data on the read replica as it existed minutes or hours earlier, enabling users to identify data corruption on the production instance and promote the replica before any problematic operations are executed. This mechanism serves as a real-time safety net, simplifying recovery compared to traditional point-in-time backup restoration.

In the event of data corruption, promoting the delayed replica to become the new primary cluster can be accomplished within minutes. This feature can be enabled using the recovery_min_apply_delay parameter, available in Amazon RDS for PostgreSQL versions 14.19, 15.14, 16.10, and 17.6 and later. This article delves into the use cases for delayed replication, recovery procedures, and best practices for managing delayed replicas to ensure a robust and efficient database recovery strategy.

Use Cases for Delayed Replication

Delayed replicas address three primary use cases: preventing accidental data modifications, protecting against logical errors in applications, and enabling auditing and forensic analysis. Below is a detailed exploration of each scenario:

  1. Preventing Accidental Data Modifications – Human errors, such as executing UPDATE or DELETE statements without proper WHERE clauses, can lead to immediate corruption of large datasets. A delayed replica acts as a buffer, allowing for the detection of such mistakes in production and the promotion of the replica to recover from unintentional changes. For instance, if a database administrator mistakenly executes DELETE FROM customer_orders WHERE status = 'pending' instead of targeting a specific date range, the delayed replica provides a critical window to recover from the error.
  2. Protection Against Errors in Applications – Bugs in applications or incorrect deployment logic can introduce risks of corruption through unwanted data changes, such as erroneous bulk inserts or unintended cascade operations affecting multiple tables. Delayed read replicas offer a recovery opportunity from such errors. If new code contains bugs that modify critical data, the delayed replica allows for the halting of the Write-ahead Log (WAL) application, enabling restoration to the correct state.
  3. Auditing and Forensic Analysis of Data Changes – A delayed replica can serve as an auditing resource, preserving the history of data for a configurable delay period. This allows for examination and comparison of past and present data side by side. If unauthorized or unintended changes are suspected, querying both the delayed replica and the primary can reveal alterations during that interval. Advanced users can also inspect the WAL on the delayed replica using tools like the pg_walinspect extension to identify specific transactions that occurred, aiding in auditing and incident investigation without the complexity of restoring point-in-time backups.

In all these scenarios, delayed replication functions as an “undo buffer” or safety net. While it does not replace automated backups, it complements disaster recovery strategies by offering a real-time point-in-time recovery mechanism. As noted in the PostgreSQL documentation, time-delayed replicas can be invaluable for correcting data loss errors by providing a crucial window for reaction.

Setting Up Delayed Replication in Amazon RDS for PostgreSQL

The recovery_min_apply_delay parameter governs PostgreSQL’s WAL replay mechanism at the transaction commit level. When configured in Amazon RDS for PostgreSQL, it modifies the replica’s recovery process by comparing the commit timestamp in each WAL record against the replica’s system clock, creating a deliberate lag in transaction visibility.

For an Amazon RDS for PostgreSQL database instance with an Amazon RDS for PostgreSQL reader instance, the following procedure demonstrates how to configure delayed replication using the AWS CLI:

  1. Create a custom database parameter group:
    aws rds create-db-parameter-group 
    --db-parameter-group-name awsblog-demo-delayedrepl-param-grp 
    --db-parameter-group-family postgres17 
    --description "database param group to configure delayed replica" 
    --region us-west-2
  2. Modify the newly created custom database parameter group to configure the recovery_min_apply_delay parameter. The default value is 0 milliseconds (no delay), with a maximum of 86400000 milliseconds (24 hours). In this example, we set it to 43200000 milliseconds (approximately 12 hours):
    aws rds modify-db-parameter-group 
    --db-parameter-group-name awsblog-demo-delayedrepl-param-grp 
    --parameters 
    '[{ 
    "ParameterName": "recovery_min_apply_delay", 
    "ParameterValue": "43200000", 
    "ApplyMethod": "immediate" 
    }]' 
    --region us-west-2
  3. Modify the read replica database instance to use the custom database parameter and reboot the replica for the configurations to take effect:
    aws rds modify-db-instance 
    --db-instance-identifier awsblog-demo-delayed-replica 
    --db-parameter-group-name awsblog-demo-delayedrepl-param-grp 
    --apply-immediately 
    --region us-west-2
    
    aws rds reboot-db-instance 
    --db-instance-identifier awsblog-demo-delayed-replica 
    --region us-west-2

    Note: The recovery_min_apply_delay parameter is static; a reboot is required for the change to take effect.

  4. Verify that the replica is configured with a 12-hour delay by connecting to the RDS read replica instance and executing one of the following queries:
    • show recovery_min_apply_delay;
        recovery_min_apply_delay
        12h
        (1 row)
    • SELECT SETTING FROM PG_SETTINGS WHERE NAME='recovery_min_apply_delay';
        setting
        43200000
        (1 row)

Recovery Control Functions with Delayed Replication

The delayed replication feature on RDS for PostgreSQL also introduces access to two recovery functions for enhanced control over the recovery process. These functions require the rds_superuser role for execution:

  • pg_wal_replay_pause(): This function requests a pause in the recovery process. When invoked, it initiates a pause request, although the actual pause may not occur immediately. To confirm that recovery has fully paused, pg_get_wal_replay_state() can be used. During a paused state, no new changes are applied to the delayed replica, providing a stable point-in-time view of the data.
  • pg_wal_replay_resume(): When ready to resume the recovery process, this function can be called to continue normal operations. The delayed replica will start applying changes from the point where it was paused.

Once WAL replay is paused with pg_wal_replay_pause(), it must be resumed using pg_wal_replay_resume(). Failure to do so will result in an indefinite accumulation of WAL logs on the read replica, leading to excessive storage consumption.

Demonstration and Recovery with Delayed Replicas

To illustrate the utility of delayed replication, consider a scenario where a user accidentally drops a logical database from a production RDS instance, resulting in service outages. The read replica, configured with a 12-hour delay, provides a crucial opportunity to implement a recovery plan before the DROP statement propagates to the replica.

Recovering from a Dropped Database

Upon connecting to the delayed replica using an account with rds_superuser privileges, the first step is to verify the replication status and immediately pause the WAL replay:

  1. Check if the WAL replay is ongoing on the replica:
    select pg_is_wal_replay_paused(); 
    pg_is_wal_replay_paused 
    f
    (1 row) 
    
    SELECT pg_is_in_recovery(); 
    pg_is_in_recovery 
    t
    (1 row)
  2. Pause WAL replay to ensure no transactions are applied on the read replica:
    select pg_wal_replay_pause(); 
    pg_wal_replay_pause 
    (1 row)
  3. Capture comprehensive replica metrics:
    select pg_last_xact_replay_timestamp() as last_replay_timestamp, NOW() - pg_last_xact_replay_timestamp() as replication_lag, pg_last_wal_receive_lsn() as last_received_lsn, pg_last_wal_replay_lsn() as last_replayed_lsn, pg_wal_lsn_diff( pg_last_wal_receive_lsn(), pg_last_wal_replay_lsn() ) as replay_lag_bytes,current_setting('recovery_min_apply_delay') as configured_delay;

    The output reveals key observations regarding replication lag and WAL data accumulation.

  4. Enable log_statements=all on the source instance to assist in investigating the incident. This allows tracing the sequence of events leading to the database drop.
  5. Set the recovery_target_lsn and recovery_target_inclusive parameters on the delayed read replica to facilitate recovery.
  6. Modify the read replica’s parameters and remove the replication delay while setting the target recovery point. Rebooting the database will be necessary for these changes to take effect.
  7. Resume WAL replay and monitor the recovery process.
  8. After confirming the blog_production database exists on the replica, promote the read replica to become the new primary.

Recovering Applications from Production Outage

Once the RDS for PostgreSQL read replica is promoted, it is essential to verify its status, ensuring it is active, healthy, and accepting new connections. The next step involves renaming the original source instance to facilitate routing application traffic to the newly promoted instance.

Best Practices

Implementing delayed replication in Amazon RDS for PostgreSQL requires adherence to best practices to optimize performance and prevent storage-related issues:

Storage Management and Monitoring

  • Establish comprehensive monitoring through Amazon CloudWatch Alarms to track FreeStorageSpace on both the source and delayed replica instances.
  • Enable storage auto-scaling on both instances to accommodate WAL log accumulation, which is particularly crucial when using delayed replication.
  • Consider configuring the max_slot_wal_keep_size parameter to automatically rotate WAL logs, preventing storage-full conditions while maintaining replication integrity.

Recovery Management

  • If storage consumption becomes excessive due to accumulating WAL logs, manually advance the replica to catch up and free space using built-in recovery controls.
  • Regularly review the delayed replica’s replication status to monitor lag, storage consumption, and adjust the delay interval based on disaster recovery requirements and storage constraints.
Tech Optimizer
Using delayed read replicas for Amazon RDS for PostgreSQL disaster recovery