Human errors can significantly threaten business continuity, with a single incorrect DELETE statement or mismanaged application deployment potentially corrupting critical business data. Amazon Relational Database Service (Amazon RDS) provides automated backups and transaction log backups to mitigate these risks. Traditional recovery methods often involve creating new database instances and executing point-in-time recovery, which can take hours for large databases.
AWS has introduced delayed read replicas for Amazon RDS for PostgreSQL, which maintain a standby replica that intentionally lags behind the primary database by a configurable time interval. This allows users to identify data corruption on the production instance and promote the replica before problematic operations are executed. Promoting the delayed replica to become the new primary cluster can be done within minutes. The feature can be enabled using the recovery_min_apply_delay parameter in specific PostgreSQL versions.
Delayed replicas address three main use cases: preventing accidental data modifications, protecting against logical errors in applications, and enabling auditing and forensic analysis of data changes. They act as an "undo buffer," complementing disaster recovery strategies by providing a real-time point-in-time recovery mechanism.
To set up delayed replication, the recovery_min_apply_delay parameter must be configured. The process involves creating a custom database parameter group, modifying it to set the desired delay, and applying it to the read replica instance. The parameter is static, requiring a reboot to take effect.
The delayed replication feature also includes two recovery functions: pg_wal_replay_pause() to pause the recovery process and pg_wal_replay_resume() to resume it. If WAL replay is paused, it must be resumed to prevent excessive storage consumption.
In a recovery scenario, if a user accidentally drops a logical database, the delayed replica provides a crucial opportunity to implement a recovery plan before the DROP statement propagates. The recovery process involves pausing WAL replay, capturing replica metrics, setting recovery target parameters, and promoting the read replica to become the new primary.
Best practices for implementing delayed replication include comprehensive monitoring of storage space, enabling storage auto-scaling, managing WAL log accumulation, and regularly reviewing the delayed replica’s replication status.