How a large financial AWS customer implemented high availability and fast disaster recovery for Amazon Aurora PostgreSQL using Global Database and Amazon RDS Proxy | Amazon Web Services

In a recent collaboration with AWS, a prominent financial institution successfully engineered a solution that achieves sub-minute failover between Availability Zones and single-digit minute recovery times across AWS Regions. This initiative was aimed at enhancing high availability (HA) and disaster recovery (DR) for their wealth management customer portal, a critical application where downtime translates directly to lost business. The design objectives included minimizing failover times and reducing human error during such processes.

The solution leveraged automation for failure detection and failover, alongside AWS-managed data replication. To ensure resilience against potential AWS control plane outages, the team opted for the Amazon Aurora PostgreSQL-Compatible Edition in conjunction with Amazon Aurora Global Database. This combination provided a robust, scalable, and managed cross-Region replication mechanism.

The architecture incorporated several key components: canary outage detection via AWS Lambda, DNS redirection facilitated by Amazon Route 53, and control plane resilience through the Amazon Route 53 Application Recovery Controller. This design effectively eliminated the need for human intervention, which often prolongs recovery efforts.

The architecture is built on a classic three-tier model, consisting of a web frontend, an application logic middle tier, and a backend database. Given the critical nature of the wealth management platform, the architecture was designed to ensure minimal downtime, with in-Region failovers expected to occur in seconds and cross-Region recoveries within minutes. The primary operations are conducted in the US East (N. Virginia) Region, with a backup instance located in US East (Ohio).

The design process

In 2021, the customer’s architecture team set an ambitious goal to reduce the platform’s Recovery Time Objective (RTO) from tens of minutes to mere seconds in response to software failures or large-scale events. They also established a Recovery Point Objective (RPO) of under one minute for user data, ensuring that only the most recent changes would be at risk during a failover. The resilience requirements were clearly defined:

Less than a minute for in-Region failover
5-minute RTO and 15-minute RPO for cross-Region recovery
A reusable design applicable to other Tier 1 workloads

To achieve these objectives, the customer collaborated closely with the AWS Solutions Architecture team to design, build, and rigorously test the solution. Traditionally, AWS customers would manually initiate a Regional failover, which could take several minutes. Recognizing the inadequacy of this approach, the team focused on fully automating data replication, failure detection, and failover processes.

Cross-Region data replication

The initial exploration of Amazon DynamoDB global tables for data availability across Regions was deemed unsuitable due to the wealth management application’s reliance on a highly normalized relational schema. Instead, the team selected Aurora PostgreSQL global database, which creates a read replica in a secondary Region and continuously replicates data changes. This setup allows for swift promotion of any reader Region to a writer in the event of a failure or large-scale incident.

The failover process was designed to prioritize speed, opting for an immediate Failover approach rather than a controlled Switchover, which could introduce unnecessary delays. This decision was crucial for minimizing downtime during unexpected outages.

In-Region high availability

To bolster uptime, the architecture included an Aurora PostgreSQL replica in a second Availability Zone within each Region. These replicas are continuously replicated from the primary writer instance and can seamlessly take over in seconds, ensuring uninterrupted service. Additionally, Amazon RDS Proxy was utilized to manage incoming requests during failovers, maintaining SQL endpoint availability even during transitions.

Application layer failover

The application layer was designed to be stateless, allowing requests to be routed to any instance without affecting user experience. A Route 53 CNAME was created to serve as the global DNS entry for the application, enabling traffic distribution across Application Load Balancers (ALBs) in both Regions. This configuration allowed for real-time updates to request routing based on the health of the Regions.

To address the RTO requirements, the team implemented a canary Lambda function that continuously tests the application’s components, ensuring rapid detection of any failures. By running every 10 seconds, the canary function minimizes the risk of false positives while maintaining operational integrity.

Testing results

Testing scenarios demonstrated the effectiveness of the failover mechanisms. In-Region failover tests showed that users experienced minimal disruption, with errors occurring for only a few seconds when using RDS Proxy. In contrast, direct connections to the database resulted in longer error durations. Cross-Region failover tests confirmed that the entire process, including DNS redirection, could be completed in single-digit minutes.

Enhancing the solution with Global database failover

Recent enhancements to Amazon RDS Proxy and Aurora PostgreSQL have further simplified the architecture. The introduction of support for secondary regions in RDS Proxy allows for direct database connections even before failover completion, improving user experience and reducing potential SQL errors. Additionally, the managed unplanned failover feature for Aurora PostgreSQL automates the re-synchronization of instances post-failover, streamlining the recovery process.

Tech Optimizer