software failures

Tech Optimizer
October 2, 2024
A financial institution collaborated with AWS to create a solution that enables sub-minute failover between Availability Zones and single-digit minute recovery times across AWS Regions for their wealth management customer portal. The solution utilizes automation for failure detection and failover, along with AWS-managed data replication, specifically employing the Amazon Aurora PostgreSQL-Compatible Edition and Amazon Aurora Global Database for cross-Region replication. Key components include canary outage detection via AWS Lambda, DNS redirection through Amazon Route 53, and control plane resilience using the Amazon Route 53 Application Recovery Controller. The architecture is based on a three-tier model, with in-Region failovers expected to occur in seconds and cross-Region recoveries within minutes. The architecture team aimed to reduce the Recovery Time Objective (RTO) from tens of minutes to seconds and established a Recovery Point Objective (RPO) of under one minute. The failover process prioritizes speed, opting for immediate failover rather than controlled switchover. Testing confirmed that in-Region failover resulted in minimal disruption, while cross-Region failover could be completed in single-digit minutes. Recent enhancements to Amazon RDS Proxy and Aurora PostgreSQL have improved the architecture's efficiency and user experience.
Winsage
July 20, 2024
The text discusses the impact of software failures on various sectors such as airlines and hospitals, highlighting the lack of stringent standards for tech companies like Microsoft and CrowdStrike. It questions why the tech industry is not held to the same accountability as other sectors and calls for mandatory redundancies and regulations to prevent catastrophic failures in digital infrastructure.
Search