Database maintenance presents a significant challenge, particularly in the realm of asset trading where uptime is paramount. Paxos, a key player in regulated infrastructure, enables institutions and consumers to manage assets around the clock. Even scheduled downtime can have substantial repercussions for those who depend on our services. The introduction of Blue-Green upgrades has transformed our maintenance strategy, allowing us to address these challenges more effectively.
The complexity of implementing Blue-Green upgrades was evident, especially concerning Temporary Roles and Change Data Capture (CDC). However, the results were compelling: we achieved a remarkable 50-fold reduction in downtime and saved an entire month in coordination efforts. Our customers pushed us to rethink our approach to uptime, questioning the necessity of annual downtimes and whether we could consistently achieve four nines of availability. The previous answer to these inquiries was a resounding “no,” as we were managing over 60 Postgres clusters with traditional upgrade processes that resulted in downtime ranging from 30 to 120 minutes each.
With Aurora Blue-Green upgrades, we found a solution. Yet, I encountered unexpected challenges during implementation. Specifically, the CREATE ROLE statements associated with our Vault-based temporary roles for human logins disrupted every upgrade attempt. After navigating through various DDL issues and ensuring minimal data loss when replication slots dropped, we established a reusable pattern that not only kept us operational but also met customer expectations.
How Blue-Green Works
Unlike traditional dump/reload or pg_upgrade methods that necessitate taking the database offline, Blue-Green upgrades utilize PostgreSQL’s logical replication to create a parallel environment. The existing blue cluster continues to handle production traffic while a new green cluster is provisioned with the updated version. Logical replication ensures real-time synchronization between the two clusters. Once synchronization is complete, the switchover can occur in approximately one minute—writes are briefly paused, everything syncs, and traffic is redirected to the green cluster.
For comprehensive technical details, refer to AWS’s Blue-Green documentation.
Challenges
While Blue-Green upgrades are revolutionary, they are not without their hurdles. Here are some of the key challenges we faced:
The Ephemeral Roles Trap
The initial staging environment upgrade encountered a failure due to an error indicating that DDL could not be replicated. Although I was aware that schema changes were prohibited during a Blue-Green upgrade, the Postgres logs revealed the issue: a CREATE ROLE statement for a human accessing the database. Our use of Vault for temporary database credentials meant that every login created a short-lived role, which constituted DDL and subsequently broke the upgrade process.
This realization indicated a painful road ahead. We had to disable Vault-based role management during upgrade windows and exercise greater caution regarding access. Despite these measures, we experienced retries across multiple clusters. If you are utilizing Vault or any dynamic role management system, it is crucial to address this issue prior to commencing your upgrade project.
Replication Slots Must Drop (Pre-PG17)
One of the most significant challenges we encountered was the necessity to drop replication slots during a Blue-Green upgrade prior to PostgreSQL 17. Our reliance on Change Data Capture (CDC) meant that this requirement could lead to data loss from a CDC perspective. We utilize Debezium-based CDC for two essential functions: replicating data to our warehouse and powering event-driven workflows triggered by database writes. The dropping of replication slots resulted in the loss of events during the gap, necessitating per-table, per-use-case backfill strategies that could take anywhere from 3 to 30 hours to design and implement.
For further insights on similar replication slot issues, you can read about InstantDB’s experience during their Postgres upgrade, which led them to choose a different approach. We opted to persevere with Blue-Green and absorb the backfill costs, although this remains a notable challenge.
IAM Cluster IDs Change
Upon the green cluster becoming primary, it acquires a new cluster ID. For those employing RDS IAM authentication, this necessitates updates for every client, adding 1-2 hours of work per cluster. While this is manageable, the cumulative effect across 60+ clusters can be significant.
The Results
Our efforts resulted in a dramatic decrease in downtime, dropping from 30-120 minutes per cluster to approximately one minute, yielding a 50X improvement. More importantly, this advancement has redefined our operational capabilities. Previously, maintaining 99.9% monthly uptime during maintenance was a struggle, but with Blue-Green upgrades, we can now perform updates without compromising our 99.99% uptime service level objectives on most products.
The impact on customer coordination has also been profound. Extended downtimes necessitated weeks of planning and coordination, including meetings to review contingency plans and extensive documentation. In contrast, sub-five-minute downtimes have streamlined this process to simple notifications, saving us at least a month of coordination work across our clusters.
What’s Coming: PostgreSQL 17
PostgreSQL 17 addresses one of our most pressing challenges. Starting with upgrades from PostgreSQL 17, users will no longer need to drop logical replication slots, preserving CDC continuity for future upgrades. It is important to note that our upgrades to PG17 still required dropping slots, but once on version 17, the upgrade path becomes significantly smoother. This presents a compelling incentive to prioritize the transition to PostgreSQL 17 for those relying on replication-slot-based CDC.
What I’d Tell Someone Starting This
Two key insights I wish I had known at the outset:
- Address Vault Auth first. At the very least, establish a clear method for disabling dynamic role creation during upgrade windows. While the DDL sensitivity is documented, the implications for Vault-based authentication may not be immediately apparent until the first upgrade failure occurs.
- Advocate for replication slot support from AWS. The ability to maintain CDC from a reader during the upgrade window would completely eliminate the backfill issue. If enough customers express this need, it could lead to a solution.
While infrastructure reliability may not be glamorous, it is essential for fostering the trust that underpins regulated digital assets. If you are navigating similar challenges with Postgres upgrades at scale, I would be keen to hear about your experiences. Feel free to connect with me on LinkedIn.