How Wiz achieved near-zero downtime for Amazon Aurora PostgreSQL major version upgrades at scale using Aurora Blue/Green Deployments

August 12, 2025

This is a guest post by Sagi Tsofan, Senior Software Engineer at Wiz, in partnership with AWS.

Wiz, a prominent player in the realm of cloud security, excels at identifying and mitigating risks across major cloud platforms. The company’s agent-less scanner processes an astounding volume of tens of billions of daily cloud resource metadata entries. This high demand for performance and low-latency processing makes the Amazon Aurora PostgreSQL-Compatible Edition database a vital component of Wiz’s architecture, as it serves hundreds of microservices at scale.

The decision to upgrade from PostgreSQL 14 to 16 was motivated by the substantial performance enhancements offered by the latter, including improved parallel query execution, faster aggregations, and a remarkable up to 300% boost in COPY performance. Additionally, Version 16 introduced enhanced security features, smarter query planning, and reduced CPU and I/O usage, making it a compelling upgrade for our infrastructure.

In this article, we delve into our experience upgrading the Aurora PostgreSQL database from version 14 to 16 with near-zero downtime, utilizing Amazon Aurora’s Blue/Green Deployments.

Challenges with minimizing downtime for Aurora PostgreSQL database major version upgrades

At Wiz, we manage numerous production Aurora PostgreSQL clusters and encountered significant challenges during major version upgrades. Traditionally, these upgrades, which utilize pg_upgrade internally, typically resulted in around one hour of downtime in our environment. This was largely due to the substantial size and heavy workloads of our clusters, which was unacceptable for customers dependent on continuous service. Thus, we sought a more effective solution, ultimately reducing downtime from approximately one hour to just 30 seconds.

Solution overview

To address these downtime challenges, we developed our own DB Upgrade Pilot—an automated, eight-step flow based on Aurora Blue/Green Deployments, enhanced with custom features. Aurora Blue/Green Deployments facilitate the copying of a production database to a synchronized staging environment, allowing changes such as version upgrades to be made without impacting production. The eventual promotion typically occurs in under one minute. Our solution builds upon this foundation by incorporating fully automated validation steps, enhanced synchronization monitoring, and a comprehensive end-to-end orchestration that executes the entire upgrade process across our diverse Aurora PostgreSQL fleet with a single command. Designed for minimal downtime, maximum operational safety, and low resource overhead, this solution effortlessly creates a mirrored production environment, ensures stable synchronization, and performs a sub-minute switchover—all initiated by a single command. The following diagram illustrates the automated flow.

Overall Automated Flow

Here, we outline the high-level steps of the DB Upgrade Pilot, with a deeper exploration of each step to follow.

Prerequisites – Takes a cluster snapshot and configures necessary database parameters such as logical replication, DDL logging, and temporarily deactivates specific extensions.
Provisioning & Upgrade – Creates Aurora Blue/Green Deployments and provisions a synchronized replica (green) of the production database (blue), targeting PostgreSQL 16.
Post-Provisioning – Executes VACUUM ANALYZE on green cluster tables to optimize performance and refresh database statistics after the upgrade.
Replicating Data – Tracks replication lag through CloudWatch metrics and performs data verification health checks between environments.
Switchover – Validates zero replication lag, halts long-running transactions on the blue cluster, and performs the actual switchover from Blue to Green using Aurora Blue/Green Deployments.
Post-Switchover – Re-enables and upgrades extensions, then resumes data pipeline ingestion after confirming compatibility.
Cleanup & Monitoring – Decommissions old infrastructure by removing deletion protection and cleaning up the Aurora Blue/Green Deployment. Maintains real-time visibility tracking replication lag and system metrics throughout the process.

Prerequisites

In the initial step of the automation flow, DB Upgrade Pilot prepares the necessary components and configurations:

Create a cluster snapshot before proceeding to enable rollback capability if issues arise during the process.
Enable rds.logical_replication within the database parameters, which is required for the logical replication utilized by Aurora Blue/Green Deployments.
Set the log_statement parameter to ddl to record all schema-changing operations in the database logs. This aids in identifying DDL operations that might cause replication failures during execution.
Important: Be aware that this setting may log sensitive information, including passwords in plain text, in your log files.
Deactivate extensions that could interfere with or disrupt logical replication, such as pg_cron and pg_partman.
Due to the parameter group change, validate that the writer instance is in sync mode and configurations are synchronized, restarting if necessary.
Temporarily pause data ingestion pipelines to minimize writes and Write-Ahead Log (WAL) generation.

The following diagram illustrates DB Upgrade Pilot in Amazon Elastic Kubernetes Service (Amazon EKS) connecting to Amazon Aurora PostgreSQL 14 to apply the necessary prerequisites.

Provisioning and Upgrade

Next, we create a new Aurora Blue/Green Deployment, targeting PostgreSQL version 16. This deployment provisions a synchronized replica (green) of the production (blue) database, allowing updates to the green database without impacting the original blue database. Caution is essential during this process, as improper changes to the green environment can disrupt replication, potentially necessitating a full reprovision. Once changes are complete and replication has fully caught up, a switchover to the green environment can occur with minimal downtime.

Throughout the provisioning phase, the blue cluster continues to serve our customers without interruption. The following diagram depicts DB Upgrade Pilot triggering the Aurora Blue/Green Deployment and the WALs being replicated.

The accompanying screenshot of the Aurora console displays the Aurora Blue/Green Deployment with the green cluster being upgraded to PostgreSQL version 16.

Post-Provisioning

After the upgrade, DB Upgrade Pilot connects to the green cluster and executes a VACUUM ANALYZE on the tables to optimize performance and refresh database statistics. Since pg_upgrade does not transfer table statistics during the upgrade process, recalculating them is strongly recommended to ensure optimal performance from the outset. This step should be performed before making the green database available for application traffic.

The subsequent screenshot illustrates Amazon CloudWatch Database Insights details of the green cluster tables being vacuumed after provisioning.

Replicating Data

DB Upgrade Pilot continuously monitors the replication lag prior to the final cutover via the Amazon CloudWatch OldestReplicationSlotLag metric. The following SQL query identifies logical replication slots and calculates the lsn_distance, indicating WAL data that has not yet been flushed or confirmed by the subscriber (replication lag):

SELECT 
    slot_name,
    confirmed_flush_lsn as flushed,
    pg_current_wal_lsn(),
    (pg_current_wal_lsn() - confirmed_flush_lsn) AS lsn_distance
FROM 
    pg_catalog.pg_replication_slots
WHERE
    slot_type = 'logical'

The accompanying screenshot shows sample output for the SQL query, where the lsn_distance of 15624 for the second slot indicates a lag in replication for that particular slot, while the other two slots (with distance 0) are fully caught up.

Before the switchover, DB Upgrade Pilot automatically conducts essential sanity checks to confirm the green environment’s readiness. This includes comprehensive data verification (row counts, key business data) and rigorous database statistics validation for optimal query performance and planning.

Switchover

This critical phase commences with the confirmation that the green environment is fully synchronized with the blue environment (zero replication lag). If synchronization is incomplete, the operation fails and automatically retries. The DB Upgrade Pilot then identifies and halts long-running transactions on the blue environment to prevent switchover delays, as the process must wait for all active queries to complete. Following the switchover, the automation automatically resubmits any interrupted transactions to the green environment to ensure data continuity.

The accompanying screenshot illustrates the Amazon CloudWatch OldestReplicationSlotLag metric, which tracks WAL data on the source from the most lagging replica. The switchover is triggered when replica lag reaches zero, a feature integrated into our automation.

The switchover is initiated with a 5-minute timeout; if this window is exceeded, the process automatically reverts, preserving the blue environment in its original state. Upon successful switchover, production traffic is immediately redirected from the blue cluster to the new green cluster, achieving minimal downtime of just 30 seconds. Aurora Blue/Green Deployments maintain the database endpoint when redirecting traffic from Blue to Green, ensuring that no application changes are necessary for endpoint management.

Post-Switchover

After confirming that traffic is successfully routing to the new PostgreSQL 16 cluster, we proceed to upgrade and re-enable the pg_cron and pg_partman extensions. Subsequently, we test these extensions to verify their compatibility with PostgreSQL 16.

Monitoring

We developed a Grafana dashboard to display real-time metrics of the system, as illustrated in the accompanying screenshot. This dashboard is part of DB Upgrade Pilot and is specifically designed for this upgrade flow. It tracks critical metrics, including replication lag between clusters and data pipeline ingestion rates, while measuring the duration of each migration step. This monitoring capability enables our team to swiftly identify any issues during the upgrade process and provides detailed visibility into the migration progress.

Here’s a breakdown of each section in the screenshot:

Current phase (upper center) – Displays the current step of the DB Upgrade Pilot.
Backlogs (upper left) – Monitors our ingestion pipeline, which we pause during the upgrade to minimize write generation and Write-Ahead Logging (WAL) generation.
Replications (upper right) – Shows replication lag, which we expect to sharply decrease after the upgrade. If not, it signals a problem, necessitating a potential revert and retry.
Runner Configuration (lower left) – Displays the parameters used to trigger the upgrade, such as target version, source cluster, whether to run a vacuum, and other custom settings.
Finished Steps (lower right) – Lists completed upgrade steps and their durations to help track progress and identify bottlenecks.
Total Downtime (lower left) – Measures the cluster’s unavailability during the switchover, critical for assessing customer impact.
DDL Changes – Captures any schema changes during the upgrade that could disrupt replication and necessitate rollback.

Cleanup

In the final step of the workflow, we decommission the old infrastructure. This process involves removing deletion protection and deleting the Aurora Blue/Green Deployment associated with the former blue cluster, resulting in a clean and cost-effective transition.

Production traffic is now fully operational on the upgraded PostgreSQL 16 instance, while the old PostgreSQL 14 clusters are gracefully retired, leaving behind a streamlined, high-performance database infrastructure. This achievement marks a significant milestone in our database management capabilities, establishing a new standard for Aurora PostgreSQL database major version upgrades.

Safety measures

The integrity and stability of our production environment were paramount throughout this automated upgrade journey. DB Upgrade Pilot is equipped with multiple layers of comprehensive safety measures to mitigate risks, including:

Automatic rollback capability – Our system automatically reverts to a stable state if issues (such as replication lag not catching up, DDL changes that disrupt replication, or Blue/Green failures) are detected prior to switchover. Automatic rollback is supported only until the switchover step, not after traffic has shifted.
Robust steps validation – Each upgrade step includes built-in validations to ensure prerequisites are met and the environment is healthy before proceeding.
Idempotency – DB Upgrade Pilot can be safely rerun if a step fails or is interrupted, preventing unintended side effects and enhancing reliability.

Challenges and lessons learned

While our solution delivers impressive results, no complex process is without its hurdles. During development, we encountered several challenges that provided invaluable lessons:

Testing – Thorough, iterative validation across all scenarios and workloads (lab, staging, dogfooding, production simulations) proved essential. With hundreds of production database clusters of varying sizes (4xl, 8xl, 24xl) and unique workloads, comprehensive testing helped us refine our tooling and uncover critical edge cases. One significant finding led to the creation of our “idle transaction purger”: during Blue/Green Deployments in PostgreSQL, replication slot creation can fail if idle transactions block consistent snapshots. Our purger continuously detects and terminates these idle transactions, ensuring reliable slot creation.
Replication lag monitoring – We learned the importance of not switching over with high replication lag, as this increases downtime. We implemented safeguards to achieve minimal lag before switchover; high lag can render both environments unresponsive while replication catches up. For further insights, refer to Aurora Blue/Green Switchover best practices.
Schema change impact – Certain operations, such as materialized view refreshes, can disrupt logical replication, necessitating a complete rerun of the process. This realization led us to log DDL statements for vital visibility, enabling troubleshooting of replication failures. Close monitoring of the Grafana dashboard status is paramount.

Results and future steps

Our automated approach has successfully reduced average downtime from approximately one hour to just 30 seconds per cluster, with minimal overhead, thereby providing near-zero business disruption during Aurora PostgreSQL database major version upgrades. Looking ahead, we plan to expand the system’s capabilities to support future major version upgrades, integrate support for FULL VACUUM operations (which incur downtime), and other critical maintenance tasks with minimal customer impact.

About the authors

Tech Optimizer