Netflix Automates RDS PostgreSQL to Aurora PostgreSQL Migration Across 400 Production Clusters

Netflix has unveiled an innovative internal automation platform designed to facilitate the migration of Amazon RDS for PostgreSQL databases to Amazon Aurora PostgreSQL. This advancement significantly mitigates operational risks and minimizes downtime across nearly 400 production clusters. The platform empowers service teams to initiate migrations through a streamlined self-service workflow, while simultaneously enforcing essential processes such as replication validation, controlled cutover, change data capture coordination, and rollback safeguards.

Database Access Management

To enhance security and efficiency, Netflix routes database access through a platform-managed data access layer built on Envoy. This setup standardizes mutual TLS and abstracts database endpoints from application code. As a result, services do not directly handle credentials or connection strings, allowing migrations to occur seamlessly beneath this protective layer. The automation is thus responsible for coordinating replication, validation, cutover, change data capture (CDC) handling, and rollback entirely at the infrastructure level.

Our goal was to make RDS to Aurora migrations repeatable and low-touch, while preserving correctness guarantees for both transactional workloads and CDC pipelines.

Migration Workflow

The migration process commences with the creation of an Aurora PostgreSQL cluster, which acts as a physical read replica of the source RDS PostgreSQL instance. This is achieved using capabilities provided by Amazon Web Services. The replica is initialized from a storage snapshot and continuously replays write-ahead log (WAL) records streamed from the source. During this phase, the system meticulously validates the health of replication slots, WAL generation rates, parameter compatibility, extension parity, and sustained replication lag under production traffic. This thorough validation ensures that the replica can handle peak write throughput prior to the cutover.

RDS to Aurora PostgreSQL Migration Workflow (Source: Netflix Blog Post)

For workloads utilizing change data capture, including logical replication slots or downstream stream processors, the automation carefully coordinates the state of slots before entering a quiescent state. During this phase, CDC consumers are paused to prevent excessive WAL retention, and slot positions are documented to facilitate the recreation of equivalent replication slots on Aurora at the correct log sequence number after promotion. This strategy preserves downstream consistency while avoiding the buildup of WAL that could lead to increased replication lag.

Real-World Application

The Enablement Applications team at Netflix was among the early adopters of this migration platform, successfully migrating databases that support device certification and partner billing workflows. During the replication process, engineers identified an elevated OldestReplicationSlotLag due to an inactive logical replication slot that retained WAL segments, thereby increasing replication lag. After addressing the stale slot, replication converged, and the migration was completed successfully, with post-cutover metrics aligning with pre-migration baselines.

Simplified Enablement Applications Overview (Source: Netflix Blog Post)

As replication lag approaches zero, the system transitions into a controlled quiescence phase. Security group rules are adjusted, and the source RDS instance is rebooted to block new connections at the infrastructure level. Once it is confirmed that all in-flight transactions have been processed and that the Aurora replica has replayed the final WAL records, the replica is promoted to a writable Aurora cluster, and the data access layer reroutes traffic to the new endpoint.

Netflix engineers have emphasized that rollback capabilities are treated with utmost importance. Until the promotion is finalized and traffic is fully redirected, the original RDS instance remains intact as the authoritative source. Should validation checks fail during synchronization or if post-promotion health checks reveal anomalies, traffic can be redirected back to the RDS cluster through the data access layer. This decoupling of applications from physical endpoints allows for a seamless restoration of the previous state without the need for redeployment. Additionally, CDC consumers can resume from previously recorded slot positions on the original cluster if necessary.

Tech Optimizer