Netflix has developed an internal automation platform to migrate Amazon RDS for PostgreSQL databases to Amazon Aurora PostgreSQL, reducing operational risks and downtime for nearly 400 production clusters. The platform allows service teams to perform migrations through a self-service workflow while ensuring processes like replication validation and rollback safeguards are maintained. Database access is managed through a platform-managed layer using Envoy, which standardizes mutual TLS and abstracts database endpoints, enhancing security and efficiency.
The migration process starts with creating an Aurora PostgreSQL cluster as a read replica of the source RDS instance, initialized from a storage snapshot and continuously replaying write-ahead log (WAL) records. Validation checks are performed to ensure the replica can handle peak write throughput before cutover. For change data capture workloads, the system coordinates the state of replication slots and pauses CDC consumers to prevent excessive WAL retention.
The Enablement Applications team at Netflix successfully migrated databases for device certification and partner billing workflows, addressing issues like elevated replication lag due to inactive logical replication slots. As replication lag decreases, the system enters a controlled quiescence phase, adjusts security rules, and reboots the source RDS instance. Once all transactions are processed and the Aurora replica is ready, it is promoted to a writable cluster, and traffic is rerouted.
Rollback capabilities are prioritized, allowing redirection back to the original RDS instance if validation checks fail or anomalies are detected post-promotion. This setup enables seamless restoration without redeployment, and CDC consumers can resume from recorded slot positions if needed.