Netflix operates a global streaming service that caters to hundreds of millions of users through a sophisticated distributed microservices architecture. Central to this operation is the engineering organization, which relies heavily on its infrastructure teams to develop internal tools and abstractions that enhance developer productivity while ensuring operational excellence. Among these teams is the Online Data Stores (ODS) team, tasked with managing persistent data store solutions throughout the organization. This team evaluates developer requirements, assesses production workloads, and provides expert guidance on data store decisions.
Business challenge
The ODS team aims to create data infrastructure solutions that boost developer productivity, minimize operational overhead for both infrastructure and application teams, deliver consistent and reliable performance under varying loads, and offer scalability to accommodate growing data volumes and user bases. However, the team identified that their fragmented relational database strategy was hindering these goals. The management of multiple PostgreSQL-compatible engines, including a licensed self-managed distributed PostgreSQL-compatible database, led to operational inefficiencies that affected both infrastructure teams and the developer community.
Infrastructure teams were burdened with self-managed databases on Amazon Elastic Compute Cloud (Amazon EC2), consuming valuable time with operational overhead related to deployments, patching, scaling, and maintenance, all while facing escalating licensing costs. Developers, too, felt the impact of this fragmented approach, as inconsistent database deployment processes across various engines slowed application development. Manual scaling procedures during traffic spikes resulted in performance degradation and production incidents. The diverse database landscape necessitated teams to maintain expertise across multiple systems, complicating the establishment of unified best practices and specialization. Acknowledging these challenges, the ODS team embarked on an evaluation of database solutions to consolidate their infrastructure and enhance both operational efficiency and developer experience.
Database evaluation criteria
To evaluate potential database options, Netflix established criteria aligned with their team principles across four key dimensions. Firstly, for developer productivity, the solution needed to leverage existing PostgreSQL expertise to minimize learning curves, maintain PostgreSQL compatibility for application portability with minimal code changes, and integrate seamlessly with existing business intelligence (BI) and developer tools. Secondly, operational efficiency requirements focused on reducing management complexity through simplified replica management, adapting to Netflix’s unpredictable traffic patterns. The need for full infrastructure abstraction was paramount, eliminating concerns around backup, failover, and infrastructure management, allowing engineers to concentrate on innovation rather than maintenance.
Thirdly, the team emphasized performance reliability, requiring high availability to meet Netflix’s stringent uptime demands, with near-zero downtime during upgrades, automatic storage scaling capabilities for enhanced operational experience, and performance consistency that either matched or improved upon existing infrastructure. Multi-Region reader support for cross-Region read replicas was also essential for enabling low-latency local reads. Finally, scalability considerations revolved around cost-efficiency, aiming for a lower total cost of ownership compared to legacy database licensing while supporting expanding workloads and accommodating future use cases as Netflix’s data ecosystem continues to evolve.
After a thorough evaluation of multiple database solutions against these criteria, Netflix selected Aurora PostgreSQL as the preferred database solution for relational workloads.
Why Netflix chose Aurora
This section delves into the key reasons behind Netflix’s choice of Aurora for its database infrastructure.
Meeting data infrastructure performance requirements
The evaluation indicated that most use cases were single-Region workloads, while others required multi-Region support, which could be effectively served by Aurora Global Database. This feature employs asynchronous storage-based replication, typically resulting in less than 1 second of cross-Region replication lag, facilitating low-latency read operations for applications connecting from geographically distributed locations. Among the single-Region workloads, the team identified an opportunity to optimize the replication model used by the self-managed distributed PostgreSQL-compatible database, eliminating the need for coordination of Raft-style consensus. This change resulted in lower write latencies, reduced operational costs, and enhanced overall performance. The architecture of Aurora provided the necessary performance improvements through several key features:
- Log-based write operations – Aurora employs a log-based approach, sending only redo log records to the distributed storage layer instead of writing full data pages, unlike traditional database engines. These log records are written in parallel to a quorum of storage nodes across three Availability Zones, requiring four of six nodes to acknowledge, which enables higher write throughput and lower latency while ensuring durability.
- Shared storage architecture – Aurora separates compute and storage layers through a fault-tolerant distributed system that spans three Availability Zones in an AWS Region. This architecture addresses the core requirements of the data infrastructure while maintaining high availability and durability.
Eliminating operational overhead
Aurora is a fully managed relational database engine that alleviates the manual deployment, patching, scaling, and maintenance efforts previously required with self-managed PostgreSQL-compatible databases on Amazon EC2. Aurora read replicas serve dual purposes as both read scaling solutions and automatic failover targets, sharing the same storage volume as the writer instance while consuming the log stream asynchronously. This lag is typically much less than 100 milliseconds after the primary instance has written an update. In the event of issues with the primary writer instance, Aurora automatically fails over to one of up to 15 read replicas within the same Region, which assumes the writer role without data loss, thanks to the shared storage architecture. This automated failover capability eliminates complex failover scenarios and the need for manual intervention, providing continuous availability without operational overhead. Furthermore, with Aurora PostgreSQL, developers can leverage their existing PostgreSQL expertise without the need for retraining. Applications required minimal or no code changes during migration due to PostgreSQL compatibility, preserving development velocity and allowing teams to maintain productivity throughout the transition while reaping performance benefits without disrupting existing workflows.
Ammar Khaku, Staff Software Engineer on the Netflix Online Data Stores team, remarked,
“We no longer have to build and deploy custom binaries on EC2 with internal security and metrics-related patches. Switching to off-the-shelf managed Aurora PostgreSQL lets us focus on business logic and data access patterns.”
Enhancing application responsiveness and developer experience
Aurora significantly enhanced the developer experience by minimizing cross-AZ latency overhead that had previously hampered application and development tool responsiveness in the self-managed distributed PostgreSQL-compatible database. The distributed nature of the prior solution necessitated simple read queries to be redirected from coordinator nodes to other cluster nodes across different Availability Zones, resulting in multiple network hops and increased latency. The shared storage architecture in Aurora allows reads to be served locally while maintaining data consistency, enabling the database engine to allocate 75% of instance memory to shared buffers by default. This allocation is higher compared to the typical 25–40% in standard PostgreSQL, as Aurora avoids double buffering between PostgreSQL’s shared buffers and the operating system page cache, allowing more queries to be served from memory rather than disk. These architectural enhancements minimized network overhead and delivered up to 75% faster response times.
Achieving cost-efficiency
Aurora’s pay-as-you-go pricing model resulted in a 28% cost savings compared to traditional license-based pricing. As a fully managed database service, Aurora alleviates the burden of manual capacity management and backup procedures, which previously required dedicated operational resources. Features like storage auto-scaling up to 256 TB and continuous incremental backup to Amazon Simple Storage Service (Amazon S3) with up to 35 days retention for automated backups further streamline operations. Additionally, read replicas incur no extra storage costs since instances share the underlying storage volume, further reducing expenses while ensuring high availability and performance.
Migration results
As of October 2025, Netflix has successfully migrated several applications from the self-managed distributed PostgreSQL-compatible database to Aurora PostgreSQL. This section reviews the performance outcomes post-migration for two applications: Spinnaker (Front50) and Policy Engine.
Spinnaker (Front50)
Front50 serves as the metadata microservice for Spinnaker, the continuous delivery system utilized across Netflix. Its workload involves storing and retrieving orchestration components such as pipelines. Faster querying of pipeline states directly influences the responsiveness of the Spinnaker UI, expediting deployment management for nearly all Netflix developers. Following the migration, Front50 experienced:
- Average latency – A reduction of approximately 50% (from 67.57 milliseconds to 41.70 milliseconds)
- Maximum latency – A reduction of approximately 70% with fewer spikes
- Stability – Much more consistent performance patterns
Policy Engine
The Policy Engine functions as a rules engine and state machine for Netflix, providing a framework for implementing, evaluating, and enforcing data governance and efficiency policies across the data store systems in use at Netflix. Its workload includes flagging data assets (tables, databases, clusters) for policy violations, managing violation state machines that automatically execute remediation actions, and notifying stakeholders. The reduction in latency has enabled Policy Engine jobs to execute more swiftly, allowing for quicker triage of alerts and ensuring compliance gaps are addressed promptly. Post-migration, the Policy Engine noted the following latency improvements:
- Overall improvement – Decreased latency across all endpoints since the migration date of July 4, 2025
- Specific endpoints – Notable latency reductions in the following endpoints:
countDatasetsreduced from approximately 5.40 milliseconds to 1.90 millisecondsfindDatasetsreduced from approximately 26.72 milliseconds to 6.51 millisecondsgetAggregatedFilterTermsreduced from approximately 12.11 milliseconds to 3.51 milliseconds
- Stability – Consistent latencies reduced to under 0.02 seconds compared to previous latencies of 0.04–0.08 seconds with frequent spikes