OpenAI, the pioneering force behind ChatGPT and other advanced AI models, has made significant strides in the tech landscape. Yet, an intriguing aspect of their journey is their reliance on a time-tested technology: PostgreSQL. This relational database serves as the backbone for OpenAI’s most essential systems. In this exploration, we delve into OpenAI’s experience with Microsoft Azure, examining the challenges encountered, the solutions devised, and the remarkable outcomes achieved. More importantly, we distill insights that can aid others in scaling their own databases.
The beginning: Initial architecture focused on simplicity
From the outset, OpenAI adopted Azure Database for PostgreSQL, which alleviated the team from the burdens of low-level database maintenance while offering essential features like automated backups and high availability. The initial architecture was straightforward: a single primary Postgres instance managed write operations, complemented by multiple read-only replicas to handle the substantial read traffic. This classic primary-replica configuration proved effective during OpenAI’s early growth phases.
For workloads heavy on reads, this single-shard approach yielded significant advantages. The scalability of read operations was exceptional, facilitated by the addition of numerous replicas as needed. Each replica served as a live copy of the primary database, allowing OpenAI to distribute read queries efficiently, thereby catering to millions of users with minimal latency. The geographic distribution of these replicas further enhanced read performance for users globally, exemplifying the efficiency of cloud-managed Postgres in scaling out reads.
However, as the demand for ChatGPT and other services surged, the limitations of this architecture began to surface. Write requests became a bottleneck, as all write operations were funneled into the single primary database. As traffic escalated, there were instances where database performance impacted OpenAI’s services, prompting the need for new strategies to support both read and write scalability for their PostgreSQL workloads.
Scaling up with PostgreSQL on Azure as demand grows
At POSETTE 2025, OpenAI shared insights into how their team scaled PostgreSQL to accommodate ChatGPT and other critical services. Collaborating closely with the Microsoft Azure Database for PostgreSQL team, OpenAI’s engineers pushed the service to new heights, resulting in a series of upgrades and best practices that transformed the database layer into a resilient component of OpenAI’s data platform.
Key strategies employed by OpenAI to enhance PostgreSQL, as discussed in Bohan Zhang’s talk, include:
1. Offloading and smoothing write workloads
- Minimizing unnecessary writes at the source
- Implementing controlled timing for specific operations
- Offloading write-heavy loads to alternative systems when feasible
These optimizations helped maintain a lean and efficient primary database.
2. Scaling reads with replicas and smart query routing
With write pressures alleviated, OpenAI turned its attention to optimizing read-heavy workloads, which constitute the majority of ChatGPT’s traffic. Their approach included:
- Maximizing read offloading to replicas
- Prioritizing requests and assigning dedicated replica servers for high-priority traffic
- Optimizing slow queries
- Utilizing connection pooling with PgBouncer
These efforts transformed OpenAI’s operational dynamics from reactive to proactive.
3. Schema governance and safeguards
Scaling is not solely about enhancing performance; it also involves ensuring stability and uptime. OpenAI instituted processes to guarantee that pushing PostgreSQL’s limits would not compromise reliability:
- Establishing strict schema change protocols
- Managing long transactions effectively
- Implementing rate limits at the application, connection, and query levels
- Ensuring high availability as a standard
These measures contributed to a robust PostgreSQL setup with cloud-grade reliability.
The result: PostgreSQL at scale
OpenAI’s collaboration with Azure Database for PostgreSQL has yielded significant outcomes, demonstrating the potential of a well-architected relational database in the cloud:
- Peak throughput: The PostgreSQL cluster now processes millions of queries per second (combining reads and writes), showcasing the capability of a single coordinated database cluster.
- Global read scale: OpenAI has deployed numerous read replicas, including cross-region replicas, to serve a global user base with low latency while avoiding overwhelming the primary database.
- Reliability: Over a nine-month period, only one critical incident (Sev0) was linked to PostgreSQL after improvements—a marked enhancement in reliability compared to previous periods.
- Ten times faster: Database response times improved dramatically, dropping from approximately 50 milliseconds to under five milliseconds for many queries, creating an instantaneous user experience.
OpenAI’s PostgreSQL architecture is handling an unprecedented workload, all built on a foundation of open-source technology and cloud services accessible to any startup. This level of scalability, once thought to require specialized databases or vast engineering teams, has been achieved by OpenAI with a focused team dedicated to systematic optimizations. In Bohan Zhang’s words, “After all the optimization we did, we are super happy with Postgres right now for our read-heavy workloads.”
Why Azure Database for PostgreSQL was key
By leveraging Azure Database for PostgreSQL, OpenAI gained access to a service designed for high-scale, mission-critical workloads. This platform offered several advantages that complemented OpenAI’s engineering efforts.
Ease of scaling and replication
Azure facilitated the seamless addition of replicas on demand. Drawing insights from OpenAI’s evolving workload, the Azure Database for PostgreSQL team developed the elastic clusters feature, currently in preview, enabling OpenAI to scale horizontally through row-based and schema-based sharding. Additionally, the introduction of cascading read replicas capability, also in preview, allowed for the creation of additional read replicas from existing ones, enhancing the efficiency of read workload scaling across regions.
Bohan Zhang, a member of OpenAI’s infrastructure team, emphasized, “At OpenAI, we utilize an unsharded architecture with one writer and multiple readers, demonstrating that PostgreSQL can scale gracefully under massive read loads.”
Additional advantages of Azure included:
- High availability and management
- Co-innovation and support
- Security and compliance
Azure Database for PostgreSQL provided a reliable foundation upon which OpenAI executed these optimizations. For startups, utilizing a managed database offers enterprise readiness from the outset, allowing teams to focus on product innovation and the specific tuning required for their unique use cases.
Making Postgres work for you
OpenAI’s achievements with Azure Database for PostgreSQL exemplify resilience and innovation, highlighting the possibilities that arise when a startup combines a powerful cloud platform with intelligent engineering. This blend of established solutions and novel approaches often proves to be a successful strategy—innovating where it matters while relying on proven technologies for foundational elements like databases. For startup developers and technical decision-makers aspiring to replicate this success, key takeaways include:
- Start simple and optimize gradually
- Leverage cloud-managed services
- Monitor, measure, and address bottlenecks
- Apply best practices from the Postgres community
For those inspired to enhance their startup’s data layer, a promising starting point is to explore Azure Database for PostgreSQL and discover effective utilization strategies.