OpenAI utilizes PostgreSQL as the backbone for its essential systems, particularly in collaboration with Microsoft Azure Database for PostgreSQL. Initially, OpenAI adopted a straightforward architecture with a single primary Postgres instance for write operations and multiple read-only replicas for handling read traffic. This setup allowed for exceptional scalability in read operations, but as demand increased, write requests became a bottleneck.
To address scalability challenges, OpenAI implemented several strategies, including offloading write workloads, optimizing read-heavy workloads with replicas and smart query routing, and establishing schema governance for stability. These optimizations led to significant outcomes: the PostgreSQL cluster now processes millions of queries per second, has numerous global read replicas for low-latency access, and has improved database response times from approximately 50 milliseconds to under five milliseconds for many queries.
OpenAI's collaboration with Azure Database for PostgreSQL facilitated ease of scaling and replication, allowing for the seamless addition of replicas and the development of features like elastic clusters and cascading read replicas. The advantages of Azure included high availability, co-innovation support, and security compliance, providing a reliable foundation for OpenAI's optimizations.