Why Postgres wants NVMe on the hot path, and S3 everywhere else

April 17, 2026

Efforts to merge two distinct storage roles into a single solution continue, particularly with the allure of Amazon S3’s durability, cost-effectiveness, and seemingly infinite capacity. However, the complexities of managing a PostgreSQL database extend beyond mere data storage; they hinge on the critical moments when the database must pause to ensure operations are completed successfully.

Understanding Commit Latency

In PostgreSQL, achieving a durable commit necessitates flushing the Write-Ahead Log (WAL) before signaling to the client that a transaction has concluded. This flushing process occurs in the XLogFlush() function, which blocks the backend until the kernel confirms that the write is durable. On high-performance enterprise NVMe drives equipped with power-loss protection, this operation can take mere tens of microseconds. However, when slower networked storage is involved, this duration can extend into milliseconds. Introducing object-storage-like solutions into this workflow exacerbates the latency issue.

This latency is not merely a theoretical metric; it directly impacts the performance of lightly loaded Online Transaction Processing (OLTP) systems, establishing a tangible limit on how swiftly a single session can finalize transactions. While group commit strategies can alleviate this issue by allowing multiple transactions to share a single flush, many production applications do not consistently operate within this ideal scenario. As a result, storage latency can significantly affect user-facing response times.

Recent benchmarking studies on managed PostgreSQL services illustrate this trend: as workloads exceed memory capacity, systems utilizing faster local storage tend to outperform their counterparts more distinctly.

“For PostgreSQL, fsync is a promise, not just a write.”

It is essential to recognize that not all SSDs exhibit the same performance characteristics. In the context of PostgreSQL, the fsync operation represents a commitment rather than a simple write action. Enterprise-grade drives with power-loss protection can often fulfill this promise more swiftly, thanks to their capacitor-backed caches. In contrast, consumer-grade SSDs typically lack the same level of safety, leading to divergent behaviors even among drives that appear equally “fast” on paper, particularly under commit-heavy workloads.

The Read Challenge

The challenges extend to read operations as well. PostgreSQL organizes heap data and indexes into 8 KB pages. When the buffer cache is missed, a backend may block while attempting to read a small page. OLTP workloads frequently encounter this scenario, involving index lookups, heap fetches, visibility checks, and additional index lookups. NVMe storage excels in this regard, while object storage does not. The issue is not merely bandwidth; rather, PostgreSQL’s requirements for numerous small, latency-sensitive reads clash with the design of S3, which is optimized for larger, higher-latency object requests.

“The problem is not bandwidth. It is that PostgreSQL wants lots of tiny, latency-sensitive reads, while S3 is built around larger, higher-latency object requests.”

This mismatch becomes increasingly pronounced when the working set exceeds available memory. While shared_buffers and the OS page cache provide significant assistance, they have their limits. As the database begins to miss cache on frequently accessed queries, the latency of the underlying storage transitions from a background concern to a primary factor affecting performance.

Moreover, Multi-Version Concurrency Control (MVCC) introduces its own complexities regarding I/O amplification. Updates do not overwrite existing rows; instead, they create new tuple versions and modify the relevant indexes. Checkpoints result in full-page writes being added to the WAL stream, while hint bits can convert reads into writes. The vacuum process must eventually clean up obsolete tuples and manage transaction ID age. These mechanisms are integral to PostgreSQL’s concurrency and crash safety, but they also necessitate storage solutions capable of handling a high volume of small, scattered I/O operations without faltering.

Separation of Concerns

Consequently, modern managed PostgreSQL systems that leverage object storage typically do not place it directly in the critical commit path. While implementations may vary, a consistent pattern emerges: maintaining a fast log, cache, or page-serving layer in close proximity to the database, while relegating colder or reconstructable data to a more durable remote tier. The focus is not merely on branding but rather on a convergence of design principles. Across various transactional PostgreSQL architectures, there is a persistent effort to shield the commit path from the latency associated with object storage.

Recent developments in the PostgreSQL upstream, particularly with the introduction of asynchronous I/O support in version 18, further emphasize this direction. This enhancement aims to empower PostgreSQL to leverage fast storage more effectively, rather than attempting to make object storage mimic the performance of local SSDs. As PostgreSQL enhances its capability to issue parallel low-latency I/O, it stands to gain significantly from NVMe and other storage solutions that can respond promptly and reliably.

This is not to undermine the value of S3. It excels in tasks for which it was designed: WAL archiving, base backups, snapshots, retention, and supporting downstream analytical systems. S3 also serves well for the colder aspects of replication and migration workflows, whether that involves initial data loads, backfills, or significant cutovers to alternative systems. The key operational strategy lies in keeping these tasks separate from the commit path. Teams utilizing Change Data Capture (CDC) pipelines or orchestrating large PostgreSQL migrations are often addressing a different set of challenges than those associated with transaction commits.

This separation of responsibilities also benefits analytical workloads. While PostgreSQL is adept at handling transactions, attempting to execute extensive scans and aggregations on the same infrastructure that manages commits, vacuum processes, and cache misses can lead to resource contention. Consequently, substantial engineering efforts are directed toward offloading analytical tasks from the OLTP path, whether through replication or by employing a distinct open-source stack for analytics alongside PostgreSQL. The objective is not to diminish PostgreSQL’s capabilities but to allow it to excel in its core competencies.

Thus, the solution is not a binary choice between NVMe and S3. Instead, it involves leveraging both, with a clear demarcation between their respective roles. Fast local or block storage should manage commits, cache misses, checkpoints, and vacuum processes, while object storage is best suited for archives, backups, and historical data. PostgreSQL thrives when the hot path operates within the microsecond range, while the cold path is designated for durability. It encounters challenges when these two functions are forced into a singular layer.

Tech Optimizer