PostgreSQL cost model: estimating the cost of flushing temp buffers to disk

Today, I share insights from the initial phase of a project aimed at facilitating temporary tables in PostgreSQL’s parallel query execution. This endeavor involved benchmarking the sequential writes and reads of temporary buffers. To achieve this, I introduced functions into the PostgreSQL system catalog to measure buffer flush operations and conducted a series of tests. For the query planner, I estimated the cost coefficient associated with flushing a temporary table buffer page to disk, in relation to the established DEFAULT_SEQ_PAGE_COST. Additionally, I assessed the cost of a simple scan of temporary buffers through a preliminary dry-run. The findings indicate that sequential writes are approximately 30% slower than reads, leading to the formulation of a proposed cost model. I also executed a limited dry-run estimate to evaluate write speeds when table blocks are randomly distributed across memory pages.

Since their inception, temporary tables in PostgreSQL have been restricted in parallel usage. This limitation can be attributed to their primary function of serving as relational variables. To maintain both performance and code simplicity, it is essential that they remain uncomplicated. Given that PostgreSQL parallel workers operate as separate backends, they lack access to the leader process’s local state, where temporary tables reside. Enabling parallel operations on these tables would significantly complicate the underlying mechanism.

Nevertheless, two production-grade implementations of parallel operations on temporary tables are already available in enterprise PostgreSQL distributions, such as Postgres Pro and Tantor. Furthermore, ongoing discussions within the community regarding logical replication of DDL highlight the need for improved tooling for temporary tables. This raises the question of how to effectively replicate temporary objects, which may be nested within various UTILITY commands. The current landscape suggests that it may be time to advance this topic within the PostgreSQL community.

Despite years of incremental enhancements related to temporary objects—such as temporary types, indexes, and functions—one fundamental challenge persists: temporary buffer pages are local to the leader process. If these pages are not flushed to disk, parallel workers cannot access them. Conversely, if we do flush them, a cost model is required for the planner to determine the feasibility of this operation. This consideration may gain significance with the potential introduction of the SQL directive GLOBAL TEMP TABLE, although there is no current indication of its arrival. However, we can take steps to address this issue while maintaining the locality of temporary table buffers.

A comment in the code (80558c1), authored by Robert Haas in 2015, sheds light on the existing situation:

/*
 * Currently, parallel workers can't access the leader's temporary
 * tables.  We could possibly relax this if we wrote all of its
 * local buffers at the start of the query and made no changes
 * thereafter (maybe we could allow hint bit changes), and if we
 * taught the workers to read them.  Writing a large number of
 * temporary buffers could be expensive, though, and we don't have
 * the rest of the necessary infrastructure right now anyway.  So
 * for now, bail out if we see a temporary table.
 */

This comment suggests a potential pathway forward: if we flush the leader process’s temporary buffers to disk prior to initiating parallel operations, workers could then safely scan the table in parallel directly from disk. However, the question of cost arises—would the overhead of writing buffers negate the benefits of parallelism?

It is easy to envision a scenario where such functionality may prove ineffective: numerous simple scan queries over temporary tables, combined with a substantial temp_buffers setting and a continuous influx of DML (even with a minimal percentage of affected rows) altering one or more temporary tables. In this case, scanning every temporary buffer and writing it to disk before executing each query could hinder overall system performance.

As we progress toward implementing parallel scans of temporary tables, addressing the cost of flushing buffers is paramount. By equipping the optimizer with a robust cost model, we can enable it to choose between a parallel scan with buffer-flush overhead and a sequential scan devoid of parallel workers.

Benchmarking Toolkit

Currently, PostgreSQL does not offer direct access to local buffers for measurement or state manipulation. To facilitate this, I incorporated several functions into pg_proc. The temp-buffers-sandbox branch, based on the latest PostgreSQL master, encompasses all modifications necessary for our tests.

The implementation comprises two pivotal commits:

Commit #1. Statistics Infrastructure

  • allocated_localbufs tracks the total number of buffers allocated for temporary tables.
  • dirtied_localbufs counts how many local buffers contain DIRTY (unsynchronized to disk) pages.

These statistics could serve as the foundation for a future cost model, granting the query optimizer insight into the current state of temporary buffers when estimating flushing costs.

Commit #2. UI Functions

  • pg_allocated_local_buffers() returns the number of currently allocated local buffers.
  • pg_flush_local_buffers() explicitly flushes all dirty pages to disk.
  • pg_read_temp_relation(relname, randomize) sequentially (or in random order) reads all blocks of a temporary table into temp_buffers.
  • pg_temp_buffers_dirty(relname) marks all pages of the specified table as DIRTY.

These functions enable explicit, block-level measurement of flush and read operations, which is crucial for developing accurate cost estimates.

Methodology

The test scripts can be found here. Operations on local buffers are relatively straightforward: they do not require locks, do not write WAL, and avoid other costly tasks. This simplicity eliminates concurrency concerns from the experiment and streamlines the benchmark logic. To construct a cost model, we need to measure three key aspects: sequential write speed, sequential read speed, and the overhead of scanning buffers without I/O (the so-called dry-run).

The ratio between read and write speeds will allow us to derive a sequential page-write cost parameter from the existing DEFAULT_SEQ_PAGE_COST utilized in PostgreSQL core. The planner can then leverage this parameter to estimate the cost of flushing dirty local buffers prior to parallel operations.

The benchmark sequence is as follows:

  1. Create a temporary table and populate it with data that fits within the local buffer pool (ensuring all pages are dirty in memory).
  2. Invoke pg_flush_local_buffers() to write all dirty buffers to disk and measure the I/O.
  3. Call pg_flush_local_buffers() again to assess scan overhead without a real flush (dry-write-run).
  4. Evict the test table’s pages by creating a new auxiliary table that fills the entire buffer pool, then drop it.
  5. Utilize pg_read_temp_relation() to read all test table blocks from disk into buffers and measure the I/O.
  6. Invoke pg_read_temp_relation() again to evaluate scan overhead without actual reads (dry-read-run).
  7. Evict the test table’s pages from temp_buffers.
  8. Write the test table’s pages into temp_buffers at random addresses.
  9. Mark the test table’s pages in temp_buffers as DIRTY using pg_temp_buffers_dirty().
  10. Flush all test table pages to disk with pg_flush_local_buffers().

Steps 7–10 provide a rough model of random access. I employ this approach to gauge the disparity between sequential and random disk access. All measurements are conducted using EXPLAIN (ANALYZE, BUFFERS), which logs execution time and buffer I/O statistics (local read, local write, blocks hit/read). Planning time is negligible (typically < 0.02 ms) and is excluded from the analysis. While it is feasible to bypass EXPLAIN entirely (thus avoiding the overhead of instrumentation structures), I assume this overhead is minimal and comparable for both reads and writes. Utilizing EXPLAIN offers a convenient method to verify execution time and confirm the actual number of affected blocks.

The tests encompass buffer pool sizes in powers of two, ranging from 128 to 262,144 blocks (equivalent to 1 MB to 2 GB), with 30 iterations per size to ensure statistical reliability. Each test allocates 110% of the number of blocks needed to accommodate the table, with the additional 10% reserved for Free Space Map and Visibility Map metadata. Larger buffer counts may trigger memory swapping, resulting in unreliable outcomes.

Benchmark Results

The raw test results are accessible here. On my laptop, the most consistent performance is observed within the 4–512 MB range:

nblocks bufsize Write (ms) Dry-run (write) (ms) Read (ms) Dry-run (read) (ms)
512 4 MB 0.54 0.002 0.58 0.016
1,024 8 MB 1.07 0.003 1.13 0.028
2,048 16 MB 3.02 0.004 2.42 0.054
4,096 32 MB 6.36 0.007 4.81 0.107
8,192 64 MB 12.34 0.013 9.79 0.210
16,384 128 MB 24.63 0.026 19.35 0.421
32,768 256 MB 49.60 0.051 38.72 0.838
65,536 512 MB 98.93 0.102 77.46 1.681

With limited RAM, larger buffer sizes exhibit qualitatively higher write overhead and increased variability:

nblocks bufsize Write (ms) Dry-run (write) (ms) Read (ms) Dry-run (read) (ms)
131,072 1 GB 283.15 0.204 180.06 3.353
262,144 2 GB 728.18 0.413 373.46 6.725

Dry-run scanning without I/O is minimal, ranging from 0.002 to 0.240 ms. Based on the experiment, we can infer that the cost of writing a temporary table should approximate the cost of a sequential page. More specifically, I propose the following formula:

DEFAULT_WRITE_TEMP_PAGE_COST = 1.20 × DEFAULT_SEQ_PAGE_COST

The limited modeling of random access indicates a 10–24% slowdown compared to sequential mode. To err on the side of caution, I recommend applying a correction factor, suggesting a value of 1.3 instead of 1.2. The low write cost can be attributed to the characteristics of temporary tables, which do not involve locks or WAL writes. However, several questions remain unanswered. For instance, what storage type does the current default seq_page_cost pertain to? My measurements were conducted on an NVMe SSD. Would the read/write ratio differ on HDD? Is it worthwhile to investigate various storage types, or is a single universal value sufficient for most scenarios? Additionally, my modeling of random writes was quite limited—would general random writes incur a different cost?

The tests also reveal that we can factor in the size of temp_buffers: dry-run buffer scanning overhead constitutes roughly 1% of the write cost. Consequently, a comprehensive formula for pre-flushing temp buffers could be articulated as follows (assuming DEFAULT_SEQ_PAGE_COST = 1):

flush_cost = 1.3 × dirtied_localbufs + 0.01 × allocated_localbufs

What’s Next?

With a cost model established, the subsequent steps are clear—though not without challenges:

  1. Add a planner flag to indicate the presence of temporary objects within a plan subtree. Existing flags like parallel_safe and consider_parallel could be extended to incorporate this logic.
  2. Implement executor support to enable a worker to access the leader process’s temporary table storage on disk. Much of this work will involve removing asserts and error messages.
  3. Introduce a temp-buffer flush operation to disk—most likely within Gather / GatherMerge, just prior to launching parallel workers.
  4. Integrate the cost model so the planner can determine whether to permit parallel scans of temporary tables. Additionally, considerations regarding temp_buffers sizing for workers will arise: should the maximum size be divided among all workers, allow each worker to allocate temp_buffers blocks, or explore alternative options?

These steps encapsulate the necessary actions to extend PostgreSQL’s parallelism to temporary tables.

Tech Optimizer
PostgreSQL cost model: estimating the cost of flushing temp buffers to disk