Redundant statistics slow down your Postgres? Try sampling in pg_stat_statements

July 18, 2025

pg_stat_statements serves as the quintessential PostgreSQL extension for monitoring query statistics, providing insights into execution counts, total and average execution times, row returns, and other vital metrics. This data facilitates the analysis of query performance over time, enabling the identification of bottlenecks and informed optimization strategies. However, in environments characterized by high contention, pg_stat_statements can inadvertently become a performance hindrance. This analysis delves into the scenarios where the extension may falter, the structure of sampling, and the conditions under which its implementation can alleviate overhead.

Issue

To comprehend when and why pg_stat_statements may impede performance, it is essential to revisit its architecture. The primary data structure utilized by pg_stat_statements is a hash table, where each bucket encapsulates execution metrics for a specific query. The key for this table is generated from four parameters:

queryid — a unique identifier for the normalized query;
User OID;
Database OID;
toplevel flag indicating whether the query is top-level, meaning it is not nested within any internal function or subquery.

The hash table locates or creates the appropriate entry using this key and updates the collected query metrics.

Upon receiving a request, pg_stat_statements follows a sequence of operations:

Search for the bucket: a shared LWLock is acquired on the hash table, and it is searched by key;
Normalization (optional): if no suitable bucket exists, the query is pre-normalized by replacing literals with placeholders like , , etc.;
Creating a new bucket: the lock level of LWLock is escalated to exclusive, allowing for the creation of a new bucket;
Recording query information in the bucket: to update the query metrics, the SpinLock of that bucket is acquired, followed by the release of both SpinLock and LWLock.

These locking operations can become a bottleneck, particularly under high contention or when faced with a multitude of unique queries. For instance, consider a scenario where all SQL queries are unique from the perspective of pg_stat_statements. A machine equipped with 48 CPUs can replicate such a load. To ensure query uniqueness, we can create 1000 similar tables with distinct names:

init_script.sql

DO $$

DECLARE

    i INT;

BEGIN

    FOR i IN 1..1000 LOOP

        EXECUTE format('CREATE TABLE table_%s (id INT PRIMARY KEY, value TEXT);', i);

        EXECUTE format('INSERT INTO table_%s (id, value) VALUES (1, ''test'');', i);

    END LOOP;

END;

$$;

Next, using the built-in random number generator in pgbench (pgbench_script.sql), we will generate queries targeting these tables, ensuring each query is distinct and falls into a new bucket of the pg_stat_statements hash table.

pgbench_script.sql

set table1_id random(1, 1000)

set table2_id random(1, 1000)

set table3_id random(1, 1000)

SELECT t1.value AS value1, t2.value AS value2, t3.value AS value3

FROM table_:table1_id t1

JOIN table_:table2_id t2 ON t1.id = t2.id

JOIN table_:table3_id t3 ON t2.id = t3.id

WHERE t1.id = 1 AND t2.id = 1 AND t3.id = 1;

To visualize the performance drop, we can invoke pg_stat_activity every second during the benchmark, recording the results in a temporary file:

waits.sql

o /tmp/waits

SELECT 'OUT', COALESCE(wait_event, 'None') wait_event, COALESCE(wait_event_type, 'No wait') wait_event_type 
FROM pg_stat_activity WHERE state ='active';

watch 1

After completing the benchmark, we can categorize all types of delays and tally their occurrences:

cat /tmp/waits | grep OUT | awk '{print  "|" }' FS="|" | sort | uniq -c | sort -n -r -k1

Subsequently, we will run the benchmark, comparing system performance with pg_stat_statements enabled and disabled, utilizing the standard pgbench utility:

Number of clients (-c): 48 – matching the number of CPUs;
Threads (-j): 20 – to limit contention at the OS level and prevent CPU overload and context switching;
Duration (-T): 120 seconds;
Script: execute pgbench_script.sql (-f pgbench_script.sql);
Metrics: total number of delays and final TPS.

pgbench -c48 -j20 -T120 -f pgbench_script.sql --progress 10 | grep "tps = " 2>>$RESULTS >>$RESULTS &

We can consolidate all these actions into a single script:

RESULTS="/tmp/results"

rm -rfv $RESULTS

nohup pgbench -c48 -j20 -T120 -f pgbench_script.sql --progress 10 | grep "tps = " 2>>$RESULTS >>$RESULTS &

timeout 125 psql -f waits.sql

echo " count | wait_event | wait_event_type" >>$RESULTS

echo "--------------------------------------" >>$RESULTS

cat /tmp/waits | grep OUT | awk '{print  "|" }' FS="|" | sort | uniq -c | sort -n -r -k1 >>$RESULTS

cat $RESULTS

rm -rfv /tmp/waits

The results obtained are as follows:

# With pg_stat_statements off

tps = 237 437.104223 (without initial connection time)

 count | wait_event | wait_event_type

--------------------------------------

 2922  None                       No wait

 918  ClientRead                 Client

# With pg_stat_statements on

tps = 32 112.129029 (without initial connection time)

 count | wait_event                 | wait_event_type

 4703  pg_stat_statements        LWLock

 884  None                       No wait

 143  ClientRead                 Client

The data illustrates that with a significant number of unique queries, enabling pg_stat_statements can drastically diminish performance, potentially resulting in a substantial drop in TPS. This decline is primarily attributed to the frequent acquisition of ‘exclusive’ LWLock.

In a contrasting scenario involving a high volume of similar queries, we will employ a more robust machine with 192 CPUs. This time, we will utilize a script that periodically checks pg_stat_activity, while generating load through the same query executed via pgbench -M prepared -S with 192 clients:

pgbench -c192 -j20 -T120 -M prepared -S --progress 10 | grep "tps = " 2>>$RESULTS >>$RESULTS &

Running this benchmark yields the following results:

# Results with pg_stat_statements off

tps = 1 015 425.438193 (without initial connection time)

 count | wait_event | wait_event_type

--------------------------------------

 13201  None                      No wait

 3482  ClientRead                Client

# Results with pg_stat_statements on

tps = 484 338.163894 (without initial connection time)

 count | wait_event | wait_event_type

 11 214  SpinDelay               Timeout

 9481  None                      No wait

 930  ClientRead                Client

When attempting to replicate this on a machine with 48 CPUs (i.e., with 48 users), performance issues remain well within statistical error margins.

This indicates that the impact of pg_stat_statements when dealing with duplicate queries becomes significant only at elevated levels of parallelism. The underlying cause is the contention for the same entry in the hash table, which is exacerbated by frequent SpinLock acquisitions during the update of query metrics within the hash table bucket. When numerous threads execute the same query simultaneously, they compete to update the same structure—incrementing call counters, execution times, and other metrics. This contention for SpinLock leads to delays and reduced TPS under high load.

What is sampling?

Query sampling is a method of uniform filtering, wherein only a subset of all executed queries is included in the sample. In the context of pg_stat_statements, this means that metric information is recorded not for every executed query, but only for a fraction of them, selected with equal probability. This approach is similarly employed in PostgreSQL for log_transaction_sample_rate and log_statement_sample_rate to mitigate log volume, as well as in auto_explain.sample_rate and pg_store_plans.sample_rate. In Tantor Postgres 17.5, a corresponding setting was introduced to pg_stat_statements — the GUC parameter pg_stat_statements.sample_rate, which allows the configuration of the fraction of queries tracked in the extension’s statistics. The parameter value (ranging from 0 to 1) determines the proportion of queries that will be monitored in pg_stat_statements. A query will be sampled if the following condition is satisfied:

Sampling

set table1_id random(1, 1000)

set table2_id random(1, 1000)

set table3_id random(1, 1000)

SELECT t1.value AS value1, t2.value AS value2, t3.value AS value3

FROM table_:table1_id t1

JOIN table_:table2_id t2 ON t1.id = t2.id

JOIN table_:table3_id t3 ON t2.id = t3.id

WHERE t1.id = 1 AND t2.id = 1 AND t3.id = 1;

Given the large volume of queries, employing this inequality allows for filtering only the specified fraction of queries. However, sampling carries a notable drawback: not all queries are recorded in pg_stat_statements, which can affect the completeness of the collected information, particularly during debugging or analyzing rare but problematic queries. Conversely, since only a portion of the queries is tracked, the load on locks is diminished, subsequently enhancing overall system performance.

Limitations of the sampling

If queries are sampled during the addition of a new bucket to the hash table, thereby alleviating the LWLock, there exists a risk of omitting a valuable query that could provide critical insights if tracked in pg_stat_statements. Furthermore, even if sampling mitigates the performance issue, concerns regarding the security and accuracy of query storage remain. The addition of a bucket to the hash table can occur both before and after query execution. However, the structure necessary for normalizing the query is established only before the query execution, specifically during the parsing phase, when the structure for storing literals for normalization is created. If at this stage pg_stat_statements opts not to save the query due to sampling, but subsequently (after execution) attempts to add it to the hash table, the query will be recorded in its original (non-normalized) form. This scenario poses a risk of leaking sensitive information in pg_stat_statements (such as passwords or personal data within SQL query literals). Therefore, sampling during query parsing is inadvisable, as it could contravene security protocols.

Nevertheless, both the PostgreSQL community and developers at Tantor Labs are exploring alternative solutions to the challenge of managing a large number of unique queries—specifically, by merging similar queries under a single queryid. This strategy reduces the number of unique entries in the hash table, thereby decreasing the frequency of locking. The community has already initiated the merging of certain queries into a single QueryId. In scenarios where SpinLock contention becomes a bottleneck—such as when the same entry in a hash table is frequently updated—sampling can prove to be quite effective. By reducing the number of calls to SpinLock (by skipping certain queries), contention between threads is minimized, thus improving overall performance.

Sampling results

Let us revisit the earlier scenario utilizing a powerful machine with 192 CPUs. The same waits.sql scripts and enabled pg_stat_statements will be employed for testing. We will execute a benchmark to evaluate the impact of the pg_stat_statements.sample_rate on performance and the nature of waits. A loop will be run over five sample_rate values: 1, 0.75, 0.5, 0.25, and 0. For each value, we will conduct load testing using pgbench:

benchmark.sh

set table1_id random(1, 1000)

set table2_id random(1, 1000)

set table3_id random(1, 1000)

SELECT t1.value AS value1, t2.value AS value2, t3.value AS value3

FROM table_:table1_id t1

JOIN table_:table2_id t2 ON t1.id = t2.id

JOIN table_:table3_id t3 ON t2.id = t3.id

WHERE t1.id = 1 AND t2.id = 1 AND t3.id = 1;

After executing the benchmark with various pg_stat_statements.sample_rate values, the results are summarized in the table below, illustrating how performance and the nature of waits evolve with changes in the fraction of queries hitting the hash table:

sample_rate	tps	SpinDelay	NoWait	ClientRead
1.00	484 338	11 107	9 568	929
0.75	909 547	4 781	12 079	2100
0.50	1 019 507	174	13 253	3378
0.25	1 019 507	–	13 397	3423
0.00	1 015 425	–	13 106	3502

At a sample_rate of 1.0, where metrics are collected for all queries, TPS is at its lowest, accompanied by a significant number of waits on SpinLock. As the sample_rate decreases to 0.75 and below, TPS experiences a notable increase, and SpinDelay diminishes by 2.3 times. At a sample_rate of 0.25 and lower, SpinDelay effectively disappears.

Thus, sampling proves to be an effective strategy for reducing the overhead associated with pg_stat_statements, significantly enhancing performance in scenarios marked by high contention for SpinLock.

Tech Optimizer