pg_stat_statements serves as the quintessential PostgreSQL extension for monitoring query statistics, providing insights into execution counts, total and average execution times, row returns, and other vital metrics. This data facilitates the analysis of query performance over time, enabling the identification of bottlenecks and informed optimization strategies. However, in environments characterized by high contention, pg_stat_statements can inadvertently become a performance hindrance. This analysis delves into the scenarios where the extension may falter, the structure of sampling, and the conditions under which its implementation can alleviate overhead.
Issue
To comprehend when and why pg_stat_statements
may impede performance, it is essential to revisit its architecture. The primary data structure utilized by pg_stat_statements
is a hash table, where each bucket encapsulates execution metrics for a specific query. The key for this table is generated from four parameters:
queryid
— a unique identifier for the normalized query;- User
OID
; - Database
OID
; toplevel
flag indicating whether the query is top-level, meaning it is not nested within any internal function or subquery.
The hash table locates or creates the appropriate entry using this key and updates the collected query metrics.
Upon receiving a request, pg_stat_statements
follows a sequence of operations:
- Search for the bucket: a shared LWLock is acquired on the hash table, and it is searched by key;
- Normalization (optional): if no suitable bucket exists, the query is pre-normalized by replacing literals with placeholders like , , etc.;
- Creating a new bucket: the lock level of LWLock is escalated to
exclusive
, allowing for the creation of a new bucket; - Recording query information in the bucket: to update the query metrics, the SpinLock of that bucket is acquired, followed by the release of both SpinLock and LWLock.
These locking operations can become a bottleneck, particularly under high contention or when faced with a multitude of unique queries. For instance, consider a scenario where all SQL queries are unique from the perspective of pg_stat_statements
. A machine equipped with 48 CPUs can replicate such a load. To ensure query uniqueness, we can create 1000 similar tables with distinct names:
init_script.sql
DO $$
DECLARE
i INT;
BEGIN
FOR i IN 1..1000 LOOP
EXECUTE format('CREATE TABLE table_%s (id INT PRIMARY KEY, value TEXT);', i);
EXECUTE format('INSERT INTO table_%s (id, value) VALUES (1, ''test'');', i);
END LOOP;
END;
$$;
Next, using the built-in random number generator in pgbench (pgbench_script.sql), we will generate queries targeting these tables, ensuring each query is distinct and falls into a new bucket of the pg_stat_statements
hash table.
pgbench_script.sql
set table1_id random(1, 1000)
set table2_id random(1, 1000)
set table3_id random(1, 1000)
SELECT t1.value AS value1, t2.value AS value2, t3.value AS value3
FROM table_:table1_id t1
JOIN table_:table2_id t2 ON t1.id = t2.id
JOIN table_:table3_id t3 ON t2.id = t3.id
WHERE t1.id = 1 AND t2.id = 1 AND t3.id = 1;
To visualize the performance drop, we can invoke pg_stat_activity every second during the benchmark, recording the results in a temporary file:
waits.sql
o /tmp/waits
SELECT 'OUT', COALESCE(wait_event, 'None') wait_event, COALESCE(wait_event_type, 'No wait') wait_event_type
FROM pg_stat_activity WHERE state ='active';
watch 1
After completing the benchmark, we can categorize all types of delays and tally their occurrences:
cat /tmp/waits | grep OUT | awk '{print "|" }' FS="|" | sort | uniq -c | sort -n -r -k1
Subsequently, we will run the benchmark, comparing system performance with pg_stat_statements
enabled and disabled, utilizing the standard pgbench utility:
- Number of clients (-c): 48 – matching the number of CPUs;
- Threads (-j): 20 – to limit contention at the OS level and prevent CPU overload and context switching;
- Duration (-T): 120 seconds;
- Script: execute pgbench_script.sql (-f pgbench_script.sql);
- Metrics: total number of delays and final TPS.
pgbench -c48 -j20 -T120 -f pgbench_script.sql --progress 10 | grep "tps = " 2>>$RESULTS >>$RESULTS &
We can consolidate all these actions into a single script:
RESULTS="/tmp/results"
rm -rfv $RESULTS
nohup pgbench -c48 -j20 -T120 -f pgbench_script.sql --progress 10 | grep "tps = " 2>>$RESULTS >>$RESULTS &
timeout 125 psql -f waits.sql
echo " count | wait_event | wait_event_type" >>$RESULTS
echo "--------------------------------------" >>$RESULTS
cat /tmp/waits | grep OUT | awk '{print "|" }' FS="|" | sort | uniq -c | sort -n -r -k1 >>$RESULTS
cat $RESULTS
rm -rfv /tmp/waits
The results obtained are as follows:
# With pg_stat_statements off
tps = 237 437.104223 (without initial connection time)
count | wait_event | wait_event_type
--------------------------------------
2922 None No wait
918 ClientRead Client
# With pg_stat_statements on
tps = 32 112.129029 (without initial connection time)
count | wait_event | wait_event_type
4703 pg_stat_statements LWLock
884 None No wait
143 ClientRead Client
The data illustrates that with a significant number of unique queries, enabling pg_stat_statements
can drastically diminish performance, potentially resulting in a substantial drop in TPS. This decline is primarily attributed to the frequent acquisition of ‘exclusive’ LWLock.
In a contrasting scenario involving a high volume of similar queries, we will employ a more robust machine with 192 CPUs. This time, we will utilize a script that periodically checks pg_stat_activity, while generating load through the same query executed via pgbench -M prepared -S with 192 clients:
pgbench -c192 -j20 -T120 -M prepared -S --progress 10 | grep "tps = " 2>>$RESULTS >>$RESULTS &
Running this benchmark yields the following results:
# Results with pg_stat_statements off
tps = 1 015 425.438193 (without initial connection time)
count | wait_event | wait_event_type
--------------------------------------
13201 None No wait
3482 ClientRead Client
# Results with pg_stat_statements on
tps = 484 338.163894 (without initial connection time)
count | wait_event | wait_event_type
11 214 SpinDelay Timeout
9481 None No wait
930 ClientRead Client
When attempting to replicate this on a machine with 48 CPUs (i.e., with 48 users), performance issues remain well within statistical error margins.
This indicates that the impact of pg_stat_statements
when dealing with duplicate queries becomes significant only at elevated levels of parallelism. The underlying cause is the contention for the same entry in the hash table, which is exacerbated by frequent SpinLock acquisitions during the update of query metrics within the hash table bucket. When numerous threads execute the same query simultaneously, they compete to update the same structure—incrementing call counters, execution times, and other metrics. This contention for SpinLock leads to delays and reduced TPS under high load.
What is sampling?
Query sampling is a method of uniform filtering, wherein only a subset of all executed queries is included in the sample. In the context of pg_stat_statements
, this means that metric information is recorded not for every executed query, but only for a fraction of them, selected with equal probability. This approach is similarly employed in PostgreSQL for log_transaction_sample_rate and log_statement_sample_rate to mitigate log volume, as well as in auto_explain.sample_rate
and pg_store_plans.sample_rate
. In Tantor Postgres 17.5, a corresponding setting was introduced to pg_stat_statements
— the GUC parameter pg_stat_statements.sample_rate
, which allows the configuration of the fraction of queries tracked in the extension’s statistics. The parameter value (ranging from 0 to 1) determines the proportion of queries that will be monitored in pg_stat_statements
. A query will be sampled if the following condition is satisfied:
Sampling
set table1_id random(1, 1000)
set table2_id random(1, 1000)
set table3_id random(1, 1000)
SELECT t1.value AS value1, t2.value AS value2, t3.value AS value3
FROM table_:table1_id t1
JOIN table_:table2_id t2 ON t1.id = t2.id
JOIN table_:table3_id t3 ON t2.id = t3.id
WHERE t1.id = 1 AND t2.id = 1 AND t3.id = 1;
Given the large volume of queries, employing this inequality allows for filtering only the specified fraction of queries. However, sampling carries a notable drawback: not all queries are recorded in pg_stat_statements
, which can affect the completeness of the collected information, particularly during debugging or analyzing rare but problematic queries. Conversely, since only a portion of the queries is tracked, the load on locks is diminished, subsequently enhancing overall system performance.
Limitations of the sampling
If queries are sampled during the addition of a new bucket to the hash table, thereby alleviating the LWLock, there exists a risk of omitting a valuable query that could provide critical insights if tracked in pg_stat_statements
. Furthermore, even if sampling mitigates the performance issue, concerns regarding the security and accuracy of query storage remain. The addition of a bucket to the hash table can occur both before and after query execution. However, the structure necessary for normalizing the query is established only before the query execution, specifically during the parsing phase, when the structure for storing literals for normalization is created. If at this stage pg_stat_statements
opts not to save the query due to sampling, but subsequently (after execution) attempts to add it to the hash table, the query will be recorded in its original (non-normalized) form. This scenario poses a risk of leaking sensitive information in pg_stat_statements
(such as passwords or personal data within SQL query literals). Therefore, sampling during query parsing is inadvisable, as it could contravene security protocols.
Nevertheless, both the PostgreSQL community and developers at Tantor Labs are exploring alternative solutions to the challenge of managing a large number of unique queries—specifically, by merging similar queries under a single queryid. This strategy reduces the number of unique entries in the hash table, thereby decreasing the frequency of locking. The community has already initiated the merging of certain queries into a single QueryId. In scenarios where SpinLock contention becomes a bottleneck—such as when the same entry in a hash table is frequently updated—sampling can prove to be quite effective. By reducing the number of calls to SpinLock (by skipping certain queries), contention between threads is minimized, thus improving overall performance.
Sampling results
Let us revisit the earlier scenario utilizing a powerful machine with 192 CPUs. The same waits.sql
scripts and enabled pg_stat_statements
will be employed for testing. We will execute a benchmark to evaluate the impact of the pg_stat_statements.sample_rate
on performance and the nature of waits. A loop will be run over five sample_rate
values: 1, 0.75, 0.5, 0.25
, and 0
. For each value, we will conduct load testing using pgbench
:
set table1_id random(1, 1000)
set table2_id random(1, 1000)
set table3_id random(1, 1000)
SELECT t1.value AS value1, t2.value AS value2, t3.value AS value3
FROM table_:table1_id t1
JOIN table_:table2_id t2 ON t1.id = t2.id
JOIN table_:table3_id t3 ON t2.id = t3.id
WHERE t1.id = 1 AND t2.id = 1 AND t3.id = 1;
After executing the benchmark with various pg_stat_statements.sample_rate
values, the results are summarized in the table below, illustrating how performance and the nature of waits evolve with changes in the fraction of queries hitting the hash table:
sample_rate | tps | SpinDelay | NoWait | ClientRead |
1.00 | 484 338 | 11 107 | 9 568 | 929 |
0.75 | 909 547 | 4 781 | 12 079 | 2100 |
0.50 | 1 019 507 | 174 | 13 253 | 3378 |
0.25 | 1 019 507 | – | 13 397 | 3423 |
0.00 | 1 015 425 | – | 13 106 | 3502 |
At a sample_rate of 1.0, where metrics are collected for all queries, TPS is at its lowest, accompanied by a significant number of waits on SpinLock. As the sample_rate decreases to 0.75 and below, TPS experiences a notable increase, and SpinDelay diminishes by 2.3 times. At a sample_rate of 0.25 and lower, SpinDelay effectively disappears.
Thus, sampling proves to be an effective strategy for reducing the overhead associated with pg_stat_statements
, significantly enhancing performance in scenarios marked by high contention for SpinLock.