Automated management of extended statistics in PostgreSQL

The extended statistics feature in PostgreSQL provides a mechanism for collecting additional statistics on specific sets of table columns. This capability becomes particularly valuable when dealing with datasets that exhibit implicit relationships between columns, such as the public power plant dataset. For example, the primary_fuel column is inherently linked to the country column. When executing a query like:

SELECT count(*) FROM power_plants
WHERE country = '' AND primary_fuel = 'Solar';

the results reveal that Norway returns 0 rows while Spain yields 243. While this outcome is intuitive based on geographical and climatic factors, PostgreSQL lacks this contextual understanding. Consequently, during query planning, it inaccurately estimates row counts—93 for Norway and 253 for Spain. Such misestimations can lead to inefficiencies, especially in complex queries involving JOINs or other operators. Extended statistics help mitigate this issue by calculating joint value distributions across column sets, thereby uncovering these dependencies.

Object-Relational Mappers (ORMs) can exacerbate these estimation issues. In the same dataset, filters on country and country_long are often used together. Their direct relationship is evident, yet when an ORM generates a GROUP BY clause for both fields, the estimation can fail dramatically:

EXPLAIN (ANALYZE, COSTS ON, TIMING OFF, BUFFERS OFF, SUMMARY OFF)
SELECT country, country_long FROM power_plants
GROUP BY country, country_long;

This scenario highlights the challenges posed by AI-generated queries, which may not always align with human logic. The extended statistics feature addresses these challenges by allowing the definition of three types of statistics on column sets: MCV (Most Common Values), ndistinct, and dependencies.

Understanding Extended Statistics

For scan filters, MCV proves to be the most effective. If the value combination being filtered is common within the table, the planner can estimate it accurately. Conversely, for rare combinations—such as solar power in Norway—PostgreSQL resorts to rough estimates, which can be refined by excluding values from the MCV list. For estimating group counts in operations like GROUP BY, DISTINCT, and Hash Join, the ndistinct statistic on column combinations becomes invaluable.

To illustrate the impact of extended statistics, consider the following command:

CREATE STATISTICS ON country, primary_fuel FROM power_plants;
ANALYZE;

After implementing this, the earlier query yields much more accurate cardinality estimates for filtering and grouping by these two fields. For Norway, the estimate improves to 1 row, while Spain remains at 253. Although the dataset size is modest, the effectiveness of the tool is evident.

Despite its advantages, extended statistics are seldom utilized in practice. This reluctance may stem from concerns that the ANALYZE command could be time-consuming, but more likely, it arises from the complexity of determining when and where to create these statistics.

Finding the Right Stats Definition

Determining the appropriate statistics to create can be guided by two empirical rules of thumb:

Rule 1: Based on Index Definitions – If a Database Administrator (DBA) considers creating an index on a set of columns, it suggests that queries will frequently filter on those columns. Enhancing query performance on these filters is a compelling reason to improve plan quality. However, not every multi-column filter suffers from poor estimates, which can lead to unnecessary statistics creation.
Rule 2: Based on Real-World Filter Patterns – This rule involves selecting candidate queries based on two factors: the load the query imposes on the database (e.g., pages-read) and the presence of multi-column scan filters. It is also beneficial to focus on cases where the actual row count significantly deviates from the estimate.

While Rule 1 allows for immediate statistics creation alongside index creation, Rule 2 requires either a timer to collect queries or manual triggers. The absence of a heavy query for an extended period does not negate the need for speed when it eventually runs. Additionally, determining when to clean up old statistics poses challenges, particularly with Rule 2, where queries may re-emerge after periods of inactivity.

To streamline the process, the simpler Rule 1 was chosen as a starting point, with plans to develop technology for Rule 2 in the future. The goal is to create a recommender that analyzes pg_stat_statements and completed query plans, suggesting statistics definitions along with justifications for each recommendation.

Extension Overview

The extension concept is straightforward (see the repository). Initially, a hook is needed to collect created object IDs, which can be accomplished using object_access_hook. Next, identifying the appropriate time to filter for composite indexes and add statistics definitions to the database can be effectively managed with ProcessUtility_hook.

As extended statistics (specifically types distinct and dependencies) are computed for all column combinations, the computational cost escalates quickly. To prevent overwhelming the database, a columns_limit parameter (indicating the maximum number of columns per statistic) and a stattypes parameter (specifying which types to include) were introduced. When auto-generated statistics are created, a dependency is registered not only on the table but also on the index used as a template. Consequently, if the index is dropped, the associated statistics are removed as well.

Deciding whether the extension should own the statistics—meaning that DROP EXTENSION would delete them—remains uncertain. The extension can function as a module without necessitating CREATE EXTENSION, impacting all databases within the cluster. To differentiate auto-generated statistics from manually created ones, the extension appends a comment to each statistic object, detailing the library and statistic name. Additionally, functions such as pg_index_stats_remove and pg_index_stats_rebuild facilitate bulk cleanup and regeneration, which is particularly useful if the schema existed prior to loading the module or if database settings have changed.

One challenge involves minimizing redundant statistics. With numerous indexes, it is essential to reduce the ANALYZE load. A deduplication procedure was introduced (see pg_index_stats.compactify). For instance, if an index exists on t(x1,x2) and another is created on t(x2,x1), no new statistics are necessary. However, if t(x1,x2,x3) already exists, creating t(x2,x1) does require new MCV statistics, while distinct and dependencies can be omitted.

Experimentation

Theoretical insights are valuable, but practical application is essential. Lacking access to a heavily loaded PostgreSQL instance, an older dump from a system with approximately 10,000 tables and three times as many indexes was utilized. Over 20,000 of these indexes contained multiple columns, with more than 1,000 having five or more columns.

Testing the extension revealed that running ANALYZE on this database took 22 seconds. With the extension activated and a 5-column limit, the duration increased to 55 seconds. The following table summarizes the raw ANALYZE times based on column limits and statistic types:

Limit	MCV	MCV + NDISTINCT	MCV + NDISTINCT + DEPENDENCIES
2	21s	27s	28s
4	23s	37s	67s
5	24s	51s	118s
6	25s	68s	196s
8	30s	137s	574s

The results clearly indicate that encompassing all column combinations can become costly, particularly with dependencies. Thus, maintaining a limit of 3-5 columns or adhering to Rule 2 is advisable. This experience elucidates why SQL Server employs a dedicated worker for updating such statistics, given the associated costs.

To evaluate redundancy cleanup, another test was conducted:

SET pg_index_stats.columns_limit = 5;
SET pg_index_stats.stattypes = 'mcv, ndistinct, dependencies';
SET pg_index_stats.compactify = 'off';
SELECT pg_index_stats_rebuild();
ANALYZE;

SET pg_index_stats.compactify = 'on';
SELECT pg_index_stats_rebuild();
ANALYZE;

To monitor the statistics objects, the following queries were executed:

-- Total number of stat items
SELECT sum(nelems) FROM (
  SELECT array_length(stxkind,1) AS nelems
  FROM pg_statistic_ext
);

-- Total per stat type
SELECT elem, count(elem) FROM (
  SELECT unnest(stxkind) elem FROM pg_statistic_ext
)
GROUP BY elem;

The results are summarized in the table below:

Metric	Without Deduplication	With Deduplication
ANALYZE time (s)	141	123
Total stat elements	74,353	61,409
MCV	24,783	24,089
DISTINCT	24,783	18,658
DEPENDENCIES	24,783	18,658
EXPRESSIONS	4	4

The gains from deduplication are modest—approximately 15% in time and a more significant reduction in the volume of statistics. However, this approach offers protection against corner cases. Notably, deduplication reduced MCV statistics, indicating that many indexes differ solely in column order. Additionally, expression statistics appeared unexpectedly, as PostgreSQL automatically generates them when the definition includes expressions. While not a major concern, controlling this behavior would be beneficial.

Comparing with joinsel

At Postgres Professional, another statistics collector known as joinsel exists, which is not a direct competitor to extended statistics. It creates a composite type based on the index definition and utilizes standard statistics from pg_statistic. The advantages of this approach include the inclusion of MCV, histograms (allowing for range filters), and compatibility with PostgreSQL’s core mechanics. However, it lacks dependencies and provides only one ndistinct value for the entire composite type, although this limitation can be addressed.

Testing the ANALYZE command with joinsel yielded the following:

SET enable_compound_index_stats = 'on';
SELECT pg_index_stats_remove();
timing on
ANALYZE;

Time: 41248.977 ms (00:41.249)

The ANALYZE time doubled, which is a reasonable outcome, and the complexity increases linearly with the number of index columns.

Tech Optimizer