The extended statistics feature in PostgreSQL allows for the collection of additional statistics on specific sets of table columns, which is beneficial for datasets with implicit relationships between columns. For instance, in the power plant dataset, the primary_fuel column is linked to the country column, affecting query results and row count estimates. When using extended statistics, more accurate cardinality estimates can be achieved, such as improving the estimate for Norway from 93 to 1 row after implementing statistics on country and primary_fuel.
Extended statistics can be defined in three types: MCV (Most Common Values), ndistinct, and dependencies. MCV is effective for common value combinations, while ndistinct is useful for estimating group counts in operations like GROUP BY. Despite their advantages, extended statistics are rarely used due to concerns about the time-consuming ANALYZE command and the complexity of determining when to create these statistics.
Two rules of thumb guide the creation of appropriate statistics: Rule 1 suggests creating statistics based on index definitions, while Rule 2 focuses on real-world filter patterns. The extension concept involves collecting created object IDs and managing the timing for adding statistics definitions to the database. A columns_limit parameter and a stattypes parameter help manage the computational cost of generating extended statistics.
Testing the extension showed that running ANALYZE took longer with the extension activated, particularly when including dependencies. Deduplication procedures were introduced to minimize redundant statistics, resulting in modest gains in time and a significant reduction in the volume of statistics. Comparisons with another statistics collector, joinsel, indicated that while it provides some benefits, it lacks the full capabilities of extended statistics, particularly in terms of dependencies.