Improving GROUP BY with CREATE STATISTICS

April 6, 2025

From https://www.cybertec-postgresql.com/en/improving-group-by-with-create-statistics

Creating Sample Data

Create some sample data:

> CREATE TABLE t_timeseries (
     t   TIMESTAMPTZ
    ,val NUMERIC DEFAULT random()
    );
CREATE TABLE

> INSERT INTO t_timeseries SELECT *
  FROM GENERATE_SERIES('2022-01-01', '2024-12-31', '1 second'::INTERVAL);
INSERT 0 94608001

This will insert 94M rows. Finally you can create optimizer statistics for this data:

>  ANALYZE;
ANALYZE

GROUP BY Queries Featuring Expressions

What you want to do now is to count how many entries we can expect per day. You can use DATE_TRUNC function which turns a precise timestamp into day, week, month, etc.

> EXPLAIN ANALYZE
  SELECT DATE_TRUNC('day', t), COUNT(*)
  FROM t_timeseries
  GROUP BY 1;

The problem is that the planner estimates that the GROUP BY will return 94M rows. The planner thinks that it’s grouped by second instead of day.

CREATE STATISTICS: Helping the PostgreSQL Optimizer

The reason for the failure is that PostgreSQL does have statistics for the column but it does not have statistics about the expression (= the output of DATE_TRUNC).

CREATE STATISTICS is a good way to tell the system keep track of additional statistics which can help the planner:

> \h CREATE STATISTICS

This will get you help description of the command. We can tell PostgreSQL to keep track of multi-column related statistics, but also about expressions.

> CREATE STATISTICS mystats
   ON (DATE_TRUNC('day', t)
   FROM t_timeseries;
CREATE STATISTICS

> ANALYZE;
ANALYZE

The system is will now sample for DATE_TRUNC('day', t) and maintain this information just like simple column-related statistics.

> EXPLAIN ANALYZE
  SELECT DATE_TRUNC('day', t), COUNT(*)
  FROM t_timeseries
  GROUP BY 1;

Now PostgreSQL has changed to a parallel query will return 1,096 rows instead which the estimate was 1,095.

Include Schema in psycopg2 Connection String Find and Replace in Vim