PG Phriday: Primal Planner Prep

The Postgres query planner is house of cards built upon the ever-shifting sand of our data. It has the utterly impossible mission of converting our ridiculous and inane requests into a logical series of fetch, filter, sort, join, and other instructions. Then the resulting steps must be ruthlessly efficient or the execution phase could very well saturate every hardware resource available; Set Theory isn’t very forgiving.

Forewarned is forearmed is very apt when applied to database query planners. Without proper statistics, they are reduced to assumptions that make adequate first approximations. But scale is the utter enemy of imprecision, as multiplicative effects quickly overwhelm reality. This allows seemingly simple report scripts to fall endlessly into a pit of smoldering system resources.

To perhaps translate that analogy a bit, let’s start with a very basic schema:

CREATE TABLE sensor_log (
  id            SERIAL PRIMARY KEY NOT NULL,
  location      VARCHAR NOT NULL,
  reading       BIGINT NOT NULL,
  reading_date  TIMESTAMP NOT NULL
);

INSERT INTO sensor_log (location, reading, reading_date)
SELECT s.id % 1000, s.id % 100,
       CURRENT_DATE - (s.id || 's')::INTERVAL
  FROM generate_series(1, 5000000) s(id);

CREATE INDEX idx_sensor_log_reading_date
    ON sensor_log (reading_date DESC);

ANALYZE sensor_log;

It’s just a basic sensor log table with a mere five million rows. There are only one thousand sensors spread across one hundred locations, and readings are captured once per second. There’s nothing ground-breaking here.

Given how unassuming this structure appears, imagine we have a regularly scheduled script that executes in multiple steps. Each step produces an intermediate UNLOGGED TABLE because a chance the script is reentrant and we taught our users that valuable trick.

But there’s a problem. Execution time, disk IO, and CPU usage are all much higher than we’d like. If we investigated the first two parts of the script, we might see something like this:

CREATE UNLOGGED TABLE recent_info AS
SELECT * FROM sensor_log
 WHERE reading_date >= CURRENT_DATE - INTERVAL '5 day';

EXPLAIN ANALYZE
SELECT * 
  FROM sensor_log
 WHERE id IN (
         SELECT id FROM recent_info
          WHERE reading BETWEEN 10 AND 50
       );

                              QUERY PLAN
------------------------------------------------------------------------
 Nested Loop  (cost=6966.50..6977.01 rows=2500000 width=23)
              (actual time=225.389..2151.500 rows=177120 loops=1)
   ->  HashAggregate  (cost=6966.07..6966.09 rows=2 width=4)
                      (actual time=225.342..334.923 rows=177120 loops=1)
         Group Key: r.id
         ->  Seq Scan on recent_info r
                 (cost=0.00..6962.56 rows=1404 width=4)
                 (actual time=0.030..135.195 rows=177120 loops=1)
               Filter: ((reading >= 10) AND (reading <= 50))
               Rows Removed by Filter: 254880
   ->  Index Scan using sensor_log_pkey on sensor_log s
           (cost=0.43..5.45 rows=1 width=23)
           (actual time=0.009..0.010 rows=1 loops=177120)
         Index Cond: (id = r.id)

 Planning time: 0.290 ms
 Execution time: 2164.980 ms

The first step creates an unlogged table to store some small fraction of the source data. That in itself is extremely common. The problems start immediately when we try to use that intermediate table as the basis for further tables. If we examine the query that builds the second table, it’s readily apparent something is horribly wrong.

Don’t worry about learning to read EXPLAIN output. In all of this output, the first set of parentheses is the planner estimate for that particular step, while the second outlines what actually happened. If we just compare the expected row counts between the two, the estimate of matched rows in recent_info was off by two orders of magnitude. Due to this drastic underestimation, the planner figured it would be faster to loop through the 1404 rows, and find corresponding matches in sensor_log.

Well, there’s a dramatic difference in looping over 1,000 matches and nearly 200,000. Our example isn’t terrible due to the scale, but a real system likely dwarfs ours by a factor of 100 at minimum. Errors scale, and not always linearly.

Postgres has a mechanism for fixing bad estimates like this called ANALYZE. There’s even a series of background workers with the sole duty of analyzing tables as data accumulates. If we inserted an arbitrary pause in the script, it’s possible one of those workers would eventually accumulate the missing statistics and transform the planner assumptions into hard facts. Or we could force the issue and analyze the table ourselves.

Let’s see how that changes the execution plan:

ANALYZE recent_info;

EXPLAIN ANALYZE
SELECT * 
  FROM sensor_log
 WHERE id IN (
         SELECT id FROM recent_info
          WHERE reading BETWEEN 10 AND 50
       );

                              QUERY PLAN
------------------------------------------------------------------------
 Merge Semi Join  (cost=26533.87..42842.33 rows=176707 width=23)
                  (actual time=195.161..419.301 rows=177120 loops=1)
   Merge Cond: (s.id = r.id)
   ->  Index Scan using sensor_log_pkey on sensor_log s
           (cost=0.43..141129.93 rows=5000000 width=23)
           (actual time=0.010..92.786 rows=431951 loops=1)
   ->  Materialize  (cost=26533.15..27416.68 rows=176707 width=4)
                    (actual time=195.130..237.153 rows=177120 loops=1)
         ->  Sort  (cost=26533.15..26974.92 rows=176707 width=4)
                   (actual time=195.125..218.283 rows=177120 loops=1)
               Sort Key: r.id
               Sort Method: external sort  Disk: 2424kB
               ->  Seq Scan on recent_info r
                       (cost=0.00..9232.00 rows=176707 width=4)
                       (actual time=0.018..93.871 rows=177120 loops=1)
                     Filter: ((reading >= 10) AND (reading <= 50))
                     Rows Removed by Filter: 254880

 Planning time: 0.305 ms
 Execution time: 426.620 ms

The degree of difference here really illustrates how well the planner adapts. Instead of a nested loop, it opted to fetch, filter, and sort the rows from recent_info and merge that into the primary key for sensor_log to find the intersection. All of our row estimates are much better, too. Why such a radical departure from the original plan?

To find that answer, we need to examine the Postgres catalog. This is where Postgres maintains everything it knows about tables, indexes, and other objects that reside within its confines. Of particular interest to us are the pg_stats view and pg_class table. This is a small part of what they contain after we analyzed recent_info:

SELECT reltuples, relpages
  FROM pg_class
 WHERE relname = 'recent_info';

 reltuples | relpages
-----------+----------
    432000 |     2752

SELECT attname, n_distinct
  FROM pg_stats
 WHERE tablename = 'recent_info';

   attname    | n_distinct 
--------------+------------
 id           |         -1
 location     |       1000
 reading      |        100
 reading_date |         -1

If we ran these same two queries immediately after creating recent_info, the first would report zero tuples, and the second would show no matches at all. At that point, Postgres knew next to nothing about the table, and that is reflected in the row estimates and planner decisions.

After we analyzed recent_info, Postgres garnered a vast smorgasbord of pertinent statistics. Not only does it have an approximate row count, it also knows how many data pages the table occupies. This lets Postgres calculate expense related to hardware interaction; how much work is involved with fetching and processing these rows.

What Postgres gains from the contents of pg_stats is altogether different. When Postgres analyzes a table, it performs a heuristic statistical sampling of its physical contents. This includes such facts such as most frequent values for all columns, average size of column data, the amount of distinct values in per column, and so on. As we can see, Postgres did a great job of scanning the table contents, as it identified the exact variance for our location and reading data.

Negative values in n_distinct denote a ratio between the amount of distinct values for that column and the total row count for the table. From that, we can derive that there’s basically one unique id or reading_date per row. Neat! For us mundane humans, this shows us the best columns to index. For Postgres, it will consider value histograms and multiply frequencies together and produce much more accurate row estimates. Better estimates almost always result in improved query execution times.

Without table analysis, Postgres is effectively operating blindly. Adding an ANALYZE statement after initializing every temporary or unlogged table is a critical element to script performance. Yet it’s also a step that’s often omitted. Inexperienced users aren’t aware of Postgres internals, and may not even know the ANALYZE command exists.

There is, of course, a way to cheat and prevent our users from having to become Postgres experts. Postgres added event triggers in version 9.3. These triggers can activate any time DDL is detected, and that means we can detect new tables right when they’re created.

Watch this:

CREATE OR REPLACE FUNCTION analyze_new_table()
RETURNS event_trigger AS
$$
DECLARE
  tab_name TEXT;
BEGIN
  IF tg_tag IN ('CREATE TABLE AS', 'SELECT INTO') THEN
    FOR tab_name IN
      SELECT objid::REGCLASS::TEXT
        FROM pg_event_trigger_ddl_commands()
    LOOP
      EXECUTE 'ANALYZE ' || tab_name;
    END LOOP;
  END IF;
END;
$$ LANGUAGE plpgsql;

CREATE EVENT TRIGGER t_analyze_new
    ON ddl_command_end 
       EXECUTE PROCEDURE analyze_new_table();

Now any time a table is created with CREATE TABLE AS or SELECT INTO, Postgres will instantly analyze the contents. Depending on the size of the table and the granularity of the default settings, this may be a relatively demanding operation. However, considering how greatly statistics determine performance, such a trigger may actually be a requirement in some settings.

The alternative is allowing illiteracy of Postgres internals or accidentally forgetting the ANALYZE clause. The consequences of which could spell the difference between a report script executing in a matter of minutes, or over the course of several grueling hours.

I know which scenario I’d prefer!