The Postgres query planner is house of cards built upon the ever-shifting sand of our data. It has the utterly impossible mission of converting our ridiculous and inane requests into a logical series of fetch, filter, sort, join, and other instructions. Then the resulting steps must be ruthlessly efficient or the execution phase could very well saturate every hardware resource available; Set Theory isn’t very forgiving.
Forewarned is forearmed is very apt when applied to database query planners. Without proper statistics, they are reduced to assumptions that make adequate first approximations. But scale is the utter enemy of imprecision, as multiplicative effects quickly overwhelm reality. This allows seemingly simple report scripts to fall endlessly into a pit of smoldering system resources.
To perhaps translate that analogy a bit, let’s start with a very basic schema:
CREATE TABLE sensor_log ( id SERIAL PRIMARY KEY NOT NULL, location VARCHAR NOT NULL, reading BIGINT NOT NULL, reading_date TIMESTAMP NOT NULL ); INSERT INTO sensor_log (location, reading, reading_date) SELECT s.id % 1000, s.id % 100, CURRENT_DATE - (s.id || 's')::INTERVAL FROM generate_series(1, 5000000) s(id); CREATE INDEX idx_sensor_log_reading_date ON sensor_log (reading_date DESC); ANALYZE sensor_log;
It’s just a basic sensor log table with a mere five million rows. There are only one thousand sensors spread across one hundred locations, and readings are captured once per second. There’s nothing ground-breaking here.
Given how unassuming this structure appears, imagine we have a regularly scheduled script that executes in multiple steps. Each step produces an intermediate
UNLOGGED TABLE because a chance the script is reentrant and we taught our users that valuable trick.
But there’s a problem. Execution time, disk IO, and CPU usage are all much higher than we’d like. If we investigated the first two parts of the script, we might see something like this:
CREATE UNLOGGED TABLE recent_info AS SELECT * FROM sensor_log WHERE reading_date >= CURRENT_DATE - INTERVAL '5 day'; EXPLAIN ANALYZE SELECT * FROM sensor_log WHERE id IN ( SELECT id FROM recent_info WHERE reading BETWEEN 10 AND 50 ); QUERY PLAN ------------------------------------------------------------------------ Nested Loop (cost=6966.50..6977.01 rows=2500000 width=23) (actual time=225.389..2151.500 rows=177120 loops=1) -> HashAggregate (cost=6966.07..6966.09 rows=2 width=4) (actual time=225.342..334.923 rows=177120 loops=1) Group Key: r.id -> Seq Scan on recent_info r (cost=0.00..6962.56 rows=1404 width=4) (actual time=0.030..135.195 rows=177120 loops=1) Filter: ((reading >= 10) AND (reading <= 50)) Rows Removed by Filter: 254880 -> Index Scan using sensor_log_pkey on sensor_log s (cost=0.43..5.45 rows=1 width=23) (actual time=0.009..0.010 rows=1 loops=177120) Index Cond: (id = r.id) Planning time: 0.290 ms Execution time: 2164.980 ms
The first step creates an unlogged table to store some small fraction of the source data. That in itself is extremely common. The problems start immediately when we try to use that intermediate table as the basis for further tables. If we examine the query that builds the second table, it’s readily apparent something is horribly wrong.
Don’t worry about learning to read
EXPLAIN output. In all of this output, the first set of parentheses is the planner estimate for that particular step, while the second outlines what actually happened. If we just compare the expected row counts between the two, the estimate of matched rows in
recent_info was off by two orders of magnitude. Due to this drastic underestimation, the planner figured it would be faster to loop through the 1404 rows, and find corresponding matches in
Well, there’s a dramatic difference in looping over 1,000 matches and nearly 200,000. Our example isn’t terrible due to the scale, but a real system likely dwarfs ours by a factor of 100 at minimum. Errors scale, and not always linearly.
Postgres has a mechanism for fixing bad estimates like this called
ANALYZE. There’s even a series of background workers with the sole duty of analyzing tables as data accumulates. If we inserted an arbitrary pause in the script, it’s possible one of those workers would eventually accumulate the missing statistics and transform the planner assumptions into hard facts. Or we could force the issue and analyze the table ourselves.
Let’s see how that changes the execution plan:
ANALYZE recent_info; EXPLAIN ANALYZE SELECT * FROM sensor_log WHERE id IN ( SELECT id FROM recent_info WHERE reading BETWEEN 10 AND 50 ); QUERY PLAN ------------------------------------------------------------------------ Merge Semi Join (cost=26533.87..42842.33 rows=176707 width=23) (actual time=195.161..419.301 rows=177120 loops=1) Merge Cond: (s.id = r.id) -> Index Scan using sensor_log_pkey on sensor_log s (cost=0.43..141129.93 rows=5000000 width=23) (actual time=0.010..92.786 rows=431951 loops=1) -> Materialize (cost=26533.15..27416.68 rows=176707 width=4) (actual time=195.130..237.153 rows=177120 loops=1) -> Sort (cost=26533.15..26974.92 rows=176707 width=4) (actual time=195.125..218.283 rows=177120 loops=1) Sort Key: r.id Sort Method: external sort Disk: 2424kB -> Seq Scan on recent_info r (cost=0.00..9232.00 rows=176707 width=4) (actual time=0.018..93.871 rows=177120 loops=1) Filter: ((reading >= 10) AND (reading <= 50)) Rows Removed by Filter: 254880 Planning time: 0.305 ms Execution time: 426.620 ms
The degree of difference here really illustrates how well the planner adapts. Instead of a nested loop, it opted to fetch, filter, and sort the rows from
recent_info and merge that into the primary key for
sensor_log to find the intersection. All of our row estimates are much better, too. Why such a radical departure from the original plan?
To find that answer, we need to examine the Postgres catalog. This is where Postgres maintains everything it knows about tables, indexes, and other objects that reside within its confines. Of particular interest to us are the
pg_stats view and
pg_class table. This is a small part of what they contain after we analyzed
SELECT reltuples, relpages FROM pg_class WHERE relname = 'recent_info'; reltuples | relpages -----------+---------- 432000 | 2752 SELECT attname, n_distinct FROM pg_stats WHERE tablename = 'recent_info'; attname | n_distinct --------------+------------ id | -1 location | 1000 reading | 100 reading_date | -1
If we ran these same two queries immediately after creating
recent_info, the first would report zero tuples, and the second would show no matches at all. At that point, Postgres knew next to nothing about the table, and that is reflected in the row estimates and planner decisions.
After we analyzed
recent_info, Postgres garnered a vast smorgasbord of pertinent statistics. Not only does it have an approximate row count, it also knows how many data pages the table occupies. This lets Postgres calculate expense related to hardware interaction; how much work is involved with fetching and processing these rows.
What Postgres gains from the contents of
pg_stats is altogether different. When Postgres analyzes a table, it performs a heuristic statistical sampling of its physical contents. This includes such facts such as most frequent values for all columns, average size of column data, the amount of distinct values in per column, and so on. As we can see, Postgres did a great job of scanning the table contents, as it identified the exact variance for our
Negative values in
n_distinct denote a ratio between the amount of distinct values for that column and the total row count for the table. From that, we can derive that there’s basically one unique
reading_date per row. Neat! For us mundane humans, this shows us the best columns to index. For Postgres, it will consider value histograms and multiply frequencies together and produce much more accurate row estimates. Better estimates almost always result in improved query execution times.
Without table analysis, Postgres is effectively operating blindly. Adding an
ANALYZE statement after initializing every temporary or unlogged table is a critical element to script performance. Yet it’s also a step that’s often omitted. Inexperienced users aren’t aware of Postgres internals, and may not even know the
ANALYZE command exists.
There is, of course, a way to cheat and prevent our users from having to become Postgres experts. Postgres added event triggers in version 9.3. These triggers can activate any time DDL is detected, and that means we can detect new tables right when they’re created.
CREATE OR REPLACE FUNCTION analyze_new_table() RETURNS event_trigger AS $$ DECLARE tab_name TEXT; BEGIN IF tg_tag IN ('CREATE TABLE AS', 'SELECT INTO') THEN FOR tab_name IN SELECT objid::REGCLASS::TEXT FROM pg_event_trigger_ddl_commands() LOOP EXECUTE 'ANALYZE ' || tab_name; END LOOP; END IF; END; $$ LANGUAGE plpgsql; CREATE EVENT TRIGGER t_analyze_new ON ddl_command_end EXECUTE PROCEDURE analyze_new_table();
Now any time a table is created with
CREATE TABLE AS or
SELECT INTO, Postgres will instantly analyze the contents. Depending on the size of the table and the granularity of the default settings, this may be a relatively demanding operation. However, considering how greatly statistics determine performance, such a trigger may actually be a requirement in some settings.
The alternative is allowing illiteracy of Postgres internals or accidentally forgetting the
ANALYZE clause. The consequences of which could spell the difference between a report script executing in a matter of minutes, or over the course of several grueling hours.
I know which scenario I’d prefer!