PG Phriday: Pesky Partition Plans

For all of those warehouse queries that never seem to complete before the heat death of the universe, there’s often a faster version. Sometimes this is due to a fundamental misunderstanding of how queries work, or how Postgres specifically functions. The trick is knowing when to back away slowly from an ugly but efficient query, and when to inject a flurry of predicates to fully illustrate the original intent of the query so the planner makes better decisions. When partitions are involved, this can be quite an adventure.

Sometimes optimizing partitioned plans is an art. Other times, things are a little easier to isolate. Imagine we have a sensor log partitioned by month, and our app is fairly new, so there are only three months available. Anything illustrated in this case would only be much worse as time progresses.

Just to make things simple, there will be no trigger logic or anything to actually move rows around. We’re just looking at query performance, not the feasibility of the structure itself.

CREATE TABLE sensor_log (
  sensor_id     INT NOT NULL,
  location      VARCHAR NOT NULL,
  reading       INT NOT NULL,
  reading_date  TIMESTAMPTZ NOT NULL DEFAULT now()
);

CREATE TABLE sensor_log_201607 () INHERITS (sensor_log);
CREATE TABLE sensor_log_201608 () INHERITS (sensor_log);
CREATE TABLE sensor_log_201609 () INHERITS (sensor_log);

INSERT INTO sensor_log_201607 (sensor_id, location, reading, reading_date)
SELECT s.id % 1000, (random() * 100)::INT, s.id % 100,
       '2016-08-01'::TIMESTAMPTZ - (s.id || 's')::INTERVAL
  FROM generate_series(1, 3600*24*30) s(id);

INSERT INTO sensor_log_201608 (sensor_id, location, reading, reading_date)
SELECT s.id % 1000, (random() * 100)::INT, s.id % 100,
       '2016-09-01'::TIMESTAMPTZ - (s.id || 's')::INTERVAL
  FROM generate_series(1, 3600*24*30) s(id);

INSERT INTO sensor_log_201609 (sensor_id, location, reading, reading_date)
SELECT s.id % 1000, (random() * 100)::INT, s.id % 100,
       '2016-10-01'::TIMESTAMPTZ - (s.id || 's')::INTERVAL
  FROM generate_series(1, 3600*24*30) s(id);

ALTER TABLE sensor_log_201607
  ADD CONSTRAINT ck_201607_part_reading_date
CHECK (reading_date >= '2016-07-01' AND reading_date < '2016-08-01');

ALTER TABLE sensor_log_201608
  ADD CONSTRAINT ck_201608_part_reading_date
CHECK (reading_date >= '2016-08-01' AND reading_date < '2016-09-01');

ALTER TABLE sensor_log_201609
  ADD CONSTRAINT ck_201609_part_reading_date
CHECK (reading_date >= '2016-09-01' AND reading_date < '2016-10-01');

CREATE INDEX idx_sensor_log_201607_sensor_id
    ON sensor_log_201607 (sensor_id);
CREATE INDEX idx_sensor_log_201607_reading_date
    ON sensor_log_201607 (reading_date DESC);

CREATE INDEX idx_sensor_log_201608_sensor_id
    ON sensor_log_201608 (sensor_id);
CREATE INDEX idx_sensor_log_201608_reading_date
    ON sensor_log_201608 (reading_date DESC);

CREATE INDEX idx_sensor_log_201609_sensor_id
    ON sensor_log_201609 (sensor_id);
CREATE INDEX idx_sensor_log_201609_reading_date
    ON sensor_log_201609 (reading_date DESC);

ANALYZE sensor_log_201607;
ANALYZE sensor_log_201608;
ANALYZE sensor_log_201609;

With these three (not including the root) tables in place, a query can target the base table and operate as if the partitioning didn’t exist. There isn’t anything new in that regard, but it’s the first place where a business intelligence operative might stumble. The base table can obscure potentially hundreds of other segments and the associated overhead.

This is especially concerning in cumulative warehouse archives where there are no primary keys. In these contexts, data is unique per key and date combination. The date isn’t simply a manner of isolating specific ranges for comparative aggregates.

On operator might understand that smaller query result sets make for fewer index matches and faster execution times. So maybe they try to isolate in a step-by-step manner with temporary tables or CTEs. Using that as a foundation, they produce this query:

EXPLAIN
WITH source AS (
    SELECT *
      FROM sensor_log
     WHERE reading_date >= '2016-09-11'
       AND reading_date < '2016-09-18'
       AND location = '5'
)
SELECT source.*, target.*
  FROM source 
  LEFT JOIN sensor_log target ON (
         source.sensor_id = target.sensor_id AND
         target.location = '10'
       );

 Merge Left Join  (cost=178118.79..212683.46 rows=2302397 width=66)
   Merge Cond: (source.sensor_id = target.sensor_id)
   CTE source
     ->  Append  (cost=0.00..24384.21 rows=5743 width=18)
           ->  Seq Scan on sensor_log  (cost=0.00..0.00 rows=1 width=48)
                 Filter: ((reading_date >= '2016-09-11 00:00:00-05'::timestamp with time zone) AND (reading_date < '2016-09-18 00:00:00-05'::timestamp with time zone) AND ((location)::text = '5'::text))
           ->  Index Scan using idx_sensor_log_201609_reading_date on sensor_log_201609  (cost=0.43..24384.21 rows=5742 width=18)
                 Index Cond: ((reading_date >= '2016-09-11 00:00:00-05'::timestamp with time zone) AND (reading_date < '2016-09-18 00:00:00-05'::timestamp with time zone))
                 Filter: ((location)::text = '5'::text)
   ->  Sort  (cost=473.44..487.80 rows=5743 width=48)
         Sort Key: source.sensor_id
         ->  CTE Scan on source  (cost=0.00..114.86 rows=5743 width=48)
   ->  Sort  (cost=153261.13..153461.58 rows=80181 width=18)
         Sort Key: target.sensor_id
         ->  Append  (cost=0.00..146730.00 rows=80181 width=18)
               ->  Seq Scan on sensor_log target  (cost=0.00..0.00 rows=1 width=48)
                     Filter: ((location)::text = '10'::text)
               ->  Seq Scan on sensor_log_201607 target_1  (cost=0.00..48910.00 rows=26352 width=18)
                     Filter: ((location)::text = '10'::text)
               ->  Seq Scan on sensor_log_201608 target_2  (cost=0.00..48910.00 rows=27562 width=18)
                     Filter: ((location)::text = '10'::text)
               ->  Seq Scan on sensor_log_201609 target_3  (cost=0.00..48910.00 rows=26266 width=18)
                     Filter: ((location)::text = '10'::text)

In this example, the CTE simulates just some other random table joining into our partitioned sample. Notice those last three elements where the planner decides to perform a sequential scan on all of the partitions? It’s not entirely naive to think date constraints are transitive when working with tables that are commonly unique. Nor is it strictly common knowledge that CTEs act as optimization fences that completely isolate the internal query. But in large archives that trend toward accumulation, those assumptions can be extremely dangerous.

Let’s say we know about the optimization fence and rewrite the query, retaining our naive assumption that date constraints are shared by the join:

EXPLAIN
SELECT source.*, target.*
  FROM sensor_log source 
  LEFT JOIN sensor_log target ON (
         source.sensor_id = target.sensor_id AND
         target.location = '10'
       )
 WHERE source.reading_date >= '2016-09-11'
   AND source.reading_date < '2016-09-18'
   AND source.location = '5';

 Merge Left Join  (cost=178003.93..212568.60 rows=2302397 width=36)
   Merge Cond: (source.sensor_id = target.sensor_id)
   ->  Sort  (cost=24742.79..24757.15 rows=5743 width=18)
         Sort Key: source.sensor_id
         ->  Append  (cost=0.00..24384.21 rows=5743 width=18)
               ->  Seq Scan on sensor_log source  (cost=0.00..0.00 rows=1 width=48)
                     Filter: ((reading_date >= '2016-09-11 00:00:00-05'::timestamp with time zone) AND (reading_date < '2016-09-18 00:00:00-05'::timestamp with time zone) AND ((location)::text = '5'::text))
               ->  Index Scan using idx_sensor_log_201609_reading_date on sensor_log_201609 source_1  (cost=0.43..24384.21 rows=5742 width=18)
                     Index Cond: ((reading_date >= '2016-09-11 00:00:00-05'::timestamp with time zone) AND (reading_date < '2016-09-18 00:00:00-05'::timestamp with time zone))
                     Filter: ((location)::text = '5'::text)
   ->  Sort  (cost=153261.13..153461.58 rows=80181 width=18)
         Sort Key: target.sensor_id
         ->  Append  (cost=0.00..146730.00 rows=80181 width=18)
               ->  Seq Scan on sensor_log target  (cost=0.00..0.00 rows=1 width=48)
                     Filter: ((location)::text = '10'::text)
               ->  Seq Scan on sensor_log_201607 target_1  (cost=0.00..48910.00 rows=26352 width=18)
                     Filter: ((location)::text = '10'::text)
               ->  Seq Scan on sensor_log_201608 target_2  (cost=0.00..48910.00 rows=27562 width=18)
                     Filter: ((location)::text = '10'::text)
               ->  Seq Scan on sensor_log_201609 target_3  (cost=0.00..48910.00 rows=26266 width=18)
                     Filter: ((location)::text = '10'::text)

As expected, the planner properly applied the date range to the source table, but not the target, and we end up with sequential scans on all of the partitions. Obvious, right? Maybe not. It’s actually not uncommon to perform cross data comparisons with current readings using multiple fragments of composite keys. When operating like this, it’s easy to forget the date constraints are equally critical, especially with partitioned tables that can span for years and represent several TB of data.

Literally every human being that has touched a warehouse in my presence has made this mistake—including myself. Whether it’s a date, ID, or some other value, conditionals shouldn’t be assumed. Nor are they transitive, even if the database engine supports that kind of predicate migration.

Here’s what the query and its plan should resemble:

EXPLAIN
SELECT source.*, target.*
  FROM sensor_log source 
  LEFT JOIN sensor_log target ON (
         source.sensor_id = target.sensor_id AND
         target.location = '10' AND
         target.reading_date >= '2016-09-11' AND
         target.reading_date < '2016-09-18'
       )
 WHERE source.reading_date >= '2016-09-11'
   AND source.reading_date < '2016-09-18'
   AND source.location = '5';

 Merge Left Join  (cost=49517.62..52217.26 rows=178062 width=36)
   Merge Cond: (source.sensor_id = target.sensor_id)
   ->  Sort  (cost=24742.79..24757.15 rows=5743 width=18)
         Sort Key: source.sensor_id
         ->  Append  (cost=0.00..24384.21 rows=5743 width=18)
               ->  Seq Scan on sensor_log source  (cost=0.00..0.00 rows=1 width=48)
                     Filter: ((reading_date >= '2016-09-11 00:00:00-05'::timestamp with time zone) AND (reading_date < '2016-09-18 00:00:00-05'::timestamp with time zone) AND ((location)::text = '5'::text))
               ->  Index Scan using idx_sensor_log_201609_reading_date on sensor_log_201609 source_1  (cost=0.43..24384.21 rows=5742 width=18)
                     Index Cond: ((reading_date >= '2016-09-11 00:00:00-05'::timestamp with time zone) AND (reading_date < '2016-09-18 00:00:00-05'::timestamp with time zone))
                     Filter: ((location)::text = '5'::text)
   ->  Sort  (cost=24774.82..24790.32 rows=6201 width=18)
         Sort Key: target.sensor_id
         ->  Append  (cost=0.00..24384.21 rows=6201 width=18)
               ->  Seq Scan on sensor_log target  (cost=0.00..0.00 rows=1 width=48)
                     Filter: ((reading_date >= '2016-09-11 00:00:00-05'::timestamp with time zone) AND (reading_date < '2016-09-18 00:00:00-05'::timestamp with time zone) AND ((location)::text = '10'::text))
               ->  Index Scan using idx_sensor_log_201609_reading_date on sensor_log_201609 target_1  (cost=0.43..24384.21 rows=6200 width=18)
                     Index Cond: ((reading_date >= '2016-09-11 00:00:00-05'::timestamp with time zone) AND (reading_date < '2016-09-18 00:00:00-05'::timestamp with time zone))
                     Filter: ((location)::text = '10'::text)

And there we have it, the partitioned table results are restricted specifically to equivalent date ranges. And those date clauses we added to the join must exist for any similarly partitioned table unless the intent really is to compare with all historical records in all partitions. And that’s not exactly recommended either. That kind of analysis really should be done with fact tables and data cubes, which are targeted toward aggregate use much better than raw results.

Just be cognizant of partitioned tables. Quite a few people claim there are better ways to implement them in Postgres provided some underlying code changes are committed. Until that time, it’s critical to always include the partition column in every query, join, sub-clause, partridge, and pear tree. It’s how Postgres decides which partitions to include in results, and often enough, which indexes to consider.

The alternative is having to wait for hours or days while Postgres checks every existing partition of every included table for the search criteria, and nobody wants that.

Well… except attackers who are trying to DOS your database.