PG Phriday: Broken Parts

Partitioning tables in Postgres can be an extremely risky endeavor. Unfortunately on many larger systems, it’s also essentially a requirement; the maximum size of a Postgres table is 32TB. This isn’t just because converting an existing table to a series of partitions is expensive or time consuming. We must consider how the query planner will react to the partitioned version of a table. There’s also the very real risk we will (or already have) implement flaws in the trigger or constraint logic.

Tools like pg_partman or pg_pathman are great for new systems, but existing deployments require retrofitting. So what happens if we have a large warehouse that’s already using a slightly flawed partition engine? And what other concerns remain even after achieving a pristine setup?

Let’s start with a relatively large sample table with three partitions. Each partition will consist of roughly 26M rows representing sensor readings every 100ms from 1000 sensors.

CREATE TABLE sensor_log (
  sensor_id     INT NOT NULL,
  location      VARCHAR NOT NULL,
  reading       INT NOT NULL,
  reading_date  TIMESTAMPTZ NOT NULL DEFAULT now()
);

CREATE TABLE sensor_log_201608 () INHERITS (sensor_log);
CREATE TABLE sensor_log_201609 () INHERITS (sensor_log);
CREATE TABLE sensor_log_201610 () INHERITS (sensor_log);

INSERT INTO sensor_log_201608 (sensor_id, location, reading, reading_date)
SELECT s.id % 1000, (random() * 100)::INT, s.id % 100,
       '2016-09-01'::TIMESTAMPTZ - (s.id * '100ms'::INTERVAL)
  FROM generate_series(1, 36000*24*31) s(id);

INSERT INTO sensor_log_201609 (sensor_id, location, reading, reading_date)
SELECT s.id % 1000, (random() * 100)::INT, s.id % 100,
       '2016-10-01'::TIMESTAMPTZ - (s.id * '100ms'::INTERVAL)
  FROM generate_series(1, 36000*24*30) s(id);

INSERT INTO sensor_log_201610 (sensor_id, location, reading, reading_date)
SELECT s.id % 1000, (random() * 100)::INT, s.id % 100,
       '2016-11-01'::TIMESTAMPTZ - (s.id * '100ms'::INTERVAL)
  FROM generate_series(1, 36000*24*31) s(id);

ALTER TABLE sensor_log_201608
  ADD CONSTRAINT ck_201608_part_reading_date
CHECK (reading_date >= '2016-08-01'::TIMESTAMPTZ AND
       reading_date < '2016-08-01'::TIMESTAMPTZ + INTERVAL '1 mon');

ALTER TABLE sensor_log_201609
  ADD CONSTRAINT ck_201609_part_reading_date
CHECK (reading_date >= '2016-09-01'::TIMESTAMPTZ AND
       reading_date < '2016-09-01'::TIMESTAMPTZ + INTERVAL '1 mon');

ALTER TABLE sensor_log_201610
  ADD CONSTRAINT ck_201609_part_reading_date
CHECK (reading_date >= '2016-09-01'::TIMESTAMPTZ AND 
       reading_date < '2016-10-01'::TIMESTAMPTZ + INTERVAL '1 mon');

CREATE INDEX idx_sensor_log_201608_sensor_id
    ON sensor_log_201608 (sensor_id);
CREATE INDEX idx_sensor_log_201608_reading_date
    ON sensor_log_201608 (reading_date DESC);

CREATE INDEX idx_sensor_log_201609_sensor_id
    ON sensor_log_201609 (sensor_id);
CREATE INDEX idx_sensor_log_201609_reading_date
    ON sensor_log_201609 (reading_date DESC);

CREATE INDEX idx_sensor_log_201610_sensor_id
    ON sensor_log_201610 (sensor_id);
CREATE INDEX idx_sensor_log_201610_reading_date
    ON sensor_log_201610 (reading_date DESC);

SET default_statistics_target = 400;

ANALYZE sensor_log_201608;
ANALYZE sensor_log_201609;
ANALYZE sensor_log_201610;

We’re early in the analysis and there’s already a major problem. The Postgres planner considers calculations a variable operation and substitutes them with a placeholder for all potential plans. This effectively erases the second boundary from all of our constraints. Simply because we used “+ INTERVAL ‘1 mon’” instead of a static value, every query will always target every partition.

Here’s how that looks:

EXPLAIN 
SELECT s1.*
  FROM sensor_log s1
 WHERE reading_date >= '2016-10-01'
   AND reading_date < '2016-10-21'
   AND location = '42';

 Append  (cost=0.00..639335.50 rows=164796 width=18)
   ->  Seq Scan on sensor_log s1  (cost=0.00..0.00 rows=1 width=48)
         Filter: ((reading_date >= '2016-10-01 00:00:00-05'::timestamp with time zone) AND (reading_date < '2016-10-21 00:00:00-05'::timestamp with time zone) AND ((location)::text = '42'::text))
   ->  Index Scan using idx_sensor_log_201608_reading_date on sensor_log_201608 s1_1  (cost=0.44..8.46 rows=1 width=18)
         Index Cond: ((reading_date >= '2016-10-01 00:00:00-05'::timestamp with time zone) AND (reading_date < '2016-10-21 00:00:00-05'::timestamp with time zone))
         Filter: ((location)::text = '42'::text)
   ->  Index Scan using idx_sensor_log_201609_reading_date on sensor_log_201609 s1_2  (cost=0.44..8.46 rows=1 width=18)
         Index Cond: ((reading_date >= '2016-10-01 00:00:00-05'::timestamp with time zone) AND (reading_date < '2016-10-21 00:00:00-05'::timestamp with time zone))
         Filter: ((location)::text = '42'::text)
   ->  Seq Scan on sensor_log_201610 s1_3  (cost=0.00..639318.58 rows=164793 width=18)
         Filter: ((reading_date >= '2016-10-01 00:00:00-05'::timestamp with time zone) AND (reading_date < '2016-10-21 00:00:00-05'::timestamp with time zone) AND ((location)::text = '42'::text))

This particular snafu invokes several related and extremely subtle problems. Consider a simple report where we want to compare sensor variance based on location. Operating under the assumption certain sensor locations don’t always trigger during certain sample periods, that’s just a single left join. If this were an inventory system, we might be comparing customers who order rubber chickens versus those who lean toward high performance graphics cards.

With our data, we can do this with a single self join, but why fetch everything? We have millions of entries going back for three months! Let’s apply a LIMIT clause so we can decide how to further analyze the results:

EXPLAIN
SELECT s1.*
  FROM sensor_log s1
  LEFT JOIN sensor_log s2 ON (
         s2.sensor_id = s1.sensor_id AND
         s2.reading_date >= '2016-10-01' AND
         s2.reading_date < '2016-10-21' AND
         s2.location = '87'
       )
 WHERE s1.reading_date >= '2016-10-01'
   AND s1.reading_date < '2016-10-21'
   AND s1.location = '42'
 LIMIT 10;

 Limit  (cost=2.73..26.59 rows=10 width=18)
   ->  Merge Left Join  (cost=2.73..336756014.96 rows=141129646 width=18)
         Merge Cond: (s1.sensor_id = s2.sensor_id)
         ->  Merge Append  (cost=1.36..167319256.22 rows=164796 width=18)
               Sort Key: s1.sensor_id
               ->  Sort  (cost=0.01..0.02 rows=1 width=48)
                     Sort Key: s1.sensor_id
                     ->  Seq Scan on sensor_log s1  (cost=0.00..0.00 rows=1 width=48)
                           Filter: ((reading_date >= '2016-10-01 00:00:00-05'::timestamp with time zone) AND (reading_date < '2016-10-21 00:00:00-05'::timestamp with time zone) AND ((location)::text = '42'::text))
               ->  Index Scan using idx_sensor_log_201608_sensor_id on sensor_log_201608 s1_1  (cost=0.44..56414786.91 rows=1 width=18)
                     Filter: ((reading_date >= '2016-10-01 00:00:00-05'::timestamp with time zone) AND (reading_date < '2016-10-21 00:00:00-05'::timestamp with time zone) AND ((location)::text = '42'::text))
               ->  Index Scan using idx_sensor_log_201609_sensor_id on sensor_log_201609 s1_2  (cost=0.44..54487523.33 rows=1 width=18)
                     Filter: ((reading_date >= '2016-10-01 00:00:00-05'::timestamp with time zone) AND (reading_date < '2016-10-21 00:00:00-05'::timestamp with time zone) AND ((location)::text = '42'::text))
               ->  Index Scan using idx_sensor_log_201610_sensor_id on sensor_log_201610 s1_3  (cost=0.44..56413238.01 rows=164793 width=18)
                     Filter: ((reading_date >= '2016-10-01 00:00:00-05'::timestamp with time zone) AND (reading_date < '2016-10-21 00:00:00-05'::timestamp with time zone) AND ((location)::text = '42'::text))
         ->  Materialize  (cost=1.36..167319830.26 rows=171278 width=4)
               ->  Merge Append  (cost=1.36..167319402.06 rows=171278 width=4)
                     Sort Key: s2.sensor_id
                     ->  Sort  (cost=0.01..0.02 rows=1 width=4)
                           Sort Key: s2.sensor_id
                           ->  Seq Scan on sensor_log s2  (cost=0.00..0.00 rows=1 width=4)
                                 Filter: ((reading_date >= '2016-10-01 00:00:00-05'::timestamp with time zone) AND (reading_date < '2016-10-21 00:00:00-05'::timestamp with time zone) AND ((location)::text = '87'::text))
                     ->  Index Scan using idx_sensor_log_201608_sensor_id on sensor_log_201608 s2_1  (cost=0.44..56414786.91 rows=1 width=4)
                           Filter: ((reading_date >= '2016-10-01 00:00:00-05'::timestamp with time zone) AND (reading_date < '2016-10-21 00:00:00-05'::timestamp with time zone) AND ((location)::text = '87'::text))
                     ->  Index Scan using idx_sensor_log_201609_sensor_id on sensor_log_201609 s2_2  (cost=0.44..54487523.33 rows=1 width=4)
                           Filter: ((reading_date >= '2016-10-01 00:00:00-05'::timestamp with time zone) AND (reading_date < '2016-10-21 00:00:00-05'::timestamp with time zone) AND ((location)::text = '87'::text))
                     ->  Index Scan using idx_sensor_log_201610_sensor_id on sensor_log_201610 s2_3  (cost=0.44..56413238.01 rows=171275 width=4)
                           Filter: ((reading_date >= '2016-10-01 00:00:00-05'::timestamp with time zone) AND (reading_date < '2016-10-21 00:00:00-05'::timestamp with time zone) AND ((location)::text = '87'::text))

There’s a lot going on here, so we’ll just focus on what went wrong. In this case, the culprit is the LIMIT clause itself. By trying to reduce output, we told Postgres to optimize the execution and avoid generating the entire result. This means it will read rows in table order until it accumulates 10 that match all of the predicates we supplied. Simple and efficient, right?

Well… no. If we look closer at one of the indexes it’s using, we’ll see it opted for the index on sensor_id instead of the index on reading_date. This isn’t necessarily a problem on the October partition, since it should match ten rows relatively quickly. If it assumes even distribution of 100 sensor locations, it may have to read up to 1000 rows to get the 10 we wanted. That’s not awful, is it?

But consider the other two partitions. Remember those? There will be no match for August or September. But Postgres doesn’t know that yet, meaning it will read the entire sensor_id index for both of them. That index is over 500MB on each partition, so we just read an extra 1GB for no reason at all. But wait! There’s more! We still need to apply the filters to remove them from consideration, right? That means a random read to the table heap for every tuple in the index, plus or minus some batching for the sake of efficiency. What’s a few hundred thousand extra random reads between friends?

Now imagine each table contains ten times more data, and there are three years worth of partitions instead of three months. Our particular example requires about six minutes to execute, but real-world cases of this can quickly cascade into hours or even days.

How can we fix this? It’s clear we need to rebuild the constraints so they properly convey the fact certain partitions should be ignored. But we already have billions of rows, and partitions are checked upon creation to verify their validity. Or are they? There is a way to automate this:

DO $$
DECLARE
  r RECORD;
BEGIN
  FOR r IN SELECT tab,
           to_date(ym, 'YYYYMM01') AS sdate,
           to_date(ym, 'YYYYMM01') + '1 mon'::INTERVAL AS edate,
           cname, col
      FROM (SELECT i.inhrelid::REGCLASS::TEXT AS tab,
                   substring(i.inhrelid::REGCLASS::TEXT FROM '......$') AS ym,
                   co.conname AS cname,
                   substring(co.consrc FROM '\w+') AS col
              FROM pg_inherits i
              JOIN pg_constraint co ON (co.conrelid = i.inhrelid)
             WHERE co.contype = 'c'
               AND co.consrc ILIKE '%1 mon%') cfix
  LOOP
    EXECUTE 'ALTER TABLE ' || r.tab || ' DROP CONSTRAINT ' ||
        quote_ident(r.cname);
    EXECUTE 'ALTER TABLE ' || r.tab || ' ADD CONSTRAINT ' ||
        quote_ident(r.cname) || ' CHECK (' ||
        quote_ident(r.col) || ' >= ' || quote_literal(r.sdate) || ' AND ' ||
        quote_ident(r.col) || ' < ' || quote_literal(r.edate) || ') NOT VALID';
  END LOOP;
END;
$$ LANGUAGE plpgsql;

UPDATE pg_constraint 
   SET convalidated = TRUE
 WHERE NOT convalidated;

The giant gross query that defines the loop simply identifies all of the broken partition constraints and provides replacement boundaries. Within the loop, we drop all of the bad constraints and replace them with better ones, being careful to tell Postgres they’re not valid so it doesn’t check them. At the end, we modify the Postgres catalog and directly mark the constraints as valid, circumventing the whole verification process entirely.

And it doesn’t matter if we have three bad constraints as in this example, or three thousand. We’ve just fixed all of them. After this, our six minute query now executes in a mere 100ms, making it roughly 4000 times faster. Remember, our example is conservative; a real warehouse would see much more striking results.

But we need to be careful. Problems like this also apply to using CURRENT_DATE or now() + INTERVAL '1 mon' or any other kind of substitution in our WHERE clauses or join conditions. With no static value, Postgres makes assumptions on the potential row counts and estimates accordingly. Sometimes these assumptions don’t change much. In others, it might mean the difference between a sequential scan and an index scan (and the associated random reads).

Index scans are usually great! Unless of course, the table has a few hundred million rows and the predicate matches 60% of them. With the variable substitution in place, Postgres will probably prefer the index scan, and our query will be much slower as a consequence. Whenever or wherever possible, use static values in predicates that touch indexes or are the primary filter condition for a query. By doing that, all of the extra conditions will still apply, but won’t adversely affect query performance.

As a side note, this is one of the other reasons indexes aren’t merely a “go faster” button. Putting them in the wrong place or invoking them improperly can make performance worse. In most cases this translates to a standard deviation or two. In others, it results in geometric or exponential increases in query times. It’s those outliers that can really ruin our day.

When partitions are concerned, it pays to be extra vigilant.