PG Phriday: Basic Partitioning

Most Postgres (PostgreSQL) users who are familiar with partitioning use the method described in the partitioning documentation. This architecture comes in a fairly standard stack:

  • One empty base table for structure.
  • At least one child table that inherits the base design.
  • A trigger to redirect inserts based on the partitioning scheme.
  • A constraint on each child table to enforce the partition scheme, and help the planner exclude child partitions from inapplicable queries.

It looks simple, but there’s a lot of SQL involved. Let’s take the sensor_log table we introduced a couple weeks ago and turn it into a partitioned structure.

TRUNCATE TABLE sensor_log;

CREATE TABLE sensor_log_part_2012 (
    LIKE sensor_log INCLUDING INDEXES,
    CHECK (reading_date >= '2012-01-01' AND
           reading_date < '2013-01-01')    
) INHERITS (sensor_log);

CREATE TABLE sensor_log_part_2013 (
    LIKE sensor_log INCLUDING INDEXES,
    CHECK (reading_date >= '2013-01-01' AND
           reading_date < '2014-01-01')    
) INHERITS (sensor_log);

CREATE TABLE sensor_log_part_2014 (
    LIKE sensor_log INCLUDING INDEXES,
    CHECK (reading_date >= '2014-01-01' AND
           reading_date < '2015-01-01')    
) INHERITS (sensor_log);

CREATE TABLE sensor_log_part_2015 (
    LIKE sensor_log INCLUDING INDEXES,
    CHECK (reading_date >= '2015-01-01' AND
           reading_date < '2016-01-01')    
) INHERITS (sensor_log);

CREATE OR REPLACE FUNCTION sensor_partition()
RETURNS TRIGGER AS $$
BEGIN
  CASE extract(year FROM NEW.reading_date)
    WHEN 2015 THEN
      INSERT INTO sensor_log_part_2015 VALUES (NEW.*);
    WHEN 2014 THEN
      INSERT INTO sensor_log_part_2014 VALUES (NEW.*);
    WHEN 2013 THEN
      INSERT INTO sensor_log_part_2013 VALUES (NEW.*);
    WHEN 2012 THEN
      INSERT INTO sensor_log_part_2012 VALUES (NEW.*);
  END CASE;

  RETURN NULL;
END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER t_correct_partition
BEFORE INSERT ON sensor_log
   FOR EACH ROW EXECUTE PROCEDURE sensor_partition();

This probably looks a bit different from the examples in the documentation for a few reasons. First, when creating a table, the LIKE markup is available to copy various attributes from another table. By using this operation, we’re redundantly copying the columns from the parent table, because this is already handled by the INHERITS command. What isn’t redundant is the INCLUDING INDEXES modifier for LIKE, which saves us the work of manually creating each index from the parent table on each child table.

Since we already have a more complicated CREATE TABLE than what they show in the documentation, we might as well include the partition constraints when creating each child table, instead of using ALTER TABLE to add them afterwards. But these are just shortcuts; the overall structure is the same, as are the performance considerations and benefits.

Using our python script and inserting 100-million rows takes about three times longer than it did in the non-partitioned case. Basically, this includes the trigger overhead for redirecting the rows. It’s painful, but not debilitating on systems that aren’t dependent on transaction performance.

We could also reduce this overhead by simplifying the logic of the sensor_partition function. An easy way to accomplish that, would be to have it only target the most recent partition. Then we would just need some maintenance job that would replace the function when the most recent partition changes. Since most ongoing loader jobs target recent incoming data, that would probably make more sense anyway.

Barring that, what do we gain? One huge benefit is constraint exclusion. The Postgres query planner will consider the CHECK constraints we placed on the partitions when processing statements. This is what a query looks like that triggers this effect:

EXPLAIN ANALYZE
SELECT count(*)
  FROM sensor_log
 WHERE reading_date >= '2015-08-14';

                             QUERY PLAN                             
--------------------------------------------------------------------
 Aggregate  (cost=16003.26..16003.27 rows=1 width=0)
            (actual time=1181.681..1181.682 rows=1 loops=1)
   ->  Append  (cost=0.00..15003.79 rows=399791 width=0)
               (actual time=0.578..929.660 rows=411314 loops=1)
         ->  Seq Scan on sensor_log
                  (cost=0.00..0.00 rows=1 width=0)
                  (actual time=0.001..0.001 rows=0 loops=1)
               Filter: (reading_date >= '2015-08-14 00:00:00')
         ->  Index Only Scan
             using sensor_log_part_2015_reading_date_idx
                on sensor_log_part_2015
                    (cost=0.44..15003.79 rows=399790 width=0)
                    (actual time=0.576..474.890 rows=411314 loops=1)
               Index Cond: (reading_date >= '2015-08-14 00:00:00')
               Heap Fetches: 411314

 Planning time: 0.282 ms
 Execution time: 1181.719 ms

With a Postgres partition, the base table is always included. Since it’s empty, a sequence scan has no cost. The WHERE clause only matches the check constraint for the 2015 partition, so all results are from that table alone. The data is extremely dense, so we still fetch over 400k records however, resulting in an execution time of around one second.

The primary caveat with constraint exclusion is that it doesn’t work with variable substitution. Were we to replace the static date with CURRENT_DATE instead, Postgres would check all of the child partitions. As it turns out, that’s only one empty index poll for each of the invalid partitions. Thus the cost isn’t much higher, though the query plan would be much more complex.

Is there anything else that goes in our favor? We’ve mentioned maintenance costs several times in the past, and this is one reason why. Our data is now distributed across four tables. Assuming we retain this partitioning scheme, all future data will continue to be distributed by year. Remember our query that checks for all tables with at least 10-million rows? Let’s try it again:

SELECT oid::regclass::text AS TABLE_NAME,
       pg_size_pretty(pg_total_relation_size(oid)) AS total_size,
       CEIL(reltuples / 1000000) AS million_rows
  FROM pg_class
 WHERE relkind = 'r'
   AND reltuples > 10000000
 ORDER BY relpages DESC;

      table_name      | total_size | million_rows 
----------------------+------------+--------------
 sensor_log_part_2014 | 4163 MB    |           37
 sensor_log_part_2013 | 4508 MB    |           37
 sensor_log_part_2015 | 2832 MB    |           23

Aside from 2012, which is much smaller simply due to the date of this article, rows are distributed proportionally. From this, we can estimate that all maintenance operations on each child table will be roughly three times faster than before. While it’s true that we’ve also multiplied the amount of maintenance by a similar factor, such operations are generally automated.

Autovacuum and autoanalyze will now be much faster, and only focus on modified data. This means our 2015 table will likely experience several more of these maintenance cycles as it accumulates rows until 2016, after which it might remain totally inert. Before we partitioned, each of these operations would have processed all 100-million rows every time since they were all contained in a single table. Now, it can focus on just the portion that’s changing frequently.

We also win the option of easy archival. If we no longer need 2012 data, we can export it with a COPY command, and then drop the table. With a single table, we could only remove it with a DELETE statement. Not only is this orders of magnitude slower since it has to locate and then remove each row for 2012, but due to Postgres’s versioned storage, we now have several million dead rows to account for. If a VACUUM doesn’t immediately follow such a procedure, we’ve virtually guaranteed performance-robbing bloat.

Large tables take a very long time to VACUUM, which would cause a chain of delays along the data import stack. This kind of maintenance doesn’t block table writes, but if we don’t delay such jobs following significant changes, we again risk unnecessary table bloat. Or the disk activity needed for the maintenance will delay it for us. Really, using a single table in these cases brings a lot of downsides.

Used properly, this kind of partitioning is close to an equivalent exchange regarding query performance. Of course, there are known caveats associated with this kind of partitioning structure.

  1. Trigger overhead to redistribute data entry is very real. Even fast triggers add significant overhead.
  2. Primary keys can not distinguish rows in a partitioned table. Since each unique index exists independently on the partition, there’s no way to prevent duplicates across the set.
  3. A system must be in place to change trigger targets and build new partitions as they are needed. This is a lot more complicated than just using a single static table.

With that said, the first caveat is easy to circumvent. Any good partitioning scheme has an associated naming scheme. Loading jobs can target new data at the appropriate partition directly, thus reverting to single-table performance in that regard. Software that is Postgres partition-aware is amazingly beneficial, but also unfortunately rare due to all the custom tooling surrounding it.

The second concern is a bit tricky since such checks are inherent to Postgres operation. We can however, supplement the PRIMARY KEY syntax, with a stored procedure that performs verification across all partitions. Then we simply attach a trigger to child partitions that would invoke it on INSERT or UPDATE. Due to the associated overhead, such an extreme measure should be reserved for incredibly critical cases. However, we should note that large tables are generally data sinks; their primary key columns are usually handled through sequences or other types of auto-numbering that makes collisions incredibly rare.

The third issue is simply a software engineering problem. In that regard, the pg_partman extension provides quite a bit of automation for provisioning and maintaining partitioned tables. If that doesn’t work, the entire partition stack is in our first block of code. A bit of variable substitution and scripting could transform it into a generalized process. In fact, I’ve seen such things in Groovy and Python data import applications.

It all comes down to what kind of tradeoffs you want. Is it worth taking a potential performance hit in some SELECT and INSERT operations to prevent long-term data juggling? Is maintenance frequent enough that large objects act as a noticeable bottleneck? Is there enough data accumulation and turnover that distributing it is desirable? There are a lot of questions to consider before jumping straight into this partition style.

This methodology is a first step to a basic distributed architecture. Next week, we’ll consider something a bit more exotic.