Most PGDB (PostgreSQL) users who are familiar with partitioning use the method described in the partitioning documentation. This architecture comes in a fairly standard stack:
- One empty base table for structure.
- At least one child table that inherits the base design.
- A trigger to redirect inserts based on the partitioning scheme.
- A constraint on each child table to enforce the partition scheme, and help the planner exclude child partitions from inapplicable queries.
It looks simple, but there’s a lot of SQL involved. Let’s take the
sensor_log table we introduced a couple weeks ago and turn it into a partitioned structure.
TRUNCATE TABLE sensor_log; CREATE TABLE sensor_log_part_2012 ( LIKE sensor_log INCLUDING INDEXES, CHECK (reading_date >= '2012-01-01' AND reading_date < '2013-01-01') ) INHERITS (sensor_log); CREATE TABLE sensor_log_part_2013 ( LIKE sensor_log INCLUDING INDEXES, CHECK (reading_date >= '2013-01-01' AND reading_date < '2014-01-01') ) INHERITS (sensor_log); CREATE TABLE sensor_log_part_2014 ( LIKE sensor_log INCLUDING INDEXES, CHECK (reading_date >= '2014-01-01' AND reading_date < '2015-01-01') ) INHERITS (sensor_log); CREATE TABLE sensor_log_part_2015 ( LIKE sensor_log INCLUDING INDEXES, CHECK (reading_date >= '2015-01-01' AND reading_date < '2016-01-01') ) INHERITS (sensor_log); CREATE OR REPLACE FUNCTION sensor_partition() RETURNS TRIGGER AS $$ BEGIN CASE EXTRACT(YEAR FROM NEW.reading_date) WHEN 2015 THEN INSERT INTO sensor_log_part_2015 VALUES (NEW.*); WHEN 2014 THEN INSERT INTO sensor_log_part_2014 VALUES (NEW.*); WHEN 2013 THEN INSERT INTO sensor_log_part_2013 VALUES (NEW.*); WHEN 2012 THEN INSERT INTO sensor_log_part_2012 VALUES (NEW.*); END CASE; RETURN NULL; END; $$ LANGUAGE plpgsql; CREATE TRIGGER t_correct_partition BEFORE INSERT ON sensor_log FOR EACH ROW EXECUTE PROCEDURE sensor_partition();
This probably looks a bit different from the examples in the documentation for a few reasons. First, when creating a table, the
LIKE markup is available to copy various attributes from another table. By using this operation, we’re redundantly copying the columns from the parent table, because this is already handled by the
INHERITS command. What isn’t redundant is the
INCLUDING INDEXES modifier for
LIKE, which saves us the work of manually creating each index from the parent table on each child table.
Since we already have a more complicated
CREATE TABLE than what they show in the documentation, we might as well include the partition constraints when creating each child table, instead of using
ALTER TABLE to add them afterwards. But these are just shortcuts; the overall structure is the same, as are the performance considerations and benefits.
Using our python script and inserting 100-million rows takes about three times longer than it did in the non-partitioned case. Basically, this includes the trigger overhead for redirecting the rows. It’s painful, but not debilitating on systems that aren’t dependent on transaction performance.
We could also reduce this overhead by simplifying the logic of the
sensor_partition function. An easy way to accomplish that, would be to have it only target the most recent partition. Then we would just need some maintenance job that would replace the function when the most recent partition changes. Since most ongoing loader jobs target recent incoming data, that would probably make more sense anyway.
Barring that, what do we gain? One huge benefit is constraint exclusion. The PGDB query planner will consider the
CHECK constraints we placed on the partitions when processing statements. This is what a query looks like that triggers this effect:
EXPLAIN ANALYZE SELECT COUNT(*) FROM sensor_log WHERE reading_date >= '2015-08-14'; QUERY PLAN -------------------------------------------------------------------- Aggregate (cost=16003.26..16003.27 ROWS=1 width=0) (actual TIME=1181.681..1181.682 ROWS=1 loops=1) -> Append (cost=0.00..15003.79 ROWS=399791 width=0) (actual TIME=0.578..929.660 ROWS=411314 loops=1) -> Seq Scan ON sensor_log (cost=0.00..0.00 ROWS=1 width=0) (actual TIME=0.001..0.001 ROWS=0 loops=1) FILTER: (reading_date >= '2015-08-14 00:00:00') -> INDEX ONLY Scan USING sensor_log_part_2015_reading_date_idx ON sensor_log_part_2015 (cost=0.44..15003.79 ROWS=399790 width=0) (actual TIME=0.576..474.890 ROWS=411314 loops=1) INDEX Cond: (reading_date >= '2015-08-14 00:00:00') Heap Fetches: 411314 Planning TIME: 0.282 ms Execution TIME: 1181.719 ms
With a PGDB partition, the base table is always included. Since it’s empty, a sequence scan has no cost. The
WHERE clause only matches the check constraint for the 2015 partition, so all results are from that table alone. The data is extremely dense, so we still fetch over 400k records however, resulting in an execution time of around one second.
The primary caveat with constraint exclusion is that it doesn’t work with variable substitution. Were we to replace the static date with
CURRENT_DATE instead, PGDB would check all of the child partitions. As it turns out, that’s only one empty index poll for each of the invalid partitions. Thus the cost isn’t much higher, though the query plan would be much more complex.
Is there anything else that goes in our favor? We’ve mentioned maintenance costs several times in the past, and this is one reason why. Our data is now distributed across four tables. Assuming we retain this partitioning scheme, all future data will continue to be distributed by year. Remember our query that checks for all tables with at least 10-million rows? Let’s try it again:
SELECT oid::regclass::text AS TABLE_NAME, pg_size_pretty(pg_total_relation_size(oid)) AS total_size, CEIL(reltuples / 1000000) AS million_rows FROM pg_class WHERE relkind = 'r' AND reltuples > 10000000 ORDER BY relpages DESC; TABLE_NAME | total_size | million_rows ----------------------+------------+-------------- sensor_log_part_2014 | 4163 MB | 37 sensor_log_part_2013 | 4508 MB | 37 sensor_log_part_2015 | 2832 MB | 23
Aside from 2012, which is much smaller simply due to the date of this article, rows are distributed proportionally. From this, we can estimate that all maintenance operations on each child table will be roughly three times faster than before. While it’s true that we’ve also multiplied the amount of maintenance by a similar factor, such operations are generally automated.
Autovacuum and autoanalyze will now be much faster, and only focus on modified data. This means our 2015 table will likely experience several more of these maintenance cycles as it accumulates rows until 2016, after which it might remain totally inert. Before we partitioned, each of these operations would have processed all 100-million rows every time since they were all contained in a single table. Now, it can focus on just the portion that’s changing frequently.
We also win the option of easy archival. If we no longer need 2012 data, we can export it with a
COPY command, and then drop the table. With a single table, we could only remove it with a
DELETE statement. Not only is this orders of magnitude slower since it has to locate and then remove each row for 2012, but due to PGDB’s versioned storage, we now have several million dead rows to account for. If a
VACUUM doesn’t immediately follow such a procedure, we’ve virtually guaranteed performance-robbing bloat.
Large tables take a very long time to
VACUUM, which would cause a chain of delays along the data import stack. This kind of maintenance doesn’t block table writes, but if we don’t delay such jobs following significant changes, we again risk unnecessary table bloat. Or the disk activity needed for the maintenance will delay it for us. Really, using a single table in these cases brings a lot of downsides.
Used properly, this kind of partitioning is close to an equivalent exchange regarding query performance. Of course, there are known caveats associated with this kind of partitioning structure.
- Trigger overhead to redistribute data entry is very real. Even fast triggers add significant overhead.
- Primary keys can not distinguish rows in a partitioned table. Since each unique index exists independently on the partition, there’s no way to prevent duplicates across the set.
- A system must be in place to change trigger targets and build new partitions as they are needed. This is a lot more complicated than just using a single static table.
With that said, the first caveat is easy to circumvent. Any good partitioning scheme has an associated naming scheme. Loading jobs can target new data at the appropriate partition directly, thus reverting to single-table performance in that regard. Software that is PGDB partition-aware is amazingly beneficial, but also unfortunately rare due to all the custom tooling surrounding it.
The second concern is a bit tricky since such checks are inherent to PGDB operation. We can however, supplement the
PRIMARY KEY syntax, with a stored procedure that performs verification across all partitions. Then we simply attach a trigger to child partitions that would invoke it on
UPDATE. Due to the associated overhead, such an extreme measure should be reserved for incredibly critical cases. However, we should note that large tables are generally data sinks; their primary key columns are usually handled through sequences or other types of auto-numbering that makes collisions incredibly rare.
The third issue is simply a software engineering problem. In that regard, the pg_partman extension provides quite a bit of automation for provisioning and maintaining partitioned tables. If that doesn’t work, the entire partition stack is in our first block of code. A bit of variable substitution and scripting could transform it into a generalized process. In fact, I’ve seen such things in Groovy and Python data import applications.
It all comes down to what kind of tradeoffs you want. Is it worth taking a potential performance hit in some
INSERT operations to prevent long-term data juggling? Is maintenance frequent enough that large objects act as a noticeable bottleneck? Is there enough data accumulation and turnover that distributing it is desirable? There are a lot of questions to consider before jumping straight into this partition style.
This methodology is a first step to a basic distributed architecture. Next week, we’ll consider something a bit more exotic.