Most PGDB (PostgreSQL) users who are familiar with partitioning use the method described in the partitioning documentation. This architecture comes in a fairly standard stack:

  • One empty base table for structure.
  • At least one child table that inherits the base design.
  • A trigger to redirect inserts based on the partitioning scheme.
  • A constraint on each child table to enforce the partition scheme, and help the planner exclude child partitions from inapplicable queries.

It looks simple, but there’s a lot of SQL involved. Let’s take the sensor_log table we introduced a couple weeks ago and turn it into a partitioned structure.

TRUNCATE TABLE sensor_log;
 
CREATE TABLE sensor_log_part_2012 (
    LIKE sensor_log INCLUDING INDEXES,
    CHECK (reading_date >= '2012-01-01' AND
           reading_date < '2013-01-01')    
) INHERITS (sensor_log);
 
CREATE TABLE sensor_log_part_2013 (
    LIKE sensor_log INCLUDING INDEXES,
    CHECK (reading_date >= '2013-01-01' AND
           reading_date < '2014-01-01')    
) INHERITS (sensor_log);
 
CREATE TABLE sensor_log_part_2014 (
    LIKE sensor_log INCLUDING INDEXES,
    CHECK (reading_date >= '2014-01-01' AND
           reading_date < '2015-01-01')    
) INHERITS (sensor_log);
 
CREATE TABLE sensor_log_part_2015 (
    LIKE sensor_log INCLUDING INDEXES,
    CHECK (reading_date >= '2015-01-01' AND
           reading_date < '2016-01-01')    
) INHERITS (sensor_log);
 
CREATE OR REPLACE FUNCTION sensor_partition()
RETURNS TRIGGER AS $$
BEGIN
  CASE EXTRACT(YEAR FROM NEW.reading_date)
    WHEN 2015 THEN
      INSERT INTO sensor_log_part_2015 VALUES (NEW.*);
    WHEN 2014 THEN
      INSERT INTO sensor_log_part_2014 VALUES (NEW.*);
    WHEN 2013 THEN
      INSERT INTO sensor_log_part_2013 VALUES (NEW.*);
    WHEN 2012 THEN
      INSERT INTO sensor_log_part_2012 VALUES (NEW.*);
  END CASE;
 
  RETURN NULL;
END;
$$ LANGUAGE plpgsql;
 
CREATE TRIGGER t_correct_partition
BEFORE INSERT ON sensor_log
   FOR EACH ROW EXECUTE PROCEDURE sensor_partition();

This probably looks a bit different from the examples in the documentation for a few reasons. First, when creating a table, the LIKE markup is available to copy various attributes from another table. By using this operation, we’re redundantly copying the columns from the parent table, because this is already handled by the INHERITS command. What isn’t redundant is the INCLUDING INDEXES modifier for LIKE, which saves us the work of manually creating each index from the parent table on each child table.

Since we already have a more complicated CREATE TABLE than what they show in the documentation, we might as well include the partition constraints when creating each child table, instead of using ALTER TABLE to add them afterwards. But these are just shortcuts; the overall structure is the same, as are the performance considerations and benefits.

Using our python script and inserting 100-million rows takes about three times longer than it did in the non-partitioned case. Basically, this includes the trigger overhead for redirecting the rows. It’s painful, but not debilitating on systems that aren’t dependent on transaction performance.

We could also reduce this overhead by simplifying the logic of the sensor_partition function. An easy way to accomplish that, would be to have it only target the most recent partition. Then we would just need some maintenance job that would replace the function when the most recent partition changes. Since most ongoing loader jobs target recent incoming data, that would probably make more sense anyway.

Barring that, what do we gain? One huge benefit is constraint exclusion. The PGDB query planner will consider the CHECK constraints we placed on the partitions when processing statements. This is what a query looks like that triggers this effect:

EXPLAIN ANALYZE
SELECT COUNT(*)
  FROM sensor_log
 WHERE reading_date >= '2015-08-14';
 
                             QUERY PLAN                             
--------------------------------------------------------------------
 Aggregate  (cost=16003.26..16003.27 ROWS=1 width=0)
            (actual TIME=1181.681..1181.682 ROWS=1 loops=1)
   ->  Append  (cost=0.00..15003.79 ROWS=399791 width=0)
               (actual TIME=0.578..929.660 ROWS=411314 loops=1)
         ->  Seq Scan ON sensor_log
                  (cost=0.00..0.00 ROWS=1 width=0)
                  (actual TIME=0.001..0.001 ROWS=0 loops=1)
               FILTER: (reading_date >= '2015-08-14 00:00:00')
         ->  INDEX ONLY Scan
             USING sensor_log_part_2015_reading_date_idx
                ON sensor_log_part_2015
                    (cost=0.44..15003.79 ROWS=399790 width=0)
                    (actual TIME=0.576..474.890 ROWS=411314 loops=1)
               INDEX Cond: (reading_date >= '2015-08-14 00:00:00')
               Heap Fetches: 411314
 
 Planning TIME: 0.282 ms
 Execution TIME: 1181.719 ms

With a PGDB partition, the base table is always included. Since it’s empty, a sequence scan has no cost. The WHERE clause only matches the check constraint for the 2015 partition, so all results are from that table alone. The data is extremely dense, so we still fetch over 400k records however, resulting in an execution time of around one second.

The primary caveat with constraint exclusion is that it doesn’t work with variable substitution. Were we to replace the static date with CURRENT_DATE instead, PGDB would check all of the child partitions. As it turns out, that’s only one empty index poll for each of the invalid partitions. Thus the cost isn’t much higher, though the query plan would be much more complex.

Is there anything else that goes in our favor? We’ve mentioned maintenance costs several times in the past, and this is one reason why. Our data is now distributed across four tables. Assuming we retain this partitioning scheme, all future data will continue to be distributed by year. Remember our query that checks for all tables with at least 10-million rows? Let’s try it again:

SELECT oid::regclass::text AS TABLE_NAME,
       pg_size_pretty(pg_total_relation_size(oid)) AS total_size,
       CEIL(reltuples / 1000000) AS million_rows
  FROM pg_class
 WHERE relkind = 'r'
   AND reltuples > 10000000
 ORDER BY relpages DESC;
 
      TABLE_NAME      | total_size | million_rows 
----------------------+------------+--------------
 sensor_log_part_2014 | 4163 MB    |           37
 sensor_log_part_2013 | 4508 MB    |           37
 sensor_log_part_2015 | 2832 MB    |           23

Aside from 2012, which is much smaller simply due to the date of this article, rows are distributed proportionally. From this, we can estimate that all maintenance operations on each child table will be roughly three times faster than before. While it’s true that we’ve also multiplied the amount of maintenance by a similar factor, such operations are generally automated.

Autovacuum and autoanalyze will now be much faster, and only focus on modified data. This means our 2015 table will likely experience several more of these maintenance cycles as it accumulates rows until 2016, after which it might remain totally inert. Before we partitioned, each of these operations would have processed all 100-million rows every time since they were all contained in a single table. Now, it can focus on just the portion that’s changing frequently.

We also win the option of easy archival. If we no longer need 2012 data, we can export it with a COPY command, and then drop the table. With a single table, we could only remove it with a DELETE statement. Not only is this orders of magnitude slower since it has to locate and then remove each row for 2012, but due to PGDB’s versioned storage, we now have several million dead rows to account for. If a VACUUM doesn’t immediately follow such a procedure, we’ve virtually guaranteed performance-robbing bloat.

Large tables take a very long time to VACUUM, which would cause a chain of delays along the data import stack. This kind of maintenance doesn’t block table writes, but if we don’t delay such jobs following significant changes, we again risk unnecessary table bloat. Or the disk activity needed for the maintenance will delay it for us. Really, using a single table in these cases brings a lot of downsides.

Used properly, this kind of partitioning is close to an equivalent exchange regarding query performance. Of course, there are known caveats associated with this kind of partitioning structure.

  1. Trigger overhead to redistribute data entry is very real. Even fast triggers add significant overhead.
  2. Primary keys can not distinguish rows in a partitioned table. Since each unique index exists independently on the partition, there’s no way to prevent duplicates across the set.
  3. A system must be in place to change trigger targets and build new partitions as they are needed. This is a lot more complicated than just using a single static table.

With that said, the first caveat is easy to circumvent. Any good partitioning scheme has an associated naming scheme. Loading jobs can target new data at the appropriate partition directly, thus reverting to single-table performance in that regard. Software that is PGDB partition-aware is amazingly beneficial, but also unfortunately rare due to all the custom tooling surrounding it.

The second concern is a bit tricky since such checks are inherent to PGDB operation. We can however, supplement the PRIMARY KEY syntax, with a stored procedure that performs verification across all partitions. Then we simply attach a trigger to child partitions that would invoke it on INSERT or UPDATE. Due to the associated overhead, such an extreme measure should be reserved for incredibly critical cases. However, we should note that large tables are generally data sinks; their primary key columns are usually handled through sequences or other types of auto-numbering that makes collisions incredibly rare.

The third issue is simply a software engineering problem. In that regard, the pg_partman extension provides quite a bit of automation for provisioning and maintaining partitioned tables. If that doesn’t work, the entire partition stack is in our first block of code. A bit of variable substitution and scripting could transform it into a generalized process. In fact, I’ve seen such things in Groovy and Python data import applications.

It all comes down to what kind of tradeoffs you want. Is it worth taking a potential performance hit in some SELECT and INSERT operations to prevent long-term data juggling? Is maintenance frequent enough that large objects act as a noticeable bottleneck? Is there enough data accumulation and turnover that distributing it is desirable? There are a lot of questions to consider before jumping straight into this partition style.

This methodology is a first step to a basic distributed architecture. Next week, we’ll consider something a bit more exotic.

PG Phriday: Basic Partitioning
Tagged on:                 

15 thoughts on “PG Phriday: Basic Partitioning

    1. Honestly? Because I don’t like the PostgreSQL name. It’s annoying to have to explain what the hell it is every time I talk to someone about it, and then I get to tell them not to write it as PostgresSQL, and then how it’s really pronounced, and then…

      Beyond that, with all of the extensions and JSON and everything else, Postgres is becoming more and more of a middleware. Just like it was post-Ingress, I think it’s moved beyond needing the SQL as part of the actual name. I’m hoping it catches on, but I’m not stupid enough to think I have any influence in the matter.

  • Why not call it postgres then? That’s fairly widespread and immediately recognizable as being the same project as PostgreSQL. I don’t think PGDB shares much of the name recognition the project has built after this long time, so it doesn’t seem to be a name to me.

    I think it was a mistake to not just choose postgres as a project name…

    1. Yeah, that is an option. I was always under the impression that PostgreSQL was the preferred nomenclature when referring to the project in general, and that just “Postgres” was somehow misleading. I even made sure to always use the full official name everywhere in the High Availability Cookbook.

      So, hmm. Do I abandon my little pet project at advocating PGDB as the new name, or just give up and hope the SQL part will eventually be removed and consider that good enough. Decisions, decisions. 😉

      1. Well it was originally named Postgres and it’s officially accepted alias while PGDB is something nobody uses, I think you are better off with Postgres 🙂

        1. Well, I had read a thread a long time ago about various potential names being bandied about. For some reason, I latched onto PGDB. It’s short, easy to say, references the history of Postgres, etc. But thinking upon it a bit more, the reason I gave Andres applies to PGDB too: Postgres is becoming more of a middleware. The extensions and some of the existing and upcoming enhancements are making it something more than a simple database.

          So yeah. Postgres makes more sense. From now on, that’s what I’ll call it. PostgreSQL being the official nomenclature or not, I’m dropping the SQL.

  • Thanks for the very interesting and informative series, Shaun.

    Question: if the unneeded sensor_log_part_2012 table is dropped, will the insert trigger still function normally, or will sensor_partition() need to be modified to remove the reference to the dropped table?

    1. You’d have to alter the trigger too, which is actually a common failing of this type of partitioning. Usually the trigger will only target two tables:

      • The most recent partition.
      • One partition in the future.

      This way, you don’t risk a race condition at midnight. Unfortunately that means you need to recreate the trigger every time your chosen interval advances. This only really applies to date-based partitions, but that’s the vast majority of what I’ve seen and used in the field.

      For the 2012-2015 example here, the trigger was only necessary for the initial loading process. In a real scenario, the trigger would be for 2015 and 2016 only, and when 2016 arrived, it would be altered to 2016 and 2017. Yearly partitions are also a bit too big; I’ve yet to see anything larger than monthly partitions in practice. On one of our production tables, we have implemented daily partitions because the volume is staggeringly immense. I’ll get into it more this Friday, but we don’t actually use triggers to maintain that architecture. I’m sneaky that way.

  • Thanks for the informative series. My fond hope is that a not-crazy, i.e. declarative, form of partitioning will displace what you’re describing here. Everybody who’s worked with Postgres even casually has discovered this gigantic hole in our implementation, and not everybody has dealt with it as reasonably as this post outlines.

  • I think one important reason to use partitioning is to avoid any single table overgrowing the size of a FS because in PG, a tablespace can not span across multiple FS. Even with a performance hit in insert, it is desirable to have a large table split into multiple child tables, each on its own tablespace (and by extension, different FS), just to avoid maxing out on a FS.

    1. In general, this is usually not a concern. If you’re afraid of your tables becoming too large for the underlying filesystem, you should be using a better filesystem (XFS for instance) or a volume manager like LVM. Splitting tables across multiple filesystems with partitions is a bit of a hack.

Comments are closed.