In the next few weeks, I’m going to be pushing a long discussion regarding PGDB (PostgreSQL) table partitioning. I’ve covered it in previous articles, but only regarding basic performance considerations. That’s a very limited view of what partitioning can offer; there’s a lot more variance and background that deserves elucidation.

So for the next few articles, the topic of discussion will be partitioning. There’s not really enough of it, and a lot of the techniques used in the field are effectively pulled straight from the documentation. I think we can go much further.

To that effect, this is the table structure all subsequent articles will reference:

CREATE TABLE sensor_log (
  sensor_log_id  SERIAL PRIMARY KEY,
  location       VARCHAR NOT NULL,
  reading        BIGINT NOT NULL,
  reading_date   TIMESTAMP NOT NULL
);
 
CREATE INDEX idx_sensor_log_location ON sensor_log (location);
CREATE INDEX idx_sensor_log_date ON sensor_log (reading_date);

Very large, continuously growing tables tend to be great candidates for partitioning. Further, we can explore several different techniques and cover various considerations related to expected volume. To help that along, I threw together a basic python loader script that both bootstraps the partition data, and can profile timings for analysis purposes.

To keep things interesting, the values are set to fill the table with 100M rows. The fact this is scripted helps simulate a process that might actually load data in a production environment. This allows testing basic insert speed with and without partitioning, as the size of the table (and indexes) increases over time.

# Set these to modify how large the COMMIT blocks will be,
# and how many days the data will represent.
# Total row count = chunk_size * date_range.
 
chunk_size = 100000
date_range = 1000
 
# Main program body
 
import psycopg2
from StringIO import StringIO
from datetime import datetime, timedelta
 
raw_data = StringIO()
db_conn = psycopg2.connect(database = 'postgres', user = 'postgres')
cur = db_conn.cursor()
 
print 'rows,seconds,microseconds'
 
for j in xrange(0, date_range):
    raw_data.seek(0)
    raw_data.truncate(0)
    stamp = datetime.now() - timedelta(days = j)
 
    for i in xrange(1, chunk_size + 1):
        stamp -= timedelta(seconds = 5)
        raw_data.write(
            '%s\t%s\t%s\n' % (i % 1000, i % 100, stamp)
        )
 
    raw_data.seek(0)
    t1 = datetime.now()
 
    cur.copy_from(
        raw_data, 'sensor_log', '\t',
        columns = ('location', 'reading', 'reading_date')
    )
 
    t2 = datetime.now()
    diff_s = str(t2 - t1).split(':')[2]
 
    print '%s,%s' % (
        (j + 1) * chunk_size, diff_s.replace('.', ',')
    )
 
    db_conn.commit()
 
raw_data.close()
db_conn.close()

Before we get to any partitioning discussion, using this script as a baseline, loading the base sensor_log table takes about 45 minutes on my testing VM. Note that we didn’t cheat by loading the table without indexes as we’ve done previously. Existing tables won’t have that luxury, so for the partitioning discussion, we can ignore it for now.

So, what are the approaches we can cover?

  • No partitioning at all.
  • Use the technique discussed in the PostgreSQL documentation (triggers).
  • Use the base table as current data, with partitions as older, archived information.

And knowing the techniques, what do we examine?

  • Statement speed.
  • Potential overhead.
  • Maintenance concerns.
  • Implementation caveats.

Partitioning is a hairy procedure. No matter what method is chosen, the result will have definite performance and management issues. The idea is to choose a good balance that addresses current needs, based on volume. I plan on being as comprehensive as possible to advocate partitioning without simply being a cheerleader.

Due to the 100-million row sample size and the associated 13GB of space needed, the time scale for even the non-partitioned case should make everything obvious. Unfortunately it also greatly inflates the time required for any statement tests, and bootstrapping the tables themselves. I think that effort is warranted though, especially because projects that are likely to benefit from partitioning will likely exceed even that scale.

So think about partitioning a bit, because I’m about to go into some crazy depth about it, and hopefully make a case for using it early and often.

PG Phriday: The Case for Partitioning
Tagged on:             

4 thoughts on “PG Phriday: The Case for Partitioning

  • I don’t agree with Hans’s comment. I can create a child table and add foreign key constraints to it, and I can put a primary key constraint on a child table and use it for foreign key constraints on other tables, so I don’t see what’s missing.

    I was able to do all of this on PG 9.4:

    CREATE TABLE dim1 (i1 INT PRIMARY KEY, c1 VARCHAR(30));
    CREATE TABLE dim2 (i2 INT PRIMARY KEY, c2 VARCHAR(30));
    CREATE TABLE fact (i1 INT REFERENCES dim1, i2 INT REFERENCES dim2, fact_date DATE, facta INT, factb INT, CONSTRAINT pk_fact PRIMARY KEY (i1, i2, fact_date));
    CREATE TABLE fact_2014 () inherits (fact);
    ALTER TABLE fact_2014 ADD CONSTRAINT ck_2014_fact_date CHECK (fact_date BETWEEN DATE '2014-01-01' AND '2014-12-31');

    At this point, fact_2014 didn’t inherit the primary key or foreign key constraints from its parent, so I have to add them. Maybe that’s inconvenient, but it’s certainly doable:

    ALTER TABLE fact_2014 ADD CONSTRAINT pk_2014 PRIMARY KEY (i1, i2, fact_date);
    ALTER TABLE fact_2014 ADD CONSTRAINT fk_2014_i1 FOREIGN KEY (i1) REFERENCES dim1;
    ALTER TABLE fact_2014 ADD CONSTRAINT fk_2014_i2 FOREIGN KEY (i2) REFERENCES dim2;

    I can also use the primary key on the partitioned table as another table’s foreign key:

    CREATE TABLE other (i1 INT, i2 INT, fact_date DATE, something VARCHAR(30), CONSTRAINT pk_other PRIMARY KEY (i1, i2, fact_date));
    ALTER TABLE other ADD CONSTRAINT fk_other FOREIGN KEY (i1, i2, fact_date) REFERENCES fact_2014;

    Maybe Hans would like to define a primary key on the parent table in the partitioning scheme, have it apply to all values in the parent table and all of its child tables, and use that as a foreign key in other tables? Is there a case where that is necessary?

    When I first started using Postgres, after using Oracle for many years, the partitioning scheme in PG struck me as slightly weird, but after a while I saw that it could do everything that I needed. It would be good to see a scenario that illustrates the specific limitation of concern to Hans, and whether there are ways to deal with that.

Comments are closed.