PG Phriday: Fancy Partitioning
This week we’ll be covering another method of Postgres partitioning. This is a technique I personally prefer and try to use and advocate at every opportunity. It’s designed to straddle the line between traditional partitioning and standard monolithic table structure by using table inheritance as a convenience factor. The assumption here is that end-user applications either:
- Know that partitioning is in use.
- Only load “current” data and don’t care about partitions.
These two things may seem mutually exclusive, but they can actually be complementary. There are also notable performance benefits to this approach that we’ll explore later.
The SQL for this structure is similar to what we used last week, but there are a couple of notable omissions.
TRUNCATE TABLE sensor_log; CREATE TABLE sensor_log_part_2012 ( LIKE sensor_log INCLUDING INDEXES, CHECK (reading_date >= '2012-01-01' AND reading_date < '2013-01-01') ) INHERITS (sensor_log); CREATE TABLE sensor_log_part_2013 ( LIKE sensor_log INCLUDING INDEXES, CHECK (reading_date >= '2013-01-01' AND reading_date < '2014-01-01') ) INHERITS (sensor_log); CREATE TABLE sensor_log_part_2014 ( LIKE sensor_log INCLUDING INDEXES, CHECK (reading_date >= '2014-01-01' AND reading_date < '2015-01-01') ) INHERITS (sensor_log);
This time, we don’t need the function that distributes the rows to the correct partition, or the associated trigger. Further, note that the 2015 partition is missing. This is possible because we’ve made two assumptions about how partitions will be used:
- Current data will be stored in the base table.
- Old or “archived” data will be moved into partitions.
To fulfill the first requirement, we made a slight change to the python loader script introduced at the beginning of this series. We could have done this with the partitioning introduced last week as well, but while that is a performance tweak for that format, this method depends on it.
# Set these to modify how large the COMMIT blocks will be, # and how many days the data will represent. # Total row count = chunk_size * date_range. chunk_size = 100000 date_range = 1000 # Main program body import psycopg2 from StringIO import StringIO from datetime import datetime, timedelta raw_data = StringIO() db_conn = psycopg2.connect(database = 'postgres', user = 'postgres', port = 5433) cur = db_conn.cursor() print 'rows,seconds,microseconds' for j in xrange(0, date_range): raw_data.seek(0) raw_data.truncate(0) usedate = datetime.now() - timedelta(days = j) stamp = datetime.combine(usedate.date(), usedate.time().max) for i in xrange(1, chunk_size + 1): stamp -= timedelta(seconds = 0.5) raw_data.write( '%s\t%s\t%s\n' % (i % 1000, i % 100, stamp) ) raw_data.seek(0) t1 = datetime.now() # This is the part where we become partition-aware. # If this script only handled current data, we wouldn't # even need the 'part' variable at all. part = '' if stamp.year != datetime.now().year: part = '_part_%s' % stamp.year cur.copy_from( raw_data, 'sensor_log%s' % part, '\t', columns = ('location', 'reading', 'reading_date') ) t2 = datetime.now() diff_s = str(t2 - t1).split(':') print '%s,%s' % ( (j + 1) * chunk_size, diff_s.replace('.', ',') ) db_conn.commit() raw_data.close() db_conn.close()
As the comments in the code note, if the loader script only handles a stream of new and current data, we could remove the partition handling block entirely. We should also note that the script loads all of the partitions about 10% faster than a single monolithic table. Likely this is due to the index calculations being reduced in cost from reduced tree depth.
The primary benefit from this structure is that we can get more use out of the
ONLY keyword. Most of the partitioned tables I’ve encountered are primarily concerned with “hot” data. Since know that Postgres will always check the base table regardless of partition contents, why not use that to our advantage? The
ONLY keyword makes that the first and last step.
This is how that looks to the planner:
EXPLAIN ANALYZE SELECT count(*) FROM ONLY sensor_log WHERE reading_date >= CURRENT_DATE; QUERY PLAN -------------------------------------------------------------------- Aggregate (cost=5141.55..5141.56 rows=1 width=0) (actual time=140.079..140.080 rows=1 loops=1) -> Index Only Scan using idx_sensor_log_date on sensor_log (cost=0.57..4853.80 rows=115099 width=0) (actual time=0.043..75.019 rows=100000 loops=1) Index Cond: (reading_date >= ('now'::cstring)::date) Heap Fetches: 0 Planning time: 0.184 ms Execution time: 140.126 ms
We can see that we only touched the base
sensor_log table, ignoring any partitions that might exist. This is the part where being partition-aware is beneficial. Now systems that focus on current data can directly restrict queries to the “current” partition without having to know anything about the naming scheme or distribution process. And as we’ve already seen, table loading can also ignore any partitions if processing recent entries.
Of course, now we have a new problem. Since the current partition is the base table, once the data no longer matches our desired constraints, it needs to be relocated. When 2016 arrives, we need to move the 2015 data, or reorganize the tables themselves. Depending on partition granularity, the latter actually isn’t so bad. In fact, here’s a process that does it all at once in a transaction:
BEGIN; DELETE FROM pg_inherits WHERE inhparent = 'sensor_log'::REGCLASS; ALTER TABLE sensor_log RENAME TO sensor_log_part_2015; CREATE TABLE sensor_log (LIKE sensor_log_part_2015 INCLUDING ALL); ALTER TABLE sensor_log_part_2015 ADD CONSTRAINT sensor_log_part_2015_reading_date_check CHECK (reading_date >= '2015-01-01' AND reading_date < '2016-01-01'); INSERT INTO pg_inherits SELECT oid, 'sensor_log'::REGCLASS, 1 FROM pg_class WHERE relname LIKE 'sensor_log_part_%' AND relkind = 'r'; COMMIT;
Alternatively, we can maintain a more elastic system. Another way to use the base table + partition style is to only keep extremely recent data in the base table, such as the most recent week. Then, move any older data to a prepared partition. In this case we would already have the 2015 partition, and then every night we would run this maintenance SQL:
BEGIN; INSERT INTO sensor_log_part_2015 SELECT * FROM ONLY sensor_log WHERE reading_date >= CURRENT_DATE - INTERVAL '1 week'; DELETE FROM ONLY sensor_log WHERE reading_date >= CURRENT_DATE - INTERVAL '1 week'; COMMIT;
In either case, this maintenance is easily scriptable, though clearly a bit more invasive than simply distributing the data outright with a trigger. What do we gain from that caveat?
- No more performance-robbing redirection trigger. This is great for high transaction volumes.
- The root table contains only the most relevant data. This allows us to treat inherited tables like optional extended archives.
- Data in the root table does not need to match the granularity of our partitions. Given the second data-movement approach, the base partition could hold one week of data, while each archive table reflects month or year distributions.
- Data readers only need to remember one rule: fast, recent data should use the
- Both readers and writers can remain blissfully ignorant of the naming scheme, distribution rules, and any other partitioning overhead. Everything can just target the base table and maintenance will sort it out later.
In the end, maybe this structure doesn’t qualify as “fancy”, but it is most definitely a departure from the methodology discussed in the Postgres documentation. It has some benefits beyond the traditional approach, and of course brings its own drawbacks to the table. As always, choosing is up to the DBA and the development teams.
I wrote a Postgres extension for my current employer that abstracts much of this and automates the rest. I hope to open-source it soon so everyone can squeeze some benefits out of applicable use cases. But now that you know about it, there’s no reason to wait for me to wade through red tape; explore!