PG Phriday: Basic Partitioning

August 21st, 2015 | Published in Database, Tech Talk | 9 Comments

Most PGDB (PostgreSQL) users who are familiar with partitioning use the method described in the partitioning documentation. This architecture comes in a fairly standard stack:

  • One empty base table for structure.
  • At least one child table that inherits the base design.
  • A trigger to redirect inserts based on the partitioning scheme.
  • A constraint on each child table to enforce the partition scheme, and help the planner exclude child partitions from inapplicable queries.

It looks simple, but there’s a lot of SQL involved. Let’s take the sensor_log table we introduced a couple weeks ago and turn it into a partitioned structure.

TRUNCATE TABLE sensor_log;
CREATE TABLE sensor_log_part_2012 (
    CHECK (reading_date >= '2012-01-01' AND
           reading_date < '2013-01-01')    
) INHERITS (sensor_log);
CREATE TABLE sensor_log_part_2013 (
    CHECK (reading_date >= '2013-01-01' AND
           reading_date < '2014-01-01')    
) INHERITS (sensor_log);
CREATE TABLE sensor_log_part_2014 (
    CHECK (reading_date >= '2014-01-01' AND
           reading_date < '2015-01-01')    
) INHERITS (sensor_log);
CREATE TABLE sensor_log_part_2015 (
    CHECK (reading_date >= '2015-01-01' AND
           reading_date < '2016-01-01')    
) INHERITS (sensor_log);
    WHEN 2015 THEN
      INSERT INTO sensor_log_part_2015 VALUES (NEW.*);
    WHEN 2014 THEN
      INSERT INTO sensor_log_part_2014 VALUES (NEW.*);
    WHEN 2013 THEN
      INSERT INTO sensor_log_part_2013 VALUES (NEW.*);
    WHEN 2012 THEN
      INSERT INTO sensor_log_part_2012 VALUES (NEW.*);
$$ LANGUAGE plpgsql;
CREATE TRIGGER t_correct_partition
   FOR EACH ROW EXECUTE PROCEDURE sensor_partition();

This probably looks a bit different from the examples in the documentation for a few reasons. First, when creating a table, the LIKE markup is available to copy various attributes from another table. By using this operation, we’re redundantly copying the columns from the parent table, because this is already handled by the INHERITS command. What isn’t redundant is the INCLUDING INDEXES modifier for LIKE, which saves us the work of manually creating each index from the parent table on each child table.

Since we already have a more complicated CREATE TABLE than what they show in the documentation, we might as well include the partition constraints when creating each child table, instead of using ALTER TABLE to add them afterwards. But these are just shortcuts; the overall structure is the same, as are the performance considerations and benefits.

Using our python script and inserting 100-million rows takes about three times longer than it did in the non-partitioned case. Basically, this includes the trigger overhead for redirecting the rows. It’s painful, but not debilitating on systems that aren’t dependent on transaction performance.

We could also reduce this overhead by simplifying the logic of the sensor_partition function. An easy way to accomplish that, would be to have it only target the most recent partition. Then we would just need some maintenance job that would replace the function when the most recent partition changes. Since most ongoing loader jobs target recent incoming data, that would probably make more sense anyway.

Barring that, what do we gain? One huge benefit is constraint exclusion. The PGDB query planner will consider the CHECK constraints we placed on the partitions when processing statements. This is what a query looks like that triggers this effect:

  FROM sensor_log
 WHERE reading_date >= '2015-08-14';
                             QUERY PLAN                             
 Aggregate  (cost=16003.26..16003.27 ROWS=1 width=0)
            (actual TIME=1181.681..1181.682 ROWS=1 loops=1)
   ->  Append  (cost=0.00..15003.79 ROWS=399791 width=0)
               (actual TIME=0.578..929.660 ROWS=411314 loops=1)
         ->  Seq Scan ON sensor_log
                  (cost=0.00..0.00 ROWS=1 width=0)
                  (actual TIME=0.001..0.001 ROWS=0 loops=1)
               FILTER: (reading_date >= '2015-08-14 00:00:00')
         ->  INDEX ONLY Scan
             USING sensor_log_part_2015_reading_date_idx
                ON sensor_log_part_2015
                    (cost=0.44..15003.79 ROWS=399790 width=0)
                    (actual TIME=0.576..474.890 ROWS=411314 loops=1)
               INDEX Cond: (reading_date >= '2015-08-14 00:00:00')
               Heap Fetches: 411314
 Planning TIME: 0.282 ms
 Execution TIME: 1181.719 ms

With a PGDB partition, the base table is always included. Since it’s empty, a sequence scan has no cost. The WHERE clause only matches the check constraint for the 2015 partition, so all results are from that table alone. The data is extremely dense, so we still fetch over 400k records however, resulting in an execution time of around one second.

The primary caveat with constraint exclusion is that it doesn’t work with variable substitution. Were we to replace the static date with CURRENT_DATE instead, PGDB would check all of the child partitions. As it turns out, that’s only one empty index poll for each of the invalid partitions. Thus the cost isn’t much higher, though the query plan would be much more complex.

Is there anything else that goes in our favor? We’ve mentioned maintenance costs several times in the past, and this is one reason why. Our data is now distributed across four tables. Assuming we retain this partitioning scheme, all future data will continue to be distributed by year. Remember our query that checks for all tables with at least 10-million rows? Let’s try it again:

SELECT oid::regclass::text AS TABLE_NAME,
       pg_size_pretty(pg_total_relation_size(oid)) AS total_size,
       CEIL(reltuples / 1000000) AS million_rows
  FROM pg_class
 WHERE relkind = 'r'
   AND reltuples > 10000000
 ORDER BY relpages DESC;
      TABLE_NAME      | total_size | million_rows 
 sensor_log_part_2014 | 4163 MB    |           37
 sensor_log_part_2013 | 4508 MB    |           37
 sensor_log_part_2015 | 2832 MB    |           23

Aside from 2012, which is much smaller simply due to the date of this article, rows are distributed proportionally. From this, we can estimate that all maintenance operations on each child table will be roughly three times faster than before. While it’s true that we’ve also multiplied the amount of maintenance by a similar factor, such operations are generally automated.

Autovacuum and autoanalyze will now be much faster, and only focus on modified data. This means our 2015 table will likely experience several more of these maintenance cycles as it accumulates rows until 2016, after which it might remain totally inert. Before we partitioned, each of these operations would have processed all 100-million rows every time since they were all contained in a single table. Now, it can focus on just the portion that’s changing frequently.

We also win the option of easy archival. If we no longer need 2012 data, we can export it with a COPY command, and then drop the table. With a single table, we could only remove it with a DELETE statement. Not only is this orders of magnitude slower since it has to locate and then remove each row for 2012, but due to PGDB’s versioned storage, we now have several million dead rows to account for. If a VACUUM doesn’t immediately follow such a procedure, we’ve virtually guaranteed performance-robbing bloat.

Large tables take a very long time to VACUUM, which would cause a chain of delays along the data import stack. This kind of maintenance doesn’t block table writes, but if we don’t delay such jobs following significant changes, we again risk unnecessary table bloat. Or the disk activity needed for the maintenance will delay it for us. Really, using a single table in these cases brings a lot of downsides.

Used properly, this kind of partitioning is close to an equivalent exchange regarding query performance. Of course, there are known caveats associated with this kind of partitioning structure.

  1. Trigger overhead to redistribute data entry is very real. Even fast triggers add significant overhead.
  2. Primary keys can not distinguish rows in a partitioned table. Since each unique index exists independently on the partition, there’s no way to prevent duplicates across the set.
  3. A system must be in place to change trigger targets and build new partitions as they are needed. This is a lot more complicated than just using a single static table.

With that said, the first caveat is easy to circumvent. Any good partitioning scheme has an associated naming scheme. Loading jobs can target new data at the appropriate partition directly, thus reverting to single-table performance in that regard. Software that is PGDB partition-aware is amazingly beneficial, but also unfortunately rare due to all the custom tooling surrounding it.

The second concern is a bit tricky since such checks are inherent to PGDB operation. We can however, supplement the PRIMARY KEY syntax, with a stored procedure that performs verification across all partitions. Then we simply attach a trigger to child partitions that would invoke it on INSERT or UPDATE. Due to the associated overhead, such an extreme measure should be reserved for incredibly critical cases. However, we should note that large tables are generally data sinks; their primary key columns are usually handled through sequences or other types of auto-numbering that makes collisions incredibly rare.

The third issue is simply a software engineering problem. In that regard, the pg_partman extension provides quite a bit of automation for provisioning and maintaining partitioned tables. If that doesn’t work, the entire partition stack is in our first block of code. A bit of variable substitution and scripting could transform it into a generalized process. In fact, I’ve seen such things in Groovy and Python data import applications.

It all comes down to what kind of tradeoffs you want. Is it worth taking a potential performance hit in some SELECT and INSERT operations to prevent long-term data juggling? Is maintenance frequent enough that large objects act as a noticeable bottleneck? Is there enough data accumulation and turnover that distributing it is desirable? There are a lot of questions to consider before jumping straight into this partition style.

This methodology is a first step to a basic distributed architecture. Next week, we’ll consider something a bit more exotic.

Tags: , , , ,

PG Phriday: Partitioning Candidates

August 14th, 2015 | Published in Database, Tech Talk | 2 Comments

What’s a good table to partition? It’s not always a question with an obvious answer. Most often, size and volume determine whether or not a table should be broken into several chunks. However, there’s also cases where business or architecture considerations might use partitioning to preserve a shared table structure, or drive aggregate analysis over a common core. In PGDB (PostgreSQL), this is even more relevant because of how partitioning is handled through inheritance.

For now, let’s focus on elements that can be directly quantified, such as size and row count. Luckily, the PGDB system catalog contains this information, so let’s give it a look. Let’s see what the catalog has to say about the sensor_log table introduced in last week’s article on The Case for Partitioning.

SELECT oid::regclass::text AS TABLE_NAME,
       pg_size_pretty(pg_total_relation_size(oid)) AS total_size,
       CEIL(reltuples / 1000000) AS million_rows
  FROM pg_class
 WHERE relkind = 'r'
   AND reltuples > 10000000
 ORDER BY relpages DESC;
 TABLE_NAME | total_size | million_rows 
 sensor_log | 12 GB      |          100

With the help of a couple administration functions, we can focus on any tables that are immediate outliers. In this case, the only criteria that really matters, is that we want to see any tables with more than 10-million rows. Why that particular number? Tables that cross this threshold often continue to grow relatively quickly due to a consistent data stream. This volume tends to increase over time, and large tables face a number of problems.

Consider the sensor_log table as a basis for analysis. At 12GB, it probably fits in memory fairly easily. In addition, it’s extremely narrow, having only four columns. Yet even with this going in our favor, Anything that needs to read the entire table will encounter occasional issues. Specifically:

  • Rebuilding a single index takes 2-5 minutes on a completely idle server with no memory pressure. Change any of these variables and the rebuild time can increase dramatically.
  • Directly vacuuming takes 1-2 minutes because it uses all available IO. Auto-vacuum however, halts frequently to prevent interfering with normal operations and can persist for hours depending on activity. PGDB tables must be vacuumed to reclaim dead rows because of MVCC. The reusable rows are not available until the vacuum is complete. As the table size increases, so does the time required to vacuum it, and thus the chance of table or index bloat since rows are not recycled as frequently.
  • Large bulk inserts, updates, or deletes exacerbate the above concerns since they are likely to trigger auto-vacuum or auto-analyze.
  • Risk of process interruption increases. Dumping or importing larger tables is generally safe, yet the costs of canceling, resuming, or retrying are significantly higher.

I’ll illustrate some of these points with a couple of anecdotes.

I frequently work with ridiculously over-provisioned systems. One of these was completely PCIe NAND-based storage with over 128GB of RAM. Even so, we had a single 500GB table with over 250M extremely wide rows based on interactions with an upstream service exchange. Occasionally we would run maintenance where we would export large chunks of the table would to disk before being archived and removed. After that process completed, it wasn’t uncommon to dump the table and re-import it, and then rebuild the indexes.

Even after reducing the table by 50%, simply rebuilding the indexes required at least an hour for each one. For one particular composite index on a couple of VARCHAR columns, it was closer to two hours. This was literally as fast as the machine could perform the task. This was despite taking several extreme steps to reduce these delays and related downtime.

  • We increased maintenance_work_mem to 1GB to match the PGDB base file size.
  • We pointed base/pgsql_tmp at /dev/shm, a shared-memory RAM drive.
  • We wrote a custom process to rebuild the indexes in parallel.

None of this was enough. After including verification steps, we could not reduce the outage window below two hours. All because a single table acted as a single entity comprised of a vast amount of data. Physics has its limitations, and processing 250-million of anything requires vast resources. If the table was partitioned by month, we could have made better use of the idle processors and reduced the downtime to a mere twenty minutes.

Another more recent instance illustrates the risks associated with bulk loading large structures. As is sometimes common with bulk data, a large volume of data is imported, and to prevent build-up, older data is sent to an archive server and deleted locally. The problem for this second example is sheer volume. Every single day consists of over 100-million data points.

The loading process itself is extremely efficient, making use of COPY for direct imports from the standard input stream, parallel threads, small batches, the works. Unfortunately the tables were originally designed for 1/10th that amount of data. Thus once the archival job starts, the real pain begins. Isolating the old rows is extremely slow. Removing them with DELETE is even slower. Then vacuuming a 500GB-1TB table continues to harass the storage subsystem.

What could we do instead? Partition. Then the archive process would look like this:

  1. Find oldest partition.
  2. Obtain the contents and send them to the archive.
  3. Drop the partition.

No slow DELETE. No VACUUM. And instead of a SELECT statement to retrieve the rows, we can just use COPY to consume the entire partition. Revising the table architecture in this case prevents several fruitless hours of needless disk churning and associated risks of delaying dependent jobs.

One of the major benefits of PGDB partitioning is to make use of constraint exclusion. This allows the query planner to only consider partitions that have a WHERE clause matching the partition constraint. This approach can vastly improve query performance in some cases, especially where sequence scans are concerned. Instead of scanning hundreds of millions of rows, the database might only need to touch some fraction of that amount. There’s no other way to have a “partial” sequence scan.

But partitioning is so much more versatile than simply reducing query costs. As above, maintenance considerations can play a major role in determining partitioning candidates. Even if a table simply accumulates rows day after day, and no process ever updates or deletes anything for the duration of its existence, partitioning is still a good idea. Why?

Statistics. In order to make informed decisions regarding indexes, looping, and so on, PostgreSQL builds statistics for tables during the ANALYZE process. The default_statistics_target setting has a maximum value of 10,000. This means that no matter the size of the table, the database will never sample and calculate more than 10,000 values per column. Because of the costs involved with tabulating these statistics, I rarely set this over 1000, and query plans are generally fine with at least 100 samples.

For large tables however, this becomes more important. With only 1000 samples potentially representing billions of rows in a single table, the representative capability of that data becomes diluted and inconsistent. If a large table is loaded during the day, for instance, the amount of data for the current day is vastly outweighed by past data. This can cause queries that only need current data to be underestimated, and then the database uses a nested loop because it thinks there are only ten results, when there are really two million.

We’ve covered nested loops and how they can be problematic in the past. Looping through two million unexpected random reads is a worst-case scenario, but it happens every day due to inadequate statistics. If, however, current data were isolated to a more relevant partition, this scenario would be much less likely. Combined with constraint exclusion, query execution times can be reduced from several hours to a few milliseconds simply due to planning improvements.

And yet businesses continue to suffer with monolithic tables containing billions of rows, wondering why everything is so slow. They complain that it takes hours, or even days, to rebuild certain tables due to excessive turnover causing runaway bloat. They throw faster disks and more memory at the problem, confused at why that didn’t help.

It doesn’t have to be that way, if we use the tools we’re given.

Tags: , , ,

Just Call Me Peter

August 8th, 2015 | Published in Contemplation | No Comments

I just realized I am a victim of the Peter Principle.

Ever since I can remember, I’ve been a very quiet and withdrawn person. When people see that, they need to assign a cause. Well, if someone isn’t talking, they must be listening or thinking. If they think a lot, they must be smart. So every adult I ever met when I was a child always treated that way. Of course, I am then pressured to push myself—to fulfill their expectations.

But through life, I’ve always seemed to hit a skill ceiling pretty consistently. I’m pretty good at math, and slept through calculus, but my early career goal of working in Quantum Physics hit a wall. I’m really good at solving formulae, but I’m terrible at coming up with them. That made a career in math or physics pretty much impossible. So I gave up and went into computers instead. But even there, I’m only mediocre. I wrote successful software, but could never comprehend graphics. So I couldn’t make games, and back then, there were no game engines. I’m a fairly knowledgeable PostgreSQL DBA, but its database source code blew my mind, so I can’t contribute to it.

Of course, then I feel like a failure for not living up to the standards I had when I was younger. Everyone said I was a prodigy, and I’d be making the things that change the world. Wasn’t I some kind of genius? Well, it turns out I never was, but I’m a passable mimic. I drank the Kool-Aid my whole life and never stopped to question it. I kept getting pushed ahead until I reached a level well beyond my actual ability, where I quickly retreated.

I always felt I was a disappointment because I’d wasted my “gift.” But I was never gifted. At most, I’m above average. So having an average life with average expectations should be the normal result. Why have I been beating myself up over this for so long? I don’t suck because I’m not Dolph Lundgren. I’m not a failure simply because I’m not Notch. So why have I been mentally telling myself that for the last 20 years?

Don’t mind me. I’ll just be over here enjoying my success for once.


There Will Never Be World Peace

August 7th, 2015 | Published in Contemplation | No Comments

Humanity simply doesn’t have a brain capable of it.

People can only mentally relate to a certain number of people or ideals. Anything else becomes foreign. Our animal brains see foreign things as potential threats, and puts up guards. Suddenly, a person starts to wonder: why don’t those other people do it the way we do? The ‘it’ here can be anything: Religion, politics, pick any subject. That question evolves into a disagreement, then an argument, then a schism, and so on.

Within a population where that is happening, there will be a lunatic fringe. This fringe will take that idealism to the extreme, and as power shifts through time, eventually will have their turn at the wheel. In power, those former disagreements will turn into demands, for rectification or subjugation. Eventually, it will become war. This war can be within the power structure of an online forum, up to clashing nations. It’s inevitable.

Is there an existing evolutionary pressure to push our species away from this tendency? If not, and I don’t believe there is, we will continue to fight amongst ourselves forever in one aspect or another.

As demonstration, disagree with this assessment. Consider your amount of zeal, and realize it varies among people. How might they react, given enough impetus? To disagree by yelling or striking out, through social, legal, or physical means. Even just one with enough power in social or legal contexts could exact revenge. Or perhaps someone else gains enough followers to shout this article down, or have it banned.

Driven far enough, someone will eventually make it personal and direct their attention to the author. The author and their supporters will become the new enemy. Again, various attacks would be used to force the article down, or have adherents socially abused, arrested, or ultimately killed.

Writ large, we see this happen every single day even within otherwise genial groups. Our inability to recognize it within ourselves is our downfall. It becomes Democrats versus Republicans. It becomes Pro Choice versus Right to Life. It becomes Free Speech versus Social Justice. It becomes “us” versus “them”.

We are simply too primitive to do otherwise.

Humanity simply isn’t wired for peace and understanding. We’re too easily frightened, too subjective, too aggressive. Too many base, animal instincts we can’t consistently override.

The world will continue to remain turbulent, and anyone who says otherwise is full of lies.


PG Phriday: The Case for Partitioning

August 7th, 2015 | Published in Database, Tech Talk | 4 Comments

In the next few weeks, I’m going to be pushing a long discussion regarding PGDB (PostgreSQL) table partitioning. I’ve covered it in previous articles, but only regarding basic performance considerations. That’s a very limited view of what partitioning can offer; there’s a lot more variance and background that deserves elucidation.

So for the next few articles, the topic of discussion will be partitioning. There’s not really enough of it, and a lot of the techniques used in the field are effectively pulled straight from the documentation. I think we can go much further.

To that effect, this is the table structure all subsequent articles will reference:

CREATE TABLE sensor_log (
  sensor_log_id  SERIAL PRIMARY KEY,
  location       VARCHAR NOT NULL,
  reading        BIGINT NOT NULL,
  reading_date   TIMESTAMP NOT NULL
CREATE INDEX idx_sensor_log_location ON sensor_log (location);
CREATE INDEX idx_sensor_log_date ON sensor_log (reading_date);

Very large, continuously growing tables tend to be great candidates for partitioning. Further, we can explore several different techniques and cover various considerations related to expected volume. To help that along, I threw together a basic python loader script that both bootstraps the partition data, and can profile timings for analysis purposes.

To keep things interesting, the values are set to fill the table with 100M rows. The fact this is scripted helps simulate a process that might actually load data in a production environment. This allows testing basic insert speed with and without partitioning, as the size of the table (and indexes) increases over time.

# Set these to modify how large the COMMIT blocks will be,
# and how many days the data will represent.
# Total row count = chunk_size * date_range.
chunk_size = 100000
date_range = 1000
# Main program body
import psycopg2
from StringIO import StringIO
from datetime import datetime, timedelta
raw_data = StringIO()
db_conn = psycopg2.connect(database = 'postgres', user = 'postgres')
cur = db_conn.cursor()
print 'rows,seconds,microseconds'
for j in xrange(0, date_range):
    stamp = - timedelta(days = j)
    for i in xrange(1, chunk_size + 1):
        stamp -= timedelta(seconds = 5)
            '%s\t%s\t%s\n' % (i % 1000, i % 100, stamp)
    t1 =
        raw_data, 'sensor_log', '\t',
        columns = ('location', 'reading', 'reading_date')
    t2 =
    diff_s = str(t2 - t1).split(':')[2]
    print '%s,%s' % (
        (j + 1) * chunk_size, diff_s.replace('.', ',')

Before we get to any partitioning discussion, using this script as a baseline, loading the base sensor_log table takes about 45 minutes on my testing VM. Note that we didn’t cheat by loading the table without indexes as we’ve done previously. Existing tables won’t have that luxury, so for the partitioning discussion, we can ignore it for now.

So, what are the approaches we can cover?

  • No partitioning at all.
  • Use the technique discussed in the PostgreSQL documentation (triggers).
  • Use the base table as current data, with partitions as older, archived information.

And knowing the techniques, what do we examine?

  • Statement speed.
  • Potential overhead.
  • Maintenance concerns.
  • Implementation caveats.

Partitioning is a hairy procedure. No matter what method is chosen, the result will have definite performance and management issues. The idea is to choose a good balance that addresses current needs, based on volume. I plan on being as comprehensive as possible to advocate partitioning without simply being a cheerleader.

Due to the 100-million row sample size and the associated 13GB of space needed, the time scale for even the non-partitioned case should make everything obvious. Unfortunately it also greatly inflates the time required for any statement tests, and bootstrapping the tables themselves. I think that effort is warranted though, especially because projects that are likely to benefit from partitioning will likely exceed even that scale.

So think about partitioning a bit, because I’m about to go into some crazy depth about it, and hopefully make a case for using it early and often.

Tags: , , ,

« Older Posts