PG Phriday: When Partitioning Goes Wrong

October 2nd, 2015 | Published in Database, Tech Talk | 2 Comments

I’ve been talking about partitions a lot recently, and I’ve painted them in a very positive light. Postgres partitions are a great way to distribute data along a logical grouping and work best when data is addressed in a fairly isloated manner. But what happens if we direct a basic query at a partitioned table in such a way that we ignore the allocation scheme? Well, what happens isn’t pretty. Let’s explore in more detail.

Let’s use the fancy partitioning I introduced a while back. Using this structure, I ran a bunch of tests with a slight modification to my python script. I changed the reading_date granularity to daily, and had the script itself create and destroy the partitions. This allowed me to run several iterations and then meter the performance of queries on the resulting output.

The relevant python chunk looks like this:

end_stamp = stamp + timedelta(days = 1)
part = ''
if j > 0:
    part = '_part_%d%02d%02d' % (stamp.year, stamp.month,
    cur.execute("DROP TABLE IF EXISTS sensor_log%s" % part)
      "CREATE TABLE sensor_log%s (" % part +
      "LIKE sensor_log INCLUDING ALL," +
      "  CHECK (reading_date >= '%s' AND" % +
      "    reading_date < '%s')" % +
      ") INHERITS (sensor_log)"

With that, the script will assume daily granularity and build the check constraints properly so constraint exclusion works as expected. To get an idea of how partitions scale with basic queries, I ran the script with 10, 20, 50, 100, 150, and 250 days. This gave a wide distribution of partition counts and the numbers I got made it fairly clear what kind of drawbacks exist.

All tests on the partition sets used these two queries:

-- Check 100 distributed values in the partitions.
  FROM sensor_log
 WHERE sensor_log_id IN (
         SELECT generate_series(1, 10000000, 100000)
-- Check 100 random values in the partitions.
  FROM sensor_log
 WHERE sensor_log_id IN (
         SELECT (random() * 10000000)::INT
           FROM generate_series(1, 100)

The goal of these queries is to stress the query planner. I’ve omitted the reading_date column entirely so no partitions can be excluded. Since the primary key isn’t how the tables are grouped, a random assortment of keys must be retrieved from all partitions.

Thinking upon this for a moment, we might expect a linear increase in query execution time by partition count. The reason for that is fairly simple: each child partition is addressed independently. But we also need to take the query planner itself into account. As the amount of partitions increase, so does the amount of potential query plans.

Do the tests bare that out? These are the results obtained from a VM. All times are in milliseconds, and in every case, the entry is the best time obtained, not an average.

Partitions Plan 1 Exec 1 Plan 2 Exec 2 Diff Plan Diff Exec
1 0.151 0.450 0.174 0.480 N/A N/A
10 0.539 2.298 0.753 2.816 4.0 5.5
20 0.890 4.655 1.384 5.006 6.9 10.4
50 2.320 13.712 2.207 14.120 14.0 30.0
100 6.050 32.822 6.777 41.311 39.5 79.5
150 10.088 48.535 10.594 52.406 63.8 108.5
250 24.267 81.278 28.226 81.618 161.5 168.3

If we examine the contents of this table, a few things pop out. First, partitions definitely decrease performance when the partitioned column isn’t included in the query. We already knew that, so what else is there? I averaged the difference of execution times for both query plans and the queries themselves to represent the degree of slowdown. Here’s a graph of just the last two columns:


There is some variance, but the degree of slowdown is not linear. As the number of partitions increase, the divergence from baseline accelerates. Further, it appears there’s an arbitrary point where the planner just falls apart. A more in-depth analysis could probably reveal when that happens for these queries, but it would be different for others. In either case, ten partitions is about four times slower than ideal, so 25x more partitions should result in a plan time of 100ms—which is 60% less than what we actually observed. At 150, we’d expect 60ms, and measurements are pretty close to that value.

What this means in the long run is that query plans cost time. As the number of partitions grows, we should expect planning time to rise faster. Times are also additive. So at 100 partitions, total query time is actually about 60x slower than no partitions at all.

The lesson here isn’t to avoid Postgres partitions entirely, but to use them sparingly and in ideal circumstances. If data can’t be delineated along specific groupings and addressed by individual segment, it should not be partitioned. In other words, if the column used to split up the data isn’t in almost every query, it’s probably the wrong approach.

There is of course, a caveat to that statement: parallelism can address part of this problem. Postgres developers have been working hard on adding parallelism to the base engine, and partitions are a natural target. It would seem possible to launch a separate backend process to obtain data from each segment independently, and aggregate it at the end. This would massively reduce query execution time, but query planning time would likely increase to accommodate the extra functionality.

Yet I’ve heard people are looking into fixing how the planner handles partitions, so that too could eventually decrease substantially. In the end, I stand by partitions simply because they’re useful, and I’ll continue to recommend them. But like all tasty things, this should come with a grain of salt; it’s easy to partition something that shouldn’t be, or do so on the wrong vector, and suffer the consequences.

Tags: , , , ,

PG Phriday: Database Infrastructure

September 25th, 2015 | Published in Database, Tech Talk | 2 Comments

This PG Phriday is going to be a bit different. During my trip to Postgres Open this year, I attended a talk I had originally written off as “some Red Hat stuff.” But I saw the word “containers” in the PostgreSQL in Containers at Scale talk and became intrigued. A few days later, I had something of an epiphany: I’ve been wrong about servers for years; we all have.

That’s a pretty bold claim, so it needs some background.

Fairly often in infrastructure and development, we see servers like this diagram:


But what’s the most important thing in the diagram? Can you even tell? Well, everything except the AppData component can be replaced. The primary goal of highly available design is building an architecture to ensure this is the case. I don’t care where the data lives, or what server is hosting it, so long as certain performance metrics are supplied by the underlying hardware. But from that diagram, the most important element would appear to be the servers themselves.

This kind of thinking is a trap, and one pretty much everyone I’ve met in the industry is guilty of falling into. In a two-server Postgres stack, where one server is the primary database and the other is a replica, we tend to gravitate toward preserving the original 1->2 orientation. This may be the case even if retaining this arrangement requires a second failover after an outage or some kind of maintenance. For some reason, the physical server gets the focus. How many times have you seen this? I can’t even count the number of times I’ve encountered servers named after their primary role as a database host, yet I have to physically log in to find out what the databases are. Sure, sometimes there’s a Wiki page of some kind outlining the data to host relationship, but is that really putting the data first?

That’s what I mean when I say we’ve all had it wrong. This obviously isn’t absolutely true, but the prevailing momentum still treats servers as important. They’re not. They’re a medium for whatever is running on them. Businesses with dozens or hundreds of servers and VMs already realize abstraction is necessary, but even here, the focus is misplaced. How many companies use Chef, Puppet, Ansible, Bcfg2, or some other tool to manage server configurations? Every one of these systems needs a master list of servers. Each server is given a semi-permanent profile.

Coordinating servers en masse with such tools is much easier, yet my AppData is still on PGDB1, and my Java app is still on JSRV1 through JSRV34. These are things we have to know, because deployment tools need them, and ad-hoc monitoring views need them, and we need to know why JSRV14 fell over and died. We have spare servers, so we create JSRV35. After months or years of this, the server names resemble an old blanket, full of holes and arbitrary assignments. I need to check on FooDB. That’s on PGDB1 and PGDB8, right? Why does the dev staff, DBA staff, or even the infrastructure team need to know any of that?

What I would love to see is something like this:


What server is the data being hosted on? It doesn’t matter. How do I get to the system for diagnostics or forensics? An alias or a dashboard. Can you at least tell me where the data is? A SAN. How do I get to the Postgres database? An alias. What alias? How about appdata-prod?

But why? What does this design provide that actually improves anything? That’s subjective to a degree, but some of the major benefits are centered around management.

  • Is an assigned server misbehaving? Cycle it back into the pool and allocate a replacement.
  • Is an application repeatedly having resource problems? Change the requirement list and reprovision.
  • Are there persistent issues even after replacing servers? Investigate and modify the software or base container and test in iterations until a stable profile emerges. Then reprovision.
  • Did the database fail over? Allocate a replacement replica and investigate the cause on the original primary. If it’s not hardware related, recycle it into the pool or deallocate the resources.
  • How do I clone FooDB? Press the “Clone FooDB” button.

By advocating a transient nature, we can focus on data and application roles, and not where they live. What do we want from our Postgres database? The data itself, necessary extensions, good performance, and the ACID guarantee. None of that requires us to know FooDB is on lp4cc1pgsql12 today and lp4cc3pgsql8 tomorrow. We should never have to type that, look it up, or even know it exists.

Yet it’s easy to say something should exist. How do we actually get there?

I’ve been watching things like Docker and LXC for a while, but without a management system, they’re effectively just low-overhead VMs. In some cases, they’re actually worse or even useless for databases due to how they handle disk storage. Things like Kubernetes persistent volumes help solve that issue, though there is inherent overhead caused by the fact the storage is considered external and must be accessed through network traffic. For larger servers, this is almost expected; how many systems handle storage with a SAN and several LUNs? Local storage is almost exclusively used for the operating system these days, especially in the era of VM penetration.

But again, that’s too low-level. Creating and managing containers isn’t really that different from doing so with VMs, except that it’s probably a different set of tools. Containers negate the requirement for configuration management, but now we need to coordinate container profiles and instances. Moving from bare hardware and VMs to containers just moves the problem. We need a hardware mapping system, a management console and tooling, and a bunch of other stuff to really make containers work at a higher level.

This is starting to dangerously resemble a corporate cloud infrastructure, because it is. While solutions like Heroku and AWS are great fox external-facing applications and databases, quite a few companies build internal software that is never used by anyone outside of the organization. In those cases, all data is internal, the platform is local, and latency is necessarily low. If software needs to churn through vast amounts of data on a parallel stack, or the Postgres database is 30TB of distributed shards, that’s a lot of servers to manage.

I manage over 100 Postgres instances, and 80% of those are replicas that could be replaced by containers. I wrote a Django application to handle synchronization, DR failover needs, and basic provisioning, yet none of that code should have been necessary. In a container world, a clone is allocated when the container is instantiated. If it falls out of sync for some reason, it gets recycled into the pool and reallocated to fit the requirement of having thirty of them. A hyper-elastic system like this is much harder to kill, gives better uptime, and even addresses horizontal scaling concerns more consistently.

Given all of this, it should be no surprise that several management platforms have emerged to remove the suck-factor from handling containers. The Postgres Open talk was about Red Hat OpenShift, but there’s also Pivotal Cloud Foundry, and Apache Stratos. Why just these? The distinction here is that all of them are considered Platform as a Service (PaaS) instead of Software (SaaS) or Infrastructure (IaaS). Most existing corporate cloud systems are IaaS, making it easy to quickly deploy a lot of VMs. By adding the platform layer, devs and DBAs alike can spin up entire stacks necessary to implement an entire platform consisting of any number of software or hardware prerequisites.

If a trading app needs Django on top of nginx serving from a local clone of the current market symbol pricing database, that’s an easy deployment. Even if there needs to be 50 of them, the Postgres database cloning and replication process would be the same. Nobody has to build that stack. Nobody has to modify the configuration management system to install system packages and add the new server to the deployment tool, then cut and paste from a Wiki to install the actual app and launch everything in the proper order. The fact I even wrote that sentence says a lot about how servers are handled in the current infrastructure climate at many organizations.

The role of Postgres in this is simple: be Postgres. As just another part of the stack, with built-in replication capabilities and a transient nature tied to the PGDATA directory, it already matches this model. Our Postgres data needs a place to live, and a place to execute. SAN hardware tends to provide a lot of extended functionality like deduplication, snapshots, and CoW, all of which greatly simplify data cloning and deallocation. Containers tied to such data clones are equally disposable, and make that data available to applications and end users without much regard to the underlying hardware.

I want to care about Postgres and my data, not how I get to it. From now on, my primary goal at work will be to make this a reality. Even if that doesn’t mean a PaaS deployment, virtual host names and tools like pgpool can do some of the heavy lifting to hide physical servers. We all do too much work to have to remember a bunch of arbitrary server names, so why do it at all?

Tags: , , ,

PG Phriday: The Bones of High Availability

September 18th, 2015 | Published in Database, Tech Talk | 1 Comment

Well, the bell has tolled, the day is over, and at the end of it all, Postgres Open has ended its fifth year in service of the community. I will say it was certainly an honor to speak again this year, though now that it’s not conveniently in Chicago, I’ll have to work harder to justify hauling myself across the country next year. Of course at this point, I’d feel guilty if I didn’t at least try, assuming any of my submissions are accepted. :)

Given that the conference has ended, I would be remiss if I didn’t post my slides. The official location on the PostgreSQL Wiki is a start, but I have a website, so I might as well use it. So if you want to view my presentation directly, there are two ways:

What was the presentation about? If you believe Gabby’s tweet, it was about puns. That’s not too far from the truth, but the despite my propensity for wordplay, the actual topic focused on—what else—Postgres high availability. By starting the journey with a single server, I discuss how each additional server on the stack can reinforce the (spooky) skeleton of a successful database architecture. In the process I also share a couple of the sobering disasters that probably shaved a few years off of my life due to some level of insufficient paranoia.

High availability really is the result of a cost to benefit analysis between how much downtime costs the company, and the expense of hardware and space in a data center. With Postgres, there are a lot of ways to leverage built-in features and take advantage of its underlying capabilities in avoiding such scenarios. This isn’t like the year I built a DRBD + Pacemaker + Postgres stack live on stage, so don’t be afraid to read through it. If nothing else, the slides are good for a few laughs.

Hope to see you at Postgres Open 2016!

Tags: , ,

PG Phriday: Dealing With Table Bloating

September 11th, 2015 | Published in Database, Tech Talk | 1 Comment

Most Postgres operators and informed users are aware that it uses MVCC for storage. One of the main drawbacks of this versioning mechanism is related to tuple reuse. In order to reuse the space, VACUUM must complete a cycle on the table. Unfortunately this isn’t always possible to “optimize” for larger tables. How so?

If a large table needs to have a calculated column added, or some other bulk query updates a large portion of its content, a large fragment of the table is now empty space. This can be painful in a warehouse scenario, but it can be even more disruptive if a table that once fit in memory is now causing endless disk paging on a transaction-heavy system. It can mean the difference between a server being driven into the ground by IO requests, or one that runs smoothly with an order of magnitude more requests.

Take, for instance, a product mapping table:

CREATE TABLE product_map
    map_id        SERIAL  PRIMARY KEY,
    product_id    INT     NOT NULL,
    vendor_id     INT     NOT NULL,
    product_code  TEXT    NOT NULL
INSERT INTO product_map (product_id, vendor_id, product_code)
SELECT,, 'P' || abs(100000 -
  FROM generate_series(1, 100000) a(id),
       generate_series(1, 10) b(id);
ANALYZE product_map;

This architecture is fairly common when mapping vendor product codes with internal tracking numbers. The approach inoculates us from vendor dependency or collisions caused by two vendors using the same ID system. These tables don’t tend to be very large, and here is how Postgres sees ours:

SELECT pg_size_pretty(pg_relation_size('product_map'));
 50 MB

In this case, we have a million mappings with ten vendors and a made up string product code. Imagine for a moment that a few vendors changed their code systems, or we made a mistake loading the table. Now we need to update a large portion of the table. To simulate this, here’s an update statement that will modify half of the rows to have different product code values. We’ll also check the size of the table following the update.

UPDATE product_map
   SET product_code = 'D' || abs(50000 - product_id)
 WHERE vendor_id > 5;
SELECT pg_size_pretty(pg_relation_size('product_map'));
 75 MB

This is where our story begins. The table is now 50% larger than before, but contains the same number of rows. This won’t affect index usage, but any query that reads the entire table is in trouble. Not only does this slow down VACUUM, but even something as simple as a daily report would have to work that much harder to summarize the contents of this table. Ad-hoc jobs will also take longer to complete.

This is fine with our measly million row table, but imagine a content map instead of a product map. Every book, song, movie, or piece of media supplied by an outside source could have multiple entries in this table. That inflates our row count into the tens or hundreds of millions. If 50% of that were empty space due to a bulk operation, we could be in trouble.

This kind of thing happens every day. If a company doesn’t have a Postgres DBA, it might take them by surprise when a nicely running system becomes a stuttering monstrosity because they changed some data the previous night. Not only is VACUUM using disk cycles trying to keep up with maintaining a multi-million row table, but the table no longer fits in memory and is causing massive IO thrashing.

The only permanent escape is to fix the table. There are two mainstream ways to accomplish this that are supported by SQL syntax. We can either CLUSTER the table by reorganizing it by one of the indexes, or use VACUUM FULL to rebuild it. Here’s what the table looks like after using CLUSTER on it:

CLUSTER product_map_pkey ON product_map;
SELECT pg_size_pretty(pg_relation_size('product_map'));
 50 MB

There, back to normal! Sadly there are some major caveats to using this approach:

  1. Both methods require an exclusive table lock for the duration of the process.
  2. Both are serial processes.

This is primarily a concern because such an exclusive lock prevents even reading the table until the command completes. If there’s no available maintenance window to accommodate potentially hours of locking a table, that’s a huge complication. That drawback alone could prevent fixing the table in all but the most dire of circumstances. This also makes Postgres look bad to anyone who just wants something that works; MySQL doesn’t have such a caveat after all.

This is all related to how the commands actually function. Here’s a quick summary of what’s going on:

  1. Create a copy of the table as an empty framework.
  2. Copy the active table rows from the old table.
  3. Create each index.
  4. Atomically rename the table to replace the old one.

They do all that work behind the scenes so it resembles a single operation. Most of these steps are either destructive or modify the underlying table structure in some way. So long as the table is in flux, it’s not safe to use it. So, we wait. We wait for the rows to be copied. We wait for the indexes to be recreated based on the new page locations. We wait.

The problem is that the index creation step is serial. If there were six indexes, Postgres would dutifully create each, one after the other. That’s something that could be done in parallel, but not in current versions of Postgres. So if we have a table with one hundred million rows and four indexes, that could be several hours of rebuilding, during which, the table is completely unusable.

Enter pg_repack, a fork of the older pg_reorg project. It boasts the ability to reorganize a table without an exclusive lock. As an added benefit, it can rebuild indexes in parallel, drastically reducing the time spent on this step. Now we have an external tool that not only allows us to do maintenance after bloating a table, but finishes quicker too.

This is particularly relevant with operations that can’t be streamlined. The UPDATE statement we used might be followed up by an INSERT for example. Well, VACUUM doesn’t work in transactions, so we would have to fragment the job to specifically use two separate transactions separated by an explicit VACUUM. Yet the very nature of an RDBMS makes this an undesirable approach. Transactions and atomic operations protect data integrity, yet we have to manually circumvent this in order to avoid unnecessary table bloat.

Having a non-invasive maintenance tool like this is critical to Postgres. So critical that whatever magic it uses should be part of the core engine. MVCC, despite its benefits, makes maintenance an integral part of operating a Postgres database. When performing that maintenance becomes invasive, disruptive, or requires circumventing normal operation, people notice. It’s one of the few remaining hurdles to mass adoption Postgres still faces.

Without a postgres DBA on staff to get a project beyond these issues, the perception becomes that it’s a pain in the neck to use. Unfortunately, Postgres DBAs are in short supply. That is, after all, why I’m writing these articles.

Tags: , , ,

PG Phriday: Postgres as Middleware

September 4th, 2015 | Published in Database, Tech Talk | 3 Comments

One of the cool things I like most about Postgres, is that it’s probably the most inclusive database software I’ve ever encountered. It’s so full of features and functionality these days, it’s practically middleware. Almost anything plugs into it, and if it doesn’t, there’s usually a way to make it happen.

Want a demonstration?

SciDB is often used for large analytical data warehouses. They even use Postgres for metadata storage. Despite this, they still haven’t written a foreign data wrapper for back-and-forth interaction. But they have written a Python API. So what can we do with it?

This isn’t a SciDB tutorial, so let’s just create a very simple data array with one attribute over one dimension. Nothing crazy here:

CREATE ARRAY thing <msg:string> [id=0:10,10,0];
store(build(thing, 'Thing ' + string(id)), thing);
{id} msg
{0} 'Thing 0'
{1} 'Thing 1'
{2} 'Thing 2'
{3} 'Thing 3'
{4} 'Thing 4'
{5} 'Thing 5'
{6} 'Thing 6'
{7} 'Thing 7'
{8} 'Thing 8'
{9} 'Thing 9'
{10} 'Thing 10'

The hardest part about accessing this data via Postgres, was actually getting the SciDB python driver to work. Instead of directly utilizing their own client libraries, it actually obtains data through an HTTP proxy they called shim. Once that was done, I just had to find the right calls in the SciDB-Py documentation and refresh my memory on PL/Python.

Given all of those parts, our whole stack resembles something like this:

  1. SciDB
  2. Shim
  3. SciDB-Py
  4. PL/Pythonu
  5. Postgres

If there were an actual foreign data wrapper, we could use it to replace the middle three layers. Until then, that’s the overhead involved. And here is all of the magic for viewing that external data:

CREATE TYPE scidb_thing AS (
  id   INT,
  msg  text
RETURNS SETOF scidb_thing
AS $$
  import numpy
  import scidbpy AS scidb
  sdb = scidb.CONNECT('http://localhost:8080')
  FOR pair IN'thing').tosparse():
    yield pair
$$ LANGUAGE plpythonu;
  FROM get_scidb_things();

This usage comes in two parts, with a third for the sake of convenience. We create a type because we want to interact with this data as if it were a table. Then we create a function that uses python to access the external data, and cast it into that type. The end result is that we can get the external data rather easily. With a bit more coding, we could make the function generic and use it to access any external SciDB array.

The view merely obfuscates the function call to emulate local usage. Let’s use the view and see what we get:

  FROM v_scidb_thing
 WHERE id = 9;
 id |   msg   
  9 | Thing 9

Given a lot of time and inspiration, we could write a local Postgres API with the capability to create, fill, or otherwise treat any SciDB array like any other external data object. I would never advocate such an approach in this case simply because of the awful overhead. The above query required 300ms to execute in my VM, which is about 100x slower than directly querying SciDB. But this is also all of the old stuff! Imagine what we could do with more recent techniques like a FDW extension.

The thing about Postgres is that it’s a database, and an interface. I just used Python to give it access to data in another system without a native client library or extension. The fact that the whole process only took about an hour makes it even more ludicrous. For small to medium-sized objects, we could combine this technique with materialized views for speedy local access.

Really, the only thing preventing Postgres from doing something is your imagination. That’s pretty liberating.

Tags: , ,

« Older Posts