PG Phriday: Cluster Control

November 20th, 2015 | Published in Database, Tech Talk | No Comments

It has occurred to me that I may have been spending a bit too much time being excited about new Postgres features and developments in the community. One of the intents of this weekly article was for educational purposes, so this week, let’s get back to basics. To that end, the topic for this week boils down to the tools available for managing Postgres instances, and how to use them. Surprisingly, it’s not as straight-forward as you might think.

Having used Postgres for almost 15 years now, it’s easy to forget (or mentally block!) the early struggles. With new tools and wrappers always being developed, it’s not always clear what the best practices for managing a Postgres cluster actually are. Indeed, it often depends on how Postgres is installed.

Let’s start with a basic source-code installation. I won’t cover that process, but we can assume that after the smoke clears, the binaries are somewhere in /usr/bin or /usr/local/bin, and are thus in our execution path. Given that, let’s say we have an available mount point at /data and want to create a new cluster there. Here’s how that might go:

sudo mkdir /data/pgsql
sudo chown postgres:postgres /data/pgsql
sudo su - postgres
initdb -D /data/pgsql/my_db
pg_ctl -D /data/pgsql/my_db start

We now have an active Postgres instance at /data/pgsql/my_db. It’s extremely common to use the postgres OS user for this kind of thing, hence all of our sudo commands to prepare. It’s entirely possible to do this as a regular user, but I usually don’t recommend that approach.

In any case, this type of installation essentially depends on the pg_ctl command-line tool. It does everything related to controlling a Postgres instance. But it’s also annoying to use the -D parameter all the time when using Postgres tools, so there are several environment variables that can also do the job. This lets us prime our environment with .bashrc, for example. Let’s stop the instance:

export PGDATA=/data/pgsql/my_db
pg_ctl stop -m fast

Why the “-m fast” part? By default, Postgres is exceedingly polite. If we had simply asked it to stop, it would patiently wait for any pending transactions to complete before stopping. But if there are several users connected, some may leave transactions idle, or a really long process may be running, or maybe we’re just in a hurry. Using a fast shutdown tells Postgres to abort any transactions and stop as soon as it can—safely, of course. There’s no data loss, but a few users or applications may grumble a bit.

At the system level, it’s very common for a global service management to wrap much of this process. To start or stop Postgres on CentOS or RedHat, we’d do something like this:

sudo service postgresql-9.4 start
sudo service postgresql-9.4 stop

On a Debian or Ubuntu system, it would be this:

sudo service postgresql start
sudo service postgresql stop

And this is where things start to get complicated. These two major Linux flavors can’t even agree on what to name the service control mechanism, and both have extremely differing views on handling multiple versions. Both however, go to great lengths to “hide” the Postgres-provided binaries so that only the known wrappers can access them unless we cheat and add the binary location directly to our PATH variable.

By and large, that isn’t actually necessary, but it does add a major complication: standardization. As in, there isn’t any. Depending on distribution, wrappers will take different parameters, install Postgres in varying locations, and have divergent control methods. Let’s dive a little deeper into Debian/Ubuntu and their slightly more comprehensive wrappers.

Ubuntu assumes any number of Postgres versions may be installed at any one time. As such, starting and stopping the postgresql service on an Ubuntu system will cascade that operation to any configured instances hosted on the machine. What if we only want to manage a single instance? First, we need to know which instances are even available. This is where the cluster tools come in.

Because a single Postgres instance can represent several individual databases, it’s not uncommon to refer to it as a cluster. Debian and thus Ubuntu took that concept and ran with it and made something that’s actually pretty handy. Let’s get a look at a test system with multiple running clusters:

sudo su - postgres
Ver Cluster Port Status Owner    Data directory               Log file
9.4 main    5432 online postgres /var/lib/postgresql/9.4/main /var/log/postgresql/postgresql-9.4-main.log

Neat, eh? This tool gives us a lot of information that’s immediately useful. With multiple clusters running on one system, we need to know which port each is running on. Knowing where the data resides, and where to find logs has obvious merit. And to differentiate each, the clusters are named. Let’s repeat that instance creation from above using a couple more wrapper tools:

pg_createcluster 9.4 my_db -D /data/pgsql/my_db
pg_ctlcluster 9.4 my_db start
Ver Cluster Port Status Owner    Data directory               Log file
9.4 main    5432 online postgres /var/lib/postgresql/9.4/main /var/log/postgresql/postgresql-9.4-main.log
9.4 my_db   5433 online postgres /data/pgsql/my_db            /var/log/postgresql/postgresql-9.4-my_db.log

The pg_createcluster tool takes the place of initdb on Debian-based systems like Ubuntu. It’s aware of the other wrappers and recommended settings, locations, and so on. It also did us the favor of automatically incrementing the port so the clusters can run simultaneously. And pg_ctlcluster supplants pg_ctl for similar reasons. It should be fairly clear the tools are inherently designed such that the version is a required parameter. They’re pretty dedicated to the concept of hosting multiple clusters and Postgres versions on a single server.

Since they’re wrappers, they also accept many of the same parameters. If we wanted to stop the my_db cluster quickly, we could use a similar command as before:

pg_ctlcluster 9.4 my_db stop -m fast

Note that there’s one major omission here: we didn’t specify the data directory. Of course, the wrappers know where all of the clusters live, so we don’t need to. The version and name uniquely identify all clusters under this architecture, removing the necessity of memorizing cluster locations or hacking our environment with variables.

All of this comes with something of a caveat related to configuration. Normally, Postgres configuration files reside within the data directory itself, making the database a self-contained entity that is not tied to any one server. If the data is on a SAN for instance, it could be mounted on a replacement server and started with no modifications. But to have wrappers, the wrappers need a standard location to find information regarding the clusters they can control. On systems that use these tools, that’s the /etc/postgresql directory.

Here’s what the configuration directories look like on our test system:

find /etc/postgresql -type d

It’s a simple organization to remember, and closely resembles the version + name format all of the tools require. All of the configuration files the Postgres documentation or online articles refer to, actually live here instead. Most UNIX systems expect configuration files somewhere in the /etc directory, so that’s not unexpected. But it does introduce a rather unusual complication.

Some Postgres tools unrelated to this organization expect configuration files to be in the Postgres data directory. If we were to restore a backup of the data directory on an unrelated server, we would be unable to start it, and we would have no idea who should be allowed to connect. To compound the issue, the cluster tools themselves would have no way to administer it, because the /etc files would be missing.

The easy solution would be to include the corresponding /etc/postgresql folder in the backup process itself. We could (and should) also use configuration management to distribute these files to replacement or restoration servers. But that doesn’t fix external tools that look for config files in the data directory, so I tend to also do a bit of file linking. For example:

for x in /etc/postgresql/9.4/my_db/*.conf; do
  ln -s $x /data/pgsql/my_db

This way we don’t have to worry; the necessary files are all there and we don’t lose any benefits provided by the cluster tools.

In any case, I hope any developers reading this are now better equipped to experiment with local instances. It’s unfortunate there’s not enough standardization across the distributions in this regard, but at least now you have a starting point and some insight at how it got this way. Enjoy!

Tags: , , ,

PG Phriday: Parallel Sequence Scans

November 13th, 2015 | Published in Database, Tech Talk | 2 Comments

A couple days ago, Robert Haas announced that he checked in the first iteration of parallel sequence scans in the Postgres 9.6 branch. And no, that’s not a typo. One of the great things about the Postgres devs is that they have a very regimented system of feature freezes to help ensure timely releases. Thus even though 9.5 just released its second beta, they’re already working on 9.6.

So what is a sequence scan, and why does this matter? Past articles have covered this, but Postgres uses sequence scans when it needs to read the entire contents of a table. For larger entities consisting of tens or hundreds of millions of rows, this is a time-consuming process. Even the best processor can only handle a few million rows per second, so as scale increases vertically, there’s currently no way to address these larger data volumes efficiently without significant effort.

As slow as a sequence scan is, it’s also a relatively simple process, and a great starting point for the long road to a parallel Postgres engine. Postgres knows how big the table files are and obviously knows its own storage format, so it merely needs to set scan ranges and farm them out to workers that report the results back to a coordinator process. In theory, that’s a clean transition that avoids complicated aggregation rules or clause pushdown headaches.

But how well does it work? To see this for myself, I did something I’ve actually never done before now: download the Postgres git tree. Then I set up a test case:

SELECT, repeat(' ', 20) AS junk
  FROM generate_series(1, 20000000) a(id);
ANALYZE test_tab;

With this test table, there are 20-million rows of empty junk and no indexes, so we’re effectively forcing Postgres to use a sequence scan for any query. Then we have to enable the feature with the max_parallel_degree parameter. And finally we invoke a query with a naive WHERE clause applied to every row so the engine has to actually do some work.

SET max_parallel_degree TO 1;
  FROM test_tab
 WHERE junk LIKE '%s%';
                             QUERY PLAN
 Gather  (cost=1000.00..265706.30 ROWS=1 width=25)
         (actual TIME=1832.456..1832.456 ROWS=0 loops=1)
   NUMBER OF Workers: 1
   ->  Parallel Seq Scan ON test_tab
         (cost=0.00..264706.20 ROWS=1 width=25)
         (actual TIME=1828.830..5489.878 ROWS=0 loops=1)
         FILTER: (junk ~~ '%s%'::text)
         ROWS Removed BY FILTER: 29594528
 Planning TIME: 0.062 ms
 Execution TIME: 1834.332 ms

There’s a lot going on here. First, we need to talk about how many processes actually worked on this query. The max_parallel_degree parameter controls how many background workers assist the main process, so there are actually two Postgres processes performing independent sequence scans. Some parallel systems use the parent as a mere coordinator, so we might expect the number of workers to be greater than 1 before actual parallel operation occurs. That isn’t the case with this implementation.

The query itself simply asks for something that doesn’t exist in our sample set. This helps us get a best-case scenario where no results are handed between the processes, but time is still consumed scanning table data. And the resulting plan from that query is rather different from a standard scan. We can see how many extra workers were involved and that the results were “gathered” by an extra execution layer.

The only odd detail is that 29-million rows were removed from the results of a 20-million row table. We’ll just chalk that up as an implementation quirk considering this is pre-alpha code. Otherwise, this patch appears to scale in a relatively linear manner. Let’s check out a few different variants of max_parallel_degree.

Workers Avg Time (s)
0 3.8
1 1.9
2 1.3
3 1.1
4 0.9

There’s a bit of jitter in the timings on our test system, but the trend is fairly clear. With no extra workers, one process can scan a 20M row table in about four seconds. With three extra workers, those four processes can perform the same task in about one second.

This iteration of the patch takes the size of the table into account, possibly to compensate for implicit worker management overhead. With 20M rows, we couldn’t get more than five dedicated workers, while 100M rows utilized seven workers to utilize all eight of our CPU cores. Beyond that are a couple important caveats:

  1. Aggregates are not currently handled.
  2. Certain clauses are not pushed down into the workers.

The first can actually be circumvented rather easily. For example:

SET max_parallel_degree TO 3;
    FROM test_tab
   WHERE junk LIKE '%s%'
) s;
(1 ROW)
TIME: 1111.904 ms

The second is a natural result of the patch’s immaturity. We need to have parallel functionality before it can be optimized. I’m perfectly content waiting for it to be done right. In the meantime, we have all of the functionality added to make this possible. After 9.4 added background workers, new Postgres extensions began leveraging them. And now 9.6 will probably use them for their original purpose, based on how stable the patch appears so far.

It’s exciting to see that Postgres will finally be able to scale vertically in a way that can handle the large tables some organizations have been accumulating. We have a 30TB database with tens of billions of rows. Even though we’ll be sharding that in the future, looming parallel features imply the shards themselves will scale as we cross over into trillions of rows.

It’s an exciting time to be a Postgres end-user.

Tags: , , ,

PG Phriday: Sidewinder

November 6th, 2015 | Published in Database, Tech Talk | 1 Comment

Maintaining a Postgres database can involve a lot of busywork. This is especially true for more robust architectures that allocate at least one replica for failover purposes. It’s still fairly common for a DBA to create a replica to accommodate emergency or upgrade scenarios, only to have to repeat the process when it came time to revert to the original master system. It’s not safe to simply subscribe the original primary to the newly promoted secondary, so this leaves either creating a new clone, or using rsync to synchronize all of the files first.

This not only makes the DBA’s job harder, it also complicates disaster recovery tests or other scenarios that require reversing the replication stream. That’s where Postgres 9.5 and its inclusion of the pg_rewind utility comes in. The only real criteria for use is that the wal_log_hints setting is enabled in postgresql.conf.

Given that stipulation, imagine we have a 9.5 Postgres instance located at /data/testdb. From here, let’s create some fake data with pgbench and create a new replica:

pgbench -i -s 100 postgres
pg_basebackup -D /data/testdb-clone -R

Next we modify the postgresql.conf file for the clone and change the port parameter so both instances can run simultaneously. Anything will work here, so long as it’s different from the parent instance. Let’s just say the new port is 5433 for this test.

Then comes the fun part: promoting and modifying the replica. A 30-second pgbench test should modify enough data such that the replica and parent are no longer even minimally interchangeable.

pg_ctl -D /data/testdb-clone start
pg_ctl -D /data/testdb-clone promote
pg_ctl -D /data/testdb stop -m fast
pgbench -p 5433 -T 30 postgres

The reason we stopped the old primary after promoting the new one is to simulate them being out of sync. Even one transaction that differs on either instance is enough to cause this effect. In fact, the previous two code blocks will be used again later, so keep them in mind.

To have some comparison, we need to test the “old” method of using rsync. Our assumption here is that the database is too large to use pg_basebackup. Let’s see it in action:

pg_ctl -D /data/testdb-clone stop -m fast
time rsync -a /data/testdb-clone/ /data/testdb/
real    0m15.019s
user    0m9.804s
sys     0m10.909s

Nothing especially complicated, right? 15 seconds isn’t bad, considering we’re dealing with about 2GB of test data. If this were a database that measured in TBs instead, we’d be extremely grateful for any files rsync chose not to transfer. But what if we used pg_rewind instead? How would that even work? Oddly enough, it’s actually pretty similar.

pg_ctl -D /data/testdb-clone stop -m fast
time pg_rewind --source-pgdata=/data/testdb-clone \
servers diverged at WAL position 0/9E0004D8 on timeline 1
rewinding from last common checkpoint at 0/9E000430 on timeline 1
real    0m0.679s
user    0m0.024s
sys     0m0.653s

We’re not quite done yet. Executing pg_rewind merely “reverts” the old primary to a point where it can safely consume transaction logs from the new primary. After this, it needs to run in recovery mode and apply any pending changes as if it were a newly allocated replica. This means we need a new recovery.conf pointing at the replica we promoted at the beginning of this process.

standby_mode = 'on'
primary_conninfo = 'user=postgres port=5433'

Note that we’re only using the postgres user to simplify our examples. We don’t recommend using the primary superuser for replication if it can be avoided. Beyond that, the only remaining steps are to start the new replica and see how long it takes to catch up.

pg_ctl -D /data/testdb start
sleep 5
egrep 'standby|ready' /data/testdb/pg_log/postgresql-*.log
2015-11-06 17:56:49 GMT LOG:  entering standby mode
2015-11-06 17:56:50 GMT LOG:  database system is ready to accept read only connections

Wow! That’s quite a difference! Add that up and we spend about two seconds for the whole process. One to rewind the data, and another to re-apply the data modified since the systems were last in sync. Further, we can use the same process on testdb-clone to safely convert it back into a replica.

Part of running a Postgres cluster of any description is coordination. Previously, swapping between two systems frequently was extremely slow and annoying for larger databases. This made maintenance and failovers to secondary pairs or DR tests much more complicated than necessary. Neither of those scenarios promote a healthy high availability architecture.

We want to swap to the secondary to perform maintenance as often as necessary, since that’s its role. We want to test our disaster recovery environment to prove its functionality. When doing so requires nearly a full synchronization in both directions every time, we’re much less likely to schedule these events very frequently. The pg_rewind utility finally gives us a safe way to do that.

As a final note, while directly included as part of Postgres 9.5, pg_rewind is actually available for 9.4 and 9.3 as well. I highly recommend integrating it into any robust HA stack; the QA and Infrastructure departments will be glad when the DR test can finish ahead of schedule for once!

Tags: , , , ,

PG Phriday: Massively Distributed Operation

October 30th, 2015 | Published in Database, News, Tech Talk | No Comments

Postgres has been lacking something for quite a while, and more than a few people have attempted to alleviate the missing functionality multiple times. I’m speaking of course, about parallel queries. There are several reasons for this, and among them include various distribution and sharding needs for large data sets. When tables start to reach hundreds of millions, or even billions of rows, even high cardinality indexes produce results very slowly.

I recently ran across an extension called pmpp for Poor Man’s Parallel Processing and decided to give it a try. It uses Postgres’ dblink system to invoke queries asynchronously and then collates the results locally. This allows further queries on that result set, as if it were a temporary table.

Theoretically this should be ideal for a distributed cluster of shards, so let’s see what happens if we try this with our sensor_log table in that configuration:

CREATE TABLE sensor_log (
  sensor_log_id  SERIAL PRIMARY KEY,
  location       VARCHAR NOT NULL,
  reading        BIGINT NOT NULL,
  reading_date   TIMESTAMP NOT NULL
CREATE INDEX idx_sensor_log_location ON sensor_log (location);
CREATE INDEX idx_sensor_log_date ON sensor_log (reading_date);
CREATE INDEX idx_sensor_log_time ON sensor_log ((reading_date::TIME));
SET search_path TO shard_1;
CREATE TABLE sensor_log (LIKE public.sensor_log INCLUDING ALL)
INHERITS (public.sensor_log);
SET search_path TO shard_2;
CREATE TABLE sensor_log (LIKE public.sensor_log INCLUDING ALL)
INHERITS (public.sensor_log);
SET search_path TO shard_3;
CREATE TABLE sensor_log (LIKE public.sensor_log INCLUDING ALL)
INHERITS (public.sensor_log);
SET search_path TO shard_4;
CREATE TABLE sensor_log (LIKE public.sensor_log INCLUDING ALL)
INHERITS (public.sensor_log);

The top sensor_log table in the public schema exists merely so we can query all the table sets as a whole without using a bunch of UNION statements. This should allow us to simulate how such a query would run without the benefit of parallel execution on each shard.

Now we need to fill the shards with data. Fortunately the generate_series function has an option to increment by arbitrary amounts, so simulating a has function for distribution is pretty easy. Here’s what that looks like:

INSERT INTO shard_1.sensor_log (location, reading, reading_date)
SELECT % 1000, % 100, now() - ( || 's')::INTERVAL
  FROM generate_series(1, 4000000, 4) s(id);
INSERT INTO shard_2.sensor_log (location, reading, reading_date)
SELECT % 1000, % 100, now() - ( || 's')::INTERVAL
  FROM generate_series(2, 4000000, 4) s(id);
INSERT INTO shard_3.sensor_log (location, reading, reading_date)
SELECT % 1000, % 100, now() - ( || 's')::INTERVAL
  FROM generate_series(3, 4000000, 4) s(id);
INSERT INTO shard_4.sensor_log (location, reading, reading_date)
SELECT % 1000, % 100, now() - ( || 's')::INTERVAL
  FROM generate_series(4, 4000000, 4) s(id);

Clearly a real sharding scenario would have a lot more involved in distributing the data. But this is poor-man’s parallelism, so it’s only appropriate to have a bunch of lazy shards to go with it.

In any case, we’re ready to query these tables. The way we generated the data, each table contains a million rows representing about six weeks of entries. A not infrequent use case for this structure is checking various time periods distributed across multiple days. That’s why we created the index on the TIME portion of our reading_date column.

If for example, we wanted to examine how 2PM looked across all of our data, we would do something like this:

\timing ON
  FROM public.sensor_log
 WHERE reading_date::TIME >= '14:00'
   AND reading_date::TIME < '15:00';
TIME: 1215.589 ms
  FROM shard_1.sensor_log
 WHERE reading_date::TIME >= '14:00'
   AND reading_date::TIME < '15:00';
TIME: 265.620 ms

The second run with just one partition is included to give some insight at how fast the query could be if all four partitions could be checked at once. Here’s where the pmpp extension comes into play. It lets us send as many queries as we want in parallel, and pulls the results as they complete. Each query can be set to a different connection, too.

For the sake of simplicity, we’ll just simulate the remote connections with a local loopback to the database where we created all of the shards. In a more advanced scenario, we would be using at least two Postgres instances on potentially separate servers.

Prepare to be amazed!

CREATE EXTENSION postgres_fdw;
OPTIONS (host 'localhost', dbname 'postgres', port '5433');
   FOR postgres 
SERVER loopback
OPTIONS (USER 'postgres', password 'test');
\timing ON
CREATE TEMP TABLE tmp_foo (total INT);
SELECT SUM(total) FROM pmpp.distribute( NULL::tmp_foo, 'loopback',
    'SELECT count(*) FROM shard_1.sensor_log
      WHERE reading_date::TIME >= ''14:00''
        AND reading_date::TIME < ''15:00''',
    'SELECT count(*) FROM shard_2.sensor_log
      WHERE reading_date::TIME >= ''14:00''
        AND reading_date::TIME < ''15:00''',
    'SELECT count(*) FROM shard_3.sensor_log
      WHERE reading_date::TIME >= ''14:00''
        AND reading_date::TIME < ''15:00''',
    'SELECT count(*) FROM shard_4.sensor_log
      WHERE reading_date::TIME >= ''14:00''
        AND reading_date::TIME < ''15:00'''
TIME: 349.503 ms

Nice, eh? With a bit more “wrapping” to hide the ugliness of broadcasting a query to multiple servers, this has some major potential! Since we separated the shards by schema, we could even bootstrap the connections so the same query could be sent to each without any further modifications. Of course, the Postgres foreign data wrapper doesn’t let us set the schema for created servers, so we’d need another workaround like pg_service.conf, but the components are there.

The primary caveat is that this approach works best for queries that drastically reduce the query set using aggregates. The problem is that Postgres needs to run the query on each system, fetch the results into a temporary structure, and then return it again from the distribute() function. This means there’s an inversely proportional relationship between row count and speed; there’s a lot of overhead involved.

Take a look at what happens when we try to run the aggregate locally:

\timing ON
SELECT COUNT(*) FROM pmpp.distribute( NULL::sensor_log, 'loopback',
    'SELECT * FROM shard_1.sensor_log
      WHERE reading_date::TIME >= ''14:00''
        AND reading_date::TIME < ''15:00''',
    'SELECT * FROM shard_2.sensor_log
      WHERE reading_date::TIME >= ''14:00''
        AND reading_date::TIME < ''15:00''',
    'SELECT * FROM shard_3.sensor_log
      WHERE reading_date::TIME >= ''14:00''
        AND reading_date::TIME < ''15:00''',
    'SELECT * FROM shard_4.sensor_log
      WHERE reading_date::TIME >= ''14:00''
        AND reading_date::TIME < ''15:00'''
TIME: 4059.308 ms

Basically, do anything possible to reduce the result set on queries. Perform aggregation remotely if possible, and don’t be surprised if pulling back tens of thousands of rows takes a bit longer than expected.

Given that caveat, this is a very powerful extension. It’s still very early in its development cycle, and could use some more functionality, but the core is definitely there. Now please excuse me while I contemplate ways to glue it to my shard_manager extension and methods of wrapping it to simplify invocation.

Tags: , , , ,

PG Phriday: Parallel-O-Postgres

October 16th, 2015 | Published in Database, Tech Talk | No Comments

I wasn’t able to write an article last week due to an unexpected complication regarding tests I was running to verify its contents. So this week, it’s going to be extra special! Also long.

What’s the fastest way to load a Postgres table? If you believe the documentation, the COPY command is the best way to unceremoniously heave data into a table. Fortunately after all of our talk about partitions, our minds are primed and ready to think in chunks. Why restrict ourselves to one COPY, when we can summon an army?

Let’s start with our tried and true sensor_log table:

CREATE TABLE sensor_log (
  sensor_log_id  SERIAL PRIMARY KEY,
  location       VARCHAR NOT NULL,
  reading        BIGINT NOT NULL,
  reading_date   TIMESTAMP NOT NULL
CREATE INDEX idx_sensor_log_location ON sensor_log (location);
CREATE INDEX idx_sensor_log_date ON sensor_log (reading_date);

Now, there are numerous methods for filling this table. We’ve explored in the past and determined that COPY is the fastest method from built-in Postgres functionality. It does this by circumventing a large portion of otherwise necessary overhead. As an example, a simple Python script can insert 100k records in about 25 seconds with INSERT statements, or 1.5 seconds with a COPY command equivalent like psycopg’s copy_to.

And as previously mentioned, we can parallelize everything for even more benefit. Or can we? Take a look at this Python code for a very simple parallel loader:

# Set these to modify how large the COMMIT blocks will be,
# and how many days the data will represent.
# Total row count = chunk_size * date_range.
chunk_size = 100000
date_range = 8
threads = 1
# Includes
import psycopg2
from StringIO import StringIO
from datetime import datetime, timedelta, date, time
from multiprocessing import Pool
# Main program body
def load_part(part_date):
    # Create all of the fake data we'll be inserting. We'll use an
    # in-memory file so we can use a PostgreSQL COPY import.
    stamp = datetime.combine(part_date, time.max)
    raw_data = StringIO()
    for i in xrange(1, chunk_size + 1):
        stamp -= timedelta(seconds = 0.5)
            '%s\t%s\t%s\n' % (i % 1000, i % 100, stamp)
    # Connect to the database and insert everything we've generated.
    # This is really basic and lame, but works as a demo.
    db_conn = psycopg2.connect(database = 'postgres', user = 'postgres')
    cur = db_conn.cursor()
        raw_data, 'sensor_log', '\t',
        columns = ('location', 'reading', 'reading_date')
# Generate a list of dates that we can send to the pool mapping function.
# This should keep our multiprocess stack full until all data is loaded.
date_list = [ - timedelta(i) for i in xrange(0, date_range)]
pool = Pool(processes = threads)
res =, date_list)

There’s not really anything complicated going on in that script. We have a configurable worker pool, we’ve elected to group data in 100k-row batches, and we have eight batches to test several multiples of two. The real fun is getting the results. Here’s what they look like on an 8-CPU system with 32GB of RAM using Postgres 9.4 for 1, 2, 4, and 8 processes:

Pool Size Avg Time (s)
1 18.6
2 11.3
4 7.0
8 8.2

What happened at eight processes? It turns out our Python script does its own work, so there’s enough context switching that it becomes expected overhead. For this particular approach, it’s best to restrict parallelism to half the amount of CPUs on the system so that doesn’t happen. Why not a pool size of 6? Since the batches are even amounts of work, it effectively handles 400k rows at a time, leaving a final 200k to distribute to two workers for one last batch. So our run-time would remain unchanged from the 4-process case.

But this is loading the same table. We’ve been covering various aspects of partitions for weeks now, so why not see what happens with partitions? As it turns out, a lot changes. Here’s the same tests with a slightly altered Python script that assumes daily partitions for our simulated data:

Pool Size Avg Time (s)
1 16.2
2 9.6
4 6.0
8 6.8

We can account for the slightly lower times mainly due to differences in primary key calculations. The unique indexes that define them are only valid per partition, so building the indexes is faster with fewer candidates. That does mean that the effect should persist for larger, more production-capacity tables of course.

There’s a major reason I used Postgres 9.4 for these tests. One of the mechanisms COPY uses to streamline data into the target table relies on the WAL backend. Previous versions of Postgres only really had one working slot for this, so parallel COPY wasn’t really something that existed. Let’s see what happens with the parallel partition numbers with a 9.3 instance:

Pool Size Avg Time (s)
1 19.8
2 12.1
4 10.5
8 16.2

Take special note of that last item on the list. What happened is clear if we examine the logs for the COPY commands themselves:

LOG:  duration: 14688.654 ms  statement: COPY ...

Remember that there are eight of these executing in parallel, and each of them required about 14.5 seconds to run. So that’s actually 14.58, or almost two entire minutes of CPU time to load 800k rows. That’s *six times slower than the sequential case. Even if we go down to four processes, each load requires about four seconds to run. That means four parallel processes require 32 seconds of CPU time. By trying to parallelize this in 9.3, we actually made it much more resource intensive.

It looks faster due to time elapsed, but that’s simply a side-effect of increased concurrency. Remember, at four seconds per batch set, it would take eight seconds of elapsed time to process two sets. After factoring in the Python script overhead, that’s perfectly inline with our results. This is why it’s important to consider the overall work being done, and not how quickly it finishes.

Knowing that, 9.4 isn’t exactly innocent when it comes to this issue. At four concurrent processes, each COPY requires about 2.5 seconds on this system. So even four concurrent processes is less efficient than a sequential load in 9.4, though the effect is far less pronounced than in 9.3.

Not to leave 9.5 out now that it’s in beta, here are its numbers using the partitioning script:

Pool Size Avg Time (s)
1 15.7
2 7.7
4 5.1
8 6.2

That looks like a pretty nice improvement over 9.4 in general, which is always nice to see between version bumps. At this point, each COPY requires about two seconds, bringing the 4-CPU case almost to parity with sequentially loading the tables. Converting a process to be parallel always introduces some amount of overhead, but this is a vast improvement from 9.3.

It’s also what I consider the final nail in the coffin for pushing upgrades. Postgres 9.3 is awesome, but 9.4 contains so many improvements it’s difficult to quantify them all. Most users and DBAs probably didn’t know COPY was essentially sequential across a server instance previous to 9.4, and that’s only one improvement that got very little fanfare according to Google. Imagine what else is in there!

There are probably a lot of things in 9.5 that equally defy simple analysis and could bear even more drastic results. But for now, check your Postgres version and if you are using parallel COPY on anything less than 9.4, you might want to upgrade soon.

Tags: , , , ,

« Older Posts