PG Phriday: Sidewinder

Maintaining a Postgres database can involve a lot of busywork. This is especially true for more robust architectures that allocate at least one replica for failover purposes. It’s still fairly common for a DBA to create a replica to accommodate emergency or upgrade scenarios, only to have to repeat the process when it came time to revert to the original master system. It’s not safe to simply subscribe the original primary to the newly promoted secondary, so this leaves either creating a new clone, or using rsync to synchronize all of the files first.

This not only makes the DBA’s job harder, it also complicates disaster recovery tests or other scenarios that require reversing the replication stream. That’s where Postgres 9.5 and its inclusion of the pg_rewind utility comes in. The only real criteria for use is that the wal_log_hints setting is enabled in postgresql.conf.

Given that stipulation, imagine we have a 9.5 Postgres instance located at /data/testdb. From here, let’s create some fake data with pgbench and create a new replica:

pgbench -i -s 100 postgres
pg_basebackup -D /data/testdb-clone -R

Next we modify the postgresql.conf file for the clone and change the port parameter so both instances can run simultaneously. Anything will work here, so long as it’s different from the parent instance. Let’s just say the new port is 5433 for this test.

Then comes the fun part: promoting and modifying the replica. A 30-second pgbench test should modify enough data such that the replica and parent are no longer even minimally interchangeable.

pg_ctl -D /data/testdb-clone start
pg_ctl -D /data/testdb-clone promote
pg_ctl -D /data/testdb stop -m fast

pgbench -p 5433 -T 30 postgres

The reason we stopped the old primary after promoting the new one is to simulate them being out of sync. Even one transaction that differs on either instance is enough to cause this effect. In fact, the previous two code blocks will be used again later, so keep them in mind.

To have some comparison, we need to test the “old” method of using rsync. Our assumption here is that the database is too large to use pg_basebackup. Let’s see it in action:

pg_ctl -D /data/testdb-clone stop -m fast

time rsync -a /data/testdb-clone/ /data/testdb/

real    0m15.019s
user    0m9.804s
sys     0m10.909s

Nothing especially complicated, right? 15 seconds isn’t bad, considering we’re dealing with about 2GB of test data. If this were a database that measured in TBs instead, we’d be extremely grateful for any files rsync chose not to transfer. But what if we used pg_rewind instead? How would that even work? Oddly enough, it’s actually pretty similar.

pg_ctl -D /data/testdb-clone stop -m fast

time pg_rewind --source-pgdata=/data/testdb-clone \
               --target-pgdata=/data/testdb

servers diverged at WAL position 0/9E0004D8 on timeline 1
rewinding from last common checkpoint at 0/9E000430 on timeline 1
Done!

real    0m0.679s
user    0m0.024s
sys     0m0.653s

We’re not quite done yet. Executing pg_rewind merely “reverts” the old primary to a point where it can safely consume transaction logs from the new primary. After this, it needs to run in recovery mode and apply any pending changes as if it were a newly allocated replica. This means we need a new recovery.conf pointing at the replica we promoted at the beginning of this process.

standby_mode = 'on'
primary_conninfo = 'user=postgres port=5433'

Note that we’re only using the postgres user to simplify our examples. We don’t recommend using the primary superuser for replication if it can be avoided. Beyond that, the only remaining steps are to start the new replica and see how long it takes to catch up.

pg_ctl -D /data/testdb start
sleep 5
egrep 'standby|ready' /data/testdb/pg_log/postgresql-*.log

2015-11-06 17:56:49 GMT LOG:  entering standby mode
2015-11-06 17:56:50 GMT LOG:  database system is ready to accept read only connections

Wow! That’s quite a difference! Add that up and we spend about two seconds for the whole process. One to rewind the data, and another to re-apply the data modified since the systems were last in sync. Further, we can use the same process on testdb-clone to safely convert it back into a replica.

Part of running a Postgres cluster of any description is coordination. Previously, swapping between two systems frequently was extremely slow and annoying for larger databases. This made maintenance and failovers to secondary pairs or DR tests much more complicated than necessary. Neither of those scenarios promote a healthy high availability architecture.

We want to swap to the secondary to perform maintenance as often as necessary, since that’s its role. We want to test our disaster recovery environment to prove its functionality. When doing so requires nearly a full synchronization in both directions every time, we’re much less likely to schedule these events very frequently. The pg_rewind utility finally gives us a safe way to do that.

As a final note, while directly included as part of Postgres 9.5, pg_rewind is actually available for 9.4 and 9.3 as well. I highly recommend integrating it into any robust HA stack; the QA and Infrastructure departments will be glad when the DR test can finish ahead of schedule for once!