PG Phriday: Sidewinder
Maintaining a Postgres database can involve a lot of busywork. This is especially true for more robust architectures that allocate at least one replica for failover purposes. It’s still fairly common for a DBA to create a replica to accommodate emergency or upgrade scenarios, only to have to repeat the process when it came time to revert to the original master system. It’s not safe to simply subscribe the original primary to the newly promoted secondary, so this leaves either creating a new clone, or using rsync
to synchronize all of the files first.
This not only makes the DBA’s job harder, it also complicates disaster recovery tests or other scenarios that require reversing the replication stream. That’s where Postgres 9.5 and its inclusion of the pg_rewind
utility comes in. The only real criteria for use is that the wal_log_hints
setting is enabled in postgresql.conf
.
Given that stipulation, imagine we have a 9.5 Postgres instance located at /data/testdb
. From here, let’s create some fake data with pgbench
and create a new replica:
pgbench -i -s 100 postgres
pg_basebackup -D /data/testdb-clone -R
Next we modify the postgresql.conf
file for the clone and change the port
parameter so both instances can run simultaneously. Anything will work here, so long as it’s different from the parent instance. Let’s just say the new port is 5433 for this test.
Then comes the fun part: promoting and modifying the replica. A 30-second pgbench
test should modify enough data such that the replica and parent are no longer even minimally interchangeable.
pg_ctl -D /data/testdb-clone start
pg_ctl -D /data/testdb-clone promote
pg_ctl -D /data/testdb stop -m fast
pgbench -p 5433 -T 30 postgres
The reason we stopped the old primary after promoting the new one is to simulate them being out of sync. Even one transaction that differs on either instance is enough to cause this effect. In fact, the previous two code blocks will be used again later, so keep them in mind.
To have some comparison, we need to test the “old” method of using rsync
. Our assumption here is that the database is too large to use pg_basebackup
. Let’s see it in action:
pg_ctl -D /data/testdb-clone stop -m fast
time rsync -a /data/testdb-clone/ /data/testdb/
real 0m15.019s
user 0m9.804s
sys 0m10.909s
Nothing especially complicated, right? 15 seconds isn’t bad, considering we’re dealing with about 2GB of test data. If this were a database that measured in TBs instead, we’d be extremely grateful for any files rsync chose not to transfer. But what if we used pg_rewind
instead? How would that even work? Oddly enough, it’s actually pretty similar.
pg_ctl -D /data/testdb-clone stop -m fast
time pg_rewind --source-pgdata=/data/testdb-clone \
--target-pgdata=/data/testdb
servers diverged at WAL position 0/9E0004D8 on timeline 1
rewinding from last common checkpoint at 0/9E000430 on timeline 1
Done!
real 0m0.679s
user 0m0.024s
sys 0m0.653s
We’re not quite done yet. Executing pg_rewind
merely “reverts” the old primary to a point where it can safely consume transaction logs from the new primary. After this, it needs to run in recovery mode and apply any pending changes as if it were a newly allocated replica. This means we need a new recovery.conf
pointing at the replica we promoted at the beginning of this process.
standby_mode = 'on'
primary_conninfo = 'user=postgres port=5433'
Note that we’re only using the postgres
user to simplify our examples. We don’t recommend using the primary superuser for replication if it can be avoided. Beyond that, the only remaining steps are to start the new replica and see how long it takes to catch up.
pg_ctl -D /data/testdb start
sleep 5
egrep 'standby|ready' /data/testdb/pg_log/postgresql-*.log
2015-11-06 17:56:49 GMT LOG: entering standby mode
2015-11-06 17:56:50 GMT LOG: database system is ready to accept read only connections
Wow! That’s quite a difference! Add that up and we spend about two seconds for the whole process. One to rewind the data, and another to re-apply the data modified since the systems were last in sync. Further, we can use the same process on testdb-clone
to safely convert it back into a replica.
Part of running a Postgres cluster of any description is coordination. Previously, swapping between two systems frequently was extremely slow and annoying for larger databases. This made maintenance and failovers to secondary pairs or DR tests much more complicated than necessary. Neither of those scenarios promote a healthy high availability architecture.
We want to swap to the secondary to perform maintenance as often as necessary, since that’s its role. We want to test our disaster recovery environment to prove its functionality. When doing so requires nearly a full synchronization in both directions every time, we’re much less likely to schedule these events very frequently. The pg_rewind
utility finally gives us a safe way to do that.
As a final note, while directly included as part of Postgres 9.5, pg_rewind
is actually available for 9.4 and 9.3 as well. I highly recommend integrating it into any robust HA stack; the QA and Infrastructure departments will be glad when the DR test can finish ahead of schedule for once!