PG Phriday: High Availability Through Delayed Replication
High availability of PostgreSQL databases is incredibly important to me. You might even say it’s a special interest of mine. It’s one reason I’m both excited and saddened by a feature introduced in 9.4. I’m Excited because it’s a feature I plan to make extensive use of, and saddened because it has flown under the radar thus far. It’s not even listed in the What’s new in PostgreSQL 9.4 Wiki page. If they’ll let me, I may have to rectify that.
What is this mysterious change that has me drooling all over my keyboard? The new recovery_min_apply_delay standby server setting. In name and intent, it forces a standby server to delay application of upstream changes. The implications, however, are much, much more important.
Let me tell you a story; it’s not long, actually. A couple years ago, I had to help a client that was using a hilariously over-engineered stack to prevent data loss. I only say that because at first glance, the number of layers and duplicate servers would shock most people, and the expense would finish the job. This was one of my full recommended stacks, plus a few extra bits for the truly paranoid. DRBD-bonded servers, Pacemaker failover, off-site disaster recovery streaming clones, nightly backup, off-site backup and historical WAL storage, and long-term tape archival in a vault for up to seven years. You would need to firebomb several cities to get rid of this data.
But data permanence and availability are not synonymous. All it took was a single misbehaving CPU to take out the entire constellation of database servers, and corrupt a bunch of recent WAL files for good measure. How this is possible, and how difficult it is to avoid, is a natural extension of using live streaming replicas for availability purposes. We always need to consider one important factor: immediacy applies to everything.
Here’s what actually happened:
- A CPU on master-1 went bad.
- Data being written to database files was corrupt.
- DRBD copied the bad blocks, immediately corrupting master-2.
- Shared memory became corrupt.
- Streaming replication copied the corrupt data to dr-master-1.
- DRBD copied the bad blocks, immediately corrupting dr-master-2.
- In turn, PostgreSQL noticed the corruption and crashed on each server.
- Monitoring systems started screaming on all channels.
Just like that, a bulletproof high-availability cluster imploded into itself. All we had left at that point were the pristine backups, and the off-site WAL archives. This is one of the major reasons I wrote walctl, actually. Keeping archived WAL files on a tertiary server isolates them from issues that affect the primary or disaster recovery clusters. Further, it means the files can be pulled by any number of new clones without overloading the masters, which are intended to be used for OLTP.
In this case, we pulled a backup from the off-site backup vault, gathered the WAL files that were generated before the CPU went bad, and got the cluster running again in a couple of hours. But this could have easily been much worse, and without the previously-mentioned expensive paranoia and surplus of redundancy levels, it would have. And despite the fact we recovered everything, there’s still the several-hour outage to address.
You see, we weren’t paranoid enough. For a truly high-available architecture, corruption of the data source should always be considered a possibility. Both DRBD and PostgreSQL strive to copy data as quickly as possible, just as they should. Synchronization delay is another huge, but unrelated problem applications often need to circumvent when communicating with replicas. One way to solve this is to keep a third standby server that uses traditional WAL consumption, and then implement a time delay.
Effectively, this means preventing the extra server from processing WAL files for some period of time. This interval allows a DBA to interrupt replay before corruption reaches a fully online replica. It takes time for monitoring systems to report outages, and for the DBA to log into a server and diagnose the problem. As we’ve seen, it can already be too late; the data is already corrupt, and a backup is the only recourse. But a delayed server is already online, can easily be recovered to a point right before the corruption started, and can drastically reduce the duration of an outage.
There are several ways of imposing this delay, and all of them require at least one more series of scripts or software to strictly regulate file availability. They’re also largely irrelevant since the introduction of PostgreSQL 9.4 and the
recovery_min_apply_delay setting. Instead of a cron job, or using a complicated script as the
recovery.conf, or some other method, we just set this variable and we get the desired offset. Here’s a two-hour window:
recovery_min_apply_delay = '2h'
This works with both streaming replication, and more traditional WAL file recovery. There is however, one caveat to using this setting. Since the replica can not apply the changes as they’re presented, they are held in the
pg_xlog directory until the imposed purgatory expires. On highly transactional systems, this can result in unexpected space usage on replicas that activate the setting. The larger the safety margin, the more files will accumulate awaiting replay.
Barring that, it’s a huge win for anyone who wants to run a highly available cluster. In fact, it can even be integrated into cluster automation, so a delayed server is stopped if the primary system is down. This keeps our desired window intact while we investigate the problem, without us having to stop troubleshooting and shut down the time-offset replica ourselves.
In addition, a delayed server can be used for standard recovery purposes. If a user erroneously deletes data, or a rogue process drops a critical object, there’s a replica ready and waiting to let us recover the data and reintroduce it to the master server.
Having a server sitting around with self-inflicted synchronization offset seems ridiculous at first glance. But from the perspective of a DBA, it can literally save the database if used properly. I highly recommend anyone who can afford to implement this technique, does so. Your system uptime will thank you.