In light of recent events where GitLab suffered a massive database loss, this is a great opportunity to examine what happened from a Postgres perspective. Since Simon Riggs over at 2ndQuadrant has already chimed in on improvements Gitlib might consider in their procedures, maybe we should walk the conversation back slightly.
This isn’t the first time Postgres backup tooling has been misused or misunderstood. The topic of backups hits forums and mailing lists rather frequently. It’s no secret that there’s no official “push button to back up” tool for Postgres. Companies like 2ndQuadrant do their best with offerings like Barman, but this will never stop admins from rolling their own solutions instead of resorting to another external utility.
Part of the reason for this stems from the extract and backup tools Postgres itself provides. After reading the documentation, it’s only natural to encounter
pg_basebackup and stop looking. And why not? MySQL has its
mysqldump after all and that’s fine, right?
Well, not always. Our first clue comes from the ‘dump’ suffix. By its nature, this means we’re extracting database contents by essentially executing a
SELECT statement against each table and waiting for the results. Sometimes this involves compression, and there’s a special format Postgres can use to inline the process, and… none of that matters.
An extract is not a backup. Full stop. Depending on the size of our database, performing a full dump can consume a span of a few seconds to several days. We’re limited by the laws of physics in this case; reading a 50TB database takes takes time, as does the processing power necessary to compress that output. Restoring it is often much slower, given how we must account for recreating indexes. In these times of gigantic databases,
pg_dump should probably be relegated to exporting and importing individual tables unless the instance is truly tiny.
In the Old Days(tm), the most common advice was to supply the
pg_dump command with the
-Fc flags to activate its internal compression and storage format. But tools evolve, and Postgres 9.3 added the
-j option so tables can be dumped or restored in parallel. Have a huge machine with 64 CPUs living on really fast SSDs and insist on dumping every table? Do this:
# This works for parallel dumps pg_dump -Fd -j 48 -d my_db -f /my/backup/dir # This is for restoring it pg_restore -j 48 -d my_db /my/backup/dir
This approach scales to surprisingly high levels, provided the underlying storage can keep up. It’s not perfectly linear, but a backup or restore using multiple parallel processes will complete in a fraction of the time. After a couple hundred GB however, even parallel dumps start to exceed standard maintenance windows.
Real database engines require a binary backup tool, and Postgres gained one of these back in version 9.1. There are multiple helpful options to customize its behavior, but its real job is to ignore database contents completely. The
pg_basebackup utility can back up a full instance of multiple databases nearly as quickly as the operating system can read the files from disk. A well equipped SAN, RAID, or SSD usually provide enough IO to back up even a 5TB instance in a couple of hours.
Unfortunately these backups are not compressed by default, and since we lose the ability to perform parallel dumps when using
pg_basebackup, compression is single-threaded if we activate it. That means our 5TB cluster will produce a 5TB backup unless we don’t mind a much longer runtime. Ouch. On the other hand, this means the output it produces is an exact physical copy. What else is an exact physical copy of a Postgres instance?
That’s right: a replica. What’s nice is that the Postgres developers know that as well. Want to copy a remote Postgres cluster on another server and have it start streaming from the primary as soon as it’s started? Easy:
pg_basebackup -h primary-server -D /new/datadir -R pg_ctl -D /new/datadir start
Since we supplied the
pg_basebackup writes a
recovery.conf file that will instruct our new replica to stream from the same server we just fetched the backup from. We can do this as many times as we wish, even if the upstream is another replica. Our only limit is network bandwidth, and we’ve all upgraded to 10Gbps equipment, right?
Unfortunately, scenarios beyond this point is where process breaks down. What happens if we have a replica that falls behind and needs to be rebuilt? For all of its benefits,
pg_basebackup still cannot (currently) skip unchanged files, or make small patches where necessary. Relying on it in this case would require erasing the replica and starting from scratch. This is where GitLab really ran into trouble.
Yet we started with synchronized files, didn’t we? Could we use
rsync to “catch up”? Yes, but it’s a somewhat convoluted procedure. We would first need to connect to the upstream server and issue a
SELECT pg_start_backup('my_backup') command so Postgres knows to archive transaction logs produced during the sync. Then after the sync is completed, we would need to stop the backup with
SELECT pg_stop_backup(). Then we would have to make our own
recovery.conf file, obtain all of the WAL files the upstream server archived, and so on.
None of that is something a system administrator will know, and it’s fiddly even to an experienced Postgres DBA. A mistake during any of that procedure will result in a non-functional or otherwise unsafe replica. All of that is the exact reason software like Barman exists. Supplied utilities only get us so far. For larger or more critical installations, either our custom scripts must flawlessly account for every failure scenario and automate everything, or we defer to someone who already did all of that work.
Even without that, there are some general guidelines to follow:
- Don’t use
pg_dumpfor backups. If your database isn’t too large now, it will be later.
- Always have a replica handy for critical systems. If a company depends on a database, there should always be at least one online copy. It’s easy to play fast-and-loose here when performing maintenance on a replica, but scripts and CLI snafus happen. Bring up another replica until maintenance is over.
- If performing dangerous work on the primary, temporarily decouple at least one replica. I had to learn from experience here that hardware or filesystem modifications can corrupt the entire replication chain. Have an online copy of critical instances, and disable replication during maintenance until work is done.
- Have an off-site tertiary replica for disaster recovery purposes. Another datacenter, another city, another state. Just put it somewhere else. Not only is this a supplementary copy for failover purposes or crash recovery, but even regional outages won’t result in lost data or availability.
- Perform backup operations on a replica for 24/7 systems. Websites that people expect to never go down for maintenance should be backed up from a replica. Postgres backups are online and non-blocking, but are IO intensive and may substantially decrease query performance. Don’t take the risk.
- Fail over to a working replica at least twice per year. Either it happens when nobody expects it, or we manage the process ourselves. The latter scenario is highly preferred and establishes a checklist to apply to the former.
- Test backup restores at least twice a year. An untested backup is equivalent to random 1s and 0s in a file somewhere. Assuming it exists, it could be anything. It may not work at all. Make sure.
- Always back up transaction log archives through the day on critical clusters. They can be used to help a replica catch up if it falls behind, and are an invaluable component of PITR functionality.
These kinds of procedural rules are not restricted to Postgres, and aren’t even really a complete list. Administering a database architecture is a thankless job that never ends, and there’s always something we’ll miss. The catastrophe at GitLab was a failure of process and architecture, not of a command typed on the wrong server. I’m extremely grateful for their openness regarding the situation, because we can all learn from it.
As an aside, while I highly respect Simon and 2ndQuadrant, Barman isn’t the only game in town for backup management. It seems backing up huge installations was a problem long enough that several developers solved it almost simultaneously. Before Barman had it, pgBackRest was designed around hard links and incremental backups of gargantuan 1TB+ systems. Unfortunately I don’t know how they compare because I haven’t tested them. Why not? Because even I got caught up in the race for a better backup tool.
One of our databases first tipped the 1TB scale around 2013 and
pg_basebackup was consuming inordinate amounts of time. So I threw together a bunch of scripts that leveraged hard links, parallel compression with
pigz, and unleashed it on a replica of our production system. It was twelve times faster on the first iteration thanks to the parallel compression, and over 100 times faster while using hard links to produce incrementals. I eventually rewrote it as a more robust set of Python libraries, but pgBackRest and Barman finally made that last step, rendering my efforts effectively moot.
I still use my libraries for now, but the extra management capabilities the other tools supply is somewhat enticing. Want some idea how much better Barman or pgBackRest will perform than relying on
pg_basebackup? We have a 50TB database that grows by 200GB per day, which many might consider slightly cumbersome. My homegrown tool transforms that into a 7TB backup in four hours, while the 24GB nightly incrementals usually require about 20 minutes. We back up a 50TB database cluster in 20 minutes.
And yes, I’ve tested restoring from this backup. Restoring takes a while, but it’s also done in parallel. There’s little that’s more satisfying than watching a system with 32 CPUs pegged at 100% decompressing 50TB worth of data as fast as the underlying filesystem is capable of writing.
But I was only able to do that because I’m intimately familiar with our tools and requirements. Our company process standards revealed limitations in the supplied Postgres backup tooling, prompting me to find or create an alternative. There is no substitute for established protocol, despite the fact building those procedures might not have an immediately obvious benefit. The payoff is avoiding downtime, and that’s something you can’t really plan.
It’s the difference between six hours of lost data, and six seconds.