In light of recent events where GitLab suffered a massive database loss, this is a great opportunity to examine what happened from a Postgres perspective. Since Simon Riggs over at 2ndQuadrant has already chimed in on improvements Gitlib might consider in their procedures, maybe we should walk the conversation back slightly.

This isn’t the first time Postgres backup tooling has been misused or misunderstood. The topic of backups hits forums and mailing lists rather frequently. It’s no secret that there’s no official “push button to back up” tool for Postgres. Companies like 2ndQuadrant do their best with offerings like Barman, but this will never stop admins from rolling their own solutions instead of resorting to another external utility.

Part of the reason for this stems from the extract and backup tools Postgres itself provides. After reading the documentation, it’s only natural to encounter pg_dump and pg_basebackup and stop looking. And why not? MySQL has its mysqldump after all and that’s fine, right?

Well, not always. Our first clue comes from the ‘dump’ suffix. By its nature, this means we’re extracting database contents by essentially executing a SELECT statement against each table and waiting for the results. Sometimes this involves compression, and there’s a special format Postgres can use to inline the process, and… none of that matters.

An extract is not a backup. Full stop. Depending on the size of our database, performing a full dump can consume a span of a few seconds to several days. We’re limited by the laws of physics in this case; reading a 50TB database takes takes time, as does the processing power necessary to compress that output. Restoring it is often much slower, given how we must account for recreating indexes. In these times of gigantic databases, pg_dump should probably be relegated to exporting and importing individual tables unless the instance is truly tiny.

In the Old Days(tm), the most common advice was to supply the pg_dump command with the -Fc flags to activate its internal compression and storage format. But tools evolve, and Postgres 9.3 added the -j option so tables can be dumped or restored in parallel. Have a huge machine with 64 CPUs living on really fast SSDs and insist on dumping every table? Do this:

# This works for parallel dumps
 
pg_dump -Fd -j 48 -d my_db -f /my/backup/dir
 
# This is for restoring it
 
pg_restore -j 48 -d my_db /my/backup/dir

This approach scales to surprisingly high levels, provided the underlying storage can keep up. It’s not perfectly linear, but a backup or restore using multiple parallel processes will complete in a fraction of the time. After a couple hundred GB however, even parallel dumps start to exceed standard maintenance windows.

Real database engines require a binary backup tool, and Postgres gained one of these back in version 9.1. There are multiple helpful options to customize its behavior, but its real job is to ignore database contents completely. The pg_basebackup utility can back up a full instance of multiple databases nearly as quickly as the operating system can read the files from disk. A well equipped SAN, RAID, or SSD usually provide enough IO to back up even a 5TB instance in a couple of hours.

Unfortunately these backups are not compressed by default, and since we lose the ability to perform parallel dumps when using pg_basebackup, compression is single-threaded if we activate it. That means our 5TB cluster will produce a 5TB backup unless we don’t mind a much longer runtime. Ouch. On the other hand, this means the output it produces is an exact physical copy. What else is an exact physical copy of a Postgres instance?

That’s right: a replica. What’s nice is that the Postgres developers know that as well. Want to copy a remote Postgres cluster on another server and have it start streaming from the primary as soon as it’s started? Easy:

pg_basebackup -h primary-server -D /new/datadir -R
pg_ctl -D /new/datadir start

Since we supplied the -R flag, pg_basebackup writes a recovery.conf file that will instruct our new replica to stream from the same server we just fetched the backup from. We can do this as many times as we wish, even if the upstream is another replica. Our only limit is network bandwidth, and we’ve all upgraded to 10Gbps equipment, right?

Unfortunately, scenarios beyond this point is where process breaks down. What happens if we have a replica that falls behind and needs to be rebuilt? For all of its benefits, pg_basebackup still cannot (currently) skip unchanged files, or make small patches where necessary. Relying on it in this case would require erasing the replica and starting from scratch. This is where GitLab really ran into trouble.

Yet we started with synchronized files, didn’t we? Could we use rsync to “catch up”? Yes, but it’s a somewhat convoluted procedure. We would first need to connect to the upstream server and issue a SELECT pg_start_backup('my_backup') command so Postgres knows to archive transaction logs produced during the sync. Then after the sync is completed, we would need to stop the backup with SELECT pg_stop_backup(). Then we would have to make our own recovery.conf file, obtain all of the WAL files the upstream server archived, and so on.

None of that is something a system administrator will know, and it’s fiddly even to an experienced Postgres DBA. A mistake during any of that procedure will result in a non-functional or otherwise unsafe replica. All of that is the exact reason software like Barman exists. Supplied utilities only get us so far. For larger or more critical installations, either our custom scripts must flawlessly account for every failure scenario and automate everything, or we defer to someone who already did all of that work.

Even without that, there are some general guidelines to follow:

  1. Don’t use pg_dump for backups. If your database isn’t too large now, it will be later.
  2. Always have a replica handy for critical systems. If a company depends on a database, there should always be at least one online copy. It’s easy to play fast-and-loose here when performing maintenance on a replica, but scripts and CLI snafus happen. Bring up another replica until maintenance is over.
  3. If performing dangerous work on the primary, temporarily decouple at least one replica. I had to learn from experience here that hardware or filesystem modifications can corrupt the entire replication chain. Have an online copy of critical instances, and disable replication during maintenance until work is done.
  4. Have an off-site tertiary replica for disaster recovery purposes. Another datacenter, another city, another state. Just put it somewhere else. Not only is this a supplementary copy for failover purposes or crash recovery, but even regional outages won’t result in lost data or availability.
  5. Perform backup operations on a replica for 24/7 systems. Websites that people expect to never go down for maintenance should be backed up from a replica. Postgres backups are online and non-blocking, but are IO intensive and may substantially decrease query performance. Don’t take the risk.
  6. Fail over to a working replica at least twice per year. Either it happens when nobody expects it, or we manage the process ourselves. The latter scenario is highly preferred and establishes a checklist to apply to the former.
  7. Test backup restores at least twice a year. An untested backup is equivalent to random 1s and 0s in a file somewhere. Assuming it exists, it could be anything. It may not work at all. Make sure.
  8. Always back up transaction log archives through the day on critical clusters. They can be used to help a replica catch up if it falls behind, and are an invaluable component of PITR functionality.

These kinds of procedural rules are not restricted to Postgres, and aren’t even really a complete list. Administering a database architecture is a thankless job that never ends, and there’s always something we’ll miss. The catastrophe at GitLab was a failure of process and architecture, not of a command typed on the wrong server. I’m extremely grateful for their openness regarding the situation, because we can all learn from it.

As an aside, while I highly respect Simon and 2ndQuadrant, Barman isn’t the only game in town for backup management. It seems backing up huge installations was a problem long enough that several developers solved it almost simultaneously. Before Barman had it, pgBackRest was designed around hard links and incremental backups of gargantuan 1TB+ systems. Unfortunately I don’t know how they compare because I haven’t tested them. Why not? Because even I got caught up in the race for a better backup tool.

One of our databases first tipped the 1TB scale around 2013 and pg_basebackup was consuming inordinate amounts of time. So I threw together a bunch of scripts that leveraged hard links, parallel compression with pigz, and unleashed it on a replica of our production system. It was twelve times faster on the first iteration thanks to the parallel compression, and over 100 times faster while using hard links to produce incrementals. I eventually rewrote it as a more robust set of Python libraries, but pgBackRest and Barman finally made that last step, rendering my efforts effectively moot.

I still use my libraries for now, but the extra management capabilities the other tools supply is somewhat enticing. Want some idea how much better Barman or pgBackRest will perform than relying on pg_basebackup? We have a 50TB database that grows by 200GB per day, which many might consider slightly cumbersome. My homegrown tool transforms that into a 7TB backup in four hours, while the 24GB nightly incrementals usually require about 20 minutes. We back up a 50TB database cluster in 20 minutes.

And yes, I’ve tested restoring from this backup. Restoring takes a while, but it’s also done in parallel. There’s little that’s more satisfying than watching a system with 32 CPUs pegged at 100% decompressing 50TB worth of data as fast as the underlying filesystem is capable of writing.

But I was only able to do that because I’m intimately familiar with our tools and requirements. Our company process standards revealed limitations in the supplied Postgres backup tooling, prompting me to find or create an alternative. There is no substitute for established protocol, despite the fact building those procedures might not have an immediately obvious benefit. The payoff is avoiding downtime, and that’s something you can’t really plan.

It’s the difference between six hours of lost data, and six seconds.

PG Phriday: Getting Back Up
Tagged on:                     

10 thoughts on “PG Phriday: Getting Back Up

  • Excellent guidelines to follow if you ever want to, you know, actually recover from disaster! Guidelines 3 and 4 aren’t often considered so good call on those. I’ve had good experiences with pgBackrest, it was able to recover a 1.5 TB database in 2 hours, and that was from a remote backup server. I would add that these tools are also capable of checksums for backup integrity, and the ability to resume a backup from where it left off, which is great for very large databases.

  • Great write up. Hosting your DB cluster’s file on ZFS is also another great way to get easy backup snapshots and cloning. Been using it for years on Ubuntu with large data warehouses and it’s a real life (and time) saver. You can even stream your ZFS snapshots over pigz for compressed backups or just to send to a new test replica to test your backups regularly.

  • When it comes to backup, PostgreSQL community is putting scattered efforts. Instead of making pg_basebackup more capable, isolated pockets of backup scripts are created.

  • Considering that the only method that finally worked for Gitlab was based on pg_dump is it really a good general advice to avoid it?

    I mean as a last resort let’s say you run pg_dump once a week and store it “offline”, so that you never lost more than a week of work even in a scary worst case scenario.

    1. Hmm, create a database extract with a tool and store it offsite eh? Yeah, we call that a backup. 🙂 In which case, it doesn’t matter what tool produced it, or whether it’s binary or raw text like a SQL dump. Just so long as there’s consistency and they’ve tested their process.

      The pg_basebackup tool would have worked, provided they used it properly. It won’t back up into a non-empty directory because that would mean it is destroying an existing set of data. It tells the remote system to run a checkpoint and commit pending data to disk so it gets fresh files and has a stable starting point in the transaction logs. Sometimes that sync takes a while, so there’s a “fast” option to force it to flush to disk as fast as possible instead of write spreading.

      And who says pg_dump would always work? Because of its lack of speed, it can fail for any number of reasons before it finishes. What if they didn’t know what the custom format did, and just did a regular SQL dump and filled their disk? The point is they didn’t know the tools they were using, and kept using them anyway. They got lucky that pg_dump worked for them in that particular case. It’s good to have multiple redundancies, yet I will maintain the “it will be” caveat regarding excessive size. It’s not indicative, but I would never ever try to use pg_dump to extract our 50TB database because it would literally take days, even with the parallel options.

      There’s a reason pg_upgrade exists, and it’s not because admins enjoyed taking the cluster down for half a day to perform a dump/restore cycle between Postgres versions. Binary is better. Native formats are faster. The pg_dump tool is great for slurping a few tables or a whole schema, or grabbing the database object definitions, but it’s not a backup. Even MySQL users basically switch to Percona’s XtraBackup when they start to get serious. Extract dumps don’t scale.

      1. I just read the section on pg_receivexlog.

        A QQ: How does it know where to start from. Suppose the server where we have set up this: (taken from your book) pg_receivexlog -h 192.168.56.10 -U rep_user \ -D /db/pg_archived -v –synchronous \ &> /db/pg_archived/wal_archive.log &

        So let us say, for some reason the pg_receivexlog process dies. When it is restarted later on, how does it know from where to start. Does it detect last known position from examining the files in /db/pg_archived directory.

        Thanks

        1. Versions before 9.6 don’t know. For those older versions, the pg_receivexlog utility just starts streaming at the current location of the transaction log. Version 9.6 added the ‘–slot’ parameter which makes use of Postgres replication slots. When using a replication slot, Postgres will refrain from removing a transaction log until it’s been acknowledged by the user of that slot. So if it crashes and is restarted several hours (or even days) later, it should pick up where it left off.

Leave a Reply

Your email address will not be published. Required fields are marked *