I always advocate breaking up large Postgres tables for a few reasons. Beyond query performance concerns, maintaining one monolithic structure is always more time consuming and consequentially more dangerous. The time required to create a dozen small indexes may be slightly longer than a single larger one, but we can treat the smaller indexes as incremental. If we want to rebuild, add more indexes, or fix any corruption, why advocate an all-or-nothing proposition? Deleting from one large table will be positively glacial compared to simply dropping an entire expired partition. The list just goes on and on.
Items in Category: Tech Talk
The Postgres developers recently announced the availability of the first public beta for Postgres 9.6. I would be highly remiss to ignore such an opportunity to dig into any interesting functionality listed in the 9.6 release notes. All in all, it’s a pretty exciting series of advancements, and assuming this is a glimpse of what we see when 9.6 drops, I’d say we’re on the right track.
Let’s just get the obvious out of the way early: dealing with multiple Terabytes or Petabytes in a database context is something of a nightmare. Distributing it, retrieving it, processing it, aggregating and reporting on it, are all complicated—and perhaps worst of all—non-intuitive. Everything from tooling and maintenance, to usage and input, are either ad-hoc or obfuscated by several special-purpose APIs and wrappers.
One of the reasons a self-scaling database is such a killer app, derives from the failure rate from having so many moving parts. A proxy here, a shard API there, a few design modifications to accommodate implementation quirks, and suddenly we have a fragile, janky golem. A lumbering monstrosity that our applications and data depend on, where modifying or replacing any part of it will mean redistributing all the data at the very least.
Having run into a [intlink id='pg-phriday-growing-pains']bit of a snag[/intlink] with Postgres-XL, and not wanting to be dead in the water with our project, I went on a bit of a knowledge quest. Database scaling is hard, so I expected a bunch of either abandoned or proprietary approaches. In addition, as a huge fans of Postgres, compatibility or outright use of the Postgres core was a strict prerequisite.
So, what options are out there? Is there even anything worth further investigation? Maybe more importantly, what do you do when you’re under a bit of a scheduling constraint? Projects need to move forward after all, and regardless of preferences, sometimes concessions are necessary. The first step was obviously the list of databases derived from Postgres.
Postgres is a great tool for most databases. Larger installations however, pretty much require horizontal scaling; addressing multi-TB tables relies on multiple parallel storage streams thanks to the laws of physics. It’s how all immense data stores work, and for a long time, Postgres really had no equivalent that wasn’t a home-grown shard management wrapper. To that end, we’ve been considering Postgres-XL as a way to fill that role. At first, everything was going well. Performance tests showed huge improvements, initial deployments uncovered no outright incompatibilities, and the conversion was underway.
Then we started to use it.