The pure, unadulterated, presumptuous impudence of NoSQL. Engines like MongoDB recklessly discard concepts like ACID in some futile quest to achieve “web scale”, and end up accomplishing neither. RDBMS systems have literally decades of history to draw upon, and have long since conquered the pitfalls NoSQL platforms are just now encountering. There may be something to a couple of them, but by and large, they’re nothing we really need.
At least, that’s something I might have said a couple of weeks ago.
Now that we’ve decided to really start embracing horizontal scaling builds, there is a critically important engine-agnostic element we need to examine. Given an existing table, how exactly should we split up the contents across our various nodes during the…
Having run into a [intlink id='pg-phriday-growing-pains']bit of a snag[/intlink] with Postgres-XL, and not wanting to be dead in the water with our project, I went on a bit of a knowledge quest. Database scaling is hard, so I expected a bunch of either abandoned or proprietary approaches. In addition, as a huge fans of Postgres, compatibility or outright use of the Postgres core was a strict prerequisite.
So, what options are out there? Is there even anything worth further investigation? Maybe more importantly, what do you do when you’re under a bit of a scheduling constraint? Projects need to move forward after all, and regardless of preferences, sometimes concessions are necessary. The first step was obviously the list of databases derived from Postgres.
Postgres is a great tool for most databases. Larger installations however, pretty much require horizontal scaling; addressing multi-TB tables relies on multiple parallel storage streams thanks to the laws of physics. It’s how all immense data stores work, and for a long time, Postgres really had no equivalent that wasn’t a home-grown shard management wrapper. To that end, we’ve been considering Postgres-XL as a way to fill that role. At first, everything was going well. Performance tests showed huge improvements, initial deployments uncovered no outright incompatibilities, and the conversion was underway.
Then we started to use it.
I’ve maintained since about 2011, that the problem with scaling Postgres to truly immense installations could be solved by a query coordinator. Why? Most sharding systems utilize an application-level distribution mechanism, which may or may not leverage an inherent hashing algorithm. This means each Postgres instance can be treated independently of all the others if the distribution process is known. On a cleverly architected system, the application is algorithm aware, and can address individual shards through a driver proxy or accessor class.