Having run into a bit of a snag with Postgres-XL, and not wanting to be dead in the water with our project, I went on a bit of a knowledge quest. Database scaling is hard, so I expected a bunch of either abandoned or proprietary approaches. In addition, as a huge fans of Postgres, compatibility or outright use of the Postgres core was a strict prerequisite.
So, what options are out there? Is there even anything worth further investigation? Maybe more importantly, what do you do when you’re under a bit of a scheduling constraint? Projects need to move forward after all, and regardless of preferences, sometimes concessions are necessary. The first step was obviously the list of databases derived from Postgres.
At first glance, that’s a pretty big list. If we look carefully though, we can see that quite a few of those projects were abandoned years ago. Others are commercial, not based on scalability, or both. Being commercial isn’t automatically a disqualification, but most of the commercial options were forked from practically ancient versions of Postgres and never kept up compatibility, or don’t mention the version at all. Amazon Redshift fits that profile, being based on Postgres 8.0, which few would want to use these days. Fujitsu Enterprise is another, which doesn’t even list which version they’re based on, nor do they provide a download for testing purposes.
What’s left? It’s hard to tell from the Wiki page, so I just started with the projects that include some kind of data scaling not based on replication. These candidates present a longer list than I’d initially anticipated, which is always a good problem to have!
Let’s scrutinize the nominees.
It’s not really a secret that CitusDB and Postgres-XL are both tackling the same problem, and are currently the top two contenders. Unlike Postgres-XL and its approach of extending SQL syntax to directly embrace data distribution, CitusDB is actually just a Postgres extension.
As a result, it’s a bit more janky. There’s no
CREATE TABLE ... DISTRIBUTE BY magic. Instead, and like other Postgres extensions, we call functions to populate metadata and control CitusDB operation. Fortunately, the CitusDB documentation is amazingly thorough. Not only are there downloads with cut-and-paste tutorials for evaluation, they also provide Docker containers with fully operational test clusters.
I also have to admit that I originally wrote this off entirely, based on my experience with pg_shard last year. I opened a ticket complaining about missing transaction support, and they never closed it. I didn’t realize that was because
pg_shard was merged into the new Citus extension, along with a veritable raft of other modifications and fixes.
My easy tutorial install didn’t exhibit the problems I had with
pg_shard, so this warrants deeper testing. I’ve got VMs galore, and itchy scripting fingers.
Well, this comes with a giant caveat. If we look at the dates listed in the Wiki, Greenplum is listed as starting in 2005. They’re not kidding, and unfortunately it seems Pivotal executed a “fork it and forget it” maneuver. The documentation admits Greenplum is based on Postgres 8.2, with elements of functionality from 8.3.
Like Amazon’s Redshift, this immediately disqualifies Greenplum from consideration for anyone using a newer version. Our own databases are on 9.4 pending an upgrade plan; there’s no way we could justify such a massive downgrade, even for horizontal scaling improvements. EnterpriseDB had a similar problem when they started selling their version of 8.3; they were far behind for years before they managed to reduce their version lag by only a few months. Greenplum never even bothered.
This may be an amazing product, but we can’t use it to replace existing Postgres 9.4 databases that need scaling. Will Greenplum catch up now that it’s been open-sourced? I can’t say. It would definitely be cool, but I’m not holding my breath. Incidentally, this is one of the reasons all of those projects on the Wiki have definitive end dates. Keeping up with Postgres after forking is extremely difficult if you don’t merge your enhancements back into core. It’s all too easy to fall hopelessly behind and become nothing but an academic concern.
Hadoop is the… uh, “new” kid on the block regarding big data. It’s designed to leverage multiple systems or VMs to spread storage mining, which would be a great way to address a VLDB system. So in 2009, a university project spawned HadoopDB to turn Postgres into a SQL interface and aggregator for Hadoop.
Sadly, that’s where the story ends. The Postgres Wiki says it’s still active, but for all intents and purposes, it has been abandoned. The quick start guide hasn’t been updated since 2009, and Thomas Koch did a presentation as late as 2011 denoting it as an effective proof of concept, but not much else.
In the end, it’s a really cool “glue” between Hadoop and Postgres. But without updates to enhance the interlinks, speed, efficiency, and bugs, it’s not suitable for a production environment. The project lived on in Hadapt before being acquired by Teradata and renamed to presto. That means there’s some potential to contact Teradata and make an existing Hadoop datastore more user friendly. The job of converting an existing Postgres cluster to a Hadoop equivalent is left as an (onerous) exercise for the user.
The Postgres-X2 project is a bit of a conundrum. Unlike Postgres-XL which is active and backed by 2ndQuadrant, Postgres-X2 seems to be a direct continuation of the abandoned Postgres-XC codebase. As a result, they’re still stuck on Postgres 9.3. Further, they likely have similar issues as we encountered with Postgres-XL, or worse due to the project’s stagnancy. After exploring the github repository, it turns out the last update to any code was two years ago.
Maybe there’s another repository elsewhere, but this project should be considered dead unless they pull a Holy Grail and claim otherwise.
I first encountered Stado back when it was GridSQL, one of many EnterpriseDB projects. It works by abstracting several database instances through a port proxy, distributing writes and reads arbitrarily based on its own internal magic. It uses Postgres as a filesystem of sorts, and connecting to Postgres directly reveals this in greater detail. Object names are the result of hash functions, and even databases are only tangentially linked to the desired given nomenclature.
Stado is all about metadata, and Postgres is its chosen host. Because I had experience with a previous incarnation, I made an exception and considered it undead for testing purposes, even though the Wiki says it died in 2011. It’s just a Java proxy after all, so what could it hurt to see if it still works with recent Postgres releases?
As it turns out, it can hurt a lot. It seems my memory of GridSQL was a little hazy, as what’s going on here isn’t merely a distribution proxy. It’s transformative and extremely restrictive, throwing up errors for “unknown” keywords such as
SCHEMA. No schema support means there’s no way we can use it, which is something of a shame. The performance metrics were encouraging back in the day, and the concept it represents is sound.
Consider the PMPP extension, for example. When I looked at it late last year, I loved the simplicity. Take a query, broadcast it to every known Postgres node, and present the results. Wrap the function in another query, and it can be re-aggregated to merge everything together. I was hoping Stado did this automatically, and that was all. Nope. Oh well.
If someone could write something that worked like I thought Stado did, I’d probably kiss them. It would require manipulating the Postgres query planner or a proxy of some kind, but that’s all I really want. Send a query to multiple nodes, let them execute it independently in parallel, keep track of aggregate functions used, and apply them to the appropriate columns in the final result. It seems so simple, but the underlying complexity is clearly more involved.
The thing is, large warehouse databases usually contain data that’s already been in another source. Primary key collisions are unlikely, as some kind of engine (Postgres-XL, ETL, etc.) has already distributed data according to some kind of hash function. I just want a query that can invoke the cluster in parallel. That’s all. Extensions like PMPP do half of the job, but short of rewriting existing applications to leverage it properly, it’s only really usable for new projects.
So I’ll probably be looking into CitusDB a bit more. It seems to work the way I want, and adds shard redundancy as an extra benefit. I’ll put it on some VMs and unceremoniously thrash it after dumping hundreds of GB into its lap and record the ensuing chaos. Hopefully these tests go better than when I subjected
pg_shard to the same treatment.
Otherwise, the state of Postgres scaling is incomplete, and there are no projects I know of that will suit our requirements. As a Postgres DBA, I probably try too hard to use it as a hammer on every nail, but it’s just so close as to be immensely frustrating.
Wish me luck!