Having run into a bit of a snag with Postgres-XL, and not wanting to be dead in the water with our project, I went on a bit of a knowledge quest. Database scaling is hard, so I expected a bunch of either abandoned or proprietary approaches. In addition, as a huge fans of Postgres, compatibility or outright use of the Postgres core was a strict prerequisite.

So, what options are out there? Is there even anything worth further investigation? Maybe more importantly, what do you do when you’re under a bit of a scheduling constraint? Projects need to move forward after all, and regardless of preferences, sometimes concessions are necessary. The first step was obviously the list of databases derived from Postgres.

At first glance, that’s a pretty big list. If we look carefully though, we can see that quite a few of those projects were abandoned years ago. Others are commercial, not based on scalability, or both. Being commercial isn’t automatically a disqualification, but most of the commercial options were forked from practically ancient versions of Postgres and never kept up compatibility, or don’t mention the version at all. Amazon Redshift fits that profile, being based on Postgres 8.0, which few would want to use these days. Fujitsu Enterprise is another, which doesn’t even list which version they’re based on, nor do they provide a download for testing purposes.

What’s left? It’s hard to tell from the Wiki page, so I just started with the projects that include some kind of data scaling not based on replication. These candidates present a longer list than I’d initially anticipated, which is always a good problem to have!

Let’s scrutinize the nominees.

CitusDB

It’s not really a secret that CitusDB and Postgres-XL are both tackling the same problem, and are currently the top two contenders. Unlike Postgres-XL and its approach of extending SQL syntax to directly embrace data distribution, CitusDB is actually just a Postgres extension.

As a result, it’s a bit more janky. There’s no CREATE TABLE ... DISTRIBUTE BY magic. Instead, and like other Postgres extensions, we call functions to populate metadata and control CitusDB operation. Fortunately, the CitusDB documentation is amazingly thorough. Not only are there downloads with cut-and-paste tutorials for evaluation, they also provide Docker containers with fully operational test clusters.

I also have to admit that I originally wrote this off entirely, based on my experience with pg_shard last year. I opened a ticket complaining about missing transaction support, and they never closed it. I didn’t realize that was because pg_shard was merged into the new Citus extension, along with a veritable raft of other modifications and fixes.

My easy tutorial install didn’t exhibit the problems I had with pg_shard, so this warrants deeper testing. I’ve got VMs galore, and itchy scripting fingers.

Greenplum

I’ve known about Greenplum for a long time. There was a lot of excitement when Pivotal announced that they were opening the source. A parallel scaling Postgres? No way!

Well, this comes with a giant caveat. If we look at the dates listed in the Wiki, Greenplum is listed as starting in 2005. They’re not kidding, and unfortunately it seems Pivotal executed a “fork it and forget it” maneuver. The documentation admits Greenplum is based on Postgres 8.2, with elements of functionality from 8.3.

Like Amazon’s Redshift, this immediately disqualifies Greenplum from consideration for anyone using a newer version. Our own databases are on 9.4 pending an upgrade plan; there’s no way we could justify such a massive downgrade, even for horizontal scaling improvements. EnterpriseDB had a similar problem when they started selling their version of 8.3; they were far behind for years before they managed to reduce their version lag by only a few months. Greenplum never even bothered.

This may be an amazing product, but we can’t use it to replace existing Postgres 9.4 databases that need scaling. Will Greenplum catch up now that it’s been open-sourced? I can’t say. It would definitely be cool, but I’m not holding my breath. Incidentally, this is one of the reasons all of those projects on the Wiki have definitive end dates. Keeping up with Postgres after forking is extremely difficult if you don’t merge your enhancements back into core. It’s all too easy to fall hopelessly behind and become nothing but an academic concern.

HadoopDB

Hadoop is the… uh, “new” kid on the block regarding big data. It’s designed to leverage multiple systems or VMs to spread storage mining, which would be a great way to address a VLDB system. So in 2009, a university project spawned HadoopDB to turn Postgres into a SQL interface and aggregator for Hadoop.

Sadly, that’s where the story ends. The Postgres Wiki says it’s still active, but for all intents and purposes, it has been abandoned. The quick start guide hasn’t been updated since 2009, and Thomas Koch did a presentation as late as 2011 denoting it as an effective proof of concept, but not much else.

In the end, it’s a really cool “glue” between Hadoop and Postgres. But without updates to enhance the interlinks, speed, efficiency, and bugs, it’s not suitable for a production environment. The project lived on in Hadapt before being acquired by Teradata and renamed to presto. That means there’s some potential to contact Teradata and make an existing Hadoop datastore more user friendly. The job of converting an existing Postgres cluster to a Hadoop equivalent is left as an (onerous) exercise for the user.

Postgres-X2

The Postgres-X2 project is a bit of a conundrum. Unlike Postgres-XL which is active and backed by 2ndQuadrant, Postgres-X2 seems to be a direct continuation of the abandoned Postgres-XC codebase. As a result, they’re still stuck on Postgres 9.3. Further, they likely have similar issues as we encountered with Postgres-XL, or worse due to the project’s stagnancy. After exploring the github repository, it turns out the last update to any code was two years ago.

Maybe there’s another repository elsewhere, but this project should be considered dead unless they pull a Holy Grail and claim otherwise.

Stado

I first encountered Stado back when it was GridSQL, one of many EnterpriseDB projects. It works by abstracting several database instances through a port proxy, distributing writes and reads arbitrarily based on its own internal magic. It uses Postgres as a filesystem of sorts, and connecting to Postgres directly reveals this in greater detail. Object names are the result of hash functions, and even databases are only tangentially linked to the desired given nomenclature.

Stado is all about metadata, and Postgres is its chosen host. Because I had experience with a previous incarnation, I made an exception and considered it undead for testing purposes, even though the Wiki says it died in 2011. It’s just a Java proxy after all, so what could it hurt to see if it still works with recent Postgres releases?

As it turns out, it can hurt a lot. It seems my memory of GridSQL was a little hazy, as what’s going on here isn’t merely a distribution proxy. It’s transformative and extremely restrictive, throwing up errors for “unknown” keywords such as SCHEMA. No schema support means there’s no way we can use it, which is something of a shame. The performance metrics were encouraging back in the day, and the concept it represents is sound.

Consider the PMPP extension, for example. When I looked at it late last year, I loved the simplicity. Take a query, broadcast it to every known Postgres node, and present the results. Wrap the function in another query, and it can be re-aggregated to merge everything together. I was hoping Stado did this automatically, and that was all. Nope. Oh well.

The Winner?

If someone could write something that worked like I thought Stado did, I’d probably kiss them. It would require manipulating the Postgres query planner or a proxy of some kind, but that’s all I really want. Send a query to multiple nodes, let them execute it independently in parallel, keep track of aggregate functions used, and apply them to the appropriate columns in the final result. It seems so simple, but the underlying complexity is clearly more involved.

The thing is, large warehouse databases usually contain data that’s already been in another source. Primary key collisions are unlikely, as some kind of engine (Postgres-XL, ETL, etc.) has already distributed data according to some kind of hash function. I just want a query that can invoke the cluster in parallel. That’s all. Extensions like PMPP do half of the job, but short of rewriting existing applications to leverage it properly, it’s only really usable for new projects.

So I’ll probably be looking into CitusDB a bit more. It seems to work the way I want, and adds shard redundancy as an extra benefit. I’ll put it on some VMs and unceremoniously thrash it after dumping hundreds of GB into its lap and record the ensuing chaos. Hopefully these tests go better than when I subjected pg_shard to the same treatment.

Otherwise, the state of Postgres scaling is incomplete, and there are no projects I know of that will suit our requirements. As a Postgres DBA, I probably try too hard to use it as a hammer on every nail, but it’s just so close as to be immensely frustrating.

Wish me luck!

PG Phriday: Derivation Deluge
Tagged on:                         

13 thoughts on “PG Phriday: Derivation Deluge

  • “The documentation admits Greenplum is based on Postgres 8.2, with elements of functionality from 8.3.”

    Which documentation? I have the same nagging doubt for a long time, but want to confirm. The GP DBAs at my place are claiming that it is based on PG 9.4.

    thanks.

    1. It was forked from 8.2.15 but it has been developed in parallel since then. Almost all 9.5 queries also work on Greenplum, but the plpgsql language is stuck on 8.2, so no ‘RETURNS TABLE’ and other nifty syntax sugar from more modern versions. Also, it doesn’t have JSON/JSONB, rangetypes and similar modern types. They are actively working on merging upstream PostgreSQL and then keep them in sync, but I wouldn’t hold my breath waiting for a stable release.

  • CitusDB is great for what it does, but it is a highly specialized extension/fork that only works if you specifically design your schema and query patterns around CitusDB from start. It is not a generic solution for making an existing database distributed. It doesn’t support for example transactions, window functions, left/right joins or volatile functions when distributed tables are involved, only very simple queries involving only equality joins.

    1. Darn, it would appear I jumped too soon. My earlier tests showed no transaction support at all. It looks like they made it slightly better and at least support transactions on their own tables, but JOIN against a regular table breaks it. Schema support is also broken, and LEFT JOIN isn’t supported, nor are UNIQUE constraints. We can get around some of that, but not all of it.

      Whelp, back to the drawing board.

    2. Hi,

      Citus is open source and we are actively working to make it better. In this process, your feedback is the most valuable input to focus our efforts. Just drop by our Slack Channel and tell us about your experience.

      Currently, Citus is geared towards high rate of fast and simple statements, high throughput bulk loading (with the new COPY FROM support), and a high rate of sub-second analytical queries. When switching to Citus, rethinking about the queries might be necessary though.

      We will release 5.1 soon and that will bring new features and performance improvements, e.g., COPY FROM. For 5.2, we are focusing on usability and we are evaluating transactions in it.

  • Another great article! You’ve saved me many days/weeks of research on this topic (scaling). At 43tb, your db IS big…I’m curious how you’re handling backups/redundancy.

    I’m starting the design/build of the ETL that will bring 30tb (30b rows) of genomics data (versus warehouse data) currently stored in 200 mysql dbs into its new home in a single Postgres 9.5 db.

    My concern is your reality – when the next big customer hits, will I be able to scale up quickly and reliably without compromising all that makes 9.5 so good? I’ll be following your travels on this topic, and thanks for doing such good work.

    1. We do our backups through a Python app I wrote about four years ago. It works very similarly to pgBackRest, which didn’t exist when I wrote it. The secret sauce is that it uses hard links for unchanged files (like Time Machine). The full 40TB of files doesn’t change often, so the differentials are much smaller. My app uses parallel gzip instead of parallel compression threads, but the end result is the same. The full 40TB backup becomes 6, and the diffs are 30-50GB, which is much easier to handle.

    2. One of the big caveats (and there are many) with Greenplum, is that when you want to scale up, you have to stop writing (to any of the currently affected tables, I think) and completely redistribute the data.

      There are other major pain points, but I wouldn’t reject it purely on language features.

      We currently have a larger dataset, but there EMC/Pivotal say there are even larger ones out there.

      You can have a lot of data but unless you fit their particular use case, it’s going to be a dog, and at that, a not particularly stable dog.

      Also the hardware costs (even if you risk going down the non-EMC route) can be expensive. In a DCAv2, each server has 24 x 300GB disks which provide up to 1.9TB usable storage. (After RAID, halving it for primary & mirrors, and keeping below 70% usage). You can see that for 43TB you’ll need a 24 node cluster (+ 2 for master/standby + ETL, etc.) and at least 2×32 port 10GigE switches.

      Would I chose Greenplum DB again? I don’t know.

  • Just a quick note… if you’re looking into this for doing analytics, you need to take a look at http://madlib.apache.org/. It exists to make it easy to push analytics computations into Postgres. That sounds counter-intuitive to scaling, but a lot of these problems are actually data problems, so not moving the data any more than necessary can make a huge difference. It also makes it easy to materialize intermediate results so that the analytics team doesn’t have to resort to baseline data as often.

  • Hi, Three years ago, I worked with Netezza. Perhaps you can see about it. I know it was based on Postgres, but what version ? Éric.

Comments are closed.