The pure, unadulterated, presumptuous impudence of NoSQL. Engines like MongoDB recklessly discard concepts like ACID in some futile quest to achieve “web scale”, and end up accomplishing neither. RDBMS systems have literally decades of history to draw upon, and have long since conquered the pitfalls NoSQL platforms are just now encountering. There may be something to a couple of them, but by and large, they’re nothing we really need.

At least, that’s something I might have said a couple of weeks ago.

Now, I’ve never really ascribed to the supposition that NoSQL is the strict antithesis of traditional SQL-driven systems. The intro paragraph is something of a hyperbolic exaggeration of the general disdain that seems to permeate the NoSQL vs SQL debate. Most DBAs I’ve met basically say “meh” and move on with their day, optimizing queries, tweaking storage metrics, and what not. Usually NoSQL is on our RADAR, but we have other stuff to worry about.

Last week, I was forced to reconsider due to an existing internal project that needed my input. That analysis really showed me why devs and data guys from all walks of life are seduced to the Dark Side of data. After another week slogging through documentation, experimenting with self-healing replication sets, self-balancing sharding, and taking copious notes on everything, something finally clicked. The Dark Side, it seems, has some merit beyond simplifying data structure.


So what can Postgres potentially learn from its eternal adversaries? Beyond stealing their encoding mechanisms via JSON and JSONB datatypes, that is. Though my analysis thus far has focused mostly on MongoDB, it alone incorporates several fundamental concepts that I’ve always felt Postgres lacked. For now, I’ll focus on the two that have—in my opinion, anyway—really prevented Postgres from reaching its full potential.

Break Me

Postgres has replication, but no meaningful application beyond having a hot standby or a few read slaves. Why no master election consensus system like RAFT? Last year, I came across an excellent article that discussed implementing it with HAProxy, Governor, and etcd instead of the more historic (and infinitely more complicated) Pacemaker stack. Someone even forked Governer into Patroni to incorporate alternative consensus approaches. Still, that’s a lot of external dependencies.

Postgres has all of the parts to handle this itself, it just doesn’t. It has the replication stream. It knows among all replica nodes which has the highest transaction id. It has a process for cloning an existing instance in pg_basebackup. If an old primary node needs to become a replica, it can either do it directly, or use pg_rewind if necessary. Yet these fragments must be combined manually. Some use scripts, others use Pacemaker or some variant of the Governer approach above, but it’s much easier for a new project to simply chose another platform.

The specific issue at hand, is that Postgres elects to consider itself an autonomous process. In this view, a database is a self-contained data management structure, where everything goes into a single extremely high integrity bucket. As a consequence, Postgres has no concept of a cluster at all, except perhaps a slight awareness of currently streaming replicas.

MongoDB takes itself much less seriously. In the MongoDB world, a mongod instance is a bucket that will accept any and all content, and do a passable job of protecting it. As such, it’s much less concerned with spillage, giving it greater leverage for passing that bucket around. The authors of MongoDB clearly wanted to focus on data availability rather than integrity. To get that, they settled on a consensus system that some might argue is inadequate when combined with their data writing strategies.

Despite that, it works. MongoDB is a concept beyond mere data storage. Where Postgres is becoming more of a storage middleware through additions like extensions and foreign data wrappers, MongoDB is “where your data goes.” That’s a very alluring attitude, and a subtle shift in focus few (if any) mainstream RDBMS have adopted. First and foremost, engines like MongoDB are a cluster, incorporating multiple nodes that interchangeably share roles but present a united face to the world. Postgres has no equivalent.

But it could! Imagine a universe where we have the integrity guarantee of an ACID database, with a Mongo-like front-end that manages the nodes such that our bulletproof data is always available. Speaking of front ends…

Into Pieces

Shards. I have 50TB of data, and I have no Earthly idea what to do with it. I’d love to blow it up into a trillion little bits, but Postgres gives me no easy way to do that. Here’s the second place MongoDB trumps Postgres in its current incarnation. It’s not exactly a cake-walk, considering I drew up this chart while compiling my notes:

Mongo Cluster

But all the tools are there. Anyone with a dozen servers or VMs can build a scalable data store without writing their own distribution implementation. That’s hard to ignore.

For Postgres, it really boils down to the citus extension and Postgres-XL. Unfortunately citus has severe limitations with joins including non-distributed tables. Likewise, Postgres-XL has been trying to merge the scaling code into the 9.5 branch for a year now. Despite the increased visibility and presumed quality of 2ndQuadrant’s work, there are a couple of orders of magnitude fewer eyes on that code. Even assuming they manage to merge everything before 9.6 drops, will they be able to keep up with core afterwards? Citus has its limitations, but in the end, it is an extension anyone can install. Anything that isn’t an extension risks falling woefully behind or being abandoned.

Is there another way? If we treated Postgres backends like dumb containers the way MongoDB does, things suddenly change quite a bit. MongoDB has a process called mongos which is basically just a glorified proxy that manages shard metadata and forks out a balancing job that ushers data around to ensure shard content isn’t lopsided. That’s the “Manager / Balancer” in my chart. What if it also included a transaction manager?

Postgres has supported background workers since 9.3. In theory, it could adopt management of shard metadata, chunk migration, or global transaction roles where appropriate. This is essentially what Postgres-XL (and similar) is doing. Where MongoDB wrote a glue process to bring everything together, Postgres-XL opted to directly patch the Postgres source, and all the pain that entails.

I’m in danger of being too dismissive of implementation details, of course. Yet there are massive potential benefits to taking a cluster approach by default. Such a distribution proxy could be database agnostic, compatible with any engine that speaks SQL. Consider too that Postgres foreign data wrappers imply it could be the communication layer.

Last Thoughts

Postgres is tantalizingly close to being a cluster system simply through the communication protocols it provides. It just needs that final piece. I’m fully convinced that someone with the right skillset (definitely not me) could whip up a proof of concept for this in a few days. This is, after all, how most application-level sharding approaches work anyway. In the end, that’s all MongoDB, and several other NoSQL solutions ultimately provide. Take one part integrated RAFT process to always present a writable primary, and mix in a data distribution and balancing proxy. Done.

If they so desired, the MongoDB developers could theoretically even use Postgres as a back-end storage mechanism. In fact, I highly encourage them to do so! Given a natural shard split and extensions like cstore_fdw, suddenly NoSQL becomes a scalable column-store.

Barring that, corporate use of Postgres will be necessarily limited. There’s only so much you can do in a single instance, and without integral scaling features, it becomes migration time. I don’t want to tell our devs to convert all of their ORMs and queries to JSON, give up easy joins, or forgo NULL handling, but I will. I have to. This 50TB is only growing every day, and without a straight-forward and reliable migration path, we need to retool.

I disagree with too many of the data compromises MongoDB makes to use it for important data, but the availability approach is fundamentally sound. Mongo’s implementation is refreshing and innovative, and one which many NoSQL engines appear to share. Is it any wonder there has been a silent transition to these systems, when everything has several billion rows, and applications fully expect to reach well beyond Terabytes of storage into Petabyte territory?

No single instance can absorb that kind of volume, and not everyone has time or inclination to work around that limitation. We need a common middle-ground with the NoSQL world. We have the better storage engine, and they have the better distribution and availability mechanisms. Can’t there be a place where both of those things work in tandem?

I, for one, welcome our cluster overlords.

PG Phriday: The Audacity of NoSQL
Tagged on:                                         

12 thoughts on “PG Phriday: The Audacity of NoSQL

  • For HA on Postgres you can use repmgr too, it has autofailover capabilities. I use it with pgbouncer to isolate old master and reconfigure what needs to be reconfigured. And is less complicated than pacemaker.

  • Really insightful article. As someone with primarily SQL background, it is good to know how nosql solutions can offer that sort of scalability. Thank you!

  • So this is a problem I have looked into, and worked with.

    There are three challenges: There isn’t a way to ensure reliable distributed transactions (The CAP theorem). There isn’t a good way to do complex joins unless you ship data around without transactions. Data constraints cannot be used reliably (the other end of a foreign key could end up on a different host).

    This means that you don’t get most of the benefits of SQL, and have to do a lot of work in your application.

  • Great write up.

    Two mongo-related features that may be of interest:

    1) majority writes/reads, to eliminate the chance of data loss, and to minimize the chance of rollback during elections. This, along with a rewrite of the election algorithm in 3.2, address a lot of the issues in the Jepsen post you mention.

    2) my favorite 3.2 feature, document validation, which gives you a lot of control for enforcing rules at the database level, but without inventing a new language for schema definitions (uses the find syntax), and while giving you lots of flexibility at the data level (use it for only the fields you want) and how validation failures are handled (log a message or throw an error on update/insert)

    docs: blog with examples:

    1. Interesting. I still recommend enabling journal acknowledgment as the default write concern for a replica set, though. As a DBA, my trust of underlying disk subsystems is on the level of “mega paranoid”. 🙂

      I’ll see about pushing our Mongo guys to upgrade our instances. Maybe during the shard migration.

      1. Those aren’t mutually exclusive:

        “Secondary members acknowledge replicated write operations after the secondary members have written to their respective on-disk journals, regardless of the j option used for the write on the primary.”


        “Changed in version 3.2: With j: true, MongoDB returns only after the requested number of members, including the primary, have written to the journal. Previously j: true write concern in a replica set only requires the primary to write to the journal, regardless of the w: write concern.”

  • Hi Shaun,

    Thank you for mentioning Patroni. While it’s true that Patroni has a few external dependencies, your typical enterprise may already run Zookeeper, etcd or a consul cluster, and with Kubernetes etcd is already there for you. We also have more than just support for multiple consensus-based systems – to name some extras there is a pg_rewind support with an ability to turn it off or decide what to do with a non-rewindable former master, a feature to initialize the replica with a configurable script that is different from pg_basebackup (which comes in handy if you have a 50TB database), scheduled switchovers, support for changing the configuration cluster-wide and dynamically and labels for the replicas to make some of them not suitable for becoming a master.

    Adding a RAFT- or PAXOS-based automatic failover to PostgreSQL would be great. I think the hard part is agreeing on a sane set of knobs to tune with reasonable defaults (i.e. timeouts on the leader key, how to elect a potential new master, how to inform an external connection pooler like pgbouncer, etc.) and documenting them properly.

  • @Shaun I was excited like you about 3 years ago about MongoDB. Once you lose data once (I did!) you will never praise MongoDB again. TBH, their design has fundamental flaws in it!

    1. Don’t get me wrong, Denish; I don’t know enough about, nor do I have adequate experience with the platform to declare it safe or worthwhile. But some of the concepts it introduces are worthy of further exploration.

      Note that I said it did an adequate job of protecting data, rather than good. But that doesn’t appear to be the primary focus of the software. I’m more interested in the scaling and availability features.

  • Similar here Postgresql, Repmgr and Consul. Repmgr handles clustering and failover, Repmgr scripts trigger and update Consul which directs write traffic to the new master. The old master may come back but it will never get to update Consul. We perform fully automated deploys using Puppet and Terraform.

  • PG has to come up with shard based solution, very much like cassandra. If that is done, then they can significantly be more useful to apps than nosql.

Comments are closed.