PG Phriday: Big Data is Hard

Let’s just get the obvious out of the way early: dealing with multiple Terabytes or Petabytes in a database context is something of a nightmare. Distributing it, retrieving it, processing it, aggregating and reporting on it, are all complicated—and perhaps worst of all—non-intuitive. Everything from tooling and maintenance, to usage and input, are either ad-hoc or obfuscated by several special-purpose APIs and wrappers.

One of the reasons a self-scaling database is such a killer app, derives from the failure rate from having so many moving parts. A proxy here, a shard API there, a few design modifications to accommodate implementation quirks, and suddenly we have a fragile, janky golem. A lumbering monstrosity that our applications and data depend on, where modifying or replacing any part of it will mean redistributing all the data at the very least.

And that’s only one reason why big data is hard. Another stems from something a little more subtle: deciding where everything goes. I hope experienced readers just groaned a bit. Done improperly, data pulled from such a system is not only inefficient, but it’s often wrong. The groan is caused by the knowledge that it’s exceptionally easy to miss accounting for some critical detail related to the data, and end up with 40TB of useless garbage by the time someone notices.

Here’s a way that might happen. Consider an application that sells products to cartoon characters. For illustrative purposes, it’s pretty basic and not fully normalized. That in itself isn’t a problem, and it’s more fun to query anyway. The structure consists of two simple tables in a container schema, and we are building it with sharding in mind:

CREATE SCHEMA toon_store;
SET search_path TO toon_store;

CREATE TABLE toon_account 
  account_id   INT NOT NULL PRIMARY KEY,
  first_name   VARCHAR NOT NULL,
  last_name    VARCHAR NOT NULL,
  email        VARCHAR NOT NULL,

CREATE TABLE toon_order
  order_id     INT NOT NULL PRIMARY KEY,
  account_id   INT NOT NULL,
  product      TEXT NOT NULL,
  quantity     INT NOT NULL,

Notice that we didn’t use SERIAL to handle autonumbering the ID fields. That’s one of the first concessions we’ve had to make in order to shard our data in the future. If we allowed the tables to assign surrogate IDs in independent shards, we would inevitably encounter conflicts. There are a few ways around this, because we still want IDs.

  1. Postgres sequences can start on an arbitrary number, and increment by arbitrary values. We could specifically create sequences that were tailored to each shard, such that there would be no conflicting assignments. This is a very simple approach, but is extremely inelastic. The incremental value must exceed the amount of potential shards, or we end up with wraparound problems. Implementation is also somewhat annoying, requiring a separate custom application to create and manage shards.
  2. Use UUID. UUIDs are generated in such a way that they can’t conflict. The main complication here is that UUIDs are rarely utilized in existing architectures. This means converting all existing data to implement them where distribution is necessary. They’re also much larger than an INT, and have a cascading effect of making all foreign keys, indexes, and network transfers of the underlying data larger and slower. These caveats don’t matter in most cases, but should be considered.
  3. We could use a function to generate the ID as the DEFAULT. In the function, we would define our shard distribution and ID algorithm. This is how shard_manager and similar extensions work. Instagram used this process in their large Postgres database, and it worked fine for them. Like UUIDs, this kind of custom solution must happen before any data is loaded into the system for best effect. Further, we actually have to make decisions regarding how many shards we could potentially create, and relying on 64-bit integers means we’ll eventually run out of IDs. We might think our application has moved on in 50-100 years, but y2k problems in the year 2000 suggest making those kind of assumptions is ultimately destructive.
  4. We let the application itself handle ID generation. Often in the case of sharding, the application has some kind of API that intercepts all queries, and makes shard-aware decisions like where to insert data, or what values need to be provided to prevent duplicate keys. This is basically just a lazier and more dangerous version of the previous option. But what happens when another ad-hoc application or data-source appears? Well, we either have to hack together some way of utilizing the existing key distribution system into incorporating these components, or duplicate the algorithm in every new system that wants to use our data. Developers hate duplicating efforts, because it’s a vector for bugs and code drift. It also means everything is ideally written in the same language, which is limiting in many circumstances.

That’s a lot to think about, so we’ll ignore it for now. Let’s just imagine we’ve fixed that problem, and IDs are magically solved. The next step is to decide on where to put the shards. A Postgres instance can have multiple databases, so let’s just isolate each shard in its own database, so we don’t have to worry about renaming our schemas or tables. We can just run the same creation script in each database and call it a day:

createdb shard1
createdb shard2

psql -f schema.sql shard1
psql -f schema.sql shard2

Federating data this way isn’t exactly efficient. Postgres databases can’t join between each other unless it’s through some kind of access layer like a Foreign Data Wrapper. That means an external connection to each database is necessary to obtain data. But isn’t the idea of Shared Nothing structures built upon no interaction between the shards? And suppose we did use schemas instead of databases to segment the data; we’d want separate connections anyway to prevent needlessly complicating the physical/logical shard map. We may need to move shards to re-balance the cluster after all, so the assumption should be that shards are physically distinct entities.

With that in mind, let’s fill our cluster with some arbitrary data. We need to distribute everything, so let’s just use some naive modulus math on the primary key of each table:

-- Run this on shard1
SET search_path TO toon_store;

INSERT INTO toon_account (account_id, first_name, last_name, email)
VALUES (1, 'Fred', 'Flintstone', '');

INSERT INTO toon_order (order_id, account_id, product, quantity)
VALUES (1, 1, 'Apple Tablet', 2);

INSERT INTO toon_order (order_id, account_id, product, quantity)
VALUES (3, 2, 'Bolt #7', 5);

-- Run this on shard2
SET search_path TO toon_store;

INSERT INTO toon_account (account_id, first_name, last_name, email)
VALUES (2, 'Gadget', 'Hackwrench', '');

INSERT INTO toon_order (order_id, account_id, product, quantity)
VALUES (2, 1, 'TromBONE', 1);

INSERT INTO toon_order (order_id, account_id, product, quantity)
VALUES (4, 2, 'Coo-Coo Cola', 12);

This is fairly straight-forward; Fred wanted some entertainment options, and Gadget has some work to do, and grabbed a case of refreshment.

At least one of you started screaming and executed an epic facepalm at what we just did. For everyone else, this is where we return to the importance of choosing distribution methods. In this case, we made an extremely elementary mistake and didn’t account for keeping associated data together. In this case, we distributed data based on the primary key of each individual table, hoping that would evenly distribute the data itself. This only works for tables that are completely independent of each other. If we ever want to execute a JOIN, the arbitrary row distribution means some rows that would normally return are simply missing.

To make this more obvious, let’s use some magic similar to PMPP, and broadcast the same query to both of our shards and examine the results:

SELECT a.first_name, a.last_name, o.product, o.quantity
  FROM toon_store.toon_account a
  JOIN toon_store.toon_order o USING (account_id);

-- From shard1

 first_name | last_name  |   product    | quantity 
 Fred       | Flintstone | Apple Tablet |        2

-- From shard2

 first_name | last_name  |   product    | quantity 
 Gadget     | Hackwrench | Coo-Coo Cola |       12

Well, we’re obviously missing two rows from those results. But why? Because Fred is only on shard1, and Gadget is only on shard2. If one or the other has orders on both shards, we’ll only retrieve those that happen to reside on the same shard as they do. This is one of the simplest mistakes to make when building a distributed data system, and the easiest to fix. There are two mainstream approaches to addressing this:

  1. Key hashing must be consistent across the system. This means taking one concept that runs through the application and distributing based on that. In our case, customers only interact with their own data. Since both tables share the account_id field, we could organize the data so that field determines the appropriate shard. Consequently, the application can make that same assumption. Combine this with a physical to logical map, and an application could instantly transform the ID to connect to, and retrieve from, the appropriate shard. If we cache the mapping itself, we could wrap the database driver with an intermediate conversion layer to obfuscate much of this.
  2. Replicate critical tables to all shards as metadata. In some ways, this is a variant of the previous approach. Compared to tables with several hundred million or billions of rows, some tables are simply inconsequential and do not require shard distribution. In these cases, it’s common to simply duplicate the table to every shard in its complete form. The toon_account table is a perfect candidate for this, since the number of accounts is minuscule compared to the volume of orders.

More often than not, the two approaches are combined. With extensions such as pg_logical taking the place of older trigger-based table replication mechanisms, we don’t have to worry so much about performance concerns, either. This may encourage DBAs to replicate tables more often than before, instead of suppressing an inward cringe at the prospect of dealing with a brittle trigger parasite slurping a changelog from every table. Properly leveraged, our broken joins work exactly as expected, with few to no concessions at the user level.

Unfortunately as with icebergs, the bulky expanse of further concerns exists below the surface. If we made incorrect assumptions about how data is distributed or related, we may need to start over. If we want a use case where new interactions are not supported by the current distribution model, we may need to start over. If our hashing algorithm is flawed in some way, we could have uneven data distribution, and may need to start over.

Consider our example. Say we fixed it by distributing based on account_id. If Gadget buys orders of magnitude more product from our store than Fred, her shard would contain far more data and be much larger. Now we’re stuck with one giant shard in relation to the others. To fix it, we’d want to replicate the toon_account table to every shard, and hash on order_id instead. Now we need to designate some shard as the primary source for the account table, such that all data modification commands only target that location. Or we could use some type of bi-directional replcation to merge all insert vectors, and deal with the inherent complexity that approach implies. Oh yeah, and we have to re-balance the toon_order table.

The rabbit hole only gets deeper. This example has two very simple tables. In a real production context with dozens or even hundreds of tables, we must be even more careful. In the end, that’s part of the reason there’s no single solution, or even a feature-complete working model. For every circumstance where Postgres-XL’s approach is perfect, there are others where depending on a coordinator isn’t as efficient as a shard-aware driver wrapper. For every query that can simply be broadcast to every known node and aggregated, there are others where all associated data is best when strongly tied to the same location in an independent silo.

I like to pretend I’m some kind of expert in this realm, but the reality is that there’s no such thing. The more I learn about the full implications of horizontal scaling, the more overwhelming it becomes. I suspect this is why so many of the Postgres scaling solutions are either proprietary (Redshift), abandoned (Stado), unstable (Postgres-XL), or a shambling heap of caveats and implementation quirks (Citus, Greenplum).

In the end, we’re still restricted to the realm of custom solutions to get the job done. Whether that’s adapting loading and querying around Citus and its many limitations, or wrapping PMPP with a few functions and building a driver wrapper to abstract away the complexity, we’re firmly in ad-hoc land. This still works, but the expertise required to build, maintain, and service the end result is the kind of job security I could do without. ;)