PG Phriday: Joining the Big Leagues

I’ve maintained since about 2011, that the problem with scaling Postgres to truly immense installations could be solved by a query coordinator. Why? Most sharding systems utilize an application-level distribution mechanism, which may or may not leverage an inherent hashing algorithm. This means each Postgres instance can be treated independently of all the others if the distribution process is known. On a cleverly architected system, the application is algorithm aware, and can address individual shards through a driver proxy or accessor class.

The real complications arise when there is no clear natural column to evenly distribute data over a wide cluster. Further, when accessing that data in an OLAP context, individual IDs derived through an algorithm are essentially worthless. To address data in general, a query must be broadcast to every individual node, and the results collated and re-aggregated at some final step. Without this key component, obtaining aggregate results from a large database cluster is an exercise in futility, or over-engineered application-level query parsers.

This is one of the reasons I like Postgres-XL and its ilk over something like CitusDB. The Postgres-XL approach treats all of the data nodes like nothing more than dumb containers, and the coordinator does all of the heavy lifting. It broadcasts queries to all nodes, combines the results, and does so in a parallel context. Going back to my article on PMPP, we can see that a query must be broadcast to all of the shards, and then another query must be wrapped around the function call to summarize all the disparate results. Postgres-XL does that without the extra step, and for general users, this is a massive difference.

Given all of this, there is a limit to the current implementation, and it’s likely a side-effect of Postgres-XL coming from Postgres-XC. This code fell so far behind core, the latest stable release is 9.2. They’re working hard on merging their patches into a 9.5 branch and including other enhancements, but until that happens, this is the current state of Big Postgres.

So where are we? I’ve recently been given access to what I consider hilariously immense servers with ridiculous performance. We’re talking 48-core, 512GB servers, with 50TB of storage each. Eight of them. If this doesn’t represent the maximum possible performance of Postgres in a large cluster, I’m afraid to ask what would.

But results are what matters. It turns out that the Postgres-XL 9.2 coordinator node crashes at cluster sizes greater than 36 data nodes. With that constraint in mind, I used one server as a coordinator, and six more servers as hosts for data nodes. Then I ran a battery of load tests consisting of up to six nodes per server, with up to eight concurrent import processes. That gives us a maximum of 36 nodes, and no server is handling more than 48 simultaneous queries—perfect!

Next, I took a 90-million row sample of our 20-billion row data set, and imported it into a simple table that exactly matched the CSV format under various permutations. To make this a real-world test, I also added a single index on a column that represented stock symbols, so the imports had to do some actual work.

The results—listed below—are pretty illuminating.

Nodes / Server Loader Procs Time (s) Rows / s
1 1 213 419694
1 2 87 1027526
1 4 46 1943365
1 8 42 2128448
2 1 172 519737
2 2 89 1004436
2 4 55 1625360
2 8 21 4256896
4 1 184 485841
4 2 84 1064224
4 4 43 2078949
4 8 24 3724784
6 1 172 519737
6 2 86 1039474
6 4 43 2078949
6 8 24 3724784

It’s pretty clear that one node per server is a big loser. After that, there’s quite a bit of variance in the results, probably stemming from the fact these servers aren’t completely idle and dedicated to this testing. But there’s enough here to draw a few conclusions.

First, we hit a parallel import bottleneck relatively quickly. This is perfectly in-line with past analysis on parallel COPY regarding patches in 9.4 locking. This means the 9.5 XL branch should provide much better results. Not that three to four million imports a second is sluggish.

Second, after the outlier of one node per server, performance seems relatively consistent across the board. More nodes don’t appear to diminish (or improve) import performance. This is important considering the work the coordinator is doing in splitting up and broadcasting the CSV contents. Assuming the nodes act as independent agents, they could theoretically import CSV data directly, provided they do this while using the GTM to prevent transaction mismatches. Removing the coordinator as the bottleneck here could drastically improve import speed on extremely large clusters.

Finally, maximizing the node count is the best approach to future query needs. Even were our tests to imply that import speeds are inversely impacted by scale, the difference would need to be much larger to suggest fewer nodes. This means a naive sequential scan over the entire 90-million rows takes about half a second on this hardware due to 36 parallel workers. And in theory, up to eight queries that cause up to 100% CPU on all of the worker nodes could operate this way.

Of course, all of this data fits in memory. We’d have to load much more data than a mere 32GB to overwhelm the 3TB available on these servers. Even all 20-billion rows only consume about 2.4TB. This means the current day of data could be available real-time, while older information requiring disk seeks would be understandably slower. And yet even those queries would be a minimum of 6x faster than a single Postgres instance could operate.

All of this seems to indicate that maximizing read scalability using an architecture like Postgres-XL requires several much smaller servers, each with its own independent storage array. While these 50TB behemoths are fun to play with, a sequential scan issued from each of the nodes would quickly saturate the local storage, and likely cause severe disk contention. Six nodes each pulling several 1GB files from the same array would resemble random IO to the controller or SAN. At the very least, there will be elevated IO and queue waits. And unfortunately, 50TB of SSD-based storage (per server!) is still staggeringly expensive.

The bottom line seems to be this: the maximum scale and performance of a large Postgres-XL cluster seems to be 36 medium-size servers. If using larger servers like in these tests, storage should be provided with independent LUNs so that each node gets its own allocation. This allows read performance to scale nearly linearly and equally reduces index or vacuum maintenance concerns by spreading the data over so many nodes.

Once XL can go beyond 36 nodes, we’ll see what Postgres can really do. Hopefully this happens with the release of 9.5. Until then, Postgres can be big, but not yet truly gargantuan. Well… and still capable of returning queries before the heat death of the universe.