PG Phriday: Postgres-XL and Horizontal Scaling

With all of the upheaval in the Postgres world thanks to advancements in extensions, foreign data wrappers, and background workers, it’s getting pretty difficult to keep track of everything! One of these rapidly moving targets is Postgres-XL and its role in helping Postgres scale outward. Large warehouses have a critical need for horizontal scaling, as the very laws of physics make it effectively impossible to perform aggregate queries on tables consisting of several billion rows. Even furiously indexed, partitioned, and clustered tables, on over-provisioned servers loaded with the best NVRAM-based storage, 64 CPUS, and 1TB of RAM, become lumbering beasts when someone wants a cumulative summary of last year’s activity.

In my previous analysis of PMPP, I wondered how we could use it to possibly rebalance one of our larger databases so that it’s actually usable in linear time. But that also got me thinking about past Postgres projects such as Postgres-XC and its ilk. I’d long since forgotten about them due to how far they lagged behind Postgres core patches. But desperate times call for desperate measures, so I looked again, remembering how Postgres-XL had taken up the mantle of open-source big-data Postgres.

What I found was initially discouraging. Postgres-XL is currently stable on the 9.2 branch even though 9.4 has been out for over a year. But the devs claim to be feverishly merging 9.5 into Postgres-XL, so assuming they eventually complete that effort, this could be something great. As it turns out—though it’s a bit of a juggling act—Postgres-XL may actually be an option for our 30+TB instance.

Of course, first we have to install it. We don’t use RPM systems where I work, so I used the source tarball and installed with configure and make such that I could simply use the binaries normally. After that, the documentation suggests using pgxc_ctl to provision everything. Unfortunately the sample configuration file for that tool doesn’t adequately describe all of the variables, and the defaults refer to nodes names and a stack that isn’t well defined. So let’s just do this the “hard” way. One benefit to this approach is a more complete understanding of what’s going on.

Everything starts with GTM, the Global Transaction Manager that ensures transactions commit safely across the cluster, since we are working with several independent Postgres instances. We also need a data coordinator, and a few data nodes. The test system has eight CPUs, so it makes sense to allocate four data nodes to avoid saturating the system. GTM can be extremely active, and the coordinator does its own work, so we should assume this uses six of our eight CPUs.

initgtm -Z gtm -D /data/pgsql/pgxl/gtm
gtm_ctl -Z gtm -D /data/pgsql/pgxl/gtm start

initdb -D /data/pgsql/pgxl/coord1 --nodename coord1
initdb -D /data/pgsql/pgxl/data1 --nodename data1
initdb -D /data/pgsql/pgxl/data2 --nodename data2
initdb -D /data/pgsql/pgxl/data3 --nodename data3
initdb -D /data/pgsql/pgxl/data4 --nodename data4

We started GTM at this point, because there’s no reason not to. It’s a very small and light service, and we’ll need it later anyway. We can’t start the cluster itself just yet however, because we need to make a bunch of configuration changes. Note that we’ve only differentiated the node roles by name at this point, and all of the ports are still the default 5432 for any Postgres service. So the minimum to get all of this running is to change:

  • The port setting for each node. To keep things simple, let’s use 6543 for the coordinator, and 6601 to 6604 for the data nodes.
  • Set pooler_port to unique values. This is because any instance is configured by default as a coordinator, and coordinators start up an internal connection pool on a separate port. Without changing this value, none of the data nodes will start after the coordinator is running. Let’s bump the values by one and use 6743 for the coordinator, and 6701 to 6704 for the data nodes. We’ll be able to fix this later.
  • Set logging_collector to on. Unless we want our console flooded with output, it’s a good start.

Next, we can actually start the cluster by launching each node:

pg_ctl -D /data/pgsql/pgxl/coord1 -Z coordinator start
pg_ctl -D /data/pgsql/pgxl/data1 -Z datanode start
pg_ctl -D /data/pgsql/pgxl/data2 -Z datanode start
pg_ctl -D /data/pgsql/pgxl/data3 -Z datanode start
pg_ctl -D /data/pgsql/pgxl/data4 -Z datanode start

At this point, our cluster still isn’t usable. Postgres-XL adds a NODE primitive, and a new instance is only aware of itself as a coordinator, and is set to the default port of 5432. This means none of the nodes can actually communicate with each other. To fix that, we actually have to connect to each node and inform it of its actual role:

cat <= '14:00'
   AND reading_date::TIME < '15:00';

-- Query 2

SELECT count(*)
  FROM sensor_log
 WHERE reading_date >= CURRENT_DATE - INTERVAL '1 day';
Query Postgres 9.4 (ms) Postgres-XL (ms)
1 977 303
2 46 12

We should probably eventually run some more comprehensive tests containing joins to see how it handles cross-database data comparisons. But this preliminary testing seems to indicate nearly linear scaling for simple data access. Given a wide enough cluster, we could reduce query times by two or three orders of magnitude on larger tables. Combine this with fact tables, and we could potentially provide real-time analysis over wide aggregate ranges.

My only real concern, aside from the performance of sequences, is that every node has to be aware of every other node. This suggests Postgres-XL uses a N-squared network mesh to facilitate node communication. This is fine for small values of N, but I suspect severely diminishing returns at higher node counts due to communication overhead. Multiple-CPU computational clusters abandoned this approach decades ago in favor of architectures that use more efficient pathways. Given this limitation, extreme-scale clusters might not be possible with Postgres-XL in its current state.

If someone out there has a 32+ node cluster using Postgres-XC or Postgres-XL, I’d love to hear from them. Solutions like this are currently one of the few ways of efficiently accessing several-billions of rows. At the very least, there’s a lot of potential here.