PG Phriday: Postgres-XL and Horizontal Scaling
December 11th, 2015 | Published in Database, Tech Talk | 2 Comments
With all of the upheaval in the Postgres world thanks to advancements in extensions, foreign data wrappers, and background workers, it’s getting pretty difficult to keep track of everything! One of these rapidly moving targets is Postgres-XL and its role in helping Postgres scale outward. Large warehouses have a critical need for horizontal scaling, as the very laws of physics make it effectively impossible to perform aggregate queries on tables consisting of several billion rows. Even furiously indexed, partitioned, and clustered tables, on over-provisioned servers loaded with the best NVRAM-based storage, 64 CPUS, and 1TB of RAM, become lumbering beasts when someone wants a cumulative summary of last year’s activity.
In my previous analysis of PMPP, I wondered how we could use it to possibly rebalance one of our larger databases so that it’s actually usable in linear time. But that also got me thinking about past Postgres projects such as Postgres-XC and its ilk. I’d long since forgotten about them due to how far they lagged behind Postgres core patches. But desperate times call for desperate measures, so I looked again, remembering how Postgres-XL had taken up the mantle of open-source big-data Postgres.
What I found was initially discouraging. Postgres-XL is currently stable on the 9.2 branch even though 9.4 has been out for over a year. But the devs claim to be feverishly merging 9.5 into Postgres-XL, so assuming they eventually complete that effort, this could be something great. As it turns out—though it’s a bit of a juggling act—Postgres-XL may actually be an option for our 30+TB instance.
Of course, first we have to install it. We don’t use RPM systems where I work, so I used the source tarball and installed with configure and make such that I could simply use the binaries normally. After that, the documentation suggests using pgxc_ctl to provision everything. Unfortunately the sample configuration file for that tool doesn’t adequately describe all of the variables, and the defaults refer to nodes names and a stack that isn’t well defined. So let’s just do this the “hard” way. One benefit to this approach is a more complete understanding of what’s going on.
Everything starts with GTM, the Global Transaction Manager that ensures transactions commit safely across the cluster, since we are working with several independent Postgres instances. We also need a data coordinator, and a few data nodes. The test system has eight CPUs, so it makes sense to allocate four data nodes to avoid saturating the system. GTM can be extremely active, and the coordinator does its own work, so we should assume this uses six of our eight CPUs.
initgtm -Z gtm -D /data/pgsql/pgxl/gtm gtm_ctl -Z gtm -D /data/pgsql/pgxl/gtm start initdb -D /data/pgsql/pgxl/coord1 --nodename coord1 initdb -D /data/pgsql/pgxl/data1 --nodename data1 initdb -D /data/pgsql/pgxl/data2 --nodename data2 initdb -D /data/pgsql/pgxl/data3 --nodename data3 initdb -D /data/pgsql/pgxl/data4 --nodename data4 |
We started GTM at this point, because there’s no reason not to. It’s a very small and light service, and we’ll need it later anyway. We can’t start the cluster itself just yet however, because we need to make a bunch of configuration changes. Note that we’ve only differentiated the node roles by name at this point, and all of the ports are still the default 5432 for any Postgres service. So the minimum to get all of this running is to change:
- The port setting for each node. To keep things simple, let’s use 6543 for the coordinator, and 6601 to 6604 for the data nodes.
- Set pooler_port to unique values. This is because any instance is configured by default as a coordinator, and coordinators start up an internal connection pool on a separate port. Without changing this value, none of the data nodes will start after the coordinator is running. Let’s bump the values by one and use 6743 for the coordinator, and 6701 to 6704 for the data nodes. We’ll be able to fix this later.
- Set logging_collector to on. Unless we want our console flooded with output, it’s a good start.
Next, we can actually start the cluster by launching each node:
pg_ctl -D /data/pgsql/pgxl/coord1 -Z coordinator start pg_ctl -D /data/pgsql/pgxl/data1 -Z datanode start pg_ctl -D /data/pgsql/pgxl/data2 -Z datanode start pg_ctl -D /data/pgsql/pgxl/data3 -Z datanode start pg_ctl -D /data/pgsql/pgxl/data4 -Z datanode start |
At this point, our cluster still isn’t usable. Postgres-XL adds a NODE primitive, and a new instance is only aware of itself as a coordinator, and is set to the default port of 5432. This means none of the nodes can actually communicate with each other. To fix that, we actually have to connect to each node and inform it of its actual role:
cat <<EOF | psql -p 6543 ALTER NODE coord1 WITH (PORT = 6543); CREATE NODE data1 WITH (TYPE = 'datanode', HOST = 'localhost', PORT = 6601); CREATE NODE data2 WITH (TYPE = 'datanode', HOST = 'localhost', PORT = 6602); CREATE NODE data3 WITH (TYPE = 'datanode', HOST = 'localhost', PORT = 6603); CREATE NODE data4 WITH (TYPE = 'datanode', HOST = 'localhost', PORT = 6604); SELECT pgxc_pool_reload(); EOF cat <<EOF | psql -p 6601 ALTER NODE data1 WITH (port = 6601, TYPE = 'datanode'); CREATE NODE coord1 WITH (TYPE = 'coordinator', HOST = 'localhost', PORT = 6543); CREATE NODE data2 WITH (TYPE = 'datanode', HOST = 'localhost', PORT = 6602); CREATE NODE data3 WITH (TYPE = 'datanode', HOST = 'localhost', PORT = 6603); CREATE NODE data4 WITH (TYPE = 'datanode', HOST = 'localhost', PORT = 6604); SELECT pgxc_pool_reload(); EOF cat <<EOF | psql -p 6602 ALTER NODE data2 WITH (port = 6602, TYPE = 'datanode'); CREATE NODE coord1 WITH (TYPE = 'coordinator', HOST = 'localhost', PORT = 6543); CREATE NODE data1 WITH (TYPE = 'datanode', HOST = 'localhost', PORT = 6601); CREATE NODE data3 WITH (TYPE = 'datanode', HOST = 'localhost', PORT = 6603); CREATE NODE data4 WITH (TYPE = 'datanode', HOST = 'localhost', PORT = 6604); SELECT pgxc_pool_reload(); EOF cat <<EOF | psql -p 6603 ALTER NODE data3 WITH (port = 6603, TYPE = 'datanode'); CREATE NODE coord1 WITH (TYPE = 'coordinator', HOST = 'localhost', PORT = 6543); CREATE NODE data1 WITH (TYPE = 'datanode', HOST = 'localhost', PORT = 6601); CREATE NODE data2 WITH (TYPE = 'datanode', HOST = 'localhost', PORT = 6602); CREATE NODE data4 WITH (TYPE = 'datanode', HOST = 'localhost', PORT = 6604); SELECT pgxc_pool_reload(); EOF cat <<EOF | psql -p 6604 ALTER NODE data4 WITH (port = 6604, TYPE = 'datanode'); CREATE NODE coord1 WITH (TYPE = 'coordinator', HOST = 'localhost', PORT = 6543); CREATE NODE data1 WITH (TYPE = 'datanode', HOST = 'localhost', PORT = 6601); CREATE NODE data2 WITH (TYPE = 'datanode', HOST = 'localhost', PORT = 6602); CREATE NODE data3 WITH (TYPE = 'datanode', HOST = 'localhost', PORT = 6603); SELECT pgxc_pool_reload(); EOF |
And then we can restart all of the data nodes to get rid of the extra pools:
pg_ctl -D /data/pgsql/pgxl/data1 -Z datanode restart pg_ctl -D /data/pgsql/pgxl/data2 -Z datanode restart pg_ctl -D /data/pgsql/pgxl/data3 -Z datanode restart pg_ctl -D /data/pgsql/pgxl/data4 -Z datanode restart |
Finally. Finally the cluster is usable. And this is the minimum setup necessary to get a working Postgres-XL stack. With that out of the way, let’s revisit our trusty sensor_log table:
CREATE TABLE sensor_log ( sensor_log_id SERIAL PRIMARY KEY, location VARCHAR NOT NULL, reading BIGINT NOT NULL, reading_date TIMESTAMP NOT NULL ) DISTRIBUTE BY HASH (sensor_log_id); CREATE INDEX idx_sensor_log_location ON sensor_log (location); CREATE INDEX idx_sensor_log_date ON sensor_log (reading_date); INSERT INTO sensor_log (location, reading, reading_date) SELECT s.id % 1000, s.id % 100, now() - (s.id || 's')::INTERVAL FROM generate_series(1, 4000000) s(id); |
This is very similar to the setup we used for testing the PMPP extension. We didn’t have to manually allocate the shards, but we do have to tell Postgres-XL how to distribute the table contents across our data nodes. We should also mention here that the INSERT will be extremely slow. The primary reason for this is that we’re relying on GTM to control the sequence values, and it seems to get progressively slower as bulk insert size increases. For instance, 1000 rows insert 5x slower than a regular Postgres database, while the full 4M is 25x slower. This would suggest sequence use either needs work, or should be avoided in favor of pre-generated values.
In any case, here are a couple queries we can use to test against a standard Postgres 9.4 install:
-- Query 1 SELECT COUNT(*) FROM sensor_log WHERE reading_date::TIME >= '14:00' AND reading_date::TIME < '15:00'; -- Query 2 SELECT COUNT(*) FROM sensor_log WHERE reading_date >= CURRENT_DATE - INTERVAL '1 day'; |
| Query | Postgres 9.4 (ms) | Postgres-XL (ms) |
|---|---|---|
| 1 | 977 | 303 |
| 2 | 46 | 12 |
We should probably eventually run some more comprehensive tests containing joins to see how it handles cross-database data comparisons. But this preliminary testing seems to indicate nearly linear scaling for simple data access. Given a wide enough cluster, we could reduce query times by two or three orders of magnitude on larger tables. Combine this with fact tables, and we could potentially provide real-time analysis over wide aggregate ranges.
My only real concern, aside from the performance of sequences, is that every node has to be aware of every other node. This suggests Postgres-XL uses a N-squared network mesh to facilitate node communication. This is fine for small values of N, but I suspect severely diminishing returns at higher node counts due to communication overhead. Multiple-CPU computational clusters abandoned this approach decades ago in favor of architectures that use more efficient pathways. Given this limitation, extreme-scale clusters might not be possible with Postgres-XL in its current state.
If someone out there has a 32+ node cluster using Postgres-XC or Postgres-XL, I’d love to hear from them. Solutions like this are currently one of the few ways of efficiently accessing several-billions of rows. At the very least, there’s a lot of potential here.
Baron Schwartz says:
December 11th, 2015 at 6:04 pm [#]
Have you looked at citusdb ?
Shaun says:
December 11th, 2015 at 6:24 pm [#]
Not yet, but it’s on my list. 🙂