PG Phriday: Postgres-XL and Horizontal Scaling

December 11th, 2015 | Published in Database, Tech Talk | 2 Comments


With all of the upheaval in the Postgres world thanks to advancements in extensions, foreign data wrappers, and background workers, it’s getting pretty difficult to keep track of everything! One of these rapidly moving targets is Postgres-XL and its role in helping Postgres scale outward. Large warehouses have a critical need for horizontal scaling, as the very laws of physics make it effectively impossible to perform aggregate queries on tables consisting of several billion rows. Even furiously indexed, partitioned, and clustered tables, on over-provisioned servers loaded with the best NVRAM-based storage, 64 CPUS, and 1TB of RAM, become lumbering beasts when someone wants a cumulative summary of last year’s activity.

In my previous analysis of PMPP, I wondered how we could use it to possibly rebalance one of our larger databases so that it’s actually usable in linear time. But that also got me thinking about past Postgres projects such as Postgres-XC and its ilk. I’d long since forgotten about them due to how far they lagged behind Postgres core patches. But desperate times call for desperate measures, so I looked again, remembering how Postgres-XL had taken up the mantle of open-source big-data Postgres.

What I found was initially discouraging. Postgres-XL is currently stable on the 9.2 branch even though 9.4 has been out for over a year. But the devs claim to be feverishly merging 9.5 into Postgres-XL, so assuming they eventually complete that effort, this could be something great. As it turns out—though it’s a bit of a juggling act—Postgres-XL may actually be an option for our 30+TB instance.

Of course, first we have to install it. We don’t use RPM systems where I work, so I used the source tarball and installed with configure and make such that I could simply use the binaries normally. After that, the documentation suggests using pgxc_ctl to provision everything. Unfortunately the sample configuration file for that tool doesn’t adequately describe all of the variables, and the defaults refer to nodes names and a stack that isn’t well defined. So let’s just do this the “hard” way. One benefit to this approach is a more complete understanding of what’s going on.

Everything starts with GTM, the Global Transaction Manager that ensures transactions commit safely across the cluster, since we are working with several independent Postgres instances. We also need a data coordinator, and a few data nodes. The test system has eight CPUs, so it makes sense to allocate four data nodes to avoid saturating the system. GTM can be extremely active, and the coordinator does its own work, so we should assume this uses six of our eight CPUs.

initgtm -Z gtm -D /data/pgsql/pgxl/gtm
gtm_ctl -Z gtm -D /data/pgsql/pgxl/gtm start
 
initdb -D /data/pgsql/pgxl/coord1 --nodename coord1
initdb -D /data/pgsql/pgxl/data1 --nodename data1
initdb -D /data/pgsql/pgxl/data2 --nodename data2
initdb -D /data/pgsql/pgxl/data3 --nodename data3
initdb -D /data/pgsql/pgxl/data4 --nodename data4

We started GTM at this point, because there’s no reason not to. It’s a very small and light service, and we’ll need it later anyway. We can’t start the cluster itself just yet however, because we need to make a bunch of configuration changes. Note that we’ve only differentiated the node roles by name at this point, and all of the ports are still the default 5432 for any Postgres service. So the minimum to get all of this running is to change:

  • The port setting for each node. To keep things simple, let’s use 6543 for the coordinator, and 6601 to 6604 for the data nodes.
  • Set pooler_port to unique values. This is because any instance is configured by default as a coordinator, and coordinators start up an internal connection pool on a separate port. Without changing this value, none of the data nodes will start after the coordinator is running. Let’s bump the values by one and use 6743 for the coordinator, and 6701 to 6704 for the data nodes. We’ll be able to fix this later.
  • Set logging_collector to on. Unless we want our console flooded with output, it’s a good start.

Next, we can actually start the cluster by launching each node:

pg_ctl -D /data/pgsql/pgxl/coord1 -Z coordinator start
pg_ctl -D /data/pgsql/pgxl/data1 -Z datanode start
pg_ctl -D /data/pgsql/pgxl/data2 -Z datanode start
pg_ctl -D /data/pgsql/pgxl/data3 -Z datanode start
pg_ctl -D /data/pgsql/pgxl/data4 -Z datanode start

At this point, our cluster still isn’t usable. Postgres-XL adds a NODE primitive, and a new instance is only aware of itself as a coordinator, and is set to the default port of 5432. This means none of the nodes can actually communicate with each other. To fix that, we actually have to connect to each node and inform it of its actual role:

cat <<EOF | psql -p 6543
ALTER NODE coord1 WITH (PORT = 6543);
 
CREATE NODE data1 WITH (TYPE = 'datanode', HOST = 'localhost', PORT = 6601);
CREATE NODE data2 WITH (TYPE = 'datanode', HOST = 'localhost', PORT = 6602);
CREATE NODE data3 WITH (TYPE = 'datanode', HOST = 'localhost', PORT = 6603);
CREATE NODE data4 WITH (TYPE = 'datanode', HOST = 'localhost', PORT = 6604);
 
SELECT pgxc_pool_reload();
EOF
 
cat <<EOF | psql -p 6601
ALTER NODE data1 WITH (port = 6601, TYPE = 'datanode');
 
CREATE NODE coord1 WITH (TYPE = 'coordinator', HOST = 'localhost', PORT = 6543);
CREATE NODE data2 WITH (TYPE = 'datanode', HOST = 'localhost', PORT = 6602);
CREATE NODE data3 WITH (TYPE = 'datanode', HOST = 'localhost', PORT = 6603);
CREATE NODE data4 WITH (TYPE = 'datanode', HOST = 'localhost', PORT = 6604);
 
SELECT pgxc_pool_reload();
EOF
 
cat <<EOF | psql -p 6602
ALTER NODE data2 WITH (port = 6602, TYPE = 'datanode');
 
CREATE NODE coord1 WITH (TYPE = 'coordinator', HOST = 'localhost', PORT = 6543);
CREATE NODE data1 WITH (TYPE = 'datanode', HOST = 'localhost', PORT = 6601);
CREATE NODE data3 WITH (TYPE = 'datanode', HOST = 'localhost', PORT = 6603);
CREATE NODE data4 WITH (TYPE = 'datanode', HOST = 'localhost', PORT = 6604);
 
SELECT pgxc_pool_reload();
EOF
 
cat <<EOF | psql -p 6603
ALTER NODE data3 WITH (port = 6603, TYPE = 'datanode');
 
CREATE NODE coord1 WITH (TYPE = 'coordinator', HOST = 'localhost', PORT = 6543);
CREATE NODE data1 WITH (TYPE = 'datanode', HOST = 'localhost', PORT = 6601);
CREATE NODE data2 WITH (TYPE = 'datanode', HOST = 'localhost', PORT = 6602);
CREATE NODE data4 WITH (TYPE = 'datanode', HOST = 'localhost', PORT = 6604);
 
SELECT pgxc_pool_reload();
EOF
 
cat <<EOF | psql -p 6604
ALTER NODE data4 WITH (port = 6604, TYPE = 'datanode');
 
CREATE NODE coord1 WITH (TYPE = 'coordinator', HOST = 'localhost', PORT = 6543);
CREATE NODE data1 WITH (TYPE = 'datanode', HOST = 'localhost', PORT = 6601);
CREATE NODE data2 WITH (TYPE = 'datanode', HOST = 'localhost', PORT = 6602);
CREATE NODE data3 WITH (TYPE = 'datanode', HOST = 'localhost', PORT = 6603);
 
SELECT pgxc_pool_reload();
EOF

And then we can restart all of the data nodes to get rid of the extra pools:

pg_ctl -D /data/pgsql/pgxl/data1 -Z datanode restart
pg_ctl -D /data/pgsql/pgxl/data2 -Z datanode restart
pg_ctl -D /data/pgsql/pgxl/data3 -Z datanode restart
pg_ctl -D /data/pgsql/pgxl/data4 -Z datanode restart

Finally. Finally the cluster is usable. And this is the minimum setup necessary to get a working Postgres-XL stack. With that out of the way, let’s revisit our trusty sensor_log table:

CREATE TABLE sensor_log (
  sensor_log_id  SERIAL PRIMARY KEY,
  location       VARCHAR NOT NULL,
  reading        BIGINT NOT NULL,
  reading_date   TIMESTAMP NOT NULL
 
) DISTRIBUTE BY HASH (sensor_log_id);
 
CREATE INDEX idx_sensor_log_location ON sensor_log (location);
CREATE INDEX idx_sensor_log_date ON sensor_log (reading_date);
 
INSERT INTO sensor_log (location, reading, reading_date)
SELECT s.id % 1000, s.id % 100, now() - (s.id || 's')::INTERVAL
  FROM generate_series(1, 4000000) s(id);

This is very similar to the setup we used for testing the PMPP extension. We didn’t have to manually allocate the shards, but we do have to tell Postgres-XL how to distribute the table contents across our data nodes. We should also mention here that the INSERT will be extremely slow. The primary reason for this is that we’re relying on GTM to control the sequence values, and it seems to get progressively slower as bulk insert size increases. For instance, 1000 rows insert 5x slower than a regular Postgres database, while the full 4M is 25x slower. This would suggest sequence use either needs work, or should be avoided in favor of pre-generated values.

In any case, here are a couple queries we can use to test against a standard Postgres 9.4 install:

-- Query 1
 
SELECT COUNT(*)
  FROM sensor_log
 WHERE reading_date::TIME >= '14:00'
   AND reading_date::TIME < '15:00';
 
-- Query 2
 
SELECT COUNT(*)
  FROM sensor_log
 WHERE reading_date >= CURRENT_DATE - INTERVAL '1 day';
Query Postgres 9.4 (ms) Postgres-XL (ms)
1 977 303
2 46 12

We should probably eventually run some more comprehensive tests containing joins to see how it handles cross-database data comparisons. But this preliminary testing seems to indicate nearly linear scaling for simple data access. Given a wide enough cluster, we could reduce query times by two or three orders of magnitude on larger tables. Combine this with fact tables, and we could potentially provide real-time analysis over wide aggregate ranges.

My only real concern, aside from the performance of sequences, is that every node has to be aware of every other node. This suggests Postgres-XL uses a N-squared network mesh to facilitate node communication. This is fine for small values of N, but I suspect severely diminishing returns at higher node counts due to communication overhead. Multiple-CPU computational clusters abandoned this approach decades ago in favor of architectures that use more efficient pathways. Given this limitation, extreme-scale clusters might not be possible with Postgres-XL in its current state.

If someone out there has a 32+ node cluster using Postgres-XC or Postgres-XL, I’d love to hear from them. Solutions like this are currently one of the few ways of efficiently accessing several-billions of rows. At the very least, there’s a lot of potential here.


Tags: , , ,

2 Comments

Feed
  1. Baron Schwartz says:

    December 11th, 2015 at 6:04 pm [#]


    Have you looked at citusdb ?


    1. Shaun says:

      December 11th, 2015 at 6:24 pm [#]


      Not yet, but it’s on my list. 🙂


Sorry, this post is closed to comments.