PG Phriday: DIY in the CLI (Part 1)

June 24th, 2016 | Published in News | No Comments


On a higher level, Postgres has a bevy of libraries, interfaces, and clients for accessing a database instance. From language APIs to GUIs like pgAdmin, or SaaS entries like JackDB, every flavor of interaction is covered. And yet, that’s only a small part of the story. For those who dare to tread into the watery depths, there’s also the world of dark incantations that is the command-line.

While most are aware of psql, the Postgres command-line client for accessing databases, there are far more creatures lurking in the black oblivion which deserve more visibility. In the spirit of getting the most from our tools, let’s take a look at these and consider how they might be useful in scenarios that aren’t restricted to boring old DBA considerations. To that end, let’s cover some of the standard client applications that anyone might use.

To begin, there are several commands that cover some standard SQL operations. Things like creating or dropping databases, users, or procedural languages, are all possible without actually connecting to Postgres. Normally, we’d have to use the equivalent SQL commands:

CREATE DATABASE foo;
DROP DATABASE foo;
 
CREATE USER kitty_cat WITH PASSWORD 'M30w9|_|rr';
DROP USER kitty_cat;
 
CREATE LANGUAGE plpythonu;
DROP LANGUAGE plpythonu;

While these are all possible to use by forcing them through psql -c, that makes automation testing and other tasks somewhat awkward, or force sysadmins to learn SQL for no reason. So Postgres provides functional wrappers we can use instead. They’re named about what we might expect, too:

createdb foo
dropdb foo
 
createuser kitty_cat --pwprompt
dropuser kitty_cat
 
createlang plpythonu
droplang plpythonu

As would be expected, all of these commands include the optional parameters from their SQL equivalents. This means a couple of handy things: flags can be listed with -? or --help, and longer descriptions are available via comprehensive man pages. The Postgres online documentation is great, and having all of this information at our fingertips is even better.

Of course, we’re not limited to databases, users, and languages. There are a few commands to streamline basic maintenance as well, and as might be expected, these also have SQL equivalents. Well, mostly. Consider these three scenarios:

  1. We want to VACUUM every database every night following a busy day of transactions. We’ve disabled autovacuum because we have a very active database, and have instead, opted for nightly maintenance.
  2. Over the years, our indexes on the user_session and login_state tables have gotten a bit fragmented, and we want to rebuild them.
  3. We have clustered multiple tables, sorting their disk pages along beneficial indexes. Due to MVCC storage, tables that experience updates need periodic re-clustering to maintain this ordering. To do this, we need to run CLUSTER regularly.

Assuming we have two databases, named ‘trading’ and ‘operations’ for example, we could do these things using SQL commands:

\c trading
VACUUM ANALYZE;
CLUSTER;
REINDEX TABLE user_session;
REINDEX TABLE login_state;
 
\c operations
VACUUM ANALYZE;
CLUSTER;

This is all perfectly serviceable. Yet for administrators or end users that don’t really care about SQL syntax, or for operations we want to automate, there is an easier way. We could get the same result using three command-line tools:

vacuumdb --all --analyze
reindexdb --table=user_session --table=login_state trading
clusterdb --all
 
# Or with short options:
 
vacuumdb -az
reindexdb -t user_session -t login_state trading
clusterdb -a

Unlike their SQL equivalents, the command-line tools can combine operations or target multiple objects. We leveraged that to reindex both tables with a single command, and vacuum or cluster all databases in our instance. This is about the time our tools become more than functional wrappers of the SQL commands. Parameters like --all illustrate client-level looping at the very least, and provide a great segue into heavier utilities.

Backing up databases and restoring them, for instance, leverages multiple high and low level Postgres interfaces. Due to this complexity, there is no SQL analog. The Postgres backup and restore utilities have also matured a lot over the years, gaining parallel dumping functionality as recently as 9.3.

Taking this into account, we could clone one of the previously mentioned databases using provided tools in two different ways:

# Just copy a target database by itself:
 
pg_dump --jobs=2 --format=directory --file=backup_dir trading
createdb trading_clone
pg_restore --jobs=2 --dbname=trading_clone backup_dir
 
# Copy the whole freaking instance at the binary level.
 
pg_basebackup -D clone_dir

Just so we’re clear, both approaches have pluses and minuses as well as applicable scenarios beyond the scope of this article. We’re merely illustrating very basic usage. Beyond that, the first method is a variant of the common dump/restore pattern used since time immemorial to perform Postgres upgrades until pg_upgrade hit the scene in 9.0. One of the primary reasons it fell out of favor was due to the growing trend of immense databases.

Even using parallel functionality, dumping the contents of every table in a database 200GB or larger will be extremely slow. Restoring that information is even worse, as we not only have to import all of the same data, but all indexes and constraints must be rebuilt from scratch. Before pg_restore was capable of parallel restores in 8.4, restores were even more frustrating. This, along with the new online replication capabilities, is the origin of pg_basebackup.

By default, pg_basebackup merely utilizes the Postgres data replication stream and reconstructs the host instance in a specified directory. This mans we can start Postgres from that directory as a replica, or use pg_upgrade to test newer versions while the old one keeps running. This encourages online failbacks in case of failed upgrades, multiple upgrade test runs, or running applications on new versions to test for edge cases and compatibility concerns.

Principally, it allows us to separate instance management from content management. We still need pg_dump and pg_restore, but they’re no longer primarily backup tools. This is espicially true for pg_dump. It has multiple export formats which can theoretically be used to transfer table data into other storage systems. Or maybe we just want to export and inspect raw schema creation commands.

What else is left on the client end of things? For those that want to build Postgres extensions, or get information about the Postgres binaries installed on a server, we have pg_config. Its output is basically a bunch of environment settings that were used to build Postgres itself. This is especially useful with distributed builds that have long lists of arbitrary configure flags. After all, here’s how Ubuntu compiles it:

pg_config --configure
 
'--with-tcl' '--with-perl' '--with-python' '--with-pam' '--with-openssl' '--with-libxml' '--with-libxslt' '--with-tclconfig=/usr/lib/x86_64-linux-gnu/tcl8.6' '--with-includes=/usr/include/tcl8.6' 'PYTHON=/usr/bin/python' '--mandir=/usr/share/postgresql/9.4/man' '--docdir=/usr/share/doc/postgresql-doc-9.4' '--sysconfdir=/etc/postgresql-common' '--datarootdir=/usr/share/' '--datadir=/usr/share/postgresql/9.4' '--bindir=/usr/lib/postgresql/9.4/bin' '--libdir=/usr/lib/x86_64-linux-gnu/' '--libexecdir=/usr/lib/postgresql/' '--includedir=/usr/include/postgresql/' '--enable-nls' '--enable-integer-datetimes' '--enable-thread-safety' '--enable-tap-tests' '--enable-debug' '--disable-rpath' '--with-uuid=e2fs' '--with-gnu-ld' '--with-pgport=5432' '--with-system-tzdata=/usr/share/zoneinfo' 'CFLAGS=-g -O2 -fstack-protector --param=ssp-buffer-size=4 -Wformat -Werror=format-security -I/usr/include/mit-krb5 -fPIC -pie -DLINUX_OOM_SCORE_ADJ=0 -fno-omit-frame-pointer' 'LDFLAGS=-Wl,-Bsymbolic-functions -Wl,-z,relro -Wl,-z,now -Wl,--as-needed -L/usr/lib/mit-krb5 -L/usr/lib/x86_64-linux-gnu/mit-krb5' '--with-krb5' '--with-gssapi' '--with-ldap' '--with-selinux' 'CPPFLAGS=-D_FORTIFY_SOURCE=2

Some of that might be nice to know on occasion.

We also have pg_isready, a “new” addition with 9.3. It’s just a quick connection check with an exit status that follows standard exit codes. This is good for basic monitoring, but not much else. On the other hand, we no longer have to run a bogus command through psql and deal with the much more verbose client-oriented output. On an automation front, that’s a major step forward.

And finally, there’s benchmarks. Perhaps we’re testing new hardware, or a new Postgres version, or maybe we’re just curious. Having recently escaped existence as a “mere” contrib tool in 9.5, now anyone and their dog can beat up Postgres for fun and profit. It’s great to obtain data for articles like this, too.

Here’s a quick example:

createdb pgbench
pgbench --initialize --scale=100 pgbench
pgbench --client=4 --jobs=2 --select-only --time=10 pgbench
 
starting vacuum...end.
transaction type: SELECT only
scaling factor: 100
query mode: simple
number of clients: 4
number of threads: 2
duration: 10 s
number of transactions actually processed: 168447
latency average: 0.237 ms
tps = 16844.169409 (including connections establishing)
tps = 16853.718425 (excluding connections establishing)

Generally read/write tests should be larger than memory to really stress a system, but we’d need a much larger scale for that on most modern systems. At scale 100 we get about 1.5GB over 10M rows of data, so on a small 8GB test VM, we’d need a scale of 600 or higher to even start hitting the disks.

Beyond the basic test framework, we can also execute arbitrary SQL scripts. This makes it possible to stress-test transactional databases with multiple parallel clients, running anything from simple SELECT statements on our own tables, to intricate stored procedures that update fact tables. As such, it’s one of the most valuable tools in the arsenal of both users and DBAs, provided either even know it exists.

There are a couple more tools I’ve left out, but they’re related to logical and stream replication and are better covered in a more in-depth manner. Beyond that, take a look at your Postgres binary directory; you might be surprised what’s hiding there.

Because distributions like to hide all the cool stuff, we may have to go looking for some of these utilities. Debian and Ubuntu users can find binaries in /usr/lib/postgresql/9.x/bin, while RHEL variants like CentOS should look in /usr/pgsql-9.x/bin.

Explore often, but do so cautiously!


Tags: , ,

PG Phriday: Let There Be Jank

June 17th, 2016 | Published in Database, Tech Talk | No Comments


One way the Postgres project is subtly misleading, is that it becomes easy to forget that not all other projects are nearly as well managed. This becomes more relevant when delving into niches that lack sufficient visibility to expose the more obvious deficiencies. As much as we like Postgres, it’s not quite as popular as it could be. This makes some of the side projects infrequently used, and as a direct consequence, they can often resemble jerky automatons cobbled together out of spit and bailing wire.

A good example of this is the hdfs_fdw extension for accessing Hadoop. To be fair, a large portion of its complexity is due to dependency on Hive and hastily assembled APIs like Thrift and fb303. Unfortunately it also suffers from “Works for Me” syndrome, lacking an autoconf to autodiscover these requirements, and it also doesn’t use automake to properly bootstrap the make environment. As a result, most builds will fail outright or require hand-modifying the Makefile—a task many will simply balk at before abandoning the extension outright.

So, let’s install the hdfs_fdw extension, going through all the necessary steps, and not just some incomplete shorthand that makes assumptions regarding the build system. To facilitate this, no default install locations will be used at all, because hardcoded defaults are how we got into this mess in the first place.

The hdfs_fdw extension depends on Thrift, so let’s start there. As of this writing, the latest version is 0.9.3. Here’s the full build process we used:

wget http://apache.osuosl.org/thrift/0.9.3/thrift-0.9.3.tar.gz
tar -xzf thrift-0.9.3.tar.gz
cd thrift-0.9.3
./configure --prefix=/opt/thrift
make -j2
sudo make install

So far, this is pretty normal. What isn’t normal, is that the thrift source includes a contrib module named fb303 we also need. Unlike the Postgres build environment, configuration settings and subsequent Makefile components do not cascade to the contrib modules. This is where the trouble begins.

The first issue is that, unlike Thrift, fb303 defaults to only static linking, and won’t generate a dynamic object file unless explicitly told to do so. Considering these components are packaged together, this restriction is unnecessarily jarring. Why choose between dynamic or static linking for a system library that may be used as either? In our case, we want dynamic shared objects, so we need to configure with --disable-static.

The second roadblock comes from the Thrift team itself, which closed a bug as “Cannot Reproduce” despite the fact one of their include path references is simply wrong. The user that reported the issue even provided a patch. So we need to fix that, too.

The whole process looks like this:

cd contrib/fb303
./bootstrap.sh --prefix=/opt/thrift --with-thriftpath=/opt/thrift \
    --disable-static
sed -i 's%/include/thrift%/include%' cpp/Makefile.am
make -j2
sudo make install

With Thrift out of the way, it’s time to move on to hdfs_fdw itself. Unfortunately we can’t quite do that yet. The hdfs_fdw extension is distributed with a library it depends on, but that is not accounted for in the Makefile. We need to build and install it separately for some reason. Further, this dependency has hardcoded paths in its own Makefile, so we must modify it or end up with the library in an arbitrary location, or with it unable to find required headers.

So let’s build the libhive library:

git clone https://github.com/EnterpriseDB/hdfs_fdw.git
cd hdfs_fdw/libhive
sed -i 's%THRIFT_HOME=.*%THRIFT_HOME=/opt/thrift/include%' Makefile
sed -i 's%INSTALL_DIR=.*%INSTALL_DIR=/opt/thrift/lib%' Makefile
make -j2
sudo make install

We elected to install libhive in the same location as Thrift because they’ll be used together in our case. This prevents cluttering up our /usr/local/lib directory, as well as allowing us to easily redistribute these prerequisites to our other Postgres systems since we don’t have a nice package.

Finally, we can build the hdfs_fdw extension itself. Or can we?

Unfortunately, we’re still not done setting up. Because we elected to install Thrift as an optional piece of software, and because we use dynamic linking, we need to tell the operating system where to find libraries. Usually this means modifying /etc/ld.so.conf and running ldconfig to re-read library paths. On Ubuntu and other Debian variants, we can actually put these in a subdirectory in a less intrusive fashion.

Here’s how that might look:

echo /opt/thrift/lib | sudo tee /etc/ld.so.conf.d/hive.conf
sudo /sbin/ldconfig

Our particular build environment is an Ubuntu system, which is a Debian variant. As such, build tools like pg_config are actually wrappers around the real utilities, which are hidden away in deep paths to prevent accidental use. This is to help facilitate having multiple Postgres versions on the same server, but it also complicates installing extensions, since the wrappers always assume the most recent version. Suppose Postgres 9.6 beta is on our system, but wanted to install an extension for 9.5?

That means we need to alter our path before building hdfs_fdw itself. Debian variants put everything we need in /usr/lib/postgresql/[version]/bin, but other UNIX systems may use a different location. We’ll need that information to proceed. In addition, since hdfs_fdw doesn’t use a configure script, we can’t tell it where to find the Thrift and fb303 libraries. This isn’t strictly necessary because we modified ldconfig, but it’s always better to be safe.

So, assuming we’re still in the hdfs_fdw/libhive directory, we’d finish the extension installing hdfs_fdw like this:

export PATH=/usr/lib/postgresql/9.5/bin:$PATH
 
cd ..
sed -i 's%/usr/local/thrift%/opt/thrift%' Makefile
make -j2 USE_PGXS=1
sudo -E make install

We needed the -E flag to preserve our $PATH variable. Otherwise the root user’s path would be used, and then the extension would be installed into the most recent Postgres version, regardless of our wishes.

Since we modified ldconfig with the new library references, we also need to restart Postgres. Otherwise, it won’t have /opt/thrift/lib in its library cache, and as a result, would throw an error when trying to activate hdfs_fdw. Ubuntu systems use pg_ctlcluster for this, while others will use the Postgres pg_ctl tool directly. Let’s make that our final step to “activate” the library before using it.

sudo pg_ctlcluster 9.5 main restart

And finally… finally we’re done. But did it work? Let’s check:

CREATE EXTENSION hdfs_fdw;
 
CREATE SERVER hdfs_server
         FOREIGN DATA WRAPPER hdfs_fdw
         OPTIONS (host 'hive_host');
 
CREATE USER MAPPING FOR postgres
    SERVER hdfs_server;
 
CREATE FOREIGN TABLE hive_sensor_log
(
  id            BIGINT,
  location      VARCHAR(255),
  reading       BIGINT,
  reading_date  TIMESTAMP
 
) SERVER hdfs_server
OPTIONS (dbname 'default', TABLE_NAME 'sensor_log');
 
SELECT * FROM hive_sensor_log;
 
 id | location | reading |    reading_date     
----+----------+---------+---------------------
  1 | place    |      82 | 2016-06-17 08:15:31
  2 | stuff    |      22 | 2016-06-17 08:18:31

Well then, that was quite an adventure. In the end, we got something that worked, though the amount of hoops we had to jump through was a little disconcerting. It shouldn’t have to be this way.

On a personal note, this was actually the easy part. Hadoop is a truculent beast, and other elements in the stack—of which there are many—just make it more ridiculous. Hive itself is probably one of the most janky things I’ve ever encountered. Postgres has libpq, so why isn’t there an equivalent for Hive? Is it a protocol or not? Why do I need to install a freaking one-off Facebook library to access my Hadoop install with a SQL interface?

Worse, I needed to follow multiple incomplete tutorials online to get Hive working at all. Beyond simply installing it, it must be started with SASL disabled for hdfs_fdw. But doing that means it defaults to Kerberos authentication. If that isn’t set up, commands need to run as the user that launched Hive, since users are mapped from the operating system. To get that to work, I had to modify several more XML files. I feel like I’d need to read at least three books on this subject before even trying to approach this with any amount of confidence.

Then Hive crashed with an OOM error after the query output above. It turns out I could select data to my heart’s content, but following an insert through Hive (which took almost 20 seconds per row), there were ceaseless problems. Pulling data after an insert always caused it to go into an infinite uncaught OOM exception crash loop that required kill -9 to stop. When all of my work for this article was complete, I shut it all down and backed away slowly, lest it rend me to dripping gobbets because I pressed the wrong key in its presence.

Postgres, despite its versatility, just works. It can be used in more advanced architectures and be leveraged for even more power, but it still functions in its base configuration. After today, I sincerely wish more projects followed that philosophy.


Tags: , , , , , , ,

PG Phriday: Moving to 9.5

June 10th, 2016 | Published in Database, Tech Talk | 4 Comments


There comes a day in every young database’s life that it’s time to move on. I’m sorry 9.4, but the day has come that we must say goodbye. It’s not like we haven’t had our good times. While I truly appreciate everything you’ve done for me, we must part ways. I’m far too needy, and I can’t demand so much of you in good conscience. May your future patches make you and your other suitors happy!

In all seriousness, Postgres 9.5 has been out since January 7th. Despite that, I’ve never really been an advocate of deploying dot-0 versions in production environments. Since the Postgres dev team is notoriously fanatically responsible, this is probably excessively paranoid. Still, I persist that some day, my certifiable anxiety regarding fresh releases will pay off. Until then, it’s time to examine a few reasons to start upgrading the 120+ Postgres instances littering our server fabric. Especially now that 9.5.3 is out, making it practically rock solid in terms of field testing.

JSONBetter

A common complaint about JSON and JSONB up to 9.4, was that there was no easy way to modify existing JSON objects. Adding fields was an exercise at text conversion, unpacking, concatenation, and re-casting. It was a mess. As a consequence, I recommended using it as a mere container column, or relying on PL/V8 or PL/Python to actually manipulate JSON data.

That’s no longer the case with 9.5. Not only are several standard operators overloaded to support JSONB, but they introduced the jsonb_set function for more direct manipulation. I covered this more extensively in the past, but here are a couple quick examples for reference:

-- Concatenation
 
SELECT '{"Hairy": true}'::JSONB || '{"Wobbly": false}'::JSONB;
 
             ?COLUMN?             
----------------------------------
 {"Hairy": TRUE, "Wobbly": FALSE}
 
-- Subtraction 
 
SELECT '{"Hairy": true, "Excited": false}'::JSONB - 'Excited';
 
    ?COLUMN?     
-----------------
 {"Hairy": TRUE}
 
-- Explicit jsonb_set
 
SELECT jsonb_set(
  '{"Hairy": true, "Status": {"Burning": true}}'::JSONB,
  '{Status,Burning}',
  'false'
);
 
                   jsonb_set                   
-----------------------------------------------
 {"Hairy": TRUE, "Status": {"Burning": FALSE}}

This is no less an exciting advancement than it was back in December, and could be a singular reason to upgrade depending on JSON usage saturation.

INSERT += UPDATE

One feature the MySQL camp has lorded over Postgres is the REPLACE INTO syntax. For those who don’t already know, it operates like an INSERT unless it finds an existing row for that primary key, upon which it acts more like an UPDATE. This has been a… point of contention for more than a decade.

I could find similar links until the end of time, but stopped at 2002 because this article needs to end eventually. Nary a month has gone by without someone asking about it in the Postgres mailing lists, being dissatisfied with the answer, and going back to MySQL. The only equivalent we had was a convoluted function that made use of exceptions and a loop in order to avoid a race condition. No more. Now, this is trivial:

CREATE TABLE upsert_me (
  id     INT PRIMARY KEY,
  stuff  TEXT
);
 
INSERT INTO upsert_me VALUES (1, 'Woo!');
INSERT INTO upsert_me VALUES (1, 'Woohoo!');
 
ERROR:  duplicate KEY VALUE violates UNIQUE CONSTRAINT "upsert_me_pkey"
 
INSERT INTO upsert_me VALUES (1, 'Woohoo!')
    ON CONFLICT (id) DO UPDATE
   SET stuff = EXCLUDED.stuff;
 
SELECT * FROM upsert_me;
 
 id |  stuff  
----+---------
  1 | Woohoo!

The Postgres implementation is actually safer, given the explicit control over when the replacement actually happens, under what criteria, and which columns are included. This kind of solution is exactly what the community has come to expect; the Postgres dev team never simply adds a new feature without careful and meticulous consideration. We get the same functionality, with a far more flexible and less fragile implementation.

Block Range Indexes

Otherwise known as a BRIN index, this type of indexing is geared toward VLDB installations with gargantuan tables that are generally INSERT only. They work by storing a page offset followed by a range of values represented there. This is extremely lossy, but for large sequentially organized tables, can help Postgres limit page scans to certain table regions immediately. Unlike a regular BTREE index which might require a random read per match, a BRIN index will match several potentially unnecessary pages and read them all. The primary difference here is between a random or a sequential read operation.

In the end, this requires more filtering to remove unwanted matches for values pulled in from the whole range. Yet this can be far cheaper in aggregate depending on the granularity of the ranges. Let’s try an example with 10M rows:

CREATE TABLE sensor_log (
  id            INT PRIMARY KEY,
  location      VARCHAR NOT NULL,
  reading       BIGINT NOT NULL,
  reading_date  TIMESTAMP NOT NULL
);
 
INSERT INTO sensor_log (id, location, reading, reading_date)
SELECT s.id, s.id % 1000, s.id % 100,
       '2016-12-31'::DATE - (s.id || 's')::INTERVAL
  FROM generate_series(1, 10000000) s(id);
 
CREATE INDEX idx_sensor_log_date_brin
    ON sensor_log USING BRIN (reading_date);
 
ANALYZE sensor_log;
 
\timing ON
 
SELECT COUNT(*)
  FROM sensor_log
 WHERE reading_date >= '2016-12-01'
   AND reading_date < '2016-12-02';
 
TIME: 20.805 ms

How does the sample query compare to using a regular BTREE index? Tests put the performance at about 10% faster when all results are in memory, otherwise the BRIN index was 3-5x faster. But there’s another benefit that isn’t quite as obvious. Due to the lossy nature of BRIN structure, these indexes are also orders of magnitude smaller. Let’s compare just how much:

CREATE INDEX idx_sensor_log_date ON sensor_log (reading_date);
 
SELECT relname, pg_size_pretty(pg_relation_size(oid))
  FROM pg_class
 WHERE relname LIKE '%sensor_log%';
 
         relname          | pg_size_pretty 
--------------------------+----------------
 sensor_log               | 498 MB
 sensor_log_pkey          | 214 MB
 idx_sensor_log_date_brin | 32 kB
 idx_sensor_log_date      | 214 MB

This means an index that’s nearly 7000x larger is only 10% faster for this particular case under ideal circumstances. On a moderately busy warehouse system that mostly sees reporting queries spanning segments of multiple TB-size tables that would never fit in memory, the BRIN index would actually be a better use of resources. In cases where the query planner would take random page cost into account for a large result set and revert to a sequential scan, this is even more relevant. We may end up reading a small fraction of a vast table instead of the entire thing.

Thus BRIN indexes definitely have a niche role, but given the number of larger enterprises making use of Postgres for VLDB installations, that niche might be larger than we think.

Moving On

In the end, there are quite a few more features I will probably cover in more depth later. The Wiki is more than complete in addressing them, and includes cases I glossed over or skipped entirely. These are the ones I wanted to highlight due to their potential to directly impact instances I manage, or empower end-users under my purview. Any one of them would be an excellent justification to upgrade to 9.5, and taken together, they practically demand it.

Regardless, this is old news to anyone drooling over 9.6, including myself. However, following my own standards for not immediately deploying new versions, it’ll be a while before 9.6 will reach our production systems. To that end, I can rest easy knowing I’ll be writing a similar article about 9.5 when 9.6 reaches maturity. I’m glad that’s the one constant I’ve observed regarding Postgres over the years: every single upgrade is worth the trouble, often to a hilarious degree.

That’s not something you can always say in the software world, and I appreciate it.


Tags: , , , , , ,

PG Phriday: Rapid Prototyping

June 3rd, 2016 | Published in Database, Tech Talk | No Comments


Ah, source control. From Subversion to git and everything in between, we all love to manage our code. The ability to quickly branch from an existing base is incredibly important to exploring and potentially abandoning divergent code paths. One often overlooked Postgres feature is the template database. At first glance, it’s just a way to ensure newly created databases contain some base functionality without having to bootstrap every time, but it’s so much more than that.

Our first clue to the true potential of template databases should probably start with the template0 and template1 databases. Every Postgres instance has them, and unless we’re familiar with the internals or intended use, they’re easy to ignore. They’re empty after all, right? Not quite. Individually, these templates actually define all of the core structures that must exist in all standard Postgres databases. System tables, views, the PUBLIC schema, and even the character encoding, are all defined in template0, and by extension, template1. The primary difference between the two is that template0 is intended to remain pristine in case we need to create a database without any of our own standard bootstrapping code.

Heck, we can’t even connect to the thing.

psql template0
psql: FATAL:  database "template0" is not currently accepting connections

But we can make changes to template1. What happens if we create a table in the template1 database and then create a new database?

\c template1
 
CREATE TABLE some_junk (id SERIAL, trash TEXT);
CREATE DATABASE foo;
 
\c foo
\d
 
                List OF relations
 Schema |       Name        |   TYPE   |  Owner   
--------+-------------------+----------+----------
 public | some_junk         | TABLE    | postgres
 public | some_junk_id_seq  | SEQUENCE | postgres

So now we can see firsthand that objects created in template1 will automatically be created in any new database. Existing databases don’t benefit from new objects in any template database; it’s a one-time snapshot of the template state at the time the new database is created. This also applies to any object in the template. Functions, types, tables, views, or anything else in template1, will be copied to new databases upon creation.

This much is almost prevalent enough to be common knowledge. As such, DBAs and some users leverage it for popular extensions and other functionality they want included everywhere. It’s not uncommon to see something like this in a new installation:

\c template1
 
CREATE EXTENSION pg_stat_statements -- Useful query stats.
CREATE EXTENSION pgstattuple;       -- Table data distribution info.
CREATE EXTENSION postgres_fdw;      -- Postgres foreign data wrapper.
CREATE EXTENSION plpythonu;         -- We use a lot of Python.

Now when we create a database, we’ll always have the ability to perform query forensics, analyze table bloat in the storage files, set up connections between other Postgres databases, and deploy python-based procedures. The last one probably isn’t as common as the first three, but if a series of applications make heavy use of Python and all databases in an organization reflect that, it’s nice we have the option. Any database object is created on our behalf, and we don’t even need to ask for it.

What is a bit more esoteric though, is that this also applies to data. Here’s what happens if we put a few rows in our some_junk table:

\c template1
 
INSERT INTO some_junk (trash)
SELECT repeat(a.id::TEXT, a.id) FROM generate_series(1, 5) a(id);
 
DROP DATABASE foo;
CREATE DATABASE foo;
 
\c foo
 
SELECT * FROM some_junk;
 
 id | trash 
----+-------
  1 | 1
  2 | 22
  3 | 333
  4 | 4444
  5 | 55555

Well that changes everything, doesn’t it? Imagine this in a development or QA environment, where we may want to repeatedly build and tear down prototypes and test cases. Knowing that we can include data in template databases, means we can start with a master branch of sorts, fork a database including data fixtures, and not worry about foreign keys or other constraints wreaking havoc on the initial import process.

That’s huge, and yet it’s nearly an unknown feature in many circles. The remaining puzzle piece that really unleashes templates however, is that any database can act as a template. It’s far more obvious with template0 and template1 since they include it in their names, but we can create a database and base its contents on any other database in our instance.

Let’s remove the some_junk table from template1 and put it somewhere more appropriate. While we’re at it, let’s define a basic table relationship we can test:

\c template1
 
DROP TABLE some_junk;
 
CREATE DATABASE bootstrap;
 
\c bootstrap
 
CREATE TABLE some_junk (
  id         SERIAL PRIMARY KEY,
  trash      TEXT
);
 
CREATE TABLE some_stuff (
  id       SERIAL PRIMARY KEY,
  junk_id  INT REFERENCES some_junk (id),
  garbage  TEXT
);
 
INSERT INTO some_junk (trash)
SELECT repeat(a.id::TEXT, a.id) FROM generate_series(1, 5) a(id);
 
INSERT INTO some_stuff (junk_id, garbage)
SELECT a.id % 5 + 1, repeat(a.id::TEXT, a.id % 5 + 1)
  FROM generate_series(1, 50) a(id);
 
CREATE DATABASE test_a TEMPLATE bootstrap;

With these two tables in place along with a base set of sample data, we can run any number of verification steps before tearing it down and trying again. Let’s perform a very basic failure test by trying to insert an invalid relationship:

(cat <<EOF | psql test_a &>/dev/null
\set ON_ERROR_STOP on
BEGIN;
INSERT INTO some_stuff (junk_id, garbage) VALUES (6, 'YAY');
ROLLBACK;
EOF
); test $? -eq 3 && echo "passed"
 
passed

the psql command reports an exit status of 3 when a script fails in some manner. We designed the script to fail, so that’s exactly what we want. In this case, our test passed and we can continue with more tests, or drop the test_a database completely if some of the tests sufficiently tainted the data. We don’t care about any of the contents, and in fact, should throw them away as often as possible to ensure a sterile testing environment.

When it comes to development, we could use the contents of test_a to isolate some potentially dangerous set of code changes without affecting the main working data set in our development environment. Each developer can have their own playground on a shared server. We could even write a system to compare the differences between the main database and one of our forks, and produce a script to affect a migration. There are a lot of exciting use cases lurking here.

All of this does of course carry one rather glaring caveat: it’s not free. It takes time and storage resources to copy the contents of a template database. The larger the data set and count of objects, the more work Postgres must perform in order to initialize everything. If we have 100GB in the template database we’re basing further databases upon, we have to wait for that much data to be copied. Taking that into consideration, there’s probably an upper bound on how large a database can get before using it as a template becomes rather cumbersome.

On the other hand, Postgres is smart enough to realize data files themselves won’t be any different just because they reside in another database container, so it copies them wholesale. We don’t need to wait for index creation or any other high-level allocation command like we would if we were replaying a script. If we increase the row count of some_stuff to five million, filling the table takes 95 seconds on our test VM and consumes about 400MB of space. Creating a new database with it as a template however, merely requires about one second. This drastically increases iterative test throughput, provided we have such a contraption.

I almost never see this kind of usage in the wild, which is a huge shame. It’s not quite git, but we can version and branch our database contents to our heart’s content with templates. Imagine if we have a product that is distributed with a core data set. We could package a deployable binary extract by targeting the proper database for a specific application version, while still maintaining the entire product database history on our side.

Few database engines even have a mechanism for this kind of database cloning. Why let it go to waste?


Tags: , ,

PG Phriday: Converting to Horizontal Distribution

May 27th, 2016 | Published in Database, Tech Talk | No Comments


Now that we’ve decided to really start embracing horizontal scaling builds, there is a critically important engine-agnostic element we need to examine. Given an existing table, how exactly should we split up the contents across our various nodes during the conversion process? Generally this is done by selecting a specific column and applying some kind of hash or custom distribution mechanism to ensure all node contents are reasonably balanced. But how do we go about figuring that out?

This question is usually answered with “use the primary key!” But this gets a bit more complex in cases where tables rely on composite keys. This doesn’t happen often, but can really throw a wrench into the works. Imagine for example, we’re using Postgres-XL and have four nodes numbered data0000 through data0003. Then we find this table:

CREATE TABLE comp_event
(
  group_code   TEXT NOT NULL,
  event_id     BIGINT NOT NULL,
  entry_tm     TIMETZ NOT NULL,
  some_data    TEXT NOT NULL
);
 
INSERT INTO comp_event
SELECT a.id % 10, a.id % 100000,
       '08:00'::TIMETZ + (a.id % 43200 || 's')::INTERVAL,
       repeat('a', a.id%10)
  FROM generate_series(1, 1000000) a(id);
 
ALTER TABLE comp_event ADD CONSTRAINT pk_comp_event
      PRIMARY KEY (group_code, event_id, entry_tm);
 
ANALYZE comp_event;

The default for Postgres-XL is to simply use the first column for distribution. This tends to fit most cases, as the first column is usually either the primary key, or a reasonable facsimile of it. We can even use a system view to confirm this is the case:

SELECT pcrelid::regclass AS TABLE_NAME, a.attname AS column_name
  FROM pgxc_class c
  JOIN pg_attribute a ON (a.attrelid = c.pcrelid)
 WHERE a.attnum = c.pcattnum;
 
 TABLE_NAME | column_name 
------------+-------------
 comp_event | group_code

But is this what we want? What would happen if we naively went ahead with the default value and converted the database? Well, the major problem is that we don’t know the hash algorithm Postgres-XL is using. It’s entirely possible that the resulting data distribution will be anywhere from “slightly off” to “completely awful,” and we need a way to verify uniform distribution before moving forward.

In the case of Postgres-XL, we can actually poll each node directly with EXECUTE DIRECT. Repeatedly executing the same query and just substituting the node name is both inefficient and cumbersome, especially if we have dozens or hundreds of nodes. Thankfully Postgres makes it easy to create functions that return sets, so let’s leverage that power in our favor:

CREATE TYPE pgxl_row_dist AS (node_name TEXT, total BIGINT);
 
CREATE OR REPLACE FUNCTION check_row_counts(tab_name REGCLASS)
RETURNS SETOF pgxl_row_dist AS
$BODY$
DECLARE
  r pgxl_row_dist;
  query TEXT;
BEGIN
  FOR r.node_name IN
      SELECT node_name
        FROM pgxc_node WHERE node_type = 'D'
  LOOP
    query = 'EXECUTE DIRECT ON (' || r.node_name || ') 
      ''SELECT count(*) FROM ' || tab_name::TEXT || '''';
    EXECUTE query INTO r.total;
    RETURN NEXT r;
  END LOOP;
END;
$BODY$ LANGUAGE plpgsql;

This function should exist in some form with the standard Postgres-XL distribution. Unfortunately if it does, I couldn’t find any equivalent. Regardless, with this in hand, we can provide a table name and see how many rows exist on each node no matter our cluster size. For our four node cluster, each node should have about 250,000 rows, give or take some variance caused by the hashing algorithm. Let’s see what the distribution actually resembles:

SELECT * FROM check_row_counts('comp_event');
 
 node_name | total  
-----------+--------
 data0000  | 600000
 data0001  | 200000
 data0002  | 200000
 data0003  |      0

That’s… unfortunate. The table doesn’t list its columns in order of cardinality since that’s never been a concern before now. Beyond that, the first column is part of our primary key, so it makes sense to be listed near the top anyway. Position is hardly a reliable criteria beyond a first approximation, so how do we fix this?

Let’s examine the Postgres statistics catalog for the comp_event table, and see how cardinality is actually represented:

SELECT attname, n_distinct
  FROM pg_stats
 WHERE tablename = 'comp_event';
 
  attname   | n_distinct 
------------+------------
 group_code |         10
 event_id   |    12471.5
 entry_tm   |  -0.158365
 some_data  |         10

The sample insert statement we used to fill comp_event should have already made this clear, but not everything is an example. If we assume the table already existed, or we loaded it with from multiple sources or scripts, the statistics would be our primary guide.

In this particular case, the event_id or entry_tm columns would be much better candidates to achieve balanced distribution. For now, let’s just keep things simple and use the event_id column since the primary difference is the cardinality. There’s no reason to introduce multiple variables such as column type quite yet.

Let’s check our row totals after telling Postgres-XL we want to use event_id for hashing:

TRUNCATE TABLE comp_event;
ALTER TABLE comp_event DISTRIBUTE BY HASH (event_id);
 
INSERT INTO comp_event
SELECT a.id % 10, a.id % 100000,
       '08:00'::TIMETZ + (a.id % 43200 || 's')::INTERVAL,
       repeat('a', a.id%10)
  FROM generate_series(1, 1000000) a(id);
 
SELECT * FROM check_row_counts('comp_event');
 
 node_name | total  
-----------+--------
 data0000  | 250050
 data0001  | 249020
 data0002  | 249730
 data0003  | 251200

Much better! Now our queries will retrieve data from all four nodes, and the first node isn’t working three times harder than the others. If we had gone into production using the previous distribution, our cluster would be unbalanced and we’d be chasing performance problems. Or if we figured this out too late, we’d have to rebalance all of the data, which can take hours or even days depending on row count. No thanks!

It’s important to do this kind of analysis before moving data into a horizontally capable cluster. The Postgres pg_stats table makes that easy to accomplish. And if repeating this process for every table is too irritating, we can even do it in bulk. Let’s construct an unholy abomination that returns the primary key column with the highest cardinality for all tables:

SELECT DISTINCT ON (schemaname, tablename)
       schemaname, tablename, attname
  FROM (
    SELECT s.schemaname, c.relname AS tablename,
           a.attname, i.indisprimary, i.indisunique,
           SUM(s.n_distinct) AS total_values
      FROM pg_index i
      JOIN pg_attribute a ON (
               a.attrelid = i.indrelid AND
               a.attnum = ANY(i.indkey)
           )
      JOIN pg_class c ON (c.oid = i.indrelid)
      JOIN pg_namespace n ON (n.oid = c.relnamespace)
      JOIN pg_stats s ON (
               s.schemaname = n.nspname AND
               s.tablename = c.relname AND
               s.attname = a.attname
           )
     WHERE i.indisunique
       AND s.schemaname NOT IN ('pg_catalog', 'information_schema')
     GROUP BY 1, 2, 3, 4, 5
) cols
ORDER BY schemaname, tablename, 
      CASE WHEN total_values < 0 THEN -total_values * 9e20
           ELSE total_values END DESC,
      indisprimary DESC, indisunique DESC;

Gross! But at least we only have to do that once or twice before restoring all of our data in the new horizontally scaled cluster. We could even make the query uglier and have it generate our ALTER TABLE statements so we don’t need to manually correct the distribution of every table. And don’t forget that this process applies to nearly all distribution mechanisms which depend on column contents, not just Postgres-XL. Just do your due diligence, and everything should work out.

Happy scaling!


Tags: , , ,

« Older Posts

Newer Posts »