PG Phriday: Rapid Prototyping

Ah, source control. From Subversion to git and everything in between, we all love to manage our code. The ability to quickly branch from an existing base is incredibly important to exploring and potentially abandoning divergent code paths. One often overlooked Postgres feature is the template database. At first glance, it’s just a way to ensure newly created databases contain some base functionality without having to bootstrap every time, but it’s so much more than that.

Our first clue to the true potential of template databases should probably start with the template0 and template1 databases. Every Postgres instance has them, and unless we’re familiar with the internals or intended use, they’re easy to ignore. They’re empty after all, right? Not quite. Individually, these templates actually define all of the core structures that must exist in all standard Postgres databases. System tables, views, the PUBLIC schema, and even the character encoding, are all defined in template0, and by extension, template1. The primary difference between the two is that template0 is intended to remain pristine in case we need to create a database without any of our own standard bootstrapping code.

Heck, we can’t even connect to the thing.

psql template0
psql: FATAL:  database "template0" is not currently accepting connections

But we can make changes to template1. What happens if we create a table in the template1 database and then create a new database?

\c template1

CREATE TABLE some_junk (id SERIAL, trash TEXT);
CREATE DATABASE foo;

\c foo
\d

                List of relations
 Schema |       Name        |   Type   |  Owner   
--------+-------------------+----------+----------
 public | some_junk         | table    | postgres
 public | some_junk_id_seq  | sequence | postgres

So now we can see firsthand that objects created in template1 will automatically be created in any new database. Existing databases don’t benefit from new objects in any template database; it’s a one-time snapshot of the template state at the time the new database is created. This also applies to any object in the template. Functions, types, tables, views, or anything else in template1, will be copied to new databases upon creation.

This much is almost prevalent enough to be common knowledge. As such, DBAs and some users leverage it for popular extensions and other functionality they want included everywhere. It’s not uncommon to see something like this in a new installation:

\c template1

CREATE EXTENSION pg_stat_statements -- Useful query stats.
CREATE EXTENSION pgstattuple;       -- Table data distribution info.
CREATE EXTENSION postgres_fdw;      -- Postgres foreign data wrapper.
CREATE EXTENSION plpythonu;         -- We use a lot of Python.

Now when we create a database, we’ll always have the ability to perform query forensics, analyze table bloat in the storage files, set up connections between other Postgres databases, and deploy python-based procedures. The last one probably isn’t as common as the first three, but if a series of applications make heavy use of Python and all databases in an organization reflect that, it’s nice we have the option. Any database object is created on our behalf, and we don’t even need to ask for it.

What is a bit more esoteric though, is that this also applies to data. Here’s what happens if we put a few rows in our some_junk table:

\c template1

INSERT INTO some_junk (trash)
SELECT repeat(a.id::TEXT, a.id) FROM generate_series(1, 5) a(id);

DROP DATABASE foo;
CREATE DATABASE foo;

\c foo

SELECT * FROM some_junk;

 id | trash 
----+-------
  1 | 1
  2 | 22
  3 | 333
  4 | 4444
  5 | 55555

Well that changes everything, doesn’t it? Imagine this in a development or QA environment, where we may want to repeatedly build and tear down prototypes and test cases. Knowing that we can include data in template databases, means we can start with a master branch of sorts, fork a database including data fixtures, and not worry about foreign keys or other constraints wreaking havoc on the initial import process.

That’s huge, and yet it’s nearly an unknown feature in many circles. The remaining puzzle piece that really unleashes templates however, is that any database can act as a template. It’s far more obvious with template0 and template1 since they include it in their names, but we can create a database and base its contents on any other database in our instance.

Let’s remove the some_junk table from template1 and put it somewhere more appropriate. While we’re at it, let’s define a basic table relationship we can test:

\c template1

DROP TABLE some_junk;

CREATE DATABASE bootstrap;

\c bootstrap

CREATE TABLE some_junk (
  id         SERIAL PRIMARY KEY,
  trash      TEXT
);

CREATE TABLE some_stuff (
  id       SERIAL PRIMARY KEY,
  junk_id  INT REFERENCES some_junk (id),
  garbage  TEXT
);

INSERT INTO some_junk (trash)
SELECT repeat(a.id::TEXT, a.id) FROM generate_series(1, 5) a(id);

INSERT INTO some_stuff (junk_id, garbage)
SELECT a.id % 5 + 1, repeat(a.id::TEXT, a.id % 5 + 1)
  FROM generate_series(1, 50) a(id);

CREATE DATABASE test_a TEMPLATE bootstrap;

With these two tables in place along with a base set of sample data, we can run any number of verification steps before tearing it down and trying again. Let’s perform a very basic failure test by trying to insert an invalid relationship:

(cat </dev/null
\set ON_ERROR_STOP on
BEGIN;
INSERT INTO some_stuff (junk_id, garbage) VALUES (6, 'YAY');
ROLLBACK;
EOF
); test $? -eq 3 && echo "passed"

passed

the psql command reports an exit status of 3 when a script fails in some manner. We designed the script to fail, so that’s exactly what we want. In this case, our test passed and we can continue with more tests, or drop the test_a database completely if some of the tests sufficiently tainted the data. We don’t care about any of the contents, and in fact, should throw them away as often as possible to ensure a sterile testing environment.

When it comes to development, we could use the contents of test_a to isolate some potentially dangerous set of code changes without affecting the main working data set in our development environment. Each developer can have their own playground on a shared server. We could even write a system to compare the differences between the main database and one of our forks, and produce a script to affect a migration. There are a lot of exciting use cases lurking here.

All of this does of course carry one rather glaring caveat: it’s not free. It takes time and storage resources to copy the contents of a template database. The larger the data set and count of objects, the more work Postgres must perform in order to initialize everything. If we have 100GB in the template database we’re basing further databases upon, we have to wait for that much data to be copied. Taking that into consideration, there’s probably an upper bound on how large a database can get before using it as a template becomes rather cumbersome.

On the other hand, Postgres is smart enough to realize data files themselves won’t be any different just because they reside in another database container, so it copies them wholesale. We don’t need to wait for index creation or any other high-level allocation command like we would if we were replaying a script. If we increase the row count of some_stuff to five million, filling the table takes 95 seconds on our test VM and consumes about 400MB of space. Creating a new database with it as a template however, merely requires about one second. This drastically increases iterative test throughput, provided we have such a contraption.

I almost never see this kind of usage in the wild, which is a huge shame. It’s not quite git, but we can version and branch our database contents to our heart’s content with templates. Imagine if we have a product that is distributed with a core data set. We could package a deployable binary extract by targeting the proper database for a specific application version, while still maintaining the entire product database history on our side.

Few database engines even have a mechanism for this kind of database cloning. Why let it go to waste?