PG Phriday: Down in the Dumps

These days with multiple Postgres database environments a commonality, it’s not unheard of to copy data from one to another. Perhaps a production extract is necessary to properly vet data in a staging environment. Maybe the development environment needs to update its data to reflect an especially pernicious and intractable edge case someone observed. In any of these scenarios, we are likely to extract data from multiple tables to import it elsewhere. The question is: how?

Last week I flatly stated that pg_dump is not an adequate backup tool. Despite that, it’s perfect for situations like these, where we only need a few tables out of an entire database. However, tables frequently reference other tables in foreign keys. While such integrity concerns do not affect data export, they greatly complicate import efforts.

The deeper nested the referential integrity rules become, we’ll pull out exponentially more hair in trying to track all of them to their requisite tables. Even after we’ve found them all, we still need to unravel the horrors again in the target cluster.

A typical table dependency chart

A typical table dependency chart

To better illustrate what we mean by “horrors”, take this relatively simple table hierarchy:

CREATE TABLE tab_a (id INT PRIMARY KEY);
CREATE TABLE tab_a1 (id INT PRIMARY KEY REFERENCES tab_a);
CREATE TABLE tab_b (id INT PRIMARY KEY REFERENCES tab_a);
CREATE TABLE tab_b1 (id INT PRIMARY KEY REFERENCES tab_b);
CREATE TABLE tab_c (id INT PRIMARY KEY REFERENCES tab_b);
CREATE TABLE tab_c1 (id INT PRIMARY KEY REFERENCES tab_c);

INSERT INTO tab_a VALUES (1);
INSERT INTO tab_a1 VALUES (1);
INSERT INTO tab_b VALUES (1);
INSERT INTO tab_b1 VALUES (1);
INSERT INTO tab_c VALUES (1);
INSERT INTO tab_c1 VALUES (1);

Imagine these are six tables out of potentially hundreds. From an outside perspective, we may only know that tab_a needs to be copied from one system to another. No matter what tool we use for this, finding the other tables is somewhat irritating. An ORM will display the relationships in a physical chart, and the psql command line client or pgAdmin will report tables that immediately depend on a table we examine. Not a great way to just isolate them all at once.

Fortunately we have the Postgres system catalog. The pg_constraint system table tracks all of those table relationships, and we can recursively fetch the entire tree using a CTE. Remember how awesome CTEs are? Let’s mix some chocolate with our peanut butter by using one to mine the catalog.

Starting with tab_a, here’s how we might find all of the other tables in the list:

WITH RECURSIVE deptree AS (
  SELECT conrelid, conrelid::REGCLASS::TEXT AS table_name,
         confrelid, confrelid::REGCLASS::TEXT AS dep_table
    FROM pg_constraint
   WHERE contype IN ('f', 'p')
     AND (confrelid::REGCLASS::TEXT = 'tab_a' OR
          conrelid::REGCLASS::TEXT = 'tab_a')
   UNION
  SELECT c.conrelid, c.conrelid::REGCLASS::TEXT AS table_name,
         c.confrelid, c.confrelid::REGCLASS::TEXT AS dep_table
    FROM pg_constraint c
    JOIN deptree d ON (d.conrelid = c.confrelid)
   WHERE c.contype = 'f'
)
SELECT *
  FROM deptree;

 conrelid | table_name | confrelid | dep_table 
----------+------------+-----------+-----------
    69327 | tab_a      |         0 | -
    69332 | tab_a1     |     69327 | tab_a
    69342 | tab_b      |     69327 | tab_a
    69352 | tab_b1     |     69342 | tab_b
    69362 | tab_c      |     69342 | tab_b
    69372 | tab_c1     |     69362 | tab_c

As complicated as this query appears, it’s not actually that bad. We simply bootstrap the CTE with any tables that depend on tab_a, and repeat the process with each successive loop. Even if there are dozens or hundreds of incestuous relationships, we’ll capture all of them.

In any case, we have all the data we need to construct a pg_dump command to export all of these tables. To keep this as speedy as possible, let’s perform a parallel dump of the indicated tables into a directory we can transmit to another server.

pg_dump -t tab_a -t tab_a1 -t tab_b -t tab_b1 -t tab_c -t tab_c1 \
        -a -Fd -j 8 -f /tmp/tab_exports postgres

Though we can list each table individually, it would be nice if we could supply a filename of desired tables instead. Dependency graphs of sufficient depth could make for a literally impossible command-line given character limit constraints. This means we may need to execute multiple dumps or rely on wildcards and hope that there’s enough name overlap.

Given that caveat, maybe we want to combine pg_dump with xargs and its parallel capabilities instead:

mkdir /tmp/tab_exports

cat /tmp/tablist.txt | xargs -I{} -P 8 \
    pg_dump -t {} -a -Fc -f /tmp/tab_exports/{}.pgr postgres

Either way, we now have a data-only representation of every table in the dependency tree of the table we’re actually importing. Now we just have to insert these values into the target system. And this is usually where we first encounter a major roadblock: existing contents.

First of all, that data is in our way. Secondly, the most efficient manner of ridding ourselves of their oppression is the TRUNCATE command. While TRUNCATE is a DDL-level command and can’t be circumvented, it does have a CASCADE decorator which will empty the entire dependency chain. This actually helps us since we have data to rebuild anything that might be affected by a cascading truncate.

Let’s give it a try on the recipient database:

TRUNCATE TABLE tab_a;

ERROR:  cannot truncate a table referenced in a foreign key constraint
DETAIL:  Table "tab_a1" references "tab_a".
HINT:  Truncate table "tab_a1" at the same time, or use TRUNCATE ... CASCADE.

TRUNCATE TABLE tab_a CASCADE;

NOTICE:  truncate cascades to table "tab_a1"
NOTICE:  truncate cascades to table "tab_b"
NOTICE:  truncate cascades to table "tab_b1"
NOTICE:  truncate cascades to table "tab_c"
NOTICE:  truncate cascades to table "tab_c1"

Well that was terrifying. Also note that the error only reported one of the dependencies. If we relied only on this information, we might attempt to drop the foreign key constraint thinking it was the only one. Every subsequent attempt to truncate the table would result in another error.

We’ve avoided all of that, but imagine if we had attempted to restrict our efforts to only the data in tab_a. We’d be left with no benefits of cascade, manually tracing the errors until success, inability to reestablish the constraints due to likely missing keys, and so on. No thanks.

Though truncate and the foreign keys aided us in emptying the tables on the recipient, restoring from our dump is somewhat problematic. We want to do it in parallel, but foreign keys directly prevent this by enforcing key integrity at each level. So we’re stuck either importing each table in order of the dependency tree, or we need to do a bit of preparation.

We can leverage that CTE again and build a couple of SQL scripts to help us out:

CREATE TEMP TABLE tab_deps AS
WITH RECURSIVE deptree AS (
  SELECT conrelid, conrelid::REGCLASS::TEXT AS table_name,
         confrelid, confrelid::REGCLASS::TEXT AS dep_table
    FROM pg_constraint
   WHERE contype IN ('f', 'p')
     AND (confrelid::REGCLASS::TEXT = 'tab_a' OR
          conrelid::REGCLASS::TEXT = 'tab_a')
   UNION
  SELECT c.conrelid, c.conrelid::REGCLASS::TEXT AS table_name,
         c.confrelid, c.confrelid::REGCLASS::TEXT AS dep_table
    FROM pg_constraint c
    JOIN deptree d ON (d.conrelid = c.confrelid)
   WHERE c.contype = 'f'
)
SELECT *
  FROM deptree;

COPY (
  SELECT 'ALTER TABLE ' || table_name || ' DISABLE TRIGGER ALL;'
    FROM tab_deps
) TO '/tmp/no_trigs.sql';

COPY (
  SELECT 'ALTER TABLE ' || table_name || ' ENABLE TRIGGER ALL;'
    FROM tab_deps
) TO '/tmp/fix_trigs.sql';

One thing that’s interesting about Postgres is that foreign keys are actually enforced by hidden system-level triggers. Given that information, we can simply turn them off for the duration of our import, and re-enable them afterwards. Now we have all the parts we need to restore the donor data into the environment of our choice.

It should be cautioned that all of the import steps should be performed on the target system itself. Even if this means synchronizing the dump directory to another system first. We don’t want to fat-finger these commands in production or an equally important system.

In any case, this is how that might go:

psql -f /tmp/no_trigs.sql postgres
psql -c 'TRUNCATE TABLE tab_a CASCADE;' postgres
pg_restore -d postgres -j 8 /tmp/tab_exports
psql -f /tmp/fix_trigs.sql postgres

Or if we used xargs, we’ll want this procedure instead:

psql -f /tmp/no_trigs.sql postgres
psql -c 'TRUNCATE TABLE tab_a CASCADE;' postgres

cat /tmp/tablist.txt | xargs -I{} -P 8 \
    pg_restore -d postgres /tmp/tab_exports/{}.pgr

psql -f /tmp/fix_trigs.sql postgres

And voila, we’re saved!

Of course table structures between the environments may not match. In that case, we would simply drop all of the involved tables instead of truncating them and stunning their constraint triggers. As such, we would want to remove the -a data-only flag from pg_dump commands we used so pg_restore can recreate them. Ultimately, we’ve broken the paradox of satisfying constraints while simultaneously circumventing them.

It’s not an ideal situation, but hey, whatever works.