PG Phriday: In the Window

I’ll be the first to admit that I found Postgres window functions fantastically confusing when I first encountered them. They’re a powerful and versatile tool for building reports and summaries, but that functionality hides behind a fairly steep learning curve. One of the ways to combat their inherent complexity is to fully explore how they work, instead of just trying to wrap our heads around the expected results.

Window doggies have gotten decidedly smug

To that end, let’s set up a quick set of data in the customary fashion:

CREATE TABLE sensor_log (
  id            SERIAL PRIMARY KEY,
  location      VARCHAR NOT NULL,
  reading       BIGINT NOT NULL,
  reading_date  TIMESTAMP NOT NULL
);
 
INSERT INTO sensor_log (location, reading, reading_date)
SELECT s.id % 1000, s.id % 100,
       CURRENT_DATE + INTERVAL '1d' - ((s.id * 10) || 'm')::INTERVAL
  FROM generate_series(1, 5000) s(id);
 
CREATE INDEX idx_sensor_log_date ON sensor_log (reading_date);
 
ANALYZE sensor_log;

Yes, it’s the trusty sensor_log table once again. This time around, we only really care about demonstration in lieu of volume, so we’ve elected for five thousand rows in place of the usual five million. Our data represents one thousand sensors sequentially taking readings every minute. This should provide enough overlap to easily demonstrate what’s going on behind the scenes.

Let’s start with probably the easiest window function of the lot: row_number. All it does is number rows in the result set so we have a kind of counter that’s useful in a lot of different contexts. How do the first ten rows for today look?

SELECT location, reading, ROW_NUMBER() OVER ()
  FROM sensor_log
 WHERE reading_date > CURRENT_DATE
 LIMIT 10;
 
 location | reading | ROW_NUMBER 
----------+---------+------------
 143      |      43 |          1
 142      |      42 |          2
 141      |      41 |          3
 140      |      40 |          4
 139      |      39 |          5
 138      |      38 |          6
 137      |      37 |          7
 136      |      36 |          8
 135      |      35 |          9
 134      |      34 |         10

Window functions must be called on some kind of data window. An empty set of () represents the entire data set, with no ordering, groups, or other shenanigans involved. We’re just numbering the results, and the output would have been no different if we removed the window function. This kind of use is very similar to Oracle’s ROWNUM pseudo-column.

Yet a row number by itself isn’t that interesting. Let’s number the rows in order of sensor reading, and fetch the first ten rows of those results:

SELECT location, reading, ROW_NUMBER() OVER (ORDER BY reading)
  FROM sensor_log
 WHERE reading_date > CURRENT_DATE
 LIMIT 10;
 
 location | reading | ROW_NUMBER 
----------+---------+------------
 100      |       0 |          1
 101      |       1 |          2
 1        |       1 |          3
 102      |       2 |          4
 2        |       2 |          5
 103      |       3 |          6
 3        |       3 |          7
 4        |       4 |          8
 104      |       4 |          9
 105      |       5 |         10

If () was the window representing all rows, then (ORDER BY reading) is that same content after being sorted by the reading column. Not only did Postgres sort our results, but it numbered them in the post-sorted order. This is a very fine distinction! Consider what happens when we move the ORDER BY clause into the query proper.

SELECT location, reading, ROW_NUMBER() OVER ()
  FROM sensor_log
 WHERE reading_date > CURRENT_DATE
 ORDER BY reading
 LIMIT 10;
 
 location | reading | ROW_NUMBER 
----------+---------+------------
 100      |       0 |         44
 1        |       1 |        143
 101      |       1 |         43
 2        |       2 |        142
 102      |       2 |         42
 3        |       3 |        141
 103      |       3 |         41
 104      |       4 |         40
 4        |       4 |        140
 105      |       5 |         39

What the heck happened here? The data looks exactly the same, but the artificial row numbers are seemingly arbitrary. Indeed they are! By design. This is part of the reason window functions are so difficult to explain and comprehend. The fact of the matter is that each window is a virtual and separate manifestation of the plain query results.

Anyone who has struggled with pointers in C or C++ know that abstracted structures introduce certain pitfalls into obtaining desired results.

Window functions and you!

To help unravel the mystery a bit, let’s look at the natural state of the results without any window function nonsense. Of course we must also shift the data by 40 rows so we can see some of the same information the window received.

SELECT location, reading
  FROM sensor_log
 WHERE reading_date > CURRENT_DATE
 LIMIT 10 OFFSET 40;
 
 location | reading 
----------+---------
 103      |       3
 102      |       2
 101      |       1
 100      |       0
 99       |      99
 98       |      98
 97       |      97
 96       |      96
 95       |      95
 94       |      94

This output represents the table rows as sorted by the index we created on reading_date. Since these results are not artificially sorted in any way, this what the window function is actually seeing without its own specific sort operation. We shifted the results by 40 rows and as expected, row 44 has the value of 0. The window function gave us exactly what we requested, but it numbered the rows before Postgres sorted them.

We can actually watch this in action by looking at the two query plans:

EXPLAIN
SELECT location, reading, ROW_NUMBER() OVER (ORDER BY reading)
  FROM sensor_log
 WHERE reading_date > CURRENT_DATE
 LIMIT 10;
 
                              QUERY PLAN
----------------------------------------------------------------------
 LIMIT  (cost=15.91..16.08 ROWS=10 width=11)
   ->  WindowAgg  (cost=15.91..18.41 ROWS=143 width=11)
         ->  Sort  (cost=15.91..16.27 ROWS=143 width=11)
               Sort KEY: reading
               ->  INDEX Scan USING idx_sensor_log_date ON sensor_log
                     (cost=0.29..10.79 ROWS=143 width=11)
                   INDEX Cond: (reading_date > ('now'::cstring)::DATE)
 
EXPLAIN
SELECT location, reading, ROW_NUMBER() OVER ()
  FROM sensor_log
 WHERE reading_date > CURRENT_DATE
 ORDER BY reading
 LIMIT 10;
 
                              QUERY PLAN
----------------------------------------------------------------------
 LIMIT  (cost=15.67..15.69 ROWS=10 width=11)
   ->  Sort  (cost=15.67..16.03 ROWS=143 width=11)
         Sort KEY: reading
         ->  WindowAgg  (cost=0.29..12.58 ROWS=143 width=11)
               ->  INDEX Scan USING idx_sensor_log_date ON sensor_log
                     (cost=0.29..10.79 ROWS=143 width=11)
                   INDEX Cond: (reading_date > ('now'::cstring)::DATE)

Note that the WindowAgg step occurs at the end of query execution in the first example, taking place directly after a sort operation. The second query sorts after the WindowAgg, indicating the window only has access to unsorted rows. The key detail is that window functions only have access to rows within the query as if it had executed without them. This also happens after other aggregates are applied, meaning it’s a bad idea (or even impossible) to mix regular aggregates with window functions.

The easiest way to comprehend how a window function works is to run the query without them. That’s the data the window has access to, regardless of how we slice and dice them within the window itself. It also explains why we’re unable to refer to window function elements in other query clauses. They’re unusable in predicates and we can’t leverage their calculations to group or limit results.

So imagine for a moment we don’t have the standard Postgres LIMIT clause. This is how we would snag the top ten results of our location readings:

SELECT *
  FROM (SELECT location, reading,
               ROW_NUMBER() OVER (ORDER BY reading)
          FROM sensor_log
         WHERE reading_date > CURRENT_DATE
       ) sub
 WHERE ROW_NUMBER <= 10;
 
 location | reading | ROW_NUMBER 
----------+---------+------------
 100      |       0 |          1
 101      |       1 |          2
 1        |       1 |          3
 102      |       2 |          4
 2        |       2 |          5
 103      |       3 |          6
 3        |       3 |          7
 4        |       4 |          8
 104      |       4 |          9
 105      |       5 |         10

This is clearly silly when using row_number, but the trick works the same with other window functions. Here’s how we’d obtain the 10th ranked readings for today’s data:

SELECT *
  FROM (SELECT location, reading,
               dense_rank() OVER (ORDER BY reading)
          FROM sensor_log
         WHERE reading_date > CURRENT_DATE
       ) sub
 WHERE dense_rank = 10;
 
 location | reading | dense_rank 
----------+---------+------------
 109      |       9 |         10
 9        |       9 |         10

Keep in mind that Postgres must fetch the full results internally to materialize them for the window functions. This is true whether there are 143 rows as with our example, or 143-million.

The more advanced use cases for window functions are a topic for another day. Consider this a very high-level introduction to how they work and their inherent limitations instead of a comprehensive guide. There’s a lot of material here that deserves closer inspection, so there’s no need to rush. Either way, don’t let window functions confuse you more than necessary. Like any independent agent, you just need to know what they’re doing behind the scenes.

Like this, but with slightly fewer bombs

PG Phriday: Getting Assertive

There are a lot of languages available for authoring Postgres functions, but there’s nothing quite like the the classic PL/pgSQL. It’s SQL! It’s not SQL! It’s a kind of horrifying mutant crossbreed suitable only for terrifying small children and generating complex reports from a cavalcade of dubious sources! And deep within its twisted entrails is an often overlooked feature usually only available in far more mature entities.

Just wait until they add sharding

That’s right, it’s obvious we’re referring to the ASSERT statement.

When we say “often overlooked”, we’re not kidding. The pertinent documentation offers a meager overview following a far more comprehensive summary of the RAISE statement. We thought it deserved better.

Let’s begin with a simple table example that can form the basis of a more complex function:

CREATE TABLE sensor_log (
  id            SERIAL PRIMARY KEY,
  location      VARCHAR NOT NULL,
  reading       BIGINT NOT NULL,
  reading_date  TIMESTAMP NOT NULL
);
 
INSERT INTO sensor_log (location, reading, reading_date)
SELECT s.id % 1000, s.id % 100,
       CURRENT_DATE - ((s.id * 10) || 's')::INTERVAL
  FROM generate_series(1, 5000000) s(id);
 
CREATE INDEX idx_sensor_log_date ON sensor_log (reading_date);
 
ANALYZE sensor_log;

What can we do with five million rows of simulated data from 1000 sensors that collect data every ten seconds? A good place to start is a summary or report table of some kind. We’re only doing cursory analysis, so we should begin with aggregating the daily average, minimum, and maximum. If we get adventurous later, we can add standard deviations, jitter, deltas, and other fancy logic report owners might find useful.

Since this type of data is easy to tabulate, we require one single table and a function to manage its contents.

CREATE TABLE sensor_report (
  id            SERIAL PRIMARY KEY,
  location      VARCHAR NOT NULL,
  reading_date  DATE NOT NULL,
  avg_reading   BIGINT NOT NULL,
  min_reading   BIGINT NOT NULL,
  max_reading   BIGINT NOT NULL
);
 
CREATE INDEX idx_sensor_report_date ON sensor_report (reading_date);
 
CREATE OR REPLACE FUNCTION sp_trickle_report()
RETURNS BIGINT AS
$$
DECLARE
  last_update DATE;
  insert_count BIGINT;
BEGIN
  -- Find the most recent data in the report so we only
  -- refresh those contents.
 
  SELECT INTO last_update MAX(reading_date)
    FROM sensor_report;
 
  -- Bootstrap the report table if it's currently empty.
 
  IF last_update IS NULL THEN
    SELECT INTO last_update MIN(reading_date)
      FROM sensor_log;
  END IF;
 
  -- Since data comes in regularly, we only need to rebuild
  -- current data. Delete old aggregates so we can replace them.
 
  DELETE FROM sensor_report
   WHERE reading_date >= last_update;
 
  -- Perform the insert in a CTE so we can capture the count of
  -- inserted rows for further analysis or to return to our caller.
 
  WITH capture AS (
    INSERT INTO sensor_report (
             location, reading_date, avg_reading,
             min_reading, max_reading
           )
    SELECT location, reading_date::DATE,
           avg(reading), MIN(reading), MIN(reading)
      FROM sensor_log
     WHERE reading_date >= last_update
     GROUP BY location, reading_date::DATE
     RETURNING *
  )
  SELECT COUNT(*) INTO insert_count
    FROM capture;
 
   RETURN insert_count;
END;
$$ LANGUAGE plpgsql;
 
SELECT sp_trickle_report();
 
 sp_trickle_report 
-------------------
            579000
 
ANALYZE sensor_report;

Thankfully this is a very simple (and somewhat naive) reporting function. Imagine this is our first iteration and we want to debug some of its activity. If we were using Python for example, we might pepper the code with assert statements or craft specific tests accomplish that same goal. Well PL/pgSQL allows us to do just that.

For the second iteration, let’s assume we want to avoid bootstrapping the report table if it’s empty. After all, bootstrapping only needs to be done once, so why pollute the function with that edge case? Besides, it’s entirely possible we might require a separate procedure to initialize the reporting table, and isolated functionality is generally less buggy.

If the table isn’t bootstrapped, it’s probably because we’re in a testing environment that is missing data. In this case, it makes sense to complain so the missing data is loaded, even though it’s not strictly an erroneous scenario. So let’s replace the bootstrap with an assertion. While we’re at it, let’s say a new requirement from the upstream code suggests a minimum of 2000 sensors are required for meaningful output. In such a case, we might need to revise our fixture data in the development and QA environments.

Here’s how that might look:

CREATE OR REPLACE FUNCTION sp_trickle_report()
RETURNS BIGINT AS
$$
DECLARE
  last_update DATE;
  insert_count BIGINT;
BEGIN
  -- Find the most recent data in the report so we only
  -- refresh those contents.
 
  SELECT INTO last_update MAX(reading_date)
    FROM sensor_report;
 
  ASSERT last_update IS NOT NULL, 'Need to bootstrap report table.';
 
  -- Since data comes in regularly, we only need to rebuild
  -- current data. Delete old aggregates so we can replace them.
 
  DELETE FROM sensor_report
   WHERE reading_date >= last_update;
 
  -- Perform the insert in a CTE so we can capture the count of
  -- inserted rows for further analysis or to return to our caller.
 
  WITH capture AS (
    INSERT INTO sensor_report (
             location, reading_date, avg_reading,
             min_reading, max_reading
           )
    SELECT location, reading_date::DATE,
           avg(reading), MIN(reading), MIN(reading)
      FROM sensor_log
     WHERE reading_date >= last_update
     GROUP BY location, reading_date::DATE
     RETURNING *
  )
  SELECT COUNT(*) INTO insert_count
    FROM capture;
 
  ASSERT insert_count >= 2000, 'Not enough current sensor activity!';
 
  RETURN insert_count;
END;
$$ LANGUAGE plpgsql;

The function itself isn’t that much different; the important part is the result of our changes. The only truly meaningful modification is that the function no longer initializes the report table if it was empty. Ignoring that, we can treat the function exactly as before, as no output will change, and no new failure conditions are triggered.

This is due to the fact that assertions are ignored by default. Remember, they’re primarily intended for debugging purposes, so they must be explicitly enabled. We can do this by making an entry in postgresql.conf, but we can also activate them within our current session.

Let’s use our new assertions:

SET plpgsql.check_asserts = TRUE;
 
TRUNCATE TABLE sensor_report;
 
SELECT sp_trickle_report();
 
ERROR:  Need TO bootstrap report TABLE.
CONTEXT:  PL/pgSQL FUNCTION sp_trickle_report() line 12 at ASSERT
 
INSERT INTO sensor_report (
         location, reading_date, avg_reading,
         min_reading, max_reading
       )
SELECT location, reading_date::DATE,
       avg(reading), MIN(reading), MIN(reading)
  FROM sensor_log
 GROUP BY location, reading_date::DATE;
 
ANALYZE sensor_report;
 
SELECT sp_trickle_report();
 
ERROR:  NOT enough CURRENT sensor activity!
CONTEXT:  PL/pgSQL FUNCTION sp_trickle_report() line 38 at ASSERT
 
INSERT INTO sensor_log (location, reading, reading_date)
SELECT s.id % 5000, s.id % 100,
       CURRENT_DATE - ((s.id * 1) || 's')::INTERVAL
  FROM generate_series(1, 50000) s(id);
 
SELECT sp_trickle_report();
 
 sp_trickle_report 
-------------------
              5000

We begin by truncating the report table so we can witness the first assertion. As we can see, Postgres correctly complains that the table is empty when it shouldn’t be. Since this is a true ERROR, execution immediately halts where the assertion failed. We see that error and use some alternative method for filling the report table, and try the function again.

Stop! Assert time!

The second execution fails because the sensor_log table only contains data based on 1000 sensors, yet we expect 2000. Besides telling us to update our fixture data to account for more sensors, we also learn what the function expects. So we decide to insert some fake data from 5000 sensors and try again. The last attempt works as expected, and we’re free to move on with our lives.

It seems odd initially that a formerly ignored statement gets escalated all the way to a fatal error, doesn’t it? But how else should assertions work? If we just needed information, we could have simply used RAISE NOTICE with the relevant details. The power of ASSERT is that we don’t need to remove the statements to deactivate them.

The reason assertions are fatal is because they raise an ASSERT_FAILURE exception. We can actually take advantage of that in debugging. We can catch that exception just like any other, meaning we can ignore it or escalate as necessary. Consider this example:

CREATE OR REPLACE FUNCTION sp_trickle_report()
RETURNS BIGINT AS
$$
DECLARE
  last_update DATE;
  insert_count BIGINT;
BEGIN
  -- Find the most recent data in the report so we only
  -- refresh those contents.
 
  SELECT INTO last_update MAX(reading_date)
    FROM sensor_report;
 
  ASSERT last_update IS NOT NULL, 'Need to bootstrap report table.';
 
  -- Since data comes in regularly, we only need to rebuild
  -- current data. Delete old aggregates so we can replace them.
 
  DELETE FROM sensor_report
   WHERE reading_date >= last_update;
 
  -- Perform the insert in a CTE so we can capture the count of
  -- inserted rows for further analysis or to return to our caller.
 
  WITH capture AS (
    INSERT INTO sensor_report (
             location, reading_date, avg_reading,
             min_reading, max_reading
           )
    SELECT location, reading_date::DATE,
           avg(reading), MIN(reading), MIN(reading)
      FROM sensor_log
     WHERE reading_date >= last_update
     GROUP BY location, reading_date::DATE
     RETURNING *
  )
  SELECT COUNT(*) INTO insert_count
    FROM capture;
 
  ASSERT insert_count >= 2000, 'Not enough current sensor activity!';
 
  RETURN insert_count;
 
EXCEPTION
  WHEN ASSERT_FAILURE THEN
    RAISE NOTICE 'Tripped over debugging code: %', SQLERRM;
    RETURN 0;
END;
$$ LANGUAGE plpgsql;

The only thing we added was a single exception block to catch any exceptions thrown within the function. Then we toned down the error slightly while maintaining the original assertion message, and return 0 from the function since it didn’t process any rows due to the exception.

We can see this for ourselves by truncating the report table again:

SET plpgsql.check_asserts = TRUE;
 
TRUNCATE TABLE sensor_report;
 
SELECT sp_trickle_report();
 
NOTICE:  Tripped OVER debugging code: Need TO bootstrap report TABLE.
 sp_trickle_report 
-------------------
                 0

Now the function itself doesn’t fail or introduce an error into any transaction contexts we might have open, but we can still see something is wrong. The function claims to have inserted no rows, and we see a notice specifically from our own debugging system. In the production environment, this code would be completely silent and the function would operate without the overly oppressive assertions.

Is ASSERT overlooked because there are more suitable approaches to solve these problems? Or is it because the feature was added in 9.5, and it’s extremely easy to miss in the deluge of other headline-grabbing advancements? Honestly it’s probably a combination of those two and a myriad of other causes. The important thing is that we know about it now.

And as we are all well aware…

G.I. Joe!

PG Phriday: Down in the Dumps

These days with multiple Postgres database environments a commonality, it’s not unheard of to copy data from one to another. Perhaps a production extract is necessary to properly vet data in a staging environment. Maybe the development environment needs to update its data to reflect an especially pernicious and intractable edge case someone observed. In any of these scenarios, we are likely to extract data from multiple tables to import it elsewhere. The question is: how?

Last week I flatly stated that pg_dump is not an adequate backup tool. Despite that, it’s perfect for situations like these, where we only need a few tables out of an entire database. However, tables frequently reference other tables in foreign keys. While such integrity concerns do not affect data export, they greatly complicate import efforts.

The deeper nested the referential integrity rules become, we’ll pull out exponentially more hair in trying to track all of them to their requisite tables. Even after we’ve found them all, we still need to unravel the horrors again in the target cluster.

A typical table dependency chart

To better illustrate what we mean by “horrors”, take this relatively simple table hierarchy:

CREATE TABLE tab_a (id INT PRIMARY KEY);
CREATE TABLE tab_a1 (id INT PRIMARY KEY REFERENCES tab_a);
CREATE TABLE tab_b (id INT PRIMARY KEY REFERENCES tab_a);
CREATE TABLE tab_b1 (id INT PRIMARY KEY REFERENCES tab_b);
CREATE TABLE tab_c (id INT PRIMARY KEY REFERENCES tab_b);
CREATE TABLE tab_c1 (id INT PRIMARY KEY REFERENCES tab_c);
 
INSERT INTO tab_a VALUES (1);
INSERT INTO tab_a1 VALUES (1);
INSERT INTO tab_b VALUES (1);
INSERT INTO tab_b1 VALUES (1);
INSERT INTO tab_c VALUES (1);
INSERT INTO tab_c1 VALUES (1);

Imagine these are six tables out of potentially hundreds. From an outside perspective, we may only know that tab_a needs to be copied from one system to another. No matter what tool we use for this, finding the other tables is somewhat irritating. An ORM will display the relationships in a physical chart, and the psql command line client or pgAdmin will report tables that immediately depend on a table we examine. Not a great way to just isolate them all at once.

Fortunately we have the Postgres system catalog. The pg_constraint system table tracks all of those table relationships, and we can recursively fetch the entire tree using a CTE. Remember how awesome CTEs are? Let’s mix some chocolate with our peanut butter by using one to mine the catalog.

Starting with tab_a, here’s how we might find all of the other tables in the list:

WITH RECURSIVE deptree AS (
  SELECT conrelid, conrelid::REGCLASS::TEXT AS TABLE_NAME,
         confrelid, confrelid::REGCLASS::TEXT AS dep_table
    FROM pg_constraint
   WHERE contype IN ('f', 'p')
     AND (confrelid::REGCLASS::TEXT = 'tab_a' OR
          conrelid::REGCLASS::TEXT = 'tab_a')
   UNION
  SELECT c.conrelid, c.conrelid::REGCLASS::TEXT AS TABLE_NAME,
         c.confrelid, c.confrelid::REGCLASS::TEXT AS dep_table
    FROM pg_constraint c
    JOIN deptree d ON (d.conrelid = c.confrelid)
   WHERE c.contype = 'f'
)
SELECT *
  FROM deptree;
 
 conrelid | TABLE_NAME | confrelid | dep_table 
----------+------------+-----------+-----------
    69327 | tab_a      |         0 | -
    69332 | tab_a1     |     69327 | tab_a
    69342 | tab_b      |     69327 | tab_a
    69352 | tab_b1     |     69342 | tab_b
    69362 | tab_c      |     69342 | tab_b
    69372 | tab_c1     |     69362 | tab_c

As complicated as this query appears, it’s not actually that bad. We simply bootstrap the CTE with any tables that depend on tab_a, and repeat the process with each successive loop. Even if there are dozens or hundreds of incestuous relationships, we’ll capture all of them.

In any case, we have all the data we need to construct a pg_dump command to export all of these tables. To keep this as speedy as possible, let’s perform a parallel dump of the indicated tables into a directory we can transmit to another server.

pg_dump -t tab_a -t tab_a1 -t tab_b -t tab_b1 -t tab_c -t tab_c1 \
        -a -Fd -j 8 -f /tmp/tab_exports postgres

Though we can list each table individually, it would be nice if we could supply a filename of desired tables instead. Dependency graphs of sufficient depth could make for a literally impossible command-line given character limit constraints. This means we may need to execute multiple dumps or rely on wildcards and hope that there’s enough name overlap.

Given that caveat, maybe we want to combine pg_dump with xargs and its parallel capabilities instead:

mkdir /tmp/tab_exports
 
cat /tmp/tablist.txt | xargs -I{} -P 8 \
    pg_dump -t {} -a -Fc -f /tmp/tab_exports/{}.pgr postgres

Either way, we now have a data-only representation of every table in the dependency tree of the table we’re actually importing. Now we just have to insert these values into the target system. And this is usually where we first encounter a major roadblock: existing contents.

First of all, that data is in our way. Secondly, the most efficient manner of ridding ourselves of their oppression is the TRUNCATE command. While TRUNCATE is a DDL-level command and can’t be circumvented, it does have a CASCADE decorator which will empty the entire dependency chain. This actually helps us since we have data to rebuild anything that might be affected by a cascading truncate.

Let’s give it a try on the recipient database:

TRUNCATE TABLE tab_a;
 
ERROR:  cannot TRUNCATE a TABLE referenced IN a FOREIGN KEY CONSTRAINT
DETAIL:  TABLE "tab_a1" REFERENCES "tab_a".
HINT:  TRUNCATE TABLE "tab_a1" at the same TIME, OR USE TRUNCATE ... CASCADE.
 
TRUNCATE TABLE tab_a CASCADE;
 
NOTICE:  TRUNCATE cascades TO TABLE "tab_a1"
NOTICE:  TRUNCATE cascades TO TABLE "tab_b"
NOTICE:  TRUNCATE cascades TO TABLE "tab_b1"
NOTICE:  TRUNCATE cascades TO TABLE "tab_c"
NOTICE:  TRUNCATE cascades TO TABLE "tab_c1"

Well that was terrifying. Also note that the error only reported one of the dependencies. If we relied only on this information, we might attempt to drop the foreign key constraint thinking it was the only one. Every subsequent attempt to truncate the table would result in another error.

We’ve avoided all of that, but imagine if we had attempted to restrict our efforts to only the data in tab_a. We’d be left with no benefits of cascade, manually tracing the errors until success, inability to reestablish the constraints due to likely missing keys, and so on. No thanks.

Though truncate and the foreign keys aided us in emptying the tables on the recipient, restoring from our dump is somewhat problematic. We want to do it in parallel, but foreign keys directly prevent this by enforcing key integrity at each level. So we’re stuck either importing each table in order of the dependency tree, or we need to do a bit of preparation.

We can leverage that CTE again and build a couple of SQL scripts to help us out:

CREATE TEMP TABLE tab_deps AS
WITH RECURSIVE deptree AS (
  SELECT conrelid, conrelid::REGCLASS::TEXT AS TABLE_NAME,
         confrelid, confrelid::REGCLASS::TEXT AS dep_table
    FROM pg_constraint
   WHERE contype IN ('f', 'p')
     AND (confrelid::REGCLASS::TEXT = 'tab_a' OR
          conrelid::REGCLASS::TEXT = 'tab_a')
   UNION
  SELECT c.conrelid, c.conrelid::REGCLASS::TEXT AS TABLE_NAME,
         c.confrelid, c.confrelid::REGCLASS::TEXT AS dep_table
    FROM pg_constraint c
    JOIN deptree d ON (d.conrelid = c.confrelid)
   WHERE c.contype = 'f'
)
SELECT *
  FROM deptree;
 
COPY (
  SELECT 'ALTER TABLE ' || TABLE_NAME || ' DISABLE TRIGGER ALL;'
    FROM tab_deps
) TO '/tmp/no_trigs.sql';
 
COPY (
  SELECT 'ALTER TABLE ' || TABLE_NAME || ' ENABLE TRIGGER ALL;'
    FROM tab_deps
) TO '/tmp/fix_trigs.sql';

One thing that’s interesting about Postgres is that foreign keys are actually enforced by hidden system-level triggers. Given that information, we can simply turn them off for the duration of our import, and re-enable them afterwards. Now we have all the parts we need to restore the donor data into the environment of our choice.

It should be cautioned that all of the import steps should be performed on the target system itself. Even if this means synchronizing the dump directory to another system first. We don’t want to fat-finger these commands in production or an equally important system.

In any case, this is how that might go:

psql -f /tmp/no_trigs.sql postgres
psql -c 'TRUNCATE TABLE tab_a CASCADE;' postgres
pg_restore -d postgres -j 8 /tmp/tab_exports
psql -f /tmp/fix_trigs.sql postgres

Or if we used xargs, we’ll want this procedure instead:

psql -f /tmp/no_trigs.sql postgres
psql -c 'TRUNCATE TABLE tab_a CASCADE;' postgres
 
cat /tmp/tablist.txt | xargs -I{} -P 8 \
    pg_restore -d postgres /tmp/tab_exports/{}.pgr
 
psql -f /tmp/fix_trigs.sql postgres

And voila, we’re saved!

Of course table structures between the environments may not match. In that case, we would simply drop all of the involved tables instead of truncating them and stunning their constraint triggers. As such, we would want to remove the -a data-only flag from pg_dump commands we used so pg_restore can recreate them. Ultimately, we’ve broken the paradox of satisfying constraints while simultaneously circumventing them.

It’s not an ideal situation, but hey, whatever works.

PG Phriday: Getting Back Up

In light of recent events where GitLab suffered a massive database loss, this is a great opportunity to examine what happened from a Postgres perspective. Since Simon Riggs over at 2ndQuadrant has already chimed in on improvements Gitlib might consider in their procedures, maybe we should walk the conversation back slightly.

This isn’t the first time Postgres backup tooling has been misused or misunderstood. The topic of backups hits forums and mailing lists rather frequently. It’s no secret that there’s no official “push button to back up” tool for Postgres. Companies like 2ndQuadrant do their best with offerings like Barman, but this will never stop admins from rolling their own solutions instead of resorting to another external utility.

Part of the reason for this stems from the extract and backup tools Postgres itself provides. After reading the documentation, it’s only natural to encounter pg_dump and pg_basebackup and stop looking. And why not? MySQL has its mysqldump after all and that’s fine, right?

Well, not always. Our first clue comes from the ‘dump’ suffix. By its nature, this means we’re extracting database contents by essentially executing a SELECT statement against each table and waiting for the results. Sometimes this involves compression, and there’s a special format Postgres can use to inline the process, and… none of that matters.

An extract is not a backup. Full stop. Depending on the size of our database, performing a full dump can consume a span of a few seconds to several days. We’re limited by the laws of physics in this case; reading a 50TB database takes takes time, as does the processing power necessary to compress that output. Restoring it is often much slower, given how we must account for recreating indexes. In these times of gigantic databases, pg_dump should probably be relegated to exporting and importing individual tables unless the instance is truly tiny.

In the Old Days(tm), the most common advice was to supply the pg_dump command with the -Fc flags to activate its internal compression and storage format. But tools evolve, and Postgres 9.3 added the -j option so tables can be dumped or restored in parallel. Have a huge machine with 64 CPUs living on really fast SSDs and insist on dumping every table? Do this:

# This works for parallel dumps
 
pg_dump -Fd -j 48 -d my_db -f /my/backup/dir
 
# This is for restoring it
 
pg_restore -j 48 -d my_db /my/backup/dir

This approach scales to surprisingly high levels, provided the underlying storage can keep up. It’s not perfectly linear, but a backup or restore using multiple parallel processes will complete in a fraction of the time. After a couple hundred GB however, even parallel dumps start to exceed standard maintenance windows.

Real database engines require a binary backup tool, and Postgres gained one of these back in version 9.1. There are multiple helpful options to customize its behavior, but its real job is to ignore database contents completely. The pg_basebackup utility can back up a full instance of multiple databases nearly as quickly as the operating system can read the files from disk. A well equipped SAN, RAID, or SSD usually provide enough IO to back up even a 5TB instance in a couple of hours.

Unfortunately these backups are not compressed by default, and since we lose the ability to perform parallel dumps when using pg_basebackup, compression is single-threaded if we activate it. That means our 5TB cluster will produce a 5TB backup unless we don’t mind a much longer runtime. Ouch. On the other hand, this means the output it produces is an exact physical copy. What else is an exact physical copy of a Postgres instance?

That’s right: a replica. What’s nice is that the Postgres developers know that as well. Want to copy a remote Postgres cluster on another server and have it start streaming from the primary as soon as it’s started? Easy:

pg_basebackup -h primary-server -D /new/datadir -R
pg_ctl -D /new/datadir start

Since we supplied the -R flag, pg_basebackup writes a recovery.conf file that will instruct our new replica to stream from the same server we just fetched the backup from. We can do this as many times as we wish, even if the upstream is another replica. Our only limit is network bandwidth, and we’ve all upgraded to 10Gbps equipment, right?

Unfortunately, scenarios beyond this point is where process breaks down. What happens if we have a replica that falls behind and needs to be rebuilt? For all of its benefits, pg_basebackup still cannot (currently) skip unchanged files, or make small patches where necessary. Relying on it in this case would require erasing the replica and starting from scratch. This is where GitLab really ran into trouble.

Yet we started with synchronized files, didn’t we? Could we use rsync to “catch up”? Yes, but it’s a somewhat convoluted procedure. We would first need to connect to the upstream server and issue a SELECT pg_start_backup('my_backup') command so Postgres knows to archive transaction logs produced during the sync. Then after the sync is completed, we would need to stop the backup with SELECT pg_stop_backup(). Then we would have to make our own recovery.conf file, obtain all of the WAL files the upstream server archived, and so on.

None of that is something a system administrator will know, and it’s fiddly even to an experienced Postgres DBA. A mistake during any of that procedure will result in a non-functional or otherwise unsafe replica. All of that is the exact reason software like Barman exists. Supplied utilities only get us so far. For larger or more critical installations, either our custom scripts must flawlessly account for every failure scenario and automate everything, or we defer to someone who already did all of that work.

Even without that, there are some general guidelines to follow:

  1. Don’t use pg_dump for backups. If your database isn’t too large now, it will be later.
  2. Always have a replica handy for critical systems. If a company depends on a database, there should always be at least one online copy. It’s easy to play fast-and-loose here when performing maintenance on a replica, but scripts and CLI snafus happen. Bring up another replica until maintenance is over.
  3. If performing dangerous work on the primary, temporarily decouple at least one replica. I had to learn from experience here that hardware or filesystem modifications can corrupt the entire replication chain. Have an online copy of critical instances, and disable replication during maintenance until work is done.
  4. Have an off-site tertiary replica for disaster recovery purposes. Another datacenter, another city, another state. Just put it somewhere else. Not only is this a supplementary copy for failover purposes or crash recovery, but even regional outages won’t result in lost data or availability.
  5. Perform backup operations on a replica for 24/7 systems. Websites that people expect to never go down for maintenance should be backed up from a replica. Postgres backups are online and non-blocking, but are IO intensive and may substantially decrease query performance. Don’t take the risk.
  6. Fail over to a working replica at least twice per year. Either it happens when nobody expects it, or we manage the process ourselves. The latter scenario is highly preferred and establishes a checklist to apply to the former.
  7. Test backup restores at least twice a year. An untested backup is equivalent to random 1s and 0s in a file somewhere. Assuming it exists, it could be anything. It may not work at all. Make sure.
  8. Always back up transaction log archives through the day on critical clusters. They can be used to help a replica catch up if it falls behind, and are an invaluable component of PITR functionality.

These kinds of procedural rules are not restricted to Postgres, and aren’t even really a complete list. Administering a database architecture is a thankless job that never ends, and there’s always something we’ll miss. The catastrophe at GitLab was a failure of process and architecture, not of a command typed on the wrong server. I’m extremely grateful for their openness regarding the situation, because we can all learn from it.

As an aside, while I highly respect Simon and 2ndQuadrant, Barman isn’t the only game in town for backup management. It seems backing up huge installations was a problem long enough that several developers solved it almost simultaneously. Before Barman had it, pgBackRest was designed around hard links and incremental backups of gargantuan 1TB+ systems. Unfortunately I don’t know how they compare because I haven’t tested them. Why not? Because even I got caught up in the race for a better backup tool.

One of our databases first tipped the 1TB scale around 2013 and pg_basebackup was consuming inordinate amounts of time. So I threw together a bunch of scripts that leveraged hard links, parallel compression with pigz, and unleashed it on a replica of our production system. It was twelve times faster on the first iteration thanks to the parallel compression, and over 100 times faster while using hard links to produce incrementals. I eventually rewrote it as a more robust set of Python libraries, but pgBackRest and Barman finally made that last step, rendering my efforts effectively moot.

I still use my libraries for now, but the extra management capabilities the other tools supply is somewhat enticing. Want some idea how much better Barman or pgBackRest will perform than relying on pg_basebackup? We have a 50TB database that grows by 200GB per day, which many might consider slightly cumbersome. My homegrown tool transforms that into a 7TB backup in four hours, while the 24GB nightly incrementals usually require about 20 minutes. We back up a 50TB database cluster in 20 minutes.

And yes, I’ve tested restoring from this backup. Restoring takes a while, but it’s also done in parallel. There’s little that’s more satisfying than watching a system with 32 CPUs pegged at 100% decompressing 50TB worth of data as fast as the underlying filesystem is capable of writing.

But I was only able to do that because I’m intimately familiar with our tools and requirements. Our company process standards revealed limitations in the supplied Postgres backup tooling, prompting me to find or create an alternative. There is no substitute for established protocol, despite the fact building those procedures might not have an immediately obvious benefit. The payoff is avoiding downtime, and that’s something you can’t really plan.

It’s the difference between six hours of lost data, and six seconds.

PG Phriday: Alien Incursion

Foreign tables have been a headline feature of Postgres ever since the release of version 9.2. Combined with extensions, they’re the secret sauce that allows Postgres to pull data from other database engines, flat files, REST interfaces, and possibly every gas station, residence, warehouse, farmhouse, hen house, outhouse, and doghouse in the area.

Postgres all the things

But that kind of power comes at a significant cost. Since the remote data comes from what is essentially a black box, there are a lot of performance optimizations Postgres can’t apply. Foreign data wrappers are also still somewhat in their infancy—even the Postgres foreign data wrapper lacks the ability to simultaneously join and sort on the remote server in the same query.

That makes data federation incredibly painful and could reduce the feature to nothing but an amusing toy. It’s functional but slow since it’s likely data processing is done locally, potentially after retrieving the entire contents of the remote table. Of course, this also means several disparate sources can all be handled within the confines of our Postgres database since it has access to all of the data… eventually.

Is there a better way? Since the introduction of Materialized views in 9.3, it became possible to trivially take a snapshot of a remote table, or some subsection of its contents, and use it repeatedly. Unlike the retrieval Postgres executes within a query on a foreign object, materialized views allow remote data to persist.

With that in mind, let’s set up a very basic sensor database that merely accumulates readings.

CREATE TABLE sensor_log (
  id            INT PRIMARY KEY,
  location      VARCHAR NOT NULL,
  reading       BIGINT NOT NULL,
  reading_date  TIMESTAMP NOT NULL
);
 
INSERT INTO sensor_log (id, location, reading, reading_date)
SELECT s.id, s.id % 1000, s.id % 100,
       CURRENT_DATE - ((s.id * 10) || 's')::INTERVAL
  FROM generate_series(1, 5000000) s(id);
 
CREATE INDEX idx_sensor_log_location ON sensor_log (location);
CREATE INDEX idx_sensor_log_date ON sensor_log (reading_date);
 
ANALYZE sensor_log;

It’s a table we use frequently in examples, and there’s no reason it can’t be used here. If we connect to any other Postgres database, we can execute the following SQL to produce a standard foreign table.

For the purposes of this example, let’s just assume the database is on the same server and the server is somewhat trusting with connections in that context. Properly securing a foreign server is a topic for a different day.

CREATE SCHEMA ext_info;
 
CREATE EXTENSION postgres_fdw;
 
CREATE SERVER sys_sensors
       FOREIGN DATA WRAPPER postgres_fdw
       OPTIONS (dbname 'sensors', host 'sensor-host', port '5432');
 
CREATE USER MAPPING FOR postgres
       SERVER sys_sensors
       OPTIONS (USER 'postgres');
 
IMPORT FOREIGN SCHEMA public
 LIMIT TO (sensor_log)
  FROM SERVER sys_sensors
  INTO ext_info;
 
ANALYZE ext_info.sensor_log;
 
SELECT COUNT(*) FROM ext_info.sensor_log;
 
  COUNT  
---------
 5000000

Take note that we’ve leveraged the IMPORT FOREIGN SCHEMA feature added in Postgres 9.5. This makes it much easier to reflect a remote data source since we don’t necessarily need to know the structure of the objects we’re importing. Either way, it worked and we were able to query the contents. So far there are no surprises here.

Now let’s wrap our external table with a very limited view that essentially fetches the entire contents of the remote table.

CREATE SCHEMA ext_materialized;
 
CREATE MATERIALIZED VIEW ext_materialized.sensor_log AS
SELECT * FROM ext_info.sensor_log;
 
CREATE UNIQUE INDEX pk_ext_sensor_log
    ON ext_materialized.sensor_log (id);
 
CREATE INDEX idx_ext_sensor_log_reading_date
    ON ext_materialized.sensor_log (reading_date);
 
ANALYZE ext_materialized.sensor_log;
 
EXPLAIN
SELECT *
  FROM ext_materialized.sensor_log
 WHERE reading_date >= CURRENT_DATE - INTERVAL '1 day';
 
                            QUERY PLAN                         
-----------------------------------------------------------------
 INDEX Scan USING idx_ext_sensor_log_reading_date ON sensor_log
       (cost=0.44..304.87 ROWS=8596 width=23)
   INDEX Cond: (reading_date >= 
       (('now'::cstring)::DATE - '1 day'::INTERVAL))

We added an index on the reading_date column and analyzed the contents to illustrate that materialized views act just like regular tables in very important ways. Postgres can collect statistics, index, and build views on top of materialized views as if they were regular tables. This means we can treat them like a local cache of remote data, with all of the ancillary benefits the Postgres planner provides for our own data.

When the remote data changes, we must rebuild the materialized view to capture the new data. Here’s how that works:

-- In the sensors database:
 
INSERT INTO sensor_log (id, location, reading, reading_date)
SELECT s.id, s.id % 1000, s.id % 100,
       CURRENT_DATE + ((s.id * 10) || 's')::INTERVAL
  FROM generate_series(5000001, 5001000) s(id);
 
-- In the database where we have the materialized view:
 
\timing ON
 
REFRESH MATERIALIZED VIEW CONCURRENTLY ext_materialized.sensor_log;
 
TIME: 117734.945 ms
 
ANALYZE ext_materialized.sensor_log;

Just… ouch. Why did that take so long? We supplied the CONCURRENTLY keyword to avoid locking the view during the refresh. This means Postgres fetches the data and compares it to the existing view contents, inserting and deleting only the data that changed since the last refresh. That’s a pretty intensive procedure, and for higher data volumes, extremely inefficient. Yet we added the “missing” content without preventing other users from accessing the materialized view, which is certainly advantageous if we don’t mind the refresh delay.

We can improve this process a number of ways. If we consider our data and how we use it, we might realize we don’t need the entire contents of the remote table. In an analysis context, maybe we only need the most recent day of readings to compare with information we’ve gathered from other sources. Since Postgres doesn’t transmit calculated predicates to the remote server, we also need to leverage the ability to create foreign tables based on remote views.

Here’s how that would look:

-- In the sensors database:
 
CREATE VIEW sensor_log_today AS
SELECT *
  FROM sensor_log
 WHERE reading_date >= CURRENT_DATE;
 
-- In the database where we have the materialized views:
 
IMPORT FOREIGN SCHEMA public
 LIMIT TO (sensor_log_today)
  FROM SERVER sys_sensors
  INTO ext_info;
 
CREATE MATERIALIZED VIEW ext_materialized.sensor_log_today AS
SELECT * FROM ext_info.sensor_log_today;
 
CREATE UNIQUE INDEX pk_ext_sensor_log_today
    ON ext_materialized.sensor_log_today (id);
 
CREATE INDEX idx_ext_sensor_log_today_reading_date
    ON ext_materialized.sensor_log_today (reading_date);
 
ANALYZE ext_materialized.sensor_log_today;
 
\timing ON
 
REFRESH MATERIALIZED VIEW CONCURRENTLY ext_materialized.sensor_log_today;
 
TIME: 17.531 ms

That’s a bit of an improvement. This kind of quick snapshot system is a solid foundation for a virtual database that contains none of its own data, and only exists for analytics. If we need to retain data for longer periods of time, we can built our own data loading process that references the smaller snapshot.

Something like this:

-- Bootstrap the giant repo of sensor readings.
 
CREATE SCHEMA pile;
 
CREATE TABLE pile.sensor_log (LIKE ext_info.sensor_log);
 
INSERT INTO pile.sensor_log
SELECT * FROM ext_info.sensor_log;
 
CREATE UNIQUE INDEX pk_pile_sensor_log
    ON pile.sensor_log (id);
 
CREATE INDEX idx_pile_sensor_log_reading_date
    ON pile.sensor_log (reading_date);
 
ANALYZE pile.sensor_log;
 
-- Now a "refresh" looks like this:
 
REFRESH MATERIALIZED VIEW CONCURRENTLY ext_materialized.sensor_log_today;
 
ANALYZE ext_materialized.sensor_log_today;
 
INSERT INTO pile.sensor_log
SELECT *
  FROM ext_materialized.sensor_log_today
 WHERE id > (SELECT MAX(id) FROM pile.sensor_log);

We may have already executed the bootstrapping section any time in the past. The refresh on the other hand, lets us “top up” our local cache with the remote data as often as we want. The remote table might even be regularly truncated, but it doesn’t matter to us; we always see the most recent information and we can handle it as we please.

All of these tables and materialized views are driven by remote data. If we were working with several customers, we might perform aggregate analysis across all of them in this virtual staging area using cumulative snapshot updates. We could decide to retain this information for a year or two and cycle anything older, or any other retention period that matches our reporting needs.

This works the same way across environments. Perhaps we want to combine data from a production and user acceptance context, but only for the last work week. Though this kind of transient architecture isn’t exactly trivial, at least the possibility exists.

The primary caveat to relying on foreign structures so extensively is that we might not have direct control over the remote objects. If a column type is changed for example, our foreign table would stop working. Depending on how deeply nested all of our materialized views are, rebuilding in the face of remote modifications will likely incur significant maintenance costs. Even if we do control the infrastructure and software on the entire stack, we’ve contributed a sprawling decoupled dependency chain to the mix.

But that’s a small price to pay for the freedom to have our hand in every cookie jar.

cookie!