PG Phriday: Through the Window

Now that we know how Postgres window functions work, why not play with them a bit to get a better understanding of their capabilities? So long as we understand window functions are applied after data gathering and aggregation steps, much of their mystery and complexity is defanged. Let’s start actually using them for stuff!

Captain Murphy is tired of your nonsense

(Note: I’m a bit under the weather today, so this Phriday will probably be a bit truncated and potentially incoherent thanks to the drugs. Apologies in advance.)

Let’s start off with the same example data as last week with one tiny alteration. To properly illustrate some of these concepts, we need some actual data variance, so we’ll be using random numbers for readings instead of modulo math.

CREATE TABLE sensor_log (
  id            SERIAL PRIMARY KEY,
  location      VARCHAR NOT NULL,
  reading       BIGINT NOT NULL,
  reading_date  TIMESTAMP NOT NULL
);
 
INSERT INTO sensor_log (location, reading, reading_date)
SELECT s.id % 1000, round(random() * 100),
       CURRENT_DATE + INTERVAL '1d' - ((s.id * 10) || 'm')::INTERVAL
  FROM generate_series(1, 5000) s(id);
 
CREATE INDEX idx_sensor_log_date ON sensor_log (reading_date);
 
ANALYZE sensor_log;

First things first: ordering matters. Window functions are either applied over the whole data-set or some partition of it. They’re also cumulative in nature, meaning we can change the aggregation results by altering the window itself. To better see what this means, consider these two sums:

SELECT location, reading, SUM(reading) OVER ()
  FROM sensor_log
 WHERE reading_date >= CURRENT_DATE
 ORDER BY location
 LIMIT 10;
 
 location | reading | SUM  
----------+---------+------
 1        |      12 | 7058
 10       |      26 | 7058
 100      |      98 | 7058
 101      |      99 | 7058
 102      |      46 | 7058
 103      |      84 | 7058
 104      |      60 | 7058
 105      |      35 | 7058
 106      |      58 | 7058
 107      |       6 | 7058
 
SELECT location, reading, SUM(reading) OVER (ORDER BY location)
  FROM sensor_log
 WHERE reading_date >= CURRENT_DATE
 ORDER BY location
 LIMIT 10;
 
 location | reading | SUM 
----------+---------+-----
 1        |      12 |  12
 10       |      26 |  38
 100      |      98 | 136
 101      |      99 | 235
 102      |      46 | 281
 103      |      84 | 365
 104      |      60 | 425
 105      |      35 | 460
 106      |      58 | 518
 107      |       6 | 524

Interestingly, we actually learned two things here. First, window functions ignore LIMIT clauses. That sum reflects every reading in the table for today. Second is that ordering by the sensor location resulted in a cumulative total for the sum for each row. This applies to all of the available window functions. A maximum would reflect the current maximum for each row until a higher value replaces it. An average is a cumulative average as values are processed, and so on.

This isn’t particularly useful for sums, but what if we wanted to watch our data converge upon the overall average?

SELECT location, reading,
       round(avg(reading) OVER (ORDER BY location), 2) AS running_avg,
       round(reading - 
           avg(reading) OVER (ORDER BY location), 2) AS variance
  FROM sensor_log
 WHERE reading_date >= CURRENT_DATE
 ORDER BY location
 LIMIT 10;
 
 location | reading | running_avg | variance 
----------+---------+-------------+----------
 1        |      12 |       12.00 |     0.00
 10       |      26 |       19.00 |     7.00
 100      |      98 |       45.33 |    52.67
 101      |      99 |       58.75 |    40.25
 102      |      46 |       56.20 |   -10.20
 103      |      84 |       60.83 |    23.17
 104      |      60 |       60.71 |    -0.71
 105      |      35 |       57.50 |   -22.50
 106      |      58 |       57.56 |     0.44
 107      |       6 |       52.40 |   -46.40

If we monitored those kinds of results second by second, we could flag any kind of appreciable jitter from the average as currently represented by the data. It may be completely expected for values to increase through the day, so the total average for all data is meaningless to us. Ordering allows us to control the aggregate’s construction in a way that isn’t really possible using other methods, barring some kind of recursive CTE magic.

Postgres’ new slogan

With ordering out of the way, there’s the small detail of data partitioning as well. The manual goes into much more detail about this, but it’s just a fancy way of saying GROUP BY within the window itself.

SELECT location, reading,
       SUM(reading) OVER (PARTITION BY location) AS all_readings,
       rank() OVER (PARTITION BY location ORDER BY reading)
  FROM sensor_log
 WHERE location::INT <= 10
 LIMIT 10;
 
 location | reading | all_readings | rank 
----------+---------+--------------+------
 0        |      32 |          260 |    1
 0        |      35 |          260 |    2
 0        |      37 |          260 |    3
 0        |      71 |          260 |    4
 0        |      85 |          260 |    5
 1        |      12 |          173 |    1
 1        |      14 |          173 |    2
 1        |      17 |          173 |    3
 1        |      44 |          173 |    4
 1        |      86 |          173 |    5

Again, we can learn a few different things from these results. First is that the window results are restricted to the partition we declared. We set the partition to limit sums, ranks, and other window functions to the domain of the location. Postgres will apply window aggregates specifically to each location as it appears in the query output.

Next consider that we partitioned the sum, but did not order it. In case it wasn’t obvious already, this tells us that PARTITION and ORDER BY clauses are independent, as are the window definitions. In this case, we want the reading total for each location, followed by the rank of each reading within that group. If we had ordered the sum window, we would have gotten a running total instead of an overall value. On the other hand, we want the rank to behave in an ordered manner.

Separate windows, separate effects. Of course, we may not actually want that to happen. If we end up using the same window over and over again, it doesn’t make sense to declare it for each column. Thankfully Postgres has a shorthand for that:

SELECT location, reading,
       SUM(reading) OVER (PARTITION BY location) AS all_readings,
       SUM(reading) OVER locs AS running_total,
       rank() OVER locs
  FROM sensor_log
 WHERE location::INT <= 10
WINDOW locs AS (PARTITION BY location ORDER BY reading)
 LIMIT 10;
 
 location | reading | all_readings | running_total | rank 
----------+---------+--------------+---------------+------
 0        |      32 |          260 |            32 |    1
 0        |      35 |          260 |            67 |    2
 0        |      37 |          260 |           104 |    3
 0        |      71 |          260 |           175 |    4
 0        |      85 |          260 |           260 |    5
 1        |      12 |          173 |            12 |    1
 1        |      14 |          173 |            26 |    2
 1        |      17 |          173 |            43 |    3
 1        |      44 |          173 |            87 |    4
 1        |      86 |          173 |           173 |    5

Perfect! Not only did we use the same window twice, but we were able to mix it with another direct declaration within the same query. That isn’t exactly ground-breaking territory, but it’s reassuring when things work as expected.

Keep in mind that the same window function limitations apply to this syntax as well. Since windows are evaluated after WHERE, GROUP BY, or HAVING clauses, the WINDOW portion of a query must appear after them as well.

Now go forth and leverage window functions to produce reports without awkward subqueries and CTEs muddying up the works! The only better way to use a window is to cool pies.

Mmmmm… pie!

PG Phriday: In the Window

I’ll be the first to admit that I found Postgres window functions fantastically confusing when I first encountered them. They’re a powerful and versatile tool for building reports and summaries, but that functionality hides behind a fairly steep learning curve. One of the ways to combat their inherent complexity is to fully explore how they work, instead of just trying to wrap our heads around the expected results.

Window doggies have gotten decidedly smug

To that end, let’s set up a quick set of data in the customary fashion:

CREATE TABLE sensor_log (
  id            SERIAL PRIMARY KEY,
  location      VARCHAR NOT NULL,
  reading       BIGINT NOT NULL,
  reading_date  TIMESTAMP NOT NULL
);
 
INSERT INTO sensor_log (location, reading, reading_date)
SELECT s.id % 1000, s.id % 100,
       CURRENT_DATE + INTERVAL '1d' - ((s.id * 10) || 'm')::INTERVAL
  FROM generate_series(1, 5000) s(id);
 
CREATE INDEX idx_sensor_log_date ON sensor_log (reading_date);
 
ANALYZE sensor_log;

Yes, it’s the trusty sensor_log table once again. This time around, we only really care about demonstration in lieu of volume, so we’ve elected for five thousand rows in place of the usual five million. Our data represents one thousand sensors sequentially taking readings every minute. This should provide enough overlap to easily demonstrate what’s going on behind the scenes.

Let’s start with probably the easiest window function of the lot: row_number. All it does is number rows in the result set so we have a kind of counter that’s useful in a lot of different contexts. How do the first ten rows for today look?

SELECT location, reading, ROW_NUMBER() OVER ()
  FROM sensor_log
 WHERE reading_date > CURRENT_DATE
 LIMIT 10;
 
 location | reading | ROW_NUMBER 
----------+---------+------------
 143      |      43 |          1
 142      |      42 |          2
 141      |      41 |          3
 140      |      40 |          4
 139      |      39 |          5
 138      |      38 |          6
 137      |      37 |          7
 136      |      36 |          8
 135      |      35 |          9
 134      |      34 |         10

Window functions must be called on some kind of data window. An empty set of () represents the entire data set, with no ordering, groups, or other shenanigans involved. We’re just numbering the results, and the output would have been no different if we removed the window function. This kind of use is very similar to Oracle’s ROWNUM pseudo-column.

Yet a row number by itself isn’t that interesting. Let’s number the rows in order of sensor reading, and fetch the first ten rows of those results:

SELECT location, reading, ROW_NUMBER() OVER (ORDER BY reading)
  FROM sensor_log
 WHERE reading_date > CURRENT_DATE
 LIMIT 10;
 
 location | reading | ROW_NUMBER 
----------+---------+------------
 100      |       0 |          1
 101      |       1 |          2
 1        |       1 |          3
 102      |       2 |          4
 2        |       2 |          5
 103      |       3 |          6
 3        |       3 |          7
 4        |       4 |          8
 104      |       4 |          9
 105      |       5 |         10

If () was the window representing all rows, then (ORDER BY reading) is that same content after being sorted by the reading column. Not only did Postgres sort our results, but it numbered them in the post-sorted order. This is a very fine distinction! Consider what happens when we move the ORDER BY clause into the query proper.

SELECT location, reading, ROW_NUMBER() OVER ()
  FROM sensor_log
 WHERE reading_date > CURRENT_DATE
 ORDER BY reading
 LIMIT 10;
 
 location | reading | ROW_NUMBER 
----------+---------+------------
 100      |       0 |         44
 1        |       1 |        143
 101      |       1 |         43
 2        |       2 |        142
 102      |       2 |         42
 3        |       3 |        141
 103      |       3 |         41
 104      |       4 |         40
 4        |       4 |        140
 105      |       5 |         39

What the heck happened here? The data looks exactly the same, but the artificial row numbers are seemingly arbitrary. Indeed they are! By design. This is part of the reason window functions are so difficult to explain and comprehend. The fact of the matter is that each window is a virtual and separate manifestation of the plain query results.

Anyone who has struggled with pointers in C or C++ know that abstracted structures introduce certain pitfalls into obtaining desired results.

Window functions and you!

To help unravel the mystery a bit, let’s look at the natural state of the results without any window function nonsense. Of course we must also shift the data by 40 rows so we can see some of the same information the window received.

SELECT location, reading
  FROM sensor_log
 WHERE reading_date > CURRENT_DATE
 LIMIT 10 OFFSET 40;
 
 location | reading 
----------+---------
 103      |       3
 102      |       2
 101      |       1
 100      |       0
 99       |      99
 98       |      98
 97       |      97
 96       |      96
 95       |      95
 94       |      94

This output represents the table rows as sorted by the index we created on reading_date. Since these results are not artificially sorted in any way, this what the window function is actually seeing without its own specific sort operation. We shifted the results by 40 rows and as expected, row 44 has the value of 0. The window function gave us exactly what we requested, but it numbered the rows before Postgres sorted them.

We can actually watch this in action by looking at the two query plans:

EXPLAIN
SELECT location, reading, ROW_NUMBER() OVER (ORDER BY reading)
  FROM sensor_log
 WHERE reading_date > CURRENT_DATE
 LIMIT 10;
 
                              QUERY PLAN
----------------------------------------------------------------------
 LIMIT  (cost=15.91..16.08 ROWS=10 width=11)
   ->  WindowAgg  (cost=15.91..18.41 ROWS=143 width=11)
         ->  Sort  (cost=15.91..16.27 ROWS=143 width=11)
               Sort KEY: reading
               ->  INDEX Scan USING idx_sensor_log_date ON sensor_log
                     (cost=0.29..10.79 ROWS=143 width=11)
                   INDEX Cond: (reading_date > ('now'::cstring)::DATE)
 
EXPLAIN
SELECT location, reading, ROW_NUMBER() OVER ()
  FROM sensor_log
 WHERE reading_date > CURRENT_DATE
 ORDER BY reading
 LIMIT 10;
 
                              QUERY PLAN
----------------------------------------------------------------------
 LIMIT  (cost=15.67..15.69 ROWS=10 width=11)
   ->  Sort  (cost=15.67..16.03 ROWS=143 width=11)
         Sort KEY: reading
         ->  WindowAgg  (cost=0.29..12.58 ROWS=143 width=11)
               ->  INDEX Scan USING idx_sensor_log_date ON sensor_log
                     (cost=0.29..10.79 ROWS=143 width=11)
                   INDEX Cond: (reading_date > ('now'::cstring)::DATE)

Note that the WindowAgg step occurs at the end of query execution in the first example, taking place directly after a sort operation. The second query sorts after the WindowAgg, indicating the window only has access to unsorted rows. The key detail is that window functions only have access to rows within the query as if it had executed without them. This also happens after other aggregates are applied, meaning it’s a bad idea (or even impossible) to mix regular aggregates with window functions.

The easiest way to comprehend how a window function works is to run the query without them. That’s the data the window has access to, regardless of how we slice and dice them within the window itself. It also explains why we’re unable to refer to window function elements in other query clauses. They’re unusable in predicates and we can’t leverage their calculations to group or limit results.

So imagine for a moment we don’t have the standard Postgres LIMIT clause. This is how we would snag the top ten results of our location readings:

SELECT *
  FROM (SELECT location, reading,
               ROW_NUMBER() OVER (ORDER BY reading)
          FROM sensor_log
         WHERE reading_date > CURRENT_DATE
       ) sub
 WHERE ROW_NUMBER <= 10;
 
 location | reading | ROW_NUMBER 
----------+---------+------------
 100      |       0 |          1
 101      |       1 |          2
 1        |       1 |          3
 102      |       2 |          4
 2        |       2 |          5
 103      |       3 |          6
 3        |       3 |          7
 4        |       4 |          8
 104      |       4 |          9
 105      |       5 |         10

This is clearly silly when using row_number, but the trick works the same with other window functions. Here’s how we’d obtain the 10th ranked readings for today’s data:

SELECT *
  FROM (SELECT location, reading,
               dense_rank() OVER (ORDER BY reading)
          FROM sensor_log
         WHERE reading_date > CURRENT_DATE
       ) sub
 WHERE dense_rank = 10;
 
 location | reading | dense_rank 
----------+---------+------------
 109      |       9 |         10
 9        |       9 |         10

Keep in mind that Postgres must fetch the full results internally to materialize them for the window functions. This is true whether there are 143 rows as with our example, or 143-million.

The more advanced use cases for window functions are a topic for another day. Consider this a very high-level introduction to how they work and their inherent limitations instead of a comprehensive guide. There’s a lot of material here that deserves closer inspection, so there’s no need to rush. Either way, don’t let window functions confuse you more than necessary. Like any independent agent, you just need to know what they’re doing behind the scenes.

Like this, but with slightly fewer bombs

PG Phriday: Getting Assertive

There are a lot of languages available for authoring Postgres functions, but there’s nothing quite like the the classic PL/pgSQL. It’s SQL! It’s not SQL! It’s a kind of horrifying mutant crossbreed suitable only for terrifying small children and generating complex reports from a cavalcade of dubious sources! And deep within its twisted entrails is an often overlooked feature usually only available in far more mature entities.

Just wait until they add sharding

That’s right, it’s obvious we’re referring to the ASSERT statement.

When we say “often overlooked”, we’re not kidding. The pertinent documentation offers a meager overview following a far more comprehensive summary of the RAISE statement. We thought it deserved better.

Let’s begin with a simple table example that can form the basis of a more complex function:

CREATE TABLE sensor_log (
  id            SERIAL PRIMARY KEY,
  location      VARCHAR NOT NULL,
  reading       BIGINT NOT NULL,
  reading_date  TIMESTAMP NOT NULL
);
 
INSERT INTO sensor_log (location, reading, reading_date)
SELECT s.id % 1000, s.id % 100,
       CURRENT_DATE - ((s.id * 10) || 's')::INTERVAL
  FROM generate_series(1, 5000000) s(id);
 
CREATE INDEX idx_sensor_log_date ON sensor_log (reading_date);
 
ANALYZE sensor_log;

What can we do with five million rows of simulated data from 1000 sensors that collect data every ten seconds? A good place to start is a summary or report table of some kind. We’re only doing cursory analysis, so we should begin with aggregating the daily average, minimum, and maximum. If we get adventurous later, we can add standard deviations, jitter, deltas, and other fancy logic report owners might find useful.

Since this type of data is easy to tabulate, we require one single table and a function to manage its contents.

CREATE TABLE sensor_report (
  id            SERIAL PRIMARY KEY,
  location      VARCHAR NOT NULL,
  reading_date  DATE NOT NULL,
  avg_reading   BIGINT NOT NULL,
  min_reading   BIGINT NOT NULL,
  max_reading   BIGINT NOT NULL
);
 
CREATE INDEX idx_sensor_report_date ON sensor_report (reading_date);
 
CREATE OR REPLACE FUNCTION sp_trickle_report()
RETURNS BIGINT AS
$$
DECLARE
  last_update DATE;
  insert_count BIGINT;
BEGIN
  -- Find the most recent data in the report so we only
  -- refresh those contents.
 
  SELECT INTO last_update MAX(reading_date)
    FROM sensor_report;
 
  -- Bootstrap the report table if it's currently empty.
 
  IF last_update IS NULL THEN
    SELECT INTO last_update MIN(reading_date)
      FROM sensor_log;
  END IF;
 
  -- Since data comes in regularly, we only need to rebuild
  -- current data. Delete old aggregates so we can replace them.
 
  DELETE FROM sensor_report
   WHERE reading_date >= last_update;
 
  -- Perform the insert in a CTE so we can capture the count of
  -- inserted rows for further analysis or to return to our caller.
 
  WITH capture AS (
    INSERT INTO sensor_report (
             location, reading_date, avg_reading,
             min_reading, max_reading
           )
    SELECT location, reading_date::DATE,
           avg(reading), MIN(reading), MIN(reading)
      FROM sensor_log
     WHERE reading_date >= last_update
     GROUP BY location, reading_date::DATE
     RETURNING *
  )
  SELECT COUNT(*) INTO insert_count
    FROM capture;
 
   RETURN insert_count;
END;
$$ LANGUAGE plpgsql;
 
SELECT sp_trickle_report();
 
 sp_trickle_report 
-------------------
            579000
 
ANALYZE sensor_report;

Thankfully this is a very simple (and somewhat naive) reporting function. Imagine this is our first iteration and we want to debug some of its activity. If we were using Python for example, we might pepper the code with assert statements or craft specific tests accomplish that same goal. Well PL/pgSQL allows us to do just that.

For the second iteration, let’s assume we want to avoid bootstrapping the report table if it’s empty. After all, bootstrapping only needs to be done once, so why pollute the function with that edge case? Besides, it’s entirely possible we might require a separate procedure to initialize the reporting table, and isolated functionality is generally less buggy.

If the table isn’t bootstrapped, it’s probably because we’re in a testing environment that is missing data. In this case, it makes sense to complain so the missing data is loaded, even though it’s not strictly an erroneous scenario. So let’s replace the bootstrap with an assertion. While we’re at it, let’s say a new requirement from the upstream code suggests a minimum of 2000 sensors are required for meaningful output. In such a case, we might need to revise our fixture data in the development and QA environments.

Here’s how that might look:

CREATE OR REPLACE FUNCTION sp_trickle_report()
RETURNS BIGINT AS
$$
DECLARE
  last_update DATE;
  insert_count BIGINT;
BEGIN
  -- Find the most recent data in the report so we only
  -- refresh those contents.
 
  SELECT INTO last_update MAX(reading_date)
    FROM sensor_report;
 
  ASSERT last_update IS NOT NULL, 'Need to bootstrap report table.';
 
  -- Since data comes in regularly, we only need to rebuild
  -- current data. Delete old aggregates so we can replace them.
 
  DELETE FROM sensor_report
   WHERE reading_date >= last_update;
 
  -- Perform the insert in a CTE so we can capture the count of
  -- inserted rows for further analysis or to return to our caller.
 
  WITH capture AS (
    INSERT INTO sensor_report (
             location, reading_date, avg_reading,
             min_reading, max_reading
           )
    SELECT location, reading_date::DATE,
           avg(reading), MIN(reading), MIN(reading)
      FROM sensor_log
     WHERE reading_date >= last_update
     GROUP BY location, reading_date::DATE
     RETURNING *
  )
  SELECT COUNT(*) INTO insert_count
    FROM capture;
 
  ASSERT insert_count >= 2000, 'Not enough current sensor activity!';
 
  RETURN insert_count;
END;
$$ LANGUAGE plpgsql;

The function itself isn’t that much different; the important part is the result of our changes. The only truly meaningful modification is that the function no longer initializes the report table if it was empty. Ignoring that, we can treat the function exactly as before, as no output will change, and no new failure conditions are triggered.

This is due to the fact that assertions are ignored by default. Remember, they’re primarily intended for debugging purposes, so they must be explicitly enabled. We can do this by making an entry in postgresql.conf, but we can also activate them within our current session.

Let’s use our new assertions:

SET plpgsql.check_asserts = TRUE;
 
TRUNCATE TABLE sensor_report;
 
SELECT sp_trickle_report();
 
ERROR:  Need TO bootstrap report TABLE.
CONTEXT:  PL/pgSQL FUNCTION sp_trickle_report() line 12 at ASSERT
 
INSERT INTO sensor_report (
         location, reading_date, avg_reading,
         min_reading, max_reading
       )
SELECT location, reading_date::DATE,
       avg(reading), MIN(reading), MIN(reading)
  FROM sensor_log
 GROUP BY location, reading_date::DATE;
 
ANALYZE sensor_report;
 
SELECT sp_trickle_report();
 
ERROR:  NOT enough CURRENT sensor activity!
CONTEXT:  PL/pgSQL FUNCTION sp_trickle_report() line 38 at ASSERT
 
INSERT INTO sensor_log (location, reading, reading_date)
SELECT s.id % 5000, s.id % 100,
       CURRENT_DATE - ((s.id * 1) || 's')::INTERVAL
  FROM generate_series(1, 50000) s(id);
 
SELECT sp_trickle_report();
 
 sp_trickle_report 
-------------------
              5000

We begin by truncating the report table so we can witness the first assertion. As we can see, Postgres correctly complains that the table is empty when it shouldn’t be. Since this is a true ERROR, execution immediately halts where the assertion failed. We see that error and use some alternative method for filling the report table, and try the function again.

Stop! Assert time!

The second execution fails because the sensor_log table only contains data based on 1000 sensors, yet we expect 2000. Besides telling us to update our fixture data to account for more sensors, we also learn what the function expects. So we decide to insert some fake data from 5000 sensors and try again. The last attempt works as expected, and we’re free to move on with our lives.

It seems odd initially that a formerly ignored statement gets escalated all the way to a fatal error, doesn’t it? But how else should assertions work? If we just needed information, we could have simply used RAISE NOTICE with the relevant details. The power of ASSERT is that we don’t need to remove the statements to deactivate them.

The reason assertions are fatal is because they raise an ASSERT_FAILURE exception. We can actually take advantage of that in debugging. We can catch that exception just like any other, meaning we can ignore it or escalate as necessary. Consider this example:

CREATE OR REPLACE FUNCTION sp_trickle_report()
RETURNS BIGINT AS
$$
DECLARE
  last_update DATE;
  insert_count BIGINT;
BEGIN
  -- Find the most recent data in the report so we only
  -- refresh those contents.
 
  SELECT INTO last_update MAX(reading_date)
    FROM sensor_report;
 
  ASSERT last_update IS NOT NULL, 'Need to bootstrap report table.';
 
  -- Since data comes in regularly, we only need to rebuild
  -- current data. Delete old aggregates so we can replace them.
 
  DELETE FROM sensor_report
   WHERE reading_date >= last_update;
 
  -- Perform the insert in a CTE so we can capture the count of
  -- inserted rows for further analysis or to return to our caller.
 
  WITH capture AS (
    INSERT INTO sensor_report (
             location, reading_date, avg_reading,
             min_reading, max_reading
           )
    SELECT location, reading_date::DATE,
           avg(reading), MIN(reading), MIN(reading)
      FROM sensor_log
     WHERE reading_date >= last_update
     GROUP BY location, reading_date::DATE
     RETURNING *
  )
  SELECT COUNT(*) INTO insert_count
    FROM capture;
 
  ASSERT insert_count >= 2000, 'Not enough current sensor activity!';
 
  RETURN insert_count;
 
EXCEPTION
  WHEN ASSERT_FAILURE THEN
    RAISE NOTICE 'Tripped over debugging code: %', SQLERRM;
    RETURN 0;
END;
$$ LANGUAGE plpgsql;

The only thing we added was a single exception block to catch any exceptions thrown within the function. Then we toned down the error slightly while maintaining the original assertion message, and return 0 from the function since it didn’t process any rows due to the exception.

We can see this for ourselves by truncating the report table again:

SET plpgsql.check_asserts = TRUE;
 
TRUNCATE TABLE sensor_report;
 
SELECT sp_trickle_report();
 
NOTICE:  Tripped OVER debugging code: Need TO bootstrap report TABLE.
 sp_trickle_report 
-------------------
                 0

Now the function itself doesn’t fail or introduce an error into any transaction contexts we might have open, but we can still see something is wrong. The function claims to have inserted no rows, and we see a notice specifically from our own debugging system. In the production environment, this code would be completely silent and the function would operate without the overly oppressive assertions.

Is ASSERT overlooked because there are more suitable approaches to solve these problems? Or is it because the feature was added in 9.5, and it’s extremely easy to miss in the deluge of other headline-grabbing advancements? Honestly it’s probably a combination of those two and a myriad of other causes. The important thing is that we know about it now.

And as we are all well aware…

G.I. Joe!

PG Phriday: Down in the Dumps

These days with multiple Postgres database environments a commonality, it’s not unheard of to copy data from one to another. Perhaps a production extract is necessary to properly vet data in a staging environment. Maybe the development environment needs to update its data to reflect an especially pernicious and intractable edge case someone observed. In any of these scenarios, we are likely to extract data from multiple tables to import it elsewhere. The question is: how?

Last week I flatly stated that pg_dump is not an adequate backup tool. Despite that, it’s perfect for situations like these, where we only need a few tables out of an entire database. However, tables frequently reference other tables in foreign keys. While such integrity concerns do not affect data export, they greatly complicate import efforts.

The deeper nested the referential integrity rules become, we’ll pull out exponentially more hair in trying to track all of them to their requisite tables. Even after we’ve found them all, we still need to unravel the horrors again in the target cluster.

A typical table dependency chart

To better illustrate what we mean by “horrors”, take this relatively simple table hierarchy:

CREATE TABLE tab_a (id INT PRIMARY KEY);
CREATE TABLE tab_a1 (id INT PRIMARY KEY REFERENCES tab_a);
CREATE TABLE tab_b (id INT PRIMARY KEY REFERENCES tab_a);
CREATE TABLE tab_b1 (id INT PRIMARY KEY REFERENCES tab_b);
CREATE TABLE tab_c (id INT PRIMARY KEY REFERENCES tab_b);
CREATE TABLE tab_c1 (id INT PRIMARY KEY REFERENCES tab_c);
 
INSERT INTO tab_a VALUES (1);
INSERT INTO tab_a1 VALUES (1);
INSERT INTO tab_b VALUES (1);
INSERT INTO tab_b1 VALUES (1);
INSERT INTO tab_c VALUES (1);
INSERT INTO tab_c1 VALUES (1);

Imagine these are six tables out of potentially hundreds. From an outside perspective, we may only know that tab_a needs to be copied from one system to another. No matter what tool we use for this, finding the other tables is somewhat irritating. An ORM will display the relationships in a physical chart, and the psql command line client or pgAdmin will report tables that immediately depend on a table we examine. Not a great way to just isolate them all at once.

Fortunately we have the Postgres system catalog. The pg_constraint system table tracks all of those table relationships, and we can recursively fetch the entire tree using a CTE. Remember how awesome CTEs are? Let’s mix some chocolate with our peanut butter by using one to mine the catalog.

Starting with tab_a, here’s how we might find all of the other tables in the list:

WITH RECURSIVE deptree AS (
  SELECT conrelid, conrelid::REGCLASS::TEXT AS TABLE_NAME,
         confrelid, confrelid::REGCLASS::TEXT AS dep_table
    FROM pg_constraint
   WHERE contype IN ('f', 'p')
     AND (confrelid::REGCLASS::TEXT = 'tab_a' OR
          conrelid::REGCLASS::TEXT = 'tab_a')
   UNION
  SELECT c.conrelid, c.conrelid::REGCLASS::TEXT AS TABLE_NAME,
         c.confrelid, c.confrelid::REGCLASS::TEXT AS dep_table
    FROM pg_constraint c
    JOIN deptree d ON (d.conrelid = c.confrelid)
   WHERE c.contype = 'f'
)
SELECT *
  FROM deptree;
 
 conrelid | TABLE_NAME | confrelid | dep_table 
----------+------------+-----------+-----------
    69327 | tab_a      |         0 | -
    69332 | tab_a1     |     69327 | tab_a
    69342 | tab_b      |     69327 | tab_a
    69352 | tab_b1     |     69342 | tab_b
    69362 | tab_c      |     69342 | tab_b
    69372 | tab_c1     |     69362 | tab_c

As complicated as this query appears, it’s not actually that bad. We simply bootstrap the CTE with any tables that depend on tab_a, and repeat the process with each successive loop. Even if there are dozens or hundreds of incestuous relationships, we’ll capture all of them.

In any case, we have all the data we need to construct a pg_dump command to export all of these tables. To keep this as speedy as possible, let’s perform a parallel dump of the indicated tables into a directory we can transmit to another server.

pg_dump -t tab_a -t tab_a1 -t tab_b -t tab_b1 -t tab_c -t tab_c1 \
        -a -Fd -j 8 -f /tmp/tab_exports postgres

Though we can list each table individually, it would be nice if we could supply a filename of desired tables instead. Dependency graphs of sufficient depth could make for a literally impossible command-line given character limit constraints. This means we may need to execute multiple dumps or rely on wildcards and hope that there’s enough name overlap.

Given that caveat, maybe we want to combine pg_dump with xargs and its parallel capabilities instead:

mkdir /tmp/tab_exports
 
cat /tmp/tablist.txt | xargs -I{} -P 8 \
    pg_dump -t {} -a -Fc -f /tmp/tab_exports/{}.pgr postgres

Either way, we now have a data-only representation of every table in the dependency tree of the table we’re actually importing. Now we just have to insert these values into the target system. And this is usually where we first encounter a major roadblock: existing contents.

First of all, that data is in our way. Secondly, the most efficient manner of ridding ourselves of their oppression is the TRUNCATE command. While TRUNCATE is a DDL-level command and can’t be circumvented, it does have a CASCADE decorator which will empty the entire dependency chain. This actually helps us since we have data to rebuild anything that might be affected by a cascading truncate.

Let’s give it a try on the recipient database:

TRUNCATE TABLE tab_a;
 
ERROR:  cannot TRUNCATE a TABLE referenced IN a FOREIGN KEY CONSTRAINT
DETAIL:  TABLE "tab_a1" REFERENCES "tab_a".
HINT:  TRUNCATE TABLE "tab_a1" at the same TIME, OR USE TRUNCATE ... CASCADE.
 
TRUNCATE TABLE tab_a CASCADE;
 
NOTICE:  TRUNCATE cascades TO TABLE "tab_a1"
NOTICE:  TRUNCATE cascades TO TABLE "tab_b"
NOTICE:  TRUNCATE cascades TO TABLE "tab_b1"
NOTICE:  TRUNCATE cascades TO TABLE "tab_c"
NOTICE:  TRUNCATE cascades TO TABLE "tab_c1"

Well that was terrifying. Also note that the error only reported one of the dependencies. If we relied only on this information, we might attempt to drop the foreign key constraint thinking it was the only one. Every subsequent attempt to truncate the table would result in another error.

We’ve avoided all of that, but imagine if we had attempted to restrict our efforts to only the data in tab_a. We’d be left with no benefits of cascade, manually tracing the errors until success, inability to reestablish the constraints due to likely missing keys, and so on. No thanks.

Though truncate and the foreign keys aided us in emptying the tables on the recipient, restoring from our dump is somewhat problematic. We want to do it in parallel, but foreign keys directly prevent this by enforcing key integrity at each level. So we’re stuck either importing each table in order of the dependency tree, or we need to do a bit of preparation.

We can leverage that CTE again and build a couple of SQL scripts to help us out:

CREATE TEMP TABLE tab_deps AS
WITH RECURSIVE deptree AS (
  SELECT conrelid, conrelid::REGCLASS::TEXT AS TABLE_NAME,
         confrelid, confrelid::REGCLASS::TEXT AS dep_table
    FROM pg_constraint
   WHERE contype IN ('f', 'p')
     AND (confrelid::REGCLASS::TEXT = 'tab_a' OR
          conrelid::REGCLASS::TEXT = 'tab_a')
   UNION
  SELECT c.conrelid, c.conrelid::REGCLASS::TEXT AS TABLE_NAME,
         c.confrelid, c.confrelid::REGCLASS::TEXT AS dep_table
    FROM pg_constraint c
    JOIN deptree d ON (d.conrelid = c.confrelid)
   WHERE c.contype = 'f'
)
SELECT *
  FROM deptree;
 
COPY (
  SELECT 'ALTER TABLE ' || TABLE_NAME || ' DISABLE TRIGGER ALL;'
    FROM tab_deps
) TO '/tmp/no_trigs.sql';
 
COPY (
  SELECT 'ALTER TABLE ' || TABLE_NAME || ' ENABLE TRIGGER ALL;'
    FROM tab_deps
) TO '/tmp/fix_trigs.sql';

One thing that’s interesting about Postgres is that foreign keys are actually enforced by hidden system-level triggers. Given that information, we can simply turn them off for the duration of our import, and re-enable them afterwards. Now we have all the parts we need to restore the donor data into the environment of our choice.

It should be cautioned that all of the import steps should be performed on the target system itself. Even if this means synchronizing the dump directory to another system first. We don’t want to fat-finger these commands in production or an equally important system.

In any case, this is how that might go:

psql -f /tmp/no_trigs.sql postgres
psql -c 'TRUNCATE TABLE tab_a CASCADE;' postgres
pg_restore -d postgres -j 8 /tmp/tab_exports
psql -f /tmp/fix_trigs.sql postgres

Or if we used xargs, we’ll want this procedure instead:

psql -f /tmp/no_trigs.sql postgres
psql -c 'TRUNCATE TABLE tab_a CASCADE;' postgres
 
cat /tmp/tablist.txt | xargs -I{} -P 8 \
    pg_restore -d postgres /tmp/tab_exports/{}.pgr
 
psql -f /tmp/fix_trigs.sql postgres

And voila, we’re saved!

Of course table structures between the environments may not match. In that case, we would simply drop all of the involved tables instead of truncating them and stunning their constraint triggers. As such, we would want to remove the -a data-only flag from pg_dump commands we used so pg_restore can recreate them. Ultimately, we’ve broken the paradox of satisfying constraints while simultaneously circumventing them.

It’s not an ideal situation, but hey, whatever works.

PG Phriday: Getting Back Up

In light of recent events where GitLab suffered a massive database loss, this is a great opportunity to examine what happened from a Postgres perspective. Since Simon Riggs over at 2ndQuadrant has already chimed in on improvements Gitlib might consider in their procedures, maybe we should walk the conversation back slightly.

This isn’t the first time Postgres backup tooling has been misused or misunderstood. The topic of backups hits forums and mailing lists rather frequently. It’s no secret that there’s no official “push button to back up” tool for Postgres. Companies like 2ndQuadrant do their best with offerings like Barman, but this will never stop admins from rolling their own solutions instead of resorting to another external utility.

Part of the reason for this stems from the extract and backup tools Postgres itself provides. After reading the documentation, it’s only natural to encounter pg_dump and pg_basebackup and stop looking. And why not? MySQL has its mysqldump after all and that’s fine, right?

Well, not always. Our first clue comes from the ‘dump’ suffix. By its nature, this means we’re extracting database contents by essentially executing a SELECT statement against each table and waiting for the results. Sometimes this involves compression, and there’s a special format Postgres can use to inline the process, and… none of that matters.

An extract is not a backup. Full stop. Depending on the size of our database, performing a full dump can consume a span of a few seconds to several days. We’re limited by the laws of physics in this case; reading a 50TB database takes takes time, as does the processing power necessary to compress that output. Restoring it is often much slower, given how we must account for recreating indexes. In these times of gigantic databases, pg_dump should probably be relegated to exporting and importing individual tables unless the instance is truly tiny.

In the Old Days(tm), the most common advice was to supply the pg_dump command with the -Fc flags to activate its internal compression and storage format. But tools evolve, and Postgres 9.3 added the -j option so tables can be dumped or restored in parallel. Have a huge machine with 64 CPUs living on really fast SSDs and insist on dumping every table? Do this:

# This works for parallel dumps
 
pg_dump -Fd -j 48 -d my_db -f /my/backup/dir
 
# This is for restoring it
 
pg_restore -j 48 -d my_db /my/backup/dir

This approach scales to surprisingly high levels, provided the underlying storage can keep up. It’s not perfectly linear, but a backup or restore using multiple parallel processes will complete in a fraction of the time. After a couple hundred GB however, even parallel dumps start to exceed standard maintenance windows.

Real database engines require a binary backup tool, and Postgres gained one of these back in version 9.1. There are multiple helpful options to customize its behavior, but its real job is to ignore database contents completely. The pg_basebackup utility can back up a full instance of multiple databases nearly as quickly as the operating system can read the files from disk. A well equipped SAN, RAID, or SSD usually provide enough IO to back up even a 5TB instance in a couple of hours.

Unfortunately these backups are not compressed by default, and since we lose the ability to perform parallel dumps when using pg_basebackup, compression is single-threaded if we activate it. That means our 5TB cluster will produce a 5TB backup unless we don’t mind a much longer runtime. Ouch. On the other hand, this means the output it produces is an exact physical copy. What else is an exact physical copy of a Postgres instance?

That’s right: a replica. What’s nice is that the Postgres developers know that as well. Want to copy a remote Postgres cluster on another server and have it start streaming from the primary as soon as it’s started? Easy:

pg_basebackup -h primary-server -D /new/datadir -R
pg_ctl -D /new/datadir start

Since we supplied the -R flag, pg_basebackup writes a recovery.conf file that will instruct our new replica to stream from the same server we just fetched the backup from. We can do this as many times as we wish, even if the upstream is another replica. Our only limit is network bandwidth, and we’ve all upgraded to 10Gbps equipment, right?

Unfortunately, scenarios beyond this point is where process breaks down. What happens if we have a replica that falls behind and needs to be rebuilt? For all of its benefits, pg_basebackup still cannot (currently) skip unchanged files, or make small patches where necessary. Relying on it in this case would require erasing the replica and starting from scratch. This is where GitLab really ran into trouble.

Yet we started with synchronized files, didn’t we? Could we use rsync to “catch up”? Yes, but it’s a somewhat convoluted procedure. We would first need to connect to the upstream server and issue a SELECT pg_start_backup('my_backup') command so Postgres knows to archive transaction logs produced during the sync. Then after the sync is completed, we would need to stop the backup with SELECT pg_stop_backup(). Then we would have to make our own recovery.conf file, obtain all of the WAL files the upstream server archived, and so on.

None of that is something a system administrator will know, and it’s fiddly even to an experienced Postgres DBA. A mistake during any of that procedure will result in a non-functional or otherwise unsafe replica. All of that is the exact reason software like Barman exists. Supplied utilities only get us so far. For larger or more critical installations, either our custom scripts must flawlessly account for every failure scenario and automate everything, or we defer to someone who already did all of that work.

Even without that, there are some general guidelines to follow:

  1. Don’t use pg_dump for backups. If your database isn’t too large now, it will be later.
  2. Always have a replica handy for critical systems. If a company depends on a database, there should always be at least one online copy. It’s easy to play fast-and-loose here when performing maintenance on a replica, but scripts and CLI snafus happen. Bring up another replica until maintenance is over.
  3. If performing dangerous work on the primary, temporarily decouple at least one replica. I had to learn from experience here that hardware or filesystem modifications can corrupt the entire replication chain. Have an online copy of critical instances, and disable replication during maintenance until work is done.
  4. Have an off-site tertiary replica for disaster recovery purposes. Another datacenter, another city, another state. Just put it somewhere else. Not only is this a supplementary copy for failover purposes or crash recovery, but even regional outages won’t result in lost data or availability.
  5. Perform backup operations on a replica for 24/7 systems. Websites that people expect to never go down for maintenance should be backed up from a replica. Postgres backups are online and non-blocking, but are IO intensive and may substantially decrease query performance. Don’t take the risk.
  6. Fail over to a working replica at least twice per year. Either it happens when nobody expects it, or we manage the process ourselves. The latter scenario is highly preferred and establishes a checklist to apply to the former.
  7. Test backup restores at least twice a year. An untested backup is equivalent to random 1s and 0s in a file somewhere. Assuming it exists, it could be anything. It may not work at all. Make sure.
  8. Always back up transaction log archives through the day on critical clusters. They can be used to help a replica catch up if it falls behind, and are an invaluable component of PITR functionality.

These kinds of procedural rules are not restricted to Postgres, and aren’t even really a complete list. Administering a database architecture is a thankless job that never ends, and there’s always something we’ll miss. The catastrophe at GitLab was a failure of process and architecture, not of a command typed on the wrong server. I’m extremely grateful for their openness regarding the situation, because we can all learn from it.

As an aside, while I highly respect Simon and 2ndQuadrant, Barman isn’t the only game in town for backup management. It seems backing up huge installations was a problem long enough that several developers solved it almost simultaneously. Before Barman had it, pgBackRest was designed around hard links and incremental backups of gargantuan 1TB+ systems. Unfortunately I don’t know how they compare because I haven’t tested them. Why not? Because even I got caught up in the race for a better backup tool.

One of our databases first tipped the 1TB scale around 2013 and pg_basebackup was consuming inordinate amounts of time. So I threw together a bunch of scripts that leveraged hard links, parallel compression with pigz, and unleashed it on a replica of our production system. It was twelve times faster on the first iteration thanks to the parallel compression, and over 100 times faster while using hard links to produce incrementals. I eventually rewrote it as a more robust set of Python libraries, but pgBackRest and Barman finally made that last step, rendering my efforts effectively moot.

I still use my libraries for now, but the extra management capabilities the other tools supply is somewhat enticing. Want some idea how much better Barman or pgBackRest will perform than relying on pg_basebackup? We have a 50TB database that grows by 200GB per day, which many might consider slightly cumbersome. My homegrown tool transforms that into a 7TB backup in four hours, while the 24GB nightly incrementals usually require about 20 minutes. We back up a 50TB database cluster in 20 minutes.

And yes, I’ve tested restoring from this backup. Restoring takes a while, but it’s also done in parallel. There’s little that’s more satisfying than watching a system with 32 CPUs pegged at 100% decompressing 50TB worth of data as fast as the underlying filesystem is capable of writing.

But I was only able to do that because I’m intimately familiar with our tools and requirements. Our company process standards revealed limitations in the supplied Postgres backup tooling, prompting me to find or create an alternative. There is no substitute for established protocol, despite the fact building those procedures might not have an immediately obvious benefit. The payoff is avoiding downtime, and that’s something you can’t really plan.

It’s the difference between six hours of lost data, and six seconds.