PG Phriday: Getting Back Up

In light of recent events where GitLab suffered a massive database loss, this is a great opportunity to examine what happened from a Postgres perspective. Since Simon Riggs over at 2ndQuadrant has already chimed in on improvements Gitlib might consider in their procedures, maybe we should walk the conversation back slightly.

This isn’t the first time Postgres backup tooling has been misused or misunderstood. The topic of backups hits forums and mailing lists rather frequently. It’s no secret that there’s no official “push button to back up” tool for Postgres. Companies like 2ndQuadrant do their best with offerings like Barman, but this will never stop admins from rolling their own solutions instead of resorting to another external utility.

Part of the reason for this stems from the extract and backup tools Postgres itself provides. After reading the documentation, it’s only natural to encounter pg_dump and pg_basebackup and stop looking. And why not? MySQL has its mysqldump after all and that’s fine, right?

Well, not always. Our first clue comes from the ‘dump’ suffix. By its nature, this means we’re extracting database contents by essentially executing a SELECT statement against each table and waiting for the results. Sometimes this involves compression, and there’s a special format Postgres can use to inline the process, and… none of that matters.

An extract is not a backup. Full stop. Depending on the size of our database, performing a full dump can consume a span of a few seconds to several days. We’re limited by the laws of physics in this case; reading a 50TB database takes takes time, as does the processing power necessary to compress that output. Restoring it is often much slower, given how we must account for recreating indexes. In these times of gigantic databases, pg_dump should probably be relegated to exporting and importing individual tables unless the instance is truly tiny.

In the Old Days(tm), the most common advice was to supply the pg_dump command with the -Fc flags to activate its internal compression and storage format. But tools evolve, and Postgres 9.3 added the -j option so tables can be dumped or restored in parallel. Have a huge machine with 64 CPUs living on really fast SSDs and insist on dumping every table? Do this:

# This works for parallel dumps
 
pg_dump -Fd -j 48 -d my_db -f /my/backup/dir
 
# This is for restoring it
 
pg_restore -j 48 -d my_db /my/backup/dir

This approach scales to surprisingly high levels, provided the underlying storage can keep up. It’s not perfectly linear, but a backup or restore using multiple parallel processes will complete in a fraction of the time. After a couple hundred GB however, even parallel dumps start to exceed standard maintenance windows.

Real database engines require a binary backup tool, and Postgres gained one of these back in version 9.1. There are multiple helpful options to customize its behavior, but its real job is to ignore database contents completely. The pg_basebackup utility can back up a full instance of multiple databases nearly as quickly as the operating system can read the files from disk. A well equipped SAN, RAID, or SSD usually provide enough IO to back up even a 5TB instance in a couple of hours.

Unfortunately these backups are not compressed by default, and since we lose the ability to perform parallel dumps when using pg_basebackup, compression is single-threaded if we activate it. That means our 5TB cluster will produce a 5TB backup unless we don’t mind a much longer runtime. Ouch. On the other hand, this means the output it produces is an exact physical copy. What else is an exact physical copy of a Postgres instance?

That’s right: a replica. What’s nice is that the Postgres developers know that as well. Want to copy a remote Postgres cluster on another server and have it start streaming from the primary as soon as it’s started? Easy:

pg_basebackup -h primary-server -D /new/datadir -R
pg_ctl -D /new/datadir start

Since we supplied the -R flag, pg_basebackup writes a recovery.conf file that will instruct our new replica to stream from the same server we just fetched the backup from. We can do this as many times as we wish, even if the upstream is another replica. Our only limit is network bandwidth, and we’ve all upgraded to 10Gbps equipment, right?

Unfortunately, scenarios beyond this point is where process breaks down. What happens if we have a replica that falls behind and needs to be rebuilt? For all of its benefits, pg_basebackup still cannot (currently) skip unchanged files, or make small patches where necessary. Relying on it in this case would require erasing the replica and starting from scratch. This is where GitLab really ran into trouble.

Yet we started with synchronized files, didn’t we? Could we use rsync to “catch up”? Yes, but it’s a somewhat convoluted procedure. We would first need to connect to the upstream server and issue a SELECT pg_start_backup('my_backup') command so Postgres knows to archive transaction logs produced during the sync. Then after the sync is completed, we would need to stop the backup with SELECT pg_stop_backup(). Then we would have to make our own recovery.conf file, obtain all of the WAL files the upstream server archived, and so on.

None of that is something a system administrator will know, and it’s fiddly even to an experienced Postgres DBA. A mistake during any of that procedure will result in a non-functional or otherwise unsafe replica. All of that is the exact reason software like Barman exists. Supplied utilities only get us so far. For larger or more critical installations, either our custom scripts must flawlessly account for every failure scenario and automate everything, or we defer to someone who already did all of that work.

Even without that, there are some general guidelines to follow:

  1. Don’t use pg_dump for backups. If your database isn’t too large now, it will be later.
  2. Always have a replica handy for critical systems. If a company depends on a database, there should always be at least one online copy. It’s easy to play fast-and-loose here when performing maintenance on a replica, but scripts and CLI snafus happen. Bring up another replica until maintenance is over.
  3. If performing dangerous work on the primary, temporarily decouple at least one replica. I had to learn from experience here that hardware or filesystem modifications can corrupt the entire replication chain. Have an online copy of critical instances, and disable replication during maintenance until work is done.
  4. Have an off-site tertiary replica for disaster recovery purposes. Another datacenter, another city, another state. Just put it somewhere else. Not only is this a supplementary copy for failover purposes or crash recovery, but even regional outages won’t result in lost data or availability.
  5. Perform backup operations on a replica for 24/7 systems. Websites that people expect to never go down for maintenance should be backed up from a replica. Postgres backups are online and non-blocking, but are IO intensive and may substantially decrease query performance. Don’t take the risk.
  6. Fail over to a working replica at least twice per year. Either it happens when nobody expects it, or we manage the process ourselves. The latter scenario is highly preferred and establishes a checklist to apply to the former.
  7. Test backup restores at least twice a year. An untested backup is equivalent to random 1s and 0s in a file somewhere. Assuming it exists, it could be anything. It may not work at all. Make sure.
  8. Always back up transaction log archives through the day on critical clusters. They can be used to help a replica catch up if it falls behind, and are an invaluable component of PITR functionality.

These kinds of procedural rules are not restricted to Postgres, and aren’t even really a complete list. Administering a database architecture is a thankless job that never ends, and there’s always something we’ll miss. The catastrophe at GitLab was a failure of process and architecture, not of a command typed on the wrong server. I’m extremely grateful for their openness regarding the situation, because we can all learn from it.

As an aside, while I highly respect Simon and 2ndQuadrant, Barman isn’t the only game in town for backup management. It seems backing up huge installations was a problem long enough that several developers solved it almost simultaneously. Before Barman had it, pgBackRest was designed around hard links and incremental backups of gargantuan 1TB+ systems. Unfortunately I don’t know how they compare because I haven’t tested them. Why not? Because even I got caught up in the race for a better backup tool.

One of our databases first tipped the 1TB scale around 2013 and pg_basebackup was consuming inordinate amounts of time. So I threw together a bunch of scripts that leveraged hard links, parallel compression with pigz, and unleashed it on a replica of our production system. It was twelve times faster on the first iteration thanks to the parallel compression, and over 100 times faster while using hard links to produce incrementals. I eventually rewrote it as a more robust set of Python libraries, but pgBackRest and Barman finally made that last step, rendering my efforts effectively moot.

I still use my libraries for now, but the extra management capabilities the other tools supply is somewhat enticing. Want some idea how much better Barman or pgBackRest will perform than relying on pg_basebackup? We have a 50TB database that grows by 200GB per day, which many might consider slightly cumbersome. My homegrown tool transforms that into a 7TB backup in four hours, while the 24GB nightly incrementals usually require about 20 minutes. We back up a 50TB database cluster in 20 minutes.

And yes, I’ve tested restoring from this backup. Restoring takes a while, but it’s also done in parallel. There’s little that’s more satisfying than watching a system with 32 CPUs pegged at 100% decompressing 50TB worth of data as fast as the underlying filesystem is capable of writing.

But I was only able to do that because I’m intimately familiar with our tools and requirements. Our company process standards revealed limitations in the supplied Postgres backup tooling, prompting me to find or create an alternative. There is no substitute for established protocol, despite the fact building those procedures might not have an immediately obvious benefit. The payoff is avoiding downtime, and that’s something you can’t really plan.

It’s the difference between six hours of lost data, and six seconds.

PG Phriday: Alien Incursion

Foreign tables have been a headline feature of Postgres ever since the release of version 9.2. Combined with extensions, they’re the secret sauce that allows Postgres to pull data from other database engines, flat files, REST interfaces, and possibly every gas station, residence, warehouse, farmhouse, hen house, outhouse, and doghouse in the area.

Postgres all the things

But that kind of power comes at a significant cost. Since the remote data comes from what is essentially a black box, there are a lot of performance optimizations Postgres can’t apply. Foreign data wrappers are also still somewhat in their infancy—even the Postgres foreign data wrapper lacks the ability to simultaneously join and sort on the remote server in the same query.

That makes data federation incredibly painful and could reduce the feature to nothing but an amusing toy. It’s functional but slow since it’s likely data processing is done locally, potentially after retrieving the entire contents of the remote table. Of course, this also means several disparate sources can all be handled within the confines of our Postgres database since it has access to all of the data… eventually.

Is there a better way? Since the introduction of Materialized views in 9.3, it became possible to trivially take a snapshot of a remote table, or some subsection of its contents, and use it repeatedly. Unlike the retrieval Postgres executes within a query on a foreign object, materialized views allow remote data to persist.

With that in mind, let’s set up a very basic sensor database that merely accumulates readings.

CREATE TABLE sensor_log (
  id            INT PRIMARY KEY,
  location      VARCHAR NOT NULL,
  reading       BIGINT NOT NULL,
  reading_date  TIMESTAMP NOT NULL
);
 
INSERT INTO sensor_log (id, location, reading, reading_date)
SELECT s.id, s.id % 1000, s.id % 100,
       CURRENT_DATE - ((s.id * 10) || 's')::INTERVAL
  FROM generate_series(1, 5000000) s(id);
 
CREATE INDEX idx_sensor_log_location ON sensor_log (location);
CREATE INDEX idx_sensor_log_date ON sensor_log (reading_date);
 
ANALYZE sensor_log;

It’s a table we use frequently in examples, and there’s no reason it can’t be used here. If we connect to any other Postgres database, we can execute the following SQL to produce a standard foreign table.

For the purposes of this example, let’s just assume the database is on the same server and the server is somewhat trusting with connections in that context. Properly securing a foreign server is a topic for a different day.

CREATE SCHEMA ext_info;
 
CREATE EXTENSION postgres_fdw;
 
CREATE SERVER sys_sensors
       FOREIGN DATA WRAPPER postgres_fdw
       OPTIONS (dbname 'sensors', host 'sensor-host', port '5432');
 
CREATE USER MAPPING FOR postgres
       SERVER sys_sensors
       OPTIONS (USER 'postgres');
 
IMPORT FOREIGN SCHEMA public
 LIMIT TO (sensor_log)
  FROM SERVER sys_sensors
  INTO ext_info;
 
ANALYZE ext_info.sensor_log;
 
SELECT COUNT(*) FROM ext_info.sensor_log;
 
  COUNT  
---------
 5000000

Take note that we’ve leveraged the IMPORT FOREIGN SCHEMA feature added in Postgres 9.5. This makes it much easier to reflect a remote data source since we don’t necessarily need to know the structure of the objects we’re importing. Either way, it worked and we were able to query the contents. So far there are no surprises here.

Now let’s wrap our external table with a very limited view that essentially fetches the entire contents of the remote table.

CREATE SCHEMA ext_materialized;
 
CREATE MATERIALIZED VIEW ext_materialized.sensor_log AS
SELECT * FROM ext_info.sensor_log;
 
CREATE UNIQUE INDEX pk_ext_sensor_log
    ON ext_materialized.sensor_log (id);
 
CREATE INDEX idx_ext_sensor_log_reading_date
    ON ext_materialized.sensor_log (reading_date);
 
ANALYZE ext_materialized.sensor_log;
 
EXPLAIN
SELECT *
  FROM ext_materialized.sensor_log
 WHERE reading_date >= CURRENT_DATE - INTERVAL '1 day';
 
                            QUERY PLAN                         
-----------------------------------------------------------------
 INDEX Scan USING idx_ext_sensor_log_reading_date ON sensor_log
       (cost=0.44..304.87 ROWS=8596 width=23)
   INDEX Cond: (reading_date >= 
       (('now'::cstring)::DATE - '1 day'::INTERVAL))

We added an index on the reading_date column and analyzed the contents to illustrate that materialized views act just like regular tables in very important ways. Postgres can collect statistics, index, and build views on top of materialized views as if they were regular tables. This means we can treat them like a local cache of remote data, with all of the ancillary benefits the Postgres planner provides for our own data.

When the remote data changes, we must rebuild the materialized view to capture the new data. Here’s how that works:

-- In the sensors database:
 
INSERT INTO sensor_log (id, location, reading, reading_date)
SELECT s.id, s.id % 1000, s.id % 100,
       CURRENT_DATE + ((s.id * 10) || 's')::INTERVAL
  FROM generate_series(5000001, 5001000) s(id);
 
-- In the database where we have the materialized view:
 
\timing ON
 
REFRESH MATERIALIZED VIEW CONCURRENTLY ext_materialized.sensor_log;
 
TIME: 117734.945 ms
 
ANALYZE ext_materialized.sensor_log;

Just… ouch. Why did that take so long? We supplied the CONCURRENTLY keyword to avoid locking the view during the refresh. This means Postgres fetches the data and compares it to the existing view contents, inserting and deleting only the data that changed since the last refresh. That’s a pretty intensive procedure, and for higher data volumes, extremely inefficient. Yet we added the “missing” content without preventing other users from accessing the materialized view, which is certainly advantageous if we don’t mind the refresh delay.

We can improve this process a number of ways. If we consider our data and how we use it, we might realize we don’t need the entire contents of the remote table. In an analysis context, maybe we only need the most recent day of readings to compare with information we’ve gathered from other sources. Since Postgres doesn’t transmit calculated predicates to the remote server, we also need to leverage the ability to create foreign tables based on remote views.

Here’s how that would look:

-- In the sensors database:
 
CREATE VIEW sensor_log_today AS
SELECT *
  FROM sensor_log
 WHERE reading_date >= CURRENT_DATE;
 
-- In the database where we have the materialized views:
 
IMPORT FOREIGN SCHEMA public
 LIMIT TO (sensor_log_today)
  FROM SERVER sys_sensors
  INTO ext_info;
 
CREATE MATERIALIZED VIEW ext_materialized.sensor_log_today AS
SELECT * FROM ext_info.sensor_log_today;
 
CREATE UNIQUE INDEX pk_ext_sensor_log_today
    ON ext_materialized.sensor_log_today (id);
 
CREATE INDEX idx_ext_sensor_log_today_reading_date
    ON ext_materialized.sensor_log_today (reading_date);
 
ANALYZE ext_materialized.sensor_log_today;
 
\timing ON
 
REFRESH MATERIALIZED VIEW CONCURRENTLY ext_materialized.sensor_log_today;
 
TIME: 17.531 ms

That’s a bit of an improvement. This kind of quick snapshot system is a solid foundation for a virtual database that contains none of its own data, and only exists for analytics. If we need to retain data for longer periods of time, we can built our own data loading process that references the smaller snapshot.

Something like this:

-- Bootstrap the giant repo of sensor readings.
 
CREATE SCHEMA pile;
 
CREATE TABLE pile.sensor_log (LIKE ext_info.sensor_log);
 
INSERT INTO pile.sensor_log
SELECT * FROM ext_info.sensor_log;
 
CREATE UNIQUE INDEX pk_pile_sensor_log
    ON pile.sensor_log (id);
 
CREATE INDEX idx_pile_sensor_log_reading_date
    ON pile.sensor_log (reading_date);
 
ANALYZE pile.sensor_log;
 
-- Now a "refresh" looks like this:
 
REFRESH MATERIALIZED VIEW CONCURRENTLY ext_materialized.sensor_log_today;
 
ANALYZE ext_materialized.sensor_log_today;
 
INSERT INTO pile.sensor_log
SELECT *
  FROM ext_materialized.sensor_log_today
 WHERE id > (SELECT MAX(id) FROM pile.sensor_log);

We may have already executed the bootstrapping section any time in the past. The refresh on the other hand, lets us “top up” our local cache with the remote data as often as we want. The remote table might even be regularly truncated, but it doesn’t matter to us; we always see the most recent information and we can handle it as we please.

All of these tables and materialized views are driven by remote data. If we were working with several customers, we might perform aggregate analysis across all of them in this virtual staging area using cumulative snapshot updates. We could decide to retain this information for a year or two and cycle anything older, or any other retention period that matches our reporting needs.

This works the same way across environments. Perhaps we want to combine data from a production and user acceptance context, but only for the last work week. Though this kind of transient architecture isn’t exactly trivial, at least the possibility exists.

The primary caveat to relying on foreign structures so extensively is that we might not have direct control over the remote objects. If a column type is changed for example, our foreign table would stop working. Depending on how deeply nested all of our materialized views are, rebuilding in the face of remote modifications will likely incur significant maintenance costs. Even if we do control the infrastructure and software on the entire stack, we’ve contributed a sprawling decoupled dependency chain to the mix.

But that’s a small price to pay for the freedom to have our hand in every cookie jar.

cookie!

PG Phriday: Everything in Common

Not a lot of people remember what Postgres was like before version 8.4. In many ways, this was the first “modern” release of the database engine. CTEs, Window Functions, column level permissions, in-place upgrade compatible with subsequent versions, collation support, continuous query statistic collection; it was just a smorgasbord of functionality.

Of these, CTEs or Common Table Expressions, probably enjoy the most user-level exposure; for good reason. Before this, there was no way to perform a recursive query in Postgres, which really hurts in certain situations. Want to display all related child threads in an online discussion? How about fetching the components of an organization chart by following management assignments? Better get ready for a lot of queries in a loop.

In addition to that, complicated queries were difficult to logically simplify. Reporting queries are especially prone to frequent sequences of aggregates and subqueries. It’s not uncommon to build a query that’s several pages long in this kind of context. Optimizing such an unwieldy beast is often difficult or even impossible simply due to all of the components and confusing nesting.

CTEs changed these things for the better and in the eyes of many, finally brought Postgres to parity with Oracle and its long-established recursive query support. So let’s explore what CTEs really deliver, and how they can improve our Postgres experience—caveats and all.

Let’s start with a trivial table and some data:

CREATE TABLE employee 
(
  employee_id  SERIAL PRIMARY KEY,
  full_name    VARCHAR NOT NULL,
  manager_id   INT REFERENCES employee
);
 
INSERT INTO employee (full_name, manager_id) VALUES
  ('King Randor', NULL),
  ('Prince Adam', 1),
  ('Teela', 2),
  ('Man-at-Arms', 2),
  ('Skeletor', NULL),
  ('Evil-Lyn', 5),
  ('Trap Jaw', 5),
  ('Clawful', 6);

It’s easy enough to display the management relationships. Here’s how our cartoon cohorts look with a basic JOIN:

SELECT m.full_name AS boss, e.full_name AS goon
  FROM employee e
  JOIN employee m ON (m.employee_id = e.manager_id)
 ORDER BY e.manager_id;
 
    boss     |    goon     
-------------+-------------
 King Randor | Prince Adam
 Prince Adam | Teela
 Prince Adam | Man-at-Arms
 Skeletor    | Evil-Lyn
 Skeletor    | Trap Jaw
 Evil-Lyn    | Clawful

In this trivial example, we can visually follow the results and understand that Clawful is ultimately a minion of Skeletor. We could also leverage our knowledge that the organization chart is only three levels deep and employ a third join to fully represent all relationships. But such a shallow corporate hierarchy is exceedingly rare, so let’s use a CTE to flush out the table instead.

WITH RECURSIVE org_tree AS (
    SELECT NULL::VARCHAR AS boss, *, 0 AS level,
           employee_id AS end_boss
      FROM employee
     WHERE manager_id IS NULL
    UNION ALL
    SELECT t.full_name AS boss, e.*, t.level + 1 AS level,
           t.end_boss
      FROM employee e
      JOIN org_tree t ON (t.employee_id = e.manager_id)
)
SELECT repeat(' ', level * 5) || full_name AS relationship
  FROM org_tree
 ORDER BY end_boss, level;
 
     relationship      
-----------------------
 King Randor
      Prince Adam
           Teela
           Man-at-Arms
 Skeletor
      Trap Jaw
      Evil-Lyn
           Clawful

Well that’s quite an improvement! But how does it work?

Our initial clue is the first query within the CTE. Other databases may do this differently, but Postgres creates a temporary in-memory table to act as a holding area to represent the CTE contents as they’re constructed. When we specify the RECURSIVE decorator, we gain the ability to bootstrap that temporary data with one query. The second query can then refer to the cumulative result in each iteration of the recursion.

The result is one query that loops in on itself three times in our example. We took advantage of this by adding a new column to track how deep the recursion is so we can visualize this more easily. Here’s what the contents of the “tree” table look like for each phase:

WITH RECURSIVE tree AS (
    SELECT NULL::VARCHAR AS boss, *, 0 AS level
      FROM employee
     WHERE manager_id IS NULL
    UNION ALL
    SELECT t.full_name AS boss, e.*, t.level + 1 AS level
      FROM employee e
      JOIN tree t ON (t.employee_id = e.manager_id)
)
SELECT * FROM tree;
 
    boss     | employee_id |  full_name  | manager_id | level 
-------------+-------------+-------------+------------+-------
             |           1 | King Randor |            |     0
             |           5 | Skeletor    |            |     0
 King Randor |           2 | Prince Adam |          1 |     1
 Skeletor    |           6 | Evil-Lyn    |          5 |     1
 Skeletor    |           7 | Trap Jaw    |          5 |     1
 Prince Adam |           3 | Teela       |          2 |     2
 Prince Adam |           4 | Man-at-Arms |          2 |     2
 Evil-Lyn    |           8 | Clawful     |          6 |     2

Each “level” here represents one dive into the employee table to fetch employees of the employees already listed. This loop naturally terminates once every boss is listed in the results. But there’s one flaw in this particular construction: what if we wanted to choose any grunt and see the whole chain of command from that point? To do that, we need to modify the CTE slightly to incorporate our desired predicate in the CTE portion itself so we can follow the relationship properly.

Here’s how that looks:

WITH RECURSIVE tree AS (
    SELECT *, 0 AS level
      FROM employee
     WHERE full_name = 'Clawful'
    UNION ALL
    SELECT e.*, t.level + 1 AS level
      FROM tree t
      JOIN employee e ON (e.employee_id = t.manager_id)
)
SELECT full_name
  FROM tree
 ORDER BY level DESC;
 
 full_name 
-----------
 Skeletor
 Evil-Lyn
 Clawful

Not bad, eh? We had to flip the JOIN because we started with a specific minion instead of the list of all executives. Then we followed the chain backwards, adding one middle-management peon per iteration until we reached the End Boss. We could combine this kind of trickery by writing a CTE that refers to another CTE and produce a query that would output the entire organization given any member in the hierarchy. We won’t, because that’s a gigantic and rather ugly query, but the capability is there.

What we can do, is demonstrate using CTEs to logically separate query fragments of a larger whole. In the past, a reporting query might consist of an imposing bulk of awkward subqueries to produce necessary aggregates and decode or label various summaries. In the worst cases, such queries might meander for dozens of pages. It’s often a miracle the end result executes at all, and debugging it is equally problematic.

Here’s how we might use CTEs to solve that conundrum:

WITH RECURSIVE org_tree AS (
    SELECT NULL::VARCHAR AS boss, *, 0 AS level,
           employee_id AS end_boss
      FROM employee
     WHERE manager_id IS NULL
    UNION ALL
    SELECT t.full_name AS boss, e.*, t.level + 1 AS level,
           t.end_boss
      FROM employee e
      JOIN org_tree t ON (t.employee_id = e.manager_id)
),
org_stats AS (
  SELECT m.full_name AS ceo, COUNT(*)-1 AS minions,
         MAX(level) AS cruelty
    FROM org_tree org
    JOIN employee m ON (m.employee_id = org.end_boss)
   GROUP BY m.full_name
),
org_attributes AS (
  SELECT m.full_name AS ceo,
         SUM(1) FILTER (WHERE org.full_name ILIKE '%evil%') AS evil,
         SUM(1) FILTER (WHERE org.full_name ILIKE '%prince%' OR
                              org.full_name ILIKE '%king%') AS royalty
    FROM org_tree org
    JOIN employee m ON (m.employee_id = org.end_boss)
   GROUP BY m.full_name
)
SELECT st.*, atr.evil, atr.royalty
  FROM org_stats st
  JOIN org_attributes atr USING (ceo);
 
     ceo     | minions | cruelty | evil | royalty 
-------------+---------+---------+------+---------
 King Randor |       3 |       2 |      |       2
 Skeletor    |       3 |       2 |    1 |

The first portion of the query is just our previous recursive attempt to flatten the organization chart and see how everything is related. The second summarizes basic statistics like employee count and maximum abstraction through middle-management. The third is just a bunch of miscellaneous attributes that might be interesting in a report. All of our examples are trivial, but in a real report, each of these may reflect much more comprehensive aggregates and formulas. Yet despite query complexity, we can determine the end goal of a fragment at a glance. Combine this with SQL comments, and we have a very user-friendly report.

Of course, CTEs are not all sunshine and roses. Remember when we said a CTE is built in a temporary memory location to facilitate recursive functionality and allow CTEs to reference each other? A consequence is that every CTE acts as what we call an optimization fence.

Normally before a query is executed, it is broken down into its component parts and the planner translates those elements into execution instructions. This might mean collapsing certain conditionals, simplifying or substituting a subquery, pushing predicates down into a stack for better row elimination, and so on.

When the planner encounters a CTE however, it can go no further. It will optimize the CTE query itself, but it does so as an encapsulated black box. Even if a WHERE clause from the referring query could greatly reduce matched rows during the CTE execution, that optimization cannot be applied. The CTE executes as written as if we had done this instead:

CREATE TEMP TABLE my_cte_chunk AS
SELECT ...

This applies to every CTE in a query. It’s better to think of each CTE as a virtual temporary table. While that allows each CTE to refer to the entire contents of another CTE, it also means we may lose several opportunities to optimize a query. It’s not uncommon to unroll a CTE and receive a much faster query in return. Query planners are complex beasts, and like any software compiler, may simplify necessary instructions by eliminating entire branches from the execution tree due to redundancy or empty result paths. Using a CTE reduces the planner’s ability to do that.

On the other hand, an experienced user can leverage this knowledge to their benefit. Since the query planner cannot penetrate optimization fences, it means we can override its decision tree. When the data or statistics indicate the planner will improperly prefer a highly inefficient plan, we can force it along an improved path. In these cases, we’re actively trading the potential for future planner improvements for immediate advantage.

The primary argument here is that the planner improvements we need may not arrive for years, or at all. Can we justify suffering bad performance for an undetermined length of time until some nebulous future planner addresses our obscure data edge case? Often the answer to this question is ‘no’. In the rare instances where this justification applies, leveraging optimization fences is probably a safe bet. At least we have the option!

In the end, Postgres improved its reputation among power users, and we gained a versatile tool that enabled the previously impossible. New recursion, simplification, and optimization options, all from a single feature? Yes, please!

PG Phriday: Why Postgres

There are a smorgasbord of database engines out there. From an outside perspective, Postgres is just another on a steadily growing pile of structured data storage mechanisms. Similarly to programming languages like Rust and Go, it’s the new and shiny database systems like MongoDB that tend to garner the most attention. On the other hand, more established engines like Oracle or MySQL have a vastly larger lead that seems insurmountable. In either case, enthusiasm and support is likely to be better represented in exciting or established installations.

So why? Why out of the myriad choices available, use Postgres? I tend to get asked this question by other DBAs or systems engineers that learn I strongly advocate Postgres. It’s actually a pretty fair inquiry, so why not make it the subject of the first PG Phriday for 2017? What distinguishes it from its brethren so strongly that I staked my entire career on it?

Boring!

Postgres isn’t new. It didn’t enjoy the popularity that practically made MySQL a household name as part of the LAMP stack. It didn’t infiltrate corporations several decades ago as the de facto standard for performance and security like Oracle. It isn’t part of a much larger supported data environment like SQL Server. It isn’t small and easy like SQLite. It’s not a distributed hulk like Hadoop, or enticingly sharded like MongoDB or Cassandra. It’s not in-memory hotness like VoltDB.

It’s just a regular, plain old ACID RDBMS.

You can't explain that!

It does after all, have all of the basics many expect in an RDBMS:

  • Tables, Views, Sequences, etc.
  • Subqueries
  • Functions in various languages
  • Triggers
  • Point In Time Recovery

Certain… other database platforms weren’t so complete. As a consequence, Postgres was preferred by those who knew the difference and needed that extra functionality without breaking the bank. It’s not much, but it’s a great way to develop a niche. From there, things get more interesting.

Durababble

For the most part, being boring but reliable was a fact of life until the release of 9.0 when Postgres introduced streaming replication and hot standby. Postgres was still a very capable platform before that juncture, but built-in high-availability made it more viable in a business context. Now the secondary copy could be online and supply query results. Now the replica would lag behind the primary by a small handful of transactions instead of entire 16MB segments of transaction log files.

Postgres had finally joined the rest of the world in that regard. MySQL used a different mechanism, but that was one of its selling-points for years before Postgres. The primary distinction is that Postgres streams the changes at a binary level, meaning very little calculation is necessary to apply them. As a result, Postgres replicas are much less likely to fall behind the upstream primary.

The second I tested this feature in 2010, any lingering doubts about the future of Postgres vanished.

Cult of Extensibility

A future version of Postgres. Probably.
A future version of Postgres. Probably.

The next huge—and arguably most important—advancement in Postgres accompanied the release of 9.1: extensions. The true implications here are hard to overstate. Not all of the internal API is exposed, but extensions make it possible for practically any inclined individual to just bolt functionality onto Postgres. When Postgres 9.2 added foreign data wrappers, even arbitrary alternative backends became a possibility.

Hijack the query planner to route data through video card CPUs? Got it. Add basic sharding and distributed query support? No problem. Interact with Cassandra, SQL Server, or even Facebook? Easy peasy. Store data in analytic-friendly column structure? Child’s play.

Perl has the dubious honor of being labeled the Swiss Army Chainsaw of languages because it enables a practitioner do anything. Extensions convey almost that same capability to Postgres. And while a badly written extension can crash your database, good ones can elevate it beyond the imaginations and time constraints of the core developers.

Extensions that provide enough added value have even inspired fully supported internal adaptations, as in the case of materialized views in Postgres 9.3. What other database does that?

Consider what happens when these features are combined.

  1. Create a materialized view that refers to a remote table.
  2. Refresh the above view before using it in a report.
  3. Alternatively, siphon updated rows from the view into a more permanent aggregate summary table.
  4. Get local data processing performance in ad-hoc analytics over heterogeneous platforms.

Now Postgres can be the central nexus for a constellation of various data environments and provide snapshot analysis for the live data. Without a convoluted ETL infrastructure. Technically the materialized views or intermediate aggregate tables aren’t strictly necessary, so Postgres wouldn’t even need to store actual data. Such a configuration would be hilariously slow, but now the ironic scenario exists where Postgres can power a database empty of actual contents.

The 9.2 release transformed Postgres into a platform, and one of the reasons I don’t use the SQL part of PostgreSQL anymore.

Developers, Developers, Developers!

Ballmer likes developers

The folks hacking on the Postgres code are both insanely skilled and notoriously available. It’s almost a running joke to guess which of the core devs will answer a basic SQL question first. There’s practically a race to answer questions in the mailing lists regardless of sophistication or perceived merit, and anyone subscribed to the list can participate.

Their dedication to fostering community interaction is unrelenting. While not quite as organized as the Linux kernel developers thanks to Linus’ role as benevolent dictator, they’ve pushed Postgres forward every year. Due to their strict commit-fests and automated testing and code review, they’ve delivered a stable update roughly every year since 2008. Is there another database engine that can boast the same?

And every release has at least one headliner feature that makes upgrading worth the effort. Every. Last. Version.

  • 8.4: Window functions + CTEs
  • 9.0: Streaming replication
  • 9.1: Foreign tables, extensions
  • 9.2: Cascading replication, JSON support
  • 9.3: Materialized views, event triggers, data checksums
  • 9.4: JSONB, background workers, logical WAL decoding
  • 9.5: Upsert
  • 9.6: Parallel execution
  • 10.0?: Standby quorum, native table partitioning

While it would be wrong to demand that kind of dedication and quality, appreciating it is quite a different story. The community pushes Postgres forward because the devs give it a voice. That’s rare in any project.

In the end, I consider it a privilege to even participate from the sidelines. Is it perfect? Of course not; I’ve pointed out serious flaws in Postgres performance that have yet to be successfully addressed. Yet given the alternatives, and what Postgres really delivers when it’s fully leveraged, I can’t even think of a better commercial RDBMS.

Why Postgres? Maybe the better question is: why not?

PG Phriday: Who Died and Made You Boss?!

Postgres is great, but it can’t run itself in all cases. Things come up. Queries go awry. Hardware fails, and users leave transactions open for interminable lengths of time. What happens if one of these things occur while the DBA themselves has a hardware fault? While they’re down for maintenance, someone still has to keep an eye on things. For the last PG Phriday of the year completely unrelated to my upcoming surgery, let’s talk about what happens when your DBA becomes inoperative due to medical complications.

This is Fake Postgres DBA 101!

Getting Around

When in doubt, SSH is the name of the game. Database accounts are either locked-down or lack sufficient permissions to do everything, but usually the postgres user itself has total access. Invoke your favorite SSH client to connect to the host running the database in question, and use sudo to become the postgres user:

ssh my-db-host
sudo su -l postgres

Afterwards, it’s a good idea to add our public key to the postgres user’s .ssh/authorized_keys file so we can log in as the postgres user without the intermediate sudo. Here’s a good guide for doing that. If there’s configuration management like salt or Puppet involved, that file is probably part of the stack and needs to be modified there or it’ll be overwritten.

Either way, we’re in. If we’re lucky enough to be on a Debian derivative like Ubuntu or Mint, we can use the pg_lsclusters command to see which database instances are running on this server. Here’s what that looks like on a sample VM:

pg_lsclusters 
 
Ver Cluster Port Status Owner    Data directory          Log file
9.5 main    5432 online postgres /data/pgsql/main/data   /var/log/postgresql/postgresql-9.5-main.log
9.6 96test  5440 online postgres /data/pgsql/96test/data /var/log/postgresql/postgresql-9.6-96test.log

The next thing we need is the psql command-line client—the nexus of Postgres interoperability. Postgres instances monitor the listed port for connection attempts, so if we wanted to target a specific instance, we’d want to refer to this output. The default is 5432 if we don’t pass any value.

Let’s get a list of all databases in the 96test Postgres instance on this system.

psql -p 5440 -l
 
                                  List of databases
   Name    |  Owner   | Encoding |   Collate   |    Ctype    |   Access privileges   
-----------+----------+----------+-------------+-------------+-----------------------
 examples  | postgres | UTF8     | en_US.UTF-8 | en_US.UTF-8 | 
 postgres  | postgres | UTF8     | en_US.UTF-8 | en_US.UTF-8 | 
 sensors   | postgres | UTF8     | en_US.UTF-8 | en_US.UTF-8 | 
 template0 | postgres | UTF8     | en_US.UTF-8 | en_US.UTF-8 | =c/postgres          +
           |          |          |             |             | postgres=CTc/postgres
 template1 | postgres | UTF8     | en_US.UTF-8 | en_US.UTF-8 | =c/postgres          +
           |          |          |             |             | postgres=CTc/postgres
(5 rows)

I’ve covered the template databases, and they can safely be ignored. The only “real” databases in this installation are examples, postgres, and sensors. The postgres database is a default and is often used as scratch space; hopefully nobody is using it as their actual production database.

Each of these is distinct and can not interact with the data contained in the others. Admins familiar with other database engines might find this odd, but that’s how Postgres works. Usually databases that require lots of data sharing use schemas to keep tables in namespaces within the database. And of course, I have a long-winded explanation of this as well.

Either way, we have a list of databases from the instance we want to examine. How do we connect? Well, if the -l flag shows us a list of available databases, what happens if we remove it and append the name of a database?

psql -p 5440 sensors
 
psql (9.6.1)
Type "help" for help.
 
sensors=# SELECT 'Hello World!';
 
   ?column?   
--------------
 Hello World!
(1 row)
 
sensors=# help
 
You are using psql, the command-line interface to PostgreSQL.
Type:  \copyright for distribution terms
       \h for help with SQL commands
       \? for help with psql commands
       \g or terminate with semicolon to execute query
       \q to quit

Sweet! we connected to the sensors database, successfully executed a basic query, and got a bit of assistance from the client software itself. The psql client is an extremely robust tool for interacting with a Postgres database. There are a lot of shortcut commands, and entering \? lists all of them. Be ready to scroll!

The one that really matters is \d and its variants. It’s short for “describe” and does exactly that. It can retrieve lists of tables, schemas, views, indexes, or any other database object. It can also provide greater detail about the object in question. Let’s use it to list the available schemas, see the contents of a schema, and get more insight on a particular table.

sensors=# \dn
 
  List OF schemas
  Name  |  Owner   
--------+----------
 logs   | postgres
 public | postgres
(2 ROWS)
 
sensors=# \dt logs.*
 
           List OF relations
 Schema |    Name    | TYPE  |  Owner   
--------+------------+-------+----------
 logs   | sensor     | TABLE | postgres
 logs   | sensor_log | TABLE | postgres
(2 ROWS)
 
sensors=# \d logs.sensor
 
                                           TABLE "logs.sensor"
    COLUMN     |            TYPE             |                         Modifiers                          
---------------+-----------------------------+------------------------------------------------------------
 sensor_id     | INTEGER                     | NOT NULL DEFAULT NEXTVAL('sensor_sensor_id_seq'::regclass)
 location      | CHARACTER VARYING           | NOT NULL
 reading       | BIGINT                      | NOT NULL
 modified_date | TIMESTAMP WITHOUT TIME zone | NOT NULL
Indexes:
    "sensor_pkey" PRIMARY KEY, btree (sensor_id)
    "idx_sensor_location" btree (location)

The \dn command shows the namespaces in the current database. From that list, we see that the logs schema might have something interesting in it. So we then rely on \dt to list all of the tables the logs schema contains. From there, we can just use the regular \d describe command to get all of the information Postgres has about the structure of the logs.sensor table, complete with indexes, constraints, triggers, and so on.

Now that we can properly introspect to see where we are, it’s time to look deeper.

Peering Into the Void

A common action upon connecting to a database is to view connections and see what they’re doing. This is when we turn to the Postgres system catalog, a series of tables and views that reflect the current state of the database. The pg_stat_activity view will tell us anything we need to know regarding user connections. For example:

SELECT pid, datname, usename, state, query_start
  FROM pg_stat_activity;
 
  pid  | datname | usename  |        state        
-------+---------+----------+---------------------
 10392 | sensors | postgres | active
 16202 | sensors | postgres | idle IN TRANSACTION
 
\x
Expanded display IS ON.
 
SELECT * FROM pg_stat_activity WHERE pid=16202;
 
-[ RECORD 1 ]----+------------------------------------
datid            | 16384
datname          | sensors
pid              | 16202
usesysid         | 10
usename          | postgres
application_name | psql
client_addr      | 
client_hostname  | 
client_port      | -1
backend_start    | 2016-12-16 08:58:44.584559-06
xact_start       | 2016-12-16 08:58:46.34946-06
query_start      | 2016-12-16 08:59:01.336305-06
state_change     | 2016-12-16 08:59:01.336895-06
wait_event_type  | 
wait_event       | 
state            | idle IN TRANSACTION
backend_xid      | 
backend_xmin     | 
query            | SELECT * FROM logs.sensor LIMIT 10;

The first query is a pretty basic terse listing of the activity of all clients. We noticed one of them was idle within a transaction and decided to view everything Postgres knew about that connection. From the various listed times, we can see that it’s only been idle for a few seconds, so there’s no cause for alarm. We can also see the last query the user executed against the database, even if it completed long ago. This is all useful debugging information.

Viewing all of that as a single row would have been extremely inconvenient, so we utilized another of the psql commands and used \x to enable extended output. When this option is active, psql presents every column within a result as a key/value row pair. It’s not especially convenient for hundreds or thousands of rows, but it’s pretty indispensable when viewing the results of a wide column list.

A good combination of the columns in the pg_stat_activity view might look something like this:

-- This query for 9.6
 
SELECT pid, datname, usename, state, client_addr, wait_event,
       now() - query_start AS duration,
       SUBSTRING(query, 1, 30) AS query_part
  FROM pg_stat_activity
 WHERE state != 'idle'
 ORDER BY now() - query_start DESC
 LIMIT 10;
 
-- This query for 9.2 to 9.5
 
SELECT pid, datname, usename, state, client_addr, waiting,
       now() - query_start AS duration,
       SUBSTRING(query, 1, 30) AS query_part
  FROM pg_stat_activity
 WHERE state != 'idle'
 ORDER BY now() - query_start DESC
 LIMIT 10;
 
  pid  | datname | usename  |        state        | client_addr | wait_event |    duration     |           query_part           
-------+---------+----------+---------------------+-------------+------------+-----------------+--------------------------------
 22503 | sensors | postgres | idle IN TRANSACTION | 10.2.2.91   |            | 00:38:21.952753 | SELECT * FROM logs.sensor LIMI
 10392 | sensors | postgres | active              |             |            | 00:00:00        | SELECT pid, datname, usename,

This query only reports non-idle connections prioritized by activity duration. From here, we can see that the connection that was idle in transaction is still idle several minutes later. Nothing is waiting on it, but maybe it’s time to clean up a bit anyway.

The Terminator

Maybe that idle connection is blocking someone and it needs to go away. There are two ways to make that happen directly within Postgres. The first and safest option is to try and cancel the offending query with pg_cancel_backend. If that doesn’t work, we escalate to pg_terminate_backend which actually breaks the connection and rolls back any pending transactions. Let’s try those now:

SELECT pg_cancel_backend(22503);
 
 pg_cancel_backend 
-------------------
 t
 
SELECT pid, state, query
  FROM pg_stat_activity
 WHERE pid = 22503;
 
  pid  |        state        |                query                
-------+---------------------+-------------------------------------
 22503 | idle IN TRANSACTION | SELECT * FROM logs.sensor LIMIT 10;
(1 ROW)
 
SELECT pg_terminate_backend(22503);
 
 pg_terminate_backend 
----------------------
 t
 
SELECT pid, state, query
  FROM pg_stat_activity
 WHERE pid = 22503;
 
 pid | state | query 
-----+-------+-------
(0 ROWS)

The pg_cancel_backend function didn’t “work” because it only cancels the currently operating query. If there’s a transaction in place, the user will simply return to their database prompt. An application that was stuck may be fixed by canceling a stuck query. If the application is threaded, it may have started a transaction and is injecting commands as another portion of the program does work elsewhere. If that process gets interrupted, the transaction may never complete.

That’s when we roll out pg_terminate_backend, which discards such polite approaches. This is the kill shot most DBAs are familiar with. Queries are canceled. Transactions are rolled back. Connections are broken. Resources are reclaimed. This isn’t quite cold-blooded murder, though. From the perspective of Postgres, we’re merely pointing a rifle at the offending connection and politely requesting it to vacate the premises.

Unlimited Cosmic Power

For DBAs more comfortable with GUIs, pgAdmin might be a better interface than psql. Version 3 is a stand-alone client for Windows, Linux, or OSX. In smaller environments, it’s not entirely uncommon to already have connection capabilities to servers hosting Postgres instances. Just connect and thrill in exploring the drilldown menus, object creation scripts, query interface, and all of the other niceties it provides.

Many of the operations that really matter, like terminating unruly connections, are only available to superusers. There are a couple ways to get this kind of access. In a pinch, it’s easy to connect to the host server as above and grant superuser access directly to ourselves from psql:

ALTER USER sthomas WITH SUPERUSER;
 
\du sthomas
 
           List OF roles
 ROLE name | Attributes | Member OF 
-----------+------------+-----------
 sthomas   | Superuser  | {}

Hopefully though, the current DBA set aside a dedicated role for this kind of thing. Giving users direct superuser access is usually frowned upon. Imagine we have a sysdba role with superuser capabilities and we want to share in the glory. This would be the correct approach:

ALTER USER sthomas WITH NOSUPERUSER;
 
GRANT sysdba TO sthomas;
 
\du s*
 
                  List OF roles
 ROLE name |       Attributes        | Member OF 
-----------+-------------------------+-----------
 sthomas   |                         | {sysdba}
 sysdba    | Superuser, Cannot login | {}

However we should note that since it’s the role with superuser access, we aren’t actually superusers. Like with sudo, we need to “activate” our new powers before they’ll work. Postgres allows users to adopt roles they’re members of. If we wanted to perform superuser actions in Postgres from pgAdmin, we could issue this command from a query pane:

SET ROLE sysdba;

Is there more? Oh, there’s a lot more. But this is enough for a competent techie to start digging around and becoming familiar with the territory. Quick, seize control while the DBA isn’t around to stop you! They’ll never know! Viva la resistance!

Of course, an experienced Postgres DBA will know and revoke your abilities later, but it’s fun to pretend. Take the opportunity to learn more anyway; just be careful on production systems.