PG Phriday: Everything in Common

Not a lot of people remember what Postgres was like before version 8.4. In many ways, this was the first “modern” release of the database engine. CTEs, Window Functions, column level permissions, in-place upgrade compatible with subsequent versions, collation support, continuous query statistic collection; it was just a smorgasbord of functionality.

Of these, CTEs or Common Table Expressions, probably enjoy the most user-level exposure; for good reason. Before this, there was no way to perform a recursive query in Postgres, which really hurts in certain situations. Want to display all related child threads in an online discussion? How about fetching the components of an organization chart by following management assignments? Better get ready for a lot of queries in a loop.

In addition to that, complicated queries were difficult to logically simplify. Reporting queries are especially prone to frequent sequences of aggregates and subqueries. It’s not uncommon to build a query that’s several pages long in this kind of context. Optimizing such an unwieldy beast is often difficult or even impossible simply due to all of the components and confusing nesting.

CTEs changed these things for the better and in the eyes of many, finally brought Postgres to parity with Oracle and its long-established recursive query support. So let’s explore what CTEs really deliver, and how they can improve our Postgres experience—caveats and all.

Let’s start with a trivial table and some data:

CREATE TABLE employee 
  employee_id  SERIAL PRIMARY KEY,
  full_name    VARCHAR NOT NULL,
  manager_id   INT REFERENCES employee
INSERT INTO employee (full_name, manager_id) VALUES
  ('King Randor', NULL),
  ('Prince Adam', 1),
  ('Teela', 2),
  ('Man-at-Arms', 2),
  ('Skeletor', NULL),
  ('Evil-Lyn', 5),
  ('Trap Jaw', 5),
  ('Clawful', 6);

It’s easy enough to display the management relationships. Here’s how our cartoon cohorts look with a basic JOIN:

SELECT m.full_name AS boss, e.full_name AS goon
  FROM employee e
  JOIN employee m ON (m.employee_id = e.manager_id)
 ORDER BY e.manager_id;
    boss     |    goon     
 King Randor | Prince Adam
 Prince Adam | Teela
 Prince Adam | Man-at-Arms
 Skeletor    | Evil-Lyn
 Skeletor    | Trap Jaw
 Evil-Lyn    | Clawful

In this trivial example, we can visually follow the results and understand that Clawful is ultimately a minion of Skeletor. We could also leverage our knowledge that the organization chart is only three levels deep and employ a third join to fully represent all relationships. But such a shallow corporate hierarchy is exceedingly rare, so let’s use a CTE to flush out the table instead.

    SELECT NULL::VARCHAR AS boss, *, 0 AS level,
           employee_id AS end_boss
      FROM employee
     WHERE manager_id IS NULL
    SELECT t.full_name AS boss, e.*, t.level + 1 AS level,
      FROM employee e
      JOIN org_tree t ON (t.employee_id = e.manager_id)
SELECT repeat(' ', level * 5) || full_name AS relationship
  FROM org_tree
 ORDER BY end_boss, level;
 King Randor
      Prince Adam
      Trap Jaw

Well that’s quite an improvement! But how does it work?

Our initial clue is the first query within the CTE. Other databases may do this differently, but Postgres creates a temporary in-memory table to act as a holding area to represent the CTE contents as they’re constructed. When we specify the RECURSIVE decorator, we gain the ability to bootstrap that temporary data with one query. The second query can then refer to the cumulative result in each iteration of the recursion.

The result is one query that loops in on itself three times in our example. We took advantage of this by adding a new column to track how deep the recursion is so we can visualize this more easily. Here’s what the contents of the “tree” table look like for each phase:

    SELECT NULL::VARCHAR AS boss, *, 0 AS level
      FROM employee
     WHERE manager_id IS NULL
    SELECT t.full_name AS boss, e.*, t.level + 1 AS level
      FROM employee e
      JOIN tree t ON (t.employee_id = e.manager_id)
    boss     | employee_id |  full_name  | manager_id | level 
             |           1 | King Randor |            |     0
             |           5 | Skeletor    |            |     0
 King Randor |           2 | Prince Adam |          1 |     1
 Skeletor    |           6 | Evil-Lyn    |          5 |     1
 Skeletor    |           7 | Trap Jaw    |          5 |     1
 Prince Adam |           3 | Teela       |          2 |     2
 Prince Adam |           4 | Man-at-Arms |          2 |     2
 Evil-Lyn    |           8 | Clawful     |          6 |     2

Each “level” here represents one dive into the employee table to fetch employees of the employees already listed. This loop naturally terminates once every boss is listed in the results. But there’s one flaw in this particular construction: what if we wanted to choose any grunt and see the whole chain of command from that point? To do that, we need to modify the CTE slightly to incorporate our desired predicate in the CTE portion itself so we can follow the relationship properly.

Here’s how that looks:

    SELECT *, 0 AS level
      FROM employee
     WHERE full_name = 'Clawful'
    SELECT e.*, t.level + 1 AS level
      FROM tree t
      JOIN employee e ON (e.employee_id = t.manager_id)
SELECT full_name
  FROM tree

Not bad, eh? We had to flip the JOIN because we started with a specific minion instead of the list of all executives. Then we followed the chain backwards, adding one middle-management peon per iteration until we reached the End Boss. We could combine this kind of trickery by writing a CTE that refers to another CTE and produce a query that would output the entire organization given any member in the hierarchy. We won’t, because that’s a gigantic and rather ugly query, but the capability is there.

What we can do, is demonstrate using CTEs to logically separate query fragments of a larger whole. In the past, a reporting query might consist of an imposing bulk of awkward subqueries to produce necessary aggregates and decode or label various summaries. In the worst cases, such queries might meander for dozens of pages. It’s often a miracle the end result executes at all, and debugging it is equally problematic.

Here’s how we might use CTEs to solve that conundrum:

    SELECT NULL::VARCHAR AS boss, *, 0 AS level,
           employee_id AS end_boss
      FROM employee
     WHERE manager_id IS NULL
    SELECT t.full_name AS boss, e.*, t.level + 1 AS level,
      FROM employee e
      JOIN org_tree t ON (t.employee_id = e.manager_id)
org_stats AS (
  SELECT m.full_name AS ceo, COUNT(*)-1 AS minions,
         MAX(level) AS cruelty
    FROM org_tree org
    JOIN employee m ON (m.employee_id = org.end_boss)
   GROUP BY m.full_name
org_attributes AS (
  SELECT m.full_name AS ceo,
         SUM(1) FILTER (WHERE org.full_name ILIKE '%evil%') AS evil,
         SUM(1) FILTER (WHERE org.full_name ILIKE '%prince%' OR
                              org.full_name ILIKE '%king%') AS royalty
    FROM org_tree org
    JOIN employee m ON (m.employee_id = org.end_boss)
   GROUP BY m.full_name
SELECT st.*, atr.evil, atr.royalty
  FROM org_stats st
  JOIN org_attributes atr USING (ceo);
     ceo     | minions | cruelty | evil | royalty 
 King Randor |       3 |       2 |      |       2
 Skeletor    |       3 |       2 |    1 |

The first portion of the query is just our previous recursive attempt to flatten the organization chart and see how everything is related. The second summarizes basic statistics like employee count and maximum abstraction through middle-management. The third is just a bunch of miscellaneous attributes that might be interesting in a report. All of our examples are trivial, but in a real report, each of these may reflect much more comprehensive aggregates and formulas. Yet despite query complexity, we can determine the end goal of a fragment at a glance. Combine this with SQL comments, and we have a very user-friendly report.

Of course, CTEs are not all sunshine and roses. Remember when we said a CTE is built in a temporary memory location to facilitate recursive functionality and allow CTEs to reference each other? A consequence is that every CTE acts as what we call an optimization fence.

Normally before a query is executed, it is broken down into its component parts and the planner translates those elements into execution instructions. This might mean collapsing certain conditionals, simplifying or substituting a subquery, pushing predicates down into a stack for better row elimination, and so on.

When the planner encounters a CTE however, it can go no further. It will optimize the CTE query itself, but it does so as an encapsulated black box. Even if a WHERE clause from the referring query could greatly reduce matched rows during the CTE execution, that optimization cannot be applied. The CTE executes as written as if we had done this instead:


This applies to every CTE in a query. It’s better to think of each CTE as a virtual temporary table. While that allows each CTE to refer to the entire contents of another CTE, it also means we may lose several opportunities to optimize a query. It’s not uncommon to unroll a CTE and receive a much faster query in return. Query planners are complex beasts, and like any software compiler, may simplify necessary instructions by eliminating entire branches from the execution tree due to redundancy or empty result paths. Using a CTE reduces the planner’s ability to do that.

On the other hand, an experienced user can leverage this knowledge to their benefit. Since the query planner cannot penetrate optimization fences, it means we can override its decision tree. When the data or statistics indicate the planner will improperly prefer a highly inefficient plan, we can force it along an improved path. In these cases, we’re actively trading the potential for future planner improvements for immediate advantage.

The primary argument here is that the planner improvements we need may not arrive for years, or at all. Can we justify suffering bad performance for an undetermined length of time until some nebulous future planner addresses our obscure data edge case? Often the answer to this question is ‘no’. In the rare instances where this justification applies, leveraging optimization fences is probably a safe bet. At least we have the option!

In the end, Postgres improved its reputation among power users, and we gained a versatile tool that enabled the previously impossible. New recursion, simplification, and optimization options, all from a single feature? Yes, please!

PG Phriday: Why Postgres

There are a smorgasbord of database engines out there. From an outside perspective, Postgres is just another on a steadily growing pile of structured data storage mechanisms. Similarly to programming languages like Rust and Go, it’s the new and shiny database systems like MongoDB that tend to garner the most attention. On the other hand, more established engines like Oracle or MySQL have a vastly larger lead that seems insurmountable. In either case, enthusiasm and support is likely to be better represented in exciting or established installations.

So why? Why out of the myriad choices available, use Postgres? I tend to get asked this question by other DBAs or systems engineers that learn I strongly advocate Postgres. It’s actually a pretty fair inquiry, so why not make it the subject of the first PG Phriday for 2017? What distinguishes it from its brethren so strongly that I staked my entire career on it?


Postgres isn’t new. It didn’t enjoy the popularity that practically made MySQL a household name as part of the LAMP stack. It didn’t infiltrate corporations several decades ago as the de facto standard for performance and security like Oracle. It isn’t part of a much larger supported data environment like SQL Server. It isn’t small and easy like SQLite. It’s not a distributed hulk like Hadoop, or enticingly sharded like MongoDB or Cassandra. It’s not in-memory hotness like VoltDB.

It’s just a regular, plain old ACID RDBMS.

You can't explain that!

It does after all, have all of the basics many expect in an RDBMS:

  • Tables, Views, Sequences, etc.
  • Subqueries
  • Functions in various languages
  • Triggers
  • Point In Time Recovery

Certain… other database platforms weren’t so complete. As a consequence, Postgres was preferred by those who knew the difference and needed that extra functionality without breaking the bank. It’s not much, but it’s a great way to develop a niche. From there, things get more interesting.


For the most part, being boring but reliable was a fact of life until the release of 9.0 when Postgres introduced streaming replication and hot standby. Postgres was still a very capable platform before that juncture, but built-in high-availability made it more viable in a business context. Now the secondary copy could be online and supply query results. Now the replica would lag behind the primary by a small handful of transactions instead of entire 16MB segments of transaction log files.

Postgres had finally joined the rest of the world in that regard. MySQL used a different mechanism, but that was one of its selling-points for years before Postgres. The primary distinction is that Postgres streams the changes at a binary level, meaning very little calculation is necessary to apply them. As a result, Postgres replicas are much less likely to fall behind the upstream primary.

The second I tested this feature in 2010, any lingering doubts about the future of Postgres vanished.

Cult of Extensibility

A future version of Postgres. Probably.
A future version of Postgres. Probably.

The next huge—and arguably most important—advancement in Postgres accompanied the release of 9.1: extensions. The true implications here are hard to overstate. Not all of the internal API is exposed, but extensions make it possible for practically any inclined individual to just bolt functionality onto Postgres. When Postgres 9.2 added foreign data wrappers, even arbitrary alternative backends became a possibility.

Hijack the query planner to route data through video card CPUs? Got it. Add basic sharding and distributed query support? No problem. Interact with Cassandra, SQL Server, or even Facebook? Easy peasy. Store data in analytic-friendly column structure? Child’s play.

Perl has the dubious honor of being labeled the Swiss Army Chainsaw of languages because it enables a practitioner do anything. Extensions convey almost that same capability to Postgres. And while a badly written extension can crash your database, good ones can elevate it beyond the imaginations and time constraints of the core developers.

Extensions that provide enough added value have even inspired fully supported internal adaptations, as in the case of materialized views in Postgres 9.3. What other database does that?

Consider what happens when these features are combined.

  1. Create a materialized view that refers to a remote table.
  2. Refresh the above view before using it in a report.
  3. Alternatively, siphon updated rows from the view into a more permanent aggregate summary table.
  4. Get local data processing performance in ad-hoc analytics over heterogeneous platforms.

Now Postgres can be the central nexus for a constellation of various data environments and provide snapshot analysis for the live data. Without a convoluted ETL infrastructure. Technically the materialized views or intermediate aggregate tables aren’t strictly necessary, so Postgres wouldn’t even need to store actual data. Such a configuration would be hilariously slow, but now the ironic scenario exists where Postgres can power a database empty of actual contents.

The 9.2 release transformed Postgres into a platform, and one of the reasons I don’t use the SQL part of PostgreSQL anymore.

Developers, Developers, Developers!

Ballmer likes developers

The folks hacking on the Postgres code are both insanely skilled and notoriously available. It’s almost a running joke to guess which of the core devs will answer a basic SQL question first. There’s practically a race to answer questions in the mailing lists regardless of sophistication or perceived merit, and anyone subscribed to the list can participate.

Their dedication to fostering community interaction is unrelenting. While not quite as organized as the Linux kernel developers thanks to Linus’ role as benevolent dictator, they’ve pushed Postgres forward every year. Due to their strict commit-fests and automated testing and code review, they’ve delivered a stable update roughly every year since 2008. Is there another database engine that can boast the same?

And every release has at least one headliner feature that makes upgrading worth the effort. Every. Last. Version.

  • 8.4: Window functions + CTEs
  • 9.0: Streaming replication
  • 9.1: Foreign tables, extensions
  • 9.2: Cascading replication, JSON support
  • 9.3: Materialized views, event triggers, data checksums
  • 9.4: JSONB, background workers, logical WAL decoding
  • 9.5: Upsert
  • 9.6: Parallel execution
  • 10.0?: Standby quorum, native table partitioning

While it would be wrong to demand that kind of dedication and quality, appreciating it is quite a different story. The community pushes Postgres forward because the devs give it a voice. That’s rare in any project.

In the end, I consider it a privilege to even participate from the sidelines. Is it perfect? Of course not; I’ve pointed out serious flaws in Postgres performance that have yet to be successfully addressed. Yet given the alternatives, and what Postgres really delivers when it’s fully leveraged, I can’t even think of a better commercial RDBMS.

Why Postgres? Maybe the better question is: why not?

PG Phriday: Who Died and Made You Boss?!

Postgres is great, but it can’t run itself in all cases. Things come up. Queries go awry. Hardware fails, and users leave transactions open for interminable lengths of time. What happens if one of these things occur while the DBA themselves has a hardware fault? While they’re down for maintenance, someone still has to keep an eye on things. For the last PG Phriday of the year completely unrelated to my upcoming surgery, let’s talk about what happens when your DBA becomes inoperative due to medical complications.

This is Fake Postgres DBA 101!

Getting Around

When in doubt, SSH is the name of the game. Database accounts are either locked-down or lack sufficient permissions to do everything, but usually the postgres user itself has total access. Invoke your favorite SSH client to connect to the host running the database in question, and use sudo to become the postgres user:

ssh my-db-host
sudo su -l postgres

Afterwards, it’s a good idea to add our public key to the postgres user’s .ssh/authorized_keys file so we can log in as the postgres user without the intermediate sudo. Here’s a good guide for doing that. If there’s configuration management like salt or Puppet involved, that file is probably part of the stack and needs to be modified there or it’ll be overwritten.

Either way, we’re in. If we’re lucky enough to be on a Debian derivative like Ubuntu or Mint, we can use the pg_lsclusters command to see which database instances are running on this server. Here’s what that looks like on a sample VM:

Ver Cluster Port Status Owner    Data directory          Log file
9.5 main    5432 online postgres /data/pgsql/main/data   /var/log/postgresql/postgresql-9.5-main.log
9.6 96test  5440 online postgres /data/pgsql/96test/data /var/log/postgresql/postgresql-9.6-96test.log

The next thing we need is the psql command-line client—the nexus of Postgres interoperability. Postgres instances monitor the listed port for connection attempts, so if we wanted to target a specific instance, we’d want to refer to this output. The default is 5432 if we don’t pass any value.

Let’s get a list of all databases in the 96test Postgres instance on this system.

psql -p 5440 -l
                                  List of databases
   Name    |  Owner   | Encoding |   Collate   |    Ctype    |   Access privileges   
 examples  | postgres | UTF8     | en_US.UTF-8 | en_US.UTF-8 | 
 postgres  | postgres | UTF8     | en_US.UTF-8 | en_US.UTF-8 | 
 sensors   | postgres | UTF8     | en_US.UTF-8 | en_US.UTF-8 | 
 template0 | postgres | UTF8     | en_US.UTF-8 | en_US.UTF-8 | =c/postgres          +
           |          |          |             |             | postgres=CTc/postgres
 template1 | postgres | UTF8     | en_US.UTF-8 | en_US.UTF-8 | =c/postgres          +
           |          |          |             |             | postgres=CTc/postgres
(5 rows)

I’ve covered the template databases, and they can safely be ignored. The only “real” databases in this installation are examples, postgres, and sensors. The postgres database is a default and is often used as scratch space; hopefully nobody is using it as their actual production database.

Each of these is distinct and can not interact with the data contained in the others. Admins familiar with other database engines might find this odd, but that’s how Postgres works. Usually databases that require lots of data sharing use schemas to keep tables in namespaces within the database. And of course, I have a long-winded explanation of this as well.

Either way, we have a list of databases from the instance we want to examine. How do we connect? Well, if the -l flag shows us a list of available databases, what happens if we remove it and append the name of a database?

psql -p 5440 sensors
psql (9.6.1)
Type "help" for help.
sensors=# SELECT 'Hello World!';
 Hello World!
(1 row)
sensors=# help
You are using psql, the command-line interface to PostgreSQL.
Type:  \copyright for distribution terms
       \h for help with SQL commands
       \? for help with psql commands
       \g or terminate with semicolon to execute query
       \q to quit

Sweet! we connected to the sensors database, successfully executed a basic query, and got a bit of assistance from the client software itself. The psql client is an extremely robust tool for interacting with a Postgres database. There are a lot of shortcut commands, and entering \? lists all of them. Be ready to scroll!

The one that really matters is \d and its variants. It’s short for “describe” and does exactly that. It can retrieve lists of tables, schemas, views, indexes, or any other database object. It can also provide greater detail about the object in question. Let’s use it to list the available schemas, see the contents of a schema, and get more insight on a particular table.

sensors=# \dn
  List OF schemas
  Name  |  Owner   
 logs   | postgres
 public | postgres
(2 ROWS)
sensors=# \dt logs.*
           List OF relations
 Schema |    Name    | TYPE  |  Owner   
 logs   | sensor     | TABLE | postgres
 logs   | sensor_log | TABLE | postgres
(2 ROWS)
sensors=# \d logs.sensor
                                           TABLE "logs.sensor"
    COLUMN     |            TYPE             |                         Modifiers                          
 sensor_id     | INTEGER                     | NOT NULL DEFAULT NEXTVAL('sensor_sensor_id_seq'::regclass)
 location      | CHARACTER VARYING           | NOT NULL
 reading       | BIGINT                      | NOT NULL
 modified_date | TIMESTAMP WITHOUT TIME zone | NOT NULL
    "sensor_pkey" PRIMARY KEY, btree (sensor_id)
    "idx_sensor_location" btree (location)

The \dn command shows the namespaces in the current database. From that list, we see that the logs schema might have something interesting in it. So we then rely on \dt to list all of the tables the logs schema contains. From there, we can just use the regular \d describe command to get all of the information Postgres has about the structure of the logs.sensor table, complete with indexes, constraints, triggers, and so on.

Now that we can properly introspect to see where we are, it’s time to look deeper.

Peering Into the Void

A common action upon connecting to a database is to view connections and see what they’re doing. This is when we turn to the Postgres system catalog, a series of tables and views that reflect the current state of the database. The pg_stat_activity view will tell us anything we need to know regarding user connections. For example:

SELECT pid, datname, usename, state, query_start
  FROM pg_stat_activity;
  pid  | datname | usename  |        state        
 10392 | sensors | postgres | active
 16202 | sensors | postgres | idle IN TRANSACTION
Expanded display IS ON.
SELECT * FROM pg_stat_activity WHERE pid=16202;
-[ RECORD 1 ]----+------------------------------------
datid            | 16384
datname          | sensors
pid              | 16202
usesysid         | 10
usename          | postgres
application_name | psql
client_addr      | 
client_hostname  | 
client_port      | -1
backend_start    | 2016-12-16 08:58:44.584559-06
xact_start       | 2016-12-16 08:58:46.34946-06
query_start      | 2016-12-16 08:59:01.336305-06
state_change     | 2016-12-16 08:59:01.336895-06
wait_event_type  | 
wait_event       | 
state            | idle IN TRANSACTION
backend_xid      | 
backend_xmin     | 
query            | SELECT * FROM logs.sensor LIMIT 10;

The first query is a pretty basic terse listing of the activity of all clients. We noticed one of them was idle within a transaction and decided to view everything Postgres knew about that connection. From the various listed times, we can see that it’s only been idle for a few seconds, so there’s no cause for alarm. We can also see the last query the user executed against the database, even if it completed long ago. This is all useful debugging information.

Viewing all of that as a single row would have been extremely inconvenient, so we utilized another of the psql commands and used \x to enable extended output. When this option is active, psql presents every column within a result as a key/value row pair. It’s not especially convenient for hundreds or thousands of rows, but it’s pretty indispensable when viewing the results of a wide column list.

A good combination of the columns in the pg_stat_activity view might look something like this:

-- This query for 9.6
SELECT pid, datname, usename, state, client_addr, wait_event,
       now() - query_start AS duration,
       SUBSTRING(query, 1, 30) AS query_part
  FROM pg_stat_activity
 WHERE state != 'idle'
 ORDER BY now() - query_start DESC
 LIMIT 10;
-- This query for 9.2 to 9.5
SELECT pid, datname, usename, state, client_addr, waiting,
       now() - query_start AS duration,
       SUBSTRING(query, 1, 30) AS query_part
  FROM pg_stat_activity
 WHERE state != 'idle'
 ORDER BY now() - query_start DESC
 LIMIT 10;
  pid  | datname | usename  |        state        | client_addr | wait_event |    duration     |           query_part           
 22503 | sensors | postgres | idle IN TRANSACTION |   |            | 00:38:21.952753 | SELECT * FROM logs.sensor LIMI
 10392 | sensors | postgres | active              |             |            | 00:00:00        | SELECT pid, datname, usename,

This query only reports non-idle connections prioritized by activity duration. From here, we can see that the connection that was idle in transaction is still idle several minutes later. Nothing is waiting on it, but maybe it’s time to clean up a bit anyway.

The Terminator

Maybe that idle connection is blocking someone and it needs to go away. There are two ways to make that happen directly within Postgres. The first and safest option is to try and cancel the offending query with pg_cancel_backend. If that doesn’t work, we escalate to pg_terminate_backend which actually breaks the connection and rolls back any pending transactions. Let’s try those now:

SELECT pg_cancel_backend(22503);
SELECT pid, state, query
  FROM pg_stat_activity
 WHERE pid = 22503;
  pid  |        state        |                query                
 22503 | idle IN TRANSACTION | SELECT * FROM logs.sensor LIMIT 10;
(1 ROW)
SELECT pg_terminate_backend(22503);
SELECT pid, state, query
  FROM pg_stat_activity
 WHERE pid = 22503;
 pid | state | query 
(0 ROWS)

The pg_cancel_backend function didn’t “work” because it only cancels the currently operating query. If there’s a transaction in place, the user will simply return to their database prompt. An application that was stuck may be fixed by canceling a stuck query. If the application is threaded, it may have started a transaction and is injecting commands as another portion of the program does work elsewhere. If that process gets interrupted, the transaction may never complete.

That’s when we roll out pg_terminate_backend, which discards such polite approaches. This is the kill shot most DBAs are familiar with. Queries are canceled. Transactions are rolled back. Connections are broken. Resources are reclaimed. This isn’t quite cold-blooded murder, though. From the perspective of Postgres, we’re merely pointing a rifle at the offending connection and politely requesting it to vacate the premises.

Unlimited Cosmic Power

For DBAs more comfortable with GUIs, pgAdmin might be a better interface than psql. Version 3 is a stand-alone client for Windows, Linux, or OSX. In smaller environments, it’s not entirely uncommon to already have connection capabilities to servers hosting Postgres instances. Just connect and thrill in exploring the drilldown menus, object creation scripts, query interface, and all of the other niceties it provides.

Many of the operations that really matter, like terminating unruly connections, are only available to superusers. There are a couple ways to get this kind of access. In a pinch, it’s easy to connect to the host server as above and grant superuser access directly to ourselves from psql:

\du sthomas
           List OF roles
 ROLE name | Attributes | Member OF 
 sthomas   | Superuser  | {}

Hopefully though, the current DBA set aside a dedicated role for this kind of thing. Giving users direct superuser access is usually frowned upon. Imagine we have a sysdba role with superuser capabilities and we want to share in the glory. This would be the correct approach:

GRANT sysdba TO sthomas;
\du s*
                  List OF roles
 ROLE name |       Attributes        | Member OF 
 sthomas   |                         | {sysdba}
 sysdba    | Superuser, Cannot login | {}

However we should note that since it’s the role with superuser access, we aren’t actually superusers. Like with sudo, we need to “activate” our new powers before they’ll work. Postgres allows users to adopt roles they’re members of. If we wanted to perform superuser actions in Postgres from pgAdmin, we could issue this command from a query pane:

SET ROLE sysdba;

Is there more? Oh, there’s a lot more. But this is enough for a competent techie to start digging around and becoming familiar with the territory. Quick, seize control while the DBA isn’t around to stop you! They’ll never know! Viva la resistance!

Of course, an experienced Postgres DBA will know and revoke your abilities later, but it’s fun to pretend. Take the opportunity to learn more anyway; just be careful on production systems.

PG Phriday: Planner Pitfalls

Recently a coworker asked me this question:

Should I expect variance between minutes and hours for the same query?

And I was forced to give him this answer:

Potentially, but not commonly. Query planning is an inexact science, and regardless of the query being the “same query,” the data is not the “same data.” This isn’t generally the case, but on occasion, changes in data can affect the query execution path. Usually this is a good thing, as the database accounts for new value distributions.

For example, if there are a million distinct values in one column, but 90% of them are the same, certain values should trigger an index scan instead of a sequential scan. Those values will change over time, and if the stats don’t account for that, queries will have non-dependable performance. Of course, this introduces potential correlation assumptions that aren’t correct in some cases, and that also causes unreliable query performance. I guess the question is: which would you rather have?

That answer, despite being “right”, didn’t sit well with me. While we did eventually sit down and optimize the query in question so it was less likely to mislead Postgres, there’s no reason to assume an end-user is going to readily accept unreliable query performance. Nor should they.

But we need to perform some analysis to determine how things got to this point. Let’s start with distribution. We have a table that contains about two weeks worth of data, represented by 66M rows over 100GB of space (135GB including indexes). We’re not scheduled to upgrade to 9.6 until early 2017, so the instance is running Postgres 9.5. It isn’t bleeding edge, but this is hardly an ancient installation.

Consider the query plan for a standard retrieval of the last day worth of data:

 WHERE create_time >= '2016-12-08';
                               QUERY PLAN
 Bitmap Heap Scan ON NEW
     (cost=137498.78..8770565.51 ROWS=5819512 width=1032)
   Recheck Cond:
     (create_time >= '2016-12-08 00:00:00-06'::TIMESTAMP WITH TIME zone)
   ->  Bitmap INDEX Scan ON idx_new_create_time
     (cost=0.00..136043.90 ROWS=5819512 width=0)
         INDEX Cond:
         (create_time >= '2016-12-08 00:00:00-06'::TIMESTAMP WITH TIME zone)

There’s nothing surprising here. There’s an index on the create_time column, and that index is pulling the rows we requested. But what happens if we add a LIMIT clause?

 WHERE create_time >= '2016-12-08'
 LIMIT 10;
                               QUERY PLAN
 LIMIT  (cost=0.00..24.03 ROWS=10 width=1032)
   ->  Seq Scan ON NEW
           (cost=0.00..13985651.85 ROWS=5819512 width=1032)
         (create_time >= '2016-12-08 00:00:00-06'::TIMESTAMP WITH TIME zone)

WHAT!? How is reading the full contents of a 100GB table ever faster than fetching over 5M rows using an index? How is that a correct decision in any sane universe, and how is Postgres reaching it?

It turns out that Postgres is making two fundamentally flawed assumptions here:

  1. The distinct values in the create_time column are evenly distributed.
  2. Only a small sample of the table will be required to obtain 10 matches. Yes this means a sequential scan, but one that can be aborted fairly quickly.

Ultimately, the first invalid assumption compounds the second. As is true in many cases with columns that contain dates, those values in our table exist along a steadily increasing vector. With two weeks of historic data, Postgres would have to read almost the entire table to reach the portion where the most recent rows reside. As a result, Postgres isn’t reading 10 or even 1000 rows, it’s reading 60-million.

What’s worse is that this behavior is consistent. We have another table that’s 500GB in size with nearly 500M rows, and the query plan is the same. Naively adding a LIMIT clause to a query on that table could be outright disastrous. Not only would it mean effectively reading the entire table, but would result in flushing many other objects out of cache. So now we’ve saturated disk IO right when other queries have lost their own table caches. At that point, every query on the system will perform horribly, even after the rogue sequential scan is complete. Memory caches need to be rebuilt after all.

Which leads to the second stumbling block that led to the original question regarding unreliable query performance. A portion of that query looked like this:

SELECT DISTINCT nw3.col1, mw.col2, SUM(mw.quantity) AS quantity
         nw1.col1=mw.col3 AND nw1.col2=mw.col2
         nw2.col1=nw1.col3 AND nw2.col2=nw1.col2
         nw3.col1=nw2.col3 AND nw3.col2=nw2.col2
 WHERE mw.create_time > CURRENT_DATE
   AND = 'some'
   AND = 'unique'
   AND = 'value'
 GROUP BY nw3.col1, mw.col2;

To be completely fair, this query contains a number of problems:

  1. Predicates in a WHERE clause are applied after the LEFT JOIN, so many unintended rows will be removed because it’s not accounting for NULL values.
  2. Predicates are not transitive. The CURRENT_DATE clause should be applied to all of the joins so the planner has all necessary information.
  3. The DISTINCT is not necessary due to the GROUP BY clause.

If we take that into account and rewrite the query, we get this:

SELECT nw3.col1, mw.col2, SUM(mw.quantity) AS quantity
         nw1.col1=mw.col3 AND
         nw1.col2=mw.col2 AND'unique' AND
         nw1.create_time > CURRENT_DATE
         nw2.col1=nw1.col3 AND
         nw2.col2=nw1.col2 AND
         nw2.create_time > CURRENT_DATE
         nw3.col1=nw2.col3 AND
         nw3.col2=nw2.col2 AND'value' AND
         nw3.create_time > CURRENT_DATE
 WHERE mw.create_time > CURRENT_DATE
   AND = 'some'
 GROUP BY nw3.col1, mw.col2;

And as expected, this version of the query performed much better, executing about 20-times faster than the original incarnation. Unfortunately, that isn’t the end of the story. See all of those join conditions on each table in addition to the WHERE clause? Postgres multiplies the probabilities of column values together to obtain a rough row estimate it uses to calculate the cost of each potential query plan. Since these are all fractions, we have a steadily decreasing estimate with each additional clause.

This usually works fine until we’re dealing with closely correlated data. If col1 and col2 have a one-to-one relationship, multiplying their probabilities is exactly the wrong thing to do. As is the case with most underestimated row counts, Postgres will generally opt for a nested loop. Why not? Iterating over a few dozen values is cheap and has little setup cost compared to allocating in-memory merge or hash segments.

In our case, the row estimates were off by two orders of magnitude. This is fine in isolation! As mentioned previously, the new query plan was much faster than the old one. But that was only the first portion of a much larger CTE-driven query. Each fragment contained similar flaws as the first, and further reduced row estimates to nearly zero in the aggregate.

That means a lot of nested loops. Fine for a handful of rows, but not an end-of-day total of five million. The (admittedly large and complicated 31k query) required nearly three hours to complete. How did we fix it?

SET enable_nestloop TO FALSE;

That one modification before running the query reduced its execution time to 26 seconds. Since the query was essentially a report, I unilaterally decided that no imaginable nested loop could possibly improve the performance of that query and would instead be actively detrimental. According to the original query author, the previous run-time was usually a few minutes before it ballooned to several hours last week.

That kind of variance would understandably confuse and probably enrage most people. How are the worst case and best case scenarios for the same query so drastically different? The underlying issue is that Postgres trends toward the lowest cost estimate of the plans it examines without taking the worst case into account.

This is what happened when it chose to use a sequential scan when we supplied a LIMIT clause. Yes, the best scenario is that only a few hundred or thousand rows are required to obtain the full 10 matches after applying the WHERE clause. The worst case is never considered, despite how much performance suffers as a consequence. Our particular example could be addressed if Postgres collected generalized vector statistics to map data distribution shapes. This would cover steadily increasing, decreasing, or clustered data values. Yet that’s hardly a simple change.

The story is similar for selecting a nested loop over a hash or merge operation. A nested loop is great until it isn’t. Best case? A few seconds. Worst? Several hours. How do I explain to users who experience this on a regular basis, beyond striving to transform every single one of them into experts at coddling the Postgres query planner? It’s an untenable situation.

It’s the only solution I have and partially why PG Phriday is a regular occurrence. But I’ve always wished it wasn’t necessary. I don’t need to be a mechanic to drive my car. A query planner is much more complicated than a motor vehicle, yet the perception remains. How do we really address the true complexity of the planner without making excuses for its current shortcomings?

Despite my love of Postgres, I can’t really answer that. When everything is operating optimally, Postgres is the best database I’ve ever encountered. But when it trips, it faceplants into frozen January molasses en-route to the moon. I usually tell users to cancel those queries, or do it myself, because that molasses is never reaching the moon.

Until the worst case of a plan is integrated into the planning process, we can expect the occasional misstep. At least we can rely on temporary tables and disabling elements of the query planner for truly belligerent queries.

PG Phriday: Ambling Architecture

It’s about the time for year-end performance reviews. While I’m always afraid I’ll narrowly avoid being fired for gross incompetence, that’s not usually how it goes. But that meeting did remind me about a bit of restructuring I plan to impose for 2017 that should vastly improve database availability across our organization. Many of the techniques to accomplish that—while Postgres tools in our case—are not Postgres-specific concepts.

Much of database fabric design comes down to compromise. What kind of resources are involved? How much downtime is tolerable? Which failover or migration process is the least disruptive. Is it important that components integrate self-healing? There are several questions that demand answers, and in most cases, Postgres replication is completely sufficient to ensure data security and uptime.

This is what most Postgres admins and users probably see in the wild:

Disaster Recovery Pair

Usually it’s just two servers set up as a mirror using asynchronous or synchronous Postgres replication. On top is generally some kind of virtual IP address or DNS pointer to ensure connections always target the primary server. This is only critically important for connections that need write access, but it’s still a common theme.

Even the some of the most heavy-duty Postgres high availability solutions are just variations of this theme. In 2012, I gave a presentation to Postgres Open on using Pacemaker. The presentation notes are available on the wiki for that year. This is no simple stack, either:

  • LVM: Linux Volume Manager
  • DRBD: Distributed Replicating Block Device
  • Pacemaker: Failover automation
  • Corosync: Pacemaker’s communication layer
  • VIP: Virtual IP address

That particular setup uses DRBD for block-level synchronization instead of Postgres replication because it was designed for an extremely high volume transaction processing system. It’s difficult to survive 300-million writes per day with synchronous replication latency unless it’s fairly low-level.

For normal server pairs that don’t require absolutely bulletproof data-loss prevention and throughput guarantees, standard Postgres replication is fine. Yet even injecting Postgres replication in place of DRBD and accounting for LVM being standard on most Linux hosts, we must still account for Pacemaker, Corosync, and VIP network structure requirements. That’s a lot to ask for smaller companies or university research. Even mid-sized companies with a full infrastructure department tend to shy away from Pacemaker due to its management complexity.

So what else can we do for easy and quick database promotions in the case of unexpected outages or managed system migrations? In early 2015, Compose wrote about their solution to the problem. The diagram for the setup looks like this:

Governor Stack

Essentially the Governor process acts as a central nexus controller for Postgres and a few other pieces. The etcd process is just a distributed key-value storage system with a robust election system to ensure consistent values across the cluster. And HAProxy hides all of our IP addresses so we never have to know which system is the leader. Connecting to HAProxy will always contact the primary Postgres server.

It looks complicated—and to a certain extent it is—but it readily beats the alternative. Here’s how the whole process basically works while it’s running:

  1. The Governor checks etcd for the presence of a leader.
  2. If no leader is found, it sets a key claiming the position with a relatively short TTL.
  3. If there’s already a leader, it tries to put the local Postgres instance in a state where it can begin replication from that system.
  4. Postgres is restarted to fit current roles if necessary.
  5. The Governor presents a REST interface to HAProxy as a health status. Only the leader will report a successful check.
  6. Repeat.

If we connect to this stack through HAProxy, it only redirects traffic to the Postgres server that reports itself as the leader. There’s never a need for a VIP, or a CNAME, or any other kind of DNS shenanigans. Just connect to HAProxy. Empty servers get bootstrapped with the most recent data. Old leaders are rewound and become part of the existing cluster. It’s elastic and self-healing, and much easier to manage than Pacemaker.

Of course, this leaves us with a couple other issues to resolve. The first is that of race conditions. If both Postgres servers are down, how do we know the first to win the leader race is the one with the most recent data? We don’t. Once all keys and positions have expired from the key-value store, there’s a blank slate that opens up the possibility a very old server could take over as the new primary. Once the server with the most recent data tries to connect, it will notice the mismatch and fail pending admin intervention.

This is what we call a Split Brain, and this scenario is only one way to achieve it. Each member of the cluster thinks it should be the leader for perfectly valid reasons, yet the “wrong” one is now in charge. The only fool-proof method to resolve this is to always have at least one online Governor available to provide new members with a comparison transaction status. The more Postgres servers we have in any particular cluster, the easier this is to achieve.

We can also help by setting a short transaction log archive timeout and sending archived logs to a globally shared location available to all cluster members. This ensures a single minimal source of transaction status and contents. Even a very old server would apply these archived transaction contents, and we’d “only” lose data since the last archival. Not ideal, but it helps to at least mitigate risk. If we’re coming back from an outage that took down ever cluster member, it’s likely we have other problems anyway.

The other concern with this kind of structure is actually horizontal scaling. Most automated consensus algorithms have membership limits due to interaction complexity at higher counts. If we have a larger cluster with dozens of members, overhead of maintaining the key-value store could sap precious hardware resources or fail outright. If we have dozens of separate Postgres clusters for various disparate applications, we are limited to either maintaining multiple parallel stacks, or we share the key-value store and HAProxy between them. In the latter case, we run into the same membership overload.

This is where decoupling comes in. It’s easy, and even suggested to split things up for shared environments. Consider this revision to our previous diagram:

Scalable Governor Stack

Each of those Postgres groups are a separate cluster with an undetermined amount of replicas. We’ve moved HAProxy and etcd to their own resources, whether those are VMs, shared servers, or some kind of container.

In that separate location, HAProxy can connect to any of the Postgres cluster members. Now we have a much smaller pool to consider as connection candidates. It’s a layer of abstraction that might introduce more latency, but it also means we don’t need to divulge the location of any Postgres server members. For configuration purposes, this greatly simplifies larger application stacks that may consist of dozens of interacting elements. Why maintain a list of ten different hosts for various Postgres needs? Just use the HAProxy pool.

Separated from the numerious Postgres hosts, consensus complexity is greatly reduced for etcd. It’s probably a good idea to have more than three members for larger constellations, but the concept remains sound. So long as the key-value pool survives, the Governor process will always have a secure location to track the Postgres leader and the transaction state of all replicas.

It’s “easy” to simply deploy the entire stack to every Postgres server and rely on local connections in isolation. For smaller pools and non-shared resources, that’s probably the most reliable approach. But for mid to large-size businesses, Postgres doesn’t operate in isolation. It’s likely there are already other services that make use of HAProxy or etcd. This approach allows all of them to share the components, and prevents us from maintaining a parallel installation of potentially unnecessary daemons.

As a final note, Zalando forked Governor and produced Patroni for occasions like those above. In existing environments, there’s probably a key-value store already in place, so why not use it? While Governor is restricted to using etcd for this purpose, Patroni is also compatible with ZooKeeper or Consul. Because of that extra bit of functionality, we’ll probably start integrating it into our organization using that last diagram as a model.

And then? Well, the only thing better than a Postgres cluster is an immortal Postgres cluster. We just have to hope it never gains sentience and desires revenge for badly written queries.