PG Phriday: Forensic Fundamentals

Friday August 19, 2016

All database engines, even Postgres, occasionally present a seemingly intractable problem that will drive everyone insane while attempting to isolate it. All it takes is the perfect storm of situational circumstances, and even a perfectly running stack of software will grind to a screeching halt. It’s situations like this that we must turn to various Postgres forensic tools to track down the issue before management starts firing people in frustration.

We’ve already discussed how connections can be blocked by improper operation ordering. But what if thing’s aren’t so straight-forward? When there are more than two actors involved, the potential for weird interactions increases to a point where replicating the issue in a clean environment is often difficult or impossible. So let’s make sure a few things are in place.

First things first: logging. We should tweak log settings from the defaults, as they provide a wealth of after-the-fact diagnostic information. All of these should be in postgresql.conf:

log_checkpoints = on
log_statement = ddl
log_min_duration_statement = 1000
log_line_prefix = '%p|%u|%d|%r|%t|'

There are a lot of other log settings, but these are the ones I personally consider essential modifications. Why? Let’s examine them one by one.

log_checkpoints: A checkpoint happens when Postgres writes pending modifications to table files. If this setting is enabled, we learn when that process started, how long it took, how much data was written, how much time was spent syncing to disk and thus waiting for the underlying OS to flush the data, and so on. Enabling this can single-handedly track down insufficient write performance of a storage device, or reveal times of heavy write activity that may suggest better transaction log coverage, and so on. It’s a crime this isn’t enabled by default.
log_statement: The ddl setting will log any modification to a database object. Create a table? Logged. Modify a view? Logged. Drop an index? Logged. This is the minimal level of activity auditing a DBA should expect. Modifying database structures, especially in production, should be strictly controlled, and we need to know when such modifications occur, who made them, and so on.
log_min_duration_statement: In most databases, queries execute on the order of tens of milliseconds. Those that exceed one full second are almost always an aberration that require further investigation. Reporting-oriented servers might err toward higher values as their queries are generally slower, but this should still be set to something. This doesn’t just reveal slow queries, but queries that ran long for any reason. This latter case makes query tracking an essential tool.
log_line_prefix: There are numerous environmental details in place for any session, or a particular operation. We know the account in question, where it connected from, which database it’s using, the PID of the Postgres backend it was assigned, and so on. It’s ideal to salt this to fit organization needs, and this is much better than the default. If a problem occurs, we might not be available or capable of viewing it right away. It’s up to the log to fill any gaps.

There is a huge exception here that will be relevant soon: the TRUNCATE command. Postgres classifies this as a modification, not DDL. As such, our chosen log_statement value will not log table truncations! This matters because TRUNCATE doesn’t just empty a table—it destroys it. What we see when truncating a table is that it becomes empty. In reality, Postgres created an exact copy of the original table and performed an atomic swap, discarding the original.

We could use that to argue TRUNCATE is DDL, and should be logged as such. We’re fundamentally modifying the table itself, and the documentation agrees:

TRUNCATE acquires an ACCESS EXCLUSIVE lock on each table it operates on, which blocks all other concurrent operations on the table.

This exclusive access lock is the key to understanding situations where TRUNCATE can be problematic. We have a process that isn’t logged as DDL, yet acts like DDL, and could originate from any application or script that touches the database. Consider the implications.

Now let’s construct the perfect storm. Imagine two access vectors: one reads data to a table, the other writes intermittently. They look like this:

CREATE TABLE foo (bar int);

-- Connection 1, reads from a table:

BEGIN;
SELECT * FROM foo;
ROLLBACK;

-- Connection 2, writes from some script source:

BEGIN;
INSERT INTO foo (bar) VALUES (42);
COMMIT;

These look relatively harmless, but there are a couple clues if we examine more closely. Notice that the read-only connection is wrapped in a transaction? This looks very odd since it’s completely unnecessary, but it’s actually very common. Many database connectors actually consider this beneficial, since it allows developers to perform endless operations without affecting the database until they’re ready to commit.

Unfortunately for applications that are read-only, this can be a lot of unnecessary overhead. Further, the author of a read-only script may not be cognizant they’re operating within a transaction. It’s understandably common for such a process to request a large amount of data from Postgres and then process it, unaware that Postgres is calmly waiting for them to commit or roll back their work. We can see this in Postgres if we examine the pg_stat_activity and pg_locks views.

Imagine a connection retrieved some data and is spending hours processing it. This is what Postgres would see:

\x on

SELECT * FROM pg_stat_activity
 WHERE state = 'idle in transaction';

-[ RECORD 1 ]----+------------------------------
datid            | 12411
datname          | postgres
pid              | 19690
usesysid         | 10
usename          | postgres
application_name | psql
client_addr      | 
client_hostname  | 
client_port      | -1
backend_start    | 2016-08-19 11:14:11.194043-05
xact_start       | 2016-08-19 11:14:28.376349-05
query_start      | 2016-08-19 11:14:28.376627-05
state_change     | 2016-08-19 11:14:28.376977-05
waiting          | f
state            | idle in transaction
backend_xid      | 
backend_xmin     | 
query            | SELECT * FROM foo;

\x off

SELECT l.pid, a.waiting, l.relation::regclass::text,
       l.locktype, l.mode
  FROM pg_locks l
  JOIN pg_stat_activity a USING (pid)
 WHERE a.state = 'idle in transaction';

  pid  | waiting | relation |  locktype  |      mode       
-------+---------+----------+------------+-----------------
 19690 | f       | foo      | relation   | AccessShareLock
 19690 | f       |          | virtualxid | ExclusiveLock

This information paints the session as ultimately innocuous. It’s idle in transaction, but it only has a share lock on one table, which shouldn’t block anything trying to write to it. Imagine our surprise when applications and users start reporting they keep getting errors about “too many connections”. When we look at the sessions, we see something like this:

SELECT count(*)
  FROM pg_stat_activity
 WHERE waiting;

 count 
-------
    97

How is that possible? It’s easy if we have a large cluster of web-driven servers all trying to write to a single table. Or even just an aggressively threaded bulk loading application that pushes data as hard as it can without any kind of throttle checks. Both of these situations are fine when the database is behaving, so nobody ever thought about coding around scenarios where it becomes unresponsive.

Sensing an issue, developers, operators, or system administrators might simply start bouncing applications or even Postgres itself in a frantic effort to clear the logjam. In that case, we can’t check pg_stat_activity for the cause, and all we have are the logs. Remember those? Let’s see what they have to say about things, given we know roughly when the problem occurred:

grep 11:42 /var/log/postgresql/postgresql-9.5-main.log

25956|postgres|postgres|::1(49763)|2016-08-19 11:42:21 CDT|LOG:  duration: 660488.381 ms  statement: TRUNCATE TABLE foo;

Introducing our third vector, a periodic cleanup job that empties a shared table before an extremely aggressive bulk operation fills it again. Before this process began, our read-only script started running and was calmly processing query results, unaware it had inadvertently shut down the entire company.

How? Remember that access share lock it’s holding while idle in a transaction? Well the TRUNCATE command requires an access exclusive lock, which it can’t get. It needs the exclusive lock because it needs to annihilate the table and replace it with a shady clone. While it waits for a lock it will never receive, the aggressive bulk loading application consumes all available connections with inserts that will never complete.

The inserts will never complete because they need an exclusive lock for the specific rows they’re inserting. But they can’t lock individual rows because the entire table is locked by the pending TRUNCATE. Meanwhile, the loading engine dutifully eats up all available connections with subsequent attempts that will simply exacerbate the problem. Once those connections are gone, standard applications can no longer connect at all for unrelated work, and suddenly the world stops.

All because of a single unnecessary transaction. If this sounds ridiculous and contrived, consider the actors. All of them operate independently of one another and function properly in a vacuum. It’s hilariously easy for that to go awry. I’ve seen this happen dozens of times in as many separate organizations across developers at all experience levels. I’ve never encountered a database-driven app that didn’t eventually trigger a DDoS on itself through some unlikely interaction between separate components.

So how can we solve this conundrum? The easiest way is to simply stop using implicit transactions wherever they may lurk. Most—if not all—database connectors provide an attribute or method that completely eliminates this behavior. In Python for example, we could tell psycopg2 we want to handle our own transactions:

import psycopg2

conn = psycopg2.connect("my connection string")
conn.autocommit = True

cur = conn.cursor()

cur.execute('BEGIN')
# Some other db-related write code here
cur.execute('COMMIT')

This reverts Postgres to standard operation: all queries resolve immediately unless commands need to be grouped together. In those cases, we can control that explicitly. This solves our first problem of dangling transactions.

Then applications and scripts need to clean up after themselves. Even in cases of garbage collection, if an insert procedure gets stuck, the whole stack needs to be canceled and cleaned up before invoking subsequent attempts. This prevents out-of-control resource consumption that starve out other access vectors. The Postgres DBA can assist here by setting role-level connection limits:

ALTER USER loader_thing WITH CONNECTION LIMIT 20;

This won’t prevent misbehaving scripts from choking themselves to death, but the rest of the application stack should continue unaffected. Of course, if different applications are sharing accounts, that needs to be stopped immediately. We can’t use high granularity controls on coarse access vectors, otherwise we’ve just moved the problem instead of solving it.

Above all, remember that this is only one possible scenario that requires forensics to track down and eliminate. All of the tools demonstrated here can almost always identify every one of them. We were even able to derive recommended practices to prevent future issues based on our observations. It’s the kind of balanced approach that emerges naturally from Postgres itself. It’s a stage where everyone has a role to play.

Except implicit transactions; those things are cancer.

See Also