PG Phriday: Who Died and Made You Boss?! (The Investigatining!)

The Postgres system catalog is a voluminous tome of intriguing metadata both obvious and stupendously esoteric. When inheriting a Postgres database infrastructure from another DBA, sometimes it falls upon us to dig into the writhing confines to derive a working knowledge of its lurking denizens. The trick is to do this before they burst forth and douse us with the database’s sticky innards and it experiences a horrible untimely demise.

Bane of unwatched databases everywhere.

Bane of unwatched databases everywhere.

To prevent that from happening, it’s a good idea to check various system views on occasion. The alternative isn’t always outright disaster, but why take the chance?

An ideal place to start is the pg_stat_activity view. It tells us what each session is (or was) doing, where it originated, who owns it, and a myriad of other juicy details. There’s just one problem:

SELECT pid, usename, query FROM pg_stat_activity;

  pid  | usename  |                       query                       
-------+----------+---------------------------------------------------
 11415 | postgres | 
   482 | sthomas  | SELECT pid, usename, query FROM pg_stat_activity;

While logged on as an unprivileged user, we can only see our own activity. The query column is protected such that only superusers can view its contents for all users. This is a security measure since it’s possible a query has sensitive information embedded somewhere. However, there are a lot of useful contextual or debugging elements in that field that a service monitor or other automated tool might find illuminating.

How do we give a user access to this data—automated or otherwise—without committing the greatest of sins by making them a superuser? One common technique is to simply wrap the view with a function that returns rows. Like this:

CREATE ROLE monitor;

CREATE OR REPLACE FUNCTION public.pg_stat_activity()
RETURNS SETOF pg_catalog.pg_stat_activity
AS $$
  SELECT * FROM pg_catalog.pg_stat_activity;
$$ LANGUAGE sql SECURITY DEFINER;

REVOKE ALL ON FUNCTION public.pg_stat_activity() FROM PUBLIC;
GRANT EXECUTE ON FUNCTION public.pg_stat_activity() TO monitor;

GRANT monitor TO sthomas;

Note that we also created a role that we can use for this kind of elevated access. By granting access to the role, we have an abstract set of privileges we can grant to, or revoke from, other users. Why track down every single grant a user might have, when we can just revoke a few roles instead?

The function is set as a SECURITY DEFINER so it runs as the user who owns it. The presumption here is that we create the function as another superuser (usually postgres), and then the query column is no longer protected for users given access to the function. Be wary when doing this however! Did you notice that we prepended the pg_catalog schema in the SELECT statement itself? This prevents the user from changing their search path and tricking the system into giving them superuser access to a view that’s really a select on a sensitive table. Sneaky!

We also needed to explicitly revoke execution access from the PUBLIC domain of all users. By default, functions created in Postgres can be executed by anyone. There are ways to change this, but most DBAs either don’t know about it, or haven’t done so. As a consequence, there are a lot of functions out there that are unnecessarily promiscuous. Elevated functions especially need this step!

After granting access to the new role, we can attempt the previous activity query again:

SELECT pid, usename, query FROM pg_stat_activity();

  pid  | usename  |                       query                       
-------+----------+---------------------------------------------------
 11415 | postgres | GRANT monitor TO sthomas;
   482 | sthomas  | SELECT pid, usename, query FROM pg_stat_activity();

Success! It feels inherently wrong to deliberately circumvent that kind of security measure, but we do as needs must. At least we did it safely. Having access to the query column is important in several contexts, especially if our application stack isn’t particularly sensitive.

Now that we have curated access to session activity, we may also need to observe how the database is working with its hardware resources. When multiple queries access data in shared memory for instance, knowing the contents might help us size it properly. It’s possible to access this information by activating the pg_buffercache extension.

With that knowledge firmly in hand and a bit of window function magic, here’s a query that we might work. In this case, I created a benchmark database and ran a couple iterations on it to generate some buffer activity.

CREATE EXTENSION pg_buffercache;

SELECT c.oid::REGCLASS::TEXT,
       round(count(*) * 8.0 / 1024, 2) AS mb_used,
       round(c.relpages * 8.0 / 1024, 2) AS object_mb,
       round(count(*) * 100.0 / c.relpages, 2) AS object_pct,
       round(count(*) * 100.0 / sum(count(*)) OVER (), 2) AS buffer_pct
  FROM pg_buffercache b
  JOIN pg_class c USING (relfilenode)
  JOIN pg_namespace n ON (c.relnamespace = n.oid)
 WHERE n.nspname NOT IN ('pg_catalog', 'information_schema', 'pg_toast')
 GROUP BY c.oid, c.relpages
 ORDER BY 2 DESC
 LIMIT 10;

          oid          | mb_used | object_mb | object_pct | buffer_pct 
-----------------------+---------+-----------+------------+------------
 pgbench_accounts_pkey |    3.22 |     21.45 |      15.01 |      53.16
 pgbench_accounts      |    2.75 |    128.08 |       2.15 |      45.42
 pgbench_history       |    0.05 |      0.62 |       8.86 |       0.90
 pgbench_tellers       |    0.02 |      0.01 |     200.00 |       0.26
 pgbench_branches      |    0.02 |      0.01 |     200.00 |       0.26

The database here is only using the default 8MB buffer, so there isn’t a whole lot of room for actual database caching. However, we can see that the account table is extremely active, and only 15% of it is cached. From this information, it’s clear we should increase shared memory to accommodate the full size of the accounts primary key, and possibly a larger portion of the accounts table.

There are, of course, other ways we can view table activity in Postgres. Consider for instance that Postgres constantly aggregates statistics as tables are written to, or read from. The statistics collector is actually rather prolific.

This query is a great way of seeing which tables are the most active:

SELECT relname AS table_name, sum(n_tup_ins) AS inserts,
       sum(n_tup_upd) AS updates,
       sum(n_tup_del) AS deletes
  FROM pg_stat_user_tables
 GROUP BY 1
 ORDER BY sum(n_tup_ins + n_tup_upd + n_tup_del) DESC
 LIMIT 10;

    table_name    | inserts | updates | deletes 
------------------+---------+---------+---------
 pgbench_accounts | 1000000 |   15431 |       0
 pgbench_tellers  |     100 |   15431 |       0
 pgbench_branches |      10 |   15431 |       0
 pgbench_history  |   15431 |       0 |       0

Remember that benchmark I mentioned earlier? We can actually see how many rows each table started with, how many were modified during the benchmark itself, and the fact that the history table was part of the benchmark transaction.

These statistics are lost a number of ways because they’re generally only relevant while the server is actually running. We can also manually reset them with the pg_stat_reset() function, in case we are doing more active forensics and want to see live accumulation.

And this is only the write data. What about reads? Elements such as sequential or index scans are equally important if we want to know how well our indexes are performing, or identify tables that might need more or better indexes.

Here’s a different use of the same stat table:

SELECT s.relid::REGCLASS::TEXT AS table_name,
       s.seq_scan, s.idx_scan,
       s.idx_tup_fetch, c.reltuples AS row_count
  FROM pg_stat_user_tables s
  JOIN pg_class c ON (c.oid = s.relid)
 ORDER BY s.seq_scan DESC, idx_scan DESC
 LIMIT 10;

    table_name    | seq_scan | idx_scan | idx_tup_fetch | row_count 
------------------+----------+----------+---------------+-----------
 pgbench_branches |    15433 |        0 |             0 |        10
 pgbench_tellers  |    15431 |        0 |             0 |       100
 pgbench_accounts |        3 |    30862 |         30862 |     1e+06
 pgbench_history  |        0 |          |               |     23148

Those numbers line up pretty well. The fact that the account table has three sequential scans is because I was trying to force the data into the shared buffers, so I read the entire contents of the table a few times. Since the branch and teller tables are so small, reading their entire contents is probably the most efficient approach, though it may warrant investigation later if the behavior persists.

If we focus on the account table where all the real activity is taking place, it’s index fetches at all times, except for my manual scans. For a 1-million row table, that’s exactly what we want to see. The fact that there’s a 1-1 relationship with the number of scans to fetches suggests they’re single fetches from the primary key, and also tells us we have good selectivity.

And now that we know a lot of information about our tables, it’s time to move on to the users of those tables.

It only seems that way

It only seems that way

More specifically, consider roles. In Postgres, best practices are to create a role and grant it access for multiple tables, functions, or views. Then we would grant the role to specific users that require that access. This is a lot safer than having direct assignments because it’s far easier to revoke and share among related users. Say, multiple people in a reporting department for instance.

But what if we’re given a username and want to see a full list of what they can actually access? That’s not as obvious as it might seem at first glance. Let’s do some permissions magic starting with this example group nesting.

CREATE ROLE benchmark;
CREATE ROLE nested_benchmark;

GRANT benchmark TO nested_benchmark;
GRANT nested_benchmark TO sthomas;

GRANT SELECT ON ALL TABLES IN SCHEMA public TO benchmark;

SELECT table_schema, table_name
  FROM information_schema.table_privileges
 WHERE grantee IN ('sthomas', 'nested_benchmark');

 table_schema | table_name 
--------------+------------
(0 rows)

What the what!? Just to be clear, the sthomas user does have full read access to all of the pgbench tables. So why aren’t they showing up in the list?

As handy as the information schema is, a critical flaw is that it only considers directly assigned privileges. Nested roles completely circumvent its ability to report access. We need a way to fix that.

Enter the wondrous recursive CTE syntax. We can create a view that “flattens” the role nesting:

CREATE OR REPLACE VIEW v_recursive_group_list AS
WITH RECURSIVE all_groups AS
(
  SELECT r.rolname AS user_name, g.rolname AS group_name
    FROM pg_authid r
    JOIN pg_auth_members m on (m.member=r.oid)
    JOIN pg_authid g on (m.roleid=g.oid)
  UNION ALL
  SELECT ag.user_name, g.rolname AS group_name
    FROM pg_authid r
    JOIN pg_auth_members m on (m.member=r.oid)
    JOIN pg_authid g on (m.roleid=g.oid)
    JOIN all_groups ag ON (r.rolname = ag.group_name)
)
SELECT * FROM all_groups;

Again, this isn’t as difficult as it looks. We’re just bootstrapping the results with the existing list of user/role associations. The UNION combines each of these with successive levels of nesting until we’ve exhausted all of them. The result is a username/group row for every role that is granted to a user, or a role which was granted to that role, and so on.

That’s something we can combine with the information schema to actually derive the full list of objects a user can access. Let’s see how that works:

SELECT *
  FROM v_recursive_group_list
 WHERE user_name = 'sthomas';

 user_name |    group_name    
-----------+------------------
 sthomas   | nested_benchmark
 sthomas   | monitor
 sthomas   | benchmark

SELECT t.table_schema, t.table_name
  FROM v_recursive_group_list gl
  JOIN information_schema.table_privileges t ON (t.grantee IN (gl.user_name, gl.group_name))
 WHERE gl.user_name = 'sthomas';

 table_schema |    table_name    
--------------+------------------
 public       | pgbench_tellers
 public       | pgbench_branches
 public       | pgbench_accounts
 public       | pgbench_history
 public       | pg_buffercache

Huzzah! Now we know what sessions are doing, how active tables are, and have a firm grasp of basic security and exposure. What else is left? How about locks?

There’s a deadlock somewhere in here…

There’s a deadlock somewhere in here…

Yes, locks. Sometimes session activity gets out of hand, or transactions a little to fast and furious, and things go somewhat awry. Being able to unravel that mess is essential to keeping a database operating smoothly.

If we combine the contents of pg_stat_activity and pg_locks, we can get a lot of additional information regarding session activity. Before we were only interested in the query that was running and maybe some surrounding context. Now we can see exactly which tables, indexes, views, and other objects are locked, what kind of lock is involved, and so on.

This is probably one of my favorite views to use for this kind of work:

CREATE OR REPLACE VIEW v_activity_locks AS
SELECT a.pid, s.mode, s.locktype, a.wait_event, a.state,
       a.usename, a.query_start::TIMESTAMP(0), a.client_addr,
       now() - a.query_start as time_used, a.query, s.tables
  FROM pg_stat_activity() a
  LEFT JOIN (
         SELECT pid AS pid, mode, locktype,
                string_agg(relname::text, ', ') AS tables
           FROM (SELECT l.pid, l.mode, l.locktype, c.relname
                   FROM pg_locks l
                   JOIN pg_class c ON (l.relation=c.oid)
                  WHERE c.relkind = 'r'
                  ORDER BY pid, relname) agg
          GROUP BY 1, 2, 3
       ) s USING (pid);

GRANT SELECT ON v_activity_locks TO monitor;

SELECT pid, usename, state,
       extract(minutes FROM time_used) AS minutes,
       substring(query, 1, 30) AS query_part,
       tables
  FROM v_activity_locks;

  pid  | usename  | state  | minutes |          query_part           |              tables              
-------+----------+--------+---------+-------------------------------+----------------------------------
  1927 | postgres | idle   |         |                               | 
 28305 | postgres | idle   |      49 | SELECT c.oid::REGCLASS::TEXT, | 
 11415 | postgres | idle   |       4 | grant monitor to sthomas;     | 
 16022 | sthomas  | active |       0 | SELECT pid, usename, state,   | pg_authid, pg_class, pg_database

Despite being horrifically ugly—and hence why it’s a view—I prefer it to simply joining the two tables. The magic is in the tables column, which lists every table that is locked by the session, in alphabetical order, all in the same result. No more tracing through multiple rows to see all of the participants in a locking mess, it’s all there for everyone to see. One line per session. Add a filter to remove idle queries. Maybe an extra predicate to only consider queries which have been active for more than a few seconds. In the end, we have a single query that can instantly identify areas worthy of additional forensics, and all involved resources.

While we’re thinking about locks, the previous view only really told us what resources a particular session was using and perhaps how it was doing so. If we sorted these results by the query_start column, it wouldn’t be too difficult to see whether or not one session was blocking another. But it’s not exactly a scenario that requires an alibi, and related activity can obscure our results if the database is particularly busy.

Thanks to the newly available pg_blocking_pids function in Postgres 9.6, we can actually see what is blocking a certain action. Before this function was introduced, the only way to figure out what was causing the block was to trace which connections were using the same resources and had their locks granted, versus those that didn’t. In a sufficiently busy system, this wasn’t necessary a causal relationship, but it provided a good starting point. Now we can see exactly what’s causing the block, and we can use that information to our benefit.

Here’s an example of a lingering modification in a transaction that caused a block for another session:

\x

SELECT DISTINCT l1.pid AS blocker_pid, a.query AS blocker_query,
       a.usename AS blocker_user, a.client_addr AS blocker_client,
       l2.pid AS blocked_pid, a2.query AS blocked_query,
       a2.usename AS blocked_user, a2.client_addr AS blocked_client
  FROM pg_locks l1
  JOIN pg_stat_activity() a on (a.pid = l1.pid)
  JOIN pg_locks l2 ON (l1.pid = ANY(pg_blocking_pids(l2.pid)))
  JOIN pg_stat_activity() a2 on (a2.pid = l2.pid)
 WHERE l1.granted
   AND NOT l2.granted;

-[ RECORD 1 ]--+------------------------------------------
blocker_pid    | 11415
blocker_query  | ALTER TABLE pgbench_accounts ADD foo INT;
blocker_user   | postgres
blocker_client | 
blocked_pid    | 12337
blocked_query  | SELECT count(*) FROM pgbench_accounts ;
blocked_user   | sthomas
blocked_client | 

Instead of using the pg_locks view and pg_stat_activity() function, we could use the v_activity_locks view to see both the blocked session and what blocked it on the same row. This would be indispensable in a really busy system, especially if a single connection has blocked several others. Such an occurrence would practically glow and demand immediate attention. Heck, it would even make an ideal automated system check.

And isn’t that what all of this really boils down to? Some of this is something that requires direct observation. Tracing user access makes sense during security audits, and the table stats can help with optimizing the database itself in many cases. But the rest is really just novel ways of combining the system views to produce quantifiable data that stands out in some way. How many queries have been running for over a minute? How many sessions have been waiting on a lock for over ten seconds?

Those first approximations are the bloodhound to a seasoned DBA. It would be silly to run all of these manually on a regular basis, but when there is a problem, all bets are off. These queries and views barely even scratch the surface of what’s really available in the Postgres system catalog. The Postgres metadata isn’t just there to reflect object organization, and curiosity goes a long way when exploring it.

So follow that bloodhound to the crime scene. There are more than enough tools available to identify, apprehend, and even punish the culprit. Or maybe just explore the stats and tweak to wring extra performance from the existing tables. This should be a good start for doing either, and infinitely more.