PG Phriday: Who Died and Made You Boss?! (The Investigatining!)

The Postgres system catalog is a voluminous tome of intriguing metadata both obvious and stupendously esoteric. When inheriting a Postgres database infrastructure from another DBA, sometimes it falls upon us to dig into the writhing confines to derive a working knowledge of its lurking denizens. The trick is to do this before they burst forth and douse us with the database’s sticky innards and it experiences a horrible untimely demise.

Bane of unwatched databases everywhere.

To prevent that from happening, it’s a good idea to check various system views on occasion. The alternative isn’t always outright disaster, but why take the chance?

An ideal place to start is the pg_stat_activity view. It tells us what each session is (or was) doing, where it originated, who owns it, and a myriad of other juicy details. There’s just one problem:

SELECT pid, usename, query FROM pg_stat_activity;
  pid  | usename  |                       query                       
 11415 | postgres | <insufficient privilege>
   482 | sthomas  | SELECT pid, usename, query FROM pg_stat_activity;

While logged on as an unprivileged user, we can only see our own activity. The query column is protected such that only superusers can view its contents for all users. This is a security measure since it’s possible a query has sensitive information embedded somewhere. However, there are a lot of useful contextual or debugging elements in that field that a service monitor or other automated tool might find illuminating.

How do we give a user access to this data—automated or otherwise—without committing the greatest of sins by making them a superuser? One common technique is to simply wrap the view with a function that returns rows. Like this:

CREATE ROLE monitor;
CREATE OR REPLACE FUNCTION public.pg_stat_activity()
RETURNS SETOF pg_catalog.pg_stat_activity
AS $$
  SELECT * FROM pg_catalog.pg_stat_activity;
REVOKE ALL ON FUNCTION public.pg_stat_activity() FROM PUBLIC;
GRANT EXECUTE ON FUNCTION public.pg_stat_activity() TO monitor;
GRANT monitor TO sthomas;

Note that we also created a role that we can use for this kind of elevated access. By granting access to the role, we have an abstract set of privileges we can grant to, or revoke from, other users. Why track down every single grant a user might have, when we can just revoke a few roles instead?

The function is set as a SECURITY DEFINER so it runs as the user who owns it. The presumption here is that we create the function as another superuser (usually postgres), and then the query column is no longer protected for users given access to the function. Be wary when doing this however! Did you notice that we prepended the pg_catalog schema in the SELECT statement itself? This prevents the user from changing their search path and tricking the system into giving them superuser access to a view that’s really a select on a sensitive table. Sneaky!

We also needed to explicitly revoke execution access from the PUBLIC domain of all users. By default, functions created in Postgres can be executed by anyone. There are ways to change this, but most DBAs either don’t know about it, or haven’t done so. As a consequence, there are a lot of functions out there that are unnecessarily promiscuous. Elevated functions especially need this step!

After granting access to the new role, we can attempt the previous activity query again:

SELECT pid, usename, query FROM pg_stat_activity();
  pid  | usename  |                       query                       
 11415 | postgres | GRANT monitor TO sthomas;
   482 | sthomas  | SELECT pid, usename, query FROM pg_stat_activity();

Success! It feels inherently wrong to deliberately circumvent that kind of security measure, but we do as needs must. At least we did it safely. Having access to the query column is important in several contexts, especially if our application stack isn’t particularly sensitive.

Now that we have curated access to session activity, we may also need to observe how the database is working with its hardware resources. When multiple queries access data in shared memory for instance, knowing the contents might help us size it properly. It’s possible to access this information by activating the pg_buffercache extension.

With that knowledge firmly in hand and a bit of window function magic, here’s a query that we might work. In this case, I created a benchmark database and ran a couple iterations on it to generate some buffer activity.

CREATE EXTENSION pg_buffercache;
       round(COUNT(*) * 8.0 / 1024, 2) AS mb_used,
       round(c.relpages * 8.0 / 1024, 2) AS object_mb,
       round(COUNT(*) * 100.0 / c.relpages, 2) AS object_pct,
       round(COUNT(*) * 100.0 / SUM(COUNT(*)) OVER (), 2) AS buffer_pct
  FROM pg_buffercache b
  JOIN pg_class c USING (relfilenode)
  JOIN pg_namespace n ON (c.relnamespace = n.oid)
 WHERE n.nspname NOT IN ('pg_catalog', 'information_schema', 'pg_toast')
 GROUP BY c.oid, c.relpages
 LIMIT 10;
          oid          | mb_used | object_mb | object_pct | buffer_pct 
 pgbench_accounts_pkey |    3.22 |     21.45 |      15.01 |      53.16
 pgbench_accounts      |    2.75 |    128.08 |       2.15 |      45.42
 pgbench_history       |    0.05 |      0.62 |       8.86 |       0.90
 pgbench_tellers       |    0.02 |      0.01 |     200.00 |       0.26
 pgbench_branches      |    0.02 |      0.01 |     200.00 |       0.26

The database here is only using the default 8MB buffer, so there isn’t a whole lot of room for actual database caching. However, we can see that the account table is extremely active, and only 15% of it is cached. From this information, it’s clear we should increase shared memory to accommodate the full size of the accounts primary key, and possibly a larger portion of the accounts table.

There are, of course, other ways we can view table activity in Postgres. Consider for instance that Postgres constantly aggregates statistics as tables are written to, or read from. The statistics collector is actually rather prolific.

This query is a great way of seeing which tables are the most active:

SELECT relname AS TABLE_NAME, SUM(n_tup_ins) AS inserts,
       SUM(n_tup_upd) AS updates,
       SUM(n_tup_del) AS deletes
  FROM pg_stat_user_tables
 ORDER BY SUM(n_tup_ins + n_tup_upd + n_tup_del) DESC
 LIMIT 10;
    TABLE_NAME    | inserts | updates | deletes 
 pgbench_accounts | 1000000 |   15431 |       0
 pgbench_tellers  |     100 |   15431 |       0
 pgbench_branches |      10 |   15431 |       0
 pgbench_history  |   15431 |       0 |       0

Remember that benchmark I mentioned earlier? We can actually see how many rows each table started with, how many were modified during the benchmark itself, and the fact that the history table was part of the benchmark transaction.

These statistics are lost a number of ways because they’re generally only relevant while the server is actually running. We can also manually reset them with the pg_stat_reset() function, in case we are doing more active forensics and want to see live accumulation.

And this is only the write data. What about reads? Elements such as sequential or index scans are equally important if we want to know how well our indexes are performing, or identify tables that might need more or better indexes.

Here’s a different use of the same stat table:

       s.seq_scan, s.idx_scan,
       s.idx_tup_fetch, c.reltuples AS ROW_COUNT
  FROM pg_stat_user_tables s
  JOIN pg_class c ON (c.oid = s.relid)
 ORDER BY s.seq_scan DESC, idx_scan DESC
 LIMIT 10;
    TABLE_NAME    | seq_scan | idx_scan | idx_tup_fetch | ROW_COUNT 
 pgbench_branches |    15433 |        0 |             0 |        10
 pgbench_tellers  |    15431 |        0 |             0 |       100
 pgbench_accounts |        3 |    30862 |         30862 |     1e+06
 pgbench_history  |        0 |          |               |     23148

Those numbers line up pretty well. The fact that the account table has three sequential scans is because I was trying to force the data into the shared buffers, so I read the entire contents of the table a few times. Since the branch and teller tables are so small, reading their entire contents is probably the most efficient approach, though it may warrant investigation later if the behavior persists.

If we focus on the account table where all the real activity is taking place, it’s index fetches at all times, except for my manual scans. For a 1-million row table, that’s exactly what we want to see. The fact that there’s a 1-1 relationship with the number of scans to fetches suggests they’re single fetches from the primary key, and also tells us we have good selectivity.

And now that we know a lot of information about our tables, it’s time to move on to the users of those tables.

It only seems that way

More specifically, consider roles. In Postgres, best practices are to create a role and grant it access for multiple tables, functions, or views. Then we would grant the role to specific users that require that access. This is a lot safer than having direct assignments because it’s far easier to revoke and share among related users. Say, multiple people in a reporting department for instance.

But what if we’re given a username and want to see a full list of what they can actually access? That’s not as obvious as it might seem at first glance. Let’s do some permissions magic starting with this example group nesting.

CREATE ROLE benchmark;
CREATE ROLE nested_benchmark;
GRANT benchmark TO nested_benchmark;
GRANT nested_benchmark TO sthomas;
SELECT table_schema, TABLE_NAME
  FROM information_schema.table_privileges
 WHERE grantee IN ('sthomas', 'nested_benchmark');
 table_schema | TABLE_NAME 
(0 ROWS)

What the what!? Just to be clear, the sthomas user does have full read access to all of the pgbench tables. So why aren’t they showing up in the list?

As handy as the information schema is, a critical flaw is that it only considers directly assigned privileges. Nested roles completely circumvent its ability to report access. We need a way to fix that.

Enter the wondrous recursive CTE syntax. We can create a view that “flattens” the role nesting:

CREATE OR REPLACE VIEW v_recursive_group_list AS
  SELECT r.rolname AS user_name, g.rolname AS group_name
    FROM pg_authid r
    JOIN pg_auth_members m ON (m.member=r.oid)
    JOIN pg_authid g ON (m.roleid=g.oid)
  SELECT ag.user_name, g.rolname AS group_name
    FROM pg_authid r
    JOIN pg_auth_members m ON (m.member=r.oid)
    JOIN pg_authid g ON (m.roleid=g.oid)
    JOIN all_groups ag ON (r.rolname = ag.group_name)
SELECT * FROM all_groups;

Again, this isn’t as difficult as it looks. We’re just bootstrapping the results with the existing list of user/role associations. The UNION combines each of these with successive levels of nesting until we’ve exhausted all of them. The result is a username/group row for every role that is granted to a user, or a role which was granted to that role, and so on.

That’s something we can combine with the information schema to actually derive the full list of objects a user can access. Let’s see how that works:

  FROM v_recursive_group_list
 WHERE user_name = 'sthomas';
 user_name |    group_name    
 sthomas   | nested_benchmark
 sthomas   | monitor
 sthomas   | benchmark
SELECT t.table_schema, t.table_name
  FROM v_recursive_group_list gl
  JOIN information_schema.table_privileges t ON (t.grantee IN (gl.user_name, gl.group_name))
 WHERE gl.user_name = 'sthomas';
 table_schema |    TABLE_NAME    
 public       | pgbench_tellers
 public       | pgbench_branches
 public       | pgbench_accounts
 public       | pgbench_history
 public       | pg_buffercache

Huzzah! Now we know what sessions are doing, how active tables are, and have a firm grasp of basic security and exposure. What else is left? How about locks?

There’s a deadlock somewhere in here…

Yes, locks. Sometimes session activity gets out of hand, or transactions a little to fast and furious, and things go somewhat awry. Being able to unravel that mess is essential to keeping a database operating smoothly.

If we combine the contents of pg_stat_activity and pg_locks, we can get a lot of additional information regarding session activity. Before we were only interested in the query that was running and maybe some surrounding context. Now we can see exactly which tables, indexes, views, and other objects are locked, what kind of lock is involved, and so on.

This is probably one of my favorite views to use for this kind of work:

CREATE OR REPLACE VIEW v_activity_locks AS
SELECT, s.mode, s.locktype, a.wait_event, a.state,
       a.usename, a.query_start::TIMESTAMP(0), a.client_addr,
       now() - a.query_start AS time_used, a.query, s.tables
  FROM pg_stat_activity() a
         SELECT pid AS pid, mode, locktype,
                string_agg(relname::text, ', ') AS TABLES
           FROM (SELECT, l.mode, l.locktype, c.relname
                   FROM pg_locks l
                   JOIN pg_class c ON (l.relation=c.oid)
                  WHERE c.relkind = 'r'
                  ORDER BY pid, relname) agg
          GROUP BY 1, 2, 3
       ) s USING (pid);
GRANT SELECT ON v_activity_locks TO monitor;
SELECT pid, usename, state,
       EXTRACT(minutes FROM time_used) AS minutes,
       SUBSTRING(query, 1, 30) AS query_part,
  FROM v_activity_locks;
  pid  | usename  | state  | minutes |          query_part           |              TABLES              
  1927 | postgres | idle   |         |                               | 
 28305 | postgres | idle   |      49 | SELECT c.oid::REGCLASS::TEXT, | 
 11415 | postgres | idle   |       4 | GRANT monitor TO sthomas;     | 
 16022 | sthomas  | active |       0 | SELECT pid, usename, state,   | pg_authid, pg_class, pg_database

Despite being horrifically ugly—and hence why it’s a view—I prefer it to simply joining the two tables. The magic is in the tables column, which lists every table that is locked by the session, in alphabetical order, all in the same result. No more tracing through multiple rows to see all of the participants in a locking mess, it’s all there for everyone to see. One line per session. Add a filter to remove idle queries. Maybe an extra predicate to only consider queries which have been active for more than a few seconds. In the end, we have a single query that can instantly identify areas worthy of additional forensics, and all involved resources.

While we’re thinking about locks, the previous view only really told us what resources a particular session was using and perhaps how it was doing so. If we sorted these results by the query_start column, it wouldn’t be too difficult to see whether or not one session was blocking another. But it’s not exactly a scenario that requires an alibi, and related activity can obscure our results if the database is particularly busy.

Thanks to the newly available pg_blocking_pids function in Postgres 9.6, we can actually see what is blocking a certain action. Before this function was introduced, the only way to figure out what was causing the block was to trace which connections were using the same resources and had their locks granted, versus those that didn’t. In a sufficiently busy system, this wasn’t necessary a causal relationship, but it provided a good starting point. Now we can see exactly what’s causing the block, and we can use that information to our benefit.

Here’s an example of a lingering modification in a transaction that caused a block for another session:

SELECT DISTINCT AS blocker_pid, a.query AS blocker_query,
       a.usename AS blocker_user, a.client_addr AS blocker_client, AS blocked_pid, a2.query AS blocked_query,
       a2.usename AS blocked_user, a2.client_addr AS blocked_client
  FROM pg_locks l1
  JOIN pg_stat_activity() a ON ( =
  JOIN pg_locks l2 ON ( = ANY(pg_blocking_pids(
  JOIN pg_stat_activity() a2 ON ( =
 WHERE l1.granted
   AND NOT l2.granted;
-[ RECORD 1 ]--+------------------------------------------
blocker_pid    | 11415
blocker_query  | ALTER TABLE pgbench_accounts ADD foo INT;
blocker_user   | postgres
blocker_client | 
blocked_pid    | 12337
blocked_query  | SELECT COUNT(*) FROM pgbench_accounts ;
blocked_user   | sthomas
blocked_client |

Instead of using the pg_locks view and pg_stat_activity() function, we could use the v_activity_locks view to see both the blocked session and what blocked it on the same row. This would be indispensable in a really busy system, especially if a single connection has blocked several others. Such an occurrence would practically glow and demand immediate attention. Heck, it would even make an ideal automated system check.

And isn’t that what all of this really boils down to? Some of this is something that requires direct observation. Tracing user access makes sense during security audits, and the table stats can help with optimizing the database itself in many cases. But the rest is really just novel ways of combining the system views to produce quantifiable data that stands out in some way. How many queries have been running for over a minute? How many sessions have been waiting on a lock for over ten seconds?

Those first approximations are the bloodhound to a seasoned DBA. It would be silly to run all of these manually on a regular basis, but when there is a problem, all bets are off. These queries and views barely even scratch the surface of what’s really available in the Postgres system catalog. The Postgres metadata isn’t just there to reflect object organization, and curiosity goes a long way when exploring it.

So follow that bloodhound to the crime scene. There are more than enough tools available to identify, apprehend, and even punish the culprit. Or maybe just explore the stats and tweak to wring extra performance from the existing tables. This should be a good start for doing either, and infinitely more.

PG Phriday: Design Pattern Workshop

Recently on the pgsql-performance mailing list, a question popped up regarding Postgres RAM usage. In this instance Pietro wondered why Postgres wasn’t using more RAM, and why his process was taking so long. There were a few insightful replies, and they’re each interesting for reasons that aren’t immediately obvious. Let’s see what is really going on here, and perhaps answer a question while we’re at it.

Pietro presents several postgresql.conf settings, but here are the ones that matter:

shared_buffers = 24GB
effective_cache_size = 72GB
work_mem = 512MB
maintenance_work_mem = 1GB

The configured system has 96GB of total RAM. Keeping that in consideration, we see that shared_buffers is set to 24GB, which is a quarter of the server’s available memory. The only potential issue here is that it’s possible a checkpoint could be a very painful experience if a significant portion of that memory becomes dirty. Since there are no complaints of high IO waits, let’s just assume that isn’t happening.

Common advice suggests setting effective_cache_size to an amount equivalent to half of system RAM + shared_buffers. A setting of 72GB looks high, but Postgres does not allocate any of that on its own. This parameter merely influences the query planner, since higher values imply an increased probability a disk page is in either Postgres or OS caches. This is an optimization setting, not a memory allocation.

A work_mem value of 512MB is extremely high. In some cases however, that’s perfectly acceptable. The amount specified here will be used for all internal query execution operations that require a memory allocation. This means sorts, hashes, bitmaps, materializations, and so on. Thus a single query can instantiate multiple 512MB segments for each necessary operation. Effectively the user has greatly reduced the likelihood Postgres will utilize disk storage for any query operations.

And finally we have maintenance_work_mem. This is so often set to 1GB, it might as well be the default. This is the amount of RAM Postgres uses when doing things like creating indexes, vacuuming, and anything else that might need vast amounts of working space to proceed. We don’t know Pietro’s procedures, but we can make some inferences later. This setting really only matters if some part of his data onboarding includes adding new indexes or triggers an autovacuum.

Then we see how memory is performing once Postgres is running:

              total        used        free      shared  buff/cache   available
Mem:            94G         28G         46G         17M         19G         64G
Swap:           15G          0B         15G
Total:         109G         28G         61G

That’s not a lot of usage. Of that 28GB, we know Postgres has 24 simply due to the setting for shared_buffers. This means that the actual queries are only consuming 4GB at most.

This is particularly relevant because if we skip to the end of the post, we see this output from top:

 9686 postgres  20   0 24.918g 214236  12628 R 100.0  0.2   2872:38 postgres
 9687 postgres  20   0 24.918g 214212  12600 R 100.0  0.2   2872:27 postgres
 9688 postgres  20   0 25.391g 709936  12708 R 100.0  0.7   2872:40 postgres
 9691 postgres  20   0 24.918g 214516  12900 R 100.0  0.2   2865:23 postgres
 9697 postgres  20   0 24.918g 214284  12676 R 100.0  0.2   2866:05 postgres

All 16 of the backends are pegged at 100% CPU, and have been for almost 48 hours each (2872/60). That definitely corroborates Pietro’s claim that this workload has been running for 48 hours. This suggests little to no waiting for disk IO during the entire process.

So will more RAM help? Probably not. Pietro says he’s doing “time series expansion and some arithmetics”. This kind of procedure doesn’t generally require vast amounts of RAM, as calculations don’t often invoke sorting or other temporary memory structures. So what’s really going on?

The second reply gets at the real heart of the matter immediately:

You’d be better off to ask for help in optimizing your queries IMHO.

Why would Scott say that?

Well, there are a lot of ways to do the same thing. In the database world, some of these are highly preferable to others. How exactly is Pietro expanding those time series, for example? He could be doing something like this:

DO $$
  a INT;
  b INT;
  new_a INT;
  new_b INT;
  FOR a, b, ... IN SELECT col_a, col_b, ...
    new_a = some_calculation(a, b, c, ..., n);
    new_b = other_calculation(a, b, c, ..., n);
    INSERT INTO target_table VALUES (new_a, new_b, ...);
$$ LANGUAGE plpgsql;

If the necessary logic is complicated enough, that might be the right approach. However, it is extremely expensive to loop through each individual row from a driver query, allocate variables, perform calculations, and invoke a single insert. This kind of process would peg a CPU at 100% while still being highly inefficient by multiple orders of magnitude.

There’s also the possibility Pietro knows this, and the calculations were fairly simple. So maybe he decided the process could be inlined a bit more. Instead of that anonymous block (or something equivalent on the application side), he decides to write helper functions to do the calculations instead:

CREATE OR REPLACE FUNCTION some_calculation(a INT, b INT, ...)
  retval INT;
  -- Some calculations here
  RETURN retval;

Being a smart guy, Pietro knows immutable function results are cached much better than otherwise, and strict functions won’t even run if a parameter is null. With these two declarations, if any invocations take the same parameters, or a parameter is empty, Postgres may not even execute the function at all. Avoiding a jump operation is a huge optimization all by itself, so consider the benefits of preventing the overhead of a function call and its associated logic.

By defining the calculations in this manner, his process would look like this instead:

INSERT INTO target_table
SELECT some_calculation(col_a, col_b, ...),
       other_calculation(col_d, col_e, ...),
  FROM ...
 WHERE ...;

Or maybe Pietro knows the value of data staging. Caching is great, but in sufficiently complicated examples of business logic, transforming it into immutable functions may be somewhat difficult. In addition, some “calculations” may rely on the results of another query. This is where temporary and unlogged tables, or CTEs come in.

Consider the fact databases natively communicate using Set Theory. They are highly optimized, and specifically engineered to bulk-process rows in ways user-generated applications simply can’t match in all but the edgiest of cases. So the final iteration of the data onboarding might resemble this:

CREATE TEMP TABLE expand_series AS
CREATE TEMP TABLE relate_series AS
  FROM expand_series
  JOIN relate_series ON (...);
INSERT INTO target_table
  FROM basic_tally
  JOIN ...
 WHERE ...;

Or maybe Pietro likes CTEs more, so uses those instead of multiple temporary table steps. Postgres will fully materialize everything anyway. The primary benefit to temporary tables is that repeated execution of multiple steps won’t require re-running all previous elements. If this were some kind of one-time report, a CTE would arguably make more sense as it operates atomically.

Regardless, this kind of iterative transformation allows us to fully embrace Set Theory. Don’t operate on one row, but all of them. If we can organize calculations into distinct categories or steps, then we’re actually requesting the database to transform the data as a whole for each operation. For this entire data set, add some derived date column. For this entire data set, produce a standard deviation variant along this window. For this entire data set, derive these additional columns from these formulae.

That’s how databases think, and Postgres is no different in this regard. By looping through each individual result, we’re actively disrupting that process and literally making it hundreds or thousands of times slower. We could have all the memory in the world, and even find some way to fully utilize it, and it wouldn’t influence that universal truth.

Now, we don’t know what Pietro is actually doing, just that he’s convinced more RAM might help. The evidence he provided suggests that isn’t the case. Having sixteen processes running at full blast for two days either means he has a massive amount of data, or there’s some unaddressed inefficiency in his overall approach.

I suspect the latter only because it’s so common. People don’t usually think like databases, and really, who can blame them? What kind of nutcase would willingly do this for a living?

Figure 1: A Nutcase

PG Phriday: CONFLICT of Interests

MySQL has had a REPLACE INTO syntax to perform “UPSERT” logic since practically the very beginning. For the longest time, users who wanted to switch to Postgres, but for whatever reason relied on this functionality, were essentially trapped. Postgres 9.5 changed all that, but why did it take so long? As with much of Postgres history, it’s a long story.

To really understand where Postgres started, we need to look at the “old” way of handling a row merge. Many in the database world have probably encountered this once or twice:

    -- First, try to update the key.
    UPDATE my_tab SET b = v
     WHERE a = k;
    END IF;
    -- The key doesn't exist, so try to insert it.
      INSERT INTO my_tab (a, b) VALUES (k, v);
    EXCEPTION WHEN unique_violation THEN
      -- Nothing here, allow the loop to continue.
$$ LANGUAGE plpgsql;
SELECT merge_tab(1, 'James');
SELECT merge_tab(1, 'Jimmy');

What on Earth is all of that? Oddly enough, the somewhat convoluted logic is not only sound, it’s actually required to avoid a race condition. In the microseconds between attempting our UPDATE and following to the INSERT, some other transaction may have inserted the “missing” key. In that case, we’d encounter a unique constraint violation.

By catching the exception, we’re not immediately kicked out of the function and are presented with a choice. Do we assume our value is “right” and repeat the loop to apply the update, or just exit silently under the assumption that the successful transaction that beat us is probably fine? This particular function selected the previous assertion because that’s what a merge or upsert tries to guarantee: that the requested action is applied. Were we to omit the loop, the exception block would ensure there was no conflict or fatal error, but we could no longer rely on the function operating as advertised.

So why not invert the logic and remove the loop entirely? After all, we could just attempt the insert and if it fails, perform the update within the exception block, right? Actually no. Consider what happens if the target key is deleted by a concurrent transaction. Say we try our insert, and in the space of time between the key violation and our update, it gets deleted. Suddenly our update also produces an error. That’s probably an extremely unlikely edge case, but in OLTP databases, the unlikely becomes frighteningly common. So to be safe, we’re stuck with the loop.

Don’t think about it too much

That is a lot of overhead for what many consider basic functionality. Since that’s no longer a concern, let’s take a look at the actual syntax the Postgres team selected. To do that, let’s start with a very basic table with a handful of rows:

CREATE TABLE symbol_mapper (
  vendor_id    BIGINT   NOT NULL,
  ext_mapping  VARCHAR  NOT NULL,
  symbol       VARCHAR  NOT NULL,
  PRIMARY KEY (vendor_id, ext_mapping)
INSERT INTO symbol_mapper VALUES (1, 'Google', 'GOOGL');
INSERT INTO symbol_mapper VALUES (1, 'Apple', 'AAPL');
INSERT INTO symbol_mapper VALUES (2, 'goo', 'GOOGL');
INSERT INTO symbol_mapper VALUES (2, 'app', 'AAPL');
ANALYZE symbol_mapper;

The purpose of a mapping table is to fill the role of decoding external names or lookup values to match internal ones. Since each vendor may have its own designation structure, we require a mapping for each. That also protects us in case two vendors use the same identifiers.

So far we have a fairly standard application of tables. Now let’s do something interesting:

INSERT INTO symbol_mapper (vendor_id, ext_mapping, symbol)
       VALUES (2, 'app', 'AAPL')

In this particular case, the “aap” mapping already exists for vendor 2, so no insert takes place. This is the equivalent of not wrapping our insert/update with a loop. We don’t care what the value is, just as long as something is there. In reality, this is more of a way to remove error output from violations than anything immediately useful.

The real fun doesn’t start until we integrate the DO UPDATE functionality. Consider the case where we want to add Samsung as a mapping for a vendor. In this particular case, someone sneaked an existing row into the system, and it contains a typo.

INSERT INTO symbol_mapper VALUES (1, 'Samsung', 'SSLNF');
INSERT INTO symbol_mapper (vendor_id, ext_mapping, symbol)
       VALUES (1, 'Samsung', 'SSNLF')
    ON CONFLICT (vendor_id, ext_mapping)
   SET symbol = 'SSNLF';
SELECT * FROM symbol_mapper;
 vendor_id | ext_mapping | symbol 
         1 | Google      | GOOGL
         1 | Apple       | AAPL
         2 | goo         | GOOGL
         2 | app         | AAPL
         1 | Samsung     | SSNLF

What we’ve done here is ensure the newest incoming mapping is the “correct” one; that’s our merge. An observant reader might ask how this is any different from our futile desire to attempt an INSERT with an UPDATE in an exception block. Unlike those two separate statements and the time-consuming exception handling, this is a single atomic action.

Did another session delete the row before us? We’ll just insert it again. Did another transaction delete the row we just inserted or updated? Oh well. The important part is that it is impossible to delete the row while our statement is running. So while the logic is similar to using an exception, the difference is that DO UPDATE is built into the database itself, so it can’t be broken into multiple actions that can be interrupted.

Another interesting bit of syntax is that we can actually incorporate a WHERE clause into the update beyond the implicit assumption that our update affects the same key we tried to insert.

INSERT INTO symbol_mapper VALUES (1, 'Sony', 'SONY');
INSERT INTO symbol_mapper (vendor_id, ext_mapping, symbol)
       VALUES (1, 'Sony', 'SNY')
    ON CONFLICT (vendor_id, ext_mapping)
   SET symbol = 'SNY'
 WHERE symbol_mapper.symbol = 'SONY';
INSERT INTO symbol_mapper (vendor_id, ext_mapping, symbol)
       VALUES (1, 'Sony', 'SNY')
    ON CONFLICT (vendor_id, ext_mapping)
   SET symbol = 'SNY'
 WHERE symbol_mapper.symbol = 'SONY';

In this case, we had a mapping for Sony that needed a correction. The first query affected the row we targeted, and the second did nothing. This is important because if we had not specified that predicate, both updates would have successfully modified the row. And so would all subsequent attempts. Remember Postgres keeps a row version for every update, even if the new and old values are identical. That’s just how MVCC works.

In a loosely built application environment, it isn’t uncommon for several vectors to operate simultaneously. If a dozen of these each upsert the same value, they’ll all be satisfied that their work is complete, and Postgres would be stuck with a dozen duplicate old rows. VACUUM (and autovacuum) will ensure old row versions are recycled, but again, that’s more overhead we don’t need to invoke.

And of course, the WHERE clause isn’t restricted to deflecting repeated update attempts. There may be circumstances where we simply don’t want to apply changes. By specifying the table name, we can introspect into any of the existing table values. What about the values we attempted to insert? Since these were part of an inherent violation, they’re assigned to a record named “excluded”.

Here it is in action:

INSERT INTO symbol_mapper VALUES (1, 'Microsoft', 'msft');
INSERT INTO symbol_mapper (vendor_id, ext_mapping, symbol)
       VALUES (1, 'Microsoft', 'MSFT')
    ON CONFLICT ON CONSTRAINT symbol_mapper_pkey
   SET symbol = 'MSFT'
 WHERE symbol_mapper.symbol != excluded.symbol;

This is a very similar situation as we had with Sony. The mapping for Microsoft needs an update if the existing value doesn’t match the one we’re attempting to insert. Well, we can perform that check explicitly without hard-coding those values into the query multiple times. It’s possible to refer to anything in the VALUES tuple by specifying “excluded”. Handy, eh?

Also notice that we’ve changed our conflict condition. Previously we had simply listed the columns in the primary key, and Postgres inferred the proper constraint from that definition. In this case, we directly stated the constraint that Postgres should use in resolving the conflict. It’s somewhat uncommon (and probably not entirely safe) to directly invoke constraint names, but the option is there in case we want it.

This feature was a long time coming; Postgres 9.5 was released in early 2016. As impossible as it sounds, we’ve only really had a little over a year to leverage ON CONFLICT. As a consequence, it’s still slowly seeping into existing Postgres application stacks. Users are still incorporating it into their workflows. It’ll be a while before it’s taken for granted with the rest of the kitchen sink Postgres offers.

Until then, it’s that special toy we’ve always wanted but couldn’t afford until now. There’s nothing else to do but make up for lost time!


P.S. Before I forget, Postgres Open is getting ready for 2017! If you or someone else wants to attend or even submit a talk, I highly recommend doing so. I’ve been something of a regular fixture there since it started in 2011, and I fully intend to try my hand again. It’s taking place in San Francisco this time around, so the change in venue will definitely be interesting. Apparently it’s all part of the new United States PostgreSQL Association, so it’s good seeing everything pull together behind a united entity. Here’s looking to the future!

PG Phriday: RESTing in the Corn

Last week we explored using Postgres as a central communication nexus between several data sources. At the time, I made a fairly hand-wavy allusion to REST interfaces. Since I hadn’t really explored further, I had assumed PLV8 could use core node.js or other similar libraries to invoke HTTP APIs. Of course as a trusted language, PLV8 isn’t allowed to do that. It’s more of a language for easily manipulating JSON and JSONB objects within Postgres.

Only slightly less violent than JSON

So we need some other way of calling out to an external HTTP resource, wrapping it in column trimmings, and sending the data to Postgres. As demonstrated last week, we could do this directly with a Python (or some other language) function. Alternatively we can use an extension that has access to Python. Then all we would need to do is write a python library, and suddenly we can interact with the REST interface as if it were an actual Postgres table. This is where Multicorn comes in.

One of the easiest ways to install Multicorn is through PGXN. It goes something like this:

sudo pgxn install multicorn

Otherwise users of Red Hat variants have a yum repo or Debian/Ubuntu can rely on the apt repo. Multicorn is popular enough that it’s included in both of these resources, which is very handy for us.

In any case, we need a basic REST service to poll. In the interests in keeping things simple, let’s use Flask to whip up an API for fetching all the words in a UNIX dictionary file. Since the jsonify method is extremely slow, it’s probably a good idea to cache the output too. The end result might look something like this:

from flask import Flask, jsonify
app = Flask(__name__)
counter = 1
dictionary = []
cached = None
words = open('/usr/share/dict/words', 'r')
for line in words:
    dictionary.append({'id': counter, 'word': line.strip()})
    counter += 1
def index():
    global cached
    if not cached:
        cached = jsonify({'records': dictionary})
    return cached
if __name__ == '__main__':, port=9999)

Our REST interface replies to only one URL and accepts no parameters. It’s not exciting, but it does the job of transmitting data to HTTP requests. With this “service” running in the background somewhere, we can continue with our task of turning that output into a Postgres table.

The next thing we need to do is make use of the Multicorn Python libraries. In this arena, Python setuptools are our friend. With them, we can write a Python library that imports Multicorn and it will automatically install everything in a compatible location. Here’s a bare-bones file that might work:

import subprocess 
from setuptools import setup, find_packages, Extension
  author='Shaun Thomas',

That takes care of actually installing the project library files. Next we need to actually use the Multicorn API. As a pure foreign data wrapper API, it has a lot more capability than we actually need. By default it can represent direct access to CSV files, filesystem objects, RSS feeds, and a bevy of other cute implementations.

Unfortunately none of these use cases apply to wrapping up /usr/share/dict/words. Multicorn can interact with authentication, pass predicates to help pre-optimize matches, and has several other capabilities we simply won’t leverage here. We just need one table from one file, with very little in the way of parsing.

To that end, we only need to define a single execute method in a class derived from ForeignDataWrapper. With that in place, Python has a very capable protocol request handler we can leverage to actually interact with our REST API. It provides everything we need to get the data and convert the HTTP response from JSON into Python dicts that represent each row of the dictionary contents.

There isn’t much exciting here:

import requests
from multicorn import ForeignDataWrapper
class DictionaryFDW(ForeignDataWrapper):
    def execute(self, quals, columns):
        response = requests.get('http://localhost:9999/')
        data = response.json()
        for row in data['records']:
            yield row

To install it, we merely invoke

sudo python install

Everything is now in place. As a Python library, Multicorn has access to other Python tools installed on the system. This means the primary option to the wrapper will be a named Python class. So long as that class is compatible, we will have a functional Postgres extension.

It just so happens that Multicorn implements this at the SERVER level. This makes sense as each server generally has its own defined interaction model. The CSV server would take filenames as parameters for foreign tables, RSS feeds would need the URL for each external resource, etc. Our wrapper is somewhat less… sophisticated. It takes no parameters and offers no advanced functionality. As such, once we define the server that invokes the library, we’re basically done.

Here’s the code that would create our dictionary table and a sample invocation:

CREATE SERVER multicorn_srv
       FOREIGN DATA WRAPPER multicorn
       OPTIONS (wrapper 'fdw.DictionaryFDW');
  id    INT,
  word  VARCHAR
) SERVER multicorn_srv;
\timing ON
SELECT * FROM dictionary
  id  |    word    
 9001 | Lubbock
 9002 | Lubumbashi
 9003 | Lucas
 9004 | Luce
 9005 | Luce's
(5 rows)
Time: 167.939 ms

It works! Beyond that, notice how slow the results are. Though there are only about 100k rows, even fetching a mere handful takes an exorbitant amount of time. This happens for a few reasons.

First consider that this is an external API call. Each request must trigger an HTTP request, parse the results, and then encode and return each row. While we know what our REST process does, others may not be so forthcoming and are essentially black boxes. At least some of the delay is due to whatever work the REST service performs on its end. Then we have the unfortunate fact that, bless its heart, Python is generally a dog’s breakfast when it comes to performance.


We could fix some of these problems. Multicorn does have the ability to pass along predicates after all. A more advanced library could implement basic regular expression, equality, or other logic to pass along WHERE clauses like these:

SELECT * FROM dictionary
 WHERE word LIKE '%house%';
                         QUERY PLAN
 FOREIGN Scan ON dictionary
     (cost=20.00..20000000000.00 ROWS=100000000 width=36)
     (actual TIME=164.616..326.678 ROWS=189 loops=1)
   FILTER: ((word)::text ~~ '%house%'::text)
   ROWS Removed BY FILTER: 98982
 Planning TIME: 0.248 ms
 Execution TIME: 326.896 ms

While in our case Postgres manually removed all inapplicable matches, a more advanced wrapper could transform these into parameters to the API call itself. In such a case, the REST call might return a much smaller (and faster) row subset. Barring that, there’s another way to cheat the system: materialized views!

We have, after all, already used these to represent regularly updated external snapshots. If we treat the API like a rolling data window, we can just refresh a materialized view with its contents and enjoy direct manipulation of instantiated contents. That means indexes and all the trimmings! A major benefit to this approach over spending all of our time optimizing our wrapper, is that we don’t have to duplicate existing Postgres functionality.

We would naturally want to eventually fix the more obvious inefficiencies in our wrapper interface. Yet until that happens—and even afterwards, since performance enhancements are cumulative—we can do something like this:

SELECT * FROM dictionary;
ANALYZE mv_dictionary;
CREATE INDEX dictionary_word_trigram
    ON mv_dictionary USING GIST (word gist_trgm_ops);
SELECT * FROM mv_dictionary
 WHERE word LIKE '%house%';
                         QUERY PLAN
 Bitmap Heap Scan ON mv_dictionary
     (cost=3.36..30.78 ROWS=10 width=13)
     (actual TIME=6.454..6.628 ROWS=189 loops=1)
   Recheck Cond: ((word)::text ~~ '%house%'::text)
   ROWS Removed BY INDEX Recheck: 1
   Heap Blocks: exact=41
   ->  Bitmap INDEX Scan ON dictionary_word_trigram
           (cost=0.00..3.35 ROWS=10 width=0)
           (actual TIME=6.435..6.435 ROWS=190 loops=1)
         INDEX Cond: ((word)::text ~~ '%house%'::text)
 Planning TIME: 0.152 ms
 Execution TIME: 6.783 ms

One cool trick Postgres has up its sleeves is that it can use leverage quirky index types. One of these is based on word trigrams. It’s a lossy format that will inevitably return more matches than we actually want on the first iteration, but remember that Postgres will apply a post-filter to remove excess matches. We can see that in the Recheck cond line from the execution plan above, which yanked one row that didn’t match our LIKE clause. That’s a small price to pay for reducing a 326ms runtime to 6ms!

Implementing trigram logic in our Python library or REST service would be a huge headache, as would duplicating LIKE handling. Since words are most likely to have inexact and fuzzy searches, we may want to avoid the whole mess. With a materialized view, we can let Postgres do all of the complicated data manipulation while the REST API focuses on bridging the gap to our extrernal resource.

Once again, we’re left with a database that contains none of our own data. Tables we create by wrapping and caching external contents gives us a transitive scratch zone and staging area for actual analysis.

In a real-world scenario, we could only hope the REST interface and its corresponding foreign data wrapper component wouldn’t be as lazy as the one presented here. Perhaps a better alternative would be to write a full REST-specific foreign data wrapper in Go. In such a world, feed sources and parameters, column decoding, caching layers, and any number of other optimizations would be efficient and native to web-driven content. One wrapper and server combination could represent any number of foreign tables instead of ours which only works with a single service.

The options, like Postgres, are endless and varied. That’s the world foreign data wrappers give us.

PG Phriday: Stuck in the Middle with Postgres

Earlier this year, I implied Postgres was some kind of super middleware for dragging data out of every external resource it could locate. But that example only used the Postgres foreign data wrapper to contact another Postgres server. Why be so unimaginative? The future is as unlimited as it is terrifying.

Meet the new Postgres mascot

Let’s start with a few prerequisites. We want to see the combined functionality of a few different Postgres capabilities, so let’s try and combine data from MySQL, Postgres, and Javascript using V8. To do that, we need a couple of easily obtained resources. Debian-based systems can do something like this:

sudo apt-get install postgresql-9.6-plv8 postgresql-9.6-mysql-fdw

Otherwise Postgres packages are available on the Postgres site or as some other name in your favorite distribution. All we need for this is some kind of MySQL install, and the MySQL and V8 Postgres extensions. More ambitious readers can try applying these techniques to SQL Server as well.

Now imagine we have a marketing infrastructure with data strewn all over the world, in a vast assortment of incompatible or legacy apps, each with its own portion of the data we want to see. Not an ideal situation really, and one normally solved by ETL to pull all of the data into a single location for perusal. But transformation takes time, and we’re exceptionally lazy. Instead of all that extract, transform, and load action, let’s just query the data directly.

We can get started by creating a basic market MySQL table with two measly rows for demonstration purposes:

CREATE TABLE market_data (
  ext_root   VARCHAR(10) PRIMARY KEY,
  trader     VARCHAR(20) NOT NULL,
  stock      VARCHAR(10) NOT NULL,
  action     CHAR(1) NOT NULL,
  price      NUMERIC(10,4) NOT NULL,
  xact_date  TIMESTAMP
INSERT INTO market_data (ext_root, trader, stock, action, price)
       VALUES ('IZ724JJ5:3', 'bsmith', 'GOOG', 'b', 800);
INSERT INTO market_data (ext_root, trader, stock, action, price)
       VALUES ('IZ724JJ5:4', 'bsmith', 'GOOG', 's', 842);

Easy, right? The rows should automatically have a timestamp applied as if they’re part of a rapidly accumulating feed. This data isn’t normalized because that would greatly complicate the data model. For the purposes of this activity, that’s not really necessary. If it helps, imagine this is just a view on top of all of the relevant tables and flattens the data to something an external resource can easily consume.

The next thing we need is some data in a Postgres database, because why not? Consider this on some server in a far-away land filled with unicorns and butterflies cavorting with innocent children in a field of daisies. It’s definitely not in our datacenter!

CREATE TABLE trade_calc (
  ext_root  VARCHAR(10) PRIMARY KEY,
  stdev     NUMERIC NOT NULL,
  mean      NUMERIC NOT NULL,
  delta     NUMERIC NOT NULL
INSERT INTO trade_calc VALUES ('IZ724JJ5:3', 13.7, 825, 11.3);
INSERT INTO trade_calc VALUES ('IZ724JJ5:4', 8.5, 832, 1.5);

In this case, the table contains supplementary information regarding the trade. Maybe an application is generating various calculations derived from the trade and storing it in a local resource for later consumption. Regardless, we’ve retained the external root identifier used to tie the rows together, which is a common feature of market data.

The next step is to introduce some magical binding substance of some kind.

Why not?

We’ll be installing all of these wonderful resources in a Postgres database dedicated specifically for that purpose. One with user accounts and grants in a centralized location where everyone and his pet fish can access and aggregate data stored there. The MySQL portion of our tentacle would look and function something like this:

CREATE SERVER mysql_market
  OPTIONS (host 'localhost');
  SERVER mysql_market
  OPTIONS (username 'root');
  FROM SERVER mysql_market
  INTO mysql;
SELECT * FROM mysql.market_data;
  ext_root  | trader | stock | action |  price   |      xact_date      
 IZ724JJ5:3 | bsmith | GOOG  | b      | 800.0000 | 2017-03-10 15:09:32
 IZ724JJ5:4 | bsmith | GOOG  | s      | 842.0000 | 2017-03-10 15:09:45

One foreign MySQL resource allocated and tested. Check!

Without a readily available node.js or REST interface to contact, we can simulate polling some kind of JSON resource with the PLV8 language. This a good way to demonstrate tying elements to Postgres using something other than SQL. It also gives us a way to grab data for situations where no foreign data wrapper exists and we’re not confident enough to create one.

In this case, we’re calling a V8 function which gets the data through some dubious means and sends it back to us:

SET search_path TO js;
CREATE TYPE clearing_data AS (
  ext_root   VARCHAR(10),
  market     VARCHAR(10),
  bank       VARCHAR(10),
  is_funded  BOOLEAN
CREATE OR REPLACE FUNCTION get_clearing_info()
RETURNS SETOF clearing_data AS
    "ext_root": "IZ724JJ5:3", "market": "NASDAQ",
    "bank": "Chase", "is_funded": FALSE
    "ext_root": "IZ724JJ5:4", "market":
    "NASDAQ", "bank": "Citi", "is_funded": TRUE
$$ LANGUAGE plv8;
CREATE VIEW all_clearing_data AS
SELECT * FROM get_clearing_info();
SELECT * FROM js.all_clearing_data;
  ext_root  | market | bank  | is_funded 
 IZ724JJ5:3 | NASDAQ | Chase | f
 IZ724JJ5:4 | NASDAQ | Citi  | t

Success! Hard-coded data isn’t exactly robust, but demos are open season, and hey look over there! In a real scenario, we’d probably have a function that accepted parameters and performed a pull from some external resource. The view is just a way of hiding the function call for users who might have their head explode to know they can select rows from a function.

Lastly we need to contact that Postgres server out in the magical field of unicorns, butterflies, and candy canes. What, I didn’t mention candy last time? Well there’s candy now. Deal with it.

CREATE EXTENSION postgres_fdw;
CREATE SERVER pgsql_market
  OPTIONS (dbname 'market', host 'localhost');
  SERVER pgsql_market
  OPTIONS (USER 'postgres');
  FROM SERVER pgsql_market
  INTO pgsql;
SELECT * FROM pgsql.trade_calc;
  ext_root  | stdev | mean | delta 
 IZ724JJ5:3 |  13.7 |  825 |  11.3
 IZ724JJ5:4 |   8.5 |  832 |   1.5

Essentially that was the same process we used with MySQL, except this time we used the Postgres FDW instead. Either way, it works and returns data exactly as expected.

The final step is to slap on the duct tape and invoke everything at once:

SELECT m.*, t.*
  FROM mysql.market_data m
  JOIN pgsql.trade_calc t USING (ext_root)
  JOIN js.all_clearing_data c USING (ext_root);
  ext_root  | trader | stock | action |  price   |      xact_date      |  ext_root  | stdev | mean | delta 
 IZ724JJ5:3 | bsmith | GOOG  | b      | 800.0000 | 2017-03-10 15:09:32 | IZ724JJ5:3 |  13.7 |  825 |  11.3
 IZ724JJ5:4 | bsmith | GOOG  | s      | 842.0000 | 2017-03-10 15:09:45 | IZ724JJ5:4 |   8.5 |  832 |   1.5

The real power here is in combining this with the external resource gathering techniques we explored in the previously mentioned article. Is the external REST call way too slow? Slap on a materialized view and index the crap out of it. Have a super stable API for an internally developed resource? Write a foreign data wrapper so Postgres can invoke it directly. Then add a materialized view for good measure. Add a refresh schedule for any resources that adversely affect performance for direct inquiries.

What we end up with is a container database that represents 0% of its own data. It’s a resource we can truncate at will, may crash with no consequences, and might exist anywhere since it’s nothing but an interface layer. A Postgres instance designed this way becomes a SQL API to access assorted data feeds through a standardized execution model. Heck, everyone can have their very own.

Were we to attempt this with our favorite language or application stack, we would probably need to write our own data joining techniques, implement set theory, incorporate hashing and sorting algorithms, and so on. In the end, we’d have written a tiny database engine to handle data combination and aggregation, without most capabilities of even a mediocre out-of-the-box solution. In this scenario, Postgres bridges that gap for us.

If Postgres has already done all the hard work, why replicate it? And isn’t that what being lazy is all about?

No coding, only nap