PG Phriday: Community Edition

Postgres is one of those database engines that carves out a niche and garners adherents with various levels of religious zeal. The community, while relatively small when compared to that of something like MongoDB, is helpful almost to a fault. Members from the freshest minted newb to the most battle tested veteran will often trip over themselves to answer questions found in the various dedicated forums, mailing lists, and chat rooms. To that end, let’s answer one particular question that ties everything together: what exactly is available to someone who wants to participate with the Postgres universe these days?

First and foremost, the Postgres site has a community page to start our trip into those most arcane depths. For obvious reasons, this is probably the most comprehensive resource available for interacting with the Postgres community in one way or the other. Let’s explore a bit.

Easily Digestible

At first glance, it’s just two innocuous paragraphs with a sprinkling of links and a couple of sidebars. Delve into the mailing lists however, and our trip down the rabbit hole begins in earnest. There are quite a few categories of interest, and subscriptions work in the usual ways: regular delivery, digest, or lurk-only for those who only feel comfortable asking questions rather than answering them. I personally subscribe to these, but there’s no reason to restrict yourself to my esoteric interests:

  • pgsql-admin: Admins and admin wannabes; the perfect place to just talk shop with other DBAs.
  • pgsql-general: The potpourri of database discussion. They’re also not kidding when they say “Please note that many of the developers monitor this area.” Several of the most prolific committers take time to answer questions from this list.
  • pgsql-performance: I spent a lot of time with a database that handled over a billion queries per day, and needed this list in order to survive. After a while, I picked up a few things and started answering questions myself. Anyone can follow that path.
  • pgsql-hackers: The Postgres devs post here. And boy do they take advantage of every opportunity to do so. Threads range from proposed features to detailed critiques of potential patch-sets. These discussions have been known to go on for years, covering hundreds of back-and-forth messages before something finally becomes fully incorporated into Postgres. There’s no better way to keep up with the future of Postgres and watch the feature vetting process in action.

That is just a tiny fragment of what’s available there, and the message traffic is fast and furious.

Chattier Than a Chat Bot

It would seem that a sizable contingent of the Postgres community maintains careful watch of the #postgresql channel over at irc.freenode.net. But it wouldn’t be the New Web if there wasn’t also a lively group over at the Postgres Team Slack.

While the mailing lists are great and responsive, there’s something to be said about live responses. In the case of the Slack channel, there’s also the benefit of formatted code pasting. Unfortunately neither of these resources are readily archivable to be available in perpetuity, nor is there any easy delineation of topics since chat is a free-for-all. This makes it somewhat difficult to refer to past discussions with any reliability—such is the black-hole of transient conversations.

Despite that, being in a virtual room with other Postgres users makes it much easier to informally swap examples and ideas. If AOL or Yahoo chat rooms were still a thing, there would likely be Postgres communities there as well. Maybe I’m looking in the wrong spots, but I find this particular resource critically under-utilized. Social media has purportedly replaced chat rooms, but it’s a demonstrably different concept that hopefully makes a comeback.

Interactivity

Speaking of social media, there is a Postgres Facebook group, and of course a PostgreSQL Google+ group with a related PostgreSQL community. And we can’t leave out the PostgreSQL Twitter.

Beyond these, local communities often have their own user groups so people can actually get together and treat Postgres like a hobby. I joined the Chicago PUG back when it started, and try to attend whenever I’m in town on business. There’s something to be said for the personal touch. It’s not official, but many of these groups tend to coordinate their activities through Meetup, which is handy for users of that service.

For those who live in an area without a local Postgres group and don’t feel comfortable starting one, Postgres has a bevy of conferences running through the year. One or more of these might be nearby, so why not attend and meet some of the people who make Postgres tick? I don’t go to all of these, but Postgres Open started in Chicago, so I’ve tried to make a regular appearance when I have interesting (and presentable) material. I also hear good things about PGConf US over in New Jersey.

And that brings us to PgUS and its ilk. While Postgres has no official owner, there is a lot of coordination between the various enclaves to reach some sort of consensus as to the direction the project takes. Some of this is in the form of advocacy and education, so there’s no need to be a Postgres prodigy to contribute.

For those stark few who find database engines exciting, meet up with your fellow “non-robots” and spread the word!

Readership

Part of learning about Postgres and spreading its use is consuming the copious blogs, wikis, and manuals dedicated to the subject. You likely came here from Planet PostgreSQL, which acts as the official blogroll/RSS-feed for all of the various articles that regularly churn out from sources such as myself. If not, there is no better way to stay abreast of new Postgres techniques, features, or just useful knowledge lost to the annals of history. Many of the Big Names(tm) regularly post on their personal or company blogs, and all of these critical insights get syndicated in a single place for everyone to enjoy.

Lest we forget, there’s also the PostgreSQL Wiki. Any sufficiently vetted community member can make entries here to make the Postgres world a bit more complete. Unlike the official documentation which concentrates specifically on describing functionality in the context of reference material, the Wiki is more human friendly and elastic. I often refer to new release pages to act as a cheat sheet when sharing new features. And where else can they put a complete list of foreign data wrappers or procedural languages, when most are not even officially Postgres projects?

I have to admit I usually feel too timid to taint any official Postgres resource with my inane rambling, but someone did it for me a while ago, so maybe that fear is unwarranted. We’ll see what the future holds!

PostScript

The Postgres community is part of the reason I am where I am, and know what I do. I learned enough on the job and through the mailing lists that I wrote a book on High Availability and Postgres to share the wealth. It is an invaluable resource for anyone who is willing to learn, and potentially a better one for those who are capable of contributing. The people are friendly, capable, and enthusiastic. That last one always makes me laugh considering the relatively dry nature of the underlying material: database software. We’re excited users and advocates of database software. If anyone had told me as a child that this would be my future, I would have just responded with: boring!

But it isn’t. It really isn’t, and I have no real answer as to why. Maybe it’s the challenge, or perhaps the community itself and its limitless encouragement. Whatever the case, Postgres is my bag. To that end, this will also be the last official PG Phriday on this site. Perhaps it was inevitable given my dedication to the cause, but I’m now a member of the team at 2nd Quadrant. As such, it only makes sense that my Postgres-related material originate from that unified front. Their blog is also part of the Planet PostgreSQL feed, so if you haven’t already, consider this a second entreaty to become a regular reader of that site.

Here’s to the future of Postgres, and everyone that works to ensure it’s a bright one!

PG Phriday: Who Died and Made You Boss?! (The Investigatining!)

The Postgres system catalog is a voluminous tome of intriguing metadata both obvious and stupendously esoteric. When inheriting a Postgres database infrastructure from another DBA, sometimes it falls upon us to dig into the writhing confines to derive a working knowledge of its lurking denizens. The trick is to do this before they burst forth and douse us with the database’s sticky innards and it experiences a horrible untimely demise.

Bane of unwatched databases everywhere.

To prevent that from happening, it’s a good idea to check various system views on occasion. The alternative isn’t always outright disaster, but why take the chance?

An ideal place to start is the pg_stat_activity view. It tells us what each session is (or was) doing, where it originated, who owns it, and a myriad of other juicy details. There’s just one problem:

SELECT pid, usename, query FROM pg_stat_activity;
 
  pid  | usename  |                       query                       
-------+----------+---------------------------------------------------
 11415 | postgres | <insufficient privilege>
   482 | sthomas  | SELECT pid, usename, query FROM pg_stat_activity;

While logged on as an unprivileged user, we can only see our own activity. The query column is protected such that only superusers can view its contents for all users. This is a security measure since it’s possible a query has sensitive information embedded somewhere. However, there are a lot of useful contextual or debugging elements in that field that a service monitor or other automated tool might find illuminating.

How do we give a user access to this data—automated or otherwise—without committing the greatest of sins by making them a superuser? One common technique is to simply wrap the view with a function that returns rows. Like this:

CREATE ROLE monitor;
 
CREATE OR REPLACE FUNCTION public.pg_stat_activity()
RETURNS SETOF pg_catalog.pg_stat_activity
AS $$
  SELECT * FROM pg_catalog.pg_stat_activity;
$$ LANGUAGE SQL SECURITY DEFINER;
 
REVOKE ALL ON FUNCTION public.pg_stat_activity() FROM PUBLIC;
GRANT EXECUTE ON FUNCTION public.pg_stat_activity() TO monitor;
 
GRANT monitor TO sthomas;

Note that we also created a role that we can use for this kind of elevated access. By granting access to the role, we have an abstract set of privileges we can grant to, or revoke from, other users. Why track down every single grant a user might have, when we can just revoke a few roles instead?

The function is set as a SECURITY DEFINER so it runs as the user who owns it. The presumption here is that we create the function as another superuser (usually postgres), and then the query column is no longer protected for users given access to the function. Be wary when doing this however! Did you notice that we prepended the pg_catalog schema in the SELECT statement itself? This prevents the user from changing their search path and tricking the system into giving them superuser access to a view that’s really a select on a sensitive table. Sneaky!

We also needed to explicitly revoke execution access from the PUBLIC domain of all users. By default, functions created in Postgres can be executed by anyone. There are ways to change this, but most DBAs either don’t know about it, or haven’t done so. As a consequence, there are a lot of functions out there that are unnecessarily promiscuous. Elevated functions especially need this step!

After granting access to the new role, we can attempt the previous activity query again:

SELECT pid, usename, query FROM pg_stat_activity();
 
  pid  | usename  |                       query                       
-------+----------+---------------------------------------------------
 11415 | postgres | GRANT monitor TO sthomas;
   482 | sthomas  | SELECT pid, usename, query FROM pg_stat_activity();

Success! It feels inherently wrong to deliberately circumvent that kind of security measure, but we do as needs must. At least we did it safely. Having access to the query column is important in several contexts, especially if our application stack isn’t particularly sensitive.

Now that we have curated access to session activity, we may also need to observe how the database is working with its hardware resources. When multiple queries access data in shared memory for instance, knowing the contents might help us size it properly. It’s possible to access this information by activating the pg_buffercache extension.

With that knowledge firmly in hand and a bit of window function magic, here’s a query that we might work. In this case, I created a benchmark database and ran a couple iterations on it to generate some buffer activity.

CREATE EXTENSION pg_buffercache;
 
SELECT c.oid::REGCLASS::TEXT,
       round(COUNT(*) * 8.0 / 1024, 2) AS mb_used,
       round(c.relpages * 8.0 / 1024, 2) AS object_mb,
       round(COUNT(*) * 100.0 / c.relpages, 2) AS object_pct,
       round(COUNT(*) * 100.0 / SUM(COUNT(*)) OVER (), 2) AS buffer_pct
  FROM pg_buffercache b
  JOIN pg_class c USING (relfilenode)
  JOIN pg_namespace n ON (c.relnamespace = n.oid)
 WHERE n.nspname NOT IN ('pg_catalog', 'information_schema', 'pg_toast')
 GROUP BY c.oid, c.relpages
 ORDER BY 2 DESC
 LIMIT 10;
 
          oid          | mb_used | object_mb | object_pct | buffer_pct 
-----------------------+---------+-----------+------------+------------
 pgbench_accounts_pkey |    3.22 |     21.45 |      15.01 |      53.16
 pgbench_accounts      |    2.75 |    128.08 |       2.15 |      45.42
 pgbench_history       |    0.05 |      0.62 |       8.86 |       0.90
 pgbench_tellers       |    0.02 |      0.01 |     200.00 |       0.26
 pgbench_branches      |    0.02 |      0.01 |     200.00 |       0.26

The database here is only using the default 8MB buffer, so there isn’t a whole lot of room for actual database caching. However, we can see that the account table is extremely active, and only 15% of it is cached. From this information, it’s clear we should increase shared memory to accommodate the full size of the accounts primary key, and possibly a larger portion of the accounts table.

There are, of course, other ways we can view table activity in Postgres. Consider for instance that Postgres constantly aggregates statistics as tables are written to, or read from. The statistics collector is actually rather prolific.

This query is a great way of seeing which tables are the most active:

SELECT relname AS TABLE_NAME, SUM(n_tup_ins) AS inserts,
       SUM(n_tup_upd) AS updates,
       SUM(n_tup_del) AS deletes
  FROM pg_stat_user_tables
 GROUP BY 1
 ORDER BY SUM(n_tup_ins + n_tup_upd + n_tup_del) DESC
 LIMIT 10;
 
    TABLE_NAME    | inserts | updates | deletes 
------------------+---------+---------+---------
 pgbench_accounts | 1000000 |   15431 |       0
 pgbench_tellers  |     100 |   15431 |       0
 pgbench_branches |      10 |   15431 |       0
 pgbench_history  |   15431 |       0 |       0

Remember that benchmark I mentioned earlier? We can actually see how many rows each table started with, how many were modified during the benchmark itself, and the fact that the history table was part of the benchmark transaction.

These statistics are lost a number of ways because they’re generally only relevant while the server is actually running. We can also manually reset them with the pg_stat_reset() function, in case we are doing more active forensics and want to see live accumulation.

And this is only the write data. What about reads? Elements such as sequential or index scans are equally important if we want to know how well our indexes are performing, or identify tables that might need more or better indexes.

Here’s a different use of the same stat table:

SELECT s.relid::REGCLASS::TEXT AS TABLE_NAME,
       s.seq_scan, s.idx_scan,
       s.idx_tup_fetch, c.reltuples AS ROW_COUNT
  FROM pg_stat_user_tables s
  JOIN pg_class c ON (c.oid = s.relid)
 ORDER BY s.seq_scan DESC, idx_scan DESC
 LIMIT 10;
 
    TABLE_NAME    | seq_scan | idx_scan | idx_tup_fetch | ROW_COUNT 
------------------+----------+----------+---------------+-----------
 pgbench_branches |    15433 |        0 |             0 |        10
 pgbench_tellers  |    15431 |        0 |             0 |       100
 pgbench_accounts |        3 |    30862 |         30862 |     1e+06
 pgbench_history  |        0 |          |               |     23148

Those numbers line up pretty well. The fact that the account table has three sequential scans is because I was trying to force the data into the shared buffers, so I read the entire contents of the table a few times. Since the branch and teller tables are so small, reading their entire contents is probably the most efficient approach, though it may warrant investigation later if the behavior persists.

If we focus on the account table where all the real activity is taking place, it’s index fetches at all times, except for my manual scans. For a 1-million row table, that’s exactly what we want to see. The fact that there’s a 1-1 relationship with the number of scans to fetches suggests they’re single fetches from the primary key, and also tells us we have good selectivity.

And now that we know a lot of information about our tables, it’s time to move on to the users of those tables.

It only seems that way

More specifically, consider roles. In Postgres, best practices are to create a role and grant it access for multiple tables, functions, or views. Then we would grant the role to specific users that require that access. This is a lot safer than having direct assignments because it’s far easier to revoke and share among related users. Say, multiple people in a reporting department for instance.

But what if we’re given a username and want to see a full list of what they can actually access? That’s not as obvious as it might seem at first glance. Let’s do some permissions magic starting with this example group nesting.

CREATE ROLE benchmark;
CREATE ROLE nested_benchmark;
 
GRANT benchmark TO nested_benchmark;
GRANT nested_benchmark TO sthomas;
 
GRANT SELECT ON ALL TABLES IN SCHEMA public TO benchmark;
 
SELECT table_schema, TABLE_NAME
  FROM information_schema.table_privileges
 WHERE grantee IN ('sthomas', 'nested_benchmark');
 
 table_schema | TABLE_NAME 
--------------+------------
(0 ROWS)

What the what!? Just to be clear, the sthomas user does have full read access to all of the pgbench tables. So why aren’t they showing up in the list?

As handy as the information schema is, a critical flaw is that it only considers directly assigned privileges. Nested roles completely circumvent its ability to report access. We need a way to fix that.

Enter the wondrous recursive CTE syntax. We can create a view that “flattens” the role nesting:

CREATE OR REPLACE VIEW v_recursive_group_list AS
WITH RECURSIVE all_groups AS
(
  SELECT r.rolname AS user_name, g.rolname AS group_name
    FROM pg_authid r
    JOIN pg_auth_members m ON (m.member=r.oid)
    JOIN pg_authid g ON (m.roleid=g.oid)
  UNION ALL
  SELECT ag.user_name, g.rolname AS group_name
    FROM pg_authid r
    JOIN pg_auth_members m ON (m.member=r.oid)
    JOIN pg_authid g ON (m.roleid=g.oid)
    JOIN all_groups ag ON (r.rolname = ag.group_name)
)
SELECT * FROM all_groups;

Again, this isn’t as difficult as it looks. We’re just bootstrapping the results with the existing list of user/role associations. The UNION combines each of these with successive levels of nesting until we’ve exhausted all of them. The result is a username/group row for every role that is granted to a user, or a role which was granted to that role, and so on.

That’s something we can combine with the information schema to actually derive the full list of objects a user can access. Let’s see how that works:

SELECT *
  FROM v_recursive_group_list
 WHERE user_name = 'sthomas';
 
 user_name |    group_name    
-----------+------------------
 sthomas   | nested_benchmark
 sthomas   | monitor
 sthomas   | benchmark
 
SELECT t.table_schema, t.table_name
  FROM v_recursive_group_list gl
  JOIN information_schema.table_privileges t ON (t.grantee IN (gl.user_name, gl.group_name))
 WHERE gl.user_name = 'sthomas';
 
 table_schema |    TABLE_NAME    
--------------+------------------
 public       | pgbench_tellers
 public       | pgbench_branches
 public       | pgbench_accounts
 public       | pgbench_history
 public       | pg_buffercache

Huzzah! Now we know what sessions are doing, how active tables are, and have a firm grasp of basic security and exposure. What else is left? How about locks?

There’s a deadlock somewhere in here…

Yes, locks. Sometimes session activity gets out of hand, or transactions a little to fast and furious, and things go somewhat awry. Being able to unravel that mess is essential to keeping a database operating smoothly.

If we combine the contents of pg_stat_activity and pg_locks, we can get a lot of additional information regarding session activity. Before we were only interested in the query that was running and maybe some surrounding context. Now we can see exactly which tables, indexes, views, and other objects are locked, what kind of lock is involved, and so on.

This is probably one of my favorite views to use for this kind of work:

CREATE OR REPLACE VIEW v_activity_locks AS
SELECT a.pid, s.mode, s.locktype, a.wait_event, a.state,
       a.usename, a.query_start::TIMESTAMP(0), a.client_addr,
       now() - a.query_start AS time_used, a.query, s.tables
  FROM pg_stat_activity() a
  LEFT JOIN (
         SELECT pid AS pid, mode, locktype,
                string_agg(relname::text, ', ') AS TABLES
           FROM (SELECT l.pid, l.mode, l.locktype, c.relname
                   FROM pg_locks l
                   JOIN pg_class c ON (l.relation=c.oid)
                  WHERE c.relkind = 'r'
                  ORDER BY pid, relname) agg
          GROUP BY 1, 2, 3
       ) s USING (pid);
 
GRANT SELECT ON v_activity_locks TO monitor;
 
SELECT pid, usename, state,
       EXTRACT(minutes FROM time_used) AS minutes,
       SUBSTRING(query, 1, 30) AS query_part,
       TABLES
  FROM v_activity_locks;
 
  pid  | usename  | state  | minutes |          query_part           |              TABLES              
-------+----------+--------+---------+-------------------------------+----------------------------------
  1927 | postgres | idle   |         |                               | 
 28305 | postgres | idle   |      49 | SELECT c.oid::REGCLASS::TEXT, | 
 11415 | postgres | idle   |       4 | GRANT monitor TO sthomas;     | 
 16022 | sthomas  | active |       0 | SELECT pid, usename, state,   | pg_authid, pg_class, pg_database

Despite being horrifically ugly—and hence why it’s a view—I prefer it to simply joining the two tables. The magic is in the tables column, which lists every table that is locked by the session, in alphabetical order, all in the same result. No more tracing through multiple rows to see all of the participants in a locking mess, it’s all there for everyone to see. One line per session. Add a filter to remove idle queries. Maybe an extra predicate to only consider queries which have been active for more than a few seconds. In the end, we have a single query that can instantly identify areas worthy of additional forensics, and all involved resources.

While we’re thinking about locks, the previous view only really told us what resources a particular session was using and perhaps how it was doing so. If we sorted these results by the query_start column, it wouldn’t be too difficult to see whether or not one session was blocking another. But it’s not exactly a scenario that requires an alibi, and related activity can obscure our results if the database is particularly busy.

Thanks to the newly available pg_blocking_pids function in Postgres 9.6, we can actually see what is blocking a certain action. Before this function was introduced, the only way to figure out what was causing the block was to trace which connections were using the same resources and had their locks granted, versus those that didn’t. In a sufficiently busy system, this wasn’t necessary a causal relationship, but it provided a good starting point. Now we can see exactly what’s causing the block, and we can use that information to our benefit.

Here’s an example of a lingering modification in a transaction that caused a block for another session:

\x
 
SELECT DISTINCT l1.pid AS blocker_pid, a.query AS blocker_query,
       a.usename AS blocker_user, a.client_addr AS blocker_client,
       l2.pid AS blocked_pid, a2.query AS blocked_query,
       a2.usename AS blocked_user, a2.client_addr AS blocked_client
  FROM pg_locks l1
  JOIN pg_stat_activity() a ON (a.pid = l1.pid)
  JOIN pg_locks l2 ON (l1.pid = ANY(pg_blocking_pids(l2.pid)))
  JOIN pg_stat_activity() a2 ON (a2.pid = l2.pid)
 WHERE l1.granted
   AND NOT l2.granted;
 
-[ RECORD 1 ]--+------------------------------------------
blocker_pid    | 11415
blocker_query  | ALTER TABLE pgbench_accounts ADD foo INT;
blocker_user   | postgres
blocker_client | 
blocked_pid    | 12337
blocked_query  | SELECT COUNT(*) FROM pgbench_accounts ;
blocked_user   | sthomas
blocked_client |

Instead of using the pg_locks view and pg_stat_activity() function, we could use the v_activity_locks view to see both the blocked session and what blocked it on the same row. This would be indispensable in a really busy system, especially if a single connection has blocked several others. Such an occurrence would practically glow and demand immediate attention. Heck, it would even make an ideal automated system check.

And isn’t that what all of this really boils down to? Some of this is something that requires direct observation. Tracing user access makes sense during security audits, and the table stats can help with optimizing the database itself in many cases. But the rest is really just novel ways of combining the system views to produce quantifiable data that stands out in some way. How many queries have been running for over a minute? How many sessions have been waiting on a lock for over ten seconds?

Those first approximations are the bloodhound to a seasoned DBA. It would be silly to run all of these manually on a regular basis, but when there is a problem, all bets are off. These queries and views barely even scratch the surface of what’s really available in the Postgres system catalog. The Postgres metadata isn’t just there to reflect object organization, and curiosity goes a long way when exploring it.

So follow that bloodhound to the crime scene. There are more than enough tools available to identify, apprehend, and even punish the culprit. Or maybe just explore the stats and tweak to wring extra performance from the existing tables. This should be a good start for doing either, and infinitely more.

PG Phriday: Design Pattern Workshop

Recently on the pgsql-performance mailing list, a question popped up regarding Postgres RAM usage. In this instance Pietro wondered why Postgres wasn’t using more RAM, and why his process was taking so long. There were a few insightful replies, and they’re each interesting for reasons that aren’t immediately obvious. Let’s see what is really going on here, and perhaps answer a question while we’re at it.

Pietro presents several postgresql.conf settings, but here are the ones that matter:

shared_buffers = 24GB
effective_cache_size = 72GB
work_mem = 512MB
maintenance_work_mem = 1GB

The configured system has 96GB of total RAM. Keeping that in consideration, we see that shared_buffers is set to 24GB, which is a quarter of the server’s available memory. The only potential issue here is that it’s possible a checkpoint could be a very painful experience if a significant portion of that memory becomes dirty. Since there are no complaints of high IO waits, let’s just assume that isn’t happening.

Common advice suggests setting effective_cache_size to an amount equivalent to half of system RAM + shared_buffers. A setting of 72GB looks high, but Postgres does not allocate any of that on its own. This parameter merely influences the query planner, since higher values imply an increased probability a disk page is in either Postgres or OS caches. This is an optimization setting, not a memory allocation.

A work_mem value of 512MB is extremely high. In some cases however, that’s perfectly acceptable. The amount specified here will be used for all internal query execution operations that require a memory allocation. This means sorts, hashes, bitmaps, materializations, and so on. Thus a single query can instantiate multiple 512MB segments for each necessary operation. Effectively the user has greatly reduced the likelihood Postgres will utilize disk storage for any query operations.

And finally we have maintenance_work_mem. This is so often set to 1GB, it might as well be the default. This is the amount of RAM Postgres uses when doing things like creating indexes, vacuuming, and anything else that might need vast amounts of working space to proceed. We don’t know Pietro’s procedures, but we can make some inferences later. This setting really only matters if some part of his data onboarding includes adding new indexes or triggers an autovacuum.

Then we see how memory is performing once Postgres is running:

              total        used        free      shared  buff/cache   available
Mem:            94G         28G         46G         17M         19G         64G
Swap:           15G          0B         15G
Total:         109G         28G         61G

That’s not a lot of usage. Of that 28GB, we know Postgres has 24 simply due to the setting for shared_buffers. This means that the actual queries are only consuming 4GB at most.

This is particularly relevant because if we skip to the end of the post, we see this output from top:

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 9686 postgres  20   0 24.918g 214236  12628 R 100.0  0.2   2872:38 postgres
 9687 postgres  20   0 24.918g 214212  12600 R 100.0  0.2   2872:27 postgres
 9688 postgres  20   0 25.391g 709936  12708 R 100.0  0.7   2872:40 postgres
 9691 postgres  20   0 24.918g 214516  12900 R 100.0  0.2   2865:23 postgres
 9697 postgres  20   0 24.918g 214284  12676 R 100.0  0.2   2866:05 postgres
 ...

All 16 of the backends are pegged at 100% CPU, and have been for almost 48 hours each (2872/60). That definitely corroborates Pietro’s claim that this workload has been running for 48 hours. This suggests little to no waiting for disk IO during the entire process.

So will more RAM help? Probably not. Pietro says he’s doing “time series expansion and some arithmetics”. This kind of procedure doesn’t generally require vast amounts of RAM, as calculations don’t often invoke sorting or other temporary memory structures. So what’s really going on?

The second reply gets at the real heart of the matter immediately:

You’d be better off to ask for help in optimizing your queries IMHO.

Why would Scott say that?

Well, there are a lot of ways to do the same thing. In the database world, some of these are highly preferable to others. How exactly is Pietro expanding those time series, for example? He could be doing something like this:

DO $$
DECLARE
  a INT;
  b INT;
  ...
  new_a INT;
  new_b INT;
  ...
BEGIN
  FOR a, b, ... IN SELECT col_a, col_b, ...
  LOOP
    new_a = some_calculation(a, b, c, ..., n);
    new_b = other_calculation(a, b, c, ..., n);
 
    INSERT INTO target_table VALUES (new_a, new_b, ...);
  END LOOP;
END;
$$ LANGUAGE plpgsql;

If the necessary logic is complicated enough, that might be the right approach. However, it is extremely expensive to loop through each individual row from a driver query, allocate variables, perform calculations, and invoke a single insert. This kind of process would peg a CPU at 100% while still being highly inefficient by multiple orders of magnitude.

There’s also the possibility Pietro knows this, and the calculations were fairly simple. So maybe he decided the process could be inlined a bit more. Instead of that anonymous block (or something equivalent on the application side), he decides to write helper functions to do the calculations instead:

CREATE OR REPLACE FUNCTION some_calculation(a INT, b INT, ...)
RETURNS INT AS
$$
DECLARE
  retval INT;
BEGIN
  -- Some calculations here
 
  RETURN retval;
END;
$$ LANGUAGE plpgsql STRICT IMMUTABLE;

Being a smart guy, Pietro knows immutable function results are cached much better than otherwise, and strict functions won’t even run if a parameter is null. With these two declarations, if any invocations take the same parameters, or a parameter is empty, Postgres may not even execute the function at all. Avoiding a jump operation is a huge optimization all by itself, so consider the benefits of preventing the overhead of a function call and its associated logic.

By defining the calculations in this manner, his process would look like this instead:

INSERT INTO target_table
SELECT some_calculation(col_a, col_b, ...),
       other_calculation(col_d, col_e, ...),
       ...
  FROM ...
 WHERE ...;

Or maybe Pietro knows the value of data staging. Caching is great, but in sufficiently complicated examples of business logic, transforming it into immutable functions may be somewhat difficult. In addition, some “calculations” may rely on the results of another query. This is where temporary and unlogged tables, or CTEs come in.

Consider the fact databases natively communicate using Set Theory. They are highly optimized, and specifically engineered to bulk-process rows in ways user-generated applications simply can’t match in all but the edgiest of cases. So the final iteration of the data onboarding might resemble this:

CREATE TEMP TABLE expand_series AS
SELECT ...;
 
CREATE TEMP TABLE relate_series AS
SELECT ...;
 
CREATE TEMP TABLE basic_tally AS
SELECT ...
  FROM expand_series
  JOIN relate_series ON (...);
 
INSERT INTO target_table
SELECT ...
  FROM basic_tally
  JOIN ...
  ...
 WHERE ...;

Or maybe Pietro likes CTEs more, so uses those instead of multiple temporary table steps. Postgres will fully materialize everything anyway. The primary benefit to temporary tables is that repeated execution of multiple steps won’t require re-running all previous elements. If this were some kind of one-time report, a CTE would arguably make more sense as it operates atomically.

Regardless, this kind of iterative transformation allows us to fully embrace Set Theory. Don’t operate on one row, but all of them. If we can organize calculations into distinct categories or steps, then we’re actually requesting the database to transform the data as a whole for each operation. For this entire data set, add some derived date column. For this entire data set, produce a standard deviation variant along this window. For this entire data set, derive these additional columns from these formulae.

That’s how databases think, and Postgres is no different in this regard. By looping through each individual result, we’re actively disrupting that process and literally making it hundreds or thousands of times slower. We could have all the memory in the world, and even find some way to fully utilize it, and it wouldn’t influence that universal truth.

Now, we don’t know what Pietro is actually doing, just that he’s convinced more RAM might help. The evidence he provided suggests that isn’t the case. Having sixteen processes running at full blast for two days either means he has a massive amount of data, or there’s some unaddressed inefficiency in his overall approach.

I suspect the latter only because it’s so common. People don’t usually think like databases, and really, who can blame them? What kind of nutcase would willingly do this for a living?

Figure 1: A Nutcase

PG Phriday: CONFLICT of Interests

MySQL has had a REPLACE INTO syntax to perform “UPSERT” logic since practically the very beginning. For the longest time, users who wanted to switch to Postgres, but for whatever reason relied on this functionality, were essentially trapped. Postgres 9.5 changed all that, but why did it take so long? As with much of Postgres history, it’s a long story.

To really understand where Postgres started, we need to look at the “old” way of handling a row merge. Many in the database world have probably encountered this once or twice:

CREATE TABLE my_tab (a INT PRIMARY KEY, b TEXT);
 
CREATE FUNCTION merge_tab(k INT, v TEXT)
RETURNS VOID AS
$$
BEGIN
  LOOP
    -- First, try to update the key.
 
    UPDATE my_tab SET b = v
     WHERE a = k;
 
    IF FOUND THEN
      EXIT;
    END IF;
 
    -- The key doesn't exist, so try to insert it.
 
    BEGIN
      INSERT INTO my_tab (a, b) VALUES (k, v);
      EXIT;
 
    EXCEPTION WHEN unique_violation THEN
      -- Nothing here, allow the loop to continue.
    END;
  END LOOP;
 
END;
$$ LANGUAGE plpgsql;
 
SELECT merge_tab(1, 'James');
SELECT merge_tab(1, 'Jimmy');

What on Earth is all of that? Oddly enough, the somewhat convoluted logic is not only sound, it’s actually required to avoid a race condition. In the microseconds between attempting our UPDATE and following to the INSERT, some other transaction may have inserted the “missing” key. In that case, we’d encounter a unique constraint violation.

By catching the exception, we’re not immediately kicked out of the function and are presented with a choice. Do we assume our value is “right” and repeat the loop to apply the update, or just exit silently under the assumption that the successful transaction that beat us is probably fine? This particular function selected the previous assertion because that’s what a merge or upsert tries to guarantee: that the requested action is applied. Were we to omit the loop, the exception block would ensure there was no conflict or fatal error, but we could no longer rely on the function operating as advertised.

So why not invert the logic and remove the loop entirely? After all, we could just attempt the insert and if it fails, perform the update within the exception block, right? Actually no. Consider what happens if the target key is deleted by a concurrent transaction. Say we try our insert, and in the space of time between the key violation and our update, it gets deleted. Suddenly our update also produces an error. That’s probably an extremely unlikely edge case, but in OLTP databases, the unlikely becomes frighteningly common. So to be safe, we’re stuck with the loop.

Don’t think about it too much

That is a lot of overhead for what many consider basic functionality. Since that’s no longer a concern, let’s take a look at the actual syntax the Postgres team selected. To do that, let’s start with a very basic table with a handful of rows:

CREATE TABLE symbol_mapper (
  vendor_id    BIGINT   NOT NULL,
  ext_mapping  VARCHAR  NOT NULL,
  symbol       VARCHAR  NOT NULL,
  PRIMARY KEY (vendor_id, ext_mapping)
);
 
INSERT INTO symbol_mapper VALUES (1, 'Google', 'GOOGL');
INSERT INTO symbol_mapper VALUES (1, 'Apple', 'AAPL');
INSERT INTO symbol_mapper VALUES (2, 'goo', 'GOOGL');
INSERT INTO symbol_mapper VALUES (2, 'app', 'AAPL');
 
ANALYZE symbol_mapper;

The purpose of a mapping table is to fill the role of decoding external names or lookup values to match internal ones. Since each vendor may have its own designation structure, we require a mapping for each. That also protects us in case two vendors use the same identifiers.

So far we have a fairly standard application of tables. Now let’s do something interesting:

INSERT INTO symbol_mapper (vendor_id, ext_mapping, symbol)
       VALUES (2, 'app', 'AAPL')
    ON CONFLICT DO NOTHING;
 
INSERT 0 0

In this particular case, the “aap” mapping already exists for vendor 2, so no insert takes place. This is the equivalent of not wrapping our insert/update with a loop. We don’t care what the value is, just as long as something is there. In reality, this is more of a way to remove error output from violations than anything immediately useful.

The real fun doesn’t start until we integrate the DO UPDATE functionality. Consider the case where we want to add Samsung as a mapping for a vendor. In this particular case, someone sneaked an existing row into the system, and it contains a typo.

INSERT INTO symbol_mapper VALUES (1, 'Samsung', 'SSLNF');
 
INSERT INTO symbol_mapper (vendor_id, ext_mapping, symbol)
       VALUES (1, 'Samsung', 'SSNLF')
    ON CONFLICT (vendor_id, ext_mapping)
    DO UPDATE
   SET symbol = 'SSNLF';
 
INSERT 0 1
 
SELECT * FROM symbol_mapper;
 
 vendor_id | ext_mapping | symbol 
-----------+-------------+--------
         1 | Google      | GOOGL
         1 | Apple       | AAPL
         2 | goo         | GOOGL
         2 | app         | AAPL
         1 | Samsung     | SSNLF

What we’ve done here is ensure the newest incoming mapping is the “correct” one; that’s our merge. An observant reader might ask how this is any different from our futile desire to attempt an INSERT with an UPDATE in an exception block. Unlike those two separate statements and the time-consuming exception handling, this is a single atomic action.

Did another session delete the row before us? We’ll just insert it again. Did another transaction delete the row we just inserted or updated? Oh well. The important part is that it is impossible to delete the row while our statement is running. So while the logic is similar to using an exception, the difference is that DO UPDATE is built into the database itself, so it can’t be broken into multiple actions that can be interrupted.

Another interesting bit of syntax is that we can actually incorporate a WHERE clause into the update beyond the implicit assumption that our update affects the same key we tried to insert.

INSERT INTO symbol_mapper VALUES (1, 'Sony', 'SONY');
 
INSERT INTO symbol_mapper (vendor_id, ext_mapping, symbol)
       VALUES (1, 'Sony', 'SNY')
    ON CONFLICT (vendor_id, ext_mapping)
    DO UPDATE
   SET symbol = 'SNY'
 WHERE symbol_mapper.symbol = 'SONY';
 
INSERT 0 1
 
INSERT INTO symbol_mapper (vendor_id, ext_mapping, symbol)
       VALUES (1, 'Sony', 'SNY')
    ON CONFLICT (vendor_id, ext_mapping)
    DO UPDATE
   SET symbol = 'SNY'
 WHERE symbol_mapper.symbol = 'SONY';
 
INSERT 0 0

In this case, we had a mapping for Sony that needed a correction. The first query affected the row we targeted, and the second did nothing. This is important because if we had not specified that predicate, both updates would have successfully modified the row. And so would all subsequent attempts. Remember Postgres keeps a row version for every update, even if the new and old values are identical. That’s just how MVCC works.

In a loosely built application environment, it isn’t uncommon for several vectors to operate simultaneously. If a dozen of these each upsert the same value, they’ll all be satisfied that their work is complete, and Postgres would be stuck with a dozen duplicate old rows. VACUUM (and autovacuum) will ensure old row versions are recycled, but again, that’s more overhead we don’t need to invoke.

And of course, the WHERE clause isn’t restricted to deflecting repeated update attempts. There may be circumstances where we simply don’t want to apply changes. By specifying the table name, we can introspect into any of the existing table values. What about the values we attempted to insert? Since these were part of an inherent violation, they’re assigned to a record named “excluded”.

Here it is in action:

INSERT INTO symbol_mapper VALUES (1, 'Microsoft', 'msft');
 
INSERT INTO symbol_mapper (vendor_id, ext_mapping, symbol)
       VALUES (1, 'Microsoft', 'MSFT')
    ON CONFLICT ON CONSTRAINT symbol_mapper_pkey
    DO UPDATE
   SET symbol = 'MSFT'
 WHERE symbol_mapper.symbol != excluded.symbol;

This is a very similar situation as we had with Sony. The mapping for Microsoft needs an update if the existing value doesn’t match the one we’re attempting to insert. Well, we can perform that check explicitly without hard-coding those values into the query multiple times. It’s possible to refer to anything in the VALUES tuple by specifying “excluded”. Handy, eh?

Also notice that we’ve changed our conflict condition. Previously we had simply listed the columns in the primary key, and Postgres inferred the proper constraint from that definition. In this case, we directly stated the constraint that Postgres should use in resolving the conflict. It’s somewhat uncommon (and probably not entirely safe) to directly invoke constraint names, but the option is there in case we want it.

This feature was a long time coming; Postgres 9.5 was released in early 2016. As impossible as it sounds, we’ve only really had a little over a year to leverage ON CONFLICT. As a consequence, it’s still slowly seeping into existing Postgres application stacks. Users are still incorporating it into their workflows. It’ll be a while before it’s taken for granted with the rest of the kitchen sink Postgres offers.

Until then, it’s that special toy we’ve always wanted but couldn’t afford until now. There’s nothing else to do but make up for lost time!

“Toy”

P.S. Before I forget, Postgres Open is getting ready for 2017! If you or someone else wants to attend or even submit a talk, I highly recommend doing so. I’ve been something of a regular fixture there since it started in 2011, and I fully intend to try my hand again. It’s taking place in San Francisco this time around, so the change in venue will definitely be interesting. Apparently it’s all part of the new United States PostgreSQL Association, so it’s good seeing everything pull together behind a united entity. Here’s looking to the future!

PG Phriday: RESTing in the Corn

Last week we explored using Postgres as a central communication nexus between several data sources. At the time, I made a fairly hand-wavy allusion to REST interfaces. Since I hadn’t really explored further, I had assumed PLV8 could use core node.js or other similar libraries to invoke HTTP APIs. Of course as a trusted language, PLV8 isn’t allowed to do that. It’s more of a language for easily manipulating JSON and JSONB objects within Postgres.

Only slightly less violent than JSON

So we need some other way of calling out to an external HTTP resource, wrapping it in column trimmings, and sending the data to Postgres. As demonstrated last week, we could do this directly with a Python (or some other language) function. Alternatively we can use an extension that has access to Python. Then all we would need to do is write a python library, and suddenly we can interact with the REST interface as if it were an actual Postgres table. This is where Multicorn comes in.

One of the easiest ways to install Multicorn is through PGXN. It goes something like this:

sudo pgxn install multicorn

Otherwise users of Red Hat variants have a yum repo or Debian/Ubuntu can rely on the apt repo. Multicorn is popular enough that it’s included in both of these resources, which is very handy for us.

In any case, we need a basic REST service to poll. In the interests in keeping things simple, let’s use Flask to whip up an API for fetching all the words in a UNIX dictionary file. Since the jsonify method is extremely slow, it’s probably a good idea to cache the output too. The end result might look something like this:

from flask import Flask, jsonify
 
app = Flask(__name__)
 
counter = 1
dictionary = []
cached = None
 
words = open('/usr/share/dict/words', 'r')
 
for line in words:
    dictionary.append({'id': counter, 'word': line.strip()})
    counter += 1
 
words.close()
 
@app.route('/')
def index():
    global cached
 
    if not cached:
        cached = jsonify({'records': dictionary})
 
    return cached
 
if __name__ == '__main__':
    app.run(debug=True, port=9999)

Our REST interface replies to only one URL and accepts no parameters. It’s not exciting, but it does the job of transmitting data to HTTP requests. With this “service” running in the background somewhere, we can continue with our task of turning that output into a Postgres table.

The next thing we need to do is make use of the Multicorn Python libraries. In this arena, Python setuptools are our friend. With them, we can write a Python library that imports Multicorn and it will automatically install everything in a compatible location. Here’s a bare-bones setup.py file that might work:

import subprocess 
from setuptools import setup, find_packages, Extension
 
setup(
  name='dictionaryfdw',
  version='0.0.1',
  author='Shaun Thomas',
  license='Postgresql',
  packages=['fdw']
)

That takes care of actually installing the project library files. Next we need to actually use the Multicorn API. As a pure foreign data wrapper API, it has a lot more capability than we actually need. By default it can represent direct access to CSV files, filesystem objects, RSS feeds, and a bevy of other cute implementations.

Unfortunately none of these use cases apply to wrapping up /usr/share/dict/words. Multicorn can interact with authentication, pass predicates to help pre-optimize matches, and has several other capabilities we simply won’t leverage here. We just need one table from one file, with very little in the way of parsing.

To that end, we only need to define a single execute method in a class derived from ForeignDataWrapper. With that in place, Python has a very capable protocol request handler we can leverage to actually interact with our REST API. It provides everything we need to get the data and convert the HTTP response from JSON into Python dicts that represent each row of the dictionary contents.

There isn’t much exciting here:

import requests
from multicorn import ForeignDataWrapper
 
class DictionaryFDW(ForeignDataWrapper):
 
    def execute(self, quals, columns):
        response = requests.get('http://localhost:9999/')
        data = response.json()
 
        for row in data['records']:
            yield row

To install it, we merely invoke setup.py:

sudo python setup.py install

Everything is now in place. As a Python library, Multicorn has access to other Python tools installed on the system. This means the primary option to the wrapper will be a named Python class. So long as that class is compatible, we will have a functional Postgres extension.

It just so happens that Multicorn implements this at the SERVER level. This makes sense as each server generally has its own defined interaction model. The CSV server would take filenames as parameters for foreign tables, RSS feeds would need the URL for each external resource, etc. Our wrapper is somewhat less… sophisticated. It takes no parameters and offers no advanced functionality. As such, once we define the server that invokes the library, we’re basically done.

Here’s the code that would create our dictionary table and a sample invocation:

CREATE EXTENSION multicorn;
CREATE SERVER multicorn_srv
       FOREIGN DATA WRAPPER multicorn
       OPTIONS (wrapper 'fdw.DictionaryFDW');
 
CREATE FOREIGN TABLE dictionary (
  id    INT,
  word  VARCHAR
) SERVER multicorn_srv;
 
\timing ON
 
SELECT * FROM dictionary
 LIMIT 5 OFFSET 9000;
 
  id  |    word    
------+------------
 9001 | Lubbock
 9002 | Lubumbashi
 9003 | Lucas
 9004 | Luce
 9005 | Luce's
(5 rows)
 
Time: 167.939 ms

It works! Beyond that, notice how slow the results are. Though there are only about 100k rows, even fetching a mere handful takes an exorbitant amount of time. This happens for a few reasons.

First consider that this is an external API call. Each request must trigger an HTTP request, parse the results, and then encode and return each row. While we know what our REST process does, others may not be so forthcoming and are essentially black boxes. At least some of the delay is due to whatever work the REST service performs on its end. Then we have the unfortunate fact that, bless its heart, Python is generally a dog’s breakfast when it comes to performance.

Yum!

We could fix some of these problems. Multicorn does have the ability to pass along predicates after all. A more advanced library could implement basic regular expression, equality, or other logic to pass along WHERE clauses like these:

EXPLAIN ANALYZE
SELECT * FROM dictionary
 WHERE word LIKE '%house%';
 
                         QUERY PLAN
------------------------------------------------------------
 FOREIGN Scan ON dictionary
     (cost=20.00..20000000000.00 ROWS=100000000 width=36)
     (actual TIME=164.616..326.678 ROWS=189 loops=1)
   FILTER: ((word)::text ~~ '%house%'::text)
   ROWS Removed BY FILTER: 98982
 Planning TIME: 0.248 ms
 Execution TIME: 326.896 ms

While in our case Postgres manually removed all inapplicable matches, a more advanced wrapper could transform these into parameters to the API call itself. In such a case, the REST call might return a much smaller (and faster) row subset. Barring that, there’s another way to cheat the system: materialized views!

We have, after all, already used these to represent regularly updated external snapshots. If we treat the API like a rolling data window, we can just refresh a materialized view with its contents and enjoy direct manipulation of instantiated contents. That means indexes and all the trimmings! A major benefit to this approach over spending all of our time optimizing our wrapper, is that we don’t have to duplicate existing Postgres functionality.

We would naturally want to eventually fix the more obvious inefficiencies in our wrapper interface. Yet until that happens—and even afterwards, since performance enhancements are cumulative—we can do something like this:

CREATE EXTENSION pg_trgm;
 
CREATE MATERIALIZED VIEW mv_dictionary AS
SELECT * FROM dictionary;
 
ANALYZE mv_dictionary;
 
CREATE INDEX dictionary_word_trigram
    ON mv_dictionary USING GIST (word gist_trgm_ops);
 
EXPLAIN ANALYZE
SELECT * FROM mv_dictionary
 WHERE word LIKE '%house%';
 
                         QUERY PLAN
------------------------------------------------------------
 Bitmap Heap Scan ON mv_dictionary
     (cost=3.36..30.78 ROWS=10 width=13)
     (actual TIME=6.454..6.628 ROWS=189 loops=1)
   Recheck Cond: ((word)::text ~~ '%house%'::text)
   ROWS Removed BY INDEX Recheck: 1
   Heap Blocks: exact=41
   ->  Bitmap INDEX Scan ON dictionary_word_trigram
           (cost=0.00..3.35 ROWS=10 width=0)
           (actual TIME=6.435..6.435 ROWS=190 loops=1)
         INDEX Cond: ((word)::text ~~ '%house%'::text)
 Planning TIME: 0.152 ms
 Execution TIME: 6.783 ms

One cool trick Postgres has up its sleeves is that it can use leverage quirky index types. One of these is based on word trigrams. It’s a lossy format that will inevitably return more matches than we actually want on the first iteration, but remember that Postgres will apply a post-filter to remove excess matches. We can see that in the Recheck cond line from the execution plan above, which yanked one row that didn’t match our LIKE clause. That’s a small price to pay for reducing a 326ms runtime to 6ms!

Implementing trigram logic in our Python library or REST service would be a huge headache, as would duplicating LIKE handling. Since words are most likely to have inexact and fuzzy searches, we may want to avoid the whole mess. With a materialized view, we can let Postgres do all of the complicated data manipulation while the REST API focuses on bridging the gap to our extrernal resource.

Once again, we’re left with a database that contains none of our own data. Tables we create by wrapping and caching external contents gives us a transitive scratch zone and staging area for actual analysis.

In a real-world scenario, we could only hope the REST interface and its corresponding foreign data wrapper component wouldn’t be as lazy as the one presented here. Perhaps a better alternative would be to write a full REST-specific foreign data wrapper in Go. In such a world, feed sources and parameters, column decoding, caching layers, and any number of other optimizations would be efficient and native to web-driven content. One wrapper and server combination could represent any number of foreign tables instead of ours which only works with a single service.

The options, like Postgres, are endless and varied. That’s the world foreign data wrappers give us.