Adventures in Server Sitting

To support more of my tinkering in an effort to test various Postgres cluster configurations, I decided it would be really nifty to have a virtual server. I could not only spin up VMs and containers to validate architectures, but experiment to my heart’s content with other potential technologies.

At first, I was going to buy an Antsle. But the fact such a thing existed made me wonder what other kinds of dedicated virtual device hardware might exist. That led me in endless random directions until I stumbled over a link to eBay that boasted a price that seemed both ridiculous and impossible. Apparently old servers are incredibly cheap and plentiful these days. After the shock wore off, I eventually settled on a Dell PowerEdge R710. This is what I ended up getting for about $300:

The iDRAC6 Enterprise is important, as it enables a virtual console over TCP through a Java app that launches from a web browser pointed to the configured IP address. No need for monitors, here! It’s an optional module and cheap by itself, but the particular model I bought had one already. Well, it would have, but they forgot to send that part. Though I fiddled with the server after it arrived, I couldn’t really go nuts with it without the iDRAC card, because that would mean dragging it over to my desk and actually hooking it up to one of the other inputs on my monitor. No thanks.

I called the seller’s support line and got them to send me the missing card. Having taken care of that, I did some more searches to decide how I wanted to configure the beast. For example, how should I configure the storage so that it’s expandable in the future? RAID 10? JBOD with software RAID and LVM for the sake of flexibility?

Naturally I searched Reddit for R710 threads, and stumbled into the Homelab reddit. Apparently it’s one of the most popular servers to start with because it’s so hilariously inexpensive and plentiful. A few helpful replies later and I had my answer: install two more 1TB drives and install ZFS.

But that left me with a new problem. ZFS likes to have direct control of devices, and the included PERC 6/i RAID card didn’t have device pass-through mode. Fortunately Homelab was helpful in that regard as well. It’s possible to flash a H200 RAID card to Host Bus Adapter (HBA) mode because it’s basically a re-branded LSI card. Well, as with all parts for the R710, such a card is also comparatively cheap, as are the cables necessary to connect it to the SAS backplane for the drives.

Which of course led to the next stumbling block. ZFS on Linux is still a bit sketchy as a boot device since its license isn’t compatible with the Linux Kernel. As such, it’s only included by Ubuntu, and I’d be one sketchy kernel away from having an unbootable system. It’s usually better to install Linux on a separate boot device and use the hard drives as a storage pool. Well, I’m not using the included DVD drive for anything, so out it goes. It’s being replaced with a 250GB Samsung 850 Evo. Not only can that act as a stable filesystem, but I can partition it as a ZFS read/write cache as well. After reserving 50GB for the OS, 200GB beats a 1GB RAID controller any day.

I finally got most of the parts and spent the weekend reconfiguring everything. I have to say, re-flashing a RAID controller into an unsupported configuration isn’t easy, especially when the available instructions are conflicting, haven’t been updated in a year, and weren’t written specifically for the R710. These were the most comprehensive instructions I could find, and supplied all of the files I’d need to pull off the entire surgery.

First I needed to get the iDRAC6 working so I could decouple the server from my monitor. That took over an hour because the included module includes its own dedicated LAN port which is required to access the virtual console. I didn’t know it gets bypassed unless you boot into the iDRAC and change its configuration to use the dedicated port instead. That took a lot of reboots. Did I mention that the R710 takes several minutes to boot? It’s true, and that’s something that’s apparently common with most server hardware, even if you don’t count the numerous prompts to configure various add-on components.

Well, I finally got that working, and moved on to the H200. I followed the instructions I found, and got stuck because they were wrong. For whatever reason, the R710 or the H200 don’t like the megarec.exe utility, and hung every time I tried to use it for the first two steps. Cue a few more reboots while I figured out it wasn’t just really slow at flashing the firmware and was actually stuck. With a bit more research, I eventually got everything working with these commands:

  1. Erase the existing BIOS and firmware, then replace it with a generic 6GB SAS profile.

    sas2flsh.exe -o -e 6
    sas2flsh.exe -o -f 6GBPSAS.fw -b mptsas2.rom
  2. Reboot.

  3. Flash the generic LSI HBA firmware.

    sas2flsh.exe -o -f 2118it.bin

I say I eventually got it to work, because the instructions I used were based on the megarec.exe utility working as described. Since I had to use sas2flsh.exe to erase the card’s firmware instead, I also removed the card’s BIOS. So even though the card had the right software, there was no BIOS to actually invoke it. That took several more reboots to diagnose and remedy.

But finally after three befuddling hours, everything was working. I didn’t fully remove the existing RAID card just in case all of this went sideways, so I’ll be finishing the operation later. At that point, I’ll remove the old controller so it stops complaining about having no cables attached. The 850 Evo hasn’t arrived yet, so I’m in no hurry. Once it comes in, it will replace the DVD drive, and then I can shove the server in my closet where it will supply me with VMs and containers until it dies.

The next leg of my adventure will be installing Proxmox to manage the VM army. I experimented with it today to verify all four drives function properly in a RAID-10 configuration with the new controller card, but that’s about all. All I know is that I managed to get things mostly working, and I’ve got a lot further to travel before I have something I can use as a virtual laboratory.

But hey, at least I’m closer than I was before.

PG Phriday: Community Edition

Postgres is one of those database engines that carves out a niche and garners adherents with various levels of religious zeal. The community, while relatively small when compared to that of something like MongoDB, is helpful almost to a fault. Members from the freshest minted newb to the most battle tested veteran will often trip over themselves to answer questions found in the various dedicated forums, mailing lists, and chat rooms. To that end, let’s answer one particular question that ties everything together: what exactly is available to someone who wants to participate with the Postgres universe these days?

First and foremost, the Postgres site has a community page to start our trip into those most arcane depths. For obvious reasons, this is probably the most comprehensive resource available for interacting with the Postgres community in one way or the other. Let’s explore a bit.

Easily Digestible

At first glance, it’s just two innocuous paragraphs with a sprinkling of links and a couple of sidebars. Delve into the mailing lists however, and our trip down the rabbit hole begins in earnest. There are quite a few categories of interest, and subscriptions work in the usual ways: regular delivery, digest, or lurk-only for those who only feel comfortable asking questions rather than answering them. I personally subscribe to these, but there’s no reason to restrict yourself to my esoteric interests:

  • pgsql-admin: Admins and admin wannabes; the perfect place to just talk shop with other DBAs.
  • pgsql-general: The potpourri of database discussion. They’re also not kidding when they say “Please note that many of the developers monitor this area.” Several of the most prolific committers take time to answer questions from this list.
  • pgsql-performance: I spent a lot of time with a database that handled over a billion queries per day, and needed this list in order to survive. After a while, I picked up a few things and started answering questions myself. Anyone can follow that path.
  • pgsql-hackers: The Postgres devs post here. And boy do they take advantage of every opportunity to do so. Threads range from proposed features to detailed critiques of potential patch-sets. These discussions have been known to go on for years, covering hundreds of back-and-forth messages before something finally becomes fully incorporated into Postgres. There’s no better way to keep up with the future of Postgres and watch the feature vetting process in action.

That is just a tiny fragment of what’s available there, and the message traffic is fast and furious.

Chattier Than a Chat Bot

It would seem that a sizable contingent of the Postgres community maintains careful watch of the #postgresql channel over at But it wouldn’t be the New Web if there wasn’t also a lively group over at the Postgres Team Slack.

While the mailing lists are great and responsive, there’s something to be said about live responses. In the case of the Slack channel, there’s also the benefit of formatted code pasting. Unfortunately neither of these resources are readily archivable to be available in perpetuity, nor is there any easy delineation of topics since chat is a free-for-all. This makes it somewhat difficult to refer to past discussions with any reliability—such is the black-hole of transient conversations.

Despite that, being in a virtual room with other Postgres users makes it much easier to informally swap examples and ideas. If AOL or Yahoo chat rooms were still a thing, there would likely be Postgres communities there as well. Maybe I’m looking in the wrong spots, but I find this particular resource critically under-utilized. Social media has purportedly replaced chat rooms, but it’s a demonstrably different concept that hopefully makes a comeback.


Speaking of social media, there is a Postgres Facebook group, and of course a PostgreSQL Google+ group with a related PostgreSQL community. And we can’t leave out the PostgreSQL Twitter.

Beyond these, local communities often have their own user groups so people can actually get together and treat Postgres like a hobby. I joined the Chicago PUG back when it started, and try to attend whenever I’m in town on business. There’s something to be said for the personal touch. It’s not official, but many of these groups tend to coordinate their activities through Meetup, which is handy for users of that service.

For those who live in an area without a local Postgres group and don’t feel comfortable starting one, Postgres has a bevy of conferences running through the year. One or more of these might be nearby, so why not attend and meet some of the people who make Postgres tick? I don’t go to all of these, but Postgres Open started in Chicago, so I’ve tried to make a regular appearance when I have interesting (and presentable) material. I also hear good things about PGConf US over in New Jersey.

And that brings us to PgUS and its ilk. While Postgres has no official owner, there is a lot of coordination between the various enclaves to reach some sort of consensus as to the direction the project takes. Some of this is in the form of advocacy and education, so there’s no need to be a Postgres prodigy to contribute.

For those stark few who find database engines exciting, meet up with your fellow “non-robots” and spread the word!


Part of learning about Postgres and spreading its use is consuming the copious blogs, wikis, and manuals dedicated to the subject. You likely came here from Planet PostgreSQL, which acts as the official blogroll/RSS-feed for all of the various articles that regularly churn out from sources such as myself. If not, there is no better way to stay abreast of new Postgres techniques, features, or just useful knowledge lost to the annals of history. Many of the Big Names(tm) regularly post on their personal or company blogs, and all of these critical insights get syndicated in a single place for everyone to enjoy.

Lest we forget, there’s also the PostgreSQL Wiki. Any sufficiently vetted community member can make entries here to make the Postgres world a bit more complete. Unlike the official documentation which concentrates specifically on describing functionality in the context of reference material, the Wiki is more human friendly and elastic. I often refer to new release pages to act as a cheat sheet when sharing new features. And where else can they put a complete list of foreign data wrappers or procedural languages, when most are not even officially Postgres projects?

I have to admit I usually feel too timid to taint any official Postgres resource with my inane rambling, but someone did it for me a while ago, so maybe that fear is unwarranted. We’ll see what the future holds!


The Postgres community is part of the reason I am where I am, and know what I do. I learned enough on the job and through the mailing lists that I wrote a book on High Availability and Postgres to share the wealth. It is an invaluable resource for anyone who is willing to learn, and potentially a better one for those who are capable of contributing. The people are friendly, capable, and enthusiastic. That last one always makes me laugh considering the relatively dry nature of the underlying material: database software. We’re excited users and advocates of database software. If anyone had told me as a child that this would be my future, I would have just responded with: boring!

But it isn’t. It really isn’t, and I have no real answer as to why. Maybe it’s the challenge, or perhaps the community itself and its limitless encouragement. Whatever the case, Postgres is my bag. To that end, this will also be the last official PG Phriday on this site. Perhaps it was inevitable given my dedication to the cause, but I’m now a member of the team at 2nd Quadrant. As such, it only makes sense that my Postgres-related material originate from that unified front. Their blog is also part of the Planet PostgreSQL feed, so if you haven’t already, consider this a second entreaty to become a regular reader of that site.

Here’s to the future of Postgres, and everyone that works to ensure it’s a bright one!

PG Phriday: Who Died and Made You Boss?! (The Investigatining!)

The Postgres system catalog is a voluminous tome of intriguing metadata both obvious and stupendously esoteric. When inheriting a Postgres database infrastructure from another DBA, sometimes it falls upon us to dig into the writhing confines to derive a working knowledge of its lurking denizens. The trick is to do this before they burst forth and douse us with the database’s sticky innards and it experiences a horrible untimely demise.

Bane of unwatched databases everywhere.

To prevent that from happening, it’s a good idea to check various system views on occasion. The alternative isn’t always outright disaster, but why take the chance?

An ideal place to start is the pg_stat_activity view. It tells us what each session is (or was) doing, where it originated, who owns it, and a myriad of other juicy details. There’s just one problem:

SELECT pid, usename, query FROM pg_stat_activity;
  pid  | usename  |                       query                       
 11415 | postgres | <insufficient privilege>
   482 | sthomas  | SELECT pid, usename, query FROM pg_stat_activity;

While logged on as an unprivileged user, we can only see our own activity. The query column is protected such that only superusers can view its contents for all users. This is a security measure since it’s possible a query has sensitive information embedded somewhere. However, there are a lot of useful contextual or debugging elements in that field that a service monitor or other automated tool might find illuminating.

How do we give a user access to this data—automated or otherwise—without committing the greatest of sins by making them a superuser? One common technique is to simply wrap the view with a function that returns rows. Like this:

CREATE ROLE monitor;
CREATE OR REPLACE FUNCTION public.pg_stat_activity()
RETURNS SETOF pg_catalog.pg_stat_activity
AS $$
  SELECT * FROM pg_catalog.pg_stat_activity;
REVOKE ALL ON FUNCTION public.pg_stat_activity() FROM PUBLIC;
GRANT EXECUTE ON FUNCTION public.pg_stat_activity() TO monitor;
GRANT monitor TO sthomas;

Note that we also created a role that we can use for this kind of elevated access. By granting access to the role, we have an abstract set of privileges we can grant to, or revoke from, other users. Why track down every single grant a user might have, when we can just revoke a few roles instead?

The function is set as a SECURITY DEFINER so it runs as the user who owns it. The presumption here is that we create the function as another superuser (usually postgres), and then the query column is no longer protected for users given access to the function. Be wary when doing this however! Did you notice that we prepended the pg_catalog schema in the SELECT statement itself? This prevents the user from changing their search path and tricking the system into giving them superuser access to a view that’s really a select on a sensitive table. Sneaky!

We also needed to explicitly revoke execution access from the PUBLIC domain of all users. By default, functions created in Postgres can be executed by anyone. There are ways to change this, but most DBAs either don’t know about it, or haven’t done so. As a consequence, there are a lot of functions out there that are unnecessarily promiscuous. Elevated functions especially need this step!

After granting access to the new role, we can attempt the previous activity query again:

SELECT pid, usename, query FROM pg_stat_activity();
  pid  | usename  |                       query                       
 11415 | postgres | GRANT monitor TO sthomas;
   482 | sthomas  | SELECT pid, usename, query FROM pg_stat_activity();

Success! It feels inherently wrong to deliberately circumvent that kind of security measure, but we do as needs must. At least we did it safely. Having access to the query column is important in several contexts, especially if our application stack isn’t particularly sensitive.

Now that we have curated access to session activity, we may also need to observe how the database is working with its hardware resources. When multiple queries access data in shared memory for instance, knowing the contents might help us size it properly. It’s possible to access this information by activating the pg_buffercache extension.

With that knowledge firmly in hand and a bit of window function magic, here’s a query that we might work. In this case, I created a benchmark database and ran a couple iterations on it to generate some buffer activity.

CREATE EXTENSION pg_buffercache;
       round(COUNT(*) * 8.0 / 1024, 2) AS mb_used,
       round(c.relpages * 8.0 / 1024, 2) AS object_mb,
       round(COUNT(*) * 100.0 / c.relpages, 2) AS object_pct,
       round(COUNT(*) * 100.0 / SUM(COUNT(*)) OVER (), 2) AS buffer_pct
  FROM pg_buffercache b
  JOIN pg_class c USING (relfilenode)
  JOIN pg_namespace n ON (c.relnamespace = n.oid)
 WHERE n.nspname NOT IN ('pg_catalog', 'information_schema', 'pg_toast')
 GROUP BY c.oid, c.relpages
 LIMIT 10;
          oid          | mb_used | object_mb | object_pct | buffer_pct 
 pgbench_accounts_pkey |    3.22 |     21.45 |      15.01 |      53.16
 pgbench_accounts      |    2.75 |    128.08 |       2.15 |      45.42
 pgbench_history       |    0.05 |      0.62 |       8.86 |       0.90
 pgbench_tellers       |    0.02 |      0.01 |     200.00 |       0.26
 pgbench_branches      |    0.02 |      0.01 |     200.00 |       0.26

The database here is only using the default 8MB buffer, so there isn’t a whole lot of room for actual database caching. However, we can see that the account table is extremely active, and only 15% of it is cached. From this information, it’s clear we should increase shared memory to accommodate the full size of the accounts primary key, and possibly a larger portion of the accounts table.

There are, of course, other ways we can view table activity in Postgres. Consider for instance that Postgres constantly aggregates statistics as tables are written to, or read from. The statistics collector is actually rather prolific.

This query is a great way of seeing which tables are the most active:

SELECT relname AS TABLE_NAME, SUM(n_tup_ins) AS inserts,
       SUM(n_tup_upd) AS updates,
       SUM(n_tup_del) AS deletes
  FROM pg_stat_user_tables
 ORDER BY SUM(n_tup_ins + n_tup_upd + n_tup_del) DESC
 LIMIT 10;
    TABLE_NAME    | inserts | updates | deletes 
 pgbench_accounts | 1000000 |   15431 |       0
 pgbench_tellers  |     100 |   15431 |       0
 pgbench_branches |      10 |   15431 |       0
 pgbench_history  |   15431 |       0 |       0

Remember that benchmark I mentioned earlier? We can actually see how many rows each table started with, how many were modified during the benchmark itself, and the fact that the history table was part of the benchmark transaction.

These statistics are lost a number of ways because they’re generally only relevant while the server is actually running. We can also manually reset them with the pg_stat_reset() function, in case we are doing more active forensics and want to see live accumulation.

And this is only the write data. What about reads? Elements such as sequential or index scans are equally important if we want to know how well our indexes are performing, or identify tables that might need more or better indexes.

Here’s a different use of the same stat table:

       s.seq_scan, s.idx_scan,
       s.idx_tup_fetch, c.reltuples AS ROW_COUNT
  FROM pg_stat_user_tables s
  JOIN pg_class c ON (c.oid = s.relid)
 ORDER BY s.seq_scan DESC, idx_scan DESC
 LIMIT 10;
    TABLE_NAME    | seq_scan | idx_scan | idx_tup_fetch | ROW_COUNT 
 pgbench_branches |    15433 |        0 |             0 |        10
 pgbench_tellers  |    15431 |        0 |             0 |       100
 pgbench_accounts |        3 |    30862 |         30862 |     1e+06
 pgbench_history  |        0 |          |               |     23148

Those numbers line up pretty well. The fact that the account table has three sequential scans is because I was trying to force the data into the shared buffers, so I read the entire contents of the table a few times. Since the branch and teller tables are so small, reading their entire contents is probably the most efficient approach, though it may warrant investigation later if the behavior persists.

If we focus on the account table where all the real activity is taking place, it’s index fetches at all times, except for my manual scans. For a 1-million row table, that’s exactly what we want to see. The fact that there’s a 1-1 relationship with the number of scans to fetches suggests they’re single fetches from the primary key, and also tells us we have good selectivity.

And now that we know a lot of information about our tables, it’s time to move on to the users of those tables.

It only seems that way

More specifically, consider roles. In Postgres, best practices are to create a role and grant it access for multiple tables, functions, or views. Then we would grant the role to specific users that require that access. This is a lot safer than having direct assignments because it’s far easier to revoke and share among related users. Say, multiple people in a reporting department for instance.

But what if we’re given a username and want to see a full list of what they can actually access? That’s not as obvious as it might seem at first glance. Let’s do some permissions magic starting with this example group nesting.

CREATE ROLE benchmark;
CREATE ROLE nested_benchmark;
GRANT benchmark TO nested_benchmark;
GRANT nested_benchmark TO sthomas;
SELECT table_schema, TABLE_NAME
  FROM information_schema.table_privileges
 WHERE grantee IN ('sthomas', 'nested_benchmark');
 table_schema | TABLE_NAME 
(0 ROWS)

What the what!? Just to be clear, the sthomas user does have full read access to all of the pgbench tables. So why aren’t they showing up in the list?

As handy as the information schema is, a critical flaw is that it only considers directly assigned privileges. Nested roles completely circumvent its ability to report access. We need a way to fix that.

Enter the wondrous recursive CTE syntax. We can create a view that “flattens” the role nesting:

CREATE OR REPLACE VIEW v_recursive_group_list AS
  SELECT r.rolname AS user_name, g.rolname AS group_name
    FROM pg_authid r
    JOIN pg_auth_members m ON (m.member=r.oid)
    JOIN pg_authid g ON (m.roleid=g.oid)
  SELECT ag.user_name, g.rolname AS group_name
    FROM pg_authid r
    JOIN pg_auth_members m ON (m.member=r.oid)
    JOIN pg_authid g ON (m.roleid=g.oid)
    JOIN all_groups ag ON (r.rolname = ag.group_name)
SELECT * FROM all_groups;

Again, this isn’t as difficult as it looks. We’re just bootstrapping the results with the existing list of user/role associations. The UNION combines each of these with successive levels of nesting until we’ve exhausted all of them. The result is a username/group row for every role that is granted to a user, or a role which was granted to that role, and so on.

That’s something we can combine with the information schema to actually derive the full list of objects a user can access. Let’s see how that works:

  FROM v_recursive_group_list
 WHERE user_name = 'sthomas';
 user_name |    group_name    
 sthomas   | nested_benchmark
 sthomas   | monitor
 sthomas   | benchmark
SELECT t.table_schema, t.table_name
  FROM v_recursive_group_list gl
  JOIN information_schema.table_privileges t ON (t.grantee IN (gl.user_name, gl.group_name))
 WHERE gl.user_name = 'sthomas';
 table_schema |    TABLE_NAME    
 public       | pgbench_tellers
 public       | pgbench_branches
 public       | pgbench_accounts
 public       | pgbench_history
 public       | pg_buffercache

Huzzah! Now we know what sessions are doing, how active tables are, and have a firm grasp of basic security and exposure. What else is left? How about locks?

There’s a deadlock somewhere in here…

Yes, locks. Sometimes session activity gets out of hand, or transactions a little to fast and furious, and things go somewhat awry. Being able to unravel that mess is essential to keeping a database operating smoothly.

If we combine the contents of pg_stat_activity and pg_locks, we can get a lot of additional information regarding session activity. Before we were only interested in the query that was running and maybe some surrounding context. Now we can see exactly which tables, indexes, views, and other objects are locked, what kind of lock is involved, and so on.

This is probably one of my favorite views to use for this kind of work:

CREATE OR REPLACE VIEW v_activity_locks AS
SELECT, s.mode, s.locktype, a.wait_event, a.state,
       a.usename, a.query_start::TIMESTAMP(0), a.client_addr,
       now() - a.query_start AS time_used, a.query, s.tables
  FROM pg_stat_activity() a
         SELECT pid AS pid, mode, locktype,
                string_agg(relname::text, ', ') AS TABLES
           FROM (SELECT, l.mode, l.locktype, c.relname
                   FROM pg_locks l
                   JOIN pg_class c ON (l.relation=c.oid)
                  WHERE c.relkind = 'r'
                  ORDER BY pid, relname) agg
          GROUP BY 1, 2, 3
       ) s USING (pid);
GRANT SELECT ON v_activity_locks TO monitor;
SELECT pid, usename, state,
       EXTRACT(minutes FROM time_used) AS minutes,
       SUBSTRING(query, 1, 30) AS query_part,
  FROM v_activity_locks;
  pid  | usename  | state  | minutes |          query_part           |              TABLES              
  1927 | postgres | idle   |         |                               | 
 28305 | postgres | idle   |      49 | SELECT c.oid::REGCLASS::TEXT, | 
 11415 | postgres | idle   |       4 | GRANT monitor TO sthomas;     | 
 16022 | sthomas  | active |       0 | SELECT pid, usename, state,   | pg_authid, pg_class, pg_database

Despite being horrifically ugly—and hence why it’s a view—I prefer it to simply joining the two tables. The magic is in the tables column, which lists every table that is locked by the session, in alphabetical order, all in the same result. No more tracing through multiple rows to see all of the participants in a locking mess, it’s all there for everyone to see. One line per session. Add a filter to remove idle queries. Maybe an extra predicate to only consider queries which have been active for more than a few seconds. In the end, we have a single query that can instantly identify areas worthy of additional forensics, and all involved resources.

While we’re thinking about locks, the previous view only really told us what resources a particular session was using and perhaps how it was doing so. If we sorted these results by the query_start column, it wouldn’t be too difficult to see whether or not one session was blocking another. But it’s not exactly a scenario that requires an alibi, and related activity can obscure our results if the database is particularly busy.

Thanks to the newly available pg_blocking_pids function in Postgres 9.6, we can actually see what is blocking a certain action. Before this function was introduced, the only way to figure out what was causing the block was to trace which connections were using the same resources and had their locks granted, versus those that didn’t. In a sufficiently busy system, this wasn’t necessary a causal relationship, but it provided a good starting point. Now we can see exactly what’s causing the block, and we can use that information to our benefit.

Here’s an example of a lingering modification in a transaction that caused a block for another session:

SELECT DISTINCT AS blocker_pid, a.query AS blocker_query,
       a.usename AS blocker_user, a.client_addr AS blocker_client, AS blocked_pid, a2.query AS blocked_query,
       a2.usename AS blocked_user, a2.client_addr AS blocked_client
  FROM pg_locks l1
  JOIN pg_stat_activity() a ON ( =
  JOIN pg_locks l2 ON ( = ANY(pg_blocking_pids(
  JOIN pg_stat_activity() a2 ON ( =
 WHERE l1.granted
   AND NOT l2.granted;
-[ RECORD 1 ]--+------------------------------------------
blocker_pid    | 11415
blocker_query  | ALTER TABLE pgbench_accounts ADD foo INT;
blocker_user   | postgres
blocker_client | 
blocked_pid    | 12337
blocked_query  | SELECT COUNT(*) FROM pgbench_accounts ;
blocked_user   | sthomas
blocked_client |

Instead of using the pg_locks view and pg_stat_activity() function, we could use the v_activity_locks view to see both the blocked session and what blocked it on the same row. This would be indispensable in a really busy system, especially if a single connection has blocked several others. Such an occurrence would practically glow and demand immediate attention. Heck, it would even make an ideal automated system check.

And isn’t that what all of this really boils down to? Some of this is something that requires direct observation. Tracing user access makes sense during security audits, and the table stats can help with optimizing the database itself in many cases. But the rest is really just novel ways of combining the system views to produce quantifiable data that stands out in some way. How many queries have been running for over a minute? How many sessions have been waiting on a lock for over ten seconds?

Those first approximations are the bloodhound to a seasoned DBA. It would be silly to run all of these manually on a regular basis, but when there is a problem, all bets are off. These queries and views barely even scratch the surface of what’s really available in the Postgres system catalog. The Postgres metadata isn’t just there to reflect object organization, and curiosity goes a long way when exploring it.

So follow that bloodhound to the crime scene. There are more than enough tools available to identify, apprehend, and even punish the culprit. Or maybe just explore the stats and tweak to wring extra performance from the existing tables. This should be a good start for doing either, and infinitely more.

PG Phriday: Design Pattern Workshop

Recently on the pgsql-performance mailing list, a question popped up regarding Postgres RAM usage. In this instance Pietro wondered why Postgres wasn’t using more RAM, and why his process was taking so long. There were a few insightful replies, and they’re each interesting for reasons that aren’t immediately obvious. Let’s see what is really going on here, and perhaps answer a question while we’re at it.

Pietro presents several postgresql.conf settings, but here are the ones that matter:

shared_buffers = 24GB
effective_cache_size = 72GB
work_mem = 512MB
maintenance_work_mem = 1GB

The configured system has 96GB of total RAM. Keeping that in consideration, we see that shared_buffers is set to 24GB, which is a quarter of the server’s available memory. The only potential issue here is that it’s possible a checkpoint could be a very painful experience if a significant portion of that memory becomes dirty. Since there are no complaints of high IO waits, let’s just assume that isn’t happening.

Common advice suggests setting effective_cache_size to an amount equivalent to half of system RAM + shared_buffers. A setting of 72GB looks high, but Postgres does not allocate any of that on its own. This parameter merely influences the query planner, since higher values imply an increased probability a disk page is in either Postgres or OS caches. This is an optimization setting, not a memory allocation.

A work_mem value of 512MB is extremely high. In some cases however, that’s perfectly acceptable. The amount specified here will be used for all internal query execution operations that require a memory allocation. This means sorts, hashes, bitmaps, materializations, and so on. Thus a single query can instantiate multiple 512MB segments for each necessary operation. Effectively the user has greatly reduced the likelihood Postgres will utilize disk storage for any query operations.

And finally we have maintenance_work_mem. This is so often set to 1GB, it might as well be the default. This is the amount of RAM Postgres uses when doing things like creating indexes, vacuuming, and anything else that might need vast amounts of working space to proceed. We don’t know Pietro’s procedures, but we can make some inferences later. This setting really only matters if some part of his data onboarding includes adding new indexes or triggers an autovacuum.

Then we see how memory is performing once Postgres is running:

              total        used        free      shared  buff/cache   available
Mem:            94G         28G         46G         17M         19G         64G
Swap:           15G          0B         15G
Total:         109G         28G         61G

That’s not a lot of usage. Of that 28GB, we know Postgres has 24 simply due to the setting for shared_buffers. This means that the actual queries are only consuming 4GB at most.

This is particularly relevant because if we skip to the end of the post, we see this output from top:

 9686 postgres  20   0 24.918g 214236  12628 R 100.0  0.2   2872:38 postgres
 9687 postgres  20   0 24.918g 214212  12600 R 100.0  0.2   2872:27 postgres
 9688 postgres  20   0 25.391g 709936  12708 R 100.0  0.7   2872:40 postgres
 9691 postgres  20   0 24.918g 214516  12900 R 100.0  0.2   2865:23 postgres
 9697 postgres  20   0 24.918g 214284  12676 R 100.0  0.2   2866:05 postgres

All 16 of the backends are pegged at 100% CPU, and have been for almost 48 hours each (2872/60). That definitely corroborates Pietro’s claim that this workload has been running for 48 hours. This suggests little to no waiting for disk IO during the entire process.

So will more RAM help? Probably not. Pietro says he’s doing “time series expansion and some arithmetics”. This kind of procedure doesn’t generally require vast amounts of RAM, as calculations don’t often invoke sorting or other temporary memory structures. So what’s really going on?

The second reply gets at the real heart of the matter immediately:

You’d be better off to ask for help in optimizing your queries IMHO.

Why would Scott say that?

Well, there are a lot of ways to do the same thing. In the database world, some of these are highly preferable to others. How exactly is Pietro expanding those time series, for example? He could be doing something like this:

DO $$
  a INT;
  b INT;
  new_a INT;
  new_b INT;
  FOR a, b, ... IN SELECT col_a, col_b, ...
    new_a = some_calculation(a, b, c, ..., n);
    new_b = other_calculation(a, b, c, ..., n);
    INSERT INTO target_table VALUES (new_a, new_b, ...);
$$ LANGUAGE plpgsql;

If the necessary logic is complicated enough, that might be the right approach. However, it is extremely expensive to loop through each individual row from a driver query, allocate variables, perform calculations, and invoke a single insert. This kind of process would peg a CPU at 100% while still being highly inefficient by multiple orders of magnitude.

There’s also the possibility Pietro knows this, and the calculations were fairly simple. So maybe he decided the process could be inlined a bit more. Instead of that anonymous block (or something equivalent on the application side), he decides to write helper functions to do the calculations instead:

CREATE OR REPLACE FUNCTION some_calculation(a INT, b INT, ...)
  retval INT;
  -- Some calculations here
  RETURN retval;

Being a smart guy, Pietro knows immutable function results are cached much better than otherwise, and strict functions won’t even run if a parameter is null. With these two declarations, if any invocations take the same parameters, or a parameter is empty, Postgres may not even execute the function at all. Avoiding a jump operation is a huge optimization all by itself, so consider the benefits of preventing the overhead of a function call and its associated logic.

By defining the calculations in this manner, his process would look like this instead:

INSERT INTO target_table
SELECT some_calculation(col_a, col_b, ...),
       other_calculation(col_d, col_e, ...),
  FROM ...
 WHERE ...;

Or maybe Pietro knows the value of data staging. Caching is great, but in sufficiently complicated examples of business logic, transforming it into immutable functions may be somewhat difficult. In addition, some “calculations” may rely on the results of another query. This is where temporary and unlogged tables, or CTEs come in.

Consider the fact databases natively communicate using Set Theory. They are highly optimized, and specifically engineered to bulk-process rows in ways user-generated applications simply can’t match in all but the edgiest of cases. So the final iteration of the data onboarding might resemble this:

CREATE TEMP TABLE expand_series AS
CREATE TEMP TABLE relate_series AS
  FROM expand_series
  JOIN relate_series ON (...);
INSERT INTO target_table
  FROM basic_tally
  JOIN ...
 WHERE ...;

Or maybe Pietro likes CTEs more, so uses those instead of multiple temporary table steps. Postgres will fully materialize everything anyway. The primary benefit to temporary tables is that repeated execution of multiple steps won’t require re-running all previous elements. If this were some kind of one-time report, a CTE would arguably make more sense as it operates atomically.

Regardless, this kind of iterative transformation allows us to fully embrace Set Theory. Don’t operate on one row, but all of them. If we can organize calculations into distinct categories or steps, then we’re actually requesting the database to transform the data as a whole for each operation. For this entire data set, add some derived date column. For this entire data set, produce a standard deviation variant along this window. For this entire data set, derive these additional columns from these formulae.

That’s how databases think, and Postgres is no different in this regard. By looping through each individual result, we’re actively disrupting that process and literally making it hundreds or thousands of times slower. We could have all the memory in the world, and even find some way to fully utilize it, and it wouldn’t influence that universal truth.

Now, we don’t know what Pietro is actually doing, just that he’s convinced more RAM might help. The evidence he provided suggests that isn’t the case. Having sixteen processes running at full blast for two days either means he has a massive amount of data, or there’s some unaddressed inefficiency in his overall approach.

I suspect the latter only because it’s so common. People don’t usually think like databases, and really, who can blame them? What kind of nutcase would willingly do this for a living?

Figure 1: A Nutcase

PG Phriday: CONFLICT of Interests

MySQL has had a REPLACE INTO syntax to perform “UPSERT” logic since practically the very beginning. For the longest time, users who wanted to switch to Postgres, but for whatever reason relied on this functionality, were essentially trapped. Postgres 9.5 changed all that, but why did it take so long? As with much of Postgres history, it’s a long story.

To really understand where Postgres started, we need to look at the “old” way of handling a row merge. Many in the database world have probably encountered this once or twice:

    -- First, try to update the key.
    UPDATE my_tab SET b = v
     WHERE a = k;
    END IF;
    -- The key doesn't exist, so try to insert it.
      INSERT INTO my_tab (a, b) VALUES (k, v);
    EXCEPTION WHEN unique_violation THEN
      -- Nothing here, allow the loop to continue.
$$ LANGUAGE plpgsql;
SELECT merge_tab(1, 'James');
SELECT merge_tab(1, 'Jimmy');

What on Earth is all of that? Oddly enough, the somewhat convoluted logic is not only sound, it’s actually required to avoid a race condition. In the microseconds between attempting our UPDATE and following to the INSERT, some other transaction may have inserted the “missing” key. In that case, we’d encounter a unique constraint violation.

By catching the exception, we’re not immediately kicked out of the function and are presented with a choice. Do we assume our value is “right” and repeat the loop to apply the update, or just exit silently under the assumption that the successful transaction that beat us is probably fine? This particular function selected the previous assertion because that’s what a merge or upsert tries to guarantee: that the requested action is applied. Were we to omit the loop, the exception block would ensure there was no conflict or fatal error, but we could no longer rely on the function operating as advertised.

So why not invert the logic and remove the loop entirely? After all, we could just attempt the insert and if it fails, perform the update within the exception block, right? Actually no. Consider what happens if the target key is deleted by a concurrent transaction. Say we try our insert, and in the space of time between the key violation and our update, it gets deleted. Suddenly our update also produces an error. That’s probably an extremely unlikely edge case, but in OLTP databases, the unlikely becomes frighteningly common. So to be safe, we’re stuck with the loop.

Don’t think about it too much

That is a lot of overhead for what many consider basic functionality. Since that’s no longer a concern, let’s take a look at the actual syntax the Postgres team selected. To do that, let’s start with a very basic table with a handful of rows:

CREATE TABLE symbol_mapper (
  vendor_id    BIGINT   NOT NULL,
  ext_mapping  VARCHAR  NOT NULL,
  symbol       VARCHAR  NOT NULL,
  PRIMARY KEY (vendor_id, ext_mapping)
INSERT INTO symbol_mapper VALUES (1, 'Google', 'GOOGL');
INSERT INTO symbol_mapper VALUES (1, 'Apple', 'AAPL');
INSERT INTO symbol_mapper VALUES (2, 'goo', 'GOOGL');
INSERT INTO symbol_mapper VALUES (2, 'app', 'AAPL');
ANALYZE symbol_mapper;

The purpose of a mapping table is to fill the role of decoding external names or lookup values to match internal ones. Since each vendor may have its own designation structure, we require a mapping for each. That also protects us in case two vendors use the same identifiers.

So far we have a fairly standard application of tables. Now let’s do something interesting:

INSERT INTO symbol_mapper (vendor_id, ext_mapping, symbol)
       VALUES (2, 'app', 'AAPL')

In this particular case, the “aap” mapping already exists for vendor 2, so no insert takes place. This is the equivalent of not wrapping our insert/update with a loop. We don’t care what the value is, just as long as something is there. In reality, this is more of a way to remove error output from violations than anything immediately useful.

The real fun doesn’t start until we integrate the DO UPDATE functionality. Consider the case where we want to add Samsung as a mapping for a vendor. In this particular case, someone sneaked an existing row into the system, and it contains a typo.

INSERT INTO symbol_mapper VALUES (1, 'Samsung', 'SSLNF');
INSERT INTO symbol_mapper (vendor_id, ext_mapping, symbol)
       VALUES (1, 'Samsung', 'SSNLF')
    ON CONFLICT (vendor_id, ext_mapping)
   SET symbol = 'SSNLF';
SELECT * FROM symbol_mapper;
 vendor_id | ext_mapping | symbol 
         1 | Google      | GOOGL
         1 | Apple       | AAPL
         2 | goo         | GOOGL
         2 | app         | AAPL
         1 | Samsung     | SSNLF

What we’ve done here is ensure the newest incoming mapping is the “correct” one; that’s our merge. An observant reader might ask how this is any different from our futile desire to attempt an INSERT with an UPDATE in an exception block. Unlike those two separate statements and the time-consuming exception handling, this is a single atomic action.

Did another session delete the row before us? We’ll just insert it again. Did another transaction delete the row we just inserted or updated? Oh well. The important part is that it is impossible to delete the row while our statement is running. So while the logic is similar to using an exception, the difference is that DO UPDATE is built into the database itself, so it can’t be broken into multiple actions that can be interrupted.

Another interesting bit of syntax is that we can actually incorporate a WHERE clause into the update beyond the implicit assumption that our update affects the same key we tried to insert.

INSERT INTO symbol_mapper VALUES (1, 'Sony', 'SONY');
INSERT INTO symbol_mapper (vendor_id, ext_mapping, symbol)
       VALUES (1, 'Sony', 'SNY')
    ON CONFLICT (vendor_id, ext_mapping)
   SET symbol = 'SNY'
 WHERE symbol_mapper.symbol = 'SONY';
INSERT INTO symbol_mapper (vendor_id, ext_mapping, symbol)
       VALUES (1, 'Sony', 'SNY')
    ON CONFLICT (vendor_id, ext_mapping)
   SET symbol = 'SNY'
 WHERE symbol_mapper.symbol = 'SONY';

In this case, we had a mapping for Sony that needed a correction. The first query affected the row we targeted, and the second did nothing. This is important because if we had not specified that predicate, both updates would have successfully modified the row. And so would all subsequent attempts. Remember Postgres keeps a row version for every update, even if the new and old values are identical. That’s just how MVCC works.

In a loosely built application environment, it isn’t uncommon for several vectors to operate simultaneously. If a dozen of these each upsert the same value, they’ll all be satisfied that their work is complete, and Postgres would be stuck with a dozen duplicate old rows. VACUUM (and autovacuum) will ensure old row versions are recycled, but again, that’s more overhead we don’t need to invoke.

And of course, the WHERE clause isn’t restricted to deflecting repeated update attempts. There may be circumstances where we simply don’t want to apply changes. By specifying the table name, we can introspect into any of the existing table values. What about the values we attempted to insert? Since these were part of an inherent violation, they’re assigned to a record named “excluded”.

Here it is in action:

INSERT INTO symbol_mapper VALUES (1, 'Microsoft', 'msft');
INSERT INTO symbol_mapper (vendor_id, ext_mapping, symbol)
       VALUES (1, 'Microsoft', 'MSFT')
    ON CONFLICT ON CONSTRAINT symbol_mapper_pkey
   SET symbol = 'MSFT'
 WHERE symbol_mapper.symbol != excluded.symbol;

This is a very similar situation as we had with Sony. The mapping for Microsoft needs an update if the existing value doesn’t match the one we’re attempting to insert. Well, we can perform that check explicitly without hard-coding those values into the query multiple times. It’s possible to refer to anything in the VALUES tuple by specifying “excluded”. Handy, eh?

Also notice that we’ve changed our conflict condition. Previously we had simply listed the columns in the primary key, and Postgres inferred the proper constraint from that definition. In this case, we directly stated the constraint that Postgres should use in resolving the conflict. It’s somewhat uncommon (and probably not entirely safe) to directly invoke constraint names, but the option is there in case we want it.

This feature was a long time coming; Postgres 9.5 was released in early 2016. As impossible as it sounds, we’ve only really had a little over a year to leverage ON CONFLICT. As a consequence, it’s still slowly seeping into existing Postgres application stacks. Users are still incorporating it into their workflows. It’ll be a while before it’s taken for granted with the rest of the kitchen sink Postgres offers.

Until then, it’s that special toy we’ve always wanted but couldn’t afford until now. There’s nothing else to do but make up for lost time!


P.S. Before I forget, Postgres Open is getting ready for 2017! If you or someone else wants to attend or even submit a talk, I highly recommend doing so. I’ve been something of a regular fixture there since it started in 2011, and I fully intend to try my hand again. It’s taking place in San Francisco this time around, so the change in venue will definitely be interesting. Apparently it’s all part of the new United States PostgreSQL Association, so it’s good seeing everything pull together behind a united entity. Here’s looking to the future!