PG Phriday: Postgres 9.6 Pluses

Friday September 30, 2016

Page content

Timing can often be extremely fortuitous. Yesterday marked the official release of Postgres 9.6!

I’ve covered 9.6 previously, but that was a beta and clearly doesn’t count. Besides, while the beta was undoubtedly high quality, the frequency of patch turnover is enough to produce a significantly different final release. So let’s skim through the release notes a bit for stuff that really stands out or seems different from last time.

Provisional Parallelism

Oddly enough, this is all pretty much the same as previous tests suggested. The final release seems to reflect almost exactly the same parallel performance as the beta. A simple query that relied on a sequential scan kept improving until we hit a total of six background workers addressing the query. They did rename the max_parallel_degree setting to max_parallel_workers_per_gather for some reason, but there’s no accounting for taste.

What is surprising is that nested loop performance still seems awful. This full test easily demonstrates that a parallel nested loop is much slower than a standard one:

\timing on

CREATE TABLE para_test AS
SELECT a.id, repeat(' ', 20) AS junk
  FROM generate_series(1, 20000000) a(id);

CREATE INDEX idx_test_id ON para_test (id);

ANALYZE para_test;

SET max_parallel_workers_per_gather TO 1;

EXPLAIN ANALYZE
SELECT p1.id
  FROM para_test p1
  JOIN para_test p2 USING (id)
 WHERE id BETWEEN 1 AND 100000;

Admittedly our contrived query might confuse the planner sufficiently that performance worsens for only this specific case. But I was hoping they would “fix” nested loops before releasing the final iteration of 9.6. For now, I’m not sure whether or not to enable this, knowing what it seems to do to nested loops. It would be nice if there were a knob to have it only use parallelism on sequential scans until they fix this particular snag.

Mega Maid

Since tables require regular maintenance to mark dead tuples for reuse, VACUUM is the tool of choice. The release notes suggest that “frozen” pages are not revisited in subsequent vacuums. The key word here is “page”. Since Postgres’ page size is (almost always) 8KB, each page will generally contain several tuples. In any table that contains a large amount of static pages, these will be skipped in all subsequent vacuums. The implication here is that vast warehouse-style tables just got a lot easier to maintain.

This is a huge step forward. Interestingly, it goes well especially with partitioned tables, as older partitions tend to just sit there and gather dust. Vacuuming these over and over again is a massive waste of resources. Good riddance!

Replication Redoux

Synchronous replication got a notable improvement that brings it to parity with many other engines that provide high availability.

In Postgres synchronous replication, no transaction may return from commit until the first online listed replica acknowledges and writes the data. If there’s another candidate listed, it might as well not exist until the first one becomes unavailable for some reason. And that’s it. One other server gets writes from the primary at what could be considered real-time.

Well, in 9.6 things are a bit different. Now multiple synchronous standby servers can act in conjunction. If we have four synchronous standby servers for example, we can specify that writes are confirmed by at least two of them before proceeding. Or as in the case above, two standby servers and one must confirm.

What I’d like to see next is for Postgres to actually utilize the other listed servers as active alternates. Imagine there’s some network hiccough that disrupts or delays communication sufficiently without cutting the connection entirely. Suddenly the master can no longer commit transactions. Ouch. If any replica may confirm—not just the first available in the list—suddenly whichever is fastest wins. As long as at some minimum of synchronous systems report back, the state of the cluster is secure.

Besides that, there’s also the new remote_apply setting. Replication usually works by the replica acknowledging the data was written to disk as a WAL file. For situations where this isn’t sufficient, now the confirmation won’t go through until the changes are actually replayed in the replica. In practice, this is almost the same thing. However, sufficiently high throughput could expose edge cases where it wasn’t. Not anymore.

Wonderful Writing

Some of these line-items are easy to miss, but fantastically important. In previous versions, when Postgres was ready to write data to tables, it simply flushed working memory to disk in its current order. Sure they implemented various throttles and spreading to prevent overwhelming the disk subsystem, but we have something better now: write ordering. Now checkpoints are sorted in page order before being written to disk, substituting primarily random writes for sequential ones.

Along this same line, writes can now be batched in smaller groups. Dirty buffers get flushed to the kernel page cache, which will make it to storage… eventually. Unfortunately, some write strategies are better than others, and many operating systems optimize for throughput instead of latency. As a result, a few seconds of completely saturated write bandwidth is fine most of the time. Of course, this strategy is awful for databases.

Now Postgres has a stick to poke the kernel to tell it to flush to disk more often. Previously admins needed to tweak very crude kernel parameters that operated on percentages of memory, or overly restrictive byte counts. The first is terrible for systems with a lot of RAM, as flushing even 1% of 512GB would be a disaster to practically all types storage hardware. The second is better, but is often turned down too far, resulting in performance loss due to underutilized memory for faster access to dirty pages.

Table extensions are also allocated in wider swaths in the event multiple simultaneous requests come in at once. Instead of one linear extent per request and the implied locking, extents are multiplicative. For quickly growing tables, this should do an admirable job of reducing write contention.

It’s too bad spinning rust is on the way out, especially in server environments. Otherwise, this should greatly improve write metrics for disk-based systems until they’re finally retired.

Ceremonious Cessation

Ever had a connection that was idle in transaction that prevented some DDL, which in turn blocked a bunch of pending updates? Well, no more! The very instant 9.6 goes live on our systems, I’m setting idle_in_transaction_session_timeout to 3600 and never looking back. Timeouts like this are long overdue, really. Webservers have had this right for decades: never trust the client. The client is a big poopie-head.

Better Bechmarks

The pgbench utility is great, as we’ve already explored, but we’ve still only really scratched the surface. That article took advantage of the fact that we could write custom scripts for testing, but did not mention that multiple scripts can be specified for the same test. Servers have mixed workload, so why not run mixed scripts? Each could contain a different access vector, set of tables, or read/write balance.

It’s not a perfect replacement for a full production harness that replays the previous day of activity or something equally grandiose, but it’s an excellent facsimile. Especially since it’s possible to individually set the probability of each script.

Wrangled Wrappers

Beyond our tests with the beta, the fact the Postgres foreign data wrapper has advanced so much from the previous iteration in 9.5 is impressive. Foreign joins, sorts, and UPDATE or DELETE all give the remote system sufficient leverage to perform its own optimizations. The more data federation Postgres supports, the closer we get to a native sharding extension that doesn’t carry an interminable list of caveats and incompatibilities with the core Postgres feature set.

Summary

Postgres 9.6 is looking good so far. It’s great that there is a high amount of consistency from our earlier tests with the beta back in May. There’s a story in there regarding the reliability of the Postgres development model, that several commitfests can produce a beta that performs roughly the same as the final result after several months of battle testing and squashed bugs.

This is also the last release of Postgres 9.x. We hate to see it go, but the 9.x tree has already lived longer than any previous iteration. Still, it’ll be a weird feeling to see the new numbering scheme in action.

I, for one, welcome our Postgres 10 overlords.