PG Phriday: Bountiful Beta Benefits

Friday May 13, 2016

Page content

The Postgres developers recently announced the availability of the first public beta for Postgres 9.6. I would be highly remiss to ignore such an opportunity to dig into any interesting functionality listed in the 9.6 release notes. All in all, it’s a pretty exciting series of advancements, and assuming this is a glimpse of what we see when 9.6 drops, I’d say we’re on the right track.

Plentiful Parallelism

Probably the most high-profile addition for 9.6 is parallel operation on certain query plans. I already examined this in depth, but how did the feature stack up over several months of development? Let’s use the same test and see:

CREATE TABLE para_test AS
SELECT a.id, repeat(' ', 20) AS junk
  FROM generate_series(1, 20000000) a(id);

ANALYZE para_test;

SET max_parallel_degree TO 1; -- The planner worked up to 5.

EXPLAIN ANALYZE
SELECT *
  FROM para_test
 WHERE junk LIKE '%s%';

Give or take a bit of variance, it looks pretty similar based on the best case of several runs:

Workers	Avg Time (s)
0	3.7
1	1.9
2	1.4
3	1.1
4	0.9
5	0.8

This is really good to see. We still get the most benefit from the initial activation; even one parallel worker drastically improves performance. After that—at least for sequence scans—returns diminish quite a bit. Perhaps more interesting however, is the new support for other operations, like nested loops. Check this out:

EXPLAIN ANALYZE
SELECT p1.id
  FROM para_test p1
  JOIN para_test p2 USING (id)
 WHERE id BETWEEN 1 AND 100000;

 Gather  (cost=5500.25..217096.45 rows=93502 width=4)
         (actual time=67.443..1651.985 rows=100000 loops=1)
   Workers Planned: 5
   Workers Launched: 5
   ->  Hash Join  (cost=4500.25..207681.27 rows=93502 width=4)
                  (actual time=77.348..1621.807 rows=16667 loops=6)
         Hash Cond: (p2.id = p1.id)
         ->  Parallel Seq Scan on para_test p2
              (cost=0.00..187059.00 rows=4000000 width=4)
              (actual time=0.014..498.137 rows=3333333 loops=6)
         ->  Hash  (cost=3331.48..3331.48 rows=93502 width=4)
                   (actual time=76.616..76.616 rows=100000 loops=6)
               Buckets: 131072  Batches: 1  Memory Usage: 4540kB
               ->  Index Only Scan using idx_test_id on para_test p1
                     (cost=0.44..3331.48 rows=93502 width=4)
                     (actual time=0.069..43.965 rows=100000 loops=6)
                     Index Cond: ((id >= 1) AND (id <= 100000))
                     Heap Fetches: 100000
 Planning time: 0.371 ms
 Execution time: 1662.068 ms

Now, it turns out this is actually a major performance hit for this contrived example. Without parallelism enabled, the nested loop operates in about 350ms. There’s clearly still some work needed under the hood, but the fact that this works at all is an encouraging step forward.

Spinlock Showdown

Several spinlocks have been removed and replaced with atomic locks to avoid costs associated with tight lock loops. Results of these modifications look encouraging. Our own tests reflect those glowing datapoints, with a bit of a caveat.

We began the comparison with 9.4 using pgbench on an 8-CPU VM with 32GB of RAM under various conditions and weren’t encouraged by the output.

Clients	9.4	9.6
1	6100	5500
2	12350	11000
4	23850	23500
8	38450	34500
16	47800	46000
32	43800	42750

Pgbench performance is worse and inconsistent in our VM environment across the board. Postgres 9.4 regularly outperforms it on both standard and prepared queries, and there’s some very strange judder between tests when the amount of clients matches the amount of system CPUs. After that, everything seems to even out within variance, but something strange is going on. Is it because the tests are on a VM instead of bare hardware? Is it because the VM was on kernel 3.13? Something clearly smells fishy here.

Why can this kind of variance be a problem? Because these are the numbers on a piece of bare metal with 32 CPUs and 64GB of RAM:

Clients	9.4	9.6
1	10800	10500
2	21500	21000
4	41000	40600
8	69000	73000
16	97000	118000
32	100000	167000

Note that the scaling improvements are far more obvious in this scenario than on our VM. It seems 9.6 is currently more environment sensitive than 9.4. Even if it’s the fault of the VM software, the fact that it affects 9.6 disproportionately is distressing. Hopefully that’ll go away by the time it’s officially released.

Foreign Federation

In Postgres versions older than 9.6, foreign tables are treated as individual entities and the external server isn’t really taken into consideration. If two foreign tables exist on the same server, all results will be fetched locally and joined as a third step. This not only fetches far more rows than necessary, but forces the local system to work harder joining everything after building temporary structures. It’s incredibly inefficient.

An easy way to visualize this is to create a couple garbage tables:

-- On external server (trust auth for testing)

CREATE TABLE foo AS
SELECT a.id FROM generate_series(1, 10) a(id);

CREATE TABLE bar AS
SELECT a.id FROM generate_series(1, 10) a(id);

-- On another system

CREATE EXTENSION postgres_fdw;

CREATE SERVER ext_srv 
  FOREIGN DATA WRAPPER postgres_fdw 
  OPTIONS (dbname 'postgres', host 'localhost', port '5434');

CREATE USER MAPPING for postgres 
  SERVER ext_srv 
  OPTIONS (user 'postgres');

CREATE FOREIGN TABLE foo (
  id INT
) SERVER ext_srv OPTIONS (table_name 'foo');

CREATE FOREIGN TABLE bar (
  id INT
) SERVER ext_srv OPTIONS (table_name 'bar');

These tables are exceedingly simple. Yet here’s what happens if we try a basic JOIN with a where clause in 9.4 or 9.5:

-- In 9.4

EXPLAIN
SELECT foo.*
  FROM foo
  JOIN bar ON (foo.id = bar.id)
 WHERE foo.id = 5;

                               QUERY PLAN                               
------------------------------------------------------------------------
 Nested Loop  (cost=200.00..296.58 rows=225 width=4)
   ->  Foreign Scan on foo  (cost=100.00..146.86 rows=15 width=4)
   ->  Materialize  (cost=100.00..146.94 rows=15 width=4)
         ->  Foreign Scan on bar  (cost=100.00..146.86 rows=15 width=4)

-- In 9.5

                           QUERY PLAN                            
-----------------------------------------------------------------
 Nested Loop  (cost=200.00..202.30 rows=1 width=4)
   ->  Foreign Scan on foo  (cost=100.00..101.15 rows=1 width=4)
   ->  Foreign Scan on bar  (cost=100.00..101.15 rows=1 width=4)

The 9.4 plan means Postgres will obtain the results of foo, and then loop through each row and obtain a result from bar. That’s potentially really awful. The 9.5 plan is better, but still not ideal. In that case, it’s pushing down the WHERE clause and pulling the matches from both tables, then joining them locally.

But 9.6 uses a much different approach:

                     QUERY PLAN                     
----------------------------------------------------
 Foreign Scan  (cost=100.00..102.28 rows=1 width=4)
   Relations: (public.foo) INNER JOIN (public.bar)

That’s right, it’s just letting the foreign system perform the entire JOIN, since both tables are local in that context. That’s a massive game-changing improvement. Pushing work to the remote server where appropriate saves bandwidth, allocation, and work on both ends of the equation. The potential returns for heavy users of foreign tables are multiple orders of magnitude.

Updating and deleting data from foreign tables is equally affected by this kind of pushdown logic. Look at the subtle difference between 9.5 and 9.6:

EXPLAIN
DELETE FROM foo
 WHERE id BETWEEN 5 AND 7;

-- On 9.5

                           QUERY PLAN                            
-----------------------------------------------------------------
 Delete on foo  (cost=100.00..101.15 rows=2 width=6)
   ->  Foreign Scan on foo  (cost=100.00..101.15 rows=1 width=6)

-- On 9.6

                            QUERY PLAN                             
-------------------------------------------------------------------
 Delete on foo  (cost=100.00..101.15 rows=2 width=6)
   ->  Foreign Delete on foo  (cost=100.00..101.15 rows=2 width=6)

So what is the difference between a “Foreign Scan” and a “Foreign Delete”? Believe it or not, the scan means Postgres 9.5 and older fetch the rows, and then issue separate DELETE statements for each. They use the same process for UPDATE statements. That’s… not great, especially as the amount of matches increases.

Once again, 9.6 saves the day. It allows the foreign system to take care of the whole process, and send back the results. The foreign system acts as if the whole DELETE statement were issued locally without any of that back-and-forth nonsense.

Vacuuming the Warehouse

The VACUUM process is vastly improved. Because Postgres uses MVCC, it requires regular maintenance to ensure the transaction counter does not wrap around and cause data loss. This makes sense, but in the context of a warehouse or any large system with several TB of data, it becomes more and more difficult to perform this maintenance simply due to performance constraints of storage systems.

In Postgres 9.5 and below, a VACUUM process must occasionally visit every single page of every single table. If a data warehouse is just sitting around accumulating data, why repeatedly vacuum all of the old data that hasn’t changed? Well, that all changes with 9.6. Now Postgres keeps track of pages that contain only unchanged values, and can skip them outright. This should drastically reduce storage IO during maintenance, and complete much faster as a bonus. For large idle warehouses, the difference can be an order of magnitude or more.

Persnickety Planner

There’s a veritably endless list of improvements in the release notes. The last I found noteworthy beyond the exciting stuff everyone drools over was a somewhat obscure planner tweak.

Do you have tables that use composite foreign keys? The query planner would normally multiply the statistical probability of each column individually, even though this usually indicates a strong correlation. This in turn causes row underestimates and bad query plans based on those artificially low values. If the planner expects 10 rows, when there are really 8000, that’s the difference between a nested loop biting off more than it can chew, and a relatively faster merge or hash operation on the larger result set.

This isn’t a general case for other implicit or explicit correlations, but it’s a move in the right direction. There is a multivariate statistics patch, but it didn’t survive the latest code freeze. It’s looking like we won’t see this kind of improvement on a general basis in 9.6, but this foreign key improvement is an example of how things could operate if a version of the patch is accepted.

In the end, 9.6 is looking like another great advancement over previous versions. I can’t wait to use it for realsies.

Plentiful Parallelism

Spinlock Showdown

Foreign Federation

Vacuuming the Warehouse

Persnickety Planner

See Also