PG Phriday: Parallel Sequence Scans

A couple days ago, Robert Haas announced that he checked in the first iteration of parallel sequence scans in the Postgres 9.6 branch. And no, that’s not a typo. One of the great things about the Postgres devs is that they have a very regimented system of feature freezes to help ensure timely releases. Thus even though 9.5 just released its second beta, they’re already working on 9.6.

So what is a sequence scan, and why does this matter? Past articles have covered this, but Postgres uses sequence scans when it needs to read the entire contents of a table. For larger entities consisting of tens or hundreds of millions of rows, this is a time-consuming process. Even the best processor can only handle a few million rows per second, so as scale increases vertically, there’s currently no way to address these larger data volumes efficiently without significant effort.

As slow as a sequence scan is, it’s also a relatively simple process, and a great starting point for the long road to a parallel Postgres engine. Postgres knows how big the table files are and obviously knows its own storage format, so it merely needs to set scan ranges and farm them out to workers that report the results back to a coordinator process. In theory, that’s a clean transition that avoids complicated aggregation rules or clause pushdown headaches.

But how well does it work? To see this for myself, I did something I’ve actually never done before now: download the Postgres git tree. Then I set up a test case:

CREATE TABLE test_tab AS
SELECT a.id, repeat(' ', 20) AS junk
  FROM generate_series(1, 20000000) a(id);

ANALYZE test_tab;

With this test table, there are 20-million rows of empty junk and no indexes, so we’re effectively forcing Postgres to use a sequence scan for any query. Then we have to enable the feature with the max_parallel_degree parameter. And finally we invoke a query with a naive WHERE clause applied to every row so the engine has to actually do some work.

SET max_parallel_degree TO 1;

EXPLAIN ANALYZE
SELECT *
  FROM test_tab
 WHERE junk LIKE '%s%';

                             QUERY PLAN
--------------------------------------------------------------------
 Gather  (cost=1000.00..265706.30 rows=1 width=25)
         (actual time=1832.456..1832.456 rows=0 loops=1)
   Number of Workers: 1
   ->  Parallel Seq Scan on test_tab
         (cost=0.00..264706.20 rows=1 width=25)
         (actual time=1828.830..5489.878 rows=0 loops=1)
         Filter: (junk ~~ '%s%'::text)
         Rows Removed by Filter: 29594528
 Planning time: 0.062 ms
 Execution time: 1834.332 ms

There’s a lot going on here. First, we need to talk about how many processes actually worked on this query. The max_parallel_degree parameter controls how many background workers assist the main process, so there are actually two Postgres processes performing independent sequence scans. Some parallel systems use the parent as a mere coordinator, so we might expect the number of workers to be greater than 1 before actual parallel operation occurs. That isn’t the case with this implementation.

The query itself simply asks for something that doesn’t exist in our sample set. This helps us get a best-case scenario where no results are handed between the processes, but time is still consumed scanning table data. And the resulting plan from that query is rather different from a standard scan. We can see how many extra workers were involved and that the results were “gathered” by an extra execution layer.

The only odd detail is that 29-million rows were removed from the results of a 20-million row table. We’ll just chalk that up as an implementation quirk considering this is pre-alpha code. Otherwise, this patch appears to scale in a relatively linear manner. Let’s check out a few different variants of max_parallel_degree.

Workers Avg Time (s)
0 3.8
1 1.9
2 1.3
3 1.1
4 0.9

There’s a bit of jitter in the timings on our test system, but the trend is fairly clear. With no extra workers, one process can scan a 20M row table in about four seconds. With three extra workers, those four processes can perform the same task in about one second.

This iteration of the patch takes the size of the table into account, possibly to compensate for implicit worker management overhead. With 20M rows, we couldn’t get more than five dedicated workers, while 100M rows utilized seven workers to utilize all eight of our CPU cores. Beyond that are a couple important caveats:

  1. Aggregates are not currently handled.
  2. Certain clauses are not pushed down into the workers.

The first can actually be circumvented rather easily. For example:

\timing

SET max_parallel_degree TO 3;

SELECT count(*) FROM (
  SELECT *
    FROM test_tab
   WHERE junk LIKE '%s%'
) s;

 count 
-------
     0
(1 row)

Time: 1111.904 ms

The second is a natural result of the patch’s immaturity. We need to have parallel functionality before it can be optimized. I’m perfectly content waiting for it to be done right. In the meantime, we have all of the functionality added to make this possible. After 9.4 added background workers, new Postgres extensions began leveraging them. And now 9.6 will probably use them for their original purpose, based on how stable the patch appears so far.

It’s exciting to see that Postgres will finally be able to scale vertically in a way that can handle the large tables some organizations have been accumulating. We have a 30TB database with tens of billions of rows. Even though we’ll be sharding that in the future, looming parallel features imply the shards themselves will scale as we cross over into trillions of rows.

It’s an exciting time to be a Postgres end-user.