PG Phriday: 10 Ways to Ruin Performance: Forgetting it EXISTS

Friday May 22, 2015

For the second of my ten part series on hidden Postgres (PostgreSQL) performance killers, I’m going to talk about something called an anti-join. It’s not a well-known approach outside of the database world, but due to how it works, it can impart almost magical plan revisions that drastically improve query performance in the right scenario. Developers can add it to a growing bag of tricks when working on database-driven content, since it comes in handy more often than you might expect.

Let’s build a test-case, shall we?

CREATE TABLE sys_product
(
    product_id   SERIAL  PRIMARY KEY,
    prod_name    TEXT    NOT NULL,
    quality      INT     NOT NULL,
    descr        TEXT    NOT NULL DEFAULT now()
);

INSERT INTO sys_product (prod_name, quality, descr)
SELECT 'Product ' || a.id::TEXT,
       log((a.id % 100) + 1)::int,
       'It does stuff.'
  FROM generate_series(1, 100000) a(id);

CREATE TABLE sys_order
(
    order_id     SERIAL       NOT NULL,
    product_id   INT          NOT NULL,
    item_count   INT          NOT NULL,
    order_dt     TIMESTAMPTZ  NOT NULL DEFAULT now(),
    valid_dt     TIMESTAMPTZ  NULL
);

INSERT INTO sys_order (product_id, item_count, order_dt, valid_dt)
SELECT (a.id % 100000) + 1, (a.id % 100) + 1,
       now() - (id % 1000 || 'd')::INTERVAL,
       CASE WHEN a.id % 499 = 0
            THEN NULL
            ELSE now() - (id % 999 || 'd')::INTERVAL
       END
  FROM generate_series(1, 1000000) a(id);

ALTER TABLE sys_order ADD CONSTRAINT pk_order_order_id
      PRIMARY KEY (order_id);

CREATE INDEX idx_order_product_id
    ON sys_order (product_id);

CREATE INDEX idx_order_valid_dt
    ON sys_order (valid_dt);

ANALYZE sys_product;
ANALYZE sys_order;

This is a very basic product and order table structure, and we’ve used it before in the last installment of this series. The only columns we’ve added since last time is the quality column in sys_product and the valid_dt column in sys_order. This way, we can introduce some variability into the query plan due to data correlation effects. We’ve tried to distribute the data such that new orders have not been validated, and there are still some non-validated orders in the past. This distribution is more reliable than using random().

Now let’s build a query that retrieves all orders on products with a quality of five using a simple JOIN:

EXPLAIN ANALYZE
SELECT o.*
  FROM sys_order o
  JOIN sys_product p USING (product_id)
 WHERE o.valid_dt IS NULL
   AND p.quality = 2;

                             QUERY PLAN                             
--------------------------------------------------------------------
 Hash Join  (cost=2985.66..7406.44 rows=1303 width=28)
            (actual time=80.438..90.406 rows=1383 loops=1)
   Hash Cond: (o.product_id = p.product_id)
   ->  Bitmap Heap Scan on sys_order o
              (cost=39.41..4423.00 rows=1933 width=28)
              (actual time=0.600..6.879 rows=2004 loops=1)
         Recheck Cond: (valid_dt IS NULL)
         Heap Blocks: exact=2004
         ->  Bitmap Index Scan on idx_order_valid_dt
                    (cost=0.00..38.92 rows=1933 width=0)
                    (actual time=0.327..0.327 rows=2004 loops=1)
               Index Cond: (valid_dt IS NULL)
   ->  Hash  (cost=2084.00..2084.00 rows=68980 width=4)
             (actual time=79.773..79.773 rows=69000 loops=1)
         Buckets: 8192  Batches: 1  Memory Usage: 2426
         ->  Seq Scan on sys_product p
                 (cost=0.00..2084.00 rows=68980 width=4)
                 (actual time=0.013..41.150 rows=69000 loops=1)
               Filter: (quality = 2)
               Rows Removed by Filter: 31000
 Planning time: 0.732 ms
 Execution time: 91.875 ms

We can actually infer quite a bit from this execution plan.

The sys_order table was checked with the index on valid_dt as desired.
The sys_product table was scanned in its entirety.
The planner estimates all match up almost exactly.
Final result was obtained by hashing the two row sets together.

From this, we could say that this is the ideal query plan. The index we wanted was used, and it reduced one million rows to about 2000. Since we don’t have an index on product quality, we might as well fetch them all and filter out the wrong quality before executing the hash. Once the hash between both is computed, we get about 1300 rows as a result.

Of course, this approach makes a huge mistake. Even though the order and product tables are within an order of magnitude of each other, we’re only interested in a relatively small fraction of the order table. Thus, fetching all products at this scale is probably a losing proposition. However, since the planner doesn’t have data regarding how closely these tables are correlated, it can’t make that assumption. Indeed, is it fetching 2000 unique product IDs, or 200? Since it can’t fetch the related products at this point, it does the best it can.

But can we do better? As it turns out, we can do a whole lot better by telling the planner what we really wanted. Note that we didn’t actually use the results from the product table in the SELECT portion of the query. Further, we know that we are only interested in a maximum of about 1300 products, so why check all of them? Is there a way to force the planner to fetch the matching orders first, and then check to see which products correspond as a second step?

We already know from last time that this is what a nested loop does, and one way to encourage the planner to pick a nested loop is by using EXISTS to break apart our JOIN condition. This accomplishes two things:

Don’t join against a table to fetch columns that will never be returned.
Favor some kind of looping construct to avoid excessive record comparisons.

This is also what is known as an anti-join, and Postgres performs very well when guided in that direction. Here’s what our query and plan look like using this concept:

EXPLAIN ANALYZE
SELECT o.*
  FROM sys_order o
 WHERE o.valid_dt IS NULL
   AND EXISTS (
         SELECT 1
           FROM sys_product p
          WHERE p.product_id = o.product_id
            AND p.quality = 2
          LIMIT 1
       );

                             QUERY PLAN                             
--------------------------------------------------------------------
 Bitmap Heap Scan on sys_order o 
             (cost=39.16..20490.82 rows=967 width=28)
             (actual time=0.565..11.462 rows=1383 loops=1)
   Recheck Cond: (valid_dt IS NULL)
   Filter: (SubPlan 1)
   Rows Removed by Filter: 621
   Heap Blocks: exact=2004
   ->  Bitmap Index Scan on idx_order_valid_dt
              (cost=0.00..38.92 rows=1933 width=0)
              (actual time=0.269..0.269 rows=2004 loops=1)
         Index Cond: (valid_dt IS NULL)
   SubPlan 1
     ->  Limit  (cost=0.29..8.31 rows=1 width=0)
                (actual time=0.003..0.003 rows=1 loops=2004)
           ->  Index Scan using sys_product_pkey on sys_product p
                     (cost=0.29..8.31 rows=1 width=0)
                     (actual time=0.002..0.002 rows=1 loops=2004)
                 Index Cond: (product_id = o.product_id)
                 Filter: (quality = 2)
                 Rows Removed by Filter: 0
 Planning time: 0.126 ms
 Execution time: 12.154 ms

This time, something drastically different happened:

The executor took our matching rows from the valid_dt index and stashed them in the memory heap.
Using those same rows, it executed about 2000 index lookups into the product table.
The results are produced almost 8x faster.

Before anyone takes this result and starts rewriting all queries everywhere to utilize anti-joins, keep in mind that they’re extremely situational. The example here was built based on an actual application query at a company I used to work with. In their case, execution went from 2500ms to 280ms following the rewrite. But in building this test-case to simulate their data, I ended up creating a dozen data distributions that were slightly slower using this approach.

In addition, take a look at the LIMIT 1 section of the subquery. We’re explicitly telling the planner that only one product can match per seek, which is what the word EXISTS should imply. Yet if we remove it, the execution plan actually reverts to the same approach used by the more naive JOIN. Ouch!

But that’s OK because queries are, by their nature, more of a linguistic construct than a computational one. Having a larger vocabulary can help convey more meaning and sometimes produce different results, but our audience is perfectly within their rights to summarize or ignore irrelevant prattle. The goal in this case is adding to our lexicon, and trying to frame applicable scenarios in a different way in hopes that the planner will catch on.

For data distributions that apply, anti-joins are a very powerful technique. If nothing else, they’re worth a try when other approaches fall flat. The only way to know for sure is experimentation! They’re just a tool, and like any tool, should only be used at opportune instances.

So remember the anti-join—it’s a great tool for that rare hex-socket screw you occasionally stumble across.

See Also