For the second of my ten part series on hidden PGDB (PostgreSQL) performance killers, I’m going to talk about something called an anti-join. It’s not a well-known approach outside of the database world, but due to how it works, it can impart almost magical plan revisions that drastically improve query performance in the right scenario. Developers can add it to a growing bag of tricks when working on database-driven content, since it comes in handy more often than you might expect.

Let’s build a test-case, shall we?

CREATE TABLE sys_product
(
    product_id   SERIAL  PRIMARY KEY,
    prod_name    TEXT    NOT NULL,
    quality      INT     NOT NULL,
    descr        TEXT    NOT NULL DEFAULT now()
);
 
INSERT INTO sys_product (prod_name, quality, descr)
SELECT 'Product ' || a.id::TEXT,
       log((a.id % 100) + 1)::INT,
       'It does stuff.'
  FROM generate_series(1, 100000) a(id);
 
CREATE TABLE sys_order
(
    order_id     SERIAL       NOT NULL,
    product_id   INT          NOT NULL,
    item_count   INT          NOT NULL,
    order_dt     TIMESTAMPTZ  NOT NULL DEFAULT now(),
    valid_dt     TIMESTAMPTZ  NULL
);
 
INSERT INTO sys_order (product_id, item_count, order_dt, valid_dt)
SELECT (a.id % 100000) + 1, (a.id % 100) + 1,
       now() - (id % 1000 || 'd')::INTERVAL,
       CASE WHEN a.id % 499 = 0
            THEN NULL
            ELSE now() - (id % 999 || 'd')::INTERVAL
       END
  FROM generate_series(1, 1000000) a(id);
 
ALTER TABLE sys_order ADD CONSTRAINT pk_order_order_id
      PRIMARY KEY (order_id);
 
CREATE INDEX idx_order_product_id
    ON sys_order (product_id);
 
CREATE INDEX idx_order_valid_dt
    ON sys_order (valid_dt);
 
ANALYZE sys_product;
ANALYZE sys_order;

This is a very basic product and order table structure, and we’ve used it before in the last installment of this series. The only columns we’ve added since last time is the quality column in sys_product and the valid_dt column in sys_order. This way, we can introduce some variability into the query plan due to data correlation effects. We’ve tried to distribute the data such that new orders have not been validated, and there are still some non-validated orders in the past. This distribution is more reliable than using random().

Now let’s build a query that retrieves all orders on products with a quality of five using a simple JOIN:

EXPLAIN ANALYZE
SELECT o.*
  FROM sys_order o
  JOIN sys_product p USING (product_id)
 WHERE o.valid_dt IS NULL
   AND p.quality = 2;
 
                             QUERY PLAN                             
--------------------------------------------------------------------
 Hash JOIN  (cost=2985.66..7406.44 ROWS=1303 width=28)
            (actual TIME=80.438..90.406 ROWS=1383 loops=1)
   Hash Cond: (o.product_id = p.product_id)
   ->  Bitmap Heap Scan ON sys_order o
              (cost=39.41..4423.00 ROWS=1933 width=28)
              (actual TIME=0.600..6.879 ROWS=2004 loops=1)
         Recheck Cond: (valid_dt IS NULL)
         Heap Blocks: exact=2004
         ->  Bitmap INDEX Scan ON idx_order_valid_dt
                    (cost=0.00..38.92 ROWS=1933 width=0)
                    (actual TIME=0.327..0.327 ROWS=2004 loops=1)
               INDEX Cond: (valid_dt IS NULL)
   ->  Hash  (cost=2084.00..2084.00 ROWS=68980 width=4)
             (actual TIME=79.773..79.773 ROWS=69000 loops=1)
         Buckets: 8192  Batches: 1  Memory Usage: 2426
         ->  Seq Scan ON sys_product p
                 (cost=0.00..2084.00 ROWS=68980 width=4)
                 (actual TIME=0.013..41.150 ROWS=69000 loops=1)
               FILTER: (quality = 2)
               ROWS Removed BY FILTER: 31000
 Planning TIME: 0.732 ms
 Execution TIME: 91.875 ms

We can actually infer quite a bit from this execution plan.

  • The sys_order table was checked with the index on valid_dt as desired.
  • The sys_product table was scanned in its entirety.
  • The planner estimates all match up almost exactly.
  • Final result was obtained by hashing the two row sets together.

From this, we could say that this is the ideal query plan. The index we wanted was used, and it reduced one million rows to about 2000. Since we don’t have an index on product quality, we might as well fetch them all and filter out the wrong quality before executing the hash. Once the hash between both is computed, we get about 1300 rows as a result.

Of course, this approach makes a huge mistake. Even though the order and product tables are within an order of magnitude of each other, we’re only interested in a relatively small fraction of the order table. Thus, fetching all products at this scale is probably a losing proposition. However, since the planner doesn’t have data regarding how closely these tables are correlated, it can’t make that assumption. Indeed, is it fetching 2000 unique product IDs, or 200? Since it can’t fetch the related products at this point, it does the best it can.

But can we do better? As it turns out, we can do a whole lot better by telling the planner what we really wanted. Note that we didn’t actually use the results from the product table in the SELECT portion of the query. Further, we know that we are only interested in a maximum of about 1300 products, so why check all of them? Is there a way to force the planner to fetch the matching orders first, and then check to see which products correspond as a second step?

We already know from last time that this is what a nested loop does, and one way to encourage the planner to pick a nested loop is by using EXISTS to break apart our JOIN condition. This accomplishes two things:

  1. Don’t join against a table to fetch columns that will never be returned.
  2. Favor some kind of looping construct to avoid excessive record comparisons.

This is also what is known as an anti-join, and PGDB performs very well when guided in that direction. Here’s what our query and plan look like using this concept:

EXPLAIN ANALYZE
SELECT o.*
  FROM sys_order o
 WHERE o.valid_dt IS NULL
   AND EXISTS (
         SELECT 1
           FROM sys_product p
          WHERE p.product_id = o.product_id
            AND p.quality = 2
          LIMIT 1
       );
 
                             QUERY PLAN                             
--------------------------------------------------------------------
 Bitmap Heap Scan ON sys_order o 
             (cost=39.16..20490.82 ROWS=967 width=28)
             (actual TIME=0.565..11.462 ROWS=1383 loops=1)
   Recheck Cond: (valid_dt IS NULL)
   FILTER: (SubPlan 1)
   ROWS Removed BY FILTER: 621
   Heap Blocks: exact=2004
   ->  Bitmap INDEX Scan ON idx_order_valid_dt
              (cost=0.00..38.92 ROWS=1933 width=0)
              (actual TIME=0.269..0.269 ROWS=2004 loops=1)
         INDEX Cond: (valid_dt IS NULL)
   SubPlan 1
     ->  LIMIT  (cost=0.29..8.31 ROWS=1 width=0)
                (actual TIME=0.003..0.003 ROWS=1 loops=2004)
           ->  INDEX Scan USING sys_product_pkey ON sys_product p
                     (cost=0.29..8.31 ROWS=1 width=0)
                     (actual TIME=0.002..0.002 ROWS=1 loops=2004)
                 INDEX Cond: (product_id = o.product_id)
                 FILTER: (quality = 2)
                 ROWS Removed BY FILTER: 0
 Planning TIME: 0.126 ms
 Execution TIME: 12.154 ms

This time, something drastically different happened:

  • The executor took our matching rows from the valid_dt index and stashed them in the memory heap.
  • Using those same rows, it executed about 2000 index lookups into the product table.
  • The results are produced almost 8x faster.

Before anyone takes this result and starts rewriting all queries everywhere to utilize anti-joins, keep in mind that they’re extremely situational. The example here was built based on an actual application query at a company I used to work with. In their case, execution went from 2500ms to 280ms following the rewrite. But in building this test-case to simulate their data, I ended up creating a dozen data distributions that were slightly slower using this approach.

In addition, take a look at the LIMIT 1 section of the subquery. We’re explicitly telling the planner that only one product can match per seek, which is what the word EXISTS should imply. Yet if we remove it, the execution plan actually reverts to the same approach used by the more naive JOIN. Ouch!

But that’s OK because queries are, by their nature, more of a linguistic construct than a computational one. Having a larger vocabulary can help convey more meaning and sometimes produce different results, but our audience is perfectly within their rights to summarize or ignore irrelevant prattle. The goal in this case is adding to our lexicon, and trying to frame applicable scenarios in a different way in hopes that the planner will catch on.

For data distributions that apply, anti-joins are a very powerful technique. If nothing else, they’re worth a try when other approaches fall flat. The only way to know for sure is experimentation! They’re just a tool, and like any tool, should only be used at opportune instances.

So remember the anti-join—it’s a great tool for that rare hex-socket screw you occasionally stumble across.

PG Phriday: 10 Ways to Ruin Performance: Forgetting it EXISTS
Tagged on:                 

17 thoughts on “PG Phriday: 10 Ways to Ruin Performance: Forgetting it EXISTS

  • Hi, Thanks for this great and clear post, there is a little typo in the first example, the quality is equal to 2 in the query when you mentionned 5 in the description. Regards

  • How hard would it be to modify the query planner to automatically detect situations in which the second plan is better?

    The two criteria seem to be:

    1. No columns from the joined table used in the SELECT section, and
    2. WHERE (and JOIN) criteria only specify a small fraction of the total rows in the joined table.

    Detecting #1 should be easy. Not sure about #2.

    How much worse can the anti-join plan potentially be if the table contents are such that its the wrong choice? If the penalty is minor then maybe the planner should always prefer it when the none of the joined table’s columns are used in the SELECT.

    1. Good question. The guys who work on the planner have improved it a lot, actually. It used to be much harder to cause the plan difference. In fact, a little more tweaking of system settings makes it even harder. At one point, I changed:

      SET random_page_cost = 1.5;
      

      To simulate having better disks, like an SSD backing. Suddenly, both variants had the exact same plan. I could probably still force the scenario, but I was starting to run out of ways to make the situation look natural; you can come up with contrived examples for almost anything.

      Then again, you can’t predict the real world, and data rarely matches expectations. 😉

      1. All true. My point was just that when two logically equivalent queries produce query plans that are wildly divergent in performance then that might should be considered a planner “bug”. This is assuming it’s reasonable to expect the planner to detect that the two versions of the query are actually logically equivalent and/or transform one into the other.

        Side question: is there any place on the blog where you field questions? I’ve got a query against a partitioned table where I absolutely cannot figure out how to make postgres do the right thing. Its driving me nuts.

          1. Do you read / respond on the mailing list? I’ll try posting there.

            Short version:

            I have large “events” table partitioned by date. It has a “batch_id” FK to a “batches” table. There’s on batch_id. There are ~20 child tables, one for each month.

            The query tries to find events whose “batch” meets certain criteria, one of which is a time range. In order to get the planner to prune irrelevant partition tables I have to include a date constraint on “events”. However, when I do that, the planner chooses to scan “events” using the date index of the batch_id index, when the batch_id index specifies a much smaller set of rows.

            It actually ends up being faster to not include a date constraint on “events” and let postgres do an index scan (using batch_id) on all the child tables including the irrelevant ones. But that seems non-ideal.

            What I’d like to find is some way to give it enough information to prune the irrelevant child tables but still use the index on batch_id instead of the one on date.

          2. I need to sign up again. 🙂

            Your specific example is more common than it might seem. As it turns out, your date constraint isn’t good enough, but the planner can’t tell. Sure, it removes irrelevant child tables, but the date index matches more rows, and that means more random seeks. By itself, that can be a much more important element. In an index scan, the first seek can immediately discount the entirety of the index, which implicitly disqualifies the table it’s on. It’s not as early as ignoring the table in the plan, but it can be much faster, provided the alternate approach does less overall work.

            I’ve seen this happen often, actually. Except what tends to occur is that PostgreSQL has enough statistics that it ignores the date index because it knows the amount of matches is actually too high to justify that index. It just so happens that, with your data, it isn’t doing this. The most likely culprit is that you have a high amount of correlation between your batch and the dates, which makes sense. If batches are in the same order as the dates, they’re highly correlated, and instead of multiple WHERE clauses being subtractive, they’re equivalent.

            So it could use the batch index or the date index, and the stats suggest there’s a slight advantage to using the date, probably because there’s a higher cardinality for that column. For your data set, that’s exactly the wrong choice, because that means it has to select all of a certain date, and then filter out anything that doesn’t fall into the batches you are checking. Unfortunately, to get the benefit of constraint exclusion, you need the date constraint, which causes it to use the wrong index, which… you get the idea.

            The easiest way to address this is to create a composite index on batch_id and event_date. The planner should see that as the most selective and use it in the presence of those two columns, and it would still work with your partitioning scheme. Since you already have an index on batch_id, create the index in the order of (event_date, batch_id) so it can use the index in the absence of batch_id if necessary.

          3. Too many typos in there. 🙁

            Should read “There’s an index on batch_id…”.

            And “…date index instead of the batch_id index…”.

          4. Great info. Thanks! I’m reluctant to create a compound index, though, purely because of the amount of data in these tables. They also get a high volume of inserts, but the inserts aren’t time-sensitive in the sense that they need to happen as quickly as possible.

            Also I misspoke- the column is actually a timestamp without time zone, not a date. And the range the query is concerned with is on the order of 1 day. So around 1/30th of the rows in the table for the corresponding month.

            However, the number of batches that satisfy the query’s criteria is even smaller.

          5. Date or timestamp doesn’t really matter in the long run. Your one-day of data is more data than a list of batches will match, and that’s what’s screwing you. In fact, since the column is a timestamp, stats would suggest an extremely high cardinality so it would greatly prefer that index any time you use the column in a query. That just compounds the problem.

            So far as creating the index, you can do it concurrently. Do that, then drop the existing date index, and you should see an improvement. Obviously, I’d do it in a test environment first, but index creation isn’t nearly as disruptive as it was in the Bad Old Days(tm).

          6. Had a breakthrough. Turns out default_statistics_target was set to “50” on this server. I set it to “200” and ran an ANALYZE on the big events table. Now the planner prefers the index on batch_id.

            Without time constraint on events (i.e. not pruning child tables): 1300 ms.

            With time constraint on events (and using the batch_id index): 900 ms.

            Prior to re-analyzing the table and with the inclusion of the time constraint on events it was taking around 2500 ms.

            This raises another question though: does “default_statistics_target” only affect explicit “ANALYZE” commands, or does it also affect how extensive the “organically collected” stats are?

            That is, could I set it to 200 as a database default, or would that kill performance due to “over-collection” of statistics?

          7. Awesome! Yeah, that would have been my second suggestion. I personally use 200 as my minimum for default_statistics_target. The default is still too low for my liking.

            And yes, it will affect all ANALYZE activity for the entire database unless you lower it (or raise it) for specific tables. Having better stats is more “expensive” in that it has to process more rows and spend more time building the stats, but it can pay off in major ways with execution times. The planner needs info to do its job.

    1. I did my tests with 9.4.1 with all default settings. But beyond that, settings are somewhat relevant. After lowering random_page_cost, the plans were the same in this particular case.

      The reason I didn’t mention that is because the technique itself is what the post was about. Anti-joins using EXISTS really is a good technique to have. If my test case is lacking, I’ll gladly welcome a better one. 🙂

      1. I explained the two queries on my production server. Fairly large machine running 9.3.5 with mostly default query-planning settings.

        I got the same query plans as what Shaun describes. EXPLAIN ANALYZE estimated the first one would take 40ms vs. 15ms for the second.

        HOWEVER. When I actually ran the queries they both took around the same time.

        1. FWIW, EXPLAIN ANALYZE is not an estimate. That’s how long the query actually took to run. The primary difference is that it discards the results instead of fetching them into the client. In addition, running the queries once isn’t a sufficient test due to caching effects. You need to run the query several times to see what the average execution time is before a meaningful comparison is possible.

          Without the ANALYZE, fetching the data into the client will consume most of the time involved at the scale we’re discussing, inflating the results of both so they’re roughly equivalent. A 20ms difference gets swallowed by the 200ms (or whatever) spent delivering the data to the client. That’s why I use ANALYZE; simply comparing run-times isn’t precise enough.

          1. Yeah, I ran both of them multiple times. Given the size of the result set (and the fact that I wasn’t on the physical machine hosting the database) it makes sense that most of the time was spent transferring the results to my client.

Comments are closed.