PG Phriday: 10 Ways to Ruin Performance: IN-Sanity

When working with a database, sometimes performance problems are both far more subtle, and much worse than a query itself might suggest. The topic of this week’s Postgres (PostgreSQL) performance killers article concerns the use of the IN clause, and how misusing it can catastrophically obliterate the database in mysterious ways.

To that end, we’ll use a slightly revised single-table test case since it’s served us pretty well so far:

DROP TABLE sys_order;
CREATE TABLE sys_order
    order_id     SERIAL       NOT NULL,
    product_id   INT          NOT NULL,
    item_count   INT          NOT NULL,
    order_dt     TIMESTAMPTZ  NOT NULL DEFAULT now()

INSERT INTO sys_order (product_id, item_count, order_dt)
SELECT ( % 100) + 1, ( % 100) + 1,
       now() - (id % 1000 || 'd')::INTERVAL
  FROM generate_series(1, 1000000) a(id);

ALTER TABLE sys_order ADD CONSTRAINT pk_order_order_id
      PRIMARY KEY (order_id);

CREATE INDEX idx_order_product_id
    ON sys_order (product_id);

CREATE INDEX idx_order_order_dt
    ON sys_order (order_dt);

ANALYZE sys_order;

As usual, my test system is a simple dual-CPU VM with 16GB of RAM and some mysterious storage allocation from a SAN. All settings are default, and the version of PostgreSQL is the latest release of the 9.4 branch. I always recommend using the latest version of Postgres when possible, otherwise there’s a risk of missing important planner improvements.

As it turns out in this particular story, the IN clause is actually pretty well known to most developers I’ve worked with. This isn’t some obtuse syntax that only experts have encountered, and it’s used regularly in applications and ORMs through the industry. It’s ubiquitous, and consequentially, extremely dangerous.

Why dangerous? Let’s examine a sanitized query I encountered in an actual running production system. Now, our test case is scaled down by a couple orders of magnitude, so the results won’t be as drastic as what I encountered. Still, the query below performs much worse than anything we’ve discussed so far:

SELECT * FROM sys_order
 WHERE order_id IN (
        SELECT DISTINCT order_id
          FROM sys_order
         WHERE product_id = 10
 ORDER BY order_dt DESC
 LIMIT 30;

                             QUERY PLAN                             
 Limit  (cost=27768.47..27768.55 rows=30 width=20)
        (actual time=1362.794..1362.840 rows=30 loops=1)
   ->  Sort  (cost=27768.47..27791.97 rows=9400 width=20)
             (actual time=1362.785..1362.801 rows=30 loops=1)

 [ Horrible ugly mess redacted ]

 Planning time: 0.699 ms
 Execution time: 1363.219 ms

What we’re looking at here, is the planner ripping itself to shreds trying to optimize a query with several problems:

  • A subquery containing the same table with no aggregates.
  • Use of DISTINCT on a primary-key column.
  • Ordering the results on the outside query.
  • Using the LIMIT on the outer query.

Taking these in order, it should be obvious that the subquery is pointless in this example. The inside query is essentially the same as the outside query, minus the ordering and result limit. There are a number of reasons this might happen. The IN clause is primarily used as a type of glue. Often, a developer or ORM will take a working query and embed it as a subquery unchanged. The justification is fairly simple: I’m interested in these how these records are related, and I already have this working query.

In most cases, IN can be simplified into some kind of JOIN, since that’s how databases tend to combine related data. By using IN and a subquery, the planner has to perform numerous unnecessary optimization steps in an attempt to reach the best plan. As the amount of complexity increases, so does the number of potential execution paths. How many elements from the inner query, for instance, can be collapsed into the outer one? What is the resource cost for each variant of doing so?

Then there’s the DISTINCT clause within the inner query. That sys_order table is not joined with anything, and there are no row multiplying functions. It’s not possible for more than one of the same primary key in the results. Yet there it is, making the planner do more work.

The last two are closely related. Since the outer query doesn’t add any new WHERE clauses, applying an order and limiting the results at that point, is simply inefficient. The database must first execute the inner query to find the relevant order_id values, and afterwards, throw away all but the top 30 results. The planner could have used the index on order_dt and read it backwards. Or it could have used the index on product_id and then ordered the results afterward, depending on which was more efficient based on statistics. Instead, it has to produce, and then subsequently discard, all data that matched the subquery.

Here’s what the query should have been:

  FROM sys_order
 WHERE product_id = 10
 ORDER BY order_dt DESC
 LIMIT 30;

                             QUERY PLAN                             
 Limit  (cost=0.42..172.64 rows=30 width=20)
        (actual time=5.049..5.101 rows=30 loops=1)
   ->  Index Scan Backward using idx_order_order_dt on sys_order
           (cost=0.42..53960.42 rows=9400 width=20)
           (actual time=5.046..5.073 rows=30 loops=1)
         Filter: (product_id = 10)
         Rows Removed by Filter: 9000
 Planning time: 0.099 ms
 Execution time: 5.137 ms

Oh, look! There’s that backward index scan I mentioned. The row estimates are a bit off, and we’ll probably want to increase statistics and analyze to produce better values, but this is a speed improvement of over 250x. In a production system, even a few milliseconds can be a huge problem with enough throughput. Multiply that by 250, and the issue is upgraded to a catastrophe.

So how did this happen?

In this particular instance, it was the fault of Java Hibernate. An object for the inner query was passed to another object to flesh out the order detail, and the result almost crashed a production system. The fix was to make smarter use of Hibernate capabilities so it didn’t generate such a terrible query. Indeed, code refactors are probably something we should all consider doing more often in order to reduce accumulated technical debt.

Unfortunately, human-generated queries aren’t free from fault, either. It’s far too tempting to smash two queries together, than to rewrite them as a merged version using proper JOIN syntax. I’ve done it myself when in a hurry to check something. The difference is that I know better than to commit such a hack to a permanent location in our code tree. And the reason I don’t do that, is because I know the potential havoc such a query can wreak.

And now, so do you. By itself, IN is a wonderful tool. But its allure can bewitch and ultimately betray when it becomes a crutch. Postgres has a lot of powerful syntax potential, and it would be a shame to limit ourselves to the basics out of familiarity.