When working with a database, sometimes performance problems are both far more subtle, and much worse than a query itself might suggest. The topic of this week’s PGDB (PostgreSQL) performance killers article concerns the use of the
IN clause, and how misusing it can catastrophically obliterate the database in mysterious ways.
To that end, we’ll use a slightly revised single-table test case since it’s served us pretty well so far:
DROP TABLE sys_order; CREATE TABLE sys_order ( order_id SERIAL NOT NULL, product_id INT NOT NULL, item_count INT NOT NULL, order_dt TIMESTAMPTZ NOT NULL DEFAULT now() ); INSERT INTO sys_order (product_id, item_count, order_dt) SELECT (a.id % 100) + 1, (a.id % 100) + 1, now() - (id % 1000 || 'd')::INTERVAL FROM generate_series(1, 1000000) a(id); ALTER TABLE sys_order ADD CONSTRAINT pk_order_order_id PRIMARY KEY (order_id); CREATE INDEX idx_order_product_id ON sys_order (product_id); CREATE INDEX idx_order_order_dt ON sys_order (order_dt); ANALYZE sys_order;
As usual, my test system is a simple dual-CPU VM with 16GB of RAM and some mysterious storage allocation from a SAN. All settings are default, and the version of PostgreSQL is the latest release of the 9.4 branch. I always recommend using the latest version of PGDB when possible, otherwise there’s a risk of missing important planner improvements.
As it turns out in this particular story, the
IN clause is actually pretty well known to most developers I’ve worked with. This isn’t some obtuse syntax that only experts have encountered, and it’s used regularly in applications and ORMs through the industry. It’s ubiquitous, and consequentially, extremely dangerous.
Why dangerous? Let’s examine a sanitized query I encountered in an actual running production system. Now, our test case is scaled down by a couple orders of magnitude, so the results won’t be as drastic as what I encountered. Still, the query below performs much worse than anything we’ve discussed so far:
EXPLAIN ANALYZE SELECT * FROM sys_order WHERE order_id IN ( SELECT DISTINCT order_id FROM sys_order WHERE product_id = 10 ) ORDER BY order_dt DESC LIMIT 30; QUERY PLAN -------------------------------------------------------------------- LIMIT (cost=27768.47..27768.55 ROWS=30 width=20) (actual TIME=1362.794..1362.840 ROWS=30 loops=1) -> Sort (cost=27768.47..27791.97 ROWS=9400 width=20) (actual TIME=1362.785..1362.801 ROWS=30 loops=1) [ Horrible ugly mess redacted ] Planning TIME: 0.699 ms Execution TIME: 1363.219 ms
What we’re looking at here, is the planner ripping itself to shreds trying to optimize a query with several problems:
- A subquery containing the same table with no aggregates.
- Use of
DISTINCTon a primary-key column.
- Ordering the results on the outside query.
- Using the
LIMITon the outer query.
Taking these in order, it should be obvious that the subquery is pointless in this example. The inside query is essentially the same as the outside query, minus the ordering and result limit. There are a number of reasons this might happen. The IN clause is primarily used as a type of glue. Often, a developer or ORM will take a working query and embed it as a subquery unchanged. The justification is fairly simple: I’m interested in these how these records are related, and I already have this working query.
In most cases,
IN can be simplified into some kind of
JOIN, since that’s how databases tend to combine related data. By using
IN and a subquery, the planner has to perform numerous unnecessary optimization steps in an attempt to reach the best plan. As the amount of complexity increases, so does the number of potential execution paths. How many elements from the inner query, for instance, can be collapsed into the outer one? What is the resource cost for each variant of doing so?
Then there’s the
DISTINCT clause within the inner query. That
sys_order table is not joined with anything, and there are no row multiplying functions. It’s not possible for more than one of the same primary key in the results. Yet there it is, making the planner do more work.
The last two are closely related. Since the outer query doesn’t add any new
WHERE clauses, applying an order and limiting the results at that point, is simply inefficient. The database must first execute the inner query to find the relevant
order_id values, and afterwards, throw away all but the top 30 results. The planner could have used the index on
order_dt and read it backwards. Or it could have used the index on
product_id and then ordered the results afterward, depending on which was more efficient based on statistics. Instead, it has to produce, and then subsequently discard, all data that matched the subquery.
Here’s what the query should have been:
EXPLAIN ANALYZE SELECT * FROM sys_order WHERE product_id = 10 ORDER BY order_dt DESC LIMIT 30; QUERY PLAN -------------------------------------------------------------------- LIMIT (cost=0.42..172.64 ROWS=30 width=20) (actual TIME=5.049..5.101 ROWS=30 loops=1) -> INDEX Scan Backward USING idx_order_order_dt ON sys_order (cost=0.42..53960.42 ROWS=9400 width=20) (actual TIME=5.046..5.073 ROWS=30 loops=1) FILTER: (product_id = 10) ROWS Removed BY FILTER: 9000 Planning TIME: 0.099 ms Execution TIME: 5.137 ms
Oh, look! There’s that backward index scan I mentioned. The row estimates are a bit off, and we’ll probably want to increase statistics and analyze to produce better values, but this is a speed improvement of over 250x. In a production system, even a few milliseconds can be a huge problem with enough throughput. Multiply that by 250, and the issue is upgraded to a catastrophe.
So how did this happen?
In this particular instance, it was the fault of Java Hibernate. An object for the inner query was passed to another object to flesh out the order detail, and the result almost crashed a production system. The fix was to make smarter use of Hibernate capabilities so it didn’t generate such a terrible query. Indeed, code refactors are probably something we should all consider doing more often in order to reduce accumulated technical debt.
Unfortunately, human-generated queries aren’t free from fault, either. It’s far too tempting to smash two queries together, than to rewrite them as a merged version using proper
JOIN syntax. I’ve done it myself when in a hurry to check something. The difference is that I know better than to commit such a hack to a permanent location in our code tree. And the reason I don’t do that, is because I know the potential havoc such a query can wreak.
And now, so do you. By itself,
IN is a wonderful tool. But its allure can bewitch and ultimately betray when it becomes a crutch. PGDB has a lot of powerful syntax potential, and it would be a shame to limit ourselves to the basics out of familiarity.