Recently a coworker asked me this question:
Should I expect variance between minutes and hours for the same query?
And I was forced to give him this answer:
Potentially, but not commonly. Query planning is an inexact science, and regardless of the query being the “same query,” the data is not the “same data.” This isn’t generally the case, but on occasion, changes in data can affect the query execution path. Usually this is a good thing, as the database accounts for new value distributions.
For example, if there are a million distinct values in one column, but 90% of them are the same, certain values should trigger an index scan instead of a sequential scan. Those values will change over time, and if the stats don’t account for that, queries will have non-dependable performance. Of course, this introduces potential correlation assumptions that aren’t correct in some cases, and that also causes unreliable query performance. I guess the question is: which would you rather have?
That answer, despite being “right”, didn’t sit well with me. While we did eventually sit down and optimize the query in question so it was less likely to mislead Postgres, there’s no reason to assume an end-user is going to readily accept unreliable query performance. Nor should they.
But we need to perform some analysis to determine how things got to this point. Let’s start with distribution. We have a table that contains about two weeks worth of data, represented by 66M rows over 100GB of space (135GB including indexes). We’re not scheduled to upgrade to 9.6 until early 2017, so the instance is running Postgres 9.5. It isn’t bleeding edge, but this is hardly an ancient installation.
Consider the query plan for a standard retrieval of the last day worth of data:
EXPLAIN SELECT * FROM NEW WHERE create_time >= '2016-12-08'; QUERY PLAN ----------------------------------------------------------------------------- Bitmap Heap Scan ON NEW (cost=137498.78..8770565.51 ROWS=5819512 width=1032) Recheck Cond: (create_time >= '2016-12-08 00:00:00-06'::TIMESTAMP WITH TIME zone) -> Bitmap INDEX Scan ON idx_new_create_time (cost=0.00..136043.90 ROWS=5819512 width=0) INDEX Cond: (create_time >= '2016-12-08 00:00:00-06'::TIMESTAMP WITH TIME zone)
There’s nothing surprising here. There’s an index on the
create_time column, and that index is pulling the rows we requested. But what happens if we add a
EXPLAIN SELECT * FROM NEW WHERE create_time >= '2016-12-08' LIMIT 10; QUERY PLAN ----------------------------------------------------------------------------- LIMIT (cost=0.00..24.03 ROWS=10 width=1032) -> Seq Scan ON NEW (cost=0.00..13985651.85 ROWS=5819512 width=1032) FILTER: (create_time >= '2016-12-08 00:00:00-06'::TIMESTAMP WITH TIME zone)
… WHAT!? How is reading the full contents of a 100GB table ever faster than fetching over 5M rows using an index? How is that a correct decision in any sane universe, and how is Postgres reaching it?
It turns out that Postgres is making two fundamentally flawed assumptions here:
- The distinct values in the
create_timecolumn are evenly distributed.
- Only a small sample of the table will be required to obtain 10 matches. Yes this means a sequential scan, but one that can be aborted fairly quickly.
Ultimately, the first invalid assumption compounds the second. As is true in many cases with columns that contain dates, those values in our table exist along a steadily increasing vector. With two weeks of historic data, Postgres would have to read almost the entire table to reach the portion where the most recent rows reside. As a result, Postgres isn’t reading 10 or even 1000 rows, it’s reading 60-million.
What’s worse is that this behavior is consistent. We have another table that’s 500GB in size with nearly 500M rows, and the query plan is the same. Naively adding a
LIMIT clause to a query on that table could be outright disastrous. Not only would it mean effectively reading the entire table, but would result in flushing many other objects out of cache. So now we’ve saturated disk IO right when other queries have lost their own table caches. At that point, every query on the system will perform horribly, even after the rogue sequential scan is complete. Memory caches need to be rebuilt after all.
Which leads to the second stumbling block that led to the original question regarding unreliable query performance. A portion of that query looked like this:
SELECT DISTINCT nw3.col1, mw.col2, SUM(mw.quantity) AS quantity FROM NEW mw LEFT JOIN NEW nw1 ON nw1.col1=mw.col3 AND nw1.col2=mw.col2 LEFT JOIN NEW nw2 ON nw2.col1=nw1.col3 AND nw2.col2=nw1.col2 LEFT JOIN NEW nw3 ON nw3.col1=nw2.col3 AND nw3.col2=nw2.col2 WHERE mw.create_time > CURRENT_DATE AND mw.actor = 'some' AND nw3.actor = 'unique' AND nw1.actor = 'value' GROUP BY nw3.col1, mw.col2;
To be completely fair, this query contains a number of problems:
- Predicates in a
WHEREclause are applied after the
LEFT JOIN, so many unintended rows will be removed because it’s not accounting for
- Predicates are not transitive. The
CURRENT_DATEclause should be applied to all of the joins so the planner has all necessary information.
DISTINCTis not necessary due to the
If we take that into account and rewrite the query, we get this:
SELECT nw3.col1, mw.col2, SUM(mw.quantity) AS quantity FROM NEW mw LEFT JOIN NEW nw1 ON ( nw1.col1=mw.col3 AND nw1.col2=mw.col2 AND nw1.actor='unique' AND nw1.create_time > CURRENT_DATE ) LEFT JOIN NEW nw2 ON ( nw2.col1=nw1.col3 AND nw2.col2=nw1.col2 AND nw2.create_time > CURRENT_DATE ) LEFT JOIN NEW nw3 ON ( nw3.col1=nw2.col3 AND nw3.col2=nw2.col2 AND nw3.actor='value' AND nw3.create_time > CURRENT_DATE ) WHERE mw.create_time > CURRENT_DATE AND mw.actor = 'some' GROUP BY nw3.col1, mw.col2;
And as expected, this version of the query performed much better, executing about 20-times faster than the original incarnation. Unfortunately, that isn’t the end of the story. See all of those join conditions on each table in addition to the
WHERE clause? Postgres multiplies the probabilities of column values together to obtain a rough row estimate it uses to calculate the cost of each potential query plan. Since these are all fractions, we have a steadily decreasing estimate with each additional clause.
This usually works fine until we’re dealing with closely correlated data. If col1 and col2 have a one-to-one relationship, multiplying their probabilities is exactly the wrong thing to do. As is the case with most underestimated row counts, Postgres will generally opt for a nested loop. Why not? Iterating over a few dozen values is cheap and has little setup cost compared to allocating in-memory merge or hash segments.
In our case, the row estimates were off by two orders of magnitude. This is fine in isolation! As mentioned previously, the new query plan was much faster than the old one. But that was only the first portion of a much larger CTE-driven query. Each fragment contained similar flaws as the first, and further reduced row estimates to nearly zero in the aggregate.
That means a lot of nested loops. Fine for a handful of rows, but not an end-of-day total of five million. The (admittedly large and complicated 31k query) required nearly three hours to complete. How did we fix it?
SET enable_nestloop TO FALSE;
That one modification before running the query reduced its execution time to 26 seconds. Since the query was essentially a report, I unilaterally decided that no imaginable nested loop could possibly improve the performance of that query and would instead be actively detrimental. According to the original query author, the previous run-time was usually a few minutes before it ballooned to several hours last week.
That kind of variance would understandably confuse and probably enrage most people. How are the worst case and best case scenarios for the same query so drastically different? The underlying issue is that Postgres trends toward the lowest cost estimate of the plans it examines without taking the worst case into account.
This is what happened when it chose to use a sequential scan when we supplied a
LIMIT clause. Yes, the best scenario is that only a few hundred or thousand rows are required to obtain the full 10 matches after applying the
WHERE clause. The worst case is never considered, despite how much performance suffers as a consequence. Our particular example could be addressed if Postgres collected generalized vector statistics to map data distribution shapes. This would cover steadily increasing, decreasing, or clustered data values. Yet that’s hardly a simple change.
The story is similar for selecting a nested loop over a hash or merge operation. A nested loop is great until it isn’t. Best case? A few seconds. Worst? Several hours. How do I explain to users who experience this on a regular basis, beyond striving to transform every single one of them into experts at coddling the Postgres query planner? It’s an untenable situation.
It’s the only solution I have and partially why PG Phriday is a regular occurrence. But I’ve always wished it wasn’t necessary. I don’t need to be a mechanic to drive my car. A query planner is much more complicated than a motor vehicle, yet the perception remains. How do we really address the true complexity of the planner without making excuses for its current shortcomings?
Despite my love of Postgres, I can’t really answer that. When everything is operating optimally, Postgres is the best database I’ve ever encountered. But when it trips, it faceplants into frozen January molasses en-route to the moon. I usually tell users to cancel those queries, or do it myself, because that molasses is never reaching the moon.
Until the worst case of a plan is integrated into the planning process, we can expect the occasional misstep. At least we can rely on temporary tables and disabling elements of the query planner for truly belligerent queries.