Trumping the PostgreSQL Query Planner

With the release of PostgreSQL 8.4, the community gained the ability to use CTE syntax. As such, this is a fairly old feature, yet it’s still misunderstood in a lot of ways. At the same time, the query planner has been advancing incrementally since that time. Most recently, PostgreSQL has gained the ability to perform index-only scans, making it possible to fetch results straight from the index, without confirming rows with the table data.

Unfortunately, this still isn’t enough. There are still quite a few areas where the PostgreSQL query planner is extremely naive, despite the advances we’ve seen recently. For instance, PostgreSQL still can’t do a basic loose index scan natively. It has to be tricked by using CTE syntax.

To demonstrate this further, imagine this relatively common scenario: an order processing system where clients can order products. What happens when we want to find the most recent order for all current customers? Boiled down to its minimum elements, this extremely simplified table will act as our order system.

CREATE TABLE test_order
(
  client_id   INT        NOT NULL,
  order_date  TIMESTAMP  NOT NULL,
  filler      TEXT       NOT NULL
);

Now we need data to test with. We can simulate a relatively old order processing system by taking the current date and subtracting 1,000 days. We can also bootstrap with 10,000 clients, and make the assumption that newer clients will be more active. This allows us to represent clients that have left our services as time goes on. So we start with this test data:

INSERT INTO test_order
SELECT s1.id,
       (CURRENT_DATE - INTERVAL '1000 days')::DATE 
           + generate_series(1, s1.id%1000),
       repeat(' ', 20)
  FROM generate_series(1, 10000) s1 (id);

The generate_series function is very handy for building fake data. We’re still not ready to use that data, however. Since we want to find the most recent order for all customers, we need an index that will combine the client_id and order_date columns in such a way that a single lookup will provide the value we want for any particular client. This index should do nicely:

CREATE INDEX idx_test_order_client_id_order_date
    ON test_order (client_id, order_date DESC);

Finally, we analyze to make sure the PostgreSQL engine has the most recent stats for our table. Just to make everything easily repeatable, we also set the default_statistics_target to a higher value than default as well.

SET default_statistics_target TO 500;
ANALYZE test_order;

Now we’ll start with the most obvious query. Here, we just use the client_id column and look for the max order_date for each:

EXPLAIN ANALYZE
SELECT client_id, max(order_date)
  FROM test_order
 GROUP BY client_id;

The query plan is fairly straight-forward, and will probably include a sequence scan. On the virtual server we’re testing with, the total runtime for us ended up looking like this:

Total runtime: 1117.408 ms

There is some variance, but the end result is just over one second per execution. We ran this query several times to ensure it was properly cached by PostgreSQL. Why didn’t the planner use the index we created? Let’s assume the planner doesn’t know what max does, and treats it like any other function. With that in mind, we can exploit a different type of syntax that should make the index much more usable. So let’s try DISTINCT ON with an explicit ORDER clause that matches the definition of our index:

EXPLAIN ANALYZE SELECT DISTINCT ON (client_id) client_id, order_date FROM test_order ORDER BY client_id, order_date DESC;

Well, this time our test system used an index-only scan, and produced the results somewhat faster. Our new runtime looks like this:

Total runtime: 923.300 ms

That’s almost 20% faster than the sequence scan. Depending on how much bigger the table is than the index, reading the index and producing these results can vary significantly. And while the query time improved, it’s still pretty bad. For systems with tens or hundreds of millions of orders, the performance of this query will continue to degrade along with the row count. We’re also not really using the index effectively.

Reading the index from top to bottom and pulling out the desired results is faster than reading the whole table. But why should we do that? Due to the way we built this index, the root node for each client should always represent the value we’re looking for. So why doesn’t the planner simply perform a shallow index scan along the root nodes? It doesn’t matter what the reason is, because we can force it to do so. This is going to be ugly, but this query will act just as we described:

EXPLAIN ANALYZE
WITH RECURSIVE skip AS
(
  (SELECT client_id, order_date
    FROM test_order
   ORDER BY client_id, order_date DESC
   LIMIT 1)
  UNION ALL
  (SELECT (SELECT min(client_id)
             FROM test_order
            WHERE client_id > skip.client_id
          ) AS client_id,
          (SELECT max(order_date)
             FROM test_order
            WHERE client_id = (
                    SELECT min(client_id)
                      FROM test_order
                     WHERE client_id > skip.client_id
                  )
          ) AS order_date
    FROM skip
   WHERE skip.client_id IS NOT NULL)
)
SELECT *
  FROM skip;

The query plan for this is extremely convoluted, and we’re not even going to try to explain what it’s doing. But the final query execution time is hard to discount:

Total runtime: 181.501 ms

So what happened here? How can the abusive and ugly CTE above outwit the PostgreSQL query planner? We use the same principle as described in the PostgreSQL wiki for loose index scans. We start with the desired maximum order date for a single client_id, then recursively begin adding clients one by one until the index is exhausted. Due to limitations preventing us from using the recursive element in a sub-query, we have to use the SELECT clause to get the next client ID and the associated order date for that client.

This technique works universally for performing sparse index scans, and actually improves as cardinality (the number of unique values) decreases. As unlikely as that sounds, since we are only using the root nodes within the index tree, performance increases when there are less root nodes to check. This is the exact opposite to how indexes are normally used, so we can see why PostgreSQL doesn’t natively integrate this technique. Yet we would like to see it added eventually so query authors can use the first query example we wrote, instead of the excessively unintuitive version that actually produced good performance.

In any case, all PostgreSQL DBAs owe it to themselves and their clusters to learn CTEs. They provide a powerful override for the query planner, and helps solve the edge cases it doesn’t yet handle.