Functions are great. Having cut my teeth on a database that didn’t even provide the ability to define functions, I’ve come to almost take them for granted in PGDB (PostgreSQL). However, with this kind of ubiquity, sometimes they can be overused in ways that don’t seem to be part of the common programmer lexicon. In this week’s PG Phriday series on performance-killing missteps, I’m going to talk a bit about set theory, and how a certain amount of familiarity is necessary to properly interact with a database.
One of the axioms of set theory states that f(x) = y, and x !~ y. That is, a function applied to some value produces a value that may not be equivalent to the original. Put another way, a spayed pet does not have the same reproductive capacity as a regular pet, and is thus not equal to the original. In the context of a database, this is extremely relevant because equivalence is what makes calculations and restrictions possible.
Given this, consider what a database index does. When creating an index, the database takes the value of one or more columns and essentially builds a pointer tree to quickly locate a record in ln time. This is much faster than reading every record in a table and applying a filter to remove unwanted records. That’s something even novice developers tend to know. But they don’t know how functions modify this scenario.
Given the very simplified introduction to set theory, applying a function to the column value means the database must discard equality. Remember: x !~ y. The database has indexed x, not y. So when a function is used in a WHERE clause, any index on x will be ignored. This makes perfect sense, considering there are an infinite number of possible functions, and the output of each is indeterminate. It’s not possible to predict the end result of every possible function.
To further illustrate this, we’ll reuse one of our tried-and-true examples:
CREATE TABLE sys_order ( order_id SERIAL NOT NULL, product_id INT NOT NULL, item_count INT NOT NULL, order_dt TIMESTAMPTZ NOT NULL DEFAULT now(), valid_dt TIMESTAMPTZ NULL ); INSERT INTO sys_order (product_id, item_count, order_dt, valid_dt) SELECT (a.id % 100000) + 1, (a.id % 100) + 1, now() - (id % 1000 || 'd')::INTERVAL, CASE WHEN a.id % 499 = 0 THEN NULL ELSE now() - (id % 999 || 'd')::INTERVAL END FROM generate_series(1, 1000000) a(id); ALTER TABLE sys_order ADD CONSTRAINT pk_order_id PRIMARY KEY (order_id); CREATE INDEX idx_order_order_dt ON sys_order (order_dt DESC);
Pay special attention to the index on
order_dt. I’ve applied a modifier that builds the index in descending order, as it’s very common to retrieve the newest records of a table. This doesn’t actually affect the values being indexed, and is a very handy feature of PGDB. We’re also going to try and use this index, because we don’t want to search through one-million records any time we want to get a few recent data points.
Here’s a query that will count the number of orders yesterday:
EXPLAIN ANALYZE SELECT COUNT(1) FROM sys_order WHERE date_trunc('day', order_dt) = CURRENT_DATE - INTERVAL '1 day'; QUERY PLAN -------------------------------------------------------------------- Aggregate (cost=29865.50..29865.51 ROWS=1 width=0) (actual TIME=989.773..989.774 ROWS=1 loops=1) -> Seq Scan ON sys_order (cost=0.00..29853.00 ROWS=5000 width=0) (actual TIME=0.021..988.407 ROWS=1000 loops=1) FILTER: (date_trunc('day'::text, order_dt) = (('now'::cstring)::DATE - '1 day'::INTERVAL)) ROWS Removed BY FILTER: 999000 Planning TIME: 0.120 ms Execution TIME: 989.810 ms
From this output, we can see that PGDB ignored the index on
order_dt and did exactly what we didn’t want. Instead of using an index to jump to the relevant values, it scanned the entire table and filtered out the values that didn’t apply. Yet the implications are actually much worse than that. The
date_trunc function, even though it’s written in C, is not a free operation. So not only have we unnecessarily read a million rows, we applied a function to each and every row, just to see if the result is within the boundary we specified. And my example fits in memory; the situation degrades exponentially when that isn’t possible.
That’s insanely expensive, and our relatively small million-row table illustrates that well enough. Imagine the same operation on a table with 100M rows or more. So we lose twice: disk resources are wasted retrieving unwanted rows, and CPU time is consumed performing unnecessary function executions. Given how many times I’ve encountered this at several unrelated companies, it’s a pretty serious problem. Here’s how the query should look:
EXPLAIN ANALYZE SELECT COUNT(1) FROM sys_order WHERE order_dt >= CURRENT_DATE - INTERVAL '1 day' AND order_dt < CURRENT_DATE; QUERY PLAN -------------------------------------------------------------------- Aggregate (cost=2564.11..2564.12 ROWS=1 width=0) [ snip ] -> Bitmap INDEX Scan ON idx_order_order_dt (cost=0.00..21.38 ROWS=894 width=0) (actual TIME=0.381..0.381 ROWS=1000 loops=1) INDEX Cond: ((order_dt >= (('now'::cstring)::DATE - '1 day'::INTERVAL)) AND (order_dt < ('now'::cstring)::DATE)) Planning TIME: 0.211 ms Execution TIME: 5.855 ms
The overall execution plan is slightly more complicated since the index involved now, but note the execution time: it’s almost 200 times faster than the original. All we did was modify the query to use a range that includes all the possible date and time combinations for yesterday. We needed to do that for the same reason we’d tried
date_trunc previously, but the end result is the same. The only difference is that this allows the database to use a value range scan on the index and obtain all matching rows immediately.
If you’ve fallen into this trap, don’t feel bad. I’ve seen everyone do this at least once. From newbies right out of college, to highly seasoned technical leads, and even including other DBAs, there doesn’t seem to be any discernible pattern. It’s too easy to frame a query without considering the underlying mechanisms that make everything work. I also want to point out that since PGDB supports functional indexes, it’s also possible to do something like this:
CREATE INDEX idx_order_order_dt ON sys_order (date_trunc('day', order_dt));
In this case, we’re simply indexing the resulting value of f(x), so as long as the function call is the same in any
WHERE clauses, the index will be used. To PGDB, it’s all the same. If 99% of the development staff, the application itself, and stray dogs are all using a function instead of doing it the “right” way, it’s actually the DBA that is going against the tide.
The only reason I don’t tend to recommend this pattern, is that the functional index is generally too restrictive. What if we wanted to search for a four hour window during yesterday? Well, now we can’t, because the index is only relevant for the date information. The simple index case is applicable to far more potential scenarios, and in the database world, index versatility is extremely important. I try to ensure developers I work with are cognizant of the pitfalls of arbitrarily using functions to filter results.
After all, it’s better to learn these things preemptively!