There are a lot of database engines out there. As such, a developer or DBA will naturally have varying levels of experience with each, and some of this might conflict with how PGDB (PostgreSQL) operates. These kinds of embedded misunderstandings can cause potential issues by themselves, but in this particular case, corrective action is fairly simple.
So this week, I’d like to talk about indexes. Many people treat them as a “make query faster” button, and this often results in egregious misuse. Indexes take space, require CPU resources to maintain, and bring overhead that adversely affects
UPDATE performance. Most know these drawbacks and generally acknowledge that too many indexes can be detrimental. However, with some databases, even column order can make a drastic difference. PGDB is among these.
To illustrate the point, let’s make a simple test case of one-million records in an inventory tracking table of one thousand products over the course of about three years.
CREATE TABLE sys_inventory_snapshot ( product_id SERIAL NOT NULL, record_dt TIMESTAMPTZ NOT NULL, item_count INT NOT NULL, order_count INT NOT NULL ); INSERT INTO sys_inventory_snapshot ( product_id, item_count, order_count, record_dt ) SELECT b.id, a.id, a.id, now() - (a.id || 'd')::INTERVAL FROM generate_series(1, 1000) a(id), generate_series(1, 1000) b(id); ALTER TABLE sys_inventory_snapshot ADD CONSTRAINT pk_inventory_snapshot PRIMARY KEY (product_id, record_dt);
At first glance, this looks pretty good. If we want information based on date or product, it’s right there. Things are even better if we have both values available! The query plan is also encouraging:
EXPLAIN ANALYZE SELECT * FROM sys_inventory_snapshot WHERE record_dt >= CURRENT_DATE - INTERVAL '1 day'; QUERY PLAN -------------------------------------------------------------------- Bitmap Heap Scan ON sys_inventory_snapshot (cost=22912.60..24880.07 ROWS=674 width=20) (actual TIME=106.051..106.530 ROWS=1000 loops=1) Recheck Cond: (record_dt >= (('now'::cstring)::DATE - '1 day'::INTERVAL)) Heap Blocks: exact=7 -> Bitmap INDEX Scan ON pk_inventory_snapshot (cost=0.00..22912.43 ROWS=674 width=0) (actual TIME=106.027..106.027 ROWS=1000 loops=1) INDEX Cond: (record_dt >= (('now'::cstring)::DATE - '1 day'::INTERVAL)) Planning TIME: 0.115 ms Execution TIME: 106.988 ms
Look at that! It’s using our date constraint to grab the most recent rows using the primary key for the table. That’s good, right?
Well, no. What’s actually happened here is somewhat esoteric, and the only real clue we have that something isn’t right, is the execution time. We just spent 100ms fetching 1000 records, which is downright offensive to an experienced DBA. Imagine this query in an application that’s executing it hundreds of times per second from multiple threads. In the database world, 100ms is basically forever. This table and query do not warrant such inexcusably bad performance.
So what happened? Buckets. Indexes take one value and point it to a bucket of other values. In the case of a composite index, one bucket value points to another bucket, which contains the values which satisfy both constraints. Now consider that
record_dt is the second bucket. In order to ensure we get the most recent data for all products, we have to visit all of the top-level buckets before we can continue to the most recent date for each.
See the problem yet? Instead of simply jumping straight to the first applicable date value and retrieving the entire bucket, we have to visit 1000 buckets and then continue to 1000 smaller buckets. This is a fine access pattern if we had both the
record_dt in the query, since that jumps straight to the values we want. But for everything else, it’s a non-ideal solution. It might be faster than a sequence scan if we’re lucky, but that’s not a guarantee. In some scenarios, it’s actually much slower.
PGDB veterans tend to solve this one of two ways:
- Reverse the column order of the index. If it turns out that 90% of queries use both columns, and 10% only use the second column, it’s better to flip them. The primary key in this case still fills its role, but without impacting performance of date searches.
- Add another index to represent the second column. This happens when the first column is actually used in exclusion or combination most of the time, but queries using only the second column are frequent enough to cause concern.
Because I’m lazy and don’t want to redefine the primary key, let’s take the second approach:
CREATE INDEX idx_inventory_snapshot_record_dt ON sys_inventory_snapshot (record_dt); EXPLAIN ANALYZE SELECT * FROM sys_inventory_snapshot WHERE record_dt >= CURRENT_DATE - INTERVAL '1 day'; QUERY PLAN -------------------------------------------------------------------- INDEX Scan USING idx_inventory_snapshot_record_dt ON sys_inventory_snapshot (cost=0.43..28.24 ROWS=674 width=20) (actual TIME=0.038..0.663 ROWS=1000 loops=1) INDEX Cond: (record_dt >= (('now'::cstring)::DATE - '1 day'::INTERVAL)) Planning TIME: 0.238 ms Execution TIME: 1.087 ms
That’s… quite a difference. If you were unfamiliar with databases or PGDB, you might be surprised by this result. But the reality is quite clear: simply listing a column in an index does not magically improve performance of queries that reference it. Order matters. In addition, the further a column strays from the principal position, the worse the degradation becomes.
I’ve seen indexes with six columns, and then a baffled developer asks why his query is slow. Well, his query only used the fourth column in the list. At that point, reading the entire 30-million row table and filtering for desired results would have been faster. The planner tries to account for this, but sometimes statistics or settings lead it astray, and it performs an index scan and dutifully follows all those buckets down the chain to the fourth column. It happens.
Now that you know why, you’re armed to avoid contributing to the problem. Think about indexes before adding them. Consider existing indexes and queries against their parent table, and ensure that column order faithfully represents the real access pattern. It’s an important—and often missed—optimization technique.
As a side note, this is why serious companies have at least one DBA on staff. Designing table structures and optimizing queries are hard enough, but invisible killers like these are more prevalent than one might realize. An experienced database-driven application developer might be familiar with these kind of minutiae, but that’s a lot to expect. Not everyone reads database performance articles, or have encountered and addressed related problems.
Until next week!