We’re finally at the end of the 10-part PGDB (PostgreSQL) performance series I use to initiate new developers into the database world. To that end, we’re going to discuss something that affects everyone at one point or another: index criteria. Or to put it another way:
Why isn’t the database using an index?
It’s a fairly innocuous question, but one that may have a surprising answer: the index was created using erroneous assumptions. Let’s explore what happens in a hospital environment with a pared-down table of patients.
DROP TABLE IF EXISTS sys_patient; CREATE TABLE sys_patient ( patient_id SERIAL NOT NULL, full_name VARCHAR NOT NULL, birth_dt DATE NOT NULL, sex CHAR NOT NULL ); INSERT INTO sys_patient (full_name, birth_dt, sex) SELECT 'Crazy Person ' || a.id, CURRENT_DATE - (a.id % 100 || 'y')::INTERVAL + (a.id % 365 || 'd')::INTERVAL, CASE WHEN a.id % 2 = 0 THEN 'M' ELSE 'F' END FROM generate_series(1, 1000000) a(id); ALTER TABLE sys_patient ADD CONSTRAINT pk_patient_id PRIMARY KEY (patient_id); CREATE INDEX idx_patient_birth_dt ON sys_patient (birth_dt); CREATE INDEX idx_patient_sex ON sys_patient (sex); ANALYZE sys_patient;
This particular hospital has a few queries that operate based on the sex of the patient, so someone created an index on that column. One day, another developer is doing some code refactoring and, being well-trained by the resident DBA, runs the query through
EXPLAIN to check the query plan. Upon seeing the result, the dev curses a bit, tries a few variants, and ultimately takes the issue to the DBA.
This is what the developer saw:
EXPLAIN ANALYZE SELECT * FROM sys_patient WHERE sex = 'F'; QUERY PLAN -------------------------------------------------------------------- Seq Scan ON sys_patient (cost=0.00..19853.00 ROWS=498233 width=29) (actual TIME=0.018..541.738 ROWS=500000 loops=1) FILTER: (sex = 'F'::bpchar) ROWS Removed BY FILTER: 500000 Planning TIME: 0.292 ms Execution TIME: 823.901 ms
No matter what the dev did, the database adamantly refused to use the
idx_patient_sex index. The answer is generally obvious to a DBA or a relatively seasoned developer, but this actually happens far more frequently than one might think. This is an extreme example, yet even experienced database users, report writers, and analysts make this mistake.
Before using an index, the database essentially asks a series of questions:
- How many matches do I expect from this index?
- What proportion of the table do these matches represent?
- Are the cumulative random seeks faster than filtering the table?
If the answer to any of those questions is too large or negative, the database will not use the index. In our example, the
sex column only has two values, and thus the answer to the above questions are more obvious than usual. With one million rows, a query only on the
sex column would match half of them. In addition, randomly seeking 500,000 results is likely an order of magnitude slower than simply filtering the whole table for matches.
But it’s not always so easy to figure out what kind of cardinality to expect from a table column. Short of checking every column of every table with
count(DISTINCT my_col) or something equally ridiculous, someone unfamiliar with the data in a complex table architecture would get stuck. However, in order to answer the above questions, the database itself must track certain statistics about table contents.
It just so happens that PGDB makes that data available to everyone through the
pg_stats view. Let’s check what PostgreSQL has stored regarding the
SELECT attname AS column_name, n_distinct FROM pg_stats WHERE tablename = 'sys_patient'; column_name | n_distinct -------------+------------ patient_id | -1 full_name | -1 birth_dt | 7310 sex | 2
Interpreting these results is actually very easy. Any column with a negative
n_distinct value is a ratio approaching 1. At -1, there’s a one-to-one relationship with the number of rows in the table, and the number of distinct values in that column. As a general rule, nearly any column with a negative value here is a good index candidate because a
WHERE clause will reduce the potential results significantly.
Positive values are an absolute count of unique values for that column. During table analysis, the database checks a random sampling of rows and tabulates statistics based on them. That means the value in
n_distinct is representative instead of exact, but usually doesn’t deviate by a significant margin. The data here doesn’t need to be viable for reporting, just to calculate an efficient query plan.
From here, we can see that the
sex column would likely be a terrible index candidate, even if we know nothing else about the table. There are simply not enough distinct values to reduce the amount of matches for a query.
Given all of this, the dev has nothing to fear; the
idx_patient_sex index should have never existed in the first place. A query that needs to fetch all of any particular sex will simply require a sequential scan, and that’s fine.
Creating indexes can be a game, and sometimes the only way to win is not to play.