PG Phriday: 10 Ways to Ruin Performance: Sex Offenders

We’re finally at the end of the 10-part Postgres (PostgreSQL) performance series I use to initiate new developers into the database world. To that end, we’re going to discuss something that affects everyone at one point or another: index criteria. Or to put it another way:

Why isn’t the database using an index?

It’s a fairly innocuous question, but one that may have a surprising answer: the index was created using erroneous assumptions. Let’s explore what happens in a hospital environment with a pared-down table of patients.

DROP TABLE IF EXISTS sys_patient;

CREATE TABLE sys_patient
(
    patient_id  SERIAL   NOT NULL,
    full_name   VARCHAR  NOT NULL,
    birth_dt    DATE     NOT NULL,
    sex         CHAR     NOT NULL
);

INSERT INTO sys_patient (full_name, birth_dt, sex)
SELECT 'Crazy Person ' || a.id,
       CURRENT_DATE - (a.id % 100 || 'y')::INTERVAL
                    + (a.id % 365 || 'd')::INTERVAL,
       CASE WHEN a.id % 2 = 0 THEN 'M' ELSE 'F' END
  FROM generate_series(1, 1000000) a(id);

ALTER TABLE sys_patient ADD CONSTRAINT pk_patient_id
      PRIMARY KEY (patient_id);

CREATE INDEX idx_patient_birth_dt ON sys_patient (birth_dt);
CREATE INDEX idx_patient_sex ON sys_patient (sex);

ANALYZE sys_patient;

This particular hospital has a few queries that operate based on the sex of the patient, so someone created an index on that column. One day, another developer is doing some code refactoring and, being well-trained by the resident DBA, runs the query through EXPLAIN to check the query plan. Upon seeing the result, the dev curses a bit, tries a few variants, and ultimately takes the issue to the DBA.

This is what the developer saw:

EXPLAIN ANALYZE 
SELECT *
  FROM sys_patient
 WHERE sex = 'F';

                             QUERY PLAN                             
--------------------------------------------------------------------
 Seq Scan on sys_patient  
      (cost=0.00..19853.00 rows=498233 width=29)
      (actual time=0.018..541.738 rows=500000 loops=1)
   Filter: (sex = 'F'::bpchar)
   Rows Removed by Filter: 500000

 Planning time: 0.292 ms
 Execution time: 823.901 ms

No matter what the dev did, the database adamantly refused to use the idx_patient_sex index. The answer is generally obvious to a DBA or a relatively seasoned developer, but this actually happens far more frequently than one might think. This is an extreme example, yet even experienced database users, report writers, and analysts make this mistake.

Before using an index, the database essentially asks a series of questions:

  1. How many matches do I expect from this index?
  2. What proportion of the table do these matches represent?
  3. Are the cumulative random seeks faster than filtering the table?

If the answer to any of those questions is too large or negative, the database will not use the index. In our example, the sex column only has two values, and thus the answer to the above questions are more obvious than usual. With one million rows, a query only on the sex column would match half of them. In addition, randomly seeking 500,000 results is likely an order of magnitude slower than simply filtering the whole table for matches.

But it’s not always so easy to figure out what kind of cardinality to expect from a table column. Short of checking every column of every table with count(DISTINCT my_col) or something equally ridiculous, someone unfamiliar with the data in a complex table architecture would get stuck. However, in order to answer the above questions, the database itself must track certain statistics about table contents.

It just so happens that Postgres makes that data available to everyone through the pg_stats view. Let’s check what PostgreSQL has stored regarding the sys_patient table.

SELECT attname AS column_name, n_distinct
  FROM pg_stats
 WHERE tablename = 'sys_patient';

 column_name | n_distinct 
-------------+------------
 patient_id  |         -1
 full_name   |         -1
 birth_dt    |       7310
 sex         |          2

Interpreting these results is actually very easy. Any column with a negative n_distinct value is a ratio approaching 1. At -1, there’s a one-to-one relationship with the number of rows in the table, and the number of distinct values in that column. As a general rule, nearly any column with a negative value here is a good index candidate because a WHERE clause will reduce the potential results significantly.

Positive values are an absolute count of unique values for that column. During table analysis, the database checks a random sampling of rows and tabulates statistics based on them. That means the value in n_distinct is representative instead of exact, but usually doesn’t deviate by a significant margin. The data here doesn’t need to be viable for reporting, just to calculate an efficient query plan.

From here, we can see that the sex column would likely be a terrible index candidate, even if we know nothing else about the table. There are simply not enough distinct values to reduce the amount of matches for a query.

Given all of this, the dev has nothing to fear; the idx_patient_sex index should have never existed in the first place. A query that needs to fetch all of any particular sex will simply require a sequential scan, and that’s fine.

Creating indexes can be a game, and sometimes the only way to win is not to play.