PG Phriday: 10 Ways to Ruin Performance: Indexing the World

An easy way to give Postgres (PostgreSQL) a performance boost is to judiciously use indexes based on queries observed in the system. For most situations, this is as simple as indexing columns that are referenced frequently in WHERE clauses. Postgres is one of the few database engines that takes this idea even further with partial indexes. Unfortunately as a consequence of insufficient exposure, most DBAs and users are unfamiliar with this extremely powerful functionality.

Imagine we have an order system that tracks order state, such that entries are marked as new, processing, or done. These kinds of transient states are not uncommon in various inventory management systems, so it’s a great example for this use case. Often with such systems, data is distributed in such a way that more than 90% of orders are marked as ‘done’. To make this interesting, let’s just cap the done state at 90%, and distribute another 5% to processing, and 5% to new.

This somewhat complex SQL should emulate the above scenario:

DROP TABLE IF EXISTS sys_order;

CREATE TABLE sys_order
(
    order_id     SERIAL       NOT NULL,
    account_id   INT          NOT NULL,
    product_id   INT          NOT NULL,
    item_count   INT          NOT NULL,
    order_state  CHAR         NOT NULL DEFAULT 'N',
    order_dt     TIMESTAMPTZ  NOT NULL DEFAULT now(),
    valid_dt     TIMESTAMPTZ  NULL
);

INSERT INTO sys_order (
         account_id, product_id, item_count, order_state,
         order_dt, valid_dt
       )
SELECT (a.id % 100) + 1, (a.id % 100000) + 1, (a.id % 100) + 1,
       CASE WHEN b.id BETWEEN 1 AND 5 THEN 'N'
            WHEN b.id BETWEEN 6 AND 10 THEN 'P'
            ELSE 'D'
       END,
       now() - (a.id % 1000 || 'd')::INTERVAL
             + (a.id || 'ms')::INTERVAL,
       CASE WHEN a.id % 499 = 0
            THEN NULL
            ELSE now() - (a.id % 999 || 'd')::INTERVAL
       END
  FROM generate_series(1, 10000) a(id),
       generate_series(1, 100) b(id);

ALTER TABLE sys_order ADD CONSTRAINT pk_order_id
      PRIMARY KEY (order_id);

CREATE INDEX idx_order_account_id
    ON sys_order (account_id);

CREATE INDEX idx_order_order_dt
    ON sys_order (order_dt DESC);

ANALYZE sys_order;

In many, if not most situations, automated tools and interfaces are only interested in pending order states. If an order is new, or in the middle of processing, it’s going to see a lot of activity. Because of this, a lot of queries will request orders based on status. Let’s presume an example of this is to fetch all new orders from a certain account, possibly for display on a web site.

This is the query Postgres sees:

SELECT *
  FROM sys_order
 WHERE account_id = 42
   AND order_state = 'N';

In my virtual environment, this executes in about 3ms and uses the index on account_id as expected. Few people would consider a 3ms execution time as slow, so this is where most optimization stops. But we know something about this data! We know that 90% (or more) of all possible order states are ‘D’. In the case of this query, Postgres has to reduce 10,000 matches from the index, down to 500 to return our results.

This is hardly ideal. A naive approach to correct this, may be to index both account_id and order_state. While this works, we’re still indexing 90% of values that provide no benefit to the index. This is where Postgres differs from many other database engines. Let’s go a bit further with our example and create a partial index and try the query again:

CREATE INDEX idx_order_state_account_id
    ON sys_order (account_id, order_state)
 WHERE order_state != 'D';

SELECT *
  FROM sys_order
 WHERE account_id = 42
   AND order_state = 'N';

The revised execution time of our query with the new index is about 0.7ms. While this is 4-5x faster than before, the benefits go beyond execution time. Let’s take a look at the size of each index on disk:

SELECT indexrelid::REGCLASS::TEXT AS index_name,
       pg_relation_size(indexrelid) / 1048576 AS size_mb
  FROM pg_index
 WHERE indrelid::REGCLASS::TEXT = 'sys_order';

         index_name         | size_mb 
----------------------------+---------
 pk_order_id                |      21
 idx_order_account_id       |      21
 idx_order_order_dt         |      21
 idx_order_state_account_id |       2

Hopefully it comes as no surprise that an index which includes 90% less data will be 90% smaller. Also, keep in mind that this contrived example was designed to somewhat downplay the effect of partial indexes. In a real order system, far more than 90% of orders would be marked as done, thus magnifying the speed increase and index size reduction.

There are a lot of ways partial indexes can be used, and like most tools, they’re not appropriate for all situations. But when the data cooperates, they’re extremely relevant. When time permits, take a look at data distribution and queries against the system. Chances are, there will be at least one situation that could be improved with a partial index.

If only every database was so accommodating.