PG Phriday: 10 Ways to Ruin Performance: Cast Away

Unlike a lot of programming languages that have loose typing, databases are extremely serious about data types. So serious in fact, many patently refuse to perform automatic type conversions in case the data being compared is not exactly equivalent afterwards. This is the case with Postgres (PostgreSQL) and surprisingly often, queries will suffer as a result. Fortunately this is something we can address!

There are easier ways to demonstrate this, but in the interests of making this more of an adventure, let’s use a join between two tables:

CREATE TABLE sys_order
    order_id     SERIAL       NOT NULL,
    product_id   INT          NOT NULL,
    item_count   INT          NOT NULL,
    order_dt     TIMESTAMPTZ  NOT NULL DEFAULT now(),
    valid_dt     TIMESTAMPTZ  NULL

INSERT INTO sys_order (product_id, item_count, order_dt, valid_dt)
SELECT ( % 100000) + 1, ( % 100) + 1,
       now() - (id % 1000 || 'd')::INTERVAL,
       CASE WHEN % 499 = 0
            THEN NULL
            ELSE now() - (id % 999 || 'd')::INTERVAL
  FROM generate_series(1, 1000000) a(id);

ALTER TABLE sys_order ADD CONSTRAINT pk_order_id
      PRIMARY KEY (order_id);

CREATE TABLE sys_exchange_map
    exchange_map_id  SERIAL       NOT NULL,
    vendor_id        INT          NOT NULL,
    ext_order_code   TEXT         NOT NULL,
    map_message      TEXT         NOT NULL,
    xmit_dt          TIMESTAMPTZ  NOT NULL DEFAULT now()

INSERT INTO sys_exchange_map (
         exchange_map_id, vendor_id, ext_order_code,
         map_message, xmit_dt
SELECT, ( % 10) + 1,
       (( % 10000) + 1) || ':' || substring(random()::TEXT, 3),
       'This is a bunch of junk from an external vendor queue!',
       now() - (id % 30 || 'd')::INTERVAL
  FROM generate_series(1, 100000) a(id);

ALTER TABLE sys_exchange_map ADD CONSTRAINT pk_exchange_map_id
      PRIMARY KEY (exchange_map_id);

CREATE INDEX idx_exchange_map_xmit_dt
    ON sys_exchange_map (xmit_dt);

ANALYZE sys_order;
ANALYZE sys_exchange_map;

This structure simulates something I’ve seen somewhat often in real systems. In this case, our order table is normal, but there’s also an exchange mapping table that maps our orders to external vendor orders. Imagine this table is part of the queue engine running the application, so the various codes and message columns are describing the interactions within the platform.

Given this structure, it’s not uncommon to encounter a query that manually parses one or more of these codes and tries to find matching orders. This is what that might look like:

  FROM sys_exchange_map m
  JOIN sys_order o ON (
         o.order_id = substring(m.ext_order_code, '(.*):')::NUMERIC
 WHERE m.xmit_dt > CURRENT_DATE;

Unfortunately on my test VM, this query takes almost four seconds! Experienced developers and DBAs probably already know what the problem is, but let’s take a gander at two segments of the EXPLAIN ANALYZE output and see if there are any obvious clues.

 Merge Join  (cost=142800.24..568726.04 rows=16935000 width=28)
             (actual time=4844.280..4875.651 rows=3333 loops=1)

 [ SNIP ]

 ->  Seq Scan on sys_order o
       (cost=0.00..17353.00 rows=1000000 width=28)
       (actual time=0.036..884.354 rows=1000000 loops=1)

 Planning time: 0.240 ms
 Execution time: 3626.783 ms

We should be able to immediately see two potential problems. Thinking back on our discussions regarding estimates matching actual, notice that the database thinks it will be returning almost 17M rows, when there are really only around 3300. Further, it’s scanning the entire sys_order table instead of using the primary key. What happened, here?

This may or may not come as a surprise, but NUMERIC is not compatible with standard integer types in Postgres. For those who have worked with other databases such as Oracle, it may seem natural to use NUMERIC as a catch-all for any number. This is especially true for developers who work with different data types in a myriad of languages, some of which feature multiple compatible numeric types.

With Postgres, casting is controlled through explicit declaration of type compatibility. We can look at NUMERIC and see which types it will convert to with this query:

SELECT s.typname AS source_type,
       t.typname AS target_type,
       CASE c.castcontext
         WHEN 'a' THEN 'Assignment'
         WHEN 'i' THEN 'Implicit'
         WHEN 'e' THEN 'Explicit'
       END AS context
  FROM pg_cast c
  JOIN pg_type s ON (s.oid = c.castsource)
  JOIN pg_type t ON (t.oid = c.casttarget)
 WHERE s.typname = 'numeric';

 source_type | target_type |  context   
 numeric     | int8        | Assignment
 numeric     | int2        | Assignment
 numeric     | int4        | Assignment
 numeric     | float4      | Implicit
 numeric     | float8      | Implicit
 numeric     | money       | Assignment
 numeric     | numeric     | Implicit

From this chart, it would appear that INT and NUMERIC are compatible, but only through assignment. This means we can use them interchangeably in calculations, for example, or save an integer to a numeric column or the reverse. This does not mean they’ll be automatically converted or treated as compatible for equality tests. Since the types are not implicitly compatible, Postgres will not use the INT index for the NUMERIC cast.

Unable to use the proper index, the planner does its best to come up with a working plan. Unfortunately the type incompatibility wreaks havoc during that phase as well. What we end up with is something of a mess, with completely incorrect row estimates, an unnecessary sequence scan, and an expensive in-memory merge. And all it would take to fix all of this, is to make one tiny change to our query:

  FROM sys_exchange_map m
  JOIN sys_order o ON (
         o.order_id = substring(m.ext_order_code, '(.*):')::INT
 WHERE m.xmit_dt > CURRENT_DATE;

                             QUERY PLAN                             
 Nested Loop  (cost=66.98..22648.71 rows=3387 width=28)
              (actual time=1.058..42.084 rows=3333 loops=1)

 [ SNIP ]

 ->  Index Scan using pk_order_id on sys_order o 
       (cost=0.43..6.19 rows=1 width=28)
       (actual time=0.003..0.003 rows=1 loops=3333)

 Planning time: 1.176 ms
 Execution time: 43.635 ms

That’s a pretty massive difference! In this particular case, it’s 80 times faster, but I’ve run into situations where a single miscast caused the query to be almost 80,000 times slower. It all depends on the scale of the data involved, and how badly everything compounds at each step of the execution process.

The message here is to pay particular attention to data types and casting. If a query is performing badly when it should be much faster, make sure types are compatible across the board. Some languages are particularly strict with regard data type usage, so consider Postgres among them.