PG Phriday: PostgreSQL Select Filters

Long have CASE statements been a double-edged sword in the database world. They’re functional, diverse, adaptive, and simple. Unfortunately they’re also somewhat bulky, and when it comes to using them to categorize aggregates, something of a hack. This is why I wanted to cry with joy when I found out that PostgreSQL 9.4 introduced a feature I’ve always wanted, but found difficult to express as a need. I mean, CASE statements are fine, right? Well, yes they are, but now we have something better. Now, we have the FILTER aggregate expression.

I always like working with examples, so let’s create some test data to illustrate just what I’m talking about.

CREATE TABLE sys_order
(
    order_id     SERIAL     NOT NULL,
    product_id   INT        NOT NULL,
    item_count   INT        NOT NULL,
    order_dt     TIMESTAMP  NOT NULL DEFAULT now()
);

INSERT INTO sys_order (product_id, item_count)
SELECT (a.id % 100) + 1, (random()*10)::INT + 1
  FROM generate_series(1, 1000000) a(id);

ALTER TABLE sys_order ADD CONSTRAINT pk_order_order_id
      PRIMARY KEY (order_id);

We now have a table for tracking fake orders, using 100 nonexistent products. I added the primary key after loading the table as a well known DBA trick. Doing this after data loading means the index can be created as a single step, which is much more efficient than repeatedly extending an existing index.

With that out of the way, let’s do a basic product order count, since that’s something many people are already familiar with:

SELECT sum(item_count) AS total
  FROM sys_order;

-[ RECORD 1 ]--
total | 1000000

No surprises here. But what happens when Jeff from Accounting wants to know how many people ordered five specific products as a column list? In the old days, we might do something like this:

SELECT sum(CASE WHEN product_id = 1 THEN item_count ELSE 0 END) AS horse_mask_count,
       sum(CASE WHEN product_id = 7 THEN item_count ELSE 0 END) AS eyeball_count,
       sum(CASE WHEN product_id = 13 THEN item_count ELSE 0 END) AS badger_count,
       sum(CASE WHEN product_id = 29 THEN item_count ELSE 0 END) AS orb_count,
       sum(CASE WHEN product_id = 73 THEN item_count ELSE 0 END) AS memebox_count
  FROM sys_order;

 horse_mask_count | eyeball_count | badger_count | orb_count | memebox_count 
------------------+---------------+--------------+-----------+---------------
            59870 |         59951 |        59601 |     59887 |         60189

Gross.

As a DBA, I’ve seen more of these than I can reasonably stand, and hate them every single time. It’s not the use of the CASE statement that is so irksome, but the micromanaging methodology necessary to reduce the count to zero for unwanted items. With FILTER however, this query changes quite a bit:

SELECT sum(item_count) FILTER (WHERE product_id = 1) AS horse_mask_count,
       sum(item_count) FILTER (WHERE product_id = 7) AS eyeball_count,
       sum(item_count) FILTER (WHERE product_id = 13) AS badger_count,
       sum(item_count) FILTER (WHERE product_id = 29) AS orb_count,
       sum(item_count) FILTER (WHERE product_id = 73) AS memebox_count
  FROM sys_order;

 horse_mask_count | eyeball_count | badger_count | orb_count | memebox_count 
------------------+---------------+--------------+-----------+---------------
            59870 |         59951 |        59601 |     59887 |         60189

The query itself isn’t much shorter, but semantically, it’s far easier to understand what we’re trying to accomplish. It’s clear that we want the item count for each specific product, and nothing else. Further, since this is built-in functionality instead of a gross hack, it’s much faster. On the system we used for testing, after ten runs, the average time for the CASE variant was about 500ms; the FILTER version was about 300ms. In both cases, the query execution plan is identical. Internally however, invoking hundreds of thousands of CASE statements causes an immense CPU impact, where FILTER can utilize set grouping or another efficient stratagem based on the filter criteria. For large OLAP databases, this is a significant improvement in both query simplicity and performance.

This is good stuff we are getting in these new releases, and I encourage everyone to enjoy all of the new toys we get every year. Some of them are much more than mere window-dressing.