PG Phriday: 10 Ways to Ruin Performance: MAXimized Value

I apologize for putting this series on a short hiatus last week for the 4th of July. But worry not, for this week is something special for all the developers out there! I’m going to try to make your life easier for a change. Screw the database!

As a Postgres (PostgreSQL) DBA, it’s easy to get tied up in performance hints, obscure syntax, and mangled queries, but it’s really all about the people. These men and women who hit our databases all day long in an attempt to hit insane deadlines often stare at the screen in defeat, trying to get something to work because they’re afraid to ask for help. I know, because I used to be one of them in my bygone developer days.

Sometimes performance isn’t just about the database. Queries can be insanely complex for seemingly simple requests. Even developers who are familiar with SQL syntax don’t speak it fluently, and will often come up with rather bizarre—yet functional—solutions to get a specific data set. Nobody wins when this happens, not the poor dev who fought a truculent query all day, nor the DBA who probably doesn’t understand all the business logic that drove it.

Let’s use an example from a company I worked with in the past. This data set should give us something fairly close to what happened back then.

CREATE TABLE sys_order
(
    order_id     SERIAL       NOT NULL,
    account_id   INT          NOT NULL,
    product_id   INT          NOT NULL,
    item_count   INT          NOT NULL,
    order_dt     TIMESTAMPTZ  NOT NULL DEFAULT now(),
    valid_dt     TIMESTAMPTZ  NULL
);

INSERT INTO sys_order (account_id, product_id, item_count,
       order_dt, valid_dt)
SELECT (a.id % 1000) + 1, (a.id % 100000) + 1, (a.id % 100) + 1,
       now() - (id % 1000 || 'd')::INTERVAL + (id || 'ms')::INTERVAL,
       CASE WHEN a.id % 499 = 0
            THEN NULL
            ELSE now() - (id % 999 || 'd')::INTERVAL
       END
  FROM generate_series(1, 1000000) a(id);

ALTER TABLE sys_order ADD CONSTRAINT pk_order_id
      PRIMARY KEY (order_id);

CREATE INDEX idx_order_account_id
    ON sys_order (account_id);

CREATE INDEX idx_order_order_dt
    ON sys_order (order_dt DESC);

ANALYZE sys_order;

Consider the scenario: retrieve all of the order details for the most recent order on every account in the system that ordered something in the last month. The developer tasked with solving this conundrum knew about GROUP BY syntax, and also had enough experience to realize it wouldn’t work. Using GROUP BY is mandatory for all non-aggregate columns in a query. Thus, if we wanted the maximum order date for each account, we’d lose all of the other order information.

Given those constraints, the developer came up with this:

SELECT o.*
  FROM sys_order o
  JOIN (
        SELECT account_id,
               MAX(order_dt) AS order_dt
          FROM sys_order
         WHERE order_dt > CURRENT_DATE
                        - INTERVAL '1 month'
         GROUP BY account_id
       ) s USING (account_id, order_dt);

It may not be immediately obvious, but this approach only worked because the order system was built in such a way that an account could not order two different items simultaneously. Without that scenario, more than one row would come back for each account when joined back to the order table. But beyond that caveat, there’s an even more distressing implication. Forget the previous discussion if you can, then ask yourself this question: what does that query do?

I’ll admit I scrached my head for a couple minutes the first time I encountered it. The intent is almost completely obfuscated by the structure, which really defeats the purpose of SQL as a language. Generally the idea of SQL is to make a request, not tell the database how to fulfill it. So that’s why it’s a good thing SQL has syntax specifically to address this kind of request.

In Postgres, the DISTINCT ON clause lets us specify columns that should be used in determining whether or not a row is actually distinct, independent of any potential aggregation. It works like this: the first unique combination of any listed columns wins. That means we can naturally control the output simply by issuing a standard ORDER BY.

Here’s how that looks:

SELECT DISTINCT ON (account_id) *
  FROM sys_order
 WHERE order_dt > CURRENT_DATE
                - INTERVAL '1 month'
 ORDER BY account_id, order_dt DESC

Now look at the query and ask the same question as before: what does this query do? Well, for each distinct account, return all columns for the most recent order in the last month. Isn’t that much more straight-forward? As a bonus, this doesn’t have the potential for breaking if the system changes and allows batch ordering of multiple products per transaction.

So we now have a query that’s easier to understand, is safer and more consistent, and uses far simpler SQL syntax. The only possible drawback is that DISTINCT ON is often a bit slower than a more convoluted approach. But that’s fine; developer time is valuable. As a DBA, I know which query I’d rather try and debug!

Hopefully, I’m not the only one.