PG Phriday: 10 Ways to Ruin Performance: In The Loop

As a database, Postgres (PostgreSQL) is fairly standard in its use of SQL. Developers of all colors however, might have trouble switching gears and thinking in set operations, since so many language constructs focus on conditionals and looping. Last week in the performance pitfalls series, we discussed a bit of Set Theory, and how ignorance of its implications can be disastrous. But what about the more mundane?

What happens, for instance, when we treat a database like a programming language?

This time, the example tables will emulate a pretty terrible ordering system with a user table of 1000 users, and a million orders distributed over roughly the last three years. Here it is, in all its glory:

CREATE TABLE sys_user
(
    user_id      SERIAL       NOT NULL,
    username     VARCHAR      NOT NULL,
    password     VARCHAR      NOT NULL,
    last_order   TIMESTAMPTZ  NULL,
    created_dt   TIMESTAMPTZ  NOT NULL DEFAULT now(),
    modified_dt  TIMESTAMPTZ  NOT NULL DEFAULT now()
);

INSERT INTO sys_user (username, password, created_dt, modified_dt)
SELECT 'user' || a.id,
       md5('use-bcrypt-instead' || 'user' || a.id || 'somepassword'),
       now() - (a.id % 1000 || 'd')::INTERVAL,
       now() - (a.id % 100 || 'd')::INTERVAL
  FROM generate_series(1, 1000) a(id);

ALTER TABLE sys_user ADD CONSTRAINT pk_user_id
      PRIMARY KEY (user_id);

ANALYZE sys_user;

CREATE TABLE sys_order
(
    order_id     SERIAL       NOT NULL,
    product_id   INT          NOT NULL,
    user_id      INT          NOT NULL,
    item_count   INT          NOT NULL,
    order_dt     TIMESTAMPTZ  NOT NULL DEFAULT now(),
    valid_dt     TIMESTAMPTZ  NULL
);

INSERT INTO sys_order (product_id, item_count, user_id, order_dt, valid_dt)
SELECT (a.id % 100000) + 1, (a.id % 100) + 1, (a.id % 1000) + 1,
       now() - (a.id % 1000 || 'd')::INTERVAL,
       CASE WHEN a.id % 499 = 0
            THEN NULL
            ELSE now() - (id % 999 || 'd')::INTERVAL
       END
  FROM generate_series(1, 1000000) a(id);

ALTER TABLE sys_order ADD CONSTRAINT pk_order_id
      PRIMARY KEY (order_id);

CREATE INDEX idx_order_order_dt
    ON sys_order (order_dt DESC);

ANALYZE sys_order;

With that out of the way, please note that I do not now, nor will I ever advocate using md5 for passwords. Just don’t. Use crypt (bcrypt) from the pgcrypto extension instead. Note that in even this woefully simplified schema, I use a rudimentary salt. I’m not a security analyst so going further is beyond the scope of this article. But if you plan on writing any kind of secured application of any kind, read up on proper cryptography protocols and two-factor authentication. Never, never, never throw an authentication system together, or you will be hacked and every password in your system will be compromised. Just ask linkedin

Going back to our example, imagine we want to keep some denormalized information about the user in the user table itself. There’s a batch job that updates the user table and maintains the last time the user ordered something. The job does something like this:

\timing on

DO $$
DECLARE
    i INT;
BEGIN
    FOR i IN SELECT user_id
               FROM sys_order
              WHERE order_dt > CURRENT_DATE - INTERVAL '1 month'
    LOOP
      UPDATE sys_user
         SET last_order = now()
       WHERE user_id = i;
    END LOOP;
END
$$ LANGUAGE plpgsql;

Time: 4107.704 ms

“Contrived,” you scream! And you’re partially right. Unfortunately, this kind of scenario happens more often than you might expect. This is actually a modified example I caught in a system I was auditing two years ago, and its execution time was on the order of hours. In that scenario, there was also the language and network overhead to consider. It wasn’t a Postgres DO loop running directly in the database, but some Groovy code that was operating on every result from another query.

So while this example is clearly contrived, that situation definitely wasn’t. Part of the reason I began my education crusade within our company was to prevent such things from ever occurring. Which brings me to the alternative: anything else. Almost literally any other database-focused solution will work better than a language looping construct.

I’ve specifically advocated against using IN syntax in many contexts, but look what happens when we convert the loop to operating on the set of results in the loop query:

\timing on

UPDATE sys_user
   SET last_order = now()
 WHERE user_id IN (
         SELECT user_id
           FROM sys_order o
          WHERE order_dt > CURRENT_DATE - INTERVAL '1 month'
       );

UPDATE 32
Time: 17.466 ms

This particular example is almost 250x faster, but I’ve seen instances where the duration differs by five or even six orders of magnitude. Remember that example I mentioned that used to run for hours? The post-modification execution time was fifteen seconds. If we had further modified the update job to use a temporary table, it would have been five seconds.

It all depends, of course, on the number of records involved, the speed of the underlying disks, the health and correlation of underlying indexes and data heap, and a lot of other factors. The lesson here is, as with the last article, to try and think in sets and groups, and perform operations on the whole collection of data at once.

But there’s yet another way I suggest using that’s even simpler. Postgres has UPDATE ... FROM syntax that lets you combine the two statements into a single operation that resembles a standard JOIN. Here’s what it looks like using our example:

\timing on

UPDATE sys_user u
   SET last_order = now()
  FROM sys_order o
 WHERE o.user_id = u.user_id
   AND o.order_dt > CURRENT_DATE - INTERVAL '1 month';

UPDATE 32
Time: 40.822 ms

Anything you can do in a SELECT can be translated into this syntax. And like the IN example, it’s compatible with EXPLAIN so we can investigate how many records are affected before executing it. But why is it slower? In this particular instance, the reason is due to the number of orders that occurred in the last month. Because of the data distribution, there are 32,000 orders from 32 customers. Unfortunately the planner sees the 32,000 first, and doesn’t know it’s only 32 distinct values, so it performs a sequence scan on the user table, inflating the overall execution time.

That won’t always be the case, and regardless, it’s still much faster than the alternative. The primary benefit from this syntax is its simplicity. If you can write a JOIN, you can use UPDATE ... FROM, and not have to worry about properly formatting or vetting a subselect.

But wait, there’s more! Examples like this are too straight-forward. Sometimes a lot of preliminary work is required before the final UPDATE. Sometimes this means using several intermediate temporary tables, or as before, building a program loop to perform several supplementary tasks and issuing individual updates.

With Postgres Common Table Expression (CTE) syntax, those prerequisite steps can be done inline. Check this out:

\timing on

WITH ord AS
(
    SELECT DISTINCT user_id
      FROM sys_order
     WHERE order_dt > CURRENT_DATE - INTERVAL '1 month'
)
UPDATE sys_user u
   SET modified_dt = now()
  FROM ord
 WHERE u.user_id = ord.user_id;

UPDATE 32
Time: 19.849 ms

We’ve now done two things.

  1. Brought execution time back inline with the IN case.
  2. Broken the UPDATE into two distinct steps.

Postgres CTEs act as an optimization fence, because each segment is physically materialized as a temporary object. This object has very real dimensions, and the results can be fed into further CTE steps, or used as in our example, directly in a final query. Here’s how adding a second step might look:

\timing on

WITH ord AS (
    SELECT DISTINCT user_id
      FROM sys_order
     WHERE order_dt > CURRENT_DATE - INTERVAL '1 month'
),
good AS (
    SELECT user_id
      FROM ord
     WHERE user_id % 3 = 0
)
UPDATE sys_user u
   SET modified_dt = now()
  FROM good
 WHERE u.user_id = good.user_id;

UPDATE 10
Time: 19.741 ms

Now the example really is contrived, because we could have just added the extra WHERE clause to the first step. But that’s because our example is hilariously basic. In a real system, there would be dozens or even hundreds of other tables, and a lot more potential for subsequent calculations, joins, and so on.

The point here isn’t that the second step was pointless, but that it was possible at all. This is what happens when you think in sets. Each result set is something that can be reduced or combined with another result set, and the combination can be used for further operations. Imagine it as object-oriented programming; each set is a blob that should be considered one entity with various attributes.

I’ll end with this bit of Python that should make it obvious what I mean:

# This is set theory:

foo = [1, 2, 3, 3, 3, 4]
print foo.count(3)

# This isn't:

total = 0
for i in foo:
  if i == 3:
    total += 1

print total