As a database, PGDB (PostgreSQL) is fairly standard in its use of SQL. Developers of all colors however, might have trouble switching gears and thinking in set operations, since so many language constructs focus on conditionals and looping. Last week in the performance pitfalls series, we discussed a bit of Set Theory, and how ignorance of its implications can be disastrous. But what about the more mundane?
What happens, for instance, when we treat a database like a programming language?
This time, the example tables will emulate a pretty terrible ordering system with a user table of 1000 users, and a million orders distributed over roughly the last three years. Here it is, in all its glory:
CREATE TABLE sys_user ( user_id SERIAL NOT NULL, username VARCHAR NOT NULL, password VARCHAR NOT NULL, last_order TIMESTAMPTZ NULL, created_dt TIMESTAMPTZ NOT NULL DEFAULT now(), modified_dt TIMESTAMPTZ NOT NULL DEFAULT now() ); INSERT INTO sys_user (username, password, created_dt, modified_dt) SELECT 'user' || a.id, md5('use-bcrypt-instead' || 'user' || a.id || 'somepassword'), now() - (a.id % 1000 || 'd')::INTERVAL, now() - (a.id % 100 || 'd')::INTERVAL FROM generate_series(1, 1000) a(id); ALTER TABLE sys_user ADD CONSTRAINT pk_user_id PRIMARY KEY (user_id); ANALYZE sys_user; CREATE TABLE sys_order ( order_id SERIAL NOT NULL, product_id INT NOT NULL, user_id INT NOT NULL, item_count INT NOT NULL, order_dt TIMESTAMPTZ NOT NULL DEFAULT now(), valid_dt TIMESTAMPTZ NULL ); INSERT INTO sys_order (product_id, item_count, user_id, order_dt, valid_dt) SELECT (a.id % 100000) + 1, (a.id % 100) + 1, (a.id % 1000) + 1, now() - (a.id % 1000 || 'd')::INTERVAL, CASE WHEN a.id % 499 = 0 THEN NULL ELSE now() - (id % 999 || 'd')::INTERVAL END FROM generate_series(1, 1000000) a(id); ALTER TABLE sys_order ADD CONSTRAINT pk_order_id PRIMARY KEY (order_id); CREATE INDEX idx_order_order_dt ON sys_order (order_dt DESC); ANALYZE sys_order;
With that out of the way, please note that I do not now, nor will I ever advocate using md5 for passwords. Just don’t. Use crypt (bcrypt) from the pgcrypto extension instead. Note that in even this woefully simplified schema, I use a rudimentary salt. I’m not a security analyst so going further is beyond the scope of this article. But if you plan on writing any kind of secured application of any kind, read up on proper cryptography protocols and two-factor authentication. Never, never, never throw an authentication system together, or you will be hacked and every password in your system will be compromised. Just ask linkedin
Going back to our example, imagine we want to keep some denormalized information about the user in the user table itself. There’s a batch job that updates the user table and maintains the last time the user ordered something. The job does something like this:
\timing ON DO $$ DECLARE i INT; BEGIN FOR i IN SELECT user_id FROM sys_order WHERE order_dt > CURRENT_DATE - INTERVAL '1 month' LOOP UPDATE sys_user SET last_order = now() WHERE user_id = i; END LOOP; END $$ LANGUAGE plpgsql; TIME: 4107.704 ms
“Contrived,” you scream! And you’re partially right. Unfortunately, this kind of scenario happens more often than you might expect. This is actually a modified example I caught in a system I was auditing two years ago, and its execution time was on the order of hours. In that scenario, there was also the language and network overhead to consider. It wasn’t a PGDB DO loop running directly in the database, but some Groovy code that was operating on every result from another query.
So while this example is clearly contrived, that situation definitely wasn’t. Part of the reason I began my education crusade within our company was to prevent such things from ever occurring. Which brings me to the alternative: anything else. Almost literally any other database-focused solution will work better than a language looping construct.
I’ve specifically advocated against using
IN syntax in many contexts, but look what happens when we convert the loop to operating on the set of results in the loop query:
\timing ON UPDATE sys_user SET last_order = now() WHERE user_id IN ( SELECT user_id FROM sys_order o WHERE order_dt > CURRENT_DATE - INTERVAL '1 month' ); UPDATE 32 TIME: 17.466 ms
This particular example is almost 250x faster, but I’ve seen instances where the duration differs by five or even six orders of magnitude. Remember that example I mentioned that used to run for hours? The post-modification execution time was fifteen seconds. If we had further modified the update job to use a temporary table, it would have been five seconds.
It all depends, of course, on the number of records involved, the speed of the underlying disks, the health and correlation of underlying indexes and data heap, and a lot of other factors. The lesson here is, as with the last article, to try and think in sets and groups, and perform operations on the whole collection of data at once.
But there’s yet another way I suggest using that’s even simpler. PGDB has
UPDATE ... FROM syntax that lets you combine the two statements into a single operation that resembles a standard
JOIN. Here’s what it looks like using our example:
\timing ON UPDATE sys_user u SET last_order = now() FROM sys_order o WHERE o.user_id = u.user_id AND o.order_dt > CURRENT_DATE - INTERVAL '1 month'; UPDATE 32 TIME: 40.822 ms
Anything you can do in a
SELECT can be translated into this syntax. And like the
IN example, it’s compatible with
EXPLAIN so we can investigate how many records are affected before executing it. But why is it slower? In this particular instance, the reason is due to the number of orders that occurred in the last month. Because of the data distribution, there are 32,000 orders from 32 customers. Unfortunately the planner sees the 32,000 first, and doesn’t know it’s only 32 distinct values, so it performs a sequence scan on the user table, inflating the overall execution time.
That won’t always be the case, and regardless, it’s still much faster than the alternative. The primary benefit from this syntax is its simplicity. If you can write a
JOIN, you can use
UPDATE ... FROM, and not have to worry about properly formatting or vetting a subselect.
But wait, there’s more! Examples like this are too straight-forward. Sometimes a lot of preliminary work is required before the final
UPDATE. Sometimes this means using several intermediate temporary tables, or as before, building a program loop to perform several supplementary tasks and issuing individual updates.
With PGDB Common Table Expression (CTE) syntax, those prerequisite steps can be done inline. Check this out:
\timing ON WITH ord AS ( SELECT DISTINCT user_id FROM sys_order WHERE order_dt > CURRENT_DATE - INTERVAL '1 month' ) UPDATE sys_user u SET modified_dt = now() FROM ord WHERE u.user_id = ord.user_id; UPDATE 32 TIME: 19.849 ms
We’ve now done two things.
- Brought execution time back inline with the
- Broken the
UPDATEinto two distinct steps.
PGDB CTEs act as an optimization fence, because each segment is physically materialized as a temporary object. This object has very real dimensions, and the results can be fed into further CTE steps, or used as in our example, directly in a final query. Here’s how adding a second step might look:
\timing ON WITH ord AS ( SELECT DISTINCT user_id FROM sys_order WHERE order_dt > CURRENT_DATE - INTERVAL '1 month' ), good AS ( SELECT user_id FROM ord WHERE user_id % 3 = 0 ) UPDATE sys_user u SET modified_dt = now() FROM good WHERE u.user_id = good.user_id; UPDATE 10 TIME: 19.741 ms
Now the example really is contrived, because we could have just added the extra
WHERE clause to the first step. But that’s because our example is hilariously basic. In a real system, there would be dozens or even hundreds of other tables, and a lot more potential for subsequent calculations, joins, and so on.
The point here isn’t that the second step was pointless, but that it was possible at all. This is what happens when you think in sets. Each result set is something that can be reduced or combined with another result set, and the combination can be used for further operations. Imagine it as object-oriented programming; each set is a blob that should be considered one entity with various attributes.
I’ll end with this bit of Python that should make it obvious what I mean:
# This is set theory: foo = [1, 2, 3, 3, 3, 4] print foo.count(3) # This isn't: total = 0 for i in foo: if i == 3: total += 1 print total