PG Phriday: The Case for Partitioning

August 7th, 2015 | Published in Database, Tech Talk | 4 Comments


In the next few weeks, I’m going to be pushing a long discussion regarding PGDB (PostgreSQL) table partitioning. I’ve covered it in previous articles, but only regarding basic performance considerations. That’s a very limited view of what partitioning can offer; there’s a lot more variance and background that deserves elucidation.

So for the next few articles, the topic of discussion will be partitioning. There’s not really enough of it, and a lot of the techniques used in the field are effectively pulled straight from the documentation. I think we can go much further.

To that effect, this is the table structure all subsequent articles will reference:

CREATE TABLE sensor_log (
  sensor_log_id  SERIAL PRIMARY KEY,
  location       VARCHAR NOT NULL,
  reading        BIGINT NOT NULL,
  reading_date   TIMESTAMP NOT NULL
);
 
CREATE INDEX idx_sensor_log_location ON sensor_log (location);
CREATE INDEX idx_sensor_log_date ON sensor_log (reading_date);

Very large, continuously growing tables tend to be great candidates for partitioning. Further, we can explore several different techniques and cover various considerations related to expected volume. To help that along, I threw together a basic python loader script that both bootstraps the partition data, and can profile timings for analysis purposes.

To keep things interesting, the values are set to fill the table with 100M rows. The fact this is scripted helps simulate a process that might actually load data in a production environment. This allows testing basic insert speed with and without partitioning, as the size of the table (and indexes) increases over time.

# Set these to modify how large the COMMIT blocks will be,
# and how many days the data will represent.
# Total row count = chunk_size * date_range.
 
chunk_size = 100000
date_range = 1000
 
# Main program body
 
import psycopg2
from StringIO import StringIO
from datetime import datetime, timedelta
 
raw_data = StringIO()
db_conn = psycopg2.connect(database = 'postgres', user = 'postgres')
cur = db_conn.cursor()
 
print 'rows,seconds,microseconds'
 
for j in xrange(0, date_range):
    raw_data.seek(0)
    raw_data.truncate(0)
    stamp = datetime.now() - timedelta(days = j)
 
    for i in xrange(1, chunk_size + 1):
        stamp -= timedelta(seconds = 5)
        raw_data.write(
            '%s\t%s\t%s\n' % (i % 1000, i % 100, stamp)
        )
 
    raw_data.seek(0)
    t1 = datetime.now()
 
    cur.copy_from(
        raw_data, 'sensor_log', '\t',
        columns = ('location', 'reading', 'reading_date')
    )
 
    t2 = datetime.now()
    diff_s = str(t2 - t1).split(':')[2]
 
    print '%s,%s' % (
        (j + 1) * chunk_size, diff_s.replace('.', ',')
    )
 
    db_conn.commit()
 
raw_data.close()
db_conn.close()

Before we get to any partitioning discussion, using this script as a baseline, loading the base sensor_log table takes about 45 minutes on my testing VM. Note that we didn’t cheat by loading the table without indexes as we’ve done previously. Existing tables won’t have that luxury, so for the partitioning discussion, we can ignore it for now.

So, what are the approaches we can cover?

  • No partitioning at all.
  • Use the technique discussed in the PostgreSQL documentation (triggers).
  • Use the base table as current data, with partitions as older, archived information.

And knowing the techniques, what do we examine?

  • Statement speed.
  • Potential overhead.
  • Maintenance concerns.
  • Implementation caveats.

Partitioning is a hairy procedure. No matter what method is chosen, the result will have definite performance and management issues. The idea is to choose a good balance that addresses current needs, based on volume. I plan on being as comprehensive as possible to advocate partitioning without simply being a cheerleader.

Due to the 100-million row sample size and the associated 13GB of space needed, the time scale for even the non-partitioned case should make everything obvious. Unfortunately it also greatly inflates the time required for any statement tests, and bootstrapping the tables themselves. I think that effort is warranted though, especially because projects that are likely to benefit from partitioning will likely exceed even that scale.

So think about partitioning a bit, because I’m about to go into some crazy depth about it, and hopefully make a case for using it early and often.


Tags: , , ,

PG Phriday: 10 Ways to Ruin Performance: Sex Offenders

July 24th, 2015 | Published in Database, Tech Talk | 4 Comments


We’re finally at the end of the 10-part PGDB (PostgreSQL) performance series I use to initiate new developers into the database world. To that end, we’re going to discuss something that affects everyone at one point or another: index criteria. Or to put it another way:

Why isn’t the database using an index?

It’s a fairly innocuous question, but one that may have a surprising answer: the index was created using erroneous assumptions. Let’s explore what happens in a hospital environment with a pared-down table of patients.

DROP TABLE IF EXISTS sys_patient;
 
CREATE TABLE sys_patient
(
    patient_id  SERIAL   NOT NULL,
    full_name   VARCHAR  NOT NULL,
    birth_dt    DATE     NOT NULL,
    sex         CHAR     NOT NULL
);
 
INSERT INTO sys_patient (full_name, birth_dt, sex)
SELECT 'Crazy Person ' || a.id,
       CURRENT_DATE - (a.id % 100 || 'y')::INTERVAL
                    + (a.id % 365 || 'd')::INTERVAL,
       CASE WHEN a.id % 2 = 0 THEN 'M' ELSE 'F' END
  FROM generate_series(1, 1000000) a(id);
 
ALTER TABLE sys_patient ADD CONSTRAINT pk_patient_id
      PRIMARY KEY (patient_id);
 
CREATE INDEX idx_patient_birth_dt ON sys_patient (birth_dt);
CREATE INDEX idx_patient_sex ON sys_patient (sex);
 
ANALYZE sys_patient;

This particular hospital has a few queries that operate based on the sex of the patient, so someone created an index on that column. One day, another developer is doing some code refactoring and, being well-trained by the resident DBA, runs the query through EXPLAIN to check the query plan. Upon seeing the result, the dev curses a bit, tries a few variants, and ultimately takes the issue to the DBA.

This is what the developer saw:

EXPLAIN ANALYZE 
SELECT *
  FROM sys_patient
 WHERE sex = 'F';
 
                             QUERY PLAN                             
--------------------------------------------------------------------
 Seq Scan ON sys_patient  
      (cost=0.00..19853.00 ROWS=498233 width=29)
      (actual TIME=0.018..541.738 ROWS=500000 loops=1)
   FILTER: (sex = 'F'::bpchar)
   ROWS Removed BY FILTER: 500000
 
 Planning TIME: 0.292 ms
 Execution TIME: 823.901 ms

No matter what the dev did, the database adamantly refused to use the idx_patient_sex index. The answer is generally obvious to a DBA or a relatively seasoned developer, but this actually happens far more frequently than one might think. This is an extreme example, yet even experienced database users, report writers, and analysts make this mistake.

Before using an index, the database essentially asks a series of questions:

  1. How many matches do I expect from this index?
  2. What proportion of the table do these matches represent?
  3. Are the cumulative random seeks faster than filtering the table?

If the answer to any of those questions is too large or negative, the database will not use the index. In our example, the sex column only has two values, and thus the answer to the above questions are more obvious than usual. With one million rows, a query only on the sex column would match half of them. In addition, randomly seeking 500,000 results is likely an order of magnitude slower than simply filtering the whole table for matches.

But it’s not always so easy to figure out what kind of cardinality to expect from a table column. Short of checking every column of every table with count(DISTINCT my_col) or something equally ridiculous, someone unfamiliar with the data in a complex table architecture would get stuck. However, in order to answer the above questions, the database itself must track certain statistics about table contents.

It just so happens that PGDB makes that data available to everyone through the pg_stats view. Let’s check what PostgreSQL has stored regarding the sys_patient table.

SELECT attname AS column_name, n_distinct
  FROM pg_stats
 WHERE tablename = 'sys_patient';
 
 column_name | n_distinct 
-------------+------------
 patient_id  |         -1
 full_name   |         -1
 birth_dt    |       7310
 sex         |          2

Interpreting these results is actually very easy. Any column with a negative n_distinct value is a ratio approaching 1. At -1, there’s a one-to-one relationship with the number of rows in the table, and the number of distinct values in that column. As a general rule, nearly any column with a negative value here is a good index candidate because a WHERE clause will reduce the potential results significantly.

Positive values are an absolute count of unique values for that column. During table analysis, the database checks a random sampling of rows and tabulates statistics based on them. That means the value in n_distinct is representative instead of exact, but usually doesn’t deviate by a significant margin. The data here doesn’t need to be viable for reporting, just to calculate an efficient query plan.

From here, we can see that the sex column would likely be a terrible index candidate, even if we know nothing else about the table. There are simply not enough distinct values to reduce the amount of matches for a query.

Given all of this, the dev has nothing to fear; the idx_patient_sex index should have never existed in the first place. A query that needs to fetch all of any particular sex will simply require a sequential scan, and that’s fine.

Creating indexes can be a game, and sometimes the only way to win is not to play.


Tags: , , , ,

PG Phriday: 10 Ways to Ruin Performance: Indexing the World

July 17th, 2015 | Published in Database, Tech Talk | 1 Comment


An easy way to give PGDB (PostgreSQL) a performance boost is to judiciously use indexes based on queries observed in the system. For most situations, this is as simple as indexing columns that are referenced frequently in WHERE clauses. PGDB is one of the few database engines that takes this idea even further with partial indexes. Unfortunately as a consequence of insufficient exposure, most DBAs and users are unfamiliar with this extremely powerful functionality.

Imagine we have an order system that tracks order state, such that entries are marked as new, processing, or done. These kinds of transient states are not uncommon in various inventory management systems, so it’s a great example for this use case. Often with such systems, data is distributed in such a way that more than 90% of orders are marked as ‘done’. To make this interesting, let’s just cap the done state at 90%, and distribute another 5% to processing, and 5% to new.

This somewhat complex SQL should emulate the above scenario:

DROP TABLE IF EXISTS sys_order;
 
CREATE TABLE sys_order
(
    order_id     SERIAL       NOT NULL,
    account_id   INT          NOT NULL,
    product_id   INT          NOT NULL,
    item_count   INT          NOT NULL,
    order_state  CHAR         NOT NULL DEFAULT 'N',
    order_dt     TIMESTAMPTZ  NOT NULL DEFAULT now(),
    valid_dt     TIMESTAMPTZ  NULL
);
 
INSERT INTO sys_order (
         account_id, product_id, item_count, order_state,
         order_dt, valid_dt
       )
SELECT (a.id % 100) + 1, (a.id % 100000) + 1, (a.id % 100) + 1,
       CASE WHEN b.id BETWEEN 1 AND 5 THEN 'N'
            WHEN b.id BETWEEN 6 AND 10 THEN 'P'
            ELSE 'D'
       END,
       now() - (a.id % 1000 || 'd')::INTERVAL
             + (a.id || 'ms')::INTERVAL,
       CASE WHEN a.id % 499 = 0
            THEN NULL
            ELSE now() - (a.id % 999 || 'd')::INTERVAL
       END
  FROM generate_series(1, 10000) a(id),
       generate_series(1, 100) b(id);
 
ALTER TABLE sys_order ADD CONSTRAINT pk_order_id
      PRIMARY KEY (order_id);
 
CREATE INDEX idx_order_account_id
    ON sys_order (account_id);
 
CREATE INDEX idx_order_order_dt
    ON sys_order (order_dt DESC);
 
ANALYZE sys_order;

In many, if not most situations, automated tools and interfaces are only interested in pending order states. If an order is new, or in the middle of processing, it’s going to see a lot of activity. Because of this, a lot of queries will request orders based on status. Let’s presume an example of this is to fetch all new orders from a certain account, possibly for display on a web site.

This is the query PGDB sees:

SELECT *
  FROM sys_order
 WHERE account_id = 42
   AND order_state = 'N';

In my virtual environment, this executes in about 3ms and uses the index on account_id as expected. Few people would consider a 3ms execution time as slow, so this is where most optimization stops. But we know something about this data! We know that 90% (or more) of all possible order states are ‘D’. In the case of this query, PGDB has to reduce 10,000 matches from the index, down to 500 to return our results.

This is hardly ideal. A naive approach to correct this, may be to index both account_id and order_state. While this works, we’re still indexing 90% of values that provide no benefit to the index. This is where PGDB differs from many other database engines. Let’s go a bit further with our example and create a partial index and try the query again:

CREATE INDEX idx_order_state_account_id
    ON sys_order (account_id, order_state)
 WHERE order_state != 'D';
 
SELECT *
  FROM sys_order
 WHERE account_id = 42
   AND order_state = 'N';

The revised execution time of our query with the new index is about 0.7ms. While this is 4-5x faster than before, the benefits go beyond execution time. Let’s take a look at the size of each index on disk:

SELECT indexrelid::REGCLASS::TEXT AS index_name,
       pg_relation_size(indexrelid) / 1048576 AS size_mb
  FROM pg_index
 WHERE indrelid::REGCLASS::TEXT = 'sys_order';
 
         index_name         | size_mb 
----------------------------+---------
 pk_order_id                |      21
 idx_order_account_id       |      21
 idx_order_order_dt         |      21
 idx_order_state_account_id |       2

Hopefully it comes as no surprise that an index which includes 90% less data will be 90% smaller. Also, keep in mind that this contrived example was designed to somewhat downplay the effect of partial indexes. In a real order system, far more than 90% of orders would be marked as done, thus magnifying the speed increase and index size reduction.

There are a lot of ways partial indexes can be used, and like most tools, they’re not appropriate for all situations. But when the data cooperates, they’re extremely relevant. When time permits, take a look at data distribution and queries against the system. Chances are, there will be at least one situation that could be improved with a partial index.

If only every database was so accommodating.


Tags: , , , ,

PG Phriday: 10 Ways to Ruin Performance: MAXimized Value

July 10th, 2015 | Published in Database, Tech Talk | No Comments


I apologize for putting this series on a short hiatus last week for the 4th of July. But worry not, for this week is something special for all the developers out there! I’m going to try to make your life easier for a change. Screw the database!

As a PGDB (PostgreSQL) DBA, it’s easy to get tied up in performance hints, obscure syntax, and mangled queries, but it’s really all about the people. These men and women who hit our databases all day long in an attempt to hit insane deadlines often stare at the screen in defeat, trying to get something to work because they’re afraid to ask for help. I know, because I used to be one of them in my bygone developer days.

Sometimes performance isn’t just about the database. Queries can be insanely complex for seemingly simple requests. Even developers who are familiar with SQL syntax don’t speak it fluently, and will often come up with rather bizarre—yet functional—solutions to get a specific data set. Nobody wins when this happens, not the poor dev who fought a truculent query all day, nor the DBA who probably doesn’t understand all the business logic that drove it.

Let’s use an example from a company I worked with in the past. This data set should give us something fairly close to what happened back then.

CREATE TABLE sys_order
(
    order_id     SERIAL       NOT NULL,
    account_id   INT          NOT NULL,
    product_id   INT          NOT NULL,
    item_count   INT          NOT NULL,
    order_dt     TIMESTAMPTZ  NOT NULL DEFAULT now(),
    valid_dt     TIMESTAMPTZ  NULL
);
 
INSERT INTO sys_order (account_id, product_id, item_count,
       order_dt, valid_dt)
SELECT (a.id % 1000) + 1, (a.id % 100000) + 1, (a.id % 100) + 1,
       now() - (id % 1000 || 'd')::INTERVAL + (id || 'ms')::INTERVAL,
       CASE WHEN a.id % 499 = 0
            THEN NULL
            ELSE now() - (id % 999 || 'd')::INTERVAL
       END
  FROM generate_series(1, 1000000) a(id);
 
ALTER TABLE sys_order ADD CONSTRAINT pk_order_id
      PRIMARY KEY (order_id);
 
CREATE INDEX idx_order_account_id
    ON sys_order (account_id);
 
CREATE INDEX idx_order_order_dt
    ON sys_order (order_dt DESC);
 
ANALYZE sys_order;

Consider the scenario: retrieve all of the order details for the most recent order on every account in the system that ordered something in the last month. The developer tasked with solving this conundrum knew about GROUP BY syntax, and also had enough experience to realize it wouldn’t work. Using GROUP BY is mandatory for all non-aggregate columns in a query. Thus, if we wanted the maximum order date for each account, we’d lose all of the other order information.

Given those constraints, the developer came up with this:

SELECT o.*
  FROM sys_order o
  JOIN (
        SELECT account_id,
               MAX(order_dt) AS order_dt
          FROM sys_order
         WHERE order_dt > CURRENT_DATE
                        - INTERVAL '1 month'
         GROUP BY account_id
       ) s USING (account_id, order_dt);

It may not be immediately obvious, but this approach only worked because the order system was built in such a way that an account could not order two different items simultaneously. Without that scenario, more than one row would come back for each account when joined back to the order table. But beyond that caveat, there’s an even more distressing implication. Forget the previous discussion if you can, then ask yourself this question: what does that query do?

I’ll admit I scrached my head for a couple minutes the first time I encountered it. The intent is almost completely obfuscated by the structure, which really defeats the purpose of SQL as a language. Generally the idea of SQL is to make a request, not tell the database how to fulfill it. So that’s why it’s a good thing SQL has syntax specifically to address this kind of request.

In PGDB, the DISTINCT ON clause lets us specify columns that should be used in determining whether or not a row is actually distinct, independent of any potential aggregation. It works like this: the first unique combination of any listed columns wins. That means we can naturally control the output simply by issuing a standard ORDER BY.

Here’s how that looks:

SELECT DISTINCT ON (account_id) *
  FROM sys_order
 WHERE order_dt > CURRENT_DATE
                - INTERVAL '1 month'
 ORDER BY account_id, order_dt DESC

Now look at the query and ask the same question as before: what does this query do? Well, for each distinct account, return all columns for the most recent order in the last month. Isn’t that much more straight-forward? As a bonus, this doesn’t have the potential for breaking if the system changes and allows batch ordering of multiple products per transaction.

So we now have a query that’s easier to understand, is safer and more consistent, and uses far simpler SQL syntax. The only possible drawback is that DISTINCT ON is often a bit slower than a more convoluted approach. But that’s fine; developer time is valuable. As a DBA, I know which query I’d rather try and debug!

Hopefully, I’m not the only one.


Tags: , , , ,

PG Phriday: 10 Ways to Ruin Performance: In The Loop

June 26th, 2015 | Published in Database, Tech Talk | 1 Comment


As a database, PGDB (PostgreSQL) is fairly standard in its use of SQL. Developers of all colors however, might have trouble switching gears and thinking in set operations, since so many language constructs focus on conditionals and looping. Last week in the performance pitfalls series, we discussed a bit of Set Theory, and how ignorance of its implications can be disastrous. But what about the more mundane?

What happens, for instance, when we treat a database like a programming language?

This time, the example tables will emulate a pretty terrible ordering system with a user table of 1000 users, and a million orders distributed over roughly the last three years. Here it is, in all its glory:

CREATE TABLE sys_user
(
    user_id      SERIAL       NOT NULL,
    username     VARCHAR      NOT NULL,
    password     VARCHAR      NOT NULL,
    last_order   TIMESTAMPTZ  NULL,
    created_dt   TIMESTAMPTZ  NOT NULL DEFAULT now(),
    modified_dt  TIMESTAMPTZ  NOT NULL DEFAULT now()
);
 
INSERT INTO sys_user (username, password, created_dt, modified_dt)
SELECT 'user' || a.id,
       md5('use-bcrypt-instead' || 'user' || a.id || 'somepassword'),
       now() - (a.id % 1000 || 'd')::INTERVAL,
       now() - (a.id % 100 || 'd')::INTERVAL
  FROM generate_series(1, 1000) a(id);
 
ALTER TABLE sys_user ADD CONSTRAINT pk_user_id
      PRIMARY KEY (user_id);
 
ANALYZE sys_user;
 
CREATE TABLE sys_order
(
    order_id     SERIAL       NOT NULL,
    product_id   INT          NOT NULL,
    user_id      INT          NOT NULL,
    item_count   INT          NOT NULL,
    order_dt     TIMESTAMPTZ  NOT NULL DEFAULT now(),
    valid_dt     TIMESTAMPTZ  NULL
);
 
INSERT INTO sys_order (product_id, item_count, user_id, order_dt, valid_dt)
SELECT (a.id % 100000) + 1, (a.id % 100) + 1, (a.id % 1000) + 1,
       now() - (a.id % 1000 || 'd')::INTERVAL,
       CASE WHEN a.id % 499 = 0
            THEN NULL
            ELSE now() - (id % 999 || 'd')::INTERVAL
       END
  FROM generate_series(1, 1000000) a(id);
 
ALTER TABLE sys_order ADD CONSTRAINT pk_order_id
      PRIMARY KEY (order_id);
 
CREATE INDEX idx_order_order_dt
    ON sys_order (order_dt DESC);
 
ANALYZE sys_order;

With that out of the way, please note that I do not now, nor will I ever advocate using md5 for passwords. Just don’t. Use crypt (bcrypt) from the pgcrypto extension instead. Note that in even this woefully simplified schema, I use a rudimentary salt. I’m not a security analyst so going further is beyond the scope of this article. But if you plan on writing any kind of secured application of any kind, read up on proper cryptography protocols and two-factor authentication. Never, never, never throw an authentication system together, or you will be hacked and every password in your system will be compromised. Just ask linkedin

Going back to our example, imagine we want to keep some denormalized information about the user in the user table itself. There’s a batch job that updates the user table and maintains the last time the user ordered something. The job does something like this:

\timing ON
 
DO $$
DECLARE
    i INT;
BEGIN
    FOR i IN SELECT user_id
               FROM sys_order
              WHERE order_dt > CURRENT_DATE - INTERVAL '1 month'
    LOOP
      UPDATE sys_user
         SET last_order = now()
       WHERE user_id = i;
    END LOOP;
END
$$ LANGUAGE plpgsql;
 
TIME: 4107.704 ms

“Contrived,” you scream! And you’re partially right. Unfortunately, this kind of scenario happens more often than you might expect. This is actually a modified example I caught in a system I was auditing two years ago, and its execution time was on the order of hours. In that scenario, there was also the language and network overhead to consider. It wasn’t a PGDB DO loop running directly in the database, but some Groovy code that was operating on every result from another query.

So while this example is clearly contrived, that situation definitely wasn’t. Part of the reason I began my education crusade within our company was to prevent such things from ever occurring. Which brings me to the alternative: anything else. Almost literally any other database-focused solution will work better than a language looping construct.

I’ve specifically advocated against using IN syntax in many contexts, but look what happens when we convert the loop to operating on the set of results in the loop query:

\timing ON
 
UPDATE sys_user
   SET last_order = now()
 WHERE user_id IN (
         SELECT user_id
           FROM sys_order o
          WHERE order_dt > CURRENT_DATE - INTERVAL '1 month'
       );
 
UPDATE 32
TIME: 17.466 ms

This particular example is almost 250x faster, but I’ve seen instances where the duration differs by five or even six orders of magnitude. Remember that example I mentioned that used to run for hours? The post-modification execution time was fifteen seconds. If we had further modified the update job to use a temporary table, it would have been five seconds.

It all depends, of course, on the number of records involved, the speed of the underlying disks, the health and correlation of underlying indexes and data heap, and a lot of other factors. The lesson here is, as with the last article, to try and think in sets and groups, and perform operations on the whole collection of data at once.

But there’s yet another way I suggest using that’s even simpler. PGDB has UPDATE ... FROM syntax that lets you combine the two statements into a single operation that resembles a standard JOIN. Here’s what it looks like using our example:

\timing ON
 
UPDATE sys_user u
   SET last_order = now()
  FROM sys_order o
 WHERE o.user_id = u.user_id
   AND o.order_dt > CURRENT_DATE - INTERVAL '1 month';
 
UPDATE 32
TIME: 40.822 ms

Anything you can do in a SELECT can be translated into this syntax. And like the IN example, it’s compatible with EXPLAIN so we can investigate how many records are affected before executing it. But why is it slower? In this particular instance, the reason is due to the number of orders that occurred in the last month. Because of the data distribution, there are 32,000 orders from 32 customers. Unfortunately the planner sees the 32,000 first, and doesn’t know it’s only 32 distinct values, so it performs a sequence scan on the user table, inflating the overall execution time.

That won’t always be the case, and regardless, it’s still much faster than the alternative. The primary benefit from this syntax is its simplicity. If you can write a JOIN, you can use UPDATE ... FROM, and not have to worry about properly formatting or vetting a subselect.

But wait, there’s more! Examples like this are too straight-forward. Sometimes a lot of preliminary work is required before the final UPDATE. Sometimes this means using several intermediate temporary tables, or as before, building a program loop to perform several supplementary tasks and issuing individual updates.

With PGDB Common Table Expression (CTE) syntax, those prerequisite steps can be done inline. Check this out:

\timing ON
 
WITH ord AS
(
    SELECT DISTINCT user_id
      FROM sys_order
     WHERE order_dt > CURRENT_DATE - INTERVAL '1 month'
)
UPDATE sys_user u
   SET modified_dt = now()
  FROM ord
 WHERE u.user_id = ord.user_id;
 
UPDATE 32
TIME: 19.849 ms

We’ve now done two things.

  1. Brought execution time back inline with the IN case.
  2. Broken the UPDATE into two distinct steps.

PGDB CTEs act as an optimization fence, because each segment is physically materialized as a temporary object. This object has very real dimensions, and the results can be fed into further CTE steps, or used as in our example, directly in a final query. Here’s how adding a second step might look:

\timing ON
 
WITH ord AS (
    SELECT DISTINCT user_id
      FROM sys_order
     WHERE order_dt > CURRENT_DATE - INTERVAL '1 month'
),
good AS (
    SELECT user_id
      FROM ord
     WHERE user_id % 3 = 0
)
UPDATE sys_user u
   SET modified_dt = now()
  FROM good
 WHERE u.user_id = good.user_id;
 
UPDATE 10
TIME: 19.741 ms

Now the example really is contrived, because we could have just added the extra WHERE clause to the first step. But that’s because our example is hilariously basic. In a real system, there would be dozens or even hundreds of other tables, and a lot more potential for subsequent calculations, joins, and so on.

The point here isn’t that the second step was pointless, but that it was possible at all. This is what happens when you think in sets. Each result set is something that can be reduced or combined with another result set, and the combination can be used for further operations. Imagine it as object-oriented programming; each set is a blob that should be considered one entity with various attributes.

I’ll end with this bit of Python that should make it obvious what I mean:

# This is set theory:
 
foo = [1, 2, 3, 3, 3, 4]
print foo.count(3)
 
# This isn't:
 
total = 0
for i in foo:
  if i == 3:
    total += 1
 
print total

Tags: , , , ,

« Older Posts

Newer Posts »