PG Phriday: 10 Ways to Ruin Performance: Forgetting it EXISTS

May 22nd, 2015 | Published in Database, Tech Talk | 1 Comment


For the second of my ten part series on hidden PGDB (PostgreSQL) performance killers, I’m going to talk about something called an anti-join. It’s not a well-known approach outside of the database world, but due to how it works, it can impart almost magical plan revisions that drastically improve query performance in the right scenario. Developers can add it to a growing bag of tricks when working on database-driven content, since it comes in handy more often than you might expect.

Let’s build a test-case, shall we?

CREATE TABLE sys_product
(
    product_id   SERIAL  PRIMARY KEY,
    prod_name    TEXT    NOT NULL,
    quality      INT     NOT NULL,
    descr        TEXT    NOT NULL DEFAULT now()
);
 
INSERT INTO sys_product (prod_name, quality, descr)
SELECT 'Product ' || a.id::TEXT,
       log((a.id % 100) + 1)::INT,
       'It does stuff.'
  FROM generate_series(1, 100000) a(id);
 
CREATE TABLE sys_order
(
    order_id     SERIAL       NOT NULL,
    product_id   INT          NOT NULL,
    item_count   INT          NOT NULL,
    order_dt     TIMESTAMPTZ  NOT NULL DEFAULT now(),
    valid_dt     TIMESTAMPTZ  NULL
);
 
INSERT INTO sys_order (product_id, item_count, order_dt, valid_dt)
SELECT (a.id % 100000) + 1, (a.id % 100) + 1,
       now() - (id % 1000 || 'd')::INTERVAL,
       CASE WHEN a.id % 499 = 0
            THEN NULL
            ELSE now() - (id % 999 || 'd')::INTERVAL
       END
  FROM generate_series(1, 1000000) a(id);
 
ALTER TABLE sys_order ADD CONSTRAINT pk_order_order_id
      PRIMARY KEY (order_id);
 
CREATE INDEX idx_order_product_id
    ON sys_order (product_id);
 
CREATE INDEX idx_order_valid_dt
    ON sys_order (valid_dt);
 
ANALYZE sys_product;
ANALYZE sys_order;

This is a very basic product and order table structure, and we’ve used it before in the last installment of this series. The only columns we’ve added since last time is the quality column in sys_product and the valid_dt column in sys_order. This way, we can introduce some variability into the query plan due to data correlation effects. We’ve tried to distribute the data such that new orders have not been validated, and there are still some non-validated orders in the past. This distribution is more reliable than using random().

Now let’s build a query that retrieves all orders on products with a quality of five using a simple JOIN:

EXPLAIN ANALYZE
SELECT o.*
  FROM sys_order o
  JOIN sys_product p USING (product_id)
 WHERE o.valid_dt IS NULL
   AND p.quality = 2;
 
                             QUERY PLAN                             
--------------------------------------------------------------------
 Hash JOIN  (cost=2985.66..7406.44 ROWS=1303 width=28)
            (actual TIME=80.438..90.406 ROWS=1383 loops=1)
   Hash Cond: (o.product_id = p.product_id)
   ->  Bitmap Heap Scan ON sys_order o
              (cost=39.41..4423.00 ROWS=1933 width=28)
              (actual TIME=0.600..6.879 ROWS=2004 loops=1)
         Recheck Cond: (valid_dt IS NULL)
         Heap Blocks: exact=2004
         ->  Bitmap INDEX Scan ON idx_order_valid_dt
                    (cost=0.00..38.92 ROWS=1933 width=0)
                    (actual TIME=0.327..0.327 ROWS=2004 loops=1)
               INDEX Cond: (valid_dt IS NULL)
   ->  Hash  (cost=2084.00..2084.00 ROWS=68980 width=4)
             (actual TIME=79.773..79.773 ROWS=69000 loops=1)
         Buckets: 8192  Batches: 1  Memory Usage: 2426
         ->  Seq Scan ON sys_product p
                 (cost=0.00..2084.00 ROWS=68980 width=4)
                 (actual TIME=0.013..41.150 ROWS=69000 loops=1)
               FILTER: (quality = 2)
               ROWS Removed BY FILTER: 31000
 Planning TIME: 0.732 ms
 Execution TIME: 91.875 ms

We can actually infer quite a bit from this execution plan.

  • The sys_order table was checked with the index on valid_dt as desired.
  • The sys_product table was scanned in its entirety.
  • The planner estimates all match up almost exactly.
  • Final result was obtained by hashing the two row sets together.

From this, we could say that this is the ideal query plan. The index we wanted was used, and it reduced one million rows to about 2000. Since we don’t have an index on product quality, we might as well fetch them all and filter out the wrong quality before executing the hash. Once the hash between both is computed, we get about 1300 rows as a result.

Of course, this approach makes a huge mistake. Even though the order and product tables are within an order of magnitude of each other, we’re only interested in a relatively small fraction of the order table. Thus, fetching all products at this scale is probably a losing proposition. However, since the planner doesn’t have data regarding how closely these tables are correlated, it can’t make that assumption. Indeed, is it fetching 2000 unique product IDs, or 200? Since it can’t fetch the related products at this point, it does the best it can.

But can we do better? As it turns out, we can do a whole lot better by telling the planner what we really wanted. Note that we didn’t actually use the results from the product table in the SELECT portion of the query. Further, we know that we are only interested in a maximum of about 1300 products, so why check all of them? Is there a way to force the planner to fetch the matching orders first, and then check to see which products correspond as a second step?

We already know from last time that this is what a nested loop does, and one way to encourage the planner to pick a nested loop is by using EXISTS to break apart our JOIN condition. This accomplishes two things:

  1. Don’t join against a table to fetch columns that will never be returned.
  2. Favor some kind of looping construct to avoid excessive record comparisons.

This is also what is known as an anti-join, and PGDB performs very well when guided in that direction. Here’s what our query and plan look like using this concept:

EXPLAIN ANALYZE
SELECT o.*
  FROM sys_order o
 WHERE o.valid_dt IS NULL
   AND EXISTS (
         SELECT 1
           FROM sys_product p
          WHERE p.product_id = o.product_id
            AND p.quality = 2
          LIMIT 1
       );
 
                             QUERY PLAN                             
--------------------------------------------------------------------
 Bitmap Heap Scan ON sys_order o 
             (cost=39.16..20490.82 ROWS=967 width=28)
             (actual TIME=0.565..11.462 ROWS=1383 loops=1)
   Recheck Cond: (valid_dt IS NULL)
   FILTER: (SubPlan 1)
   ROWS Removed BY FILTER: 621
   Heap Blocks: exact=2004
   ->  Bitmap INDEX Scan ON idx_order_valid_dt
              (cost=0.00..38.92 ROWS=1933 width=0)
              (actual TIME=0.269..0.269 ROWS=2004 loops=1)
         INDEX Cond: (valid_dt IS NULL)
   SubPlan 1
     ->  LIMIT  (cost=0.29..8.31 ROWS=1 width=0)
                (actual TIME=0.003..0.003 ROWS=1 loops=2004)
           ->  INDEX Scan USING sys_product_pkey ON sys_product p
                     (cost=0.29..8.31 ROWS=1 width=0)
                     (actual TIME=0.002..0.002 ROWS=1 loops=2004)
                 INDEX Cond: (product_id = o.product_id)
                 FILTER: (quality = 2)
                 ROWS Removed BY FILTER: 0
 Planning TIME: 0.126 ms
 Execution TIME: 12.154 ms

This time, something drastically different happened:

  • The executor took our matching rows from the valid_dt index and stashed them in the memory heap.
  • Using those same rows, it executed about 2000 index lookups into the product table.
  • The results are produced almost 8x faster.

Before anyone takes this result and starts rewriting all queries everywhere to utilize anti-joins, keep in mind that they’re extremely situational. The example here was built based on an actual application query at a company I used to work with. In their case, execution went from 2500ms to 280ms following the rewrite. But in building this test-case to simulate their data, I ended up creating a dozen data distributions that were slightly slower using this approach.

In addition, take a look at the LIMIT 1 section of the subquery. We’re explicitly telling the planner that only one product can match per seek, which is what the word EXISTS should imply. Yet if we remove it, the execution plan actually reverts to the same approach used by the more naive JOIN. Ouch!

But that’s OK because queries are, by their nature, more of a linguistic construct than a computational one. Having a larger vocabulary can help convey more meaning and sometimes produce different results, but our audience is perfectly within their rights to summarize or ignore irrelevant prattle. The goal in this case is adding to our lexicon, and trying to frame applicable scenarios in a different way in hopes that the planner will catch on.

For data distributions that apply, anti-joins are a very powerful technique. If nothing else, they’re worth a try when other approaches fall flat. The only way to know for sure is experimentation! They’re just a tool, and like any tool, should only be used at opportune instances.

So remember the anti-join—it’s a great tool for that rare hex-socket screw you occasionally stumble across.


Tags: , , , ,

PG Phriday: 10 Ways to Ruin Performance: Forgetting to EXPLAIN

May 15th, 2015 | Published in Database, Tech Talk | 6 Comments


Yesterday I gave the developers at my company what I call a DBA Chat. It’s something I try to do every month to keep them apprised on various features, caveats, performance considerations, and so on. I find that educating the folks who regularly work with the database does wonders for application performance and my sanity. The benefit of this long format is that I can go over more information than a time constrained set of slides.

This month we went over one of my slightly older talks, and even though the material was three years old from my perspective, it was all new to them. And why not? Developers have a job to do, and while they do work with the database, it generally isn’t their responsibility to research the voluminous minutia and quirks commonly associated with a specific platform. That’s my job.

So here is the first of a ten part series on anti-patterns that can annihilate PGDB (PostgreSQL) performance, some of which are outright insidious. This week we’ll start slow with a short discussion on the importance of EXPLAIN. This is PGDB’s first line of defense against bad queries, and it’s overlooked more than you might expect, as well as somewhat difficult to interpret for the uninformed. I’ll probably have a more in-depth version of this article in the future for truly interpreting EXPLAIN output, but this time we’ll be focusing on using it in general.

First, we need to set up a use case to illustrate a couple of variants. Two tables with a close relationship should do it:

CREATE TABLE sys_product
(
    product_id   SERIAL  PRIMARY KEY,
    prod_name    TEXT    NOT NULL,
    descr        TEXT    NOT NULL DEFAULT now()
);
 
INSERT INTO sys_product (prod_name, descr)
SELECT 'Product ' || a.id::TEXT, 'It does stuff.'
  FROM generate_series(1, 1000) a(id);
 
CREATE TABLE sys_order
(
    order_id     SERIAL       NOT NULL,
    product_id   INT          NOT NULL,
    item_count   INT          NOT NULL,
    order_dt     TIMESTAMPTZ  NOT NULL DEFAULT now()
);
 
INSERT INTO sys_order (product_id, item_count)
SELECT (a.id % 100) + 1, (random()*10)::INT + 1
  FROM generate_series(1, 1000000) a(id);
 
ALTER TABLE sys_order ADD CONSTRAINT pk_order_order_id
      PRIMARY KEY (order_id);
 
CREATE INDEX idx_order_product_id
    ON sys_order (product_id);

Now, EXPLAIN comes in many variants, but there are two that directly concern us: EXPLAIN and EXPLAIN ANALYZE. The first will list the estimated cost of the query, which indexes will be used, how many rows are estimated to match, and so on. The second will physically execute the query and obtain the actual match counts along with elapsed time for each step. Here’s what that looks like:

EXPLAIN SELECT * FROM sys_order;
 
                             QUERY PLAN                             
--------------------------------------------------------------------
 Seq Scan ON sys_order  (cost=0.00..16370.00 ROWS=1000000 width=20)
 
EXPLAIN ANALYZE SELECT * FROM sys_order;
 
                             QUERY PLAN
--------------------------------------------------------------------
 Seq Scan ON sys_order  (cost=0.00..16370.00 ROWS=1000000 width=20)
     (actual TIME=0.006..465.956 ROWS=1000000 loops=1)
 Planning TIME: 0.041 ms
 Execution TIME: 886.766 ms

I’ve reformatted the results a bit, but the difference in output should be readily apparent. Interpreting this isn’t even very difficult at a high level. Begin by examining the first set of parenthesis in both results, and ignore the cost for now; it’s essentially meaningless noise intended primarily for the query planner. What concerns us is the row count, which is one-million in this case. It’s how many rows PGDB estimated it would return.

The second set of parenthesis in the ANALYZE version gives the actual values obtained by executing the query. In this case, the query matched exactly one million rows, and took 465ms to do so. If you don’t know already, a Seq Scan is short for Sequence Scan, and means the database read the entire contents of the table. That should be expected since we didn’t provide any WHERE clause to restrict the results.

Pay special attention to the rows section, because it can be a major clue to ferret out performance problems. PGDB keeps statistics on table contents, but these are necessarily sparse and in aggregate form. There’s also currently no way to correlate statistics between tables, so a JOIN can drastically affect row estimates, and since it’s how cost is calculated, can result in wildly different execution plans than the optimal case.

So we know that a Sequence Scan will read the entire table. Hopefully that point by itself is an obvious indication that doing so is generally undesirable. If we are querying a table with 300-million rows, reading its entire contents is almost never the intended approach, nor should it be.

What other variants should we look for? How about an Index Scan:

EXPLAIN ANALYZE
 SELECT * FROM sys_order
  WHERE order_id BETWEEN 5 AND 10;
 
                             QUERY PLAN                             
--------------------------------------------------------------------
 INDEX Scan USING pk_order_order_id ON sys_order 
       (cost=0.42..8.53 ROWS=5 width=20)
       (actual TIME=0.008..0.012 ROWS=6 loops=1)
   INDEX Cond: ((order_id >= 5) AND (order_id <= 10))
 Planning TIME: 0.112 ms
 Execution TIME: 0.041 ms

We can see a few things from these results:

  • Estimated row count is not the same as how many actually matched.
  • The rows were found with an Index Scan.
  • Execution time was much faster than the Sequence Scan case.

It’s not critical for estimated and actual row counts to match, but they should be within an order of magnitude of each other. If this isn’t the case, it’s likely the planner has insufficient information about the data, or statistics are wrong, or data is more closely correlated than it assumed, resulting in less eliminations than it expected, and so on. This is one of the first ways to see that there could be a problem with a query’s execution plan.

Regarding the index scan, matching five or six values against a million isn’t a big deal, but there are limits. Indexes in PGDB work, in most cases, by fetching very quickly from the index file and then consulting the base table file to retrieve the matching row. This results in two random seeks to the underlying storage subsystem, and if you don’t know already, random reads are usually slower than a sequential read by an order of magnitude or more. This doesn’t apply for SSD devices, but most databases don’t reside entirely on such expensive hardware just yet.

That’s why the rows element of the EXPLAIN is important in respect to index scans; it’s entirely possible that reading the entire table would be faster than performing several hundred thousand index and table seeks. For the most part, the planner knows this, but calculating this relationship is still more of an art than a science. Consequentially, sometimes the planner guesses wrong, and query performance will suffer. By using EXPLAIN, we can see what happened, and find a way around the problem.

Let’s turn this into a JOIN to make it a little more fun:

EXPLAIN ANALYZE
 SELECT o.*
   FROM sys_order o
   JOIN sys_product p USING (product_id)
  WHERE p.product_id IN (5, 10, 20);
 
                             QUERY PLAN                             
--------------------------------------------------------------------
 Nested Loop  (cost=188.87..15715.29 ROWS=3614 width=20)
              (actual TIME=2.317..66.377 ROWS=30000 loops=1)
   ->  INDEX ONLY Scan USING sys_product_pkey ON sys_product p
             (cost=0.28..24.88 ROWS=3 width=4)
             (actual TIME=0.033..0.056 ROWS=3 loops=1)
         INDEX Cond: (product_id = ANY ('{5,10,20}'::INTEGER[]))
         Heap Fetches: 3
   ->  Bitmap Heap Scan ON sys_order o
              (cost=188.59..5130.14 ROWS=10000 width=20)
              (actual TIME=2.105..13.037 ROWS=10000 loops=3)
         Recheck Cond: (product_id = p.product_id)
         Heap Blocks: exact=19107
         ->  Bitmap INDEX Scan ON idx_order_product_id
               (cost=0.00..186.09 ROWS=10000 width=0)
               (actual TIME=1.170..1.170 ROWS=10000 loops=3)
               INDEX Cond: (product_id = p.product_id)
 Planning TIME: 0.274 ms
 Execution TIME: 78.834 ms

Wow! Luckily we can ignore most of this output by focusing on the Nested Loop at the top. This is really just a fancy way of saying “For Loop”. This is also the outermost level of the query plan, as EXPLAIN output is somewhat inverted. Each nesting level is a prerequisite step to construct the layer above it, somewhat like an onion.

If we look at the loops section in the actual execution information for each of these blocks, we can see that the loop really ran three times (the nested entries show loops=3), even though it says it only ran once. The distinction here is that the outermost loop ran once, and it is somewhat misleading.

But basically PGDB fetched three rows from the product table, and looped through memory using those values to find corresponding data in the order table. And as you might imagine, this is fine for small numbers of loops. There are occasions where PostgreSQL will underestimate row counts and opt for a Nested Loop, when really there are hundreds of thousands or millions of potential loop candidates, and it will use the same process as we saw above.

In such cases, execution time is extremely adversely impacted. Why? Nested loops and other types of loop are computationally expensive. As far as the database is concerned, it’s almost always faster to create an in-memory object of some kind and perform a bulk operation on all values at once, such as a comparison or a transformation. Nested loops make this impossible, and impose a series of calculations and function callbacks on each individual row set in O(n*m) scale. For small loop counts, this is often cheaper than building those afore-mentioned memory representations, but scale is vastly important.

Here’s an example of what I mean, and we don’t even need the whole EXPLAIN output to see the problem:

EXPLAIN ANALYZE
 SELECT o.*
   FROM sys_order o
   JOIN sys_product p USING (product_id)
  WHERE p.product_id BETWEEN 6 AND 10;
 
                             QUERY PLAN                             
--------------------------------------------------------------------
 Nested Loop  (cost=188.20..17588.55 ROWS=4000 width=20)
              (actual TIME=2.542..105.528 ROWS=50000 loops=1)

This should be ringing alarm bells. Note that PGDB expected 4000 rows, and actually matched 50,000. At this scale that isn’t really a major issue, but expand these tables by a couple orders of magnitude, and that can quickly change. I’ve seen several queries where the estimated row count is five orders of magnitude less than the actual results. Disk fetches are exceptionally expensive, and multiplying random read times by a few thousand adds up. If a query is using a nested loop is being slow, always check the statistics based on EXPLAIN output; you may need to revise your query to compensate.

Those are the Big Three in a nutshell. Even if you know nothing else about PGDB, how it plans or executes queries, or even databases in general, try to examine the execution path of your queries whenever possible. This is especially true if you have a query that’s unexpectedly slow. Being cognizant of potential issues prevents malformed queries or structures from worming their way permanently into a production application, and that’s something we can all celebrate.

Happy querying.


Tags: , , , ,

PG Phriday: Functions and Performance Attributes

May 8th, 2015 | Published in Database, Tech Talk | No Comments


Users both old and new are likely aware that PGDB has functions. Some lucky readers may have even written one or two. For those without much experience in this area, or are thinking about contributing a few functions to your database for the first time, there are a couple things you should know. This week, I want to cover a few of the function declaration decorators. If you have no idea what I’m talking about, then this is probably something you’ll want to read regardless of your experience level.

Let’s begin with a very basic function that, as usual, does nothing:

CREATE OR REPLACE FUNCTION f_do_nothing(myvar INT)
RETURNS INT AS
$$
BEGIN
    RETURN myvar;
END;
$$ LANGUAGE plpgsql;
 
\timing ON
 
SELECT f_do_nothing(1)
  FROM generate_series(1, 1000000) AS s (id);

Now, hopefully we can agree that a function with one parameter which returns the contents of the parameter and does no work, is pretty simple. Calling this function one million times in a test VM required about two seconds. Some astute readers might point out that I could have used the SQL language instead of PL/pgSQL, and they’d be absolutely right. Doing so would reduce run time to about 600ms for this example.

Fortunately, the decorators I’ll be discussing will be much more powerful than even switching to a much simpler and subsequently less versatile language. As of PostgreSQL 9.4, there are four that will actually impact performance: VOLATILE, STABLE, STRICT, and IMMUTABLE.

The VOLATILE hint isn’t strictly necessary, as it’s the default. Effectively it means that the function can’t be optimized at all, as its effects can vary wildly based on table contents and parameters, even during execution. This causes PostgreSQL to invoke the function each time it’s used, regardless of what the function is actually doing. For our function, that’s clearly a detrimental approach. So what other hints do we have available?

The STABLE attribute is a bit more stringent. The implication in this case, is that the function can not modify the contents of the database in any way. Within the context of a transaction, this means the function can be semi-optimized by preventing extra calls because the data should not change. For this particular example, the STABLE keyword reduced run-time by about ten percent, to about 1.7 seconds on average. Here’s what that looks like:

CREATE OR REPLACE FUNCTION f_do_nothing(myvar INT)
RETURNS INT AS
$$
BEGIN
    RETURN myvar;
END;
$$ LANGUAGE plpgsql STABLE;
 
\timing ON
 
SELECT f_do_nothing(1)
  FROM generate_series(1, 1000000) AS s (id);

After this comes the IMMUTABLE keyword, which takes the optimization one step further. This tells the planner that the function will always return the same result for the same parameters. Due to that guarantee, function calls can tacitly be replaced by the the cached results where that benefits the planner. One way to look at this is by examining the performance:

CREATE OR REPLACE FUNCTION f_do_nothing(myvar INT)
RETURNS INT AS
$$
BEGIN
    RETURN myvar;
END;
$$ LANGUAGE plpgsql IMMUTABLE;
 
\timing ON
 
SELECT f_do_nothing(1)
  FROM generate_series(1, 1000000) AS s (id)

The IMMUTABLE decorator reduced run time to 400ms in this case, but we don’t really know how many times the function was actually called. Just because a function call can be replaced by its results, doesn’t mean that will happen. Let’s modify the situation slightly by adding an explicit pause in the code:

CREATE OR REPLACE FUNCTION f_do_nothing(myvar INT)
RETURNS INT AS
$$
BEGIN
    PERFORM pg_sleep(1);
    RETURN myvar;
END;
$$ LANGUAGE plpgsql IMMUTABLE;
 
\timing ON
 
SELECT f_do_nothing(1), f_do_nothing(1);

If your system is like mine, this example ran for two seconds, despite the fact that PostgreSQL could have replaced the second call with the results of the first. This seems counter-intuitive, but the real power of the IMMUTABLE keyword is not simply in the replacement structure, but in the parameter/result equivalence. It’s a fundamental misunderstanding that the database must prevent excess function calls, though that would be ideal. Let’s modify the code again to make the situation more explicit:

CREATE OR REPLACE FUNCTION f_do_nothing_imm(myvar INT)
RETURNS INT AS
$$
BEGIN
    RAISE NOTICE 'IMMUTABLE RAN!';
    RETURN myvar;
END;
$$ LANGUAGE plpgsql IMMUTABLE;
 
CREATE OR REPLACE FUNCTION f_do_nothing_vol(myvar INT)
RETURNS INT AS
$$
BEGIN
    RAISE NOTICE 'VOLATILE RAN!';
    RETURN myvar;
END;
$$ LANGUAGE plpgsql VOLATILE;
 
SELECT 1
  FROM generate_series(1, 5) AS s (id)
 WHERE f_do_nothing_imm(10) = 10
   AND f_do_nothing_vol(10) = 10;

And the NOTICE output in this case is very telling:

NOTICE:  IMMUTABLE RAN!
NOTICE:  VOLATILE RAN!
NOTICE:  VOLATILE RAN!
NOTICE:  VOLATILE RAN!
NOTICE:  VOLATILE RAN!
NOTICE:  VOLATILE RAN!

While it seems that the SELECT section being optimized is somewhat arbitrary, the WHERE portion appears to have been reduced appropriately. It all comes down to how the planner decides to simplify the execution plan of the query, and the conditionals in the WHERE clause are much more likely to experience substitution.

Finally comes the STRICT decorator. It goes one step further than IMMUTABLE and assumes that a function with any NULL parameters will also return NULL. Again, this is an optimization where the planner can completely substitute a function call with NULL, thereby not only removing the function execution itself, but avoiding the requisite function parameter and return value mapping. Its nature also means it can be combined with the other keywords, like so:

CREATE OR REPLACE FUNCTION f_do_nothing(myvar INT)
RETURNS INT AS
$$
BEGIN
    RETURN myvar;
END;
$$ LANGUAGE plpgsql VOLATILE STRICT;
 
\timing ON
 
SELECT f_do_nothing(NULL)
  FROM generate_series(1, 1000000) AS s (id);

This time, even though the function is clearly marked VOLATILE, it doesn’t execute even once. The run time is merely the time required to generate and return one million values.

Now, regardless of power these keywords provide, there are some important caveats that apply. When we used notices to view the call path of our functions, the planner did what we expected, and prevented extra calls to the immutable version. So let’s try that example one more time with a variable in place and examine those notices again:

SELECT 1
  FROM generate_series(1, 5) AS s (id)
 WHERE f_do_nothing_imm(s.id % 2) = s.id % 2
   AND f_do_nothing_vol(s.id % 2) = s.id % 2;
 
NOTICE:  VOLATILE RAN!
NOTICE:  IMMUTABLE RAN!
NOTICE:  VOLATILE RAN!
NOTICE:  IMMUTABLE RAN!
NOTICE:  VOLATILE RAN!
NOTICE:  IMMUTABLE RAN!
NOTICE:  VOLATILE RAN!
NOTICE:  IMMUTABLE RAN!
NOTICE:  VOLATILE RAN!
NOTICE:  IMMUTABLE RAN!

Doh! Though the immutable function should have only run twice, we see that it ran just as often as the volatile version. It would appear that the function declaration documentation clears this up (emphasis mine):

If this option is given, any call of the function with all-constant arguments can be immediately replaced with the function value.

The problem in this case, is that the PostgreSQL planner automatically considers variables as their worst-case scenario: any number of infinite values could be substituted here. Even though we know only two of the five values were passed to the function, the PostgreSQL does not until execution time. Since the planner is determining how IMMUTABLE affects execution, that keyword is ignored for dynamic parameters.

With that said, the IMMUTABLE keyword exists beyond the context of optimization. In Set Theory, functions must be deterministic. Thus for indexes to operate as expected, functions must always return the same values when given the same inputs. Since PostgreSQL supports functional indexes, this means that only IMMUTABLE functions are candidates for providing indexable values.

This is also why the substitution process used by the planner seems arbitrary from an outside perspective; we’re not privy to the entire decision tree that generates the final execution plan. In many cases, either multiple function executions can not be reduced, or the cost and complexity of doing so would be computationally prohibitive.

In the end, it comes down to opportunity cost. By using these decorators, we give PostgreSQL as much information as possible to make decisions that affect the query execution plan. Consequentially, while we may reduce query run time, we do increase the amount of possible function candidates for table indexing. This last part is especially important, because such indexes do reduce the computational overhead of executing the function on the indexed value.

As the query planner improves over successive generations, these decorators only get more powerful. Try to use them where appropriate, and you can only improve the situation.


Tags: , , ,

PG Phriday: Reducing Writes With Unlogged Tables

May 1st, 2015 | Published in Database, Tech Talk | 2 Comments


Last week, I covered how MVCC, PGDB’s storage system, works on a very high level. Near the end, I also mentioned that it doesn’t quite lend itself well to certain use cases, such as rapidly mutating session storage. Well, there is one caveat to that statement that I’d forgotten about because of its relatively limited utility: unlogged tables.

Here’s what the PGDB documentation has to say about unlogged tables:

Data written to unlogged tables is not written to the write-ahead log, which makes them considerably faster than ordinary tables. However, they are not crash-safe: an unlogged table is automatically truncated after a crash or unclean shutdown. The contents of an unlogged table are also not replicated to standby servers.

But what does this actually mean in the back end? That comes down to how databases interact with the underlying filesystem. In the case of PGDB, all writes to the database are actually first saved to a binary file that is decoupled from the actual data files. Then a periodic event called a checkpoint will commit those changes to the target table files. This way, crashes do not damage the database, the binary logs can be used to replicate to other instances, and backups can be rolled forward in time by applying transaction logs manually.

This is why unlogged tables have absolutely no crash protection, will not be copied to the replication stream, and get truncated if there’s any doubt regarding the contents. There’s no way to verify the data they contain, and they’ve been pruned from the very infrastructure that gives PGDB its durability. So what are they good for, then?

Let’s run a very short experiment. I commonly recommend any PostgreSQL server be set up with these three options in postgresql.conf:

wal_level = hot_standby
archive_mode = on
archive_command = 'exit 0'

Since archive mode can only be modified by restarting the server, it’s a good idea to just leave it enabled, but neutralized. PGDB does not archive transaction logs with these settings, just as imposed by the defaults. However, by modifying archive_command and reloading the config files, we can easily enable archival some time later without disrupting the database. Let’s create a junk drawer and set up a basic copy command, and reload:

mkdir /tmp/xlog

# Set archive_command in postgresql.conf
archive_command = 'cp -f %p /tmp/xlog/%f'

# Reload PGDB
sudo service postgresql reload

This will copy transaction log files which are no longer needed by the database (%p), to the directory we created in /tmp. Now let’s create two tables that might benefit from an unlogged approach:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
\timing
 
CREATE UNLOGGED TABLE sys_session_unlogged
(
    session_id    SERIAL     NOT NULL,
    session_data  TEXT       NOT NULL,
    modified_dt   TIMESTAMP  NOT NULL DEFAULT now()
);
 
INSERT INTO sys_session_unlogged (session_data)
SELECT repeat(chr((random() * 1000)::INT % 94 + 33), 32)
  FROM generate_series(1, 5000000) a(id);
 
-- Pause here to check the /tmp/xlog directory for its contents.
 
CREATE TABLE sys_session
(
    session_id    SERIAL     NOT NULL,
    session_data  TEXT       NOT NULL,
    modified_dt   TIMESTAMP  NOT NULL DEFAULT now()
);
 
INSERT INTO sys_session (session_data)
SELECT repeat(chr((random() * 1000)::INT % 94 + 33), 32)
  FROM generate_series(1, 5000000) a(id);

Why five million rows? It acts as a great amplifier for timings, to make it more obvious what benefits there might be. In my test VM, the unlogged version took about 16 seconds to load, while the regular table required around 20. What might be more important however, is what happened in the /tmp/xlog directory. After filling the unlogged table, there were two files there, suggesting there’s some behind-the-scenes accounting that imply the table is not entirely unrepresented in the transaction log stream. For some perspective, the regular table contributed 35.

Now, a 25% performance benefit is probably not worth discarding crash protection and other safeguards, but avoiding needless writes just might be. All of the other aspects of MVCC still apply, so I still wouldn’t suggest using something like this for user session tables, but what can it be used for? Perhaps as a replacement for temporary tables.

Temporary tables have one major shortcoming: they’re unusable by anything but the current connection. If an application is comprised of several disjointed components that need to manipulate the same in-transit data, there is no option but to create a real table for this collaborative work. In this context, a database crash or other disruption is not a disaster, as the process can be restarted, and the table rebuilt. Alternatively, if the application itself crashes, its connection and all of its previously inserted data are lost. Unlogged tables thwart both of these scenarios.

They also make a great dumping ground for all kinds of volatile data that doesn’t need long-term protection, such as a web user activity log. Such tables can collect a period of clicks, API calls, logins, and so on, before being analyzed and stored permanently in aggregate form at a much lower volume. Essentially, any use case that doesn’t need a bullet-proof durability guarantee is a good candidate for conversion to an unlogged table.

As a DBA, it’s actually somewhat difficult to view data in this respect. Purposefully creating a table and explicitly disabling data persistence seems both foolhardy and dangerous at first glance. But there are use cases, as mentioned above, where we don’t need to jealously loom over the table with a loaded shotgun, suspiciously glaring at anyone in the vicinity. Some data is simply meant to be regularly destroyed, so why not accommodate that aspect?

In fact, I have a developer or two to talk into incorporating these into a multi-phase ETL system. Wish me luck!


Tags: , , ,

PG Phriday: Let’s Talk About Data Storage

April 24th, 2015 | Published in Database, Tech Talk | 1 Comment


As a DBA, I strive not to live in an isolated ivory tower, away from the developers that are destined to fill our databases with volumes of data from a myriad of applications. It’s important, I think, to occasionally come back to the basics. So I’d like to discuss one of the core elements that PGDB DBAs might be intimately familiar with, but comes up often enough that some clarification is warranted. How does PostgreSQL store data? The answer may surprise you, and is critical to understand before regularly interacting with a PGDB system.

The storage engine PGDB uses is called Multi Version Concurrency Control, or MVCC for short. There is excellent documentation of the subject, along with further resources that attempt to convey a thorough understanding, and succeed in that regard. I strongly encourage anyone who works with PostgreSQL in any context, to read these until they can recite the basics from memory. It is the very bedrock of the database.

But we need an executive summary, so let’s start with an extremely simple table with an id and generic column:

ID Stuff
11 foo
21 bar
31 baz

Let’s say that some time in the future, we update ID 2. This would be the table at that point:

ID Stuff
11 foo
21 bar
31 baz
218 moo

Notice that there are now two copies of the row for ID 2. How is it possible for PGDB to differentiate between them? Which one is “correct”?

To answer those two questions, I’ll need a short aside to explain transaction counters. PGDB keeps a counter that it assigns to all database activities, regardless of origin. Because these numbers can’t cycle endlessly, it currently wraps after two billion values have been used. It then applies an “age” algorithm to produce an abstract representation of data antiquity. For the purposes of this discussion, the subscript in the examples represents the transaction age.

Every action gets a transaction ID, and every transaction ID acts like a snapshot because it provides a stable comparison point. Even if you don’t explicitly call BEGIN TRANSACTION, each statement gets a transaction ID, and thus an age to determine visibility.

Now, take note of the subscript for the second row version in the above example. It indicates that the second row was inserted by transaction 18. If transaction 5 were still open for some reason, it would not see this row, and thus the old version is still valid within that context. But transaction 20 would see both, and use the newest version.

This is how MVCC works, in a nutshell. The primary benefits here are related to blocking. Readers can’t block readers, as they’re not creating new row versions. Further, writers won’t block readers, as the old row version is still valid, and the reader would not see the new row version anyway due to transaction age. And naturally, readers can’t block writers, as reading data doesn’t inherently change its representation. A lot of databases have extremely complex and disruptive rules and locking systems to enforce these outcomes, but it’s just a natural part of MVCC.

Now imagine we update that row again:

ID Stuff
11 foo
21 bar
31 baz
218 moo
242 pop

Now there are three versions of the same row. This will keep happening for every UPDATE that row endures. As you might imagine, this is extremely undesirable in the long term; a very active table that receives a high volume of updates would quickly bloat out of control. And you’d be right. It’s not uncommon for high transaction volumes to quickly render tables 90+% old junk data.

PGDB solves this problem with VACUUM. The VACUUM system is given a transaction ID like everything else, but it can also do basic visibility testing because it knows which transactions are still open within the system. It works by recycling any row, provided the transaction age is lower than its own, and it’s no longer visible to other transactions. After being vacuumed, our table looks like this:

ID Stuff
11 foo
21 bar
31 baz
218 moo
242 pop

Because PGDB wants to reduce potential locking contention, the old row is not actually removed from the data file. Instead, the location is entered into a free space map. Any row in the free space map will be reused by new row versions. Let’s add a new row:

ID Stuff
11 foo
594 fun
31 baz
218 moo
242 pop

There are two major benefits to this approach:

  1. New row versions may not cause the table data files to be grown.
  2. New row versions are not always inserted at the end of the table. This area is known as a write “hot zone” and depending on the underlying file system, can be a huge source of contention and write delays.

However, this means our maintenance can never fall behind or we risk filling the file system with a bunch of old, useless, junk data. Table activity becomes inherently slower as bloat increases, adversely impacting performance in several different ways. Even worse, without regular vacuums, the transaction ID reuse that I mentioned earlier isn’t properly reset. Without that, the database will eventually shut down since it can’t generate unique transaction IDs for versioning data.

I won’t get into the automatic vacuum system that generally addresses this, but I can say that it’s not a complete solution. All vacuums function by reading table data files and maintaining the visibility map. This activity is necessarily limited by disk performance, and various internal settings that prevent vacuums from consuming all available disk throughput. Due to these limitations, tables that receive a high volume of updates will always be larger than their ideal size—sometimes much larger.

And this is where application developers come in. I ask that you think about how MVCC works, and consider how you develop a program that needs to store data. MVCC does not lend itself well to certain use cases, such as session tracking where each page load causes an update in the underlying table. There are better approaches for storing fungible session states than a persistent row in the database.

Most scenarios are fully suitable to how PGDB stores data. It’s a very powerful and capable database engine with far fewer locking problems than other systems. But we pay for that flexibility with extra maintenance concerns, and considerations for avoiding excessive update volume. Certainly, PGDB will function without specifically addressing these points, but it works better when everyone along the chain knows how to circumvent inherent limitations.

Know your tools, and they will rarely disappoint you.


Tags: , , , , , ,

« Older Posts