PG Phriday: 10 Ways to Ruin Performance: In The Loop

June 26th, 2015 | Published in Database, Tech Talk | 1 Comment


As a database, PGDB (PostgreSQL) is fairly standard in its use of SQL. Developers of all colors however, might have trouble switching gears and thinking in set operations, since so many language constructs focus on conditionals and looping. Last week in the performance pitfalls series, we discussed a bit of Set Theory, and how ignorance of its implications can be disastrous. But what about the more mundane?

What happens, for instance, when we treat a database like a programming language?

This time, the example tables will emulate a pretty terrible ordering system with a user table of 1000 users, and a million orders distributed over roughly the last three years. Here it is, in all its glory:

CREATE TABLE sys_user
(
    user_id      SERIAL       NOT NULL,
    username     VARCHAR      NOT NULL,
    password     VARCHAR      NOT NULL,
    last_order   TIMESTAMPTZ  NULL,
    created_dt   TIMESTAMPTZ  NOT NULL DEFAULT now(),
    modified_dt  TIMESTAMPTZ  NOT NULL DEFAULT now()
);
 
INSERT INTO sys_user (username, password, created_dt, modified_dt)
SELECT 'user' || a.id,
       md5('use-bcrypt-instead' || 'user' || a.id || 'somepassword'),
       now() - (a.id % 1000 || 'd')::INTERVAL,
       now() - (a.id % 100 || 'd')::INTERVAL
  FROM generate_series(1, 1000) a(id);
 
ALTER TABLE sys_user ADD CONSTRAINT pk_user_id
      PRIMARY KEY (user_id);
 
ANALYZE sys_user;
 
CREATE TABLE sys_order
(
    order_id     SERIAL       NOT NULL,
    product_id   INT          NOT NULL,
    user_id      INT          NOT NULL,
    item_count   INT          NOT NULL,
    order_dt     TIMESTAMPTZ  NOT NULL DEFAULT now(),
    valid_dt     TIMESTAMPTZ  NULL
);
 
INSERT INTO sys_order (product_id, item_count, user_id, order_dt, valid_dt)
SELECT (a.id % 100000) + 1, (a.id % 100) + 1, (a.id % 1000) + 1,
       now() - (a.id % 1000 || 'd')::INTERVAL,
       CASE WHEN a.id % 499 = 0
            THEN NULL
            ELSE now() - (id % 999 || 'd')::INTERVAL
       END
  FROM generate_series(1, 1000000) a(id);
 
ALTER TABLE sys_order ADD CONSTRAINT pk_order_id
      PRIMARY KEY (order_id);
 
CREATE INDEX idx_order_order_dt
    ON sys_order (order_dt DESC);
 
ANALYZE sys_order;

With that out of the way, please note that I do not now, nor will I ever advocate using md5 for passwords. Just don’t. Use crypt (bcrypt) from the pgcrypto extension instead. Note that in even this woefully simplified schema, I use a rudimentary salt. I’m not a security analyst so going further is beyond the scope of this article. But if you plan on writing any kind of secured application of any kind, read up on proper cryptography protocols and two-factor authentication. Never, never, never throw an authentication system together, or you will be hacked and every password in your system will be compromised. Just ask linkedin

Going back to our example, imagine we want to keep some denormalized information about the user in the user table itself. There’s a batch job that updates the user table and maintains the last time the user ordered something. The job does something like this:

\timing ON
 
DO $$
DECLARE
    i INT;
BEGIN
    FOR i IN SELECT user_id
               FROM sys_order
              WHERE order_dt > CURRENT_DATE - INTERVAL '1 month'
    LOOP
      UPDATE sys_user
         SET last_order = now()
       WHERE user_id = i;
    END LOOP;
END
$$ LANGUAGE plpgsql;
 
TIME: 4107.704 ms

“Contrived,” you scream! And you’re partially right. Unfortunately, this kind of scenario happens more often than you might expect. This is actually a modified example I caught in a system I was auditing two years ago, and its execution time was on the order of hours. In that scenario, there was also the language and network overhead to consider. It wasn’t a PGDB DO loop running directly in the database, but some Groovy code that was operating on every result from another query.

So while this example is clearly contrived, that situation definitely wasn’t. Part of the reason I began my education crusade within our company was to prevent such things from ever occurring. Which brings me to the alternative: anything else. Almost literally any other database-focused solution will work better than a language looping construct.

I’ve specifically advocated against using IN syntax in many contexts, but look what happens when we convert the loop to operating on the set of results in the loop query:

\timing ON
 
UPDATE sys_user
   SET last_order = now()
 WHERE user_id IN (
         SELECT user_id
           FROM sys_order o
          WHERE order_dt > CURRENT_DATE - INTERVAL '1 month'
       );
 
UPDATE 32
TIME: 17.466 ms

This particular example is almost 250x faster, but I’ve seen instances where the duration differs by five or even six orders of magnitude. Remember that example I mentioned that used to run for hours? The post-modification execution time was fifteen seconds. If we had further modified the update job to use a temporary table, it would have been five seconds.

It all depends, of course, on the number of records involved, the speed of the underlying disks, the health and correlation of underlying indexes and data heap, and a lot of other factors. The lesson here is, as with the last article, to try and think in sets and groups, and perform operations on the whole collection of data at once.

But there’s yet another way I suggest using that’s even simpler. PGDB has UPDATE ... FROM syntax that lets you combine the two statements into a single operation that resembles a standard JOIN. Here’s what it looks like using our example:

\timing ON
 
UPDATE sys_user u
   SET last_order = now()
  FROM sys_order o
 WHERE o.user_id = u.user_id
   AND o.order_dt > CURRENT_DATE - INTERVAL '1 month';
 
UPDATE 32
TIME: 40.822 ms

Anything you can do in a SELECT can be translated into this syntax. And like the IN example, it’s compatible with EXPLAIN so we can investigate how many records are affected before executing it. But why is it slower? In this particular instance, the reason is due to the number of orders that occurred in the last month. Because of the data distribution, there are 32,000 orders from 32 customers. Unfortunately the planner sees the 32,000 first, and doesn’t know it’s only 32 distinct values, so it performs a sequence scan on the user table, inflating the overall execution time.

That won’t always be the case, and regardless, it’s still much faster than the alternative. The primary benefit from this syntax is its simplicity. If you can write a JOIN, you can use UPDATE ... FROM, and not have to worry about properly formatting or vetting a subselect.

But wait, there’s more! Examples like this are too straight-forward. Sometimes a lot of preliminary work is required before the final UPDATE. Sometimes this means using several intermediate temporary tables, or as before, building a program loop to perform several supplementary tasks and issuing individual updates.

With PGDB Common Table Expression (CTE) syntax, those prerequisite steps can be done inline. Check this out:

\timing ON
 
WITH ord AS
(
    SELECT DISTINCT user_id
      FROM sys_order
     WHERE order_dt > CURRENT_DATE - INTERVAL '1 month'
)
UPDATE sys_user u
   SET modified_dt = now()
  FROM ord
 WHERE u.user_id = ord.user_id;
 
UPDATE 32
TIME: 19.849 ms

We’ve now done two things.

  1. Brought execution time back inline with the IN case.
  2. Broken the UPDATE into two distinct steps.

PGDB CTEs act as an optimization fence, because each segment is physically materialized as a temporary object. This object has very real dimensions, and the results can be fed into further CTE steps, or used as in our example, directly in a final query. Here’s how adding a second step might look:

\timing ON
 
WITH ord AS (
    SELECT DISTINCT user_id
      FROM sys_order
     WHERE order_dt > CURRENT_DATE - INTERVAL '1 month'
),
good AS (
    SELECT user_id
      FROM ord
     WHERE user_id % 3 = 0
)
UPDATE sys_user u
   SET modified_dt = now()
  FROM good
 WHERE u.user_id = good.user_id;
 
UPDATE 10
TIME: 19.741 ms

Now the example really is contrived, because we could have just added the extra WHERE clause to the first step. But that’s because our example is hilariously basic. In a real system, there would be dozens or even hundreds of other tables, and a lot more potential for subsequent calculations, joins, and so on.

The point here isn’t that the second step was pointless, but that it was possible at all. This is what happens when you think in sets. Each result set is something that can be reduced or combined with another result set, and the combination can be used for further operations. Imagine it as object-oriented programming; each set is a blob that should be considered one entity with various attributes.

I’ll end with this bit of Python that should make it obvious what I mean:

# This is set theory:
 
foo = [1, 2, 3, 3, 3, 4]
print foo.count(3)
 
# This isn't:
 
total = 0
for i in foo:
  if i == 3:
    total += 1
 
print total

Tags: , , , ,

PGCon 2015 Unconference: A Community

June 23rd, 2015 | Published in Database, Tech Talk | No Comments


Well, I’ve returned from PGCon 2015 in Canada, and after a couple days to decompress, it’s time to share. I wrote about the PGCon 2014 unconference after returning to Chicago last year, so I felt it was only fitting that I start there. I feel as strongly now as I did a year ago, that directly interacting with the PostgreSQL maintainers at this level helps the community thrive. Even though PGCon is generally a more developer-focused conference, being able to brainstorm with the bigwigs, even if nothing comes of it, means the ideas have been given a fair shake.

The format this year was quite a bit different than last year. I attribute this to Josh Berkus once again, in what I see as his attempt to formalize the process and give it more weight. Indeed, it’s hard to argue with his results. Just take a look at the 2015 Unconference Wiki page. It’s a deluge of information I wish we had about the 2014 talks, from attendees and talk summaries, to relevant external links and the all-important schedule. I’m a bit biased in that regard because I tend to produce and consume vast swaths of excessive information, but it’s an excellent reflection on how much the PostgreSQL community values documentation in general.

Unfortunately due to inclement weather, I missed the voting process and the first day of talks entirely. I desperately missed watching the talk selection process, though Josh said they did a lot of that electronically because several people would be late to the conference. I’m not sure how I missed that, so I’ll blame email; it deserves it anyway. Regardless, after witnessing the 2014 talk selection, I stand by my earlier assertion that it’s a sight to behold. It warms my crooked old heart to watch people so excited about technology and the amicable interaction they have with the friendly developers.

Despite the setbacks, I did attend several chats on Wednesday, which is another departure from last year’s format. In 2014, the selection process and every talk were constrained to one long Saturday and was very ad-hoc. I can’t tell whether or not distributing everything across two days is an improvement, but it certainly worked in the favor of anyone offset by rain delays this year. And the information available on Wednesday was certainly copious. Those at the same sessions I watched got a summary of semiconductor storage variants, PostgreSQL and Docker, the possible future of horizontal scaling, how pg_shard is progressing, and how partitioning might be better integrated into the core (my personal favorite). All told, it was a good assortment and most of them were fairly informative.

Through all of these audiences, something felt different from before, and it took me a while to figure out what it was: spontaneity and interactivity. Every session I attended, barring Josh’s own topic on Docker, had slides. It’s extremely difficult to have community discussion or collaboration when there’s a single speaker pushing his or her own agenda. From what I saw this year, the majority of the Unconference was plagued by this propensity to allow the speaker an open forum, as if they had won a bonus PGCon slot. I get the impression that was not the intention of the format, and I’ve heard slides might be disallowed next year. If that’s the case, that’s an impressively prompt response, and suggests I wasn’t the only one who voiced the concern.

I also suspect the unconference was harmed by holding it before the primary conference itself. I think part of what made everything work last year, was that the unconference was a response to the primary talks. Once we had a chance to learn about the current direction of PostgreSQL, potential upcoming features, extensions, modules, kitchen sinks, and so on, everyone could discuss inherent implications. We bootstrapped our brainstorming with all of the exciting information presented in the conference, and organically produced further discussions based on it. Without that, the unconference was just… more talks, and unnecessarily covered some overlapping information. It’s impossible to know everything a speaker will include in a topic, so we were operating blindly in most cases. I think that might have contributed to the passive audiences; there was no direction to work with.

And that really is the challenge, isn’t it? Is it possible to wrangle a group of developers and DBAs into a room and encourage semi-focused conversations without allowing it to devolve into anecdotes and speculation? Yes! But can it be done without injecting a ringmaster into each session to ensure there’s some kind of convergence on the stated topic? I definitely don’t want the unconference to become the equivalent of a moderated internet forum, because that cheapens its impact. This is in many respects a once-in-a-lifetime opportunity for many attendees: to be part of PostgreSQL’s future. I’d hate for anyone to miss it because they didn’t want to listen to a bunch of loose squabbling.

And that’s why I have really high hopes for next year. The trick with anything is to find balance. The first attempt revealed the potential of the unconference format, the second gave it some necessary structure but may have overreached slightly, and the third will combine everything into a cohesive amalgam everyone can love. These people are all geniuses, and I have every bit of faith they’ll obliterate my tiny, little mind with something I never even considered. That’s what they do, it’s who they are.

Me? I just try to pay attention so I don’t miss anything. Join one of the mailing lists so you can, too. Also, don’t be afraid to attend a PostgreSQL User Group in your area. It’s not uncommon to see a committer there depending on the location; you don’t always have to attend a conference to bask in the glow of these luminous god creatures!


Tags: , , , ,

PG Phriday: 10 Ways to Ruin Performance: Functionally Bankrupt

June 19th, 2015 | Published in Database, Tech Talk | 4 Comments


Functions are great. Having cut my teeth on a database that didn’t even provide the ability to define functions, I’ve come to almost take them for granted in PGDB (PostgreSQL). However, with this kind of ubiquity, sometimes they can be overused in ways that don’t seem to be part of the common programmer lexicon. In this week’s PG Phriday series on performance-killing missteps, I’m going to talk a bit about set theory, and how a certain amount of familiarity is necessary to properly interact with a database.

One of the axioms of set theory states that f(x) = y, and x !~ y. That is, a function applied to some value produces a value that may not be equivalent to the original. Put another way, a spayed pet does not have the same reproductive capacity as a regular pet, and is thus not equal to the original. In the context of a database, this is extremely relevant because equivalence is what makes calculations and restrictions possible.

Given this, consider what a database index does. When creating an index, the database takes the value of one or more columns and essentially builds a pointer tree to quickly locate a record in ln time. This is much faster than reading every record in a table and applying a filter to remove unwanted records. That’s something even novice developers tend to know. But they don’t know how functions modify this scenario.

Given the very simplified introduction to set theory, applying a function to the column value means the database must discard equality. Remember: x !~ y. The database has indexed x, not y. So when a function is used in a WHERE clause, any index on x will be ignored. This makes perfect sense, considering there are an infinite number of possible functions, and the output of each is indeterminate. It’s not possible to predict the end result of every possible function.

To further illustrate this, we’ll reuse one of our tried-and-true examples:

CREATE TABLE sys_order
(
    order_id     SERIAL       NOT NULL,
    product_id   INT          NOT NULL,
    item_count   INT          NOT NULL,
    order_dt     TIMESTAMPTZ  NOT NULL DEFAULT now(),
    valid_dt     TIMESTAMPTZ  NULL
);
 
INSERT INTO sys_order (product_id, item_count, order_dt, valid_dt)
SELECT (a.id % 100000) + 1, (a.id % 100) + 1,
       now() - (id % 1000 || 'd')::INTERVAL,
       CASE WHEN a.id % 499 = 0
            THEN NULL
            ELSE now() - (id % 999 || 'd')::INTERVAL
       END
  FROM generate_series(1, 1000000) a(id);
 
ALTER TABLE sys_order ADD CONSTRAINT pk_order_id
      PRIMARY KEY (order_id);
 
CREATE INDEX idx_order_order_dt
    ON sys_order (order_dt DESC);

Pay special attention to the index on order_dt. I’ve applied a modifier that builds the index in descending order, as it’s very common to retrieve the newest records of a table. This doesn’t actually affect the values being indexed, and is a very handy feature of PGDB. We’re also going to try and use this index, because we don’t want to search through one-million records any time we want to get a few recent data points.

Here’s a query that will count the number of orders yesterday:

EXPLAIN ANALYZE
SELECT COUNT(1)
  FROM sys_order
 WHERE date_trunc('day', order_dt) = CURRENT_DATE - INTERVAL '1 day';
 
                             QUERY PLAN
--------------------------------------------------------------------
 Aggregate  (cost=29865.50..29865.51 ROWS=1 width=0)
            (actual TIME=989.773..989.774 ROWS=1 loops=1)
   ->  Seq Scan ON sys_order
         (cost=0.00..29853.00 ROWS=5000 width=0)
         (actual TIME=0.021..988.407 ROWS=1000 loops=1)
         FILTER: (date_trunc('day'::text, order_dt) = 
                 (('now'::cstring)::DATE - '1 day'::INTERVAL))
         ROWS Removed BY FILTER: 999000
 
 Planning TIME: 0.120 ms
 Execution TIME: 989.810 ms

From this output, we can see that PGDB ignored the index on order_dt and did exactly what we didn’t want. Instead of using an index to jump to the relevant values, it scanned the entire table and filtered out the values that didn’t apply. Yet the implications are actually much worse than that. The date_trunc function, even though it’s written in C, is not a free operation. So not only have we unnecessarily read a million rows, we applied a function to each and every row, just to see if the result is within the boundary we specified. And my example fits in memory; the situation degrades exponentially when that isn’t possible.

That’s insanely expensive, and our relatively small million-row table illustrates that well enough. Imagine the same operation on a table with 100M rows or more. So we lose twice: disk resources are wasted retrieving unwanted rows, and CPU time is consumed performing unnecessary function executions. Given how many times I’ve encountered this at several unrelated companies, it’s a pretty serious problem. Here’s how the query should look:

EXPLAIN ANALYZE
SELECT COUNT(1)
  FROM sys_order
 WHERE order_dt >= CURRENT_DATE - INTERVAL '1 day'
   AND order_dt < CURRENT_DATE;
 
                             QUERY PLAN
--------------------------------------------------------------------
Aggregate  (cost=2564.11..2564.12 ROWS=1 width=0)
 
[ snip ]
 
 ->  Bitmap INDEX Scan ON idx_order_order_dt
       (cost=0.00..21.38 ROWS=894 width=0)
       (actual TIME=0.381..0.381 ROWS=1000 loops=1)
       INDEX Cond: ((order_dt >= 
                     (('now'::cstring)::DATE - '1 day'::INTERVAL))
                 AND (order_dt < ('now'::cstring)::DATE))
 
 Planning TIME: 0.211 ms
 Execution TIME: 5.855 ms

The overall execution plan is slightly more complicated since the index involved now, but note the execution time: it’s almost 200 times faster than the original. All we did was modify the query to use a range that includes all the possible date and time combinations for yesterday. We needed to do that for the same reason we’d tried date_trunc previously, but the end result is the same. The only difference is that this allows the database to use a value range scan on the index and obtain all matching rows immediately.

If you’ve fallen into this trap, don’t feel bad. I’ve seen everyone do this at least once. From newbies right out of college, to highly seasoned technical leads, and even including other DBAs, there doesn’t seem to be any discernible pattern. It’s too easy to frame a query without considering the underlying mechanisms that make everything work. I also want to point out that since PGDB supports functional indexes, it’s also possible to do something like this:

CREATE INDEX idx_order_order_dt
    ON sys_order (date_trunc('day', order_dt));

In this case, we’re simply indexing the resulting value of f(x), so as long as the function call is the same in any WHERE clauses, the index will be used. To PGDB, it’s all the same. If 99% of the development staff, the application itself, and stray dogs are all using a function instead of doing it the “right” way, it’s actually the DBA that is going against the tide.

The only reason I don’t tend to recommend this pattern, is that the functional index is generally too restrictive. What if we wanted to search for a four hour window during yesterday? Well, now we can’t, because the index is only relevant for the date information. The simple index case is applicable to far more potential scenarios, and in the database world, index versatility is extremely important. I try to ensure developers I work with are cognizant of the pitfalls of arbitrarily using functions to filter results.

After all, it’s better to learn these things preemptively!


Tags: , , , ,

PG Phriday: 10 Ways to Ruin Performance: Out Of Order

June 12th, 2015 | Published in Database, Tech Talk | 1 Comment


There are a lot of database engines out there. As such, a developer or DBA will naturally have varying levels of experience with each, and some of this might conflict with how PGDB (PostgreSQL) operates. These kinds of embedded misunderstandings can cause potential issues by themselves, but in this particular case, corrective action is fairly simple.

So this week, I’d like to talk about indexes. Many people treat them as a “make query faster” button, and this often results in egregious misuse. Indexes take space, require CPU resources to maintain, and bring overhead that adversely affects INSERT and UPDATE performance. Most know these drawbacks and generally acknowledge that too many indexes can be detrimental. However, with some databases, even column order can make a drastic difference. PGDB is among these.

To illustrate the point, let’s make a simple test case of one-million records in an inventory tracking table of one thousand products over the course of about three years.

CREATE TABLE sys_inventory_snapshot
(
    product_id   SERIAL       NOT NULL,
    record_dt    TIMESTAMPTZ  NOT NULL,
    item_count   INT          NOT NULL,
    order_count  INT          NOT NULL
);
 
INSERT INTO sys_inventory_snapshot (
       product_id, item_count, order_count, record_dt
)
SELECT b.id, a.id, a.id,
       now() - (a.id || 'd')::INTERVAL
  FROM generate_series(1, 1000) a(id),
       generate_series(1, 1000) b(id);
 
ALTER TABLE sys_inventory_snapshot ADD CONSTRAINT pk_inventory_snapshot
      PRIMARY KEY (product_id, record_dt);

At first glance, this looks pretty good. If we want information based on date or product, it’s right there. Things are even better if we have both values available! The query plan is also encouraging:

EXPLAIN ANALYZE
SELECT *
  FROM sys_inventory_snapshot
 WHERE record_dt >= CURRENT_DATE - INTERVAL '1 day';
 
                            QUERY PLAN                             
--------------------------------------------------------------------
 Bitmap Heap Scan ON sys_inventory_snapshot
        (cost=22912.60..24880.07 ROWS=674 width=20)
        (actual TIME=106.051..106.530 ROWS=1000 loops=1)
   Recheck Cond: (record_dt >= (('now'::cstring)::DATE - '1 day'::INTERVAL))
   Heap Blocks: exact=7
   ->  Bitmap INDEX Scan ON pk_inventory_snapshot
         (cost=0.00..22912.43 ROWS=674 width=0)
         (actual TIME=106.027..106.027 ROWS=1000 loops=1)
         INDEX Cond: (record_dt >= (('now'::cstring)::DATE - '1 day'::INTERVAL))
 
 Planning TIME: 0.115 ms
 Execution TIME: 106.988 ms

Look at that! It’s using our date constraint to grab the most recent rows using the primary key for the table. That’s good, right?

Well, no. What’s actually happened here is somewhat esoteric, and the only real clue we have that something isn’t right, is the execution time. We just spent 100ms fetching 1000 records, which is downright offensive to an experienced DBA. Imagine this query in an application that’s executing it hundreds of times per second from multiple threads. In the database world, 100ms is basically forever. This table and query do not warrant such inexcusably bad performance.

So what happened? Buckets. Indexes take one value and point it to a bucket of other values. In the case of a composite index, one bucket value points to another bucket, which contains the values which satisfy both constraints. Now consider that record_dt is the second bucket. In order to ensure we get the most recent data for all products, we have to visit all of the top-level buckets before we can continue to the most recent date for each.

See the problem yet? Instead of simply jumping straight to the first applicable date value and retrieving the entire bucket, we have to visit 1000 buckets and then continue to 1000 smaller buckets. This is a fine access pattern if we had both the product_id and record_dt in the query, since that jumps straight to the values we want. But for everything else, it’s a non-ideal solution. It might be faster than a sequence scan if we’re lucky, but that’s not a guarantee. In some scenarios, it’s actually much slower.

PGDB veterans tend to solve this one of two ways:

  1. Reverse the column order of the index. If it turns out that 90% of queries use both columns, and 10% only use the second column, it’s better to flip them. The primary key in this case still fills its role, but without impacting performance of date searches.
  2. Add another index to represent the second column. This happens when the first column is actually used in exclusion or combination most of the time, but queries using only the second column are frequent enough to cause concern.

Because I’m lazy and don’t want to redefine the primary key, let’s take the second approach:

CREATE INDEX idx_inventory_snapshot_record_dt
    ON sys_inventory_snapshot (record_dt);
 
EXPLAIN ANALYZE
SELECT *
  FROM sys_inventory_snapshot
 WHERE record_dt >= CURRENT_DATE - INTERVAL '1 day';
 
                            QUERY PLAN                             
--------------------------------------------------------------------
 INDEX Scan USING idx_inventory_snapshot_record_dt
    ON sys_inventory_snapshot
       (cost=0.43..28.24 ROWS=674 width=20)
       (actual TIME=0.038..0.663 ROWS=1000 loops=1)
   INDEX Cond: (record_dt >= (('now'::cstring)::DATE - '1 day'::INTERVAL))
 
 Planning TIME: 0.238 ms
 Execution TIME: 1.087 ms

That’s… quite a difference. If you were unfamiliar with databases or PGDB, you might be surprised by this result. But the reality is quite clear: simply listing a column in an index does not magically improve performance of queries that reference it. Order matters. In addition, the further a column strays from the principal position, the worse the degradation becomes.

I’ve seen indexes with six columns, and then a baffled developer asks why his query is slow. Well, his query only used the fourth column in the list. At that point, reading the entire 30-million row table and filtering for desired results would have been faster. The planner tries to account for this, but sometimes statistics or settings lead it astray, and it performs an index scan and dutifully follows all those buckets down the chain to the fourth column. It happens.

Now that you know why, you’re armed to avoid contributing to the problem. Think about indexes before adding them. Consider existing indexes and queries against their parent table, and ensure that column order faithfully represents the real access pattern. It’s an important—and often missed—optimization technique.

As a side note, this is why serious companies have at least one DBA on staff. Designing table structures and optimizing queries are hard enough, but invisible killers like these are more prevalent than one might realize. An experienced database-driven application developer might be familiar with these kind of minutiae, but that’s a lot to expect. Not everyone reads database performance articles, or have encountered and addressed related problems.

Until next week!


Tags: , , , ,

Review: Learning Heroku Postgres

June 6th, 2015 | Published in Book, Review | No Comments


I recently got the opportunity to take a look at Learning Heroku Postgres, a new book by Patrick Espake that seems intended to help new PostgreSQL database administrators get their data into the cloud. The chapters are short, concise, and the questionnaires at the end are a nice touch. But does it hit the mark? Almost.

Before I get too far into this review, I should point out that Heroku is a proprietary service that presents a modular deployment system for various programming languages, applications, administration, monitoring, and other related services. Though there are free hobby-level instances for most modules, it is a commercial platform which provides SAAS (Software as a Service) across multiple geographic locations. In order to leverage it properly, I recommend these hobby-level instances only for experimentation.

To that end, this is very much a book that is a benefit to the PostgreSQL community. Small and large businesses often have trouble distributing data and applications in a high availability environment, and as such, Heroku is a potential solution for quick-and-dirty scalability at a reasonable cost.

The book itself is essentially broken down into three major parts: Tooling, Basics, and Extras. Though this is not explicitly defined by the chapter overview, this is the way it reads. This is somewhat important, because it allows a bit of skipping around for users who are already familiar with PostgreSQL, Heroku, or both.

The adventure begins with a couple short chapters on how Heroku itself is organized, and acquiring Heroku command-line tools for managing account features. Here, Espake presents a good bird’s-eye view of Heroku’s deployment infrastructure and configuration, and spends time discussing just how everything is decoupled and bound together by queues so the elastic infrastructure accurately represents the intention of the user. This is critical, as understanding the underlying landscape can (and should) directly influence development cycles, since readers must account for Heroku’s quirks while organizing their application and associated data.

From here, the discussion naturally moves to PostgreSQL itself in chapter three. Espake makes it clear that PostgreSQL is managed as a fully automated solution, behind a thick wall of tools, interfaces, and somewhat limited management commands. This is one of the most important chapters, as it effectively lays out all of the necessary commands for synchronizing data and accessing the database itself for more direct manipulation with SQL clients, languages, and drivers. Afterwards in chapter four, he addresses the topic of backups and how to secure and obtain them. Both of these chapters combine to give a reader control of how Heroku represents their data, and securing it from loss.

Chapter five is something of an oddity. Espake introduces Heroku dataclips as a method for sharing data without talking about the reality of what they are: versioned views with an exposure API. This is the first time I got the impression that this book is more of a usage manual than a true learning resource. Yes it is important to show how data can be shared or downloaded with this feature, but after the introduction in chapter one regarding Heroku’s operation, I found this omission particularly odd. Given how dataclips work, they could be combined with views for easier overall data management, and yet this option is never presented.

Chapter six moves on to instance management. By this, I mean various uses for database replicas, such as forking, failover, and replacing the current database with a previous version. All the necessary commands and GUI options are here to make juggling multiple copies of the database easier. But again I see wasted opportunity. Heroku considers ‘rollback’ the act of replacing the primary instance with a previous backup instance. The fact that this directly conflicts with the concept of a transaction rollback is never discussed. Nor are database followers equated with PostgreSQL streaming replication, the mechanism that’s probably behind the feature. I wish Espake spent more time explaining how things work, instead of just providing instructions. After all, that kind of information is probably available in Heroku’s documentation; this book should provide a deeper understanding the user can leverage toward a better PostgreSQL cluster.

The last two chapters tie up most of the remaining loose ends by covering logs and various PostgreSQL-specific extensions available on the Heroku platform. Chapter eight in particular is a laundry list of PostgreSQL extensions generally available within the contribution libraries commonly distributed with the PostgreSQL code or binaries. It’s a good resource for users unfamiliar with this functionality, and further links are provided where necessary so the reader can explore, should that feature be relevant. While not really a feature of Heroku, or even especially relevant since most PostgreSQL distributions include them anyway, extensions are part of what make PostgreSQL so powerful, so I’ll allow it.

In the end, the book adequately covers numerous Heroku commands and interface elements. I wish the author spent more time talking about how some of Heroku’s terminology conflicts with common database concepts. For example, Heroku’s idea of ‘promote’ isn’t quite what a seasoned database administrator would recognize. Allowing a new user absorb this interpretation without caveat, could lead to conceptual issues in the future. This happens often unfortunately, as I’d already mentioned regarding rollback. From chapter four onward, the book is organized like a manual as if it were written by an employee of Heroku, treating PostgreSQL as a mere Heroku module that needed a checklist of feature documentation. There’s a reason this book is so short!

Still, it’s a good way to bootstrap a Heroku deployment of PostgreSQL. If there aren’t more comprehensive books on integrating the two, there probably will be in the near future. Wait for these if you really want to delve into a Heroku deployment; for a newbie, you can’t go wrong here.


Tags: , , ,

« Older Posts