PG Phriday: 10 Ways to Ruin Performance: Sex Offenders

July 24th, 2015 | Published in Database, Tech Talk | 3 Comments


We’re finally at the end of the 10-part PGDB (PostgreSQL) performance series I use to initiate new developers into the database world. To that end, we’re going to discuss something that affects everyone at one point or another: index criteria. Or to put it another way:

Why isn’t the database using an index?

It’s a fairly innocuous question, but one that may have a surprising answer: the index was created using erroneous assumptions. Let’s explore what happens in a hospital environment with a pared-down table of patients.

DROP TABLE IF EXISTS sys_patient;
 
CREATE TABLE sys_patient
(
    patient_id  SERIAL   NOT NULL,
    full_name   VARCHAR  NOT NULL,
    birth_dt    DATE     NOT NULL,
    sex         CHAR     NOT NULL
);
 
INSERT INTO sys_patient (full_name, birth_dt, sex)
SELECT 'Crazy Person ' || a.id,
       CURRENT_DATE - (a.id % 100 || 'y')::INTERVAL
                    + (a.id % 365 || 'd')::INTERVAL,
       CASE WHEN a.id % 2 = 0 THEN 'M' ELSE 'F' END
  FROM generate_series(1, 1000000) a(id);
 
ALTER TABLE sys_patient ADD CONSTRAINT pk_patient_id
      PRIMARY KEY (patient_id);
 
CREATE INDEX idx_patient_birth_dt ON sys_patient (birth_dt);
CREATE INDEX idx_patient_sex ON sys_patient (sex);
 
ANALYZE sys_patient;

This particular hospital has a few queries that operate based on the sex of the patient, so someone created an index on that column. One day, another developer is doing some code refactoring and, being well-trained by the resident DBA, runs the query through EXPLAIN to check the query plan. Upon seeing the result, the dev curses a bit, tries a few variants, and ultimately takes the issue to the DBA.

This is what the developer saw:

EXPLAIN ANALYZE 
SELECT *
  FROM sys_patient
 WHERE sex = 'F';
 
                             QUERY PLAN                             
--------------------------------------------------------------------
 Seq Scan ON sys_patient  
      (cost=0.00..19853.00 ROWS=498233 width=29)
      (actual TIME=0.018..541.738 ROWS=500000 loops=1)
   FILTER: (sex = 'F'::bpchar)
   ROWS Removed BY FILTER: 500000
 
 Planning TIME: 0.292 ms
 Execution TIME: 823.901 ms

No matter what the dev did, the database adamantly refused to use the idx_patient_sex index. The answer is generally obvious to a DBA or a relatively seasoned developer, but this actually happens far more frequently than one might think. This is an extreme example, yet even experienced database users, report writers, and analysts make this mistake.

Before using an index, the database essentially asks a series of questions:

  1. How many matches do I expect from this index?
  2. What proportion of the table do these matches represent?
  3. Are the cumulative random seeks faster than filtering the table?

If the answer to any of those questions is too large or negative, the database will not use the index. In our example, the sex column only has two values, and thus the answer to the above questions are more obvious than usual. With one million rows, a query only on the sex column would match half of them. In addition, randomly seeking 500,000 results is likely an order of magnitude slower than simply filtering the whole table for matches.

But it’s not always so easy to figure out what kind of cardinality to expect from a table column. Short of checking every column of every table with count(DISTINCT my_col) or something equally ridiculous, someone unfamiliar with the data in a complex table architecture would get stuck. However, in order to answer the above questions, the database itself must track certain statistics about table contents.

It just so happens that PGDB makes that data available to everyone through the pg_stats view. Let’s check what PostgreSQL has stored regarding the sys_patient table.

SELECT attname AS column_name, n_distinct
  FROM pg_stats
 WHERE tablename = 'sys_patient';
 
 column_name | n_distinct 
-------------+------------
 patient_id  |         -1
 full_name   |         -1
 birth_dt    |       7310
 sex         |          2

Interpreting these results is actually very easy. Any column with a negative n_distinct value is a ratio approaching 1. At -1, there’s a one-to-one relationship with the number of rows in the table, and the number of distinct values in that column. As a general rule, nearly any column with a negative value here is a good index candidate because a WHERE clause will reduce the potential results significantly.

Positive values are an absolute count of unique values for that column. During table analysis, the database checks a random sampling of rows and tabulates statistics based on them. That means the value in n_distinct is representative instead of exact, but usually doesn’t deviate by a significant margin. The data here doesn’t need to be viable for reporting, just to calculate an efficient query plan.

From here, we can see that the sex column would likely be a terrible index candidate, even if we know nothing else about the table. There are simply not enough distinct values to reduce the amount of matches for a query.

Given all of this, the dev has nothing to fear; the idx_patient_sex index should have never existed in the first place. A query that needs to fetch all of any particular sex will simply require a sequential scan, and that’s fine.

Creating indexes can be a game, and sometimes the only way to win is not to play.


Tags: , , , ,

PG Phriday: 10 Ways to Ruin Performance: Indexing the World

July 17th, 2015 | Published in Database, Tech Talk | 1 Comment


An easy way to give PGDB (PostgreSQL) a performance boost is to judiciously use indexes based on queries observed in the system. For most situations, this is as simple as indexing columns that are referenced frequently in WHERE clauses. PGDB is one of the few database engines that takes this idea even further with partial indexes. Unfortunately as a consequence of insufficient exposure, most DBAs and users are unfamiliar with this extremely powerful functionality.

Imagine we have an order system that tracks order state, such that entries are marked as new, processing, or done. These kinds of transient states are not uncommon in various inventory management systems, so it’s a great example for this use case. Often with such systems, data is distributed in such a way that more than 90% of orders are marked as ‘done’. To make this interesting, let’s just cap the done state at 90%, and distribute another 5% to processing, and 5% to new.

This somewhat complex SQL should emulate the above scenario:

DROP TABLE IF EXISTS sys_order;
 
CREATE TABLE sys_order
(
    order_id     SERIAL       NOT NULL,
    account_id   INT          NOT NULL,
    product_id   INT          NOT NULL,
    item_count   INT          NOT NULL,
    order_state  CHAR         NOT NULL DEFAULT 'N',
    order_dt     TIMESTAMPTZ  NOT NULL DEFAULT now(),
    valid_dt     TIMESTAMPTZ  NULL
);
 
INSERT INTO sys_order (
         account_id, product_id, item_count, order_state,
         order_dt, valid_dt
       )
SELECT (a.id % 100) + 1, (a.id % 100000) + 1, (a.id % 100) + 1,
       CASE WHEN b.id BETWEEN 1 AND 5 THEN 'N'
            WHEN b.id BETWEEN 6 AND 10 THEN 'P'
            ELSE 'D'
       END,
       now() - (a.id % 1000 || 'd')::INTERVAL
             + (a.id || 'ms')::INTERVAL,
       CASE WHEN a.id % 499 = 0
            THEN NULL
            ELSE now() - (a.id % 999 || 'd')::INTERVAL
       END
  FROM generate_series(1, 10000) a(id),
       generate_series(1, 100) b(id);
 
ALTER TABLE sys_order ADD CONSTRAINT pk_order_id
      PRIMARY KEY (order_id);
 
CREATE INDEX idx_order_account_id
    ON sys_order (account_id);
 
CREATE INDEX idx_order_order_dt
    ON sys_order (order_dt DESC);
 
ANALYZE sys_order;

In many, if not most situations, automated tools and interfaces are only interested in pending order states. If an order is new, or in the middle of processing, it’s going to see a lot of activity. Because of this, a lot of queries will request orders based on status. Let’s presume an example of this is to fetch all new orders from a certain account, possibly for display on a web site.

This is the query PGDB sees:

SELECT *
  FROM sys_order
 WHERE account_id = 42
   AND order_state = 'N';

In my virtual environment, this executes in about 3ms and uses the index on account_id as expected. Few people would consider a 3ms execution time as slow, so this is where most optimization stops. But we know something about this data! We know that 90% (or more) of all possible order states are ‘D’. In the case of this query, PGDB has to reduce 10,000 matches from the index, down to 500 to return our results.

This is hardly ideal. A naive approach to correct this, may be to index both account_id and order_state. While this works, we’re still indexing 90% of values that provide no benefit to the index. This is where PGDB differs from many other database engines. Let’s go a bit further with our example and create a partial index and try the query again:

CREATE INDEX idx_order_state_account_id
    ON sys_order (account_id, order_state)
 WHERE order_state != 'D';
 
SELECT *
  FROM sys_order
 WHERE account_id = 42
   AND order_state = 'N';

The revised execution time of our query with the new index is about 0.7ms. While this is 4-5x faster than before, the benefits go beyond execution time. Let’s take a look at the size of each index on disk:

SELECT indexrelid::REGCLASS::TEXT AS index_name,
       pg_relation_size(indexrelid) / 1048576 AS size_mb
  FROM pg_index
 WHERE indrelid::REGCLASS::TEXT = 'sys_order';
 
         index_name         | size_mb 
----------------------------+---------
 pk_order_id                |      21
 idx_order_account_id       |      21
 idx_order_order_dt         |      21
 idx_order_state_account_id |       2

Hopefully it comes as no surprise that an index which includes 90% less data will be 90% smaller. Also, keep in mind that this contrived example was designed to somewhat downplay the effect of partial indexes. In a real order system, far more than 90% of orders would be marked as done, thus magnifying the speed increase and index size reduction.

There are a lot of ways partial indexes can be used, and like most tools, they’re not appropriate for all situations. But when the data cooperates, they’re extremely relevant. When time permits, take a look at data distribution and queries against the system. Chances are, there will be at least one situation that could be improved with a partial index.

If only every database was so accommodating.


Tags: , , , ,

PG Phriday: 10 Ways to Ruin Performance: MAXimized Value

July 10th, 2015 | Published in Database, Tech Talk | No Comments


I apologize for putting this series on a short hiatus last week for the 4th of July. But worry not, for this week is something special for all the developers out there! I’m going to try to make your life easier for a change. Screw the database!

As a PGDB (PostgreSQL) DBA, it’s easy to get tied up in performance hints, obscure syntax, and mangled queries, but it’s really all about the people. These men and women who hit our databases all day long in an attempt to hit insane deadlines often stare at the screen in defeat, trying to get something to work because they’re afraid to ask for help. I know, because I used to be one of them in my bygone developer days.

Sometimes performance isn’t just about the database. Queries can be insanely complex for seemingly simple requests. Even developers who are familiar with SQL syntax don’t speak it fluently, and will often come up with rather bizarre—yet functional—solutions to get a specific data set. Nobody wins when this happens, not the poor dev who fought a truculent query all day, nor the DBA who probably doesn’t understand all the business logic that drove it.

Let’s use an example from a company I worked with in the past. This data set should give us something fairly close to what happened back then.

CREATE TABLE sys_order
(
    order_id     SERIAL       NOT NULL,
    account_id   INT          NOT NULL,
    product_id   INT          NOT NULL,
    item_count   INT          NOT NULL,
    order_dt     TIMESTAMPTZ  NOT NULL DEFAULT now(),
    valid_dt     TIMESTAMPTZ  NULL
);
 
INSERT INTO sys_order (account_id, product_id, item_count,
       order_dt, valid_dt)
SELECT (a.id % 1000) + 1, (a.id % 100000) + 1, (a.id % 100) + 1,
       now() - (id % 1000 || 'd')::INTERVAL + (id || 'ms')::INTERVAL,
       CASE WHEN a.id % 499 = 0
            THEN NULL
            ELSE now() - (id % 999 || 'd')::INTERVAL
       END
  FROM generate_series(1, 1000000) a(id);
 
ALTER TABLE sys_order ADD CONSTRAINT pk_order_id
      PRIMARY KEY (order_id);
 
CREATE INDEX idx_order_account_id
    ON sys_order (account_id);
 
CREATE INDEX idx_order_order_dt
    ON sys_order (order_dt DESC);
 
ANALYZE sys_order;

Consider the scenario: retrieve all of the order details for the most recent order on every account in the system that ordered something in the last month. The developer tasked with solving this conundrum knew about GROUP BY syntax, and also had enough experience to realize it wouldn’t work. Using GROUP BY is mandatory for all non-aggregate columns in a query. Thus, if we wanted the maximum order date for each account, we’d lose all of the other order information.

Given those constraints, the developer came up with this:

SELECT o.*
  FROM sys_order o
  JOIN (
        SELECT account_id,
               MAX(order_dt) AS order_dt
          FROM sys_order
         WHERE order_dt > CURRENT_DATE
                        - INTERVAL '1 month'
         GROUP BY account_id
       ) s USING (account_id, order_dt);

It may not be immediately obvious, but this approach only worked because the order system was built in such a way that an account could not order two different items simultaneously. Without that scenario, more than one row would come back for each account when joined back to the order table. But beyond that caveat, there’s an even more distressing implication. Forget the previous discussion if you can, then ask yourself this question: what does that query do?

I’ll admit I scrached my head for a couple minutes the first time I encountered it. The intent is almost completely obfuscated by the structure, which really defeats the purpose of SQL as a language. Generally the idea of SQL is to make a request, not tell the database how to fulfill it. So that’s why it’s a good thing SQL has syntax specifically to address this kind of request.

In PGDB, the DISTINCT ON clause lets us specify columns that should be used in determining whether or not a row is actually distinct, independent of any potential aggregation. It works like this: the first unique combination of any listed columns wins. That means we can naturally control the output simply by issuing a standard ORDER BY.

Here’s how that looks:

SELECT DISTINCT ON (account_id) *
  FROM sys_order
 WHERE order_dt > CURRENT_DATE
                - INTERVAL '1 month'
 ORDER BY account_id, order_dt DESC

Now look at the query and ask the same question as before: what does this query do? Well, for each distinct account, return all columns for the most recent order in the last month. Isn’t that much more straight-forward? As a bonus, this doesn’t have the potential for breaking if the system changes and allows batch ordering of multiple products per transaction.

So we now have a query that’s easier to understand, is safer and more consistent, and uses far simpler SQL syntax. The only possible drawback is that DISTINCT ON is often a bit slower than a more convoluted approach. But that’s fine; developer time is valuable. As a DBA, I know which query I’d rather try and debug!

Hopefully, I’m not the only one.


Tags: , , , ,

PG Phriday: 10 Ways to Ruin Performance: In The Loop

June 26th, 2015 | Published in Database, Tech Talk | 1 Comment


As a database, PGDB (PostgreSQL) is fairly standard in its use of SQL. Developers of all colors however, might have trouble switching gears and thinking in set operations, since so many language constructs focus on conditionals and looping. Last week in the performance pitfalls series, we discussed a bit of Set Theory, and how ignorance of its implications can be disastrous. But what about the more mundane?

What happens, for instance, when we treat a database like a programming language?

This time, the example tables will emulate a pretty terrible ordering system with a user table of 1000 users, and a million orders distributed over roughly the last three years. Here it is, in all its glory:

CREATE TABLE sys_user
(
    user_id      SERIAL       NOT NULL,
    username     VARCHAR      NOT NULL,
    password     VARCHAR      NOT NULL,
    last_order   TIMESTAMPTZ  NULL,
    created_dt   TIMESTAMPTZ  NOT NULL DEFAULT now(),
    modified_dt  TIMESTAMPTZ  NOT NULL DEFAULT now()
);
 
INSERT INTO sys_user (username, password, created_dt, modified_dt)
SELECT 'user' || a.id,
       md5('use-bcrypt-instead' || 'user' || a.id || 'somepassword'),
       now() - (a.id % 1000 || 'd')::INTERVAL,
       now() - (a.id % 100 || 'd')::INTERVAL
  FROM generate_series(1, 1000) a(id);
 
ALTER TABLE sys_user ADD CONSTRAINT pk_user_id
      PRIMARY KEY (user_id);
 
ANALYZE sys_user;
 
CREATE TABLE sys_order
(
    order_id     SERIAL       NOT NULL,
    product_id   INT          NOT NULL,
    user_id      INT          NOT NULL,
    item_count   INT          NOT NULL,
    order_dt     TIMESTAMPTZ  NOT NULL DEFAULT now(),
    valid_dt     TIMESTAMPTZ  NULL
);
 
INSERT INTO sys_order (product_id, item_count, user_id, order_dt, valid_dt)
SELECT (a.id % 100000) + 1, (a.id % 100) + 1, (a.id % 1000) + 1,
       now() - (a.id % 1000 || 'd')::INTERVAL,
       CASE WHEN a.id % 499 = 0
            THEN NULL
            ELSE now() - (id % 999 || 'd')::INTERVAL
       END
  FROM generate_series(1, 1000000) a(id);
 
ALTER TABLE sys_order ADD CONSTRAINT pk_order_id
      PRIMARY KEY (order_id);
 
CREATE INDEX idx_order_order_dt
    ON sys_order (order_dt DESC);
 
ANALYZE sys_order;

With that out of the way, please note that I do not now, nor will I ever advocate using md5 for passwords. Just don’t. Use crypt (bcrypt) from the pgcrypto extension instead. Note that in even this woefully simplified schema, I use a rudimentary salt. I’m not a security analyst so going further is beyond the scope of this article. But if you plan on writing any kind of secured application of any kind, read up on proper cryptography protocols and two-factor authentication. Never, never, never throw an authentication system together, or you will be hacked and every password in your system will be compromised. Just ask linkedin

Going back to our example, imagine we want to keep some denormalized information about the user in the user table itself. There’s a batch job that updates the user table and maintains the last time the user ordered something. The job does something like this:

\timing ON
 
DO $$
DECLARE
    i INT;
BEGIN
    FOR i IN SELECT user_id
               FROM sys_order
              WHERE order_dt > CURRENT_DATE - INTERVAL '1 month'
    LOOP
      UPDATE sys_user
         SET last_order = now()
       WHERE user_id = i;
    END LOOP;
END
$$ LANGUAGE plpgsql;
 
TIME: 4107.704 ms

“Contrived,” you scream! And you’re partially right. Unfortunately, this kind of scenario happens more often than you might expect. This is actually a modified example I caught in a system I was auditing two years ago, and its execution time was on the order of hours. In that scenario, there was also the language and network overhead to consider. It wasn’t a PGDB DO loop running directly in the database, but some Groovy code that was operating on every result from another query.

So while this example is clearly contrived, that situation definitely wasn’t. Part of the reason I began my education crusade within our company was to prevent such things from ever occurring. Which brings me to the alternative: anything else. Almost literally any other database-focused solution will work better than a language looping construct.

I’ve specifically advocated against using IN syntax in many contexts, but look what happens when we convert the loop to operating on the set of results in the loop query:

\timing ON
 
UPDATE sys_user
   SET last_order = now()
 WHERE user_id IN (
         SELECT user_id
           FROM sys_order o
          WHERE order_dt > CURRENT_DATE - INTERVAL '1 month'
       );
 
UPDATE 32
TIME: 17.466 ms

This particular example is almost 250x faster, but I’ve seen instances where the duration differs by five or even six orders of magnitude. Remember that example I mentioned that used to run for hours? The post-modification execution time was fifteen seconds. If we had further modified the update job to use a temporary table, it would have been five seconds.

It all depends, of course, on the number of records involved, the speed of the underlying disks, the health and correlation of underlying indexes and data heap, and a lot of other factors. The lesson here is, as with the last article, to try and think in sets and groups, and perform operations on the whole collection of data at once.

But there’s yet another way I suggest using that’s even simpler. PGDB has UPDATE ... FROM syntax that lets you combine the two statements into a single operation that resembles a standard JOIN. Here’s what it looks like using our example:

\timing ON
 
UPDATE sys_user u
   SET last_order = now()
  FROM sys_order o
 WHERE o.user_id = u.user_id
   AND o.order_dt > CURRENT_DATE - INTERVAL '1 month';
 
UPDATE 32
TIME: 40.822 ms

Anything you can do in a SELECT can be translated into this syntax. And like the IN example, it’s compatible with EXPLAIN so we can investigate how many records are affected before executing it. But why is it slower? In this particular instance, the reason is due to the number of orders that occurred in the last month. Because of the data distribution, there are 32,000 orders from 32 customers. Unfortunately the planner sees the 32,000 first, and doesn’t know it’s only 32 distinct values, so it performs a sequence scan on the user table, inflating the overall execution time.

That won’t always be the case, and regardless, it’s still much faster than the alternative. The primary benefit from this syntax is its simplicity. If you can write a JOIN, you can use UPDATE ... FROM, and not have to worry about properly formatting or vetting a subselect.

But wait, there’s more! Examples like this are too straight-forward. Sometimes a lot of preliminary work is required before the final UPDATE. Sometimes this means using several intermediate temporary tables, or as before, building a program loop to perform several supplementary tasks and issuing individual updates.

With PGDB Common Table Expression (CTE) syntax, those prerequisite steps can be done inline. Check this out:

\timing ON
 
WITH ord AS
(
    SELECT DISTINCT user_id
      FROM sys_order
     WHERE order_dt > CURRENT_DATE - INTERVAL '1 month'
)
UPDATE sys_user u
   SET modified_dt = now()
  FROM ord
 WHERE u.user_id = ord.user_id;
 
UPDATE 32
TIME: 19.849 ms

We’ve now done two things.

  1. Brought execution time back inline with the IN case.
  2. Broken the UPDATE into two distinct steps.

PGDB CTEs act as an optimization fence, because each segment is physically materialized as a temporary object. This object has very real dimensions, and the results can be fed into further CTE steps, or used as in our example, directly in a final query. Here’s how adding a second step might look:

\timing ON
 
WITH ord AS (
    SELECT DISTINCT user_id
      FROM sys_order
     WHERE order_dt > CURRENT_DATE - INTERVAL '1 month'
),
good AS (
    SELECT user_id
      FROM ord
     WHERE user_id % 3 = 0
)
UPDATE sys_user u
   SET modified_dt = now()
  FROM good
 WHERE u.user_id = good.user_id;
 
UPDATE 10
TIME: 19.741 ms

Now the example really is contrived, because we could have just added the extra WHERE clause to the first step. But that’s because our example is hilariously basic. In a real system, there would be dozens or even hundreds of other tables, and a lot more potential for subsequent calculations, joins, and so on.

The point here isn’t that the second step was pointless, but that it was possible at all. This is what happens when you think in sets. Each result set is something that can be reduced or combined with another result set, and the combination can be used for further operations. Imagine it as object-oriented programming; each set is a blob that should be considered one entity with various attributes.

I’ll end with this bit of Python that should make it obvious what I mean:

# This is set theory:
 
foo = [1, 2, 3, 3, 3, 4]
print foo.count(3)
 
# This isn't:
 
total = 0
for i in foo:
  if i == 3:
    total += 1
 
print total

Tags: , , , ,

PGCon 2015 Unconference: A Community

June 23rd, 2015 | Published in Database, Tech Talk | No Comments


Well, I’ve returned from PGCon 2015 in Canada, and after a couple days to decompress, it’s time to share. I wrote about the PGCon 2014 unconference after returning to Chicago last year, so I felt it was only fitting that I start there. I feel as strongly now as I did a year ago, that directly interacting with the PostgreSQL maintainers at this level helps the community thrive. Even though PGCon is generally a more developer-focused conference, being able to brainstorm with the bigwigs, even if nothing comes of it, means the ideas have been given a fair shake.

The format this year was quite a bit different than last year. I attribute this to Josh Berkus once again, in what I see as his attempt to formalize the process and give it more weight. Indeed, it’s hard to argue with his results. Just take a look at the 2015 Unconference Wiki page. It’s a deluge of information I wish we had about the 2014 talks, from attendees and talk summaries, to relevant external links and the all-important schedule. I’m a bit biased in that regard because I tend to produce and consume vast swaths of excessive information, but it’s an excellent reflection on how much the PostgreSQL community values documentation in general.

Unfortunately due to inclement weather, I missed the voting process and the first day of talks entirely. I desperately missed watching the talk selection process, though Josh said they did a lot of that electronically because several people would be late to the conference. I’m not sure how I missed that, so I’ll blame email; it deserves it anyway. Regardless, after witnessing the 2014 talk selection, I stand by my earlier assertion that it’s a sight to behold. It warms my crooked old heart to watch people so excited about technology and the amicable interaction they have with the friendly developers.

Despite the setbacks, I did attend several chats on Wednesday, which is another departure from last year’s format. In 2014, the selection process and every talk were constrained to one long Saturday and was very ad-hoc. I can’t tell whether or not distributing everything across two days is an improvement, but it certainly worked in the favor of anyone offset by rain delays this year. And the information available on Wednesday was certainly copious. Those at the same sessions I watched got a summary of semiconductor storage variants, PostgreSQL and Docker, the possible future of horizontal scaling, how pg_shard is progressing, and how partitioning might be better integrated into the core (my personal favorite). All told, it was a good assortment and most of them were fairly informative.

Through all of these audiences, something felt different from before, and it took me a while to figure out what it was: spontaneity and interactivity. Every session I attended, barring Josh’s own topic on Docker, had slides. It’s extremely difficult to have community discussion or collaboration when there’s a single speaker pushing his or her own agenda. From what I saw this year, the majority of the Unconference was plagued by this propensity to allow the speaker an open forum, as if they had won a bonus PGCon slot. I get the impression that was not the intention of the format, and I’ve heard slides might be disallowed next year. If that’s the case, that’s an impressively prompt response, and suggests I wasn’t the only one who voiced the concern.

I also suspect the unconference was harmed by holding it before the primary conference itself. I think part of what made everything work last year, was that the unconference was a response to the primary talks. Once we had a chance to learn about the current direction of PostgreSQL, potential upcoming features, extensions, modules, kitchen sinks, and so on, everyone could discuss inherent implications. We bootstrapped our brainstorming with all of the exciting information presented in the conference, and organically produced further discussions based on it. Without that, the unconference was just… more talks, and unnecessarily covered some overlapping information. It’s impossible to know everything a speaker will include in a topic, so we were operating blindly in most cases. I think that might have contributed to the passive audiences; there was no direction to work with.

And that really is the challenge, isn’t it? Is it possible to wrangle a group of developers and DBAs into a room and encourage semi-focused conversations without allowing it to devolve into anecdotes and speculation? Yes! But can it be done without injecting a ringmaster into each session to ensure there’s some kind of convergence on the stated topic? I definitely don’t want the unconference to become the equivalent of a moderated internet forum, because that cheapens its impact. This is in many respects a once-in-a-lifetime opportunity for many attendees: to be part of PostgreSQL’s future. I’d hate for anyone to miss it because they didn’t want to listen to a bunch of loose squabbling.

And that’s why I have really high hopes for next year. The trick with anything is to find balance. The first attempt revealed the potential of the unconference format, the second gave it some necessary structure but may have overreached slightly, and the third will combine everything into a cohesive amalgam everyone can love. These people are all geniuses, and I have every bit of faith they’ll obliterate my tiny, little mind with something I never even considered. That’s what they do, it’s who they are.

Me? I just try to pay attention so I don’t miss anything. Join one of the mailing lists so you can, too. Also, don’t be afraid to attend a PostgreSQL User Group in your area. It’s not uncommon to see a committer there depending on the location; you don’t always have to attend a conference to bask in the glow of these luminous god creatures!


Tags: , , , ,

« Older Posts