Not a lot of people remember what Postgres was like before version 8.4. In many ways, this was the first “modern” release of the database engine. CTEs, Window Functions, column level permissions, in-place upgrade compatible with subsequent versions, collation support, continuous query statistic collection; it was just a smorgasbord of functionality.
Of these, CTEs or Common Table Expressions, probably enjoy the most user-level exposure; for good reason. Before this, there was no way to perform a recursive query in Postgres, which really hurts in certain situations. Want to display all related child threads in an online discussion? How about fetching the components of an organization chart by following management assignments? Better get ready for a lot of queries in a loop.
In addition to that, complicated queries were difficult to logically simplify. Reporting queries are especially prone to frequent sequences of aggregates and subqueries. It’s not uncommon to build a query that’s several pages long in this kind of context. Optimizing such an unwieldy beast is often difficult or even impossible simply due to all of the components and confusing nesting.
CTEs changed these things for the better and in the eyes of many, finally brought Postgres to parity with Oracle and its long-established recursive query support. So let’s explore what CTEs really deliver, and how they can improve our Postgres experience—caveats and all.
Recently a coworker asked me this question:
Should I expect variance between minutes and hours for the same query?
And I was forced to give him this answer:
Potentially, but not commonly. Query planning is an inexact science, and regardless of the query being the “same query,” the data is not the “same data.” This isn’t generally the case, but on occasion, changes in data can affect the query execution path. Usually this is a good thing, as the database accounts for new value distributions.
For example, if there are a million distinct values in one column, but 90% of them are the same, certain values should trigger an index scan instead of a sequential scan. Those values will change over time, and if the stats don’t account for that, queries will have non-dependable performance. Of course, this introduces potential correlation assumptions that aren’t correct in some cases, and that also causes unreliable query performance. I guess the question is: which would you rather have?
The Postgres query planner is house of cards built upon the ever-shifting sand of our data. It has the utterly impossible mission of converting our ridiculous and inane requests into a logical series of fetch, filter, sort, join, and other instructions. Then the resulting steps must be ruthlessly efficient or the execution phase could very well saturate every hardware resource available; Set Theory isn’t very forgiving.
Forewarned is forearmed is very apt when applied to database query planners. Without proper statistics, they are reduced to assumptions that make adequate first approximations. But scale is the utter enemy of imprecision, as multiplicative effects quickly overwhelm reality. This allows seemingly simple report scripts to fall endlessly into a pit of smoldering system resources.
Partitioning tables in Postgres can be an extremely risky endeavor. Unfortunately on many larger systems, it’s also essentially a requirement; the maximum size of a Postgres table is 32TB. This isn’t just because converting an existing table to a series of partitions is expensive or time consuming. We must consider how the query planner will react to the partitioned version of a table. There’s also the very real risk we will (or already have) implement flaws in the trigger or constraint logic.
Tools like pg_partman or pg_pathman are great for new systems, but existing deployments require retrofitting. So what happens if we have a large warehouse that’s already using a slightly flawed partition engine? And what other concerns remain even after achieving a pristine setup?
The Postgres developers recently announced the availability of the first public beta for Postgres 9.6. I would be highly remiss to ignore such an opportunity to dig into any interesting functionality listed in the 9.6 release notes. All in all, it’s a pretty exciting series of advancements, and assuming this is a glimpse of what we see when 9.6 drops, I’d say we’re on the right track.