PG Phriday: Secret of the Ooze

Friday March 11, 2016

A few days ago, a developer came to me with that inevitable scenario that every DBA secretly dreads: a need for a dynamic table structure. After I’d finished dying inside, I explained the various architectures that could give him what he needed, and then I excused myself to another room so I could weep silently without disturbing my coworkers. But was it really that bad? Databases have come a long way since the Bad Old Days when there were really only two viable approaches to table polymorphism. Postgres in particular adds two options that greatly reduce the inherent horror of The Blob. In fact, I might even say its a viable strategy now that Postgres JSON support is so good.

Why might a dev team need or even want a table with semi-defined columns? That seems like a silly question to ask in the days of NoSQL databases. Yet an application traditionally tied to a RDBMS can easily reach a situation where a strict table is outright inadequate. Our particular case arose because application variance demands columnar representation of precalculated values. But the generators of those values are themselves dynamic, as are the labels. One version may have ten defined columns beyond the standard core, where another might have fifteen. So we need a dynamic design of some description.

Before we go crazy in our history lesson, let’s start with a base table we’ll be expanding in various ways through the article. It’ll have a million data points ranging from today to a few months in the past, with datapoints every ten seconds.

CREATE TABLE sensor_log (
  id            INT PRIMARY KEY,
  location      VARCHAR NOT NULL,
  reading       BIGINT NOT NULL,
  reading_date  TIMESTAMP NOT NULL
);

INSERT INTO sensor_log (id, location, reading, reading_date)
SELECT s.id, s.id % 1000, s.id % 100,
       '2016-03-11'::TIMESTAMP - ((s.id * 10) || 's')::INTERVAL
  FROM generate_series(1, 1000000) s(id);

CREATE INDEX idx_sensor_log_location ON sensor_log (location);
CREATE INDEX idx_sensor_log_date ON sensor_log (reading_date);

ANALYZE sensor_log;

To understand just why a DBA might shudder in abject terror at the prospect of a dynamic table polluting a database, we need to examine the ancient methods of achieving it. The first of which is simply adding columns directly. New columns for every calculation. Columns for every misspelling of a word. Potentially useful columns. All columns. Every column. Hundreds. Thousands. Postgres can have up to 1600, so let’s max that baby out! But then what? Another table with even more columns! Glued together into an unholy amalgam with an extremely inadvisable JOIN.

This is understandably… frowned upon. Instead, probably the most common technique leverages an attribute table. Attribute tables are just key/value pairs tied to the primary key of the source table. This lets us theoretically have infinite columns based on need. They commonly resemble something like this:

CREATE TABLE sensor_log_attr (
  id   INT,
  key  VARCHAR NOT NULL,
  val  VARCHAR NOT NULL,
  UNIQUE (id, key)
);

The primary concern here is one of efficiency. Each and every supplementary column requires a JOIN to represent the data. These virtual columns are impossible to index within the context of the base table itself. We also get no statistics for how values are distributed, because again, these columns aren’t really part of the table structure in question. This approach only really works for descriptive information.

Let’s see how we might use an attribute table by adding another day of data and a view to “flatten” the table for consumption:

INSERT INTO sensor_log (id, location, reading, reading_date)
SELECT 1000000 + s.id, s.id % 1000, s.id % 100,
       '2016-03-11'::TIMESTAMP + ((s.id * 10) || 's')::INTERVAL
  FROM generate_series(1, 8640) s(id);

INSERT INTO sensor_log_attr (id, key, val)
SELECT 1000000 + s.id, 'frequency', (s.id % 120) || 'hz'
  FROM generate_series(1, 8640) s(id);

INSERT INTO sensor_log_attr (id, key, val)
SELECT 1000000 + s.id, 'jitter', round(1.0 / ((s.id % 10)+10), 3)
  FROM generate_series(1, 8640) s(id);

CREATE OR REPLACE VIEW v_attr_flat AS
SELECT l.*, f.val AS frequency, j.val::DECIMAL AS jitter
  FROM sensor_log l
  LEFT JOIN sensor_log_attr f ON (f.id = l.id AND f.key = 'frequency')
  LEFT JOIN sensor_log_attr j ON (j.id = l.id AND j.key = 'jitter');

ANALYZE sensor_log_attr;

SELECT avg(jitter)
  FROM v_attr_flat
 WHERE reading_date > '2016-03-12'::timestamp - interval '10 minutes';

The query on this particular set of data on our test hardware required about 0.8ms to run. This is a best case scenario, unfortunately. Every column added compounds the view and adds more work to the underlying view. Every row retrieved is likely another iteration of a nested loop over that complicated structure. On the other hand, even if there are hundreds of column variants, only a relatively small handful will apply in any one use case. As such, there is an upper bound on the performance penalty incurred by multiple joins and incomplete statistics.

Regardless, attribute tables are still the slowest technique available in our toolbox. For some database engines, this is where the story ends, much to the dismay of their DBAs. Postgres however, has table inheritance. Some Postgres veterans probably immediately considered that approach, and why not? Optional columns are easy to add. Instead of using multiple views, we can simply create a table for each major variant to extend the common base of the root table. It’s a great place to start.

Our previous example added some data to the sensor_log table we want to use as our base, so let’s start by resetting its contents:

TRUNCATE TABLE sensor_log;

INSERT INTO sensor_log (id, location, reading, reading_date)
SELECT s.id, s.id % 1000, s.id % 100,
       '2016-03-11'::TIMESTAMP - ((s.id * 10) || 's')::INTERVAL
  FROM generate_series(1, 1000000) s(id);

ANALYZE sensor_log;

Now consider our frequency and jitter columns. In the attribute case, both of these had to be TEXT type because it’s the only type that can contain any data. Our inheritance model means we can customize the data types as we normally would. In addition, we get the full benefit of table statistics, including a far simpler query plan while addressing the new table. As such, inheritance is probably the most efficient approach available to us.

Here’s how we could use an inherited table for the new columns:

CREATE TABLE sensor_log_upgraded (
  frequency  TEXT,
  jitter     DECIMAL
)
INHERITS (sensor_log);

INSERT INTO sensor_log_upgraded (
  id, location, reading, reading_date, frequency, jitter
)
SELECT 1000000 + s.id, s.id % 1000, s.id % 100,
       '2016-03-11'::TIMESTAMP + ((s.id * 10) || 's')::INTERVAL,
       (s.id % 120) || 'hz', round(1.0 / ((s.id % 10)+10), 3)
  FROM generate_series(1, 8640) s(id);

CREATE INDEX idx_sensor_upgraded_location
    ON sensor_log_upgraded (location);

CREATE INDEX idx_sensor_upgraded_date
    ON sensor_log_upgraded (reading_date);

ANALYZE sensor_log_upgraded;

SELECT avg(jitter)
  FROM sensor_log_upgraded
 WHERE reading_date > '2016-03-12'::timestamp - interval '10 minutes';

For this very simple example, the query ran about 50% faster than the attribute version. And why wouldn’t it? The row count in the inherited table is much lower than the base table. This situation would persist for every child table modeled this way. Each would have its own custom set of extended columns customized to suit the intended task, with the performance that implies. Imagine we used the “one giant flat table” model instead. Query performance would steadily degrade as more variants were introduced because they’re all held in the same underlying table. Include the necessity of wading through dozens of unused columns, and the situation becomes even more dire.

If there’s any caveat to table inheritance, it’s shared aggregates. Beyond the common root columns held in the base table, it’s impossible to access supplementary fields from the base table without convoluted unions and liberal use of the ONLY decorator. The impact of this can be reduced by putting likely aggregate columns in the base table itself, or using inheritance chains to group variants together.

This concept of chained inheritance has a lot of potential. If a child table itself acts as a base for another table with some special case column, it’s far more likely the two share enough underlying elements to enable aggregation. With a solid naming scheme and targeted queries, it’s a completely viable—if a bit clunky—method for accommodating extremely dynamic applications.

Yet there’s still one more approach to consider. Ever since Postgres added JSON as a supported datatype, it threatened to eat the lunch of NoSQL engines. But why? In a way, it improves table inheritance by making shared aggregates possible across variants. Like inheritance, the JSON data will only contain columns relevant to the application use case. As a bonus, table statistics and indexing of the JSON data are still possible, if a bit kludgy behind the scenes.

Here’s how we might implement JSON polymorphism:

DROP TABLE sensor_log_upgraded;

ALTER TABLE sensor_log ADD poly_data JSONB;

INSERT INTO sensor_log (
  id, location, reading, reading_date, poly_data
)
SELECT 1000000 + s.id, s.id % 1000, s.id % 100,
       '2016-03-11'::TIMESTAMP + ((s.id * 10) || 's')::INTERVAL,
       json_build_object(
         'frequency', (s.id % 120) || 'hz',
         'jitter', round(1.0 / ((s.id % 10)+10), 3)
       )::JSONB
  FROM generate_series(1, 8640) s(id);

ANALYZE sensor_log;

CREATE OR REPLACE VIEW v_poly_flat AS
SELECT id, location, reading, reading_date,
       poly_data->>'frequency' AS frequency,
       (poly_data->>'jitter')::DECIMAL AS jitter
  FROM sensor_log;

SELECT avg(jitter)
  FROM v_poly_flat
 WHERE reading_date > '2016-03-12'::timestamp - interval '10 minutes';

Like the attribute technique, we require a view to unroll optional fields. But we don’t need joins to do it, and statistics are at least somewhat available from the JSON contents. As a result, the sample query is only slightly slower than the inheritance example. Similar to inheritance architecture, we can create views to represent specific use cases and deployment versions. Better yet, the syntax for these views is relatively simple, and easy to automate so applications can generate their own views as needs dictate. Altering one giant table, or regularly creating new tables carries far more risk.

As we mentioned before, another major benefit is that indexes propagate through the view, so indexable poly fields are possible. This would be a given for table inheritance, but completely impossible for an attribute table. Here’s how we could do use it to specifically target sensor jitter:

CREATE INDEX idx_sensor_log_jitter
    ON sensor_log ( ((poly_data->>'jitter')::DECIMAL) )
 WHERE ((poly_data->>'jitter')::DECIMAL) IS NOT NULL;

EXPLAIN analyze
SELECT count(*)
  FROM v_poly_flat
 WHERE jitter > 0.095;

                             QUERY PLAN
--------------------------------------------------------------------
 Aggregate  (cost=14301.94..14301.95 rows=1 width=0)
            (actual time=0.690..0.690 rows=1 loops=1)
   ->  Bitmap Heap Scan on sensor_log
           (cost=6298.08..19506.34 rows=336213 width=0)
           (actual time=0.151..0.631 rows=864 loops=1)
         Recheck Cond:
             (((poly_data ->> 'jitter'::text))::numeric > 0.095)
         Heap Blocks: exact=115
         ->  Bitmap Index Scan on idx_sensor_log_jitter
                 (cost=0.00..6214.03 rows=336213 width=0)
                 (actual time=0.130..0.130 rows=864 loops=1)
               Index Cond:
                   (((poly_data ->> 'jitter'::text))::numeric > 0.095)

This is pretty exciting in a couple of ways. First, we can see that the planner used our jitter index to restrict candidate rows as expected. Secondly, we were able to leverage partial indexes to completely ignore rows that don’t have the jitter column. Consider how powerful that is in the context of a dynamic table structure. A large majority of rows will simply lack most attributes, and that makes the index one or more orders of magnitude smaller than it would be otherwise. In our case, the index on the location column was 28MB, while our jitter index was only 208kB.

For data that would instead leverage exact matches instead of ranges, there are even better indexing techniques specifically for JSON. And if the advanced Vodka indexes get integrated into Postgres core, the situation gets even better.

On the potentially concerning side, we can see that the planner was confused a bit since only a tiny fraction of our data even has poly_data contents. In a real case, it’s likely mis-estimates won’t be quite as drastic, but the presence in this example proves there’s still work to be done on the planner. We got a good query plan this time, but with three orders of magnitude separating the estimate from the actual results, things could have been far worse.

In the end, I still don’t like the prospect of supporting, or the general concept of polymorphic tables. I’m still a DBA at heart, after all. Yet with the tools Postgres provides, my urge to swim in an acid volcano at the mere thought isn’t quite as intense. In fact, I can even visualize a future where Postgres JSON features bridge the gap between a need for NoSQL and the importance of standardized data structures. And I can do it without feeling eternally soiled.

I guess Postgres makes anything possible.

See Also