PG Phriday: CONFLICT of Interests

MySQL has had a REPLACE INTO syntax to perform “UPSERT” logic since practically the very beginning. For the longest time, users who wanted to switch to Postgres, but for whatever reason relied on this functionality, were essentially trapped. Postgres 9.5 changed all that, but why did it take so long? As with much of Postgres history, it’s a long story.

To really understand where Postgres started, we need to look at the “old” way of handling a row merge. Many in the database world have probably encountered this once or twice:

    -- First, try to update the key.
    UPDATE my_tab SET b = v
     WHERE a = k;
    END IF;
    -- The key doesn't exist, so try to insert it.
      INSERT INTO my_tab (a, b) VALUES (k, v);
    EXCEPTION WHEN unique_violation THEN
      -- Nothing here, allow the loop to continue.
$$ LANGUAGE plpgsql;
SELECT merge_tab(1, 'James');
SELECT merge_tab(1, 'Jimmy');

What on Earth is all of that? Oddly enough, the somewhat convoluted logic is not only sound, it’s actually required to avoid a race condition. In the microseconds between attempting our UPDATE and following to the INSERT, some other transaction may have inserted the “missing” key. In that case, we’d encounter a unique constraint violation.

By catching the exception, we’re not immediately kicked out of the function and are presented with a choice. Do we assume our value is “right” and repeat the loop to apply the update, or just exit silently under the assumption that the successful transaction that beat us is probably fine? This particular function selected the previous assertion because that’s what a merge or upsert tries to guarantee: that the requested action is applied. Were we to omit the loop, the exception block would ensure there was no conflict or fatal error, but we could no longer rely on the function operating as advertised.

So why not invert the logic and remove the loop entirely? After all, we could just attempt the insert and if it fails, perform the update within the exception block, right? Actually no. Consider what happens if the target key is deleted by a concurrent transaction. Say we try our insert, and in the space of time between the key violation and our update, it gets deleted. Suddenly our update also produces an error. That’s probably an extremely unlikely edge case, but in OLTP databases, the unlikely becomes frighteningly common. So to be safe, we’re stuck with the loop.

Don’t think about it too much

That is a lot of overhead for what many consider basic functionality. Since that’s no longer a concern, let’s take a look at the actual syntax the Postgres team selected. To do that, let’s start with a very basic table with a handful of rows:

CREATE TABLE symbol_mapper (
  vendor_id    BIGINT   NOT NULL,
  ext_mapping  VARCHAR  NOT NULL,
  symbol       VARCHAR  NOT NULL,
  PRIMARY KEY (vendor_id, ext_mapping)
INSERT INTO symbol_mapper VALUES (1, 'Google', 'GOOGL');
INSERT INTO symbol_mapper VALUES (1, 'Apple', 'AAPL');
INSERT INTO symbol_mapper VALUES (2, 'goo', 'GOOGL');
INSERT INTO symbol_mapper VALUES (2, 'app', 'AAPL');
ANALYZE symbol_mapper;

The purpose of a mapping table is to fill the role of decoding external names or lookup values to match internal ones. Since each vendor may have its own designation structure, we require a mapping for each. That also protects us in case two vendors use the same identifiers.

So far we have a fairly standard application of tables. Now let’s do something interesting:

INSERT INTO symbol_mapper (vendor_id, ext_mapping, symbol)
       VALUES (2, 'app', 'AAPL')

In this particular case, the “aap” mapping already exists for vendor 2, so no insert takes place. This is the equivalent of not wrapping our insert/update with a loop. We don’t care what the value is, just as long as something is there. In reality, this is more of a way to remove error output from violations than anything immediately useful.

The real fun doesn’t start until we integrate the DO UPDATE functionality. Consider the case where we want to add Samsung as a mapping for a vendor. In this particular case, someone sneaked an existing row into the system, and it contains a typo.

INSERT INTO symbol_mapper VALUES (1, 'Samsung', 'SSLNF');
INSERT INTO symbol_mapper (vendor_id, ext_mapping, symbol)
       VALUES (1, 'Samsung', 'SSNLF')
    ON CONFLICT (vendor_id, ext_mapping)
   SET symbol = 'SSNLF';
SELECT * FROM symbol_mapper;
 vendor_id | ext_mapping | symbol 
         1 | Google      | GOOGL
         1 | Apple       | AAPL
         2 | goo         | GOOGL
         2 | app         | AAPL
         1 | Samsung     | SSNLF

What we’ve done here is ensure the newest incoming mapping is the “correct” one; that’s our merge. An observant reader might ask how this is any different from our futile desire to attempt an INSERT with an UPDATE in an exception block. Unlike those two separate statements and the time-consuming exception handling, this is a single atomic action.

Did another session delete the row before us? We’ll just insert it again. Did another transaction delete the row we just inserted or updated? Oh well. The important part is that it is impossible to delete the row while our statement is running. So while the logic is similar to using an exception, the difference is that DO UPDATE is built into the database itself, so it can’t be broken into multiple actions that can be interrupted.

Another interesting bit of syntax is that we can actually incorporate a WHERE clause into the update beyond the implicit assumption that our update affects the same key we tried to insert.

INSERT INTO symbol_mapper VALUES (1, 'Sony', 'SONY');
INSERT INTO symbol_mapper (vendor_id, ext_mapping, symbol)
       VALUES (1, 'Sony', 'SNY')
    ON CONFLICT (vendor_id, ext_mapping)
   SET symbol = 'SNY'
 WHERE symbol_mapper.symbol = 'SONY';
INSERT INTO symbol_mapper (vendor_id, ext_mapping, symbol)
       VALUES (1, 'Sony', 'SNY')
    ON CONFLICT (vendor_id, ext_mapping)
   SET symbol = 'SNY'
 WHERE symbol_mapper.symbol = 'SONY';

In this case, we had a mapping for Sony that needed a correction. The first query affected the row we targeted, and the second did nothing. This is important because if we had not specified that predicate, both updates would have successfully modified the row. And so would all subsequent attempts. Remember Postgres keeps a row version for every update, even if the new and old values are identical. That’s just how MVCC works.

In a loosely built application environment, it isn’t uncommon for several vectors to operate simultaneously. If a dozen of these each upsert the same value, they’ll all be satisfied that their work is complete, and Postgres would be stuck with a dozen duplicate old rows. VACUUM (and autovacuum) will ensure old row versions are recycled, but again, that’s more overhead we don’t need to invoke.

And of course, the WHERE clause isn’t restricted to deflecting repeated update attempts. There may be circumstances where we simply don’t want to apply changes. By specifying the table name, we can introspect into any of the existing table values. What about the values we attempted to insert? Since these were part of an inherent violation, they’re assigned to a record named “excluded”.

Here it is in action:

INSERT INTO symbol_mapper VALUES (1, 'Microsoft', 'msft');
INSERT INTO symbol_mapper (vendor_id, ext_mapping, symbol)
       VALUES (1, 'Microsoft', 'MSFT')
    ON CONFLICT ON CONSTRAINT symbol_mapper_pkey
   SET symbol = 'MSFT'
 WHERE symbol_mapper.symbol != excluded.symbol;

This is a very similar situation as we had with Sony. The mapping for Microsoft needs an update if the existing value doesn’t match the one we’re attempting to insert. Well, we can perform that check explicitly without hard-coding those values into the query multiple times. It’s possible to refer to anything in the VALUES tuple by specifying “excluded”. Handy, eh?

Also notice that we’ve changed our conflict condition. Previously we had simply listed the columns in the primary key, and Postgres inferred the proper constraint from that definition. In this case, we directly stated the constraint that Postgres should use in resolving the conflict. It’s somewhat uncommon (and probably not entirely safe) to directly invoke constraint names, but the option is there in case we want it.

This feature was a long time coming; Postgres 9.5 was released in early 2016. As impossible as it sounds, we’ve only really had a little over a year to leverage ON CONFLICT. As a consequence, it’s still slowly seeping into existing Postgres application stacks. Users are still incorporating it into their workflows. It’ll be a while before it’s taken for granted with the rest of the kitchen sink Postgres offers.

Until then, it’s that special toy we’ve always wanted but couldn’t afford until now. There’s nothing else to do but make up for lost time!


P.S. Before I forget, Postgres Open is getting ready for 2017! If you or someone else wants to attend or even submit a talk, I highly recommend doing so. I’ve been something of a regular fixture there since it started in 2011, and I fully intend to try my hand again. It’s taking place in San Francisco this time around, so the change in venue will definitely be interesting. Apparently it’s all part of the new United States PostgreSQL Association, so it’s good seeing everything pull together behind a united entity. Here’s looking to the future!

PG Phriday: RESTing in the Corn

Last week we explored using Postgres as a central communication nexus between several data sources. At the time, I made a fairly hand-wavy allusion to REST interfaces. Since I hadn’t really explored further, I had assumed PLV8 could use core node.js or other similar libraries to invoke HTTP APIs. Of course as a trusted language, PLV8 isn’t allowed to do that. It’s more of a language for easily manipulating JSON and JSONB objects within Postgres.

Only slightly less violent than JSON

So we need some other way of calling out to an external HTTP resource, wrapping it in column trimmings, and sending the data to Postgres. As demonstrated last week, we could do this directly with a Python (or some other language) function. Alternatively we can use an extension that has access to Python. Then all we would need to do is write a python library, and suddenly we can interact with the REST interface as if it were an actual Postgres table. This is where Multicorn comes in.

One of the easiest ways to install Multicorn is through PGXN. It goes something like this:

sudo pgxn install multicorn

Otherwise users of Red Hat variants have a yum repo or Debian/Ubuntu can rely on the apt repo. Multicorn is popular enough that it’s included in both of these resources, which is very handy for us.

In any case, we need a basic REST service to poll. In the interests in keeping things simple, let’s use Flask to whip up an API for fetching all the words in a UNIX dictionary file. Since the jsonify method is extremely slow, it’s probably a good idea to cache the output too. The end result might look something like this:

from flask import Flask, jsonify
app = Flask(__name__)
counter = 1
dictionary = []
cached = None
words = open('/usr/share/dict/words', 'r')
for line in words:
    dictionary.append({'id': counter, 'word': line.strip()})
    counter += 1
def index():
    global cached
    if not cached:
        cached = jsonify({'records': dictionary})
    return cached
if __name__ == '__main__':, port=9999)

Our REST interface replies to only one URL and accepts no parameters. It’s not exciting, but it does the job of transmitting data to HTTP requests. With this “service” running in the background somewhere, we can continue with our task of turning that output into a Postgres table.

The next thing we need to do is make use of the Multicorn Python libraries. In this arena, Python setuptools are our friend. With them, we can write a Python library that imports Multicorn and it will automatically install everything in a compatible location. Here’s a bare-bones file that might work:

import subprocess 
from setuptools import setup, find_packages, Extension
  author='Shaun Thomas',

That takes care of actually installing the project library files. Next we need to actually use the Multicorn API. As a pure foreign data wrapper API, it has a lot more capability than we actually need. By default it can represent direct access to CSV files, filesystem objects, RSS feeds, and a bevy of other cute implementations.

Unfortunately none of these use cases apply to wrapping up /usr/share/dict/words. Multicorn can interact with authentication, pass predicates to help pre-optimize matches, and has several other capabilities we simply won’t leverage here. We just need one table from one file, with very little in the way of parsing.

To that end, we only need to define a single execute method in a class derived from ForeignDataWrapper. With that in place, Python has a very capable protocol request handler we can leverage to actually interact with our REST API. It provides everything we need to get the data and convert the HTTP response from JSON into Python dicts that represent each row of the dictionary contents.

There isn’t much exciting here:

import requests
from multicorn import ForeignDataWrapper
class DictionaryFDW(ForeignDataWrapper):
    def execute(self, quals, columns):
        response = requests.get('http://localhost:9999/')
        data = response.json()
        for row in data['records']:
            yield row

To install it, we merely invoke

sudo python install

Everything is now in place. As a Python library, Multicorn has access to other Python tools installed on the system. This means the primary option to the wrapper will be a named Python class. So long as that class is compatible, we will have a functional Postgres extension.

It just so happens that Multicorn implements this at the SERVER level. This makes sense as each server generally has its own defined interaction model. The CSV server would take filenames as parameters for foreign tables, RSS feeds would need the URL for each external resource, etc. Our wrapper is somewhat less… sophisticated. It takes no parameters and offers no advanced functionality. As such, once we define the server that invokes the library, we’re basically done.

Here’s the code that would create our dictionary table and a sample invocation:

CREATE SERVER multicorn_srv
       FOREIGN DATA WRAPPER multicorn
       OPTIONS (wrapper 'fdw.DictionaryFDW');
  id    INT,
  word  VARCHAR
) SERVER multicorn_srv;
\timing ON
SELECT * FROM dictionary
  id  |    word    
 9001 | Lubbock
 9002 | Lubumbashi
 9003 | Lucas
 9004 | Luce
 9005 | Luce's
(5 rows)
Time: 167.939 ms

It works! Beyond that, notice how slow the results are. Though there are only about 100k rows, even fetching a mere handful takes an exorbitant amount of time. This happens for a few reasons.

First consider that this is an external API call. Each request must trigger an HTTP request, parse the results, and then encode and return each row. While we know what our REST process does, others may not be so forthcoming and are essentially black boxes. At least some of the delay is due to whatever work the REST service performs on its end. Then we have the unfortunate fact that, bless its heart, Python is generally a dog’s breakfast when it comes to performance.


We could fix some of these problems. Multicorn does have the ability to pass along predicates after all. A more advanced library could implement basic regular expression, equality, or other logic to pass along WHERE clauses like these:

SELECT * FROM dictionary
 WHERE word LIKE '%house%';
                         QUERY PLAN
 FOREIGN Scan ON dictionary
     (cost=20.00..20000000000.00 ROWS=100000000 width=36)
     (actual TIME=164.616..326.678 ROWS=189 loops=1)
   FILTER: ((word)::text ~~ '%house%'::text)
   ROWS Removed BY FILTER: 98982
 Planning TIME: 0.248 ms
 Execution TIME: 326.896 ms

While in our case Postgres manually removed all inapplicable matches, a more advanced wrapper could transform these into parameters to the API call itself. In such a case, the REST call might return a much smaller (and faster) row subset. Barring that, there’s another way to cheat the system: materialized views!

We have, after all, already used these to represent regularly updated external snapshots. If we treat the API like a rolling data window, we can just refresh a materialized view with its contents and enjoy direct manipulation of instantiated contents. That means indexes and all the trimmings! A major benefit to this approach over spending all of our time optimizing our wrapper, is that we don’t have to duplicate existing Postgres functionality.

We would naturally want to eventually fix the more obvious inefficiencies in our wrapper interface. Yet until that happens—and even afterwards, since performance enhancements are cumulative—we can do something like this:

SELECT * FROM dictionary;
ANALYZE mv_dictionary;
CREATE INDEX dictionary_word_trigram
    ON mv_dictionary USING GIST (word gist_trgm_ops);
SELECT * FROM mv_dictionary
 WHERE word LIKE '%house%';
                         QUERY PLAN
 Bitmap Heap Scan ON mv_dictionary
     (cost=3.36..30.78 ROWS=10 width=13)
     (actual TIME=6.454..6.628 ROWS=189 loops=1)
   Recheck Cond: ((word)::text ~~ '%house%'::text)
   ROWS Removed BY INDEX Recheck: 1
   Heap Blocks: exact=41
   ->  Bitmap INDEX Scan ON dictionary_word_trigram
           (cost=0.00..3.35 ROWS=10 width=0)
           (actual TIME=6.435..6.435 ROWS=190 loops=1)
         INDEX Cond: ((word)::text ~~ '%house%'::text)
 Planning TIME: 0.152 ms
 Execution TIME: 6.783 ms

One cool trick Postgres has up its sleeves is that it can use leverage quirky index types. One of these is based on word trigrams. It’s a lossy format that will inevitably return more matches than we actually want on the first iteration, but remember that Postgres will apply a post-filter to remove excess matches. We can see that in the Recheck cond line from the execution plan above, which yanked one row that didn’t match our LIKE clause. That’s a small price to pay for reducing a 326ms runtime to 6ms!

Implementing trigram logic in our Python library or REST service would be a huge headache, as would duplicating LIKE handling. Since words are most likely to have inexact and fuzzy searches, we may want to avoid the whole mess. With a materialized view, we can let Postgres do all of the complicated data manipulation while the REST API focuses on bridging the gap to our extrernal resource.

Once again, we’re left with a database that contains none of our own data. Tables we create by wrapping and caching external contents gives us a transitive scratch zone and staging area for actual analysis.

In a real-world scenario, we could only hope the REST interface and its corresponding foreign data wrapper component wouldn’t be as lazy as the one presented here. Perhaps a better alternative would be to write a full REST-specific foreign data wrapper in Go. In such a world, feed sources and parameters, column decoding, caching layers, and any number of other optimizations would be efficient and native to web-driven content. One wrapper and server combination could represent any number of foreign tables instead of ours which only works with a single service.

The options, like Postgres, are endless and varied. That’s the world foreign data wrappers give us.

PG Phriday: Stuck in the Middle with Postgres

Earlier this year, I implied Postgres was some kind of super middleware for dragging data out of every external resource it could locate. But that example only used the Postgres foreign data wrapper to contact another Postgres server. Why be so unimaginative? The future is as unlimited as it is terrifying.

Meet the new Postgres mascot

Let’s start with a few prerequisites. We want to see the combined functionality of a few different Postgres capabilities, so let’s try and combine data from MySQL, Postgres, and Javascript using V8. To do that, we need a couple of easily obtained resources. Debian-based systems can do something like this:

sudo apt-get install postgresql-9.6-plv8 postgresql-9.6-mysql-fdw

Otherwise Postgres packages are available on the Postgres site or as some other name in your favorite distribution. All we need for this is some kind of MySQL install, and the MySQL and V8 Postgres extensions. More ambitious readers can try applying these techniques to SQL Server as well.

Now imagine we have a marketing infrastructure with data strewn all over the world, in a vast assortment of incompatible or legacy apps, each with its own portion of the data we want to see. Not an ideal situation really, and one normally solved by ETL to pull all of the data into a single location for perusal. But transformation takes time, and we’re exceptionally lazy. Instead of all that extract, transform, and load action, let’s just query the data directly.

We can get started by creating a basic market MySQL table with two measly rows for demonstration purposes:

CREATE TABLE market_data (
  ext_root   VARCHAR(10) PRIMARY KEY,
  trader     VARCHAR(20) NOT NULL,
  stock      VARCHAR(10) NOT NULL,
  action     CHAR(1) NOT NULL,
  price      NUMERIC(10,4) NOT NULL,
  xact_date  TIMESTAMP
INSERT INTO market_data (ext_root, trader, stock, action, price)
       VALUES ('IZ724JJ5:3', 'bsmith', 'GOOG', 'b', 800);
INSERT INTO market_data (ext_root, trader, stock, action, price)
       VALUES ('IZ724JJ5:4', 'bsmith', 'GOOG', 's', 842);

Easy, right? The rows should automatically have a timestamp applied as if they’re part of a rapidly accumulating feed. This data isn’t normalized because that would greatly complicate the data model. For the purposes of this activity, that’s not really necessary. If it helps, imagine this is just a view on top of all of the relevant tables and flattens the data to something an external resource can easily consume.

The next thing we need is some data in a Postgres database, because why not? Consider this on some server in a far-away land filled with unicorns and butterflies cavorting with innocent children in a field of daisies. It’s definitely not in our datacenter!

CREATE TABLE trade_calc (
  ext_root  VARCHAR(10) PRIMARY KEY,
  stdev     NUMERIC NOT NULL,
  mean      NUMERIC NOT NULL,
  delta     NUMERIC NOT NULL
INSERT INTO trade_calc VALUES ('IZ724JJ5:3', 13.7, 825, 11.3);
INSERT INTO trade_calc VALUES ('IZ724JJ5:4', 8.5, 832, 1.5);

In this case, the table contains supplementary information regarding the trade. Maybe an application is generating various calculations derived from the trade and storing it in a local resource for later consumption. Regardless, we’ve retained the external root identifier used to tie the rows together, which is a common feature of market data.

The next step is to introduce some magical binding substance of some kind.

Why not?

We’ll be installing all of these wonderful resources in a Postgres database dedicated specifically for that purpose. One with user accounts and grants in a centralized location where everyone and his pet fish can access and aggregate data stored there. The MySQL portion of our tentacle would look and function something like this:

CREATE SERVER mysql_market
  OPTIONS (host 'localhost');
  SERVER mysql_market
  OPTIONS (username 'root');
  FROM SERVER mysql_market
  INTO mysql;
SELECT * FROM mysql.market_data;
  ext_root  | trader | stock | action |  price   |      xact_date      
 IZ724JJ5:3 | bsmith | GOOG  | b      | 800.0000 | 2017-03-10 15:09:32
 IZ724JJ5:4 | bsmith | GOOG  | s      | 842.0000 | 2017-03-10 15:09:45

One foreign MySQL resource allocated and tested. Check!

Without a readily available node.js or REST interface to contact, we can simulate polling some kind of JSON resource with the PLV8 language. This a good way to demonstrate tying elements to Postgres using something other than SQL. It also gives us a way to grab data for situations where no foreign data wrapper exists and we’re not confident enough to create one.

In this case, we’re calling a V8 function which gets the data through some dubious means and sends it back to us:

SET search_path TO js;
CREATE TYPE clearing_data AS (
  ext_root   VARCHAR(10),
  market     VARCHAR(10),
  bank       VARCHAR(10),
  is_funded  BOOLEAN
CREATE OR REPLACE FUNCTION get_clearing_info()
RETURNS SETOF clearing_data AS
    "ext_root": "IZ724JJ5:3", "market": "NASDAQ",
    "bank": "Chase", "is_funded": FALSE
    "ext_root": "IZ724JJ5:4", "market":
    "NASDAQ", "bank": "Citi", "is_funded": TRUE
$$ LANGUAGE plv8;
CREATE VIEW all_clearing_data AS
SELECT * FROM get_clearing_info();
SELECT * FROM js.all_clearing_data;
  ext_root  | market | bank  | is_funded 
 IZ724JJ5:3 | NASDAQ | Chase | f
 IZ724JJ5:4 | NASDAQ | Citi  | t

Success! Hard-coded data isn’t exactly robust, but demos are open season, and hey look over there! In a real scenario, we’d probably have a function that accepted parameters and performed a pull from some external resource. The view is just a way of hiding the function call for users who might have their head explode to know they can select rows from a function.

Lastly we need to contact that Postgres server out in the magical field of unicorns, butterflies, and candy canes. What, I didn’t mention candy last time? Well there’s candy now. Deal with it.

CREATE EXTENSION postgres_fdw;
CREATE SERVER pgsql_market
  OPTIONS (dbname 'market', host 'localhost');
  SERVER pgsql_market
  OPTIONS (USER 'postgres');
  FROM SERVER pgsql_market
  INTO pgsql;
SELECT * FROM pgsql.trade_calc;
  ext_root  | stdev | mean | delta 
 IZ724JJ5:3 |  13.7 |  825 |  11.3
 IZ724JJ5:4 |   8.5 |  832 |   1.5

Essentially that was the same process we used with MySQL, except this time we used the Postgres FDW instead. Either way, it works and returns data exactly as expected.

The final step is to slap on the duct tape and invoke everything at once:

SELECT m.*, t.*
  FROM mysql.market_data m
  JOIN pgsql.trade_calc t USING (ext_root)
  JOIN js.all_clearing_data c USING (ext_root);
  ext_root  | trader | stock | action |  price   |      xact_date      |  ext_root  | stdev | mean | delta 
 IZ724JJ5:3 | bsmith | GOOG  | b      | 800.0000 | 2017-03-10 15:09:32 | IZ724JJ5:3 |  13.7 |  825 |  11.3
 IZ724JJ5:4 | bsmith | GOOG  | s      | 842.0000 | 2017-03-10 15:09:45 | IZ724JJ5:4 |   8.5 |  832 |   1.5

The real power here is in combining this with the external resource gathering techniques we explored in the previously mentioned article. Is the external REST call way too slow? Slap on a materialized view and index the crap out of it. Have a super stable API for an internally developed resource? Write a foreign data wrapper so Postgres can invoke it directly. Then add a materialized view for good measure. Add a refresh schedule for any resources that adversely affect performance for direct inquiries.

What we end up with is a container database that represents 0% of its own data. It’s a resource we can truncate at will, may crash with no consequences, and might exist anywhere since it’s nothing but an interface layer. A Postgres instance designed this way becomes a SQL API to access assorted data feeds through a standardized execution model. Heck, everyone can have their very own.

Were we to attempt this with our favorite language or application stack, we would probably need to write our own data joining techniques, implement set theory, incorporate hashing and sorting algorithms, and so on. In the end, we’d have written a tiny database engine to handle data combination and aggregation, without most capabilities of even a mediocre out-of-the-box solution. In this scenario, Postgres bridges that gap for us.

If Postgres has already done all the hard work, why replicate it? And isn’t that what being lazy is all about?

No coding, only nap

PG Phriday: Through the Window

Now that we know how Postgres window functions work, why not play with them a bit to get a better understanding of their capabilities? So long as we understand window functions are applied after data gathering and aggregation steps, much of their mystery and complexity is defanged. Let’s start actually using them for stuff!

Captain Murphy is tired of your nonsense

(Note: I’m a bit under the weather today, so this Phriday will probably be a bit truncated and potentially incoherent thanks to the drugs. Apologies in advance.)

Let’s start off with the same example data as last week with one tiny alteration. To properly illustrate some of these concepts, we need some actual data variance, so we’ll be using random numbers for readings instead of modulo math.

CREATE TABLE sensor_log (
  id            SERIAL PRIMARY KEY,
  location      VARCHAR NOT NULL,
  reading       BIGINT NOT NULL,
  reading_date  TIMESTAMP NOT NULL
INSERT INTO sensor_log (location, reading, reading_date)
SELECT % 1000, round(random() * 100),
       CURRENT_DATE + INTERVAL '1d' - (( * 10) || 'm')::INTERVAL
  FROM generate_series(1, 5000) s(id);
CREATE INDEX idx_sensor_log_date ON sensor_log (reading_date);
ANALYZE sensor_log;

First things first: ordering matters. Window functions are either applied over the whole data-set or some partition of it. They’re also cumulative in nature, meaning we can change the aggregation results by altering the window itself. To better see what this means, consider these two sums:

SELECT location, reading, SUM(reading) OVER ()
  FROM sensor_log
 WHERE reading_date >= CURRENT_DATE
 ORDER BY location
 LIMIT 10;
 location | reading | SUM  
 1        |      12 | 7058
 10       |      26 | 7058
 100      |      98 | 7058
 101      |      99 | 7058
 102      |      46 | 7058
 103      |      84 | 7058
 104      |      60 | 7058
 105      |      35 | 7058
 106      |      58 | 7058
 107      |       6 | 7058
SELECT location, reading, SUM(reading) OVER (ORDER BY location)
  FROM sensor_log
 WHERE reading_date >= CURRENT_DATE
 ORDER BY location
 LIMIT 10;
 location | reading | SUM 
 1        |      12 |  12
 10       |      26 |  38
 100      |      98 | 136
 101      |      99 | 235
 102      |      46 | 281
 103      |      84 | 365
 104      |      60 | 425
 105      |      35 | 460
 106      |      58 | 518
 107      |       6 | 524

Interestingly, we actually learned two things here. First, window functions ignore LIMIT clauses. That sum reflects every reading in the table for today. Second is that ordering by the sensor location resulted in a cumulative total for the sum for each row. This applies to all of the available window functions. A maximum would reflect the current maximum for each row until a higher value replaces it. An average is a cumulative average as values are processed, and so on.

This isn’t particularly useful for sums, but what if we wanted to watch our data converge upon the overall average?

SELECT location, reading,
       round(avg(reading) OVER (ORDER BY location), 2) AS running_avg,
       round(reading - 
           avg(reading) OVER (ORDER BY location), 2) AS variance
  FROM sensor_log
 WHERE reading_date >= CURRENT_DATE
 ORDER BY location
 LIMIT 10;
 location | reading | running_avg | variance 
 1        |      12 |       12.00 |     0.00
 10       |      26 |       19.00 |     7.00
 100      |      98 |       45.33 |    52.67
 101      |      99 |       58.75 |    40.25
 102      |      46 |       56.20 |   -10.20
 103      |      84 |       60.83 |    23.17
 104      |      60 |       60.71 |    -0.71
 105      |      35 |       57.50 |   -22.50
 106      |      58 |       57.56 |     0.44
 107      |       6 |       52.40 |   -46.40

If we monitored those kinds of results second by second, we could flag any kind of appreciable jitter from the average as currently represented by the data. It may be completely expected for values to increase through the day, so the total average for all data is meaningless to us. Ordering allows us to control the aggregate’s construction in a way that isn’t really possible using other methods, barring some kind of recursive CTE magic.

Postgres’ new slogan

With ordering out of the way, there’s the small detail of data partitioning as well. The manual goes into much more detail about this, but it’s just a fancy way of saying GROUP BY within the window itself.

SELECT location, reading,
       SUM(reading) OVER (PARTITION BY location) AS all_readings,
       rank() OVER (PARTITION BY location ORDER BY reading)
  FROM sensor_log
 WHERE location::INT <= 10
 LIMIT 10;
 location | reading | all_readings | rank 
 0        |      32 |          260 |    1
 0        |      35 |          260 |    2
 0        |      37 |          260 |    3
 0        |      71 |          260 |    4
 0        |      85 |          260 |    5
 1        |      12 |          173 |    1
 1        |      14 |          173 |    2
 1        |      17 |          173 |    3
 1        |      44 |          173 |    4
 1        |      86 |          173 |    5

Again, we can learn a few different things from these results. First is that the window results are restricted to the partition we declared. We set the partition to limit sums, ranks, and other window functions to the domain of the location. Postgres will apply window aggregates specifically to each location as it appears in the query output.

Next consider that we partitioned the sum, but did not order it. In case it wasn’t obvious already, this tells us that PARTITION and ORDER BY clauses are independent, as are the window definitions. In this case, we want the reading total for each location, followed by the rank of each reading within that group. If we had ordered the sum window, we would have gotten a running total instead of an overall value. On the other hand, we want the rank to behave in an ordered manner.

Separate windows, separate effects. Of course, we may not actually want that to happen. If we end up using the same window over and over again, it doesn’t make sense to declare it for each column. Thankfully Postgres has a shorthand for that:

SELECT location, reading,
       SUM(reading) OVER (PARTITION BY location) AS all_readings,
       SUM(reading) OVER locs AS running_total,
       rank() OVER locs
  FROM sensor_log
 WHERE location::INT <= 10
WINDOW locs AS (PARTITION BY location ORDER BY reading)
 LIMIT 10;
 location | reading | all_readings | running_total | rank 
 0        |      32 |          260 |            32 |    1
 0        |      35 |          260 |            67 |    2
 0        |      37 |          260 |           104 |    3
 0        |      71 |          260 |           175 |    4
 0        |      85 |          260 |           260 |    5
 1        |      12 |          173 |            12 |    1
 1        |      14 |          173 |            26 |    2
 1        |      17 |          173 |            43 |    3
 1        |      44 |          173 |            87 |    4
 1        |      86 |          173 |           173 |    5

Perfect! Not only did we use the same window twice, but we were able to mix it with another direct declaration within the same query. That isn’t exactly ground-breaking territory, but it’s reassuring when things work as expected.

Keep in mind that the same window function limitations apply to this syntax as well. Since windows are evaluated after WHERE, GROUP BY, or HAVING clauses, the WINDOW portion of a query must appear after them as well.

Now go forth and leverage window functions to produce reports without awkward subqueries and CTEs muddying up the works! The only better way to use a window is to cool pies.

Mmmmm… pie!

PG Phriday: In the Window

I’ll be the first to admit that I found Postgres window functions fantastically confusing when I first encountered them. They’re a powerful and versatile tool for building reports and summaries, but that functionality hides behind a fairly steep learning curve. One of the ways to combat their inherent complexity is to fully explore how they work, instead of just trying to wrap our heads around the expected results.

Window doggies have gotten decidedly smug

To that end, let’s set up a quick set of data in the customary fashion:

CREATE TABLE sensor_log (
  id            SERIAL PRIMARY KEY,
  location      VARCHAR NOT NULL,
  reading       BIGINT NOT NULL,
  reading_date  TIMESTAMP NOT NULL
INSERT INTO sensor_log (location, reading, reading_date)
SELECT % 1000, % 100,
       CURRENT_DATE + INTERVAL '1d' - (( * 10) || 'm')::INTERVAL
  FROM generate_series(1, 5000) s(id);
CREATE INDEX idx_sensor_log_date ON sensor_log (reading_date);
ANALYZE sensor_log;

Yes, it’s the trusty sensor_log table once again. This time around, we only really care about demonstration in lieu of volume, so we’ve elected for five thousand rows in place of the usual five million. Our data represents one thousand sensors sequentially taking readings every minute. This should provide enough overlap to easily demonstrate what’s going on behind the scenes.

Let’s start with probably the easiest window function of the lot: row_number. All it does is number rows in the result set so we have a kind of counter that’s useful in a lot of different contexts. How do the first ten rows for today look?

SELECT location, reading, ROW_NUMBER() OVER ()
  FROM sensor_log
 WHERE reading_date > CURRENT_DATE
 LIMIT 10;
 location | reading | ROW_NUMBER 
 143      |      43 |          1
 142      |      42 |          2
 141      |      41 |          3
 140      |      40 |          4
 139      |      39 |          5
 138      |      38 |          6
 137      |      37 |          7
 136      |      36 |          8
 135      |      35 |          9
 134      |      34 |         10

Window functions must be called on some kind of data window. An empty set of () represents the entire data set, with no ordering, groups, or other shenanigans involved. We’re just numbering the results, and the output would have been no different if we removed the window function. This kind of use is very similar to Oracle’s ROWNUM pseudo-column.

Yet a row number by itself isn’t that interesting. Let’s number the rows in order of sensor reading, and fetch the first ten rows of those results:

SELECT location, reading, ROW_NUMBER() OVER (ORDER BY reading)
  FROM sensor_log
 WHERE reading_date > CURRENT_DATE
 LIMIT 10;
 location | reading | ROW_NUMBER 
 100      |       0 |          1
 101      |       1 |          2
 1        |       1 |          3
 102      |       2 |          4
 2        |       2 |          5
 103      |       3 |          6
 3        |       3 |          7
 4        |       4 |          8
 104      |       4 |          9
 105      |       5 |         10

If () was the window representing all rows, then (ORDER BY reading) is that same content after being sorted by the reading column. Not only did Postgres sort our results, but it numbered them in the post-sorted order. This is a very fine distinction! Consider what happens when we move the ORDER BY clause into the query proper.

SELECT location, reading, ROW_NUMBER() OVER ()
  FROM sensor_log
 WHERE reading_date > CURRENT_DATE
 ORDER BY reading
 LIMIT 10;
 location | reading | ROW_NUMBER 
 100      |       0 |         44
 1        |       1 |        143
 101      |       1 |         43
 2        |       2 |        142
 102      |       2 |         42
 3        |       3 |        141
 103      |       3 |         41
 104      |       4 |         40
 4        |       4 |        140
 105      |       5 |         39

What the heck happened here? The data looks exactly the same, but the artificial row numbers are seemingly arbitrary. Indeed they are! By design. This is part of the reason window functions are so difficult to explain and comprehend. The fact of the matter is that each window is a virtual and separate manifestation of the plain query results.

Anyone who has struggled with pointers in C or C++ know that abstracted structures introduce certain pitfalls into obtaining desired results.

Window functions and you!

To help unravel the mystery a bit, let’s look at the natural state of the results without any window function nonsense. Of course we must also shift the data by 40 rows so we can see some of the same information the window received.

SELECT location, reading
  FROM sensor_log
 WHERE reading_date > CURRENT_DATE
 location | reading 
 103      |       3
 102      |       2
 101      |       1
 100      |       0
 99       |      99
 98       |      98
 97       |      97
 96       |      96
 95       |      95
 94       |      94

This output represents the table rows as sorted by the index we created on reading_date. Since these results are not artificially sorted in any way, this what the window function is actually seeing without its own specific sort operation. We shifted the results by 40 rows and as expected, row 44 has the value of 0. The window function gave us exactly what we requested, but it numbered the rows before Postgres sorted them.

We can actually watch this in action by looking at the two query plans:

SELECT location, reading, ROW_NUMBER() OVER (ORDER BY reading)
  FROM sensor_log
 WHERE reading_date > CURRENT_DATE
 LIMIT 10;
                              QUERY PLAN
 LIMIT  (cost=15.91..16.08 ROWS=10 width=11)
   ->  WindowAgg  (cost=15.91..18.41 ROWS=143 width=11)
         ->  Sort  (cost=15.91..16.27 ROWS=143 width=11)
               Sort KEY: reading
               ->  INDEX Scan USING idx_sensor_log_date ON sensor_log
                     (cost=0.29..10.79 ROWS=143 width=11)
                   INDEX Cond: (reading_date > ('now'::cstring)::DATE)
SELECT location, reading, ROW_NUMBER() OVER ()
  FROM sensor_log
 WHERE reading_date > CURRENT_DATE
 ORDER BY reading
 LIMIT 10;
                              QUERY PLAN
 LIMIT  (cost=15.67..15.69 ROWS=10 width=11)
   ->  Sort  (cost=15.67..16.03 ROWS=143 width=11)
         Sort KEY: reading
         ->  WindowAgg  (cost=0.29..12.58 ROWS=143 width=11)
               ->  INDEX Scan USING idx_sensor_log_date ON sensor_log
                     (cost=0.29..10.79 ROWS=143 width=11)
                   INDEX Cond: (reading_date > ('now'::cstring)::DATE)

Note that the WindowAgg step occurs at the end of query execution in the first example, taking place directly after a sort operation. The second query sorts after the WindowAgg, indicating the window only has access to unsorted rows. The key detail is that window functions only have access to rows within the query as if it had executed without them. This also happens after other aggregates are applied, meaning it’s a bad idea (or even impossible) to mix regular aggregates with window functions.

The easiest way to comprehend how a window function works is to run the query without them. That’s the data the window has access to, regardless of how we slice and dice them within the window itself. It also explains why we’re unable to refer to window function elements in other query clauses. They’re unusable in predicates and we can’t leverage their calculations to group or limit results.

So imagine for a moment we don’t have the standard Postgres LIMIT clause. This is how we would snag the top ten results of our location readings:

  FROM (SELECT location, reading,
               ROW_NUMBER() OVER (ORDER BY reading)
          FROM sensor_log
         WHERE reading_date > CURRENT_DATE
       ) sub
 location | reading | ROW_NUMBER 
 100      |       0 |          1
 101      |       1 |          2
 1        |       1 |          3
 102      |       2 |          4
 2        |       2 |          5
 103      |       3 |          6
 3        |       3 |          7
 4        |       4 |          8
 104      |       4 |          9
 105      |       5 |         10

This is clearly silly when using row_number, but the trick works the same with other window functions. Here’s how we’d obtain the 10th ranked readings for today’s data:

  FROM (SELECT location, reading,
               dense_rank() OVER (ORDER BY reading)
          FROM sensor_log
         WHERE reading_date > CURRENT_DATE
       ) sub
 WHERE dense_rank = 10;
 location | reading | dense_rank 
 109      |       9 |         10
 9        |       9 |         10

Keep in mind that Postgres must fetch the full results internally to materialize them for the window functions. This is true whether there are 143 rows as with our example, or 143-million.

The more advanced use cases for window functions are a topic for another day. Consider this a very high-level introduction to how they work and their inherent limitations instead of a comprehensive guide. There’s a lot of material here that deserves closer inspection, so there’s no need to rush. Either way, don’t let window functions confuse you more than necessary. Like any independent agent, you just need to know what they’re doing behind the scenes.

Like this, but with slightly fewer bombs