PG Phriday: Through the Window

Friday March 03, 2017

Now that we know how Postgres window functions work, why not play with them a bit to get a better understanding of their capabilities? So long as we understand window functions are applied after data gathering and aggregation steps, much of their mystery and complexity is defanged. Let’s start actually using them for stuff!

Captain Murphy is tired of your nonsense

(Note: I’m a bit under the weather today, so this Phriday will probably be a bit truncated and potentially incoherent thanks to the drugs. Apologies in advance.)

Let’s start off with the same example data as last week with one tiny alteration. To properly illustrate some of these concepts, we need some actual data variance, so we’ll be using random numbers for readings instead of modulo math.

CREATE TABLE sensor_log (
  id            SERIAL PRIMARY KEY,
  location      VARCHAR NOT NULL,
  reading       BIGINT NOT NULL,
  reading_date  TIMESTAMP NOT NULL
);

INSERT INTO sensor_log (location, reading, reading_date)
SELECT s.id % 1000, round(random() * 100),
       CURRENT_DATE + INTERVAL '1d' - ((s.id * 10) || 'm')::INTERVAL
  FROM generate_series(1, 5000) s(id);

CREATE INDEX idx_sensor_log_date ON sensor_log (reading_date);

ANALYZE sensor_log;

First things first: ordering matters. Window functions are either applied over the whole data-set or some partition of it. They’re also cumulative in nature, meaning we can change the aggregation results by altering the window itself. To better see what this means, consider these two sums:

SELECT location, reading, sum(reading) OVER ()
  FROM sensor_log
 WHERE reading_date >= CURRENT_DATE
 ORDER BY location
 LIMIT 10;

 location | reading | sum  
----------+---------+------
 1        |      12 | 7058
 10       |      26 | 7058
 100      |      98 | 7058
 101      |      99 | 7058
 102      |      46 | 7058
 103      |      84 | 7058
 104      |      60 | 7058
 105      |      35 | 7058
 106      |      58 | 7058
 107      |       6 | 7058

SELECT location, reading, sum(reading) OVER (ORDER BY location)
  FROM sensor_log
 WHERE reading_date >= CURRENT_DATE
 ORDER BY location
 LIMIT 10;

 location | reading | sum 
----------+---------+-----
 1        |      12 |  12
 10       |      26 |  38
 100      |      98 | 136
 101      |      99 | 235
 102      |      46 | 281
 103      |      84 | 365
 104      |      60 | 425
 105      |      35 | 460
 106      |      58 | 518
 107      |       6 | 524

Interestingly, we actually learned two things here. First, window functions ignore LIMIT clauses. That sum reflects every reading in the table for today. Second is that ordering by the sensor location resulted in a cumulative total for the sum for each row. This applies to all of the available window functions. A maximum would reflect the current maximum for each row until a higher value replaces it. An average is a cumulative average as values are processed, and so on.

This isn’t particularly useful for sums, but what if we wanted to watch our data converge upon the overall average?

SELECT location, reading,
       round(avg(reading) OVER (ORDER BY location), 2) AS running_avg,
       round(reading - 
           avg(reading) OVER (ORDER BY location), 2) AS variance
  FROM sensor_log
 WHERE reading_date >= CURRENT_DATE
 ORDER BY location
 LIMIT 10;

 location | reading | running_avg | variance 
----------+---------+-------------+----------
 1        |      12 |       12.00 |     0.00
 10       |      26 |       19.00 |     7.00
 100      |      98 |       45.33 |    52.67
 101      |      99 |       58.75 |    40.25
 102      |      46 |       56.20 |   -10.20
 103      |      84 |       60.83 |    23.17
 104      |      60 |       60.71 |    -0.71
 105      |      35 |       57.50 |   -22.50
 106      |      58 |       57.56 |     0.44
 107      |       6 |       52.40 |   -46.40

If we monitored those kinds of results second by second, we could flag any kind of appreciable jitter from the average as currently represented by the data. It may be completely expected for values to increase through the day, so the total average for all data is meaningless to us. Ordering allows us to control the aggregate’s construction in a way that isn’t really possible using other methods, barring some kind of recursive CTE magic.

With ordering out of the way, there’s the small detail of data partitioning as well. The manual goes into much more detail about this, but it’s just a fancy way of saying GROUP BY within the window itself.

SELECT location, reading,
       sum(reading) OVER (PARTITION BY location) AS all_readings,
       rank() OVER (PARTITION BY location ORDER BY reading)
  FROM sensor_log
 WHERE location::INT <= 10
 LIMIT 10;

 location | reading | all_readings | rank 
----------+---------+--------------+------
 0        |      32 |          260 |    1
 0        |      35 |          260 |    2
 0        |      37 |          260 |    3
 0        |      71 |          260 |    4
 0        |      85 |          260 |    5
 1        |      12 |          173 |    1
 1        |      14 |          173 |    2
 1        |      17 |          173 |    3
 1        |      44 |          173 |    4
 1        |      86 |          173 |    5

Again, we can learn a few different things from these results. First is that the window results are restricted to the partition we declared. We set the partition to limit sums, ranks, and other window functions to the domain of the location. Postgres will apply window aggregates specifically to each location as it appears in the query output.

Next consider that we partitioned the sum, but did not order it. In case it wasn’t obvious already, this tells us that PARTITION and ORDER BY clauses are independent, as are the window definitions. In this case, we want the reading total for each location, followed by the rank of each reading within that group. If we had ordered the sum window, we would have gotten a running total instead of an overall value. On the other hand, we want the rank to behave in an ordered manner.

Separate windows, separate effects. Of course, we may not actually want that to happen. If we end up using the same window over and over again, it doesn’t make sense to declare it for each column. Thankfully Postgres has a shorthand for that:

SELECT location, reading,
       sum(reading) OVER (PARTITION BY location) AS all_readings,
       sum(reading) OVER locs AS running_total,
       rank() OVER locs
  FROM sensor_log
 WHERE location::INT <= 10
WINDOW locs AS (PARTITION BY location ORDER BY reading)
 LIMIT 10;

 location | reading | all_readings | running_total | rank 
----------+---------+--------------+---------------+------
 0        |      32 |          260 |            32 |    1
 0        |      35 |          260 |            67 |    2
 0        |      37 |          260 |           104 |    3
 0        |      71 |          260 |           175 |    4
 0        |      85 |          260 |           260 |    5
 1        |      12 |          173 |            12 |    1
 1        |      14 |          173 |            26 |    2
 1        |      17 |          173 |            43 |    3
 1        |      44 |          173 |            87 |    4
 1        |      86 |          173 |           173 |    5

Perfect! Not only did we use the same window twice, but we were able to mix it with another direct declaration within the same query. That isn’t exactly ground-breaking territory, but it’s reassuring when things work as expected.

Keep in mind that the same window function limitations apply to this syntax as well. Since windows are evaluated after WHERE, GROUP BY, or HAVING clauses, the WINDOW portion of a query must appear after them as well.

Now go forth and leverage window functions to produce reports without awkward subqueries and CTEs muddying up the works! The only better way to use a window is to cool pies.

See Also