PG Phriday: Through the Window
Now that we know how Postgres window functions work, why not play with them a bit to get a better understanding of their capabilities? So long as we understand window functions are applied after data gathering and aggregation steps, much of their mystery and complexity is defanged. Let’s start actually using them for stuff!
(Note: I’m a bit under the weather today, so this Phriday will probably be a bit truncated and potentially incoherent thanks to the drugs. Apologies in advance.)
Let’s start off with the same example data as last week with one tiny alteration. To properly illustrate some of these concepts, we need some actual data variance, so we’ll be using random numbers for readings instead of modulo math.
CREATE TABLE sensor_log (
id SERIAL PRIMARY KEY,
location VARCHAR NOT NULL,
reading BIGINT NOT NULL,
reading_date TIMESTAMP NOT NULL
);
INSERT INTO sensor_log (location, reading, reading_date)
SELECT s.id % 1000, round(random() * 100),
CURRENT_DATE + INTERVAL '1d' - ((s.id * 10) || 'm')::INTERVAL
FROM generate_series(1, 5000) s(id);
CREATE INDEX idx_sensor_log_date ON sensor_log (reading_date);
ANALYZE sensor_log;
First things first: ordering matters. Window functions are either applied over the whole data-set or some partition of it. They’re also cumulative in nature, meaning we can change the aggregation results by altering the window itself. To better see what this means, consider these two sums:
SELECT location, reading, sum(reading) OVER ()
FROM sensor_log
WHERE reading_date >= CURRENT_DATE
ORDER BY location
LIMIT 10;
location | reading | sum
----------+---------+------
1 | 12 | 7058
10 | 26 | 7058
100 | 98 | 7058
101 | 99 | 7058
102 | 46 | 7058
103 | 84 | 7058
104 | 60 | 7058
105 | 35 | 7058
106 | 58 | 7058
107 | 6 | 7058
SELECT location, reading, sum(reading) OVER (ORDER BY location)
FROM sensor_log
WHERE reading_date >= CURRENT_DATE
ORDER BY location
LIMIT 10;
location | reading | sum
----------+---------+-----
1 | 12 | 12
10 | 26 | 38
100 | 98 | 136
101 | 99 | 235
102 | 46 | 281
103 | 84 | 365
104 | 60 | 425
105 | 35 | 460
106 | 58 | 518
107 | 6 | 524
Interestingly, we actually learned two things here. First, window functions ignore LIMIT
clauses. That sum reflects every reading in the table for today. Second is that ordering by the sensor location resulted in a cumulative total for the sum for each row. This applies to all of the available window functions. A maximum would reflect the current maximum for each row until a higher value replaces it. An average is a cumulative average as values are processed, and so on.
This isn’t particularly useful for sums, but what if we wanted to watch our data converge upon the overall average?
SELECT location, reading,
round(avg(reading) OVER (ORDER BY location), 2) AS running_avg,
round(reading -
avg(reading) OVER (ORDER BY location), 2) AS variance
FROM sensor_log
WHERE reading_date >= CURRENT_DATE
ORDER BY location
LIMIT 10;
location | reading | running_avg | variance
----------+---------+-------------+----------
1 | 12 | 12.00 | 0.00
10 | 26 | 19.00 | 7.00
100 | 98 | 45.33 | 52.67
101 | 99 | 58.75 | 40.25
102 | 46 | 56.20 | -10.20
103 | 84 | 60.83 | 23.17
104 | 60 | 60.71 | -0.71
105 | 35 | 57.50 | -22.50
106 | 58 | 57.56 | 0.44
107 | 6 | 52.40 | -46.40
If we monitored those kinds of results second by second, we could flag any kind of appreciable jitter from the average as currently represented by the data. It may be completely expected for values to increase through the day, so the total average for all data is meaningless to us. Ordering allows us to control the aggregate’s construction in a way that isn’t really possible using other methods, barring some kind of recursive CTE magic.
With ordering out of the way, there’s the small detail of data partitioning as well. The manual goes into much more detail about this, but it’s just a fancy way of saying GROUP BY
within the window itself.
SELECT location, reading,
sum(reading) OVER (PARTITION BY location) AS all_readings,
rank() OVER (PARTITION BY location ORDER BY reading)
FROM sensor_log
WHERE location::INT <= 10
LIMIT 10;
location | reading | all_readings | rank
----------+---------+--------------+------
0 | 32 | 260 | 1
0 | 35 | 260 | 2
0 | 37 | 260 | 3
0 | 71 | 260 | 4
0 | 85 | 260 | 5
1 | 12 | 173 | 1
1 | 14 | 173 | 2
1 | 17 | 173 | 3
1 | 44 | 173 | 4
1 | 86 | 173 | 5
Again, we can learn a few different things from these results. First is that the window results are restricted to the partition we declared. We set the partition to limit sums, ranks, and other window functions to the domain of the location. Postgres will apply window aggregates specifically to each location as it appears in the query output.
Next consider that we partitioned the sum, but did not order it. In case it wasn’t obvious already, this tells us that PARTITION
and ORDER BY
clauses are independent, as are the window definitions. In this case, we want the reading total for each location, followed by the rank of each reading within that group. If we had ordered the sum window, we would have gotten a running total instead of an overall value. On the other hand, we want the rank to behave in an ordered manner.
Separate windows, separate effects. Of course, we may not actually want that to happen. If we end up using the same window over and over again, it doesn’t make sense to declare it for each column. Thankfully Postgres has a shorthand for that:
SELECT location, reading,
sum(reading) OVER (PARTITION BY location) AS all_readings,
sum(reading) OVER locs AS running_total,
rank() OVER locs
FROM sensor_log
WHERE location::INT <= 10
WINDOW locs AS (PARTITION BY location ORDER BY reading)
LIMIT 10;
location | reading | all_readings | running_total | rank
----------+---------+--------------+---------------+------
0 | 32 | 260 | 32 | 1
0 | 35 | 260 | 67 | 2
0 | 37 | 260 | 104 | 3
0 | 71 | 260 | 175 | 4
0 | 85 | 260 | 260 | 5
1 | 12 | 173 | 12 | 1
1 | 14 | 173 | 26 | 2
1 | 17 | 173 | 43 | 3
1 | 44 | 173 | 87 | 4
1 | 86 | 173 | 173 | 5
Perfect! Not only did we use the same window twice, but we were able to mix it with another direct declaration within the same query. That isn’t exactly ground-breaking territory, but it’s reassuring when things work as expected.
Keep in mind that the same window function limitations apply to this syntax as well. Since windows are evaluated after WHERE
, GROUP BY
, or HAVING
clauses, the WINDOW
portion of a query must appear after them as well.
Now go forth and leverage window functions to produce reports without awkward subqueries and CTEs muddying up the works! The only better way to use a window is to cool pies.