PG Phriday: Materialized Views, Revisited
Materialized views are a great improvement to performance in many cases. Introduced in PostgreSQL 9.3, they finally added an easy method for turning a view into a transient table that could be indexed, mined for statistics for better planner performance, and easily rebuilt. Unfortunately, refreshing a materialized view in PostgreSQL 9.3 caused a full exclusive lock, blocking any use until the process was complete. In 9.4, this can finally be done concurrently, though there are still a couple caveats.
Fake data is always best to illustrate, so let’s create a very basic table and a materialized view based on its contents. Heck, why not use the table from last week’s PG Phriday? For the purposes of this article, we’ll modify it slightly so there’s more than one day of orders.
CREATE TABLE sys_order ( order_id SERIAL NOT NULL, product_id INT NOT NULL, item_count INT NOT NULL, order_dt TIMESTAMP NOT NULL DEFAULT now() ); INSERT INTO sys_order (product_id, item_count, order_dt) SELECT (a.id % 100) + 1, (random()*10)::INT + 1, CURRENT_DATE - ((random()*30)::INT || 'days')::INTERVAL FROM generate_series(1, 1000000) a(id); ALTER TABLE sys_order ADD CONSTRAINT pk_order_order_id PRIMARY KEY (order_id); ANALYZE sys_order;
Now, how would we create a materialized view from this? There are a million rows constructed of 100 products, with varying order totals. A good way to use materialized views is to collect the underlying data into some kind of aggregate. How about product order totals for the entire day?
CREATE MATERIALIZED VIEW mv_daily_orders AS SELECT order_dt, product_id, sum(item_count) AS item_total FROM sys_order GROUP BY order_dt, product_id; CREATE UNIQUE INDEX udx_daily_orders_order_dt ON mv_daily_orders (order_dt, product_id); ANALYZE mv_daily_orders;
Note that we added a unique index to the order date and product ID. This is a subtle but required element to concurrently refresh a materialized view. Without it, we would get this error trying a concurrent refresh:
ERROR: cannot refresh materialized view "public.mv_daily_orders" concurrently HINT: Create a unique index with no WHERE clause on one or more columns of the materialized view.
In addition, we also analyzed the table. It’s always good practice to
ANALYZE a table after significant modifications so statistics are fresh. The planner and your query run times will thank you. There is an automatic daemon that will analyze tables after a certain threshold of changes, but this threshold may be too low, especially for extremely large tables that are on the receiving end of nightly insert jobs.
The materialized view is much smaller than the original, making it theoretically a better choice for reports and summaries. Let’s take a look at the last five days of sales for product ID number 1:
SELECT order_dt, item_total FROM mv_daily_orders WHERE order_dt >= CURRENT_DATE - INTERVAL '4 days' AND product_id = 1 ORDER BY order_dt; order_dt | item_total ---------------------+------------ 2015-03-02 00:00:00 | 1875 2015-03-03 00:00:00 | 1977 2015-03-04 00:00:00 | 2150 2015-03-05 00:00:00 | 1859 2015-03-06 00:00:00 | 1003
Great! The 6th isn’t over yet, so let’s insert some more orders.
INSERT INTO sys_order (product_id, item_count, order_dt) SELECT (a.id % 100) + 1, (random()*10)::INT + 1, CURRENT_DATE FROM generate_series(1, 1000) a(id);
Materialized views to not update when their parent table is modified. This means we would have to update the view ourselves by calling
REFRESH MATERIALIZED VIEW. In order to illustrate that 9.4 doesn’t lock these during a refresh, we’ll send these commands to the first connection:
BEGIN; REFRESH MATERIALIZED VIEW CONCURRENTLY mv_daily_orders;
Notice that we didn’t end the transaction? That will preserve any locks PostgreSQL allocates while we attempt to access data in the materialized view. Next, we need a second connection where we try to use the view:
SELECT count(1) FROM mv_daily_orders; count ------- 3100
If this were a 9.3 database, we couldn’t use the
CONCURRENTLY keyword, and the second connection would have hung until the first connection issued a
COMMIT statement. Otherwise, we have normal transaction isolation controls. The view contents would look as if they had not been refreshed until the transaction is committed. Let’s look at those totals again:
order_dt | item_total ---------------------+------------ 2015-03-02 00:00:00 | 1875 2015-03-03 00:00:00 | 1977 2015-03-04 00:00:00 | 2150 2015-03-05 00:00:00 | 1859 2015-03-06 00:00:00 | 1067
Now we have 64 more sales for product number 1.
But what about those caveats? First of all, this is from the PostgreSQL Wiki regarding the concurrent update process:
Instead of locking the materialized view up, it instead creates a temporary updated version of it, compares the two versions, then applies INSERTs and DELETEs against the materialized view to apply the difference. This means queries can still use the materialized view while it’s being updated.
This means a concurrent update needs to re-run the definition query, compare it against the existing rows of the view, and then merge in the changes using
DELETE statements. Note that in our case, it would have been much more efficient to simply ’top off’ the view by getting the totals for the current day and replacing them manually. A “real”
sys_order table would be much larger than a short 30-day window, and such volume would impart a profound performance impact.
But we can’t do that with the built-in materialized view structure. All manual attempts to modify the view produce this error:
ERROR: cannot change materialized view "mv_daily_orders"
Drat. Granted, such variance would not be reflected in our view definition and would therefore not be recommended. However, it would be nice if we could define a function and bind that as a callback when
REFRESH MATERIALIZED VIEW is invoked, especially when rebuilding the entire data set from scratch is slow and inefficient. The assumption in such a case would be that the function would leave the view in an accurate state when complete.
A concurrent refresh does not lock the view for use, but can require significant resources and time to rebuild and merge. Why this approach was chosen? The PostgreSQL
TRUNCATE command is also transaction safe and entirely atomic. A new table is created based on the definition of the current one, but empty, and the new structures replace the old ones when no transactions refer to them any longer. This same process could have been used for the concurrent view updates, considering the definition query has to be re-run anyway. There are probably some complicated internals that would have made this difficult, but I still wonder.
Which brings us to the second caveat:
VACUUM. Since the contents of the materialized view are directly modified with
INSERT behind the scenes, that leaves dead rows according to PostgreSQL’s MVCC storage mechanism. This means the view has to be vacuumed following a refresh or there’s a risk the structure will bloat over time. Now, the autovacuum daemon should take care of this for us in the long run, but it’s still an element to consider.
In the end,
CONCURRENT is a vastly needed improvement. Once a materialized view is created, there’s a high likelihood it will require periodic updates. If end users can’t treat materialized views the same way as they would a normal table or view, they’ll eventually migrate away from using them at all. That’s the path of least resistance, after all; why use a materialized view when it could be exclusively locked for long periods of time.
Such conduct directly conflicts with the expectation MVCC normally provides, and end users have come to expect: writers do not block readers. Since it’s only the data being modified, and not the structure of the view, I personally wonder why
CONCURRENTLY isn’t the default behavior. As such, if you’re using a 9.4 database in conjunction with materialized views, I strongly encourage using the
CONCURRENTLY keyword at all times when performing refreshes.
Here’s hoping materialized views become even more viable in future releases!