PG Phriday: Materialized Views, Revisited

March 6th, 2015 | Published in Database, Tech Talk | 3 Comments


Materialized views are a great improvement to performance in many cases. Introduced in PostgreSQL 9.3, they finally added an easy method for turning a view into a transient table that could be indexed, mined for statistics for better planner performance, and easily rebuilt. Unfortunately, refreshing a materialized view in PostgreSQL 9.3 caused a full exclusive lock, blocking any use until the process was complete. In 9.4, this can finally be done concurrently, though there are still a couple caveats.

Fake data is always best to illustrate, so let’s create a very basic table and a materialized view based on its contents. Heck, why not use the table from last week’s PG Phriday? For the purposes of this article, we’ll modify it slightly so there’s more than one day of orders.

CREATE TABLE sys_order
(
    order_id     SERIAL     NOT NULL,
    product_id   INT        NOT NULL,
    item_count   INT        NOT NULL,
    order_dt     TIMESTAMP  NOT NULL DEFAULT now()
);

INSERT INTO sys_order (product_id, item_count, order_dt)
SELECT (a.id % 100) + 1, (random()*10)::INT + 1,
       CURRENT_DATE - ((random()*30)::INT || 'days')::INTERVAL
  FROM generate_series(1, 1000000) a(id);

ALTER TABLE sys_order ADD CONSTRAINT pk_order_order_id
      PRIMARY KEY (order_id);

ANALYZE sys_order;

Now, how would we create a materialized view from this? There are a million rows constructed of 100 products, with varying order totals. A good way to use materialized views is to collect the underlying data into some kind of aggregate. How about product order totals for the entire day?

CREATE MATERIALIZED VIEW mv_daily_orders AS
SELECT order_dt, product_id, sum(item_count) AS item_total
  FROM sys_order
 GROUP BY order_dt, product_id;

CREATE UNIQUE INDEX udx_daily_orders_order_dt
    ON mv_daily_orders (order_dt, product_id);

ANALYZE mv_daily_orders;

Note that we added a unique index to the order date and product ID. This is a subtle but required element to concurrently refresh a materialized view. Without it, we would get this error trying a concurrent refresh:

ERROR:  cannot refresh materialized view "public.mv_daily_orders" concurrently
HINT:  Create a unique index with no WHERE clause on one or more columns of the materialized view.

In addition, we also analyzed the table. It’s always good practice to ANALYZE a table after significant modifications so statistics are fresh. The planner and your query run times will thank you. There is an automatic daemon that will analyze tables after a certain threshold of changes, but this threshold may be too low, especially for extremely large tables that are on the receiving end of nightly insert jobs.

The materialized view is much smaller than the original, making it theoretically a better choice for reports and summaries. Let’s take a look at the last five days of sales for product ID number 1:

SELECT order_dt, item_total
  FROM mv_daily_orders
 WHERE order_dt >= CURRENT_DATE - INTERVAL '4 days'
   AND product_id = 1
 ORDER BY order_dt;

      order_dt       | item_total
---------------------+------------
 2015-03-02 00:00:00 |       1875
 2015-03-03 00:00:00 |       1977
 2015-03-04 00:00:00 |       2150
 2015-03-05 00:00:00 |       1859
 2015-03-06 00:00:00 |       1003

Great! The 6th isn’t over yet, so let’s insert some more orders.

INSERT INTO sys_order (product_id, item_count, order_dt)
SELECT (a.id % 100) + 1, (random()*10)::INT + 1, CURRENT_DATE
  FROM generate_series(1, 1000) a(id);

Materialized views to not update when their parent table is modified. This means we would have to update the view ourselves by calling REFRESH MATERIALIZED VIEW. In order to illustrate that 9.4 doesn’t lock these during a refresh, we’ll send these commands to the first connection:

BEGIN;
REFRESH MATERIALIZED VIEW CONCURRENTLY mv_daily_orders;

Notice that we didn’t end the transaction? That will preserve any locks PostgreSQL allocates while we attempt to access data in the materialized view. Next, we need a second connection where we try to use the view:

SELECT count(1) FROM mv_daily_orders;

 count
-------
  3100

If this were a 9.3 database, we couldn’t use the CONCURRENTLY keyword, and the second connection would have hung until the first connection issued a COMMIT statement. Otherwise, we have normal transaction isolation controls. The view contents would look as if they had not been refreshed until the transaction is committed. Let’s look at those totals again:

      order_dt       | item_total
---------------------+------------
 2015-03-02 00:00:00 |       1875
 2015-03-03 00:00:00 |       1977
 2015-03-04 00:00:00 |       2150
 2015-03-05 00:00:00 |       1859
 2015-03-06 00:00:00 |       1067

Now we have 64 more sales for product number 1.

But what about those caveats? First of all, this is from the PostgreSQL Wiki regarding the concurrent update process:

Instead of locking the materialized view up, it instead creates a temporary updated version of it, compares the two versions, then applies INSERTs and DELETEs against the materialized view to apply the difference. This means queries can still use the materialized view while it’s being updated.

This means a concurrent update needs to re-run the definition query, compare it against the existing rows of the view, and then merge in the changes using INSERT and DELETE statements. Note that in our case, it would have been much more efficient to simply ‘top off’ the view by getting the totals for the current day and replacing them manually. A “real” sys_order table would be much larger than a short 30-day window, and such volume would impart a profound performance impact.

But we can’t do that with the built-in materialized view structure. All manual attempts to modify the view produce this error:

ERROR:  cannot change materialized view "mv_daily_orders"

Drat. Granted, such variance would not be reflected in our view definition and would therefore not be recommended. However, it would be nice if we could define a function and bind that as a callback when REFRESH MATERIALIZED VIEW is invoked, especially when rebuilding the entire data set from scratch is slow and inefficient. The assumption in such a case would be that the function would leave the view in an accurate state when complete.

A concurrent refresh does not lock the view for use, but can require significant resources and time to rebuild and merge. Why this approach was chosen? The PostgreSQL TRUNCATE command is also transaction safe and entirely atomic. A new table is created based on the definition of the current one, but empty, and the new structures replace the old ones when no transactions refer to them any longer. This same process could have been used for the concurrent view updates, considering the definition query has to be re-run anyway. There are probably some complicated internals that would have made this difficult, but I still wonder.

Which brings us to the second caveat: VACUUM. Since the contents of the materialized view are directly modified with DELETE and INSERT behind the scenes, that leaves dead rows according to PostgreSQL’s MVCC storage mechanism. This means the view has to be vacuumed following a refresh or there’s a risk the structure will bloat over time. Now, the autovacuum daemon should take care of this for us in the long run, but it’s still an element to consider.

In the end, CONCURRENT is a vastly needed improvement. Once a materialized view is created, there’s a high likelihood it will require periodic updates. If end users can’t treat materialized views the same way as they would a normal table or view, they’ll eventually migrate away from using them at all. That’s the path of least resistance, after all; why use a materialized view when it could be exclusively locked for long periods of time.

Such conduct directly conflicts with the expectation MVCC normally provides, and end users have come to expect: writers do not block readers. Since it’s only the data being modified, and not the structure of the view, I personally wonder why CONCURRENTLY isn’t the default behavior. As such, if you’re using a 9.4 database in conjunction with materialized views, I strongly encourage using the CONCURRENTLY keyword at all times when performing refreshes.

Here’s hoping materialized views become even more viable in future releases!


Tags: , , ,

Recipe: Amazeballs Chili

March 4th, 2015 | Published in Recipe | No Comments


Every once in a while, I get the itch to cook something. This time around, chili was my poison of choice because this winter won’t end and I’m silently protesting. This is probably the best chili I’ve ever made, so I’m required by law to share it. Seriously, this stuff is amazeballs.

Ingredients

  • 1 lb ground pork
  • 1 lb ground beef
  • 1 28oz can crushed tomatoes
  • 1 28oz can stewed tomatoes
  • 2 15.5oz cans red kidney beans
  • 2 15.5oz cans great northern beans
  • 2 15.5oz cans black beans
  • 1 12oz can tomato paste
  • 1/4 – 1/3 cup chili powder
  • 1.5 tbsp cumin
  • 1 tbsp salt
  • 2 tsp paprika
  • 1 tsp cayenne pepper
  • 1 tsp white pepper
  • 1 tsp coriander

Directions

  1. Mix all dry ingredients (chili powder, cumin, salt, paprika, cayenne, white pepper, coriander) in a bowl.
  2. Brown ground pork and beef in large saucepan. Add half of the spice mixture to the meat as it cooks.
  3. Combine all ingredients in a large pot or crock pot (do not drain or clean beans) and simmer on low for 6-8 hours, stirring occasionally.

Conclusion / Notes

I didn’t add garlic because Jen is allergic. Otherwise, I consider this recipe basically perfect. Unfortunately I only eyeballed the spices as I was adding them, so the above measurements are only approximate.


Tags:

PG Phriday: PostgreSQL Select Filters

February 27th, 2015 | Published in Database, Tech Talk | No Comments


Long have CASE statements been a double-edged sword in the database world. They’re functional, diverse, adaptive, and simple. Unfortunately they’re also somewhat bulky, and when it comes to using them to categorize aggregates, something of a hack. This is why I wanted to cry with joy when I found out that PostgreSQL 9.4 introduced a feature I’ve always wanted, but found difficult to express as a need. I mean, CASE statements are fine, right? Well, yes they are, but now we have something better. Now, we have the FILTER aggregate expression.

I always like working with examples, so let’s create some test data to illustrate just what I’m talking about.

CREATE TABLE sys_order
(
    order_id     SERIAL     NOT NULL,
    product_id   INT        NOT NULL,
    item_count   INT        NOT NULL,
    order_dt     TIMESTAMP  NOT NULL DEFAULT now()
);

INSERT INTO sys_order (product_id, item_count)
SELECT (a.id % 100) + 1, (random()*10)::INT + 1
  FROM generate_series(1, 1000000) a(id);

ALTER TABLE sys_order ADD CONSTRAINT pk_order_order_id
      PRIMARY KEY (order_id);

We now have a table for tracking fake orders, using 100 nonexistent products. I added the primary key after loading the table as a well known DBA trick. Doing this after data loading means the index can be created as a single step, which is much more efficient than repeatedly extending an existing index.

With that out of the way, let’s do a basic product order count, since that’s something many people are already familiar with:

SELECT sum(item_count) AS total
  FROM sys_order;

-[ RECORD 1 ]--
total | 1000000

No surprises here. But what happens when Jeff from Accounting wants to know how many people ordered five specific products as a column list? In the old days, we might do something like this:

SELECT sum(CASE WHEN product_id = 1 THEN item_count ELSE 0 END) AS horse_mask_count,
       sum(CASE WHEN product_id = 7 THEN item_count ELSE 0 END) AS eyeball_count,
       sum(CASE WHEN product_id = 13 THEN item_count ELSE 0 END) AS badger_count,
       sum(CASE WHEN product_id = 29 THEN item_count ELSE 0 END) AS orb_count,
       sum(CASE WHEN product_id = 73 THEN item_count ELSE 0 END) AS memebox_count
  FROM sys_order;

 horse_mask_count | eyeball_count | badger_count | orb_count | memebox_count 
------------------+---------------+--------------+-----------+---------------
            59870 |         59951 |        59601 |     59887 |         60189

Gross.

As a DBA, I’ve seen more of these than I can reasonably stand, and hate them every single time. It’s not the use of the CASE statement that is so irksome, but the micromanaging methodology necessary to reduce the count to zero for unwanted items. With FILTER however, this query changes quite a bit:

SELECT sum(item_count) FILTER (WHERE product_id = 1) AS horse_mask_count,
       sum(item_count) FILTER (WHERE product_id = 7) AS eyeball_count,
       sum(item_count) FILTER (WHERE product_id = 13) AS badger_count,
       sum(item_count) FILTER (WHERE product_id = 29) AS orb_count,
       sum(item_count) FILTER (WHERE product_id = 73) AS memebox_count
  FROM sys_order;

 horse_mask_count | eyeball_count | badger_count | orb_count | memebox_count 
------------------+---------------+--------------+-----------+---------------
            59870 |         59951 |        59601 |     59887 |         60189

The query itself isn’t much shorter, but semantically, it’s far easier to understand what we’re trying to accomplish. It’s clear that we want the item count for each specific product, and nothing else. Further, since this is built-in functionality instead of a gross hack, it’s much faster. On the system we used for testing, after ten runs, the average time for the CASE variant was about 500ms; the FILTER version was about 300ms. In both cases, the query execution plan is identical. Internally however, invoking hundreds of thousands of CASE statements causes an immense CPU impact, where FILTER can utilize set grouping or another efficient stratagem based on the filter criteria. For large OLAP databases, this is a significant improvement in both query simplicity and performance.

This is good stuff we are getting in these new releases, and I encourage everyone to enjoy all of the new toys we get every year. Some of them are much more than mere window-dressing.


Tags: , , ,

About People and Poverty

December 22nd, 2014 | Published in Contemplation | 1 Comment


I want to tell a story, and I’m sure most people won’t like it for one reason or another. If you stop reading after the first paragraph or two, I won’t blame you. It’s hard to read, and says a lot of bad things about humanity. But I like to think that it also provides necessary perspective that helps society see where it needs to improve.

It’s about my family.

In a lot of ways, we’re not good people. My uncles are unashamedly racist, my aunt is a master manipulator, and my mother is almost incapable of supporting herself or finishing anything. All of them have one or several psychological problems, stemming from depression and bipolarism, body dysmorphic disorder, or outright schizophrenia. Every single one. And it’s no accident.

It all started with my grandmother, who by all accounts, is one of the most terrible people I’ve ever met. Her second husband, you see, was a pedophile and a rapist, as my mother and her sister discovered before they were even ten years old. The first time she was raped, my mother made the fateful mistake of telling her mother. She was betrayed in the worst way a child can be. Her mother accused her of being a husband-stealing slut, and broke her nose.

Ever since that day, she fostered a simmering hate for my mother; mom’s childhood and everything after it suffered tremendously. From that point on, my mother would learn she was worthless, could do nothing right, and was beaten frequently for something as innocuous as improperly washing a dish. My aunt saw what happened and learned from my mother’s mistake, and thus began her own defense mechanism of influencing situations to avoid a similar fate. My uncles were faced with an impossible amount of cognitive dissonance: believe their parents were terrible people that would abuse innocent children, or that the children somehow deserved it. Their own beatings at childish mistakes helped decide which interpretation was correct.

Thus began a legacy that continues to this day. My grandmother was a lifelong smoker and finally died at the age of 72, but the damage she caused lives on. I’m honestly surprised I ever met her, but my mother was just as damaged as the rest of the kids. Her belief that her suffering was deserved, meant she saw her past as just another family quirk. I like to think my hate for her mother rubbed off on her, but I’ll never really know for sure. Abuse can be fantastically enduring, and people are weak to indoctrination.

As a result, my mother didn’t really know how to raise me. Were I a less precocious child, I probably would have been deep into the drug scene, not to mention rebellious and resentful of everything and everyone for having a better life. Instead, I was incensed by our situation to study and dig my way out of poverty by any means necessary. But in the meantime the damage, as I said, was enduring. My mother had very little confidence, no higher education, and no parental advice on how to advocate herself or work within the system. She was basically trained to be poor, because society demands adherence to somewhat strict rules necessary for smooth operation. Anyone who doesn’t fit that mold is discarded.

So we lived in trailers. Or with her boyfriends. Or, for a while, in a car. At one point, I slept on a couch for about a year when I was five. Mom would work multiple jobs, and since she couldn’t always afford a babysitter, sometimes I’d spend hours just hanging out at the 7-11 where she worked. It wasn’t uncommon for me to sleep in the stock room in the back. I can only imagine how impotent my mother must have felt back then, being forced to raise her child that way. But the social safety net is thin, and she was loath to use it more than necessary. If anyone is curious as to my vehement anger at our treatment of the poor, it starts here.

But that wasn’t all. Because her self confidence was so terrible, and the fact she never learned what made a healthy relationship, she dated and married a string of verbally or physically abusive alcoholics. My heart condition made me a very weak and sickly child. These men told me that I would be better off dead because I was sick all the time, or rubbed my nose in soiled underwear because I had a weak bladder before I was three. My father abandoned us before I was born, and isn’t even listed on my birth certificate. I’ve never met the man, nor do I even know his name. As a result, mom never received any form of child support, not that he would have paid it, given his demonstrated lack of responsibility.

Now my mom is in her late 50’s. Due to her age, lack of higher education and marketable skills, and fragmented or absent work history, she has almost no job prospects. She’s been subsisting for the last twenty years on side jobs and whatever she can make by working out of her home as a seamstress. As a result, she has no retirement savings, can’t pay her rent regularly, and can barely afford food. She just told me she finally broke down and went to a food bank last week.

I help when I can. Rent is expensive these days, and I already have my own to worry about. But I make sure she has a roof over her head, as dismal as a run-down one-room Chicago basement apartment can be. Her Chicago Ventra card is always loaded. When she asks, I help with bills, though I grumble a bit. I feel bad about giving her lectures about finances, knowing her irresponsibility and lack of foresight isn’t wholly her fault, but I’m only human. This is the world we live in.

If you’ve ever wondered why I’m painfully practical, am prone to completely unexpected bouts of anger, and complain about almost every dime I spend, there’s a long and sordid history behind it. Age and perspective have tamed some of that, but in a properly constructed society where people actually cared about each other, either my mother or I would have been intercepted before our problems became so deeply ingrained. It’s almost as if society is driven to facilitate the implements of its own demise.

In another universe, I could have been much different. Without a heart condition, perhaps I would have had enough energy to violently rebel against my situation, making trouble for everyone in my wake. I did that anyway, but only half-heartedly due to my reluctance to permanently damage my future prospects. One of my uncles wasn’t so restrained; given his past, he cared about little, and got in trouble and fights constantly into his 20’s. Yet it didn’t have to be that way. If the community cared, and really paid attention, all of my grandmother’s children would have long been relocated to good homes before the damage was irreversible.

Instead the problem was ignored and propagated to yet another generation, and then another. Mine is hardly an isolated case, regardless of how it might seem. People like me get labeled as Trailer Trash, meth heads, or shiftless poor. People that fall through the cracks garner only resentment instead of the assistance they need to prevent the rot from corrupting our society. Our careless and cavalier attitude compounds the situation and fosters nothing but resentment between our artificial castes. It’s a troublesome brew that I’ve seen both sides of, and even though I’m middle class now, I vehemently hate our society and everything it represents.

In my mind, I’ll always be poor and wondering where my next meal will come from. I’ll always be that little kid who didn’t get a deadly heart condition corrected until I was six because we couldn’t afford the surgery. I’ll always be skeptical and leery of everyone’s intentions, and internally scream “Fuck you for judging me!” at every perceived slight. My rage will never be sated, no matter how collected I may seem on the surface.

Now imagine I was still poor. Imagine I still had nothing to lose. Imagine the damage I could do if I were backed into a corner. Just visit a poor neighborhood, and you don’t have to imagine. The resentment is almost palpable, and threatens to boil over at any provocation. I don’t condone when these people lash out, but I understand their motivation. Desperation is a dangerous thing, and every time we cut the social safety net in the name of fiscal responsibility, we bring the knife closer to our own throats.

My mom, my family, and I are merely symptoms of a much more insidious underlying problem. I wish I had an answer to any of this, but I don’t. But if we can’t even acknowledge the symptoms, we can hardly begin to develop a cure. It really does take a village to raise a child, and unfortunately, our village needs to stop and examine the children it’s been raising. We would all benefit from a little more care, and a lot less callous disregard. I’ve never understood why we don’t try harder, given the risk to reward ratio.

Given the season, please take the opportunity to give a little more love to your family. If the situation arises where you can share that love with someone else who needs it, please do so. A little can go a long way. I can assure you it won’t go unappreciated.


Tags: , ,

On PostgreSQL View Dependencies

November 5th, 2014 | Published in Database, Tech Talk | 5 Comments


As many seasoned DBAs might know, there’s one area that PostgreSQL still manages to be highly aggravating. By this, I mean the role views have in mucking up PostgreSQL dependencies. The part that annoys me personally, is that it doesn’t have to be this way.

Take, for example, what happens if you try to modify a VARCHAR column so that the column length is higher. We’re not changing the type, or dropping the column, or anything overly complicated. Yet we’re faced with this message:

ERROR:  cannot alter type of a column used by a view or rule
DETAIL:  rule _RETURN on view v_change_me depends on column "too_short"

Though PostgreSQL tells us which view and column prompted this error, that’s the last favor it provides. The only current way to fix this error is to drop the view, alter the column, then recreate the view. In a production 24/7 environment, this is extremely problematic. The system I work with handles over two-billion queries per day; there’s no way I’m dropping a view that the platform depends on, even in a transaction.

This problem is compounded when views depend on other views. The error doesn’t say so, but I defined another view named v_change_me_too that depends on v_change_me, yet I would never know it by the output PostgreSQL generated. Large production systems can have dozens, or even hundreds of views that depend on complex hierarchies of tables and other views. Yet there’s no built-in way to identify these views, let alone modify them safely.

If you want to follow along, this is the code I used to build my test case:

CREATE TABLE change_me ( too_short VARCHAR(30) );
CREATE VIEW v_change_me AS SELECT * FROM change_me;
CREATE VIEW v_change_me_too AS SELECT * FROM v_change_me;

And here’s the statement I used to try and make the column bigger:

ALTER TABLE change_me ALTER too_short TYPE VARCHAR(50);

It turns out we can solve this for some cases, though it takes a very convoluted path. The first thing we need to do is identify all of the views in the dependency chain. To do this, we need a recursive query. Here’s one that should find all the views in our sample chain, starting with the table itself:

WITH RECURSIVE vlist AS (
    SELECT c.oid::REGCLASS AS view_name
      FROM pg_class c
     WHERE c.relname = 'change_me'
     UNION ALL
    SELECT DISTINCT r.ev_class::REGCLASS AS view_name
      FROM pg_depend d
      JOIN pg_rewrite r ON (r.oid = d.objid)
      JOIN vlist ON (vlist.view_name = d.refobjid)
     WHERE d.refobjsubid != 0
)
SELECT * FROM vlist;

If we execute that query, both v_change_me and v_change_me_too will show up in the results. Keep in mind that in actual production systems, this list can be much longer. For systems that can survive downtime, this list can be passed to pg_dump to obtain all of the view definitions. That will allow a DBA to drop the views, modify the table, then accurately recreate them.

For simple cases where we’re just extending an existing column, we can take advantage of the fact the pg_attribute catalog table allows direct manipulation. In PostgreSQL, TEXT-type columns have a length 4-bytes longer than the column limit. So we simply reuse the recursive query and extend that length:

WITH RECURSIVE vlist AS (
    SELECT c.oid::REGCLASS AS view_name
      FROM pg_class c
     WHERE c.relname = 'change_me'
     UNION ALL
    SELECT DISTINCT r.ev_class::REGCLASS AS view_name
      FROM pg_depend d
      JOIN pg_rewrite r ON (r.oid = d.objid)
      JOIN vlist ON (vlist.view_name = d.refobjid)
     WHERE d.refobjsubid != 0
)
UPDATE pg_attribute a
   SET a.atttypmod = 50 + 4
  FROM vlist
 WHERE a.attrelid = vlist.view_name
   AND a.attname = 'too_short';

Now, this isn’t exactly a perfect solution. If views alias the column name, things get a lot more complicated. We have to modify the recursive query to return both the view name, and the column alias. Unfortunately the pg_depend view always sets the objsubid column to 0 for views. The objsubid column is used to determine which which column corresponds to the aliased column.

Without having this value, it becomes impossible to know what to modify in pg_attribute for the views. In effect, instead of being a doubly-linked list, pg_depend is a singly-linked list we can only follow backwards. So we can discover what the aliases depend on, but not what the aliases are. I can’t really think of any reason this would be set for tables, but not for views.

This means, of course, that large production systems will still need to revert to the DROP -> ALTER -> CREATE route for column changes to dependent views. But why? PostgreSQL knows the entire dependency chain. Why is it impossible to modify these in an atomic transaction context? If I have one hundred views on a table, why do I have to drop all of them before modifying the table? And, again, the type of modification in this example is extremely trivial; we’re not going from a TEXT to an INT, or anything that would require drastically altering the view logic.

For highly available databases, this makes it extremely difficult to use PostgreSQL without some type of short outage. Column modifications, while not common, are a necessary evil. Since it would be silly to recommend never using views, we have to live with downtime imposed by the database software. Now that PostgreSQL is becoming popular in enterprise settings, issues like this are gaining more visibility.

Hopefully this is one of those easy fixes they can patch into 9.5 or 9.6. If not, I can see it hampering adoption.


Tags: , ,

« Older Posts

Newer Posts »