PG Phriday: PostgreSQL Select Filters

February 27th, 2015 | Published in Database, Tech Talk | No Comments

Long have CASE statements been a double-edged sword in the database world. They’re functional, diverse, adaptive, and simple. Unfortunately they’re also somewhat bulky, and when it comes to using them to categorize aggregates, something of a hack. This is why I wanted to cry with joy when I found out that PostgreSQL 9.4 introduced a feature I’ve always wanted, but found difficult to express as a need. I mean, CASE statements are fine, right? Well, yes they are, but now we have something better. Now, we have the FILTER aggregate expression.

I always like working with examples, so let’s create some test data to illustrate just what I’m talking about.

CREATE TABLE sys_order
    order_id     SERIAL     NOT NULL,
    product_id   INT        NOT NULL,
    item_count   INT        NOT NULL,
    order_dt     TIMESTAMP  NOT NULL DEFAULT now()

INSERT INTO sys_order (product_id, item_count)
SELECT ( % 100) + 1, (random()*10)::INT + 1
  FROM generate_series(1, 1000000) a(id);

ALTER TABLE sys_order ADD CONSTRAINT pk_order_order_id
      PRIMARY KEY (order_id);

We now have a table for tracking fake orders, using 100 nonexistent products. I added the primary key after loading the table as a well known DBA trick. Doing this after data loading means the index can be created as a single step, which is much more efficient than repeatedly extending an existing index.

With that out of the way, let’s do a basic product order count, since that’s something many people are already familiar with:

SELECT sum(item_count) AS total
  FROM sys_order;

-[ RECORD 1 ]--
total | 1000000

No surprises here. But what happens when Jeff from Accounting wants to know how many people ordered five specific products as a column list? In the old days, we might do something like this:

SELECT sum(CASE WHEN product_id = 1 THEN item_count ELSE 0 END) AS horse_mask_count,
       sum(CASE WHEN product_id = 7 THEN item_count ELSE 0 END) AS eyeball_count,
       sum(CASE WHEN product_id = 13 THEN item_count ELSE 0 END) AS badger_count,
       sum(CASE WHEN product_id = 29 THEN item_count ELSE 0 END) AS orb_count,
       sum(CASE WHEN product_id = 73 THEN item_count ELSE 0 END) AS memebox_count
  FROM sys_order;

 horse_mask_count | eyeball_count | badger_count | orb_count | memebox_count 
            59870 |         59951 |        59601 |     59887 |         60189


As a DBA, I’ve seen more of these than I can reasonably stand, and hate them every single time. It’s not the use of the CASE statement that is so irksome, but the micromanaging methodology necessary to reduce the count to zero for unwanted items. With FILTER however, this query changes quite a bit:

SELECT sum(item_count) FILTER (WHERE product_id = 1) AS horse_mask_count,
       sum(item_count) FILTER (WHERE product_id = 7) AS eyeball_count,
       sum(item_count) FILTER (WHERE product_id = 13) AS badger_count,
       sum(item_count) FILTER (WHERE product_id = 29) AS orb_count,
       sum(item_count) FILTER (WHERE product_id = 73) AS memebox_count
  FROM sys_order;

 horse_mask_count | eyeball_count | badger_count | orb_count | memebox_count 
            59870 |         59951 |        59601 |     59887 |         60189

The query itself isn’t much shorter, but semantically, it’s far easier to understand what we’re trying to accomplish. It’s clear that we want the item count for each specific product, and nothing else. Further, since this is built-in functionality instead of a gross hack, it’s much faster. On the system we used for testing, after ten runs, the average time for the CASE variant was about 500ms; the FILTER version was about 300ms. In both cases, the query execution plan is identical. Internally however, invoking hundreds of thousands of CASE statements causes an immense CPU impact, where FILTER can utilize set grouping or another efficient stratagem based on the filter criteria. For large OLAP databases, this is a significant improvement in both query simplicity and performance.

This is good stuff we are getting in these new releases, and I encourage everyone to enjoy all of the new toys we get every year. Some of them are much more than mere window-dressing.

Tags: , , ,

About People and Poverty

December 22nd, 2014 | Published in Contemplation | 1 Comment

I want to tell a story, and I’m sure most people won’t like it for one reason or another. If you stop reading after the first paragraph or two, I won’t blame you. It’s hard to read, and says a lot of bad things about humanity. But I like to think that it also provides necessary perspective that helps society see where it needs to improve.

It’s about my family.

In a lot of ways, we’re not good people. My uncles are unashamedly racist, my aunt is a master manipulator, and my mother is almost incapable of supporting herself or finishing anything. All of them have one or several psychological problems, stemming from depression and bipolarism, body dysmorphic disorder, or outright schizophrenia. Every single one. And it’s no accident.

It all started with my grandmother, who by all accounts, is one of the most terrible people I’ve ever met. Her second husband, you see, was a pedophile and a rapist, as my mother and her sister discovered before they were even ten years old. The first time she was raped, my mother made the fateful mistake of telling her mother. She was betrayed in the worst way a child can be. Her mother accused her of being a husband-stealing slut, and broke her nose.

Ever since that day, she fostered a simmering hate for my mother; mom’s childhood and everything after it suffered tremendously. From that point on, my mother would learn she was worthless, could do nothing right, and was beaten frequently for something as innocuous as improperly washing a dish. My aunt saw what happened and learned from my mother’s mistake, and thus began her own defense mechanism of influencing situations to avoid a similar fate. My uncles were faced with an impossible amount of cognitive dissonance: believe their parents were terrible people that would abuse innocent children, or that the children somehow deserved it. Their own beatings at childish mistakes helped decide which interpretation was correct.

Thus began a legacy that continues to this day. My grandmother was a lifelong smoker and finally died at the age of 72, but the damage she caused lives on. I’m honestly surprised I ever met her, but my mother was just as damaged as the rest of the kids. Her belief that her suffering was deserved, meant she saw her past as just another family quirk. I like to think my hate for her mother rubbed off on her, but I’ll never really know for sure. Abuse can be fantastically enduring, and people are weak to indoctrination.

As a result, my mother didn’t really know how to raise me. Were I a less precocious child, I probably would have been deep into the drug scene, not to mention rebellious and resentful of everything and everyone for having a better life. Instead, I was incensed by our situation to study and dig my way out of poverty by any means necessary. But in the meantime the damage, as I said, was enduring. My mother had very little confidence, no higher education, and no parental advice on how to advocate herself or work within the system. She was basically trained to be poor, because society demands adherence to somewhat strict rules necessary for smooth operation. Anyone who doesn’t fit that mold is discarded.

So we lived in trailers. Or with her boyfriends. Or, for a while, in a car. At one point, I slept on a couch for about a year when I was five. Mom would work multiple jobs, and since she couldn’t always afford a babysitter, sometimes I’d spend hours just hanging out at the 7-11 where she worked. It wasn’t uncommon for me to sleep in the stock room in the back. I can only imagine how impotent my mother must have felt back then, being forced to raise her child that way. But the social safety net is thin, and she was loath to use it more than necessary. If anyone is curious as to my vehement anger at our treatment of the poor, it starts here.

But that wasn’t all. Because her self confidence was so terrible, and the fact she never learned what made a healthy relationship, she dated and married a string of verbally or physically abusive alcoholics. My heart condition made me a very weak and sickly child. These men told me that I would be better off dead because I was sick all the time, or rubbed my nose in soiled underwear because I had a weak bladder before I was three. My father abandoned us before I was born, and isn’t even listed on my birth certificate. I’ve never met the man, nor do I even know his name. As a result, mom never received any form of child support, not that he would have paid it, given his demonstrated lack of responsibility.

Now my mom is in her late 50’s. Due to her age, lack of higher education and marketable skills, and fragmented or absent work history, she has almost no job prospects. She’s been subsisting for the last twenty years on side jobs and whatever she can make by working out of her home as a seamstress. As a result, she has no retirement savings, can’t pay her rent regularly, and can barely afford food. She just told me she finally broke down and went to a food bank last week.

I help when I can. Rent is expensive these days, and I already have my own to worry about. But I make sure she has a roof over her head, as dismal as a run-down one-room Chicago basement apartment can be. Her Chicago Ventra card is always loaded. When she asks, I help with bills, though I grumble a bit. I feel bad about giving her lectures about finances, knowing her irresponsibility and lack of foresight isn’t wholly her fault, but I’m only human. This is the world we live in.

If you’ve ever wondered why I’m painfully practical, am prone to completely unexpected bouts of anger, and complain about almost every dime I spend, there’s a long and sordid history behind it. Age and perspective have tamed some of that, but in a properly constructed society where people actually cared about each other, either my mother or I would have been intercepted before our problems became so deeply ingrained. It’s almost as if society is driven to facilitate the implements of its own demise.

In another universe, I could have been much different. Without a heart condition, perhaps I would have had enough energy to violently rebel against my situation, making trouble for everyone in my wake. I did that anyway, but only half-heartedly due to my reluctance to permanently damage my future prospects. One of my uncles wasn’t so restrained; given his past, he cared about little, and got in trouble and fights constantly into his 20’s. Yet it didn’t have to be that way. If the community cared, and really paid attention, all of my grandmother’s children would have long been relocated to good homes before the damage was irreversible.

Instead the problem was ignored and propagated to yet another generation, and then another. Mine is hardly an isolated case, regardless of how it might seem. People like me get labeled as Trailer Trash, meth heads, or shiftless poor. People that fall through the cracks garner only resentment instead of the assistance they need to prevent the rot from corrupting our society. Our careless and cavalier attitude compounds the situation and fosters nothing but resentment between our artificial castes. It’s a troublesome brew that I’ve seen both sides of, and even though I’m middle class now, I vehemently hate our society and everything it represents.

In my mind, I’ll always be poor and wondering where my next meal will come from. I’ll always be that little kid who didn’t get a deadly heart condition corrected until I was six because we couldn’t afford the surgery. I’ll always be skeptical and leery of everyone’s intentions, and internally scream “Fuck you for judging me!” at every perceived slight. My rage will never be sated, no matter how collected I may seem on the surface.

Now imagine I was still poor. Imagine I still had nothing to lose. Imagine the damage I could do if I were backed into a corner. Just visit a poor neighborhood, and you don’t have to imagine. The resentment is almost palpable, and threatens to boil over at any provocation. I don’t condone when these people lash out, but I understand their motivation. Desperation is a dangerous thing, and every time we cut the social safety net in the name of fiscal responsibility, we bring the knife closer to our own throats.

My mom, my family, and I are merely symptoms of a much more insidious underlying problem. I wish I had an answer to any of this, but I don’t. But if we can’t even acknowledge the symptoms, we can hardly begin to develop a cure. It really does take a village to raise a child, and unfortunately, our village needs to stop and examine the children it’s been raising. We would all benefit from a little more care, and a lot less callous disregard. I’ve never understood why we don’t try harder, given the risk to reward ratio.

Given the season, please take the opportunity to give a little more love to your family. If the situation arises where you can share that love with someone else who needs it, please do so. A little can go a long way. I can assure you it won’t go unappreciated.

Tags: , ,

On PostgreSQL View Dependencies

November 5th, 2014 | Published in Database, Tech Talk | 5 Comments

As many seasoned DBAs might know, there’s one area that PostgreSQL still manages to be highly aggravating. By this, I mean the role views have in mucking up PostgreSQL dependencies. The part that annoys me personally, is that it doesn’t have to be this way.

Take, for example, what happens if you try to modify a VARCHAR column so that the column length is higher. We’re not changing the type, or dropping the column, or anything overly complicated. Yet we’re faced with this message:

ERROR:  cannot alter type of a column used by a view or rule
DETAIL:  rule _RETURN on view v_change_me depends on column "too_short"

Though PostgreSQL tells us which view and column prompted this error, that’s the last favor it provides. The only current way to fix this error is to drop the view, alter the column, then recreate the view. In a production 24/7 environment, this is extremely problematic. The system I work with handles over two-billion queries per day; there’s no way I’m dropping a view that the platform depends on, even in a transaction.

This problem is compounded when views depend on other views. The error doesn’t say so, but I defined another view named v_change_me_too that depends on v_change_me, yet I would never know it by the output PostgreSQL generated. Large production systems can have dozens, or even hundreds of views that depend on complex hierarchies of tables and other views. Yet there’s no built-in way to identify these views, let alone modify them safely.

If you want to follow along, this is the code I used to build my test case:

CREATE TABLE change_me ( too_short VARCHAR(30) );
CREATE VIEW v_change_me AS SELECT * FROM change_me;
CREATE VIEW v_change_me_too AS SELECT * FROM v_change_me;

And here’s the statement I used to try and make the column bigger:

ALTER TABLE change_me ALTER too_short TYPE VARCHAR(50);

It turns out we can solve this for some cases, though it takes a very convoluted path. The first thing we need to do is identify all of the views in the dependency chain. To do this, we need a recursive query. Here’s one that should find all the views in our sample chain, starting with the table itself:

    SELECT c.oid::REGCLASS AS view_name
      FROM pg_class c
     WHERE c.relname = 'change_me'
    SELECT DISTINCT r.ev_class::REGCLASS AS view_name
      FROM pg_depend d
      JOIN pg_rewrite r ON (r.oid = d.objid)
      JOIN vlist ON (vlist.view_name = d.refobjid)
     WHERE d.refobjsubid != 0
SELECT * FROM vlist;

If we execute that query, both v_change_me and v_change_me_too will show up in the results. Keep in mind that in actual production systems, this list can be much longer. For systems that can survive downtime, this list can be passed to pg_dump to obtain all of the view definitions. That will allow a DBA to drop the views, modify the table, then accurately recreate them.

For simple cases where we’re just extending an existing column, we can take advantage of the fact the pg_attribute catalog table allows direct manipulation. In PostgreSQL, TEXT-type columns have a length 4-bytes longer than the column limit. So we simply reuse the recursive query and extend that length:

    SELECT c.oid::REGCLASS AS view_name
      FROM pg_class c
     WHERE c.relname = 'change_me'
    SELECT DISTINCT r.ev_class::REGCLASS AS view_name
      FROM pg_depend d
      JOIN pg_rewrite r ON (r.oid = d.objid)
      JOIN vlist ON (vlist.view_name = d.refobjid)
     WHERE d.refobjsubid != 0
UPDATE pg_attribute a
   SET a.atttypmod = 50 + 4
  FROM vlist
 WHERE a.attrelid = vlist.view_name
   AND a.attname = 'too_short';

Now, this isn’t exactly a perfect solution. If views alias the column name, things get a lot more complicated. We have to modify the recursive query to return both the view name, and the column alias. Unfortunately the pg_depend view always sets the objsubid column to 0 for views. The objsubid column is used to determine which which column corresponds to the aliased column.

Without having this value, it becomes impossible to know what to modify in pg_attribute for the views. In effect, instead of being a doubly-linked list, pg_depend is a singly-linked list we can only follow backwards. So we can discover what the aliases depend on, but not what the aliases are. I can’t really think of any reason this would be set for tables, but not for views.

This means, of course, that large production systems will still need to revert to the DROP -> ALTER -> CREATE route for column changes to dependent views. But why? PostgreSQL knows the entire dependency chain. Why is it impossible to modify these in an atomic transaction context? If I have one hundred views on a table, why do I have to drop all of them before modifying the table? And, again, the type of modification in this example is extremely trivial; we’re not going from a TEXT to an INT, or anything that would require drastically altering the view logic.

For highly available databases, this makes it extremely difficult to use PostgreSQL without some type of short outage. Column modifications, while not common, are a necessary evil. Since it would be silly to recommend never using views, we have to live with downtime imposed by the database software. Now that PostgreSQL is becoming popular in enterprise settings, issues like this are gaining more visibility.

Hopefully this is one of those easy fixes they can patch into 9.5 or 9.6. If not, I can see it hampering adoption.

Tags: , ,

Why I’m Somewhat Worried About Ebola

October 15th, 2014 | Published in News | No Comments

Before I really get going with this post, I want to say I’m not panicked, and I suggest you stay the same. Meanwhile, it’s pretty clear the currently cavalier attitude toward Ebola needs to change. And of course, it all boils down to humans being the fallible creatures they are.

How Ebola Works

There’s good information on How Ebola Works, and how it kills you, but I’ll summarize. Ebola is a Biosafety Level 4 contagion, meaning proper attire when interacting with infected is a fully sealed safety suit with respirator, which should be decontaminated before and after exposure.

These kinds of precautions are necessary because Ebola is a hemorrhagic fever that causes multiple organ failure within days of exposure. How does that happen? Ebola is capable of replicating without the immune system taking immediate notice, because it attacks the dendritic cells of the immune system itself. Since these cells are how the immune system recognizes new invaders, there’s no defense while Ebola replicates. During the infection, it enters cells and makes them make more Ebola, which in turn causes them to explode. Eventually enough of this happens that the immune system actually does something once it notices the damage.

Unfortunately that something is a cytokine storm. In effect, the immune system freaks out and disgorges all of its killing might, severely damaging blood vessels in the process. This internal bleeding, in turn, causes blood pressure to drop. Combined with tissue damaged by Ebola, this leads to organ failure and eventually death. The mortality rate is quoted as anywhere between 25% to 95%, but it’s quite a bit more potent than the flu.

During all of this, you can’t have painkillers to ease the agony of your dying organs due to the likelihood your liver is among the casualties. It is a horrible, painful way to die that I wouldn’t wish on anyone.

How Does it Spread?

Some comparisons have been made that suggest Ebola is about as communicative as Hepatitis C. The CDC suggests that bodily fluids become a vector when the patient starts showing symptoms, which may take up to three weeks. With an incubation period that long, this allows travelers to reach quite diverse destinations before symptoms appear and spreading becomes likely.

Usually it’s easy to avoid bodily fluids like blood, mucus, semen, or diarrhea. But the problem with Ebola is that sweat is also a vector. Even during a cool day, the body produces sweat, and that sweat can get smeared on things like doorknobs.

Why it Bothers Me

The patient who recently died in Dallas has already infected his third health care worker. I’m pretty sure Hepatitis C doesn’t spread so easily, or every nurse in the country would have it. What’s worse, the nurse traveled on a plane a day before being diagnosed, potentially infecting anyone she encountered at both the departure and destination airports, as well as anyone in the plane. Yes, that includes the flight attendants.

And that is the real problem: people.

People generally don’t wash their hands before eating or itching their eyes. People don’t cough or sneeze into their elbows. People work sick for fear of loosing their jobs, or falling behind, thereby spreading diseases to the entire workplace. People reuse gloves. People travel when they’re not supposed to. People get scared and make mistakes. People cut budgets so there aren’t sufficient resources to handle outbreaks. People send patients home with incorrect diagnoses.

People suck.

We can claim “it will never happen here,” or “you’d have to roll around in Ebola diarrhea to catch it,” or “our infrastructure will prevent spreading,” but that’s all wrong. All it takes is one weak link: one lazy person who didn’t fill in a checkbox on a medical form; one person who neglected to change gloves between patients; one person who thinks the rules don’t apply to them; one person in denial about how sick they are; one person who works in fast food and can’t afford a sick day. Or in the case of Dallas, a hilariously incompetent string of mishaps that led to at least three nurses being infected, people who presumably have better access to sterilization and proper handling of contaminated material than the rest of us.

And all of this is happening right as we start to enter flu season. The symptoms of Ebola are very similar to the flu, which means misdiagnoses will become problematic. Being sloppy with the flu is a nuisance, but with Ebola, it’s deadly. We need to stop fucking around and get serious, or we’ll end up like Liberia and Sierra Leone.

We can mock those countries all we want for shoddy infrastructure and lack of education, but are we really any better? People are fallible, and whether they’re in the US or Africa, we need to account for Murphy’s Law and Finagle’s Corollary. We can call Dallas a fluke, but it’s not. Shit happens, and the sooner we accept that, the sooner we can actually address Ebola before it becomes a real problem.

Hopefully, we still can.


On PostgreSQL Logging Verbosity

August 5th, 2014 | Published in Database, Tech Talk | 5 Comments

Recently I stumbled across a question on Reddit regarding the performance impact of using pgBadger on an active database server. The only real answer to this question is: do not use pgBadger. Before anyone asks—no, you shouldn’t use pgFouine either. This is not an indictment on the quality of either project, but a statement of their obsolescence in the face of recent PostgreSQL features.

One of the recommended postgresql.conf changes for both of these tools is to set log_min_duration_statement to 0. There are a few other changes they require to put the log entries in the correct format, but we can worry about those later. For now, let’s focus on a little innocent math.

One of the PostgreSQL servers I work with, processes almost two billion queries per day. Let’s imagine every such query is very simple, even though this is definitely not the case. Consider this an example query:

SELECT col1, col2 FROM my_table WHERE id=?

Assuming the query is paramterized, and the number is from one to a million, our average query length is 47 characters. Let’s just say it’s 50 to keep things easy. If we multiply that by two billion, that’s 100-billion bytes of logged SQL. Seen another way, that’s 93GB of logs per day, or about 1MB of log data per second.

In practice, such a short query will not constitute the bulk of a PostgreSQL server’s workload. In fact, if even a simple ORM is involved, all queries are likely to be far more verbose. Java’s hibernate in particular is especially prone to overly gratuitous aliases prepended to all result columns. This is what our query would look like after Hibernate was done with it:

SELECT opsmytable1_.col1, opsmytable1_.col1
  FROM my_table opsmytable1_

If we ignore the whitespace I added for readability, and use values from one to a million, the average query length becomes 99. Remember, this is ignoring all useful data PostgreSQL would also be logging! There are also a number of other problems with many of my operating assumptions. It’s very unlikely that query traffic will be consistent, nor will the queries themselves be so short. In addition, I didn’t account for the length of the log prefix that should contain relevant metadata about the query and its duration.

Once on a boring day, I enabled all query logging just to see how verbose our logs became. On that fateful day, I set log_min_duration_statement to 0 for approximately ten seconds, and the result was 140MB worth of log files. Thus was my curiosity sated, and my soul filled with abject horror. Faced with such untenable output, how can we do log analysis? There’s no way pgBadger can process 100GB of logs in a timely manner. I tried using it a while ago, and even that ten seconds of log output required over a minute of processing.

It turns out PostgreSQL has had an answer to this for a while, but it wasn’t until the release of 9.2 that the feature became mature enough to use regularly. The pg_stat_statements extension maintains a system catalog table that tracks query performance data in realtime. Constants and variables are replaced to generalize the results, and it exposes information such as the number of executions, the total average run time of all executions, the number of rows matched, and so on. This is more than any log processing utility can do given the most verbose settings available.

I could spend hours churning through log files, or I can execute a query like this:

SELECT calls, total_time/calls AS avg_time, query
  FROM pg_stat_statements
 LIMIT 10;

That query just returned the ten slowest queries in the database. I could easily modify this query to find the most frequently executed queries, and thus improve our caching layer to include that data. This module can be directly responsible for platform improvements if used properly, and the amount of overhead is minimal.

In addition, the log settings are still available in conjunction with pg_stat_statements. I normally recommend setting log_min_duration_statement to a value that’s high enough to remove log noise, but low enough that it exposes problems early. I have ours set to 1000 so any query that runs longer than one second is exposed. Even on a system as active as ours, this produces about 5MB of log entries per day. This is a much more reasonable amount of data for log analysis, spot-checking, or finding abnormal system behavior.

All of this said, we could just as easily watch the database cluster and set log_min_duration_statement to a nonzero amount of milliseconds. For most systems, even 20 milliseconds would be enough to prevent log output from saturating our disk write queue. However, the pg_stat_statements extension automatically takes care of performance statistics without any post-processing or corresponding increase in log verbosity, so why add pgBadger to the stack at all?

There may be a compelling argument I’m missing, but for now I suggest using pg_stat_statements without PostgreSQL-focused log post-processing. Ops tools like Graylog or logstash are specifically designed to parse logs for monitoring significant events, and keeping the signal to noise ratio high is better for these tools.

Save logs for errors, warnings, and notices; PostgreSQL is great at keeping track of its own performance.

Tags: , , , ,

« Older Posts