PG Phriday: Let’s Talk About Data Storage

April 24th, 2015 | Published in Database, Tech Talk | 1 Comment

As a DBA, I strive not to live in an isolated ivory tower, away from the developers that are destined to fill our databases with volumes of data from a myriad of applications. It’s important, I think, to occasionally come back to the basics. So I’d like to discuss one of the core elements that PGDB DBAs might be intimately familiar with, but comes up often enough that some clarification is warranted. How does PostgreSQL store data? The answer may surprise you, and is critical to understand before regularly interacting with a PGDB system.

The storage engine PGDB uses is called Multi Version Concurrency Control, or MVCC for short. There is excellent documentation of the subject, along with further resources that attempt to convey a thorough understanding, and succeed in that regard. I strongly encourage anyone who works with PostgreSQL in any context, to read these until they can recite the basics from memory. It is the very bedrock of the database.

But we need an executive summary, so let’s start with an extremely simple table with an id and generic column:

ID Stuff
11 foo
21 bar
31 baz

Let’s say that some time in the future, we update ID 2. This would be the table at that point:

ID Stuff
11 foo
21 bar
31 baz
218 moo

Notice that there are now two copies of the row for ID 2. How is it possible for PGDB to differentiate between them? Which one is “correct”?

To answer those two questions, I’ll need a short aside to explain transaction counters. PGDB keeps a counter that it assigns to all database activities, regardless of origin. Because these numbers can’t cycle endlessly, it currently wraps after two billion values have been used. It then applies an “age” algorithm to produce an abstract representation of data antiquity. For the purposes of this discussion, the subscript in the examples represents the transaction age.

Every action gets a transaction ID, and every transaction ID acts like a snapshot because it provides a stable comparison point. Even if you don’t explicitly call BEGIN TRANSACTION, each statement gets a transaction ID, and thus an age to determine visibility.

Now, take note of the subscript for the second row version in the above example. It indicates that the second row was inserted by transaction 18. If transaction 5 were still open for some reason, it would not see this row, and thus the old version is still valid within that context. But transaction 20 would see both, and use the newest version.

This is how MVCC works, in a nutshell. The primary benefits here are related to blocking. Readers can’t block readers, as they’re not creating new row versions. Further, writers won’t block readers, as the old row version is still valid, and the reader would not see the new row version anyway due to transaction age. And naturally, readers can’t block writers, as reading data doesn’t inherently change its representation. A lot of databases have extremely complex and disruptive rules and locking systems to enforce these outcomes, but it’s just a natural part of MVCC.

Now imagine we update that row again:

ID Stuff
11 foo
21 bar
31 baz
218 moo
242 pop

Now there are three versions of the same row. This will keep happening for every UPDATE that row endures. As you might imagine, this is extremely undesirable in the long term; a very active table that receives a high volume of updates would quickly bloat out of control. And you’d be right. It’s not uncommon for high transaction volumes to quickly render tables 90+% old junk data.

PGDB solves this problem with VACUUM. The VACUUM system is given a transaction ID like everything else, but it can also do basic visibility testing because it knows which transactions are still open within the system. It works by recycling any row, provided the transaction age is lower than its own, and it’s no longer visible to other transactions. After being vacuumed, our table looks like this:

ID Stuff
11 foo
21 bar
31 baz
218 moo
242 pop

Because PGDB wants to reduce potential locking contention, the old row is not actually removed from the data file. Instead, the location is entered into a free space map. Any row in the free space map will be reused by new row versions. Let’s add a new row:

ID Stuff
11 foo
594 fun
31 baz
218 moo
242 pop

There are two major benefits to this approach:

  1. New row versions may not cause the table data files to be grown.
  2. New row versions are not always inserted at the end of the table. This area is known as a write “hot zone” and depending on the underlying file system, can be a huge source of contention and write delays.

However, this means our maintenance can never fall behind or we risk filling the file system with a bunch of old, useless, junk data. Table activity becomes inherently slower as bloat increases, adversely impacting performance in several different ways. Even worse, without regular vacuums, the transaction ID reuse that I mentioned earlier isn’t properly reset. Without that, the database will eventually shut down since it can’t generate unique transaction IDs for versioning data.

I won’t get into the automatic vacuum system that generally addresses this, but I can say that it’s not a complete solution. All vacuums function by reading table data files and maintaining the visibility map. This activity is necessarily limited by disk performance, and various internal settings that prevent vacuums from consuming all available disk throughput. Due to these limitations, tables that receive a high volume of updates will always be larger than their ideal size—sometimes much larger.

And this is where application developers come in. I ask that you think about how MVCC works, and consider how you develop a program that needs to store data. MVCC does not lend itself well to certain use cases, such as session tracking where each page load causes an update in the underlying table. There are better approaches for storing fungible session states than a persistent row in the database.

Most scenarios are fully suitable to how PGDB stores data. It’s a very powerful and capable database engine with far fewer locking problems than other systems. But we pay for that flexibility with extra maintenance concerns, and considerations for avoiding excessive update volume. Certainly, PGDB will function without specifically addressing these points, but it works better when everyone along the chain knows how to circumvent inherent limitations.

Know your tools, and they will rarely disappoint you.

Tags: , , , , , ,

PG Phriday: Anonymous Blocks and Object Manipulation

April 17th, 2015 | Published in Database, Tech Talk | No Comments

PGDB has had anonymous blocks since the release of 9.0 in late 2010. But it must either be one of those features that got lost in the shuffle, or is otherwise considered too advanced, because I rarely see it used in the wild. If that’s the case, it’s a great shame considering the raw power it conveys. Without committing to a function, we can essentially execute any code in the database, with or without SQL input.

Why is that good? One potential element of overhead when communicating with a database is network transfer. If processing millions of rows, forcing PGDB to allocate and push those results over the network will be much slower than manipulating them locally within the database itself. However, the looming specter of ad-hoc scripts is always a threat as well.

It was the latter scenario that prompted this particular discussion. A few weeks ago, I addressed date-based constraints and how they’re easy to get wrong. Knowing this, there’s a good chance we have objects in our database that need revision in order to operate properly. In one particular instance, I needed to correct over 800 existing check constraints an automated system built over the last year.

I hope you can imagine that’s not something I would want to do by hand. So it was a great opportunity to invoke an anonymous block, because there’s very little chance I’d need to do this regularly enough to justify a fully-fledged function. In the end, I came up with something like this:

DO $$
  chk TEXT;
  col TEXT;
  edate DATE;
  sdate DATE;
  tab TEXT;
  ym TEXT;
  FOR tab, chk, col IN 
      SELECT i.inhrelid::REGCLASS::TEXT AS tab,
             co.conname AS cname,
             substring(co.consrc FROM '\w+') AS col
        FROM pg_inherits i
        JOIN pg_constraint co ON (co.conrelid = i.inhrelid)
       WHERE co.contype = 'c'
    ym := substring(tab FROM '......$');
    sdate := to_date(ym, 'YYYYMM01');
    edate := sdate + INTERVAL '1 mon';


        quote_ident(chk) || ' CHECK (' ||
        quote_ident(col) || ' >= ' || quote_literal(sdate) ||
          ' AND ' ||
        quote_ident(col) || ' < ' || quote_literal(edate) || ')';
$$ LANGUAGE plpgsql;

I didn’t just use a bunch of unnecessary variables for giggles. The original version of this block used a single RECORD and a subquery to collect all of the necessary substitutions in their calculated forms. However, I felt this discussion needed a simpler step-by-step logic. Now let’s discuss this rather large block of SQL, because it is a very interesting lesson in several different aspects of the PL/pgSQL procedural language.

If you didn’t already know, you can loop through SQL results in a FOR loop, and pull SELECT results into variables while doing so. This is fairly common knowledge, so I won’t dwell on it. We should however, examine the query itself:

SELECT i.inhrelid::REGCLASS::TEXT AS tab,
       co.conname AS cname,
       substring(co.consrc FROM '\w+') AS col
  FROM pg_inherits i
  JOIN pg_constraint co ON (co.conrelid = i.inhrelid)
 WHERE co.contype = 'c'

Here, we’re making use of system catalog tables that help PGDB manage table metadata. First comes pg_inherits for information on child partitions, since they’re extremely likely to inherit from some base table as suggested by the partition documentation. Next, we incorporate information from pg_constraint so we know the name of each check constraint (contype of ‘c’) to modify.

Regarding the SELECT block itself, there is admittedly some magic going on here. The REGCLASS type serves a dual purpose in PGDB. For one, it is compatible with the OID object identifier type used extensively in the catalog tables. And second, when cast to TEXT, it outputs the schema and object name it represents, based on the current namespace. This means that, no matter where we are in the database, we will get a full substitution of the object—wherever it lives.

In that same block, we also abuse the consrc field to obtain the name of the column used in the constraint. There’s probably a more official way to get this, but as it turns out, the \w wildcard will match any word character. By globbing with +, we essentially grab the first series of word characters in the check. It might not work with other check constraints, but date-based partition rules generally only have an upper and lower bound. For these, the first match gives us the column name, and we don’t care about the rest.

Within the loop itself, things are a bit more straight-forward. After a bit of variable juggling, we start by dropping the old check. It was malformed, so good riddance. Then we build the new constraint based on our desired start and end dates. Note the use of quote_literal here. By using this function, the date variables are converted to text and quoted as static values. The end result is a query like this:

ALTER TABLE some_child_201504
  ADD CONSTRAINT ck_insert_date_201504
CHECK (insert_date >= '2015-04-01' AND
       insert_date < '2015-05-01')

Because these static text values do not match the column type, PGDB will automatically cast them in the physical constraint it actually creates. This prevents the check type mismatches we wrestled with in the last article.

So ends this example of fixing some broken DDL with an ad-hoc anonymous block. In the past, it was fairly common for DBAs to write queries using concatenation to write the DDL commands in the SELECT section of the query. Then we would direct that output to a script, and execute it separately. In this particular case, we would need two scripts: one to drop the constraints, and another to re-create them. That approach is certainly an option for those still uncomfortable working with anonymous blocks or EXECUTE statements.

In the end, I always encourage exploring capabilities to their full extent. Dig into Server Programming documentation if you really want to learn more.

Tags: , ,

PG Phriday: Functions and Addressing JSON Data

April 10th, 2015 | Published in Database, Tech Talk | 6 Comments

Fairly recently, a friend of mine presented a problem he wanted to solve with some JSON he had in a table. After he presented the end result he was trying to reach, I made the assumption that this would be pretty easy to do. But then I looked at the JSON Functions to try and find that quick fix. Though I read extensively and used rather liberal interpretations of the functions, there’s no way to directly manipulate JSON object contents with PostgreSQL.

Wait! Before you start yelling at me for being an idiot, I know what you’re going to say. I thought the same thing… at first. Go ahead and look, though. As of PostgreSQL 9.4, there is no built-in functionality to add or remove JSON elements without one or more intermediate transformation steps through PostgreSQL arrays or records. But that isn’t necessarily unexpected. Why?

Because PostgreSQL is a database. Its primary purpose is to store data and subsequently extract and view it. From this perspective, there’s no reason for PostgreSQL to have an entire library of JSON modification functions or operators. Regardless of this however, actions such as data merges and bulk updates still need to be possible. Yet all other fields allow a single update statement to append information, or otherwise perform a native calculation to replace the value in-line. There must be a way to do this with JSON too, without jumping through too many burning hoops.

Luckily there is, but it does require some preliminary work. Let’s start with a simple JSON document, as seen by PostgreSQL:

SELECT '{"Hairy": true, "Smelly": false}'::JSON;

 {"Hairy": true, "Smelly": false}

Ok. Now, how would I add an attribute named “Wobbly”? Well, I could pull the data into an external application, add it, and store the result. But suppose this was in a table of millions of records? That’s probably the least efficient way to modify them. This could be parallelized to a certain extent, but that requires a lot of scaffolding code and is way too much work for something so simple.

Instead, let’s create a function to do it for us. We’ve already established that PostgreSQL JSON manipulation is extremely limited, so what other option do we have? Here’s a python function:

CREATE or REPLACE FUNCTION json_update(data JSON, key TEXT, value JSON)
AS $$

    if not key:
        return data

    from json import loads, dumps
    js = loads(data)
    js[key] = loads(value)
    return dumps(js)

$$ LANGUAGE plpythonu;

Now we could add the field with ease:

SELECT json_update('{"Hairy": true, "Smelly": false}'::JSON,
       'Wobbly', 'false'::JSON);

 {"Hairy": true, "Smelly": false, "Wobbly": false}

And if we want to get really fancy, there’s always PLV8:

CREATE or REPLACE FUNCTION json_update(data JSON, key TEXT, value JSON)
AS $$
    if (key)
        data[key] = value;
    return data;

$$ LANGUAGE plv8;

SELECT json_update('{"Hairy": true, "Smelly": false}'::JSON,
       'Wobbly', 'false'::JSON);


Though with PLV8, there are a couple of relatively minor caveats.

  1. PLV8 doesn’t work with JSONB yet, which is why all of these examples are in JSON instead.
  2. You might notice that it stripped all the extraneous whitespace, which may not be desirable.

Either way, both of these variants do something that PostgreSQL can’t do on its own. This is one of the reasons PostgreSQL is my favorite database; it’s so easy to extend and enhance.

Just as a thought experiment, which of these functional variants is faster? I didn’t use the IMMUTABLE or STRICT decorators, so it would be easy to run a loop of a few thousand iterations and see what the final run-time is. Here’s a modification of the test query:

SELECT json_update('{"Hairy": true, "Smelly": false}'::JSON,
       'Wobbly', 'false'::JSON)
  FROM generate_series(1, 100000);

On my test VM, the python function took around four seconds, while the PLV8 version only needed a second and a half. Clearly PLV8’s native handling of its own datatype helps here, and python having to repeatedly import the json library hurts its own execution. By adding IMMUTABLE, both fly through all 100-thousand iterations in less than 200ms.

Don’t be afraid to stray from SQL when using PostgreSQL. In fact, this might be a good case for thinking about PostgreSQL in an entirely different light. I might start calling it PGDB from now on, simply to honor its roots and its primary functionality. SQL is no longer the Alpha and Omega when it comes to its capabilities these days. So I feel it’s only right to adapt along with it.

Here’s to the future of PGDB!

Tags: , , , ,

Comments Working Again

March 30th, 2015 | Published in Tech Talk | 1 Comment

This is what happens when you don’t have a QA department. :p

Sorry everyone. I upgraded WordPress and a bunch of plugins a few weeks back, and didn’t realize the reCAPTCHA plugin changed providers, and has been marking all comments as spam since then. I went through the spam backlog and recovered anything obvious. Though just to make sure, I’m going to dig through any tertiary settings and make sure legit email addresses haven’t been identified as spam sources. Just so I don’t have to worry about approving every comment, even from pre-approved submitters.

I know this affects, at most, two or three people that actually visit this site occasionally, but now those people are happy. Or at least less annoyed. :)

Tags: , ,

PG Phriday: High Availability Through Delayed Replication

March 27th, 2015 | Published in Database, Tech Talk | 6 Comments

High availability of PostgreSQL databases is incredibly important to me. You might even say it’s a special interest of mine. It’s one reason I’m both excited and saddened by a feature introduced in 9.4. I’m Excited because it’s a feature I plan to make extensive use of, and saddened because it has flown under the radar thus far. It’s not even listed in the What’s new in PostgreSQL 9.4 Wiki page. If they’ll let me, I may have to rectify that.

What is this mysterious change that has me drooling all over my keyboard? The new recovery_min_apply_delay standby server setting. In name and intent, it forces a standby server to delay application of upstream changes. The implications, however, are much, much more important.

Let me tell you a story; it’s not long, actually. A couple years ago, I had to help a client that was using a hilariously over-engineered stack to prevent data loss. I only say that because at first glance, the number of layers and duplicate servers would shock most people, and the expense would finish the job. This was one of my full recommended stacks, plus a few extra bits for the truly paranoid. DRBD-bonded servers, Pacemaker failover, off-site disaster recovery streaming clones, nightly backup, off-site backup and historical WAL storage, and long-term tape archival in a vault for up to seven years. You would need to firebomb several cities to get rid of this data.

But data permanence and availability are not synonymous. All it took was a single misbehaving CPU to take out the entire constellation of database servers, and corrupt a bunch of recent WAL files for good measure. How this is possible, and how difficult it is to avoid, is a natural extension of using live streaming replicas for availability purposes. We always need to consider one important factor: immediacy applies to everything.

Here’s what actually happened:

  1. A CPU on master-1 went bad.
  2. Data being written to database files was corrupt.
  3. DRBD copied the bad blocks, immediately corrupting master-2.
  4. Shared memory became corrupt.
  5. Streaming replication copied the corrupt data to dr-master-1.
  6. DRBD copied the bad blocks, immediately corrupting dr-master-2.
  7. In turn, PostgreSQL noticed the corruption and crashed on each server.
  8. Monitoring systems started screaming on all channels.

Just like that, a bulletproof high-availability cluster imploded into itself. All we had left at that point were the pristine backups, and the off-site WAL archives. This is one of the major reasons I wrote walctl, actually. Keeping archived WAL files on a tertiary server isolates them from issues that affect the primary or disaster recovery clusters. Further, it means the files can be pulled by any number of new clones without overloading the masters, which are intended to be used for OLTP.

In this case, we pulled a backup from the off-site backup vault, gathered the WAL files that were generated before the CPU went bad, and got the cluster running again in a couple of hours. But this could have easily been much worse, and without the previously-mentioned expensive paranoia and surplus of redundancy levels, it would have. And despite the fact we recovered everything, there’s still the several-hour outage to address.

You see, we weren’t paranoid enough. For a truly high-available architecture, corruption of the data source should always be considered a possibility. Both DRBD and PostgreSQL strive to copy data as quickly as possible, just as they should. Synchronization delay is another huge, but unrelated problem applications often need to circumvent when communicating with replicas. One way to solve this is to keep a third standby server that uses traditional WAL consumption, and then implement a time delay.

Effectively, this means preventing the extra server from processing WAL files for some period of time. This interval allows a DBA to interrupt replay before corruption reaches a fully online replica. It takes time for monitoring systems to report outages, and for the DBA to log into a server and diagnose the problem. As we’ve seen, it can already be too late; the data is already corrupt, and a backup is the only recourse. But a delayed server is already online, can easily be recovered to a point right before the corruption started, and can drastically reduce the duration of an outage.

There are several ways of imposing this delay, and all of them require at least one more series of scripts or software to strictly regulate file availability. They’re also largely irrelevant since the introduction of PostgreSQL 9.4 and the recovery_min_apply_delay setting. Instead of a cron job, or using a complicated script as the restore_command in recovery.conf, or some other method, we just set this variable and we get the desired offset. Here’s a two-hour window:

recovery_min_apply_delay = '2h'

This works with both streaming replication, and more traditional WAL file recovery. There is however, one caveat to using this setting. Since the replica can not apply the changes as they’re presented, they are held in the pg_xlog directory until the imposed purgatory expires. On highly transactional systems, this can result in unexpected space usage on replicas that activate the setting. The larger the safety margin, the more files will accumulate awaiting replay.

Barring that, it’s a huge win for anyone who wants to run a highly available cluster. In fact, it can even be integrated into cluster automation, so a delayed server is stopped if the primary system is down. This keeps our desired window intact while we investigate the problem, without us having to stop troubleshooting and shut down the time-offset replica ourselves.

In addition, a delayed server can be used for standard recovery purposes. If a user erroneously deletes data, or a rogue process drops a critical object, there’s a replica ready and waiting to let us recover the data and reintroduce it to the master server.

Having a server sitting around with self-inflicted synchronization offset seems ridiculous at first glance. But from the perspective of a DBA, it can literally save the database if used properly. I highly recommend anyone who can afford to implement this technique, does so. Your system uptime will thank you.

Tags: , , , ,

« Older Posts