Comments Working Again

March 30th, 2015 | Published in Tech Talk | No Comments


This is what happens when you don’t have a QA department. :p

Sorry everyone. I upgraded WordPress and a bunch of plugins a few weeks back, and didn’t realize the reCAPTCHA plugin changed providers, and has been marking all comments as spam since then. I went through the spam backlog and recovered anything obvious. Though just to make sure, I’m going to dig through any tertiary settings and make sure legit email addresses haven’t been identified as spam sources. Just so I don’t have to worry about approving every comment, even from pre-approved submitters.

I know this affects, at most, two or three people that actually visit this site occasionally, but now those people are happy. Or at least less annoyed. :)


Tags: , ,

PG Phriday: High Availability Through Delayed Replication

March 27th, 2015 | Published in Database, Tech Talk | 6 Comments


High availability of PostgreSQL databases is incredibly important to me. You might even say it’s a special interest of mine. It’s one reason I’m both excited and saddened by a feature introduced in 9.4. I’m Excited because it’s a feature I plan to make extensive use of, and saddened because it has flown under the radar thus far. It’s not even listed in the What’s new in PostgreSQL 9.4 Wiki page. If they’ll let me, I may have to rectify that.

What is this mysterious change that has me drooling all over my keyboard? The new recovery_min_apply_delay standby server setting. In name and intent, it forces a standby server to delay application of upstream changes. The implications, however, are much, much more important.

Let me tell you a story; it’s not long, actually. A couple years ago, I had to help a client that was using a hilariously over-engineered stack to prevent data loss. I only say that because at first glance, the number of layers and duplicate servers would shock most people, and the expense would finish the job. This was one of my full recommended stacks, plus a few extra bits for the truly paranoid. DRBD-bonded servers, Pacemaker failover, off-site disaster recovery streaming clones, nightly backup, off-site backup and historical WAL storage, and long-term tape archival in a vault for up to seven years. You would need to firebomb several cities to get rid of this data.

But data permanence and availability are not synonymous. All it took was a single misbehaving CPU to take out the entire constellation of database servers, and corrupt a bunch of recent WAL files for good measure. How this is possible, and how difficult it is to avoid, is a natural extension of using live streaming replicas for availability purposes. We always need to consider one important factor: immediacy applies to everything.

Here’s what actually happened:

  1. A CPU on master-1 went bad.
  2. Data being written to database files was corrupt.
  3. DRBD copied the bad blocks, immediately corrupting master-2.
  4. Shared memory became corrupt.
  5. Streaming replication copied the corrupt data to dr-master-1.
  6. DRBD copied the bad blocks, immediately corrupting dr-master-2.
  7. In turn, PostgreSQL noticed the corruption and crashed on each server.
  8. Monitoring systems started screaming on all channels.

Just like that, a bulletproof high-availability cluster imploded into itself. All we had left at that point were the pristine backups, and the off-site WAL archives. This is one of the major reasons I wrote walctl, actually. Keeping archived WAL files on a tertiary server isolates them from issues that affect the primary or disaster recovery clusters. Further, it means the files can be pulled by any number of new clones without overloading the masters, which are intended to be used for OLTP.

In this case, we pulled a backup from the off-site backup vault, gathered the WAL files that were generated before the CPU went bad, and got the cluster running again in a couple of hours. But this could have easily been much worse, and without the previously-mentioned expensive paranoia and surplus of redundancy levels, it would have. And despite the fact we recovered everything, there’s still the several-hour outage to address.

You see, we weren’t paranoid enough. For a truly high-available architecture, corruption of the data source should always be considered a possibility. Both DRBD and PostgreSQL strive to copy data as quickly as possible, just as they should. Synchronization delay is another huge, but unrelated problem applications often need to circumvent when communicating with replicas. One way to solve this is to keep a third standby server that uses traditional WAL consumption, and then implement a time delay.

Effectively, this means preventing the extra server from processing WAL files for some period of time. This interval allows a DBA to interrupt replay before corruption reaches a fully online replica. It takes time for monitoring systems to report outages, and for the DBA to log into a server and diagnose the problem. As we’ve seen, it can already be too late; the data is already corrupt, and a backup is the only recourse. But a delayed server is already online, can easily be recovered to a point right before the corruption started, and can drastically reduce the duration of an outage.

There are several ways of imposing this delay, and all of them require at least one more series of scripts or software to strictly regulate file availability. They’re also largely irrelevant since the introduction of PostgreSQL 9.4 and the recovery_min_apply_delay setting. Instead of a cron job, or using a complicated script as the restore_command in recovery.conf, or some other method, we just set this variable and we get the desired offset. Here’s a two-hour window:

recovery_min_apply_delay = '2h'

This works with both streaming replication, and more traditional WAL file recovery. There is however, one caveat to using this setting. Since the replica can not apply the changes as they’re presented, they are held in the pg_xlog directory until the imposed purgatory expires. On highly transactional systems, this can result in unexpected space usage on replicas that activate the setting. The larger the safety margin, the more files will accumulate awaiting replay.

Barring that, it’s a huge win for anyone who wants to run a highly available cluster. In fact, it can even be integrated into cluster automation, so a delayed server is stopped if the primary system is down. This keeps our desired window intact while we investigate the problem, without us having to stop troubleshooting and shut down the time-offset replica ourselves.

In addition, a delayed server can be used for standard recovery purposes. If a user erroneously deletes data, or a rogue process drops a critical object, there’s a replica ready and waiting to let us recover the data and reintroduce it to the master server.

Having a server sitting around with self-inflicted synchronization offset seems ridiculous at first glance. But from the perspective of a DBA, it can literally save the database if used properly. I highly recommend anyone who can afford to implement this technique, does so. Your system uptime will thank you.


Tags: , , , ,

PG Phriday: Date Based Partition Constraints

March 20th, 2015 | Published in Database, Tech Talk | 3 Comments


PostgreSQL has provided table partitions for a long time. In fact, one might say it has always had partitioning. The functionality and performance of table inheritance has increased over the years, and there are innumerable arguments for using it, especially for larger tables consisting of hundreds of millions of rows. So I want to discuss a quirk that often catches developers off guard. In fact, it can render partitioning almost useless or counter-productive.

PostgreSQL has a very good overview in its partitioning documentation. And the pg_partman extension at PGXN follows the standard partitioning model to automate many of the pesky tasks for maintaining several aspects of partitioning. With modules like this, there’s no need to manually manage new partitions, constraint maintenance, or even some aspects of data movement and archival.

However, existing partition sets exist, and not everyone knows about extensions like this, or have developed in-house systems instead. Here’s something I encountered recently:

CREATE TABLE sys_order
(
    order_id     SERIAL       PRIMARY KEY,
    product_id   INT          NOT NULL,
    item_count   INT          NOT NULL,
    order_dt     TIMESTAMPTZ  NOT NULL DEFAULT now()
);

CREATE TABLE sys_order_part_201502 ()
       INHERITS (sys_order);

ALTER TABLE sys_order_part_201502
  ADD CONSTRAINT chk_order_part_201502
      CHECK (order_dt >= '2015-02-01'::DATE AND
             order_dt < '2015-02-01'::DATE + INTERVAL '1 mon');

This looks innocuous enough, but PostgreSQL veterans are already shaking their heads. The documentation alludes to how this could be a problem:

Keep the partitioning constraints simple, else the planner may not be able to prove that partitions don’t need to be visited.

The issue in this case, is that adding the interval of a month changes the right boundary of this range constraint into a dynamic value. PostgreSQL will not use dynamic values in evaluating check constraints. Here’s a query plan from PostgreSQL 9.4.1, which is the most recent release as of this writing:

EXPLAIN
SELECT * FROM sys_order
 WHERE order_dt = '2015-03-02';

                QUERY PLAN                                    
---------------------------------------------
 Append  (cost=0.00..30.38 rows=9 width=20)
   ->  Seq Scan on sys_order  ...
   ->  Seq Scan on sys_order_part_201502  ...

Well, it looks like the PostgreSQL planner wants to check both tables, even though the constraint we added to the child does not apply. Now, this isn’t a bug per se, but it might present as somewhat counter-intuitive. Let’s replace the constraint with one that does not use a dynamic value and try again:

ALTER TABLE sys_order_part_201502
 DROP CONSTRAINT chk_order_part_201502;

ALTER TABLE sys_order_part_201502
  ADD CONSTRAINT chk_order_part_201502
      CHECK (order_dt >= '2015-02-01'::DATE AND
             order_dt < '2015-03-01'::DATE);

EXPLAIN
SELECT * FROM sys_order
 WHERE order_dt = '2015-03-02';

                QUERY PLAN                                    
---------------------------------------------
 Append  (cost=0.00..30.38 rows=9 width=20)
   ->  Seq Scan on sys_order  ...
   ->  Seq Scan on sys_order_part_201502  ...

Wait a minute… what happened here? There’s no dynamic values; the constraint is a simple pair of static dates. Yet still, PostgreSQL wants to check both tables. Well, this was a trick question of sorts, because the real answer lies in the data types used in the constraint. The TIMESTAMP WITH TIME ZONE type, you see, is not interchangeable with TIMESTAMP. Since the time zone is preserved in this type, the actual time and date can vary depending on how it’s cast.

Watch what happens when we change the constraint to match the column type used for order_dt:

ALTER TABLE sys_order_part_201502
 DROP CONSTRAINT chk_order_part_201502;

ALTER TABLE sys_order_part_201502
  ADD CONSTRAINT chk_order_part_201502
      CHECK (order_dt >= '2015-02-01'::TIMESTAMPTZ AND
             order_dt < '2015-03-01'::TIMESTAMPTZ);

EXPLAIN
SELECT * FROM sys_order
 WHERE order_dt = '2015-03-02';

                QUERY PLAN                                    
---------------------------------------------
 Append  (cost=0.00..0.00 rows=1 width=20)
   ->  Seq Scan on sys_order  ...

Now all of the types will be directly compatible, removing any possibility of time zones being cast to a different date than the constraint uses. This is an extremely subtle type mismatch, as many developers and DBAs alike, consider these types as interchangeable. This is further complicated by the fact DATE seems to be the best type to use for the constraint, since time isn’t relevant to the desired boundaries.

It’s important to understand that even experienced developers and DBAs can get types wrong. This is especially true when including information like the time zone appears completely innocent. In fact, it’s the default PostgreSQL datetime type for a very good reason: time zones change. Without the time zone, data in the column is bound to the time zone wherever the server is running. That this applies to dates as well, can come as a bit of a surprise.

The lesson here is to always watch your types. PostgreSQL removed a lot of automatic casting in 8.3, and received no small amount of backlash for doing so. However, we can see how subtly incompatible types can cause major issues down the line. In the case of partitioning, a type mismatch can be the difference between reading 10-thousand rows, or 10-billion.


Tags: , ,

PG Phriday: Interacting with JSON and JSONB

March 13th, 2015 | Published in Database, Tech Talk | No Comments


With the release of PostgreSQL 9.4, comes the ability to use binary JSON objects. This internal representation is faster and more capable than the original JSON included in 9.3. But how do we actually interact with JSON and JSONB in a database connection context? The answer is actually a little complicated and somewhat surprising.

Casting. Casting Everywhere.

Despite its inclusion as an internal type, PostgreSQL maintains its position as encouraging explicit casting to avoid bugs inherent in magic type conversions. Unfortunately, JSON blurs several lines in this regard, and this could lead to confusion on several fronts.

Let’s take a look at JSON first. Here are three very basic JSON documents for illustration:

{ "name": "cow-man" }
{ "weight": 389.4 }
{ "alive": true }

Nothing crazy. We have a string, a number, and a boolean. The PostgreSQL JSON type documentation suggests it handles these internally, which we can see for ourselves.

SELECT '{ "name": "cow-man" }'::JSON;

         json          
-----------------------
 { "name": "cow-man" }

SELECT '{ "weight": 389.4 }'::JSON;

        json         
---------------------
 { "weight": 389.4 }

SELECT '{ "alive": true }'::JSON;

       json        
-------------------
 { "alive": true }

Great! We can see the string, the number, and the boolean preserved in PostgreSQL’s encoding. Things start to go a bit sideways when we pull fields, though:

SELECT '{ "name": "cow-man" }'::JSON->'name';

 ?column?  
-----------
 "cow-man"

So far, so good. The PostgreSQL JSON documentation for functions and operators says that the -> operator returns a JSON object. And indeed, we can re-cast this string to JSON:

SELECT '"cow-man"'::JSON;

   json    
-----------
 "cow-man"

What happens when we try to compare two JSON objects, though?

SELECT '{ "name": "cow-man" }'::JSON->'name' = '"cow-man"'::JSON;

ERROR:  operator does not exist: json = json

Wait… what? Hmm. Let’s try the same thing with JSONB:

SELECT '{ "name": "cow-man" }'::JSONB->'name' = '"cow-man"'::JSONB;

 ?column? 
----------
 t

That’s something of a surprise, isn’t it? It’s pretty clear from this that JSON and JSONB are much more than simply how the data gets encoded and stored. It also drastically affects how it’s possible to interact with the data itself.

Don’t relax yet, though! JSON and JSONB casting only succeed on TEXT or VARCHAR similar types. For example, these don’t work:

SELECT 365::JSON;
SELECT 365::JSONB;

But these do:

SELECT 365::TEXT::JSON;
SELECT 365::TEXT::JSONB;

So even though PostgreSQL acknowledges JSON datatypes, it can’t convert between those and its own internal types. A PostgreSQL NUMERIC is similar to a JSON NUMBER, but they’re not interchangeable, and can’t even be casted without first going through some kind of TEXT type. This is the same for boolean values. The only type that is treated natively is a string-based value.

While it may seem inconvenient to always use another type as an intermediary when interacting with JSON, that’s the current reality.

Just use TEXT and JSONB

If we reexamine the JSON type documentation, we also see the ->> operator. This not only pulls the indicated field, but automatically casts it to text. This means that we can turn this ugly monstrosity:

SELECT ('{ "field": "value" }'::JSON->'field')::TEXT;

Into this:

SELECT '{ "field": "value" }'::JSON->>'field';

From here, we can perform any action normally possible with a text-based value. This is the only way to pull a JSON or JSONB field directly into a PostgreSQL native type.

All of this would suggest that the safest way to work with JSON or JSONB is through text. Ironically, text is also the only way to exchange comparisons between JSON and JSONB. Observe:

SELECT '"moo"'::JSON = '"moo"'::JSONB;

ERROR:  operator does not exist: json = jsonb

And yet:

SELECT '"moo"'::JSON::TEXT = '"moo"'::JSONB::TEXT;

 ?column? 
----------
 t

Well, then. What this means is pretty clear: convert at the last minute, and always use some kind of text value when dealing with JSON and JSONB.

While I’m here, I’d also like to point out a somewhat amusing side-effect of how JSONB works as opposed to JSON. Textual data gets converted to JSONB automatically when JSONB is one of the equalities. What does that mean? All of these are valid, and note that I’m quoting everything so it’s treated as text:

SELECT '"moo"'::JSONB = '"moo"';
SELECT '365'::JSONB = '365';
SELECT 'true'::JSONB = 'true';

But all of these produce an error:

SELECT '"moo"'::JSON = '"moo"';
SELECT '365'::JSON = '365';
SELECT 'true'::JSON = 'true';

This alone suggests that the lack of interoperability between JSON and JSONB is more of an oversight, and that JSON is missing some casting rules. Hopefully, that means 9.5 will carry some corrections in this regard. It’s hard to imagine PostgreSQL will leave JSON as a lame-duck datatype that was only really useful for 9.3 while JSONB was being developed.

If not, I guess that means I don’t have to revisit this topic in the future. Everyone knows I love being lazy.


Tags: , ,

A Short Examination of pg_shard

March 12th, 2015 | Published in Database, Tech Talk | 5 Comments


For part of today, I’ve been experimenting with the new-ish pg_shard extension contributed by CitusData. I had pretty high hopes for this module and was extremely excited to try it out. After screwing around with it for a while, I can say it has a lot of potential. Yet I can’t reasonably recommend it in its current form. The README file suggests quite a few understandable caveats, but it’s the ones they don’t mention that hurt a lot more.

Here’s what I encountered while experimenting:

  • No support for transactions.
  • No support for the EXPLAIN command to produce query plans.
  • No support for COPY for bulk loading.
  • A bug that causes worker nodes to reject use of CURRENT_DATE in queries.

The first two are probably the worst, and the third is hardly trivial. I’m pretty long-winded, but here’s my view on potential impact.

By itself, lacking transaction support makes pg_shard more of a toy in my opinion. This breaks the A in ACID, and as such, reduces PostgreSQL from a legitimate database, to a fun experiment in assuming bad data never makes it into a sharded table. I would never, in good conscience, deploy such a thing into a production environment.

By not providing EXPLAIN support, it is not possible to see what a query might do on a sharded cluster. This is not only incredibly dangerous, but makes it impossible to troubleshoot or optimize queries. Which shards would the query run on? How much data came from each candidate shard? There’s no way to know. It is possible to load the auto_explain module on each worker node to examine what it did, but there’s no way to check the query plan beforehand.

And what about COPY? The documentation states that INSERT is the only way to get data into a table. Outside of a transaction, multiple inserts are incredibly slow due to round trip time, single-transaction context, fsync delays, and the list goes on. I created a VM and threw a measly 100k individual inserts at a regular, unsharded table, and the whole job took over a minute. Replaying the script in a transaction cut that time down to ten seconds. On the pg_shard copy of the table with two worker nodes, the same inserts required two minutes and twenty seconds. For 100k records. Presumably this could be corrected by utilizing several loader threads in parallel, but I haven’t tested that yet.

The primary reason sharding might be used, is to horizontally scale a very large table. Based on the times I saw, it would take 15 days to load a table with one billion rows. The sample table I used was only four columns and had no indexes to slow the loading process. Yet the COPY statement needed only 300ms for the same amount of data, and could load one billion rows of that table in under an hour. So even if I ignored the lack of transactions and EXPLAIN support, getting our 20-billion rows of data into pg_shard simply wouldn’t be feasible.

I really, really wanted to consider pg_shard on one of the large multi-TB instances I administer. I still do. But for now, I’m going to watch the project and check in on it occasionally and see if they eventually work out these kinks. It’s a great prototype, and having CitusData behind it suggests it’ll eventually become something spectacular.

Of course, there’s always the possibility that an as yet unnamed project somewhere out there already is. If so, please point me to it; pg_shard teased my salivary glands, and I want more.


Tags: , , , ,

« Older Posts