Li-Don't-Node

Friday December 03, 2010

About a week ago, my website and email vanished off the face of the internet. I think this deserves a certain amount of explanation, lest someone think I’m incompetent in my own field. Not too long ago, I switched off my colocated server because I don’t need my own personal machine for two websites, a couple very small databases, and a low-volume email server. I didn’t downgrade fully to a shared host because I run a Django app, Wordpress, PostgreSQL, MySQL for the afore mentioned Wordpress content, Postfix to better control my blacklists, with Postgrey because greylisting kills an assload of spam blacklists would miss, etc. I’m sure there’s a regular webhost that has all that, but I haven’t found them yet.

So on to Virtual Hosting. There’s not very many providers out there for this level of service, oddly enough. Effectively it boils down to Slicehost, Rackspace Cloud, and Linode. I read a lot of customer testimonials, service comparisons, feature and pricing pages, and eventually settled on Linode. Why them? With a few clicks on their web interface, I can provision my disk allocation into as many partitions as I want, clone partitions for testbeds or impromptu backups, administer DNS, even log into my node via a clever ssh client. There’s a lot more, but suffice it to say, there’s more than enough to satisfy a guy accustomed to building servers up from scratch, without all the mess of actually doing it. Hell, you can even pick a Linux distribution to bootstrap a partition.

And I must admit this worked great for quite a while. I never had any problems with my virtual machine, and I even managed to upgrade to Debian Lenny after making a nice and safe clone of my Sarge install. And then on November 23rd, about every single thing that could have gone wrong, did. For once, I’m not exaggerating for comedic effect. Take a look at Linode’s status page, which details the outage in their Freemont data center. Let’s see here…

There was a huge lightning storm in the area at the time.
The data center lost power.
The generators for the Linode servers didn’t activate.
Both of the redundant UPSs for the Linode servers failed.
The RAID unmounted uncleanly and required repair.
Recovering the RAID led to many hosts being unrecoverable.
My host was one of the permanently damaged ones on that particular VM server.

In addition, I left a flag off my rsync backup script (-a does not imply -r in –files-from), so my most recent one was from late 2009 aside from database extracts I happened to have from September. So all of my mail after September 2009 is gone, as are my site and engine tweaks except for my Wordpress redesign because I just happened to have a local test copy. But looking back at the above list of utter chicanery, you’d be hard-pressed to wipe out a server that completely through targeted sabotage. I mean… the generators, both UPSs, the RAID, and the recovery all failed?

I really feel for the poor on-site admin who was probably just sitting there in shock as the damage reports rolled in and the problems piled up. While I’ve never even heard of an outage that severe short of a firebombing, effectively everything capable of failing in that particular instance, did so in a spectacular manner. What annoys me though, is that Linode doesn’t back up the VM servers as a whole. They have a somewhat new backup service which takes regular snapshots of customer VMs, but the server itself is completely at the mercy of Entropy.

I understand it’s probably cheaper in dollars for them to do this, plus whatever they make from the customers paying more for the snapshots, but the engineer in me balks at this design. A cobbled together system that crawls through the VMs and takes regular extracts (it doesn’t work on encrypted disks, so they’re not true byte-copies of the VM) incurs development time, software maintenance, managing the heterogeneous environment, and any porting concerns. I personally would have shrugged and enabled filesystem snapshots and shipped them to another server on a regular basis. Forget all the piddly little VMs, it’s vastly simpler to just copy the entire thing wholesale.

My borked backup system was my fault. Their server obliteration wasn’t really theirs, except it’s pretty clear they’ve never run a “shoot the server in the head” test. But honestly, no company I’ve ever worked for has ever actually done that. And for what it’s worth, I got everything working again, sans one or two articles I still need to re-copy from my Livejournal Mirror, and a short weekend of getting Postfix working with SSL and IMAP. No harm, no foul.

At the time of course, I was intensely angry. Partially because I’ve never heard of a server outage that spectacular. I’ve personally never lost any data since that time I bumped my hard-drive in 1997 while it was moving content from a doublespace-compressed drive to a new drive I had just purchased. Nobody I’ve ever worked for has ever permanently lost any data. I’m not trying to invoke Murphy’s Law here, either. It’s simply never happened. Through some combination of backups, good hardware design, and probably more than a little luck, it just hadn’t happened. Not in the thirteen years I’ve worked in IT. Never. Maybe that’s why I was due, since it almost seems like the universe went out of its way to obliterate my site host. But in order to truly do that, my laptop would have to suffer a hard-drive corruption, as would my external backup drive, and the thumb-drive I occasionally fill just in case.

So now I’m left with a conundrum. Linode is still one of the best virtual hosting providers out there. And now that the shock has worn off and I’ve calmed down a little and fixed everything that needed immediate fixing, I have time to think. Do I retain or abandon a provider that lost my data in some ridiculous and impossible chain of coincidences? I can’t really trust them knowing about their backup methodology and shoddy hardware, much akin to the terrible Penguin servers a previous employer cursed with every failure. But I know all about keeping stuff working on bad hardware: have really, really good backups.

Linode gave me three free months as a kind of apology for obliterating my system so thoroughly, and now I’m going to make absolutely sure I can rebuild my system within a few minutes if necessary, even if only to facilitate jumping ship later. So why move? But I’ll take suggestions if people have them; I’m clearly willing to try new things if they’re worth it.

See Also