Sysadmin war stories: when the RAID fails
April 4, 2009,
Every system adminstrator who has been around for a few years has a few war stories. We have been around for a little while, so we have plenty of them. Some of them are interesting, others embarrassing and others are just plain weird. The best stories are where you actually learn something, so you can quickly fix things in the future when something similar happens.
When the RAID fails
We have our hosting servers running in a RAID setup: every machine has 2 disks on with identical partions in RAID-1 setup, except for partitions that don't necessarily need to be recovered, such as /tmp. The cool thing about RAID is that if one disk fails you have a bit of breathing space and don't start losing data immediately. The not so cool thing about RAID is that it is easy to being lulled into a false sense of security. While chances that both disks fail at the same time are less than when one disk fails it is not uncommon (think power spikes, etc.).
We have been saved by RAID a few times. At one point one of the disks in our main hosting server died. We got a replacement, went to the data center, shutdown the services, made a backup to another machine "just in case", removed the disk partitions from the RAID configuration, shutdown the server, installed the new pre-partitioned disk and rebooted. At that point our worst nightmare came true: at that exact moment the other disk failed as well.
So we were in a very noisy datacenter, with a failed RAID, just one spare disk and a deadline to be back online within a few hours (we had told our customers that we would have 4 hours downtime at maximum, and we thought we would be out in 2 hours at maximum). We made the mistake of not taking a DVD with the latest release of the operating system with us, so we needed to download and burn it (but had a 100 Mbps connection and it forced us to take some downtime and come up with a strategy). We did a reinstall from scratch, installed all updates, restored all backups and got the most essential systems (databases and mail) back up and running. We then went back home to fix things in a more relaxed environment. We only exceeded the expected maintenance window with something like 15 to 20 minutes. Back home just a few minor things needed to be fixed to be completely back in business.
We learned a few valuable lessons: always do a full backup at the last possible moment (if possible), because you might need it and bring a CD or DVD with the operating system you are running and always prepare for the worst.