System administration war stories: BIOS
April 2, 2009,
Every system administrator who has been around for a few years has a few war stories. We have been around for a little while, so we have plenty of them. Some of them are interesting, others embarrassing and others are just plain weird. The best stories are where you actually learn something, so you can quickly fix things in the future when something similar happens.
BIOS settings do count
One of the first real sysadmin jobs I had was running a student lab at Utrecht University. This lab was funded for the most part with money from Microsoft research. Of course the machines, except for one machine (funded by the department) were running a version of Microsoft Windows. The admins had an interesting naming scheme: the computers were named after Microsoft producs. The main server was called 'windows'. The computer called 'exchange' was indeed running Microsoft Exchange, but also SQLserver, which the machine called 'sqlserver' was not (or the other way around, it has been a while). Some people might think giving machines names is weird, but it does help avoiding confusion in conversations ("SQLserver is down!", "no it's not...oh, you mean the machine").
The main file server was equiped with a hardware RAID controller and was running on Windows 2000. The three disks were set up in a RAID-5 configuration. During maintenance the machine had to be rebooted and it did come back up. One of the disks had failed a long time before and the system was running in degraded mode. Unfortunately the RAID controller had a bug so it could not boot from a degraded RAID. A firmware update (which had been available for 2 years) fixed it, after upgrading Windows 2000 complained about not having a driver for this particular version of the RAID controller and could not boot.
I ended up booting a Linux rescue CD, mounted the file system, made a backup and replaced the whole machine with FreeBSD and Samba. I informed the users that their data was safe and they could come to me for a restore. After about a year I deleted the data, after no one had come back for data.
The Windows-only machines were eventually replaced with new machines, which were installed s dual boot Linux/Windows (but actually only running Windows to install patches). The machines were Dell OptiPlex GX260 machines. The machines started with Red Hat 8 or 9 and were upgraded (not simultaniously) to Fedora Core 1, 2 and eventually 3. On one day, after an upgrade of the X server on Fedora Core 3 the machines would not respond anymore after going to a virtual console and back to X. This feature was actually used a lot by some of the users in the lab and they disliked having to completely power off the machine and turn it on again.
After a few weeks of intense swearing that it was not working (with increasing irritation and stress levels on my part as a result) I decided to look into a BIOS update as a last try ("it probably won't hurt"), after endless searching did not get me anywhere. The BIOS update actually fixed the issue (it turned out to be a wrong video memory setting), so I quickly updated the BIOS on all machines and users were happy the next morning.
Although the BIOS is regarded obsolete by some (and is in fact being actively replaced by cool projects like CoreBoot) it turns out that it can still influence the correct working of machines and applications, even when they completely ignore the BIOS. So whenever you hit a weird problem you can't immediately find a solution for and which seems hardware related try upgrading the BIOS. It might just help.