Anything can go wrong
If something can go wrong, it will go wrong. Even if there’s nothing that can go wrong, it still will go wrong.
Situation: rack-mounted server with all the goodies on it, Windows 2000 Server running SQL Server, IIS and a bunch of other stuff, including Symantec Antivirus. External HP tape drive to be attached. No-brainer to-do list: set correct SCSI ID, connect the cable and restart the system so the SCSI controller finds the new device. Windows will detect the drive, then the drivers will be updated with most recent ones.
No biggie, you’d think. Pretty much the same complexity/risk level with connecting a USB printer, 30 minutes of downtime. Tops. I have not seen, heard of, or even imagined something this trivial can go so horribly wrong.
On reboot, server is as slow as an old, feeble turtle which suffered some rare illness as a baby-turtle. The tape drive is recognized, but when trying to access the device properties to update the driver — everything freezes. Nearly anything I try to do in Management console results in MMC freezing. Disk management doesn’t work at all. Worst of all, the SQL server won’t start — and time is ticking, people need to use applications relying on this data, business isn’t running, you can almost see money flying out of the pocket. To top it off, I don’t have physical access to the server by myself, only others do and they all went home to enjoy the Friday afternoon. No “Safe mode” for me, no unplugging the tape drive and hope things will turn back to normal. What to do?
First thought, the universal Windows fix: restart it. Servers aren’t exactly snappy at getting back online, in fact their POST diagnostics can take more time to complete than the operating system to load. So 20 minutes later, I’m staring at the same situation.
Second thought, let’s see what is not running properly. Checking services, I notice that a few of them are trying to start but get stuck at “Starting” — stop action is disabled at this point. OK, let’s disable these services and restart. Nope, can’t disable them, management console freezes when applying changes. Hmmmm.
Third thought: this situation occured as a consequence of attaching the tape drive. Being unable to physically detach the cable, I can only disable the device. No workie. Uninstall the device, to remove the driver from memory. Nada.
At this point, phone calls are building up and one of the people with access to the server has to return to the office. I know he’s pissed off — I would if I were him. Goodbye weekend plans, when worst case scenario is spending the next 24 hours reinstalling the operating system and restoring all settings to the way they were before. We detach the tape drive, but the server doesn’t seem to care and continues to be as slow as a dead fish thrown down the toilet. We test the SCSI controller, diagnose drives, try anything we can think of; nothing wrong. Must be software, then: we go through the system and manually delete all references to the tape drive, even in the registry. Same outcome. Finally, the server is started in Safe mode from the console, all supplemental services are disabled, server restarts and… whew, we have a stable system.
Sort of.
We enable critical services, SQL server starts in less than 10 seconds, things are back on the floating line, except for a few management services including the Symantec Antivirus. Its event log contains a reference to some 40+ settings being changed, and it just won’t run properly for some reason. Maybe it has detected the tape drive as a Removable Storage and attempted to scan it (with no tape in drive)? Maybe it was just Windows automatically creating scheduled backups. I don’t know. And frankly, at 1 in the morning on a Friday night, I don’t effin’ care. It’s been too long since the last time I tried to find logical explanations for Windows’ oddities.
Lesson learned? Never underestimate the possibility that things can — and will — go wrong when you least expect them to, and in a way that you couldn’t predict. What started as a 10-minute connect–reboot–verify-that-everything-works-before-leaving-for-home kind of after-hours job, turned into a 5 hour ordeal propagating into tens of hours lost by other people waiting for this system and probably thousands of dollars of losses from delays. And a pissed off boss.
Effin’ tape drive and older Windows systems. Effin’ physical access restrictions. And effin’ single point of failure in case this server goes down. Which is why I added the tape drive, to have a full system and data backup ready for emergencies. Which is exactly what caused the emergency.
In a few minutes I’m going to fall asleep thinking “what have I done wrong?” That ain’t good.


August 14th, 2006 at 3:00 am
Now imagine how funny it would have been if it were an effin’ Linux server, with text mode diagnostics.
August 15th, 2006 at 6:29 am
Well, that’s true. But then, being such a hands-on operating system, I doubt a Linux server would take the initiative to automatically do whatever the Windows server did on detecting the tape drive, which led to the screw-up. Between the two, I appreciate more the Linux server for not having its own mind and not doing things automatically, unexpectedly and uncontrollably, than the Windows server which does all that and then provides me some nice GUI tools to fix up the problems it created.
The diagnostics we used were actually text mode, built into the SCSI RAID contoller and activated with a key combination before POST completion. They had a basic text-mode interface with a menu bar, not command line.