Archive for August, 2006

Love and hate for Dell

Tuesday, August 15th, 2006

I’ve got a mixed love/hate relationship with Dell.

I love them for having great ideas. Their computers are not designed for everyone; they’re specialized tools that perform very well in specific tasks and areas. The hardware is well thought and assembled — exactly what you need to do something. Custom designed motherboards embrace the BTX form factor, to improve air cooling, minimize noise and size. If you open some Dell desktops or servers, you will simply be amazed on how every little piece of plastic has its well-thought role, how detailed the design is and how much more you are getting in the box, compared to a generic brand PC. The servers particularily stand out, just go through the service manual of a Dell PowerEdge 2950 and be amazed. And, like most brand name systems, these are tested for incompatibilities and should give you fewer problems than self-built systems.

But I also hate Dell for screwing up more often than you’d like. The reason anyone goes with a brand PC is the set of extra services included in the price. Dell’s services have been slipping for years. Tech support is in India, they simply cannot do anything to help you out other than recording your complaint; it’s even more aggravating when you have to ask the dude to repeat what he’s saying 5 times, because you can’t understand his English. Order processing has its issues as well; as a company that sells directly to the consumer, they need to make sure the orders are processed in due time. Delays in shipment, incorrect addresses, lost packages and wrong configurations are a few things they need to control, as well as sending refunds for cancelled orders.

And finally, there’s the hardware. When it runs, it’s awesome. When it fails, it’s a nightmare. You don’t know for sure what’s wrong with it. They don’t know either. There’s no easy fix, and you end up spending weeks asking yourself “Why?” Just like my three Dell Latitude laptops: why on Earth have they decided to go with SDRAM instead of DDRAM? What a horrible bottleneck in performance, also seen in many of their Pentium 4 desktops! Why have they decided to go with Hitachi hard drives, which nearly self-destruct on a periodic basis? Why are there hardware issues with power management when I install Western Digital replacement hard drives? And why are they still listing ancient driver versions on their support site, when component manufacturers (Intel chipsets or 3com network controllers, for instance) have released newer, improved drivers?

I’ll probably buy a new computer in a few months. What will it be — a Dell, or a bag of components I’ll assemble myself? I’m oscillating between these. If I get the Dell, I’ll probably be very happy with what it delivers without needing to open the box at all. However, it won’t be too flexible in how I can modify and upgrade it. And if something will go wrong, I’ll be cursing my decision. On the other hand, with a custom built computer, I’ll probably spend a decent amount of time with my hands in it, tweaking and fiddling with stuff — lots of fun, but also annoying at one point, and definitely not as impressive as the brand name hardware and construction.

Stay tuned for the next episode of this amazing soap opera.

Anything can go wrong

Saturday, August 12th, 2006

If something can go wrong, it will go wrong. Even if there’s nothing that can go wrong, it still will go wrong.

Situation: rack-mounted server with all the goodies on it, Windows 2000 Server running SQL Server, IIS and a bunch of other stuff, including Symantec Antivirus. External HP tape drive to be attached. No-brainer to-do list: set correct SCSI ID, connect the cable and restart the system so the SCSI controller finds the new device. Windows will detect the drive, then the drivers will be updated with most recent ones.

No biggie, you’d think. Pretty much the same complexity/risk level with connecting a USB printer, 30 minutes of downtime. Tops. I have not seen, heard of, or even imagined something this trivial can go so horribly wrong.

On reboot, server is as slow as an old, feeble turtle which suffered some rare illness as a baby-turtle. The tape drive is recognized, but when trying to access the device properties to update the driver — everything freezes. Nearly anything I try to do in Management console results in MMC freezing. Disk management doesn’t work at all. Worst of all, the SQL server won’t start — and time is ticking, people need to use applications relying on this data, business isn’t running, you can almost see money flying out of the pocket. To top it off, I don’t have physical access to the server by myself, only others do and they all went home to enjoy the Friday afternoon. No “Safe mode” for me, no unplugging the tape drive and hope things will turn back to normal. What to do?

First thought, the universal Windows fix: restart it. Servers aren’t exactly snappy at getting back online, in fact their POST diagnostics can take more time to complete than the operating system to load. So 20 minutes later, I’m staring at the same situation.

Second thought, let’s see what is not running properly. Checking services, I notice that a few of them are trying to start but get stuck at “Starting” — stop action is disabled at this point. OK, let’s disable these services and restart. Nope, can’t disable them, management console freezes when applying changes. Hmmmm.

Third thought: this situation occured as a consequence of attaching the tape drive. Being unable to physically detach the cable, I can only disable the device. No workie. Uninstall the device, to remove the driver from memory. Nada.

At this point, phone calls are building up and one of the people with access to the server has to return to the office. I know he’s pissed off — I would if I were him. Goodbye weekend plans, when worst case scenario is spending the next 24 hours reinstalling the operating system and restoring all settings to the way they were before. We detach the tape drive, but the server doesn’t seem to care and continues to be as slow as a dead fish thrown down the toilet. We test the SCSI controller, diagnose drives, try anything we can think of; nothing wrong. Must be software, then: we go through the system and manually delete all references to the tape drive, even in the registry. Same outcome. Finally, the server is started in Safe mode from the console, all supplemental services are disabled, server restarts and… whew, we have a stable system.

Sort of.

We enable critical services, SQL server starts in less than 10 seconds, things are back on the floating line, except for a few management services including the Symantec Antivirus. Its event log contains a reference to some 40+ settings being changed, and it just won’t run properly for some reason. Maybe it has detected the tape drive as a Removable Storage and attempted to scan it (with no tape in drive)? Maybe it was just Windows automatically creating scheduled backups. I don’t know. And frankly, at 1 in the morning on a Friday night, I don’t effin’ care. It’s been too long since the last time I tried to find logical explanations for Windows’ oddities.

Lesson learned? Never underestimate the possibility that things can — and will — go wrong when you least expect them to, and in a way that you couldn’t predict. What started as a 10-minute connect–reboot–verify-that-everything-works-before-leaving-for-home kind of after-hours job, turned into a 5 hour ordeal propagating into tens of hours lost by other people waiting for this system and probably thousands of dollars of losses from delays. And a pissed off boss.

Effin’ tape drive and older Windows systems. Effin’ physical access restrictions. And effin’ single point of failure in case this server goes down. Which is why I added the tape drive, to have a full system and data backup ready for emergencies. Which is exactly what caused the emergency.

In a few minutes I’m going to fall asleep thinking “what have I done wrong?” That ain’t good.