It Comes Down To A Fan July 4, 2007Posted by gordonwatts in computers, physics.
Recently one of the readout crates in D0 started having odd problems. The hardware my group is responsible for started crashing. These were Linux kernel panics: the OS on the board had given up the ghost and was checking out for lunch — instead of taking data for the D0 experiment.
These problems are very hard to track down. And this one was starting to look really tough. All of us were thinking about what software bug had been recently introduced or perhaps the other hardware in the crate had started feeding us bad data and we weren’t protected against it?? The error was happening in the Linux kernel too – which makes debugging rather difficult.
Then Dan, who was in charge of the whole crate, noticed that one of the cooling fans had died. He fixed it, and there have been no crashes since. The problem was that most of us haven’t had to deal with the hardware in so long that it didn’t really occur to us that this could be a hardware problem — less that it could be a cooling problem! The basic problem, is, that the hardware has been too reliable. 🙂
This reminds me of a set of computers we purchased to run a farm in this same system. The fans on the CPU’s were all cheap, and started to fail. It took us a while to figure it out: the machines started randomly crashing.
It isn’t all software! How many of us are still working hard on hardware? Probably not enough!