New Level 3 Lesson May 22, 2008Posted by gordonwatts in computers, D0, Trigger.
Ok, here is a dumb lesson I’ve learned the hard way. And thanks to many others who helped resolve it. Lets say you design a distributed system – like your online and trigger and data collection system at D0. This is a medium sized system — perhaps 500 boxes and several 1000 CPU’s at this point. It is key to note that this is a heterogeneous system — many of those boxes are doing different things and have to be custom configured.
Now, since it is heterogeneous, but a distributed system, and all the boxes have to communicate with each other, they have to have a way of finding each other. You definitely can’t use raw DNS and the machine name. Computers change. Sometimes you want to do a hot-swap to an experimental system. Your DNS is managed by a central facility so the turn-around can be a day – and when the accelerator is delivery beam you need less than an hour.
So you have to decide on some sort of name service. Some service that can take a name and reply with a machine. If it is done right, this will disappear into the infrastructure and you’ll not even be aware it is there after a few years.
Lets see, we’ve been running since 2001. In about 2003 we started using what the “sanctioned” name server for our Level 3 Trigger and DAQ part of the system. Of course, you have to make sure you know where that name server is for all this to work. We had an alias in DNS for that purpose.
And it turns out that our stuff is one of the few things left using that nameserver. Everyone else loads a python file on the command line. I’d originally designed our system so that you could change the location of a system on the fly without having to reboot one of the components – so the python approach was never considered. And the online system recently cleared out a bunch of machines.
The name server was moved. And that alias? Well, everyone forgot and so it wasn’t established. And then slowly, over time, parts of Level 3 started to fail. Thank goodness it was the monitoring code that failed first. But there were several hours of panic. All of us had forgotten how the system works it has been so long.
Maintaining the same system running for years is so weird. Almost all the code I write I think “Ok — get it running, debugged, and check it in and move on.” Keeping some of it running for years, however, there are other considerations. I bet there are whole books on this. Too bad we HEP people never take the time to read that sort of thing before we do our software development…