8-core Machines June 12, 2007Posted by gordonwatts in computers.
At D0 we are getting ready to order our next batch of machines to upgrade the Level 3 Trigger and DAQ farm. The test machine we were sent to evaluate is a 8-core 64-bit CPU machine.
Here is the rub: our Level 3 code is currently only certified to run in 32 bit mode. And it requires about 1 gig of memory. We can easily buy a large amount of memory for this machine – say 8 gigs. But when we run in 32 bit mode we can only access 4 gigs of that (since 2^32 = 4 gigs).
First of all. Who would have thought that 4 gigs wasn’t enough!?? Second: note the price we are paying for the style of multi-processing we do in particle physics.
Our parallel processing is event based — only grossly parallel. Each event is dealt with at a time. If we want to run on a multi-core machine then we start up multiple copies of the executable. The price of this, of course, is that we pay for all the resources for each copy. In the case of the trigger executable that translates to close to a gig of memory.
There is a second type of parallel processing – multi-threading. Here you run multiple threads of execution inside a single process. The beauty of this is that you are only running one executable — so you don’t have to pay for all the resources. For example, we could use two cores in each trigger executable. But, there is a price: getting multi-threaded code to run properly is very difficult! For example, if you have a global variable and both threads of execution try to access it at the same time you can have a disaster! If you don’t design the capability of using multiple threads in from the start it is near impossible to add in the end.
Sadly, the C++ language is not much help. The new C++ standard is going to add some multi-threading capabilities, but that won’t help us modify our code. There are some cool pragma’s — so called OpenMP pragmas. I’ve seen that these are supported in recent Visual C++ and GCC v4.1/4.2. You can slap a pragma on the top of a loop and it will automatically partition out the loop to multiple cores. How cool is that? But it won’t help with the most general problem, unfortunately.
Finally, I should note that, apparently, it is possible to get around the 4 gig limit by using a special module in the Linux kernel which does some sort of bank swapping. That may end up being what we have to do — as it is the solution that will require the least amount of work.
After writing this post, I stumbled on a post by a Sun engineer. Which pointed to a number of other posts, in particular this one by Phillip Toland. Most of the posts are thinking about web servers and their issues, but they are strongly advocating the share none path — because it is easier (and programmer/physicist time is more expensive than a CPU). In short, we should spend the time to validate our code on 64 bit machines (hard, but do it once) and then spend the extra money on extra memory to run it. I’d go for that. Now, who has some spare time to validate the Level 3 trigger on a 64 bit machine?😉