It’s the Data, Stupid! May 5, 2008Posted by gordonwatts in computers.
I’ve mentioned before that I think multicore computing is going to hit HEP hard. The basic problem is that we run all of our jobs in an embarrassingly parallel way. If the machine has 8 cores on it, then we run 8 reconstruction jobs. There is nothing wrong with this on the surface, and, indeed, performance tests indicate that so far we observe an almost linear speed increase as a function of the number of cores.
The problem, I fear, is getting data on and off the chip with the many cores. For code to run fast it must be well fed with CPU instructions and the data it is processing. Memory and the CPU have only so much bandwidth. The HEP way of running means that all the various cores are working on very different things – different data, different code – which means on-chip cache hits will be low – requiring more data from outside the chip.
At some point this has to start to impact performance. We’ll see that when we start getting linear increases in performance.
Ars had a nice article a day or so ago on this issue when discussing the various processors and their multi-core capabilities. The study they reference looks specifically at the memory bandwidth. IBM’s cell processor did the best – this is the chip used in the PS3 gaming console. It was specifically designed for high memory bandwidth. Intel’s architectures didn’t do nearly as well, however.
I also found it in interesting that before they were able to test the memory bandwidth they had to deal with some other bottlenecks in the Intel chips. Specifically, the Translation Lookaside Buffer (TLB) (see section 4.1 of the paper for a detailed discussion). Basically, every time the CPU goes to memory it must translate the requested memory address into a physics memory address. To speed up this process, the CPU maintains a cache of these translations. If you hit the cache your memory access proceeds at a very quick speed. If you don’t, then things grind to a halt while the CPU calculates the translation. If you are accessing data spread over a large area of memory you are bound to constantly be overflowing the the TLB cache. This is interesting because it strongly resonates with one of the presentations on optimizing CMS code at CHEP. Specifically, their memory allocation and deallocation (see page 18 of the talk). This means the CPU is constantly jumping around from whatever code it is working on, to the alloc/dealloc, to scanning memory for free blocks – all these operations use up slots in the TLB without actually “getting work done”.
Fortunately, in the Ars article, they mention that the CPU manufactures, like Intel, are aware of this, and the next generation of chips will have much larger TLB’s – which should help with this issue. It is always the next generation, isn’t it?🙂
I wonder if we in HEP will ever hit this memory bandwidth bottleneck? Perhaps when we have 8 cores on a chip (when is that predicted to happen? Next year!? :-)). Getting around the bottleneck will require some major work in how we write our code: it is all inherently single-threaded.
BTW, the point of this study was to optimize a physics simulation — a magnetohydrodynamic system. Physics is everywhere!