CHEP Day 2 September 13, 2007Posted by gordonwatts in computers, Conference, physics.
I’m spending this week in Victoria, Canada attending the Computers in High Energy Physics (CHEP) conference. At one time this was my favorite conference (more on that in a later post). These summaries are just things in the talks I found interesting on day 2. [I spent last week there — I’m just getting around to cleaning up these posts!!!]
Plenary: The Future of GRID Computing – Miron Livny
Most people I talked to really liked this talk. Miron’s group at Wisconsin is the author of Condor, one of the more popular GRID aware free batch systems out there. We’ve made heavy usage of it back at UW to process ATLAS events. But I didn’t like his talk all that much – mostly because I don’t like GRID talks all that much. I got made fun of in a plenary session 2 or 3 CHEPs ago for complaining that we were talking about the GRID too much. I didn’t think it could get worse — but it has – even more of CHEP is devoted to computer science problems (getting jobs to run on the GRID) than it is on physics problems that can be solved with computers.
Miron started on a promising note: GRID computing is dead.😉 What he meant was that you no longer got automatic funding by attaching the word GRID to a proposal. Then he said, “but distributed computing is here to stay. Lets get back to work and away from the hype.” !!
He also has one slide in his talk where he says that we have to do better in job reliability. I’d like to point you to a NYTimes article that was published today talking about complex systems and how error prone they are. What irks me a bit is that the GRID folks have been stressing reliability for years now and their progress is, well, slow (at least to my eyes). New features and new tool kits seem to appear with alarming frequency — but no one goes back and makes sure that all the layers work well together.
He now helps lead the OSG consortium. This is a loosely affiliated group of labs and universities that have banded together to supply large computing resources to various branches of science (not just HEP). They have some umbrella funding, but its future is not obvious. I wonder if consortiums like OSG will last or will we live in a world of very large facilities like TerraGRID that we buy time on?
He also stated what would be a repeated theme: moving data files around is the single largest weak point of the GRID currently. He is right about that!
Plenary: High Performance Computing at the Petascale and Beyond (Blue Gene!!)– James Sexton (IBM)
James gave another fantastic vendor talk. It was interesting to see the big difference in IBM’s vision vs other company’s visions. Most of the discussion was of Blue Gene and how it fit into the future of computing. IBM positions it as a research machine that they can sell. They expect to develop new technologies here and then slowly feed it down into other product lines.
Blue Gene is a move away from commodity processors: they attempt to put everything on a single chip: memory controller, bus interface, etc. With as many processors as they plan to put in a rack they need absolute reliability. They are using a core based on the PPC (duh) running at only 800 MHz. They put 32 of these cores on a single card along with 10 Gigs of memory. This means each core has about 300 megs to call its own. They then pack these cards into racks — about 4000 cores per rack. They then have high speed interconnects between the cards and racks.
Livermore has 64 racks. Their mean time between failure is 7 days. All I can say is wow! Our L3 system runs about a week before a node fails and we have only about 1400 cores in that (small!) farm. 64 racks is 256,000 cores, or about 180 times bigger than our L3 trigger farm.
Applications that have run on Blue Gene include atomic and material simulations. The comment from the researchers was that these types of problems weren’t each to put into a BG, but it was possible. Protein folding also did some work — but say they need another x3 in compute power. Ouch!!!
James, along with many others, pointed to the fact that the cost of memory is not falling very quickly, so economics is going to force one into the Blue Gene model, or something that looks like it. Further disk storage and getting data in and out of disk storage is also not keeping up, leading one to wonder if it will be cheaper to re-derive results than to store the intermediate results?
His future predictions were pretty guessable: the 2-4 GHz max clock rate was here to stay, but be ready to move to using millions of cores. Memory bandwidth is going to be a problem, and reliability is going to become so much more important in these 1000 core systems.
Plenary: Canadian Cyberinfrastructure – Bill St. Arnaud (CANARIE)
Bill started this by saying that Canada was late to the GRID. This statement surprises me: I remember Canada being one of the first GRID sites to help D0 out with Monte Carlo processing. The CANARIE group is particularly strong in networking. Bill talked about several technologies that allow one to reconfigure networks on the fly. I’m not going to claim I understand this, nor do I understand why this is particularly attractive (except, perhaps, from a security point-of-view), but other talks at the conference made me think this scheme is going to be widely adopted.
Parallel: Booting ROOT with BOOT – Rene Bruin
The idea behind this coming improvement to ROOT is that most ROOT projects use a very small portion of ROOT. So why does all that memory have to be loaded? Why do all those files have to be distributed? Further, there are small bug fixes and it would be nice to distribute them without having to redistribute the whole of ROOT — and allow the user to update on the fly.
In order to reduce ROOT’s memory footprint, the ROOT team is doing what people have been asking it to do for the last 10 years: reduce interdependencies. A rumor has it that the straw that broke the camel’s back was Rene made a minor change to the TClass.h file and discovered that he had to recompile all of ROOT. This is fantastic! Reduced dependencies (and the ROOT team has discovered abstract classes in a big way! Wow!) should make ROOT a more stable product in the long run. Rene showed some data towards the end that had evidence that in the latest versions of ROOT the memory foot print had already been reduced — and, thus, the start up time had also been reduced (they are down from a peak heap size of about 30 megs to 4!). Fantastic!
Parallel: ROOT Graphics: Status and Future – Oliver Couet
Oliver described a bunch of new plots styles (Spider, Parallel, Box, and others). In particular, the Parallel plots looked like they might be interesting in particle physics. The spider and box plots will take me a while to get my head around in how they could be useful.
They have also shifted ROOT over to Open GL now, which has enabled some of the ALICE amazing graphics.
Parallel: ROOT I/O Performance and Functionality – Philippe Canal
Philippe is a very brave man. He spends all of his time down in the I/O sub-layer of ROOT. The ROOT team is usually listed as Fons and Rene — but I think Philippe must be getting close to deserving to be added to that list (if he isn’t already past threshold). The I/O subsystem must be a fairly thankless task, though it does contain some very cool problems that need to be solved without sacrificing speed, but it also has to be some of the most difficult and tedious work!
A lot of work has gone into storing data as compressed values — when you don’t need to save full precision. They are using a technique, it sounds like, that Scott Snyder used for D0’s I/O subsystem when he wrote that many moons ago. They have also improved the asynchronous read-ahead over a network — techniques to keep a data hungry ROOT application well fed despite network latencies. In anticipation of greater PROOF adoption, they have also improved code that concatenates results from several files into a single file. Another very cool thing: they can take an ASCII file and read it into a TTree without a user having to write any code! This will be very helpful for quick-and-dirty TTree building! There was also a cool sounding improvement to the TEventList (use TEntryList now) – a sparse and highly optimized way to store events that pass cuts.