jump to navigation

It’s the Data, Stupid! May 5, 2008

Posted by gordonwatts in computers.
add a comment

I’ve mentioned before that I think multicore computing is going to hit HEP hard. The basic problem is that we run all of our jobs in an embarrassingly parallel way. If the machine has 8 cores on it, then we run 8 reconstruction jobs. There is nothing wrong with this on the surface, and, indeed, performance tests indicate that so far we observe an almost linear speed increase as a function of the number of cores.

The problem, I fear, is getting data on and off the chip with the many cores. For code to run fast it must be well fed with CPU instructions and the data it is processing. Memory and the CPU have only so much bandwidth. The HEP way of running means that all the various cores are working on very different things - different data, different code - which means on-chip cache hits will be low - requiring more data from outside the chip.

At some point this has to start to impact performance. We’ll see that when we start getting linear increases in performance.

Ars had a nice article a day or so ago on this issue when discussing the various processors and their multi-core capabilities. The study they reference looks specifically at the memory bandwidth. IBM’s cell processor did the best - this is the chip used in the PS3 gaming console. It was specifically designed for high memory bandwidth. Intel’s architectures didn’t do nearly as well, however.

I also found it in interesting that before they were able to test the memory bandwidth they had to deal with some other bottlenecks in the Intel chips. Specifically, the Translation Lookaside Buffer (TLB) (see section 4.1 of the paper for a detailed discussion). Basically, every time the CPU goes to memory it must translate the requested memory address into a physics memory address. To speed up this process, the CPU maintains a cache of these translations. If you hit the cache your memory access proceeds at a very quick speed. If you don’t, then things grind to a halt while the CPU calculates the translation. If you are accessing data spread over a large area of memory you are bound to constantly be overflowing the the TLB cache. This is interesting because it strongly resonates with one of the presentations on optimizing CMS code at CHEP. Specifically, their memory allocation and deallocation (see page 18 of the talk). This means the CPU is constantly jumping around from whatever code it is working on, to the alloc/dealloc, to scanning memory for free blocks - all these operations use up slots in the TLB without actually “getting work done”.

Fortunately, in the Ars article, they mention that the CPU manufactures, like Intel, are aware of this, and the next generation of chips will have much larger TLB’s - which should help with this issue. It is always the next generation, isn’t it? :-)

I wonder if we in HEP will ever hit this memory bandwidth bottleneck? Perhaps when we have 8 cores on a chip (when is that predicted to happen? Next year!? :-)). Getting around the bottleneck will require some major work in how we write our code: it is all inherently single-threaded.

BTW, the point of this study was to optimize a physics simulation — a magnetohydrodynamic system. Physics is everywhere!

Spam Facts April 26, 2008

Posted by gordonwatts in computers.
add a comment

UW is a big place. And the computing folks there manage a lot of systems. Mail, of course. Our group’s computer guy knows a person involved in C&C’s mail operations and discovered that:

40,000!!! It is no wonder everyone is switching over to instant messaging.

A Little Too Self Involved… April 23, 2008

Posted by gordonwatts in computers, physics life.
add a comment

An email just went by on a public mailing list that contained the following gem:

I have many users who have no compiler, as they are physicists and have no use for C++.

I forget that in HEP we could not do physics without C++!

Is There a Spam Storm? April 17, 2008

Posted by gordonwatts in computers.
1 comment so far

What is up with the spam today? Usually I have perhaps 5 or 6 spam messages when I first log in. Today I had over 100. And they are coming into my mailbox at the rate of about 2 per minute. This makes email almost totally useless!

Some spammer is using my email address as their fake from address. This is perfect for the spammers - they inflict all the bounce messages on someone else! How cool is that!?

I can see why folks are driven away from email. This is truly awful! I’ll not be answering much email today!

Update: More like 5 or 6 per minute right now!

HEP in the Cloud March 20, 2008

Posted by gordonwatts in ATLAS, computers.
10 comments

Amazon has done a lot of work to make GRID computing services accessible to anyone that wants it. Actually, it surprised me that Google or Microsoft didn’t do it first — to run their search engines and other similar things they must have farm computing down to a tee.

In HEP we spend a huge amount of money and cost and time with the GRID. A discussion in a bar some time back generated the question: what would it cost to move HEP into the cloud?

Databases

Yesterday I mentioned databases for storing event data. Amazon has SimpleDB (see this posting to get an idea of how it works). On the surface it looks rather poorly suited to do what we would want to do with our highly structured data. But, ignoring that and some of the overhead it will charge - for the 100 GB of data that Rich had in his database it would cost about 150 bucks a month to store it. Querying is dirt cheap — 14 cents per hour of CPU time used. I have no idea what the performance would be on a database like this, but even if it were x10 slower I doubt it would matter much.

ATLAS’ equivalent database to Rich’s project is thought to be 14 TB/year. That works out to be $21,500/month.

Event Data

Amazon has a simple storage service as well (Amazon S3). Because the data is just a binary blob the cost of storage is much cheaper: 15 cents per GB per month. However, trying to figure out what size ATLAS will actually use if it stored everything in the cloud, and ignored the actual design, is difficult. Making some rough estimates from an old version of the computing model, I’m going to guess about 10 PB per year (that is petabyte!). That is about 1.6 million bucks per month. But we aren’t done with this yet, however - it costs money to move the data in and out. First, just to load the data it will cost about 1 million.

Then we have to use the data - lets say each year we cycle through all the data once — so all 10 PB. That will run about 2.5 million per year (not per month!). But if we use Amazon’s EC2 compute cloud, moving data to it and back is free. In that case, only final datasets will probably be moved. That would be much cheaper.

Computing

This is even harder for me to calculate. This matches up with Amazon’s EC2. One cool thing is data between these computers and S3 is free. Otherwise, for a 32 bit single processor machine that has enough memory to run ATLAS software it looks like it costs about 10 cents per hour of use. Now, in ATLAS an estimate in 2005 was it would take about 3000 kSI2k to reconstruct the average event. So, for an Amazon machine (that is about 1.9 kSI2k) that would take about 26 minutes. So, about 5 cents per event to reconstruct the event. If we expect 2,000,000,000 events per year, then that will cost us $100 million dollars to reconstruct. If someone is familiar with SpecINT2000 and how it works, perhaps they can verify I did this math “ok”. And I’ve not included analysis time which is probably x2 more.

So, there you have it. A lot of money would go into running this in the cloud. Of course, we could never walk up to someone like Amazon and dump this on them. In almost all cases we will do better on our own as we can optimize what we are doing for our uses. Further, the cash that gets spent on this is from all over, and in all different colors. Many nations, for example, buy GRID installations for all scientists in their country. ATLAS just piggybacks on these purchases and uses a portion of them. Still, interesting to see what the cost would be - about 120 million before you even start to analyze the data to produce a physics result!

WARNING: this is very much a back-of-the-envelope calculation!!

HEP in a Database March 19, 2008

Posted by gordonwatts in D0, computers.
7 comments

Not everyone is satisfied with ROOT as the “tool” to analyze HEP data. Back in D0’s Run I all the data was loaded into a commercial database.

So, before you roll your eyes - you are right. HEP is littered with database train wrecks (can anyone say Objectivity?). However, most of those had to do with trying to store every single last bit of data that came off the data acquisition system in the database. And then also store reconstructed data. And then, in some cases, even the analysis level objects. In fact, ROOT grew out disagreement with this vision (and you can tell who won…).

This project, however, was different. The goal was to store only the high level physics information. For a reconstructed jet, for example, they had the four vector and some other quantities (like electromagnetic fraction of calorimeter energies - 28 values in all). They had separate markers for tight very high quality electrons and loose, lower quality, electrons. Same for muons, jets, etc. To understand the limitations of this — and what you might or might not do with this tool: if you changed your jet energy scale you would have to completely re-load the database. This is not something you do frequently, but you get the idea: this is to do your final selection - the last mile of your analysis. Indeed, the test case was to repeat the Run 1 top discovery analysis. However, if you can do selection quickly imagine the power for scanning over a large SUSY parameter space!

How much data? About 62 million events. As a raw ntuple it was 62.4 GB of ntuples (small by today’s standards, of course!). It took almost 1000 hours to generate these ntuples - applying jet energy scale, etc. After being inserted into the database it was 80 GB of raw data, and another 30 GB of database index data.

They used Microsoft’s SQL Server for this. On a qual 450 MHz Pentium II with 256 MB of memory. Does that tell you how long ago this experiment was done!?

Actually, their DB design was pretty clever. All electrons in one table, all jets in another. Then another table which just listed all tight electrons, and another one that listed all loose electrons, etc.

So, how fast did this thing run? So, looking for a Z boson goes to two electrons took about 7 seconds. It found about 6000 events - the right number. Looking for a W boson decaying to an electron and neutrino took about 18 seconds to find 86,000 events. That is pretty darn good!

Are there plans to do this in ATLAS? Well, perhaps. We have a physics summary database - but it isn’t complete (e.g. doesn’t have all the jets in an event). It its design goal is different: you use it to select a sample of events you actually want to run over.

The project was lead by Rich Partridge at Brown University (with a lot of help from an undergraduate Matt Bowen). For more raw information you can see a talk by Rich at a SLAC meeting the other day (CERN ATLAS agendas, look for meetings on Feb 27, the SLAC ATLAS forum).

At any rate, this was something I’ve been meaning to write about for a while. Unfortunately for an approach like this, about 95% of an analyzer’s time is spent trying to understand what exactly is a tight electron - and its fake rate. However, anything that makes for fast turn around is a boon in my book!

Understanding an old Level 3 Bug March 17, 2008

Posted by gordonwatts in D0, computers.
1 comment so far

About one or two years ago I had to fix a bug in the D0 DAQ Supervisor. The Supervisor is responsible for coordinating the configuration of 400 or 500 farm nodes and about 80 front end crates that generate the data. It is massively multi-threaded. When it is at its busiest it has over 200 threads running. Most are simply me being too lazy to do anything but block while trying to send data to the Internet. Back in the day it ran on a slow dual-core machine under Linux and I did my best to avoid all locks that I could in my multi-threaded code - because locking is expensive, and the Supervisor needed every bit of speed help it could get back then (on a modern machine it is plenty fast enough).

My code was basically some initialization like the following:

global_a = 1.0;

global_b = 5.0;

global_inited = true;

Once global_inited was set to true, then I knew it was safe for the rest of my other threads to look at a and b:

if (global_inited) {

  use-global a…

}

Unfortunately this didn’t always work - sometimes the program behaved as if random values had been entered for a and b. I was never able to reproduce this either. It would happen only once in a while, and restarting the supervisor usually fixed it. Eventually, to fix this bug, I re-structured my code so that all the initialization happened before any other thread was started. After that I never saw the bug again. But I never understood why I was seeing the bug!

A guy who works deep in the stack at Microsoft recently started a blog. One of this first posts explains, possibly, what bit me: the compiler and the CPU (both!!) are allowed to reorder the order that global_a, global_b, and global_inited are set!! Since this bug was not reproducible it was probably done by the CPU, though at the time I never tested that (or ever really figured out what caused this).

Building ROOT on Windows March 2, 2008

Posted by gordonwatts in ROOT, computers.
2 comments

Not so much interest to most people here…

So, on Windows, there has been the occasion that I’ve needed to build ROOT. In particular, on windows, this is because the released version of ROOT is not built against python. I also find myself doing lots of building building when buts are found in root (yeah, I know, ROOT doesn’t have bugs).

The standard way to do the build is download and install cygwin. It is a rather heavy install, however, as any of you who have installed it know. Axel, a member of the ROOT team, put together a minimal set of tools that will build ROOT using the MSVC compilers. I built a small UI on top of that along with an installer to make it easier to get onto your machine.

I found this pretty useful for getting my work done, so I thought I’d release it. Actually, it has been finished for months, I just never seemed to get around to finishing off the last bit. So, here it is. The UI is horrible, but it works…

arXiv Accepts OOXML!? February 28, 2008

Posted by gordonwatts in computers.
11 comments

image This just floored me. Dave mentioned this in a comment in one of my last posts. It looks like the major pre-print archive accepts the OOXML format now. I thought they only accepted PDF and .tex submissions.

This makes be beg the question (though it might not do it to you), but is there a tex2ooXML converter (some found here, but one or two I tried didn’t seem to work)? Hear me out before you write me off as crazy (as I think some of you already have). The reason that interests me is that PDF is a static page format. I now read almost all my papers on a screen. The size of the screen and its resolution rarely match with letter. For example, my last laptop purchase was delayed 6 months because I needed a high resolution screen so I can read my PDF’s full screen when it was rotated on its side - in tablet mode. The increased screen resolution, btw, makes PDF’s look a lot better - especially when the computer modern font is used (sorry, couldn’t resist that dig).

What I’d love is to be able to re-flow the documents on the fly to adjust to the screen size. Now, what I could do is run latex in the background. The upside is that no format translation is required and TeX is certainly up to the task. The downside is that automation of this isn’t trivial - some programing work would be required (column sizes, screen sizes and resolutions, font rasterization?). On the other hand, Word will do re-flow and columns automatically, as well as resizing fonts. The downside is you would loose fidelity in the translation.

The place this I’m particularly interested in this is these new MID devices - slow CPU’s and relatively low resolution (and small screen sizes). They have relatively weak CPUs. I’ve seen Word reflow documents on these devices - it performs ok on a 100 page document (not ideal). I have no idea how long it would take to regenerate a 20 page latex paper (which would then be 100 pages or something). Would it be fast enough to be usable?

LaTeX gives me headaches! February 27, 2008

Posted by gordonwatts in computers.
15 comments

For those of you out in the real world: you use Word or something similar to write all of your documents. In Physics we use two typesetting packages, TeX. Unlike Word, TeX uses text input files to fully describe the document being produced. This is not WYSIWYG!!

TeX’s inception was in 1977 (the first released version was much after that). Many people in the field of physics will disagree with me when I say that the output that TeX produces is ugly and hard to read (partly because of the font family it uses), and is quite hard to use - especially if you are an infrequent user.

For example, I am editor of a note in ATLAS that needs an internal and external version. The internal version is the same as the external version along with a bunch of extra material. I needed the equivalent of an if statement to turn on and off sections of text depending on what version of the note I was producing. In Word it would take me about 5 minutes to create a new style that would do this - because of the user interface almost all the options are basically “in your face.” When it came to TeX, however, I spent about 2 hours searching the web and trying to create my own commands to do the job before I finally stumbled on a good web page. What a waste of time!

That said, before people in physics think I’ve totally lost it, there are two things I think that TeX does better than word: handles long documents (like a thesis) better than Word, and deals with figure placement better than Word.

TeX is a macro language - built to solve a specific set of typesetting problems. It’s macro language is amazingly flexible and gives you access to almost all parts of the layout engine from its code. But its power is also its weakness for someone like me: I almost never need advanced features - so I can never remember what or how to access them. And, in fact, the TeX macros are so low level there are numerous packages on top of it (LaTeX is probably the most popular). Word exposes a full object model that you can program against in easily in any dynamic language (like the .NET version of python or VB) and with a little more difficulty, C++. However, my impression is for really complex typesetting jobs TeX is a bit easier to deal with. Of course, that is just it: as a physicist I rarely, if ever, need or want that level of control. And that is just it: exposing all of that means the TeX macro language is prohibitively hard to use.