jump to navigation

HEP in the Cloud March 20, 2008

Posted by gordonwatts in ATLAS, computers, Uncategorized.
trackback

Amazon has done a lot of work to make GRID computing services accessible to anyone that wants it. Actually, it surprised me that Google or Microsoft didn’t do it first — to run their search engines and other similar things they must have farm computing down to a tee.

In HEP we spend a huge amount of money and cost and time with the GRID. A discussion in a bar some time back generated the question: what would it cost to move HEP into the cloud?

Databases

Yesterday I mentioned databases for storing event data. Amazon has SimpleDB (see this posting to get an idea of how it works). On the surface it looks rather poorly suited to do what we would want to do with our highly structured data. But, ignoring that and some of the overhead it will charge – for the 100 GB of data that Rich had in his database it would cost about 150 bucks a month to store it. Querying is dirt cheap — 14 cents per hour of CPU time used. I have no idea what the performance would be on a database like this, but even if it were x10 slower I doubt it would matter much.

ATLAS’ equivalent database to Rich’s project is thought to be 14 TB/year. That works out to be $21,500/month.

Event Data

Amazon has a simple storage service as well (Amazon S3). Because the data is just a binary blob the cost of storage is much cheaper: 15 cents per GB per month. However, trying to figure out what size ATLAS will actually use if it stored everything in the cloud, and ignored the actual design, is difficult. Making some rough estimates from an old version of the computing model, I’m going to guess about 10 PB per year (that is petabyte!). That is about 1.6 million bucks per month. But we aren’t done with this yet, however – it costs money to move the data in and out. First, just to load the data it will cost about 1 million.

Then we have to use the data – lets say each year we cycle through all the data once — so all 10 PB. That will run about 2.5 million per year (not per month!). But if we use Amazon’s EC2 compute cloud, moving data to it and back is free. In that case, only final datasets will probably be moved. That would be much cheaper.

Computing

This is even harder for me to calculate. This matches up with Amazon’s EC2. One cool thing is data between these computers and S3 is free. Otherwise, for a 32 bit single processor machine that has enough memory to run ATLAS software it looks like it costs about 10 cents per hour of use. Now, in ATLAS an estimate in 2005 was it would take about 3000 kSI2k to reconstruct the average event. So, for an Amazon machine (that is about 1.9 kSI2k) that would take about 26 minutes. So, about 5 cents per event to reconstruct the event. If we expect 2,000,000,000 events per year, then that will cost us $100 million dollars to reconstruct. If someone is familiar with SpecINT2000 and how it works, perhaps they can verify I did this math “ok”. And I’ve not included analysis time which is probably x2 more.

So, there you have it. A lot of money would go into running this in the cloud. Of course, we could never walk up to someone like Amazon and dump this on them. In almost all cases we will do better on our own as we can optimize what we are doing for our uses. Further, the cash that gets spent on this is from all over, and in all different colors. Many nations, for example, buy GRID installations for all scientists in their country. ATLAS just piggybacks on these purchases and uses a portion of them. Still, interesting to see what the cost would be – about 120 million before you even start to analyze the data to produce a physics result!

WARNING: this is very much a back-of-the-envelope calculation!!

Comments»

1. Anonymous - March 20, 2008

I don’t think they make any guarantees about 24 hour availability or data loss. You’d need to make some sort of custom insurance plan with them for that, more $$$.

2. Gordon Watts - March 20, 2008

Yes — that is totally true. It costs so much to run the accelerator complex and the detector itself you want to make sure that you always have two copies of the data. And you are, of course, willing to pay for that. But looking at the costs above it looks like the raw CPU is what is always going to dominate. Of course, above is not a careful accounting — as I mentioned, it will get worse as you want to analyze the data!

3. MIke M - March 20, 2008

I think that clouds certainly provide a nice solution for simulations, which aren’t constrained by storing/locating any input data, and can stage-out data from the site at the end of the job. I haven’t confirmed it, but I heard a rumor that the STAR experiment runs simulation production on the Amazon cloud from time to time. I’ll try to dig up the details…

4. Adam Kocoloski - March 20, 2008

Yep, there’s an ongoing Xen/EC2 effort that’s on the agenda at most STAR grid meetings. Last I heard we put together a STAR-customized EC2 image and can use it to run simulations, but we hadn’t actually started any full-scale productions on Amazon hardware yet.

5. Nick - March 20, 2008

I’m fascinated by the idea of misusing some of these technologies that were designed for high-traffic web services for HEP data and the like. Not to say that HEP computer scientists didn’t do their homework but how prevalent is the understanding that scaling websites probably runs into the same issues as scaling HEP data?

6. gordonwatts - March 20, 2008

Nick – I think that our data usage is much more in line with some of hte big data mining that goes on rather than web site scaling. Data flow and usage patterns are more closely matched to the data mining that walmart might do when looking at purchase patterns in their nation-wide database…

Adam – yeah, we looked at doing that. But then we realized, looking around, there were all these windows boxes that were sitting underused around the department. Many are used for undergraduate labs, for example. So we setup a Linux VM and off it went. Soon our computer guy discovered that a friend of his over in the architecture dept. had a bunch of machines sitting idle — so we were running over there. Before you knew it, we had about 100K events processed. It was very nice. Of course, ATLAS refused to accept them as official MC, but it did mean we could get going on our analysis while waiting for the official MC samples to show up!

7. gordonwatts - March 20, 2008

Mike M — I think that is correct. When you don’t need large database access, etc., clouds seem to be better suited to doing simulation. Data requires lots of lookups to process. However, everyone is now moving the database into the cloud too. Amazon, for example, gives you a cut-rate on data transfer between their database and their compute cloud… so no reason not to load-her-up!

8. Jeff Barr - March 22, 2008

Hi Gordon, this is a very interesting analysis. I’m very interested in the use of AWS for scientific purposes and would be happy to talk further. You’ve got my email; feel free to drop me a note if you have any questions or comments.

9. gordonwatts - March 24, 2008

Jeff — thanks. Not yet (thank goodness I don’t coordinate all of that work!). But if anyone wants to contact Jeff about this directly, send me an email and I’ll give you his email address.

10. Thoughts on Google App Engine « Adam Kocoloski - April 9, 2008

[…] In short, Google App Engine looks like a slick solution for web application developers who want to scale up fast. It’s far less flexible than Amazon Web Services — App Engine is definitely not a grid computing solution. They say so right in the introduction, and the significant restrictions Google places on developer codes back that up. In contrast, AWS’s loosely-coupled combination of EC2, S3, and SimpleDB allows for a wider variety of applications with requirements that are much different than those of a traditional web app (say, HEP computing). […]


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: