[CHEP2006] Is Google Going into Service Computing? February 26, 2006Posted by gordonwatts in computers.
Kathragadda, of Google, gave a talk on data mining and how Google does it (sadly, he has not uploaded his slides). The talk was pretty interesting. Google has setup a huge distributed file system to contain all their data (backup for Google means keeping 3 copies of everything). They have large farms of computers they can then run jobs on that will process this data. They have infrastructure setup to automatically look at the data and route jobs to where the data is and recombine results.
Several people in the audience noticed this was a lot like what we have to do when processing data in particle physics. We have huge data sets, distributed over many computers. We send jobs and we then try to recombine the results (I suspect I wasn’t the only one that thought "PROOF" when Kathragadda was talking).
Google has done a bunch for open source, and so someone from the audience asked Kathragadda what are the chances the various tools that Google has written would be made open source. A lot of people could benefit. The answer was very interesting. Yes, he said, Google definitely wanted to do that. But of course, the first thing was to keep Google running and making money. As a result they had tuned things very carefully. They have made modifications to the Linux kernel, for example; this isn’t software you could just install on any old computer. You pretty much need a whole system top-to-bottom configured (hardware, OS, hacks, etc.) for it to work. He then said something along the lines that they would rather configure and maintain the HW and configuration themselves, and then sell tiime on the computers (he didn’t say exactly those words, of course).
That is an interesting development. I wonder if they will sell access to the Google database too — the data from their web crawler? I imagine a lot of people would love to analyze patters, etc., in the Google data. And if Google did this, I wonder if there was someway that HEP could take advantage of it? And could we do it in a cost effective way (i.e. cheaper than what we are doing now)?