Here we go again… September 7, 2006Posted by gordonwatts in computers, physics.
The data format merry-go-round. When DZERO started Run II data analysis it was using the so-called TMB format (thumb-nail). This was a binary, high compressed format for our data that enabled us to store all of our event data on disk. Complete datasets were hundreds of gigabytes rather than terabytes. But the format was so highly compressed it was slow. Full de-compression of the data could limit your analysis to running at a rate of about 10 events per second. For making simple plots that doesn’t scale.
And so a new data format was created by a grass-roots effort. Back then it was called TMBTree. A combination of TMB and a ROOT TTree — it was in a ROOT format, which achieves good compression and very high speeds. It was larger than the TMB, but you could run though 100’s of events per second. This format was eventually adopted by the experiment — people voted with their feet — can called CAF (Compact Analysis Format). This is what we are using to analyze Run II data.
Well, it seems to be happening again. In fact, the same person that started the grass roots effort for the CAF format is starting a new one nicknamed “high pT compact data set” (no catching name yet). This is ultra compressed — basically 4-vectors and some vague quality information — and the complete DZERO dataset should be around 15 GB.
Why do I write this? It is a lesson. If an experiment (ATLAS/CMS, etc.) does not provide an easy way to access the data, the people will invent their own. This may not seem like a big deal, but the experimental resources required to create and maintain these resources can be significant. And if a set of people or a physics group decides to adopt these grass-roots formats as a common format then storage & CPU must be allocated.
I wonder if a single data format can satisfy all the needs for an experiment? To first order the answer is no: no need for a final analyzer to have always-on access to the RAW data, for example, so you can have RAW data, reconstructed data, and compressed data. That is the default that every experiment starts with — 3 data formats. Usually they are based on the same underlying storage technology. But what about other storage formats? When Andy did his thesis he needed to run over multi-jet data. There is a huge amount of that data, so he and others he worked with invented a very small very compact data format that kept a minimal amount of information. That way he could run on all the data very quickly.
ATLAS’ AOD format, the current proposed analysis format, falls into the bucket of not-easily-usable-or-inefficient-data-formats in its current implementation. People are already generating their own custom ROOT based formats and it is only a matter of time before they congeal and grow.