jump to navigation

Here we go again… September 7, 2006

Posted by gordonwatts in computers, physics.
trackback

The data format merry-go-round. When DZERO started Run II data analysis it was using the so-called TMB format (thumb-nail). This was a binary, high compressed format for our data that enabled us to store all of our event data on disk. Complete datasets were hundreds of gigabytes rather than terabytes. But the format was so highly compressed it was slow. Full de-compression of the data could limit your analysis to running at a rate of about 10 events per second. For making simple plots that doesn’t scale.

And so a new data format was created by a grass-roots effort. Back then it was called TMBTree. A combination of TMB and a ROOT TTree — it was in a ROOT format, which achieves good compression and very high speeds. It was larger than the TMB, but you could run though 100’s of events per second. This format was eventually adopted by the experiment — people voted with their feet — can called CAF (Compact Analysis Format). This is what we are using to analyze Run II data.

Well, it seems to be happening again. In fact, the same person that started the grass roots effort for the CAF format is starting a new one nicknamed “high pT compact data set” (no catching name yet). This is ultra compressed — basically 4-vectors and some vague quality information — and the complete DZERO dataset should be around 15 GB.

Why do I write this? It is a lesson. If an experiment (ATLAS/CMS, etc.) does not provide an easy way to access the data, the people will invent their own. This may not seem like a big deal, but the experimental resources required to create and maintain these resources can be significant. And if a set of people or a physics group decides to adopt these grass-roots formats as a common format then storage & CPU must be allocated.

I wonder if a single data format can satisfy all the needs for an experiment? To first order the answer is no: no need for a final analyzer to have always-on access to the RAW data, for example, so you can have RAW data, reconstructed data, and compressed data. That is the default that every experiment starts with — 3 data formats. Usually they are based on the same underlying storage technology. But what about other storage formats? When Andy did his thesis he needed to run over multi-jet data. There is a huge amount of that data, so he and others he worked with invented a very small very compact data format that kept a minimal amount of information. That way he could run on all the data very quickly.

ATLAS’ AOD format, the current proposed analysis format, falls into the bucket of not-easily-usable-or-inefficient-data-formats in its current implementation. People are already generating their own custom ROOT based formats and it is only a matter of time before they congeal and grow.

Comments»

1. agm - September 8, 2006

The data should be in a compartmentalized format, no? So that you only need to acquire what you need and work with that (such as the access of satellite observations provided to the database hosted at Goddard SFC, where you can pick based on satellite, by instrument, by type of data taken (maybe you want particle counts, maybe you want wave instrument observations, maybe you want electric potentials…)


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: