Trends in Triggering: Offline to online

Trends in Triggering: Offline to online June 5, 2015

Posted by gordonwatts in ATLAS, LHC, Trigger.
trackback

The recent LHCC open meeting is a great place to look to see the current state of the Large Hadron Collider’s physics program. While watching the talks I had one of those moments. You know – where suddenly you realize something that you’d seen here and there isn’t just something you’d seen here and there, but that it is a trend. It was the LHCb talk that drove it home for me.

There are many reasons this is desirable, which I’ll get to in a second. but the fact that everyone is starting to do it is because it is possible. Moore’s law is at the root of this, along with the fact that we take software more seriously than we used to.

First, some context. Software in the trigger lives in a rather harsh environment. Take the LHC. Every 25 ns a new collision occurs. The trigger must decide if that collision is interesting enough to keep, or not. Interesting, of course, means cool physics like a collision that might contain a Higgs or perhaps some new exotic particle. We can only afford to save about 1000 events per second. Afford, by the way, is the right word here: each collision we wish to save must be written to disk and tape, and must be processed multiple times, spending CPU cycles. It turns out the cost of CPU cycles is the driver here.

Even with modern processors 25 ns isn’t a lot of time. As a result we tend to divide our trigger into levels. Traditionally the first level is hardware – fast and simple – and can make a decision in the first 25 ns. A second level is often a combination of specialized hardware and standard PC’s. It can take a little longer to make the decision. And the third level is usually a farm of commodity PC’s (think GRID or cloud computing). Each level gets to take a longer amount of time and make more careful calculations to make its decision. Already Moore’s law has basically eliminated Level 2. At the Tevatron DZERO had a hardward/PC Level 2; ATL:AS had a PC-only Level 2 the 2011-2012 run of ATLAS, and now even that is gone in the run that just started.

Traditionally the software that ran in the 3rd level trigger (often called a High Level Trigger, or HLT for short) were carefully optimized and custom designed algorithms. Often only a select part of the collaboration wrote these, and there were lots of coding rules involved to make sure extra CPU cycles (time) weren’t wasted. CPU is of utmost importance here, and every additional physics feature must be balanced against the CPU cost. It will find charged particle tracks, but perhaps only ones that can be quickly found (e.g. obvious ones). The ones that take a little more work – they get skipped in the trigger because it will take too much time!

Offline, on the other hand, was a different story. Offline refers to reconstruction code – this is code that runs after the data is recorded to tape. It can take its time – it can carefully reconstruct the data, looking for charged particle tracks anywhere in the detector, applying the latest calibrations, etc. This code is written with physics performance in mind, and traditionally, CPU and memory performance have been secondary (if that). Generally the best algorithms run here – if a charged particle track can be found by an algorithm, this is where that algorithm will reside. Who cares if it takes 5 seconds?

Traditionally, these two code bases have been exactly that: two code bases. But this does cause some physics problems. For example, you can have a situation where your offline code will find an object that your trigger code does not, or vice versa. And thus when it comes time to understand how much physics you’ve actually written to tape – a crucial step in measuring a particle like the Higgs, or searching for something new – the additional complication can be… painful (I speak from experience!).

Over time we’ve gotten much better at writing software. We now track performance in a way we never have before: physics, CPU, and memory are all measured on releases built every night. With modern tools we’ve discovered that… holy cow!… applying well known software practices means we can have our physics performance and CPU and memory performance too! And in the few places that just isn’t possible, there are usually easy knobs we can turn to reduce the CPU requirements. And even if we have to make a small CPU sacrifice, Moore’s law helps out and takes up the slack.

In preparation for Run 2 at the LHC ATLAS went through a major software re-design. One big effort was to more as many of the offline algorithms into the trigger as possible. This was a big job – the internal data structures had to be unified, offline algorithms’ CPU performance was examined in a way it had never been before. In the end ATLAS will have less software to maintain, and it will have (I hope) more understandable reconstruction performance when it comes to doing physics.

LHCb is doing the same thing. I’ve seen discussions about new experiments running offline and writing only that out. Air shower arrays searching for large cosmic-ray showers often do quite a bit of final processing in real-time. All of this made me think these were not isolated occurrences. I don’t think anyone has labeled this a trend yet, but I’m ready to.

By the way, this does not mean offline code and algorithms will disappear. There will always be versions of the algorithm that will use huge amounts of CPU power to get the last 10% of performance. The offline code is not run for several days after the data is taken in order to make sure the latest and greatest calibration data has been distributed. This calibration data is much more fine grained (and recent) than what is available to the trigger. Though as Moore’s law and our ability to better engineer the software improves, perhaps even this will disappear over time.

Comments»

1. Marco Cattaneo - June 5, 2015: Indeed the reconstruction that LHCb now runs in the trigger farm is identical to that which we run offline. The intention during run 2 is that the calibration used in the trigger farm will be “good enough” for most analyses, with no need for further processing offline. We can achieve this by running the calibration in real time on specially selected events from the first few minutes of each LHC fill, and applying the update 9if needed) to the rest of the fill. Only analyses that need the ultimate precision will have to e.g. refit vertices offline. All this is of course a plan: the software and procedures are ready, but we will have to see how it all works out in real life with the 2015 data.

Next logical steps:
– If the online and offline reconstructions are identical, why run it again offline? Why not write out from the HLT farm the reconstruction results (tracks, vertices, PID objects) *instead* of the detector hits?
– If the online and offline reconstruction are identical, why not do the physics preselections online, and write out only the physics objects that will end up in the analysis ntuple? We call this concept the “Turbo Stream” (there were papers at CHEP 2015 on this) and we plan to demonstrate this with a number of important analyses of the 2015 data.

Reply
gordonwatts - June 6, 2015: Marco, thanks for the insider view! What I got I gleaned from the LHCb LPCC talk (which was quite well done). In ATLAS, our calorimeter calibration and our muon alignment (now perhaps our ID alignment due to some bowing that was discussed in the LPCC as a function of temp) mean that we need a few hours to calibrate everything… so we delay reconstruction. I believe (!?) CMS does the same thing. But it leads to the question: since we are getting so close, perhaps we should be doing high statistics physics analysis in the trigger now, rather than waiting – we can do things we couldn’t do before, right? 🙂

Reply

Life as a Physicist