jump to navigation

Understanding an old Level 3 Bug March 17, 2008

Posted by gordonwatts in computers, D0.
trackback

About one or two years ago I had to fix a bug in the D0 DAQ Supervisor. The Supervisor is responsible for coordinating the configuration of 400 or 500 farm nodes and about 80 front end crates that generate the data. It is massively multi-threaded. When it is at its busiest it has over 200 threads running. Most are simply me being too lazy to do anything but block while trying to send data to the Internet. Back in the day it ran on a slow dual-core machine under Linux and I did my best to avoid all locks that I could in my multi-threaded code – because locking is expensive, and the Supervisor needed every bit of speed help it could get back then (on a modern machine it is plenty fast enough).

My code was basically some initialization like the following:

global_a = 1.0;

global_b = 5.0;

global_inited = true;

Once global_inited was set to true, then I knew it was safe for the rest of my other threads to look at a and b:

if (global_inited) {

  use-global a…

}

Unfortunately this didn’t always work – sometimes the program behaved as if random values had been entered for a and b. I was never able to reproduce this either. It would happen only once in a while, and restarting the supervisor usually fixed it. Eventually, to fix this bug, I re-structured my code so that all the initialization happened before any other thread was started. After that I never saw the bug again. But I never understood why I was seeing the bug!

A guy who works deep in the stack at Microsoft recently started a blog. One of this first posts explains, possibly, what bit me: the compiler and the CPU (both!!) are allowed to reorder the order that global_a, global_b, and global_inited are set!! Since this bug was not reproducible it was probably done by the CPU, though at the time I never tested that (or ever really figured out what caused this).

Comments»

1. ovysimon - March 17, 2008

[…]Unfortunately this didn’t always work – sometimes the program behaved as if random values had been entered for a and b. I was never able to reproduce this either. It would happen only once in a while, and restarting the supervisor usually fixed it. Eventually, to fix this bug, I re-structured my code so that all the initialization happened before any other thread was started. After that I never saw the bug again. But I never understood why I was seeing the bug![…]
Thx for the post!


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: