I stumbled across an article on reproducibility recently, “Science is in a reproducibility crisis: How do we resolve it?”, with the following quotes which really caught me off guard:
Over the past few years, there has been a growing awareness that many experimentally established "facts" don’t seem to hold up to repeated investigation.
They made a reference to a 2010 alarmist New Yorker article, The Truth Wears Off (there is a link to a PDF of this article on the website, but I don’t know if it is legal, so I won’t link directly here).
Read that quote carefully: many. That means a lot. It would be all over! Searching on the internet, I stumbled on a Nature report. They looked carefully at a database of medical journal publications and retraction rates. Here is a image of the retraction rates the found as a function of time:
First, watch it for the axes here – multiply the numbers on the left by 10 to the 5th (100000), and numbers on the right by 10 to the –2 (0.01). IN short, the peak rate is 0.01%. This is a tiny number. And, as the report points out, there are two ways to interpret the results:
This conclusion, of course, can have two interpretations, each with very different implications for the state of science. The first interpretation implies that increasing competition in science and the pressure to publish is pushing scientists to produce flawed manuscripts at a higher rate, which means that scientific integrity is indeed in decline. The second interpretation is more positive: it suggests that flawed manuscripts are identified more successfully, which means that the self-correction of science is improving.
The truth is probably a mixture of the two. But this rate is still very very small!
The reason I harp on this is because I’m currently involved in a project that contains reproducibility as one of its possible uses: preserving the data of the DZERO experiment, one of the two general purpose detectors on the now-defunct Tevatron accelerator. Through this I’ve come to appreciate exactly how difficult and potentially expensive this process might be. Especially in my field.
Lets take a very simple example. Say you use Excel to process data for a paper you are writing. The final number comes from this spreadsheet and is copied into the conclusions paragraph of your paper. So you can now upload your excel spreadsheet to the journal along with the draft of the paper. The journal archives it forever. If someone is puzzled by your result, they can go to the journal and download the spreadsheet and see exactly what you did (aka modern economics papers). Win!
Only wait. What if the numbers that you typed into your spreadsheet came from some calculations you ran. Ok. You need to include that. And the inputs to the calculations. And so on and so on. For a medical study you would presumably have to upload the anonymous medical records of each patient, and then everything from there to the conclusion about a drug’s safety or efficacy. Uploading raw data from my field is not feasible – it is petabytes in size. This is all ad-hoc – the tools we use do not track the data as they flow through them.
As an early prof I was involved in a study that was trying to replicate and extend a result from a prior experiment. We couldn’t. The group from the other experiment was forced to resurrect code on a dead operating system, and figure out what they did – reproduce it – so they could ask our questions. The process took almost a year. In the end we found one error in that original paper – but the biggest change was just that modern tools were better had a better model of physics and that was the main reason we could not replicate their results. It delayed the publication of our paper by a long time.
So, clearly, it is useful to have reproducibility. Errors are made. Bias gets involved even with the best of intentions. Sometimes fraud is involved. But these negatives have to be balanced against the cost of making all the analyses reproducible. Our tools just aren’t there yet and it will be both expensive and time consuming to upgrade them. Do we do that? Or measure a new number, rule out a new effect, test a new drug?
Given the rates above, I’d be inclined to select the latter. And have a process of evolution of the tools. No crisis.