Frequent Errors In Scientific Software May Undermine Many Published Results
from the it's-a-bug-not-a-feature dept
It's a commonplace that software permeates modern society. But it's less appreciated that increasingly it permeates many fields of science too. The move from traditional, analog instruments, to digital ones that run software, brings with it a new kind of issue. Although analog instruments can be -- and usually are – inaccurate to some degree, they don't have bugs in the same way as digital ones do. Bugs are much more complex and variable in their effects, and can be much harder to spot. A study in the F1000 Research journal by David A. W. Soergel, published as open access using open peer review, tries to estimate just how much of an issue that might be. He points out that software bugs are really quite common, especially for hand-crafted scientific software:
It has been estimated that the industry average rate of programming errors is "about 15-50 errors per 1000 lines of delivered code". That estimate describes the work of professional software engineers -- not of the graduate students who write most scientific data analysis programs, usually without the benefit of training in software engineering and testing. The recent increase in attention to such training is a welcome and essential development. Nonetheless, even the most careful software engineering practices in industry rarely achieve an error rate better than 1 per 1000 lines. Since software programs commonly have many thousands of lines of code (Table 1), it follows that many defects remain in delivered code -- even after all testing and debugging is complete.
To take account of the fact that even when there are bugs in code, they may not affect the result meaningfully, and that there's also the chance that a scientist might spot them before they get published, Soergel uses the following formula to estimate the scale of the problem:
Number of errors per program execution =
He then considers some different cases. For what he calls a "typical medium-scale bioinformatics analysis":
total lines of code (LOC)
* proportion executed
* probability of error per line
* probability that the error meaningfully affects the result
* probability that an erroneous result appears plausible to the scientist.we expect that two errors changed the output of this program run, so the probability of a wrong output is effectively 100%. All bets are off regarding scientific conclusions drawn from such an analysis.
Things are better for what he calls a "small focused analysis, rigorously executed": here the probability of a wrong output is 5%. Soergel freely admits:
The factors going into the above estimates are rank speculation, and the conclusion varies widely depending on the guessed values.
But he rightly goes on to point out:
Nonetheless it is sobering that some plausible values can produce high total error rates, and that even conservative values suggest that an appreciable proportion of results may be erroneous due to software defects -- above and beyond those that are erroneous for more widely appreciated reasons.
That's an important point, and is likely to become even more relevant as increasingly complex code starts to turn up in scientific apparatus, and researchers routinely write even more programs. At the very least, Soergel's results suggest that more research needs to be done to explore the issue of erroneous results caused by bugs in scientific software -- although it might be a good idea not to use computers for this particular work....
Follow me @glynmoody on Twitter or identi.ca, and +glynmoody on Google+
Thank you for reading this Techdirt post. With so many things competing for everyone’s attention these days, we really appreciate you giving us your time. We work hard every day to put quality content out there for our community.
Techdirt is one of the few remaining truly independent media outlets. We do not have a giant corporation behind us, and we rely heavily on our community to support us, in an age when advertisers are increasingly uninterested in sponsoring small, independent sites — especially a site like ours that is unwilling to pull punches in its reporting and analysis.
While other websites have resorted to paywalls, registration requirements, and increasingly annoying/intrusive advertising, we have always kept Techdirt open and available to anyone. But in order to continue doing so, we need your support. We offer a variety of ways for our readers to support us, from direct donations to special subscriptions and cool merchandise — and every little bit helps. Thank you.
–The Techdirt Team
Reader Comments
Subscribe: RSS
View by: Time | Thread
The people who own these systems love money more than scientists love science, and scientists are known to be terrible about admitting they are wrong, so I can only imagine how buggy scientific software is.
[ link to this | view in thread ]
All bugs are not considered equal.
These bugs are easy to find, easy to reproduce and easy to fix. They are also easy to test for and are considered serious bugs. You generally don't ship code with serious bugs.
The kinds of bugs that generally exist for stable programs are things like the following REAL bug that currently exists in EVERY version of Microsoft Access:
Create an Access Database in Access (Any version). Open the database in a Chinese Version of Microsoft Access and create a NEW Form then create a Macro or VBA code and save the database. If you now open the database in Microsoft Access in an English version of Access it can open the form but not run the form. You can change the name of the form from a Chinese name to an English name which will fix the form BUT you cannot change the code that Access generated for you which is in Chinese that does not work on Windows without Chinese installed. You have to create a new form. Copy everything over and delete the old form.
This is a real bug. It is really annoying but data is never corrupted. These are the kinds of low level bugs that generally are not fixed. It involves using multiple computers with different languages and only happens in a very specific situation. It can also be worked around easily. To fix the issues is also difficult and likely to create new bugs.
Saying that there is a 10% chance that a bug will "meaningfully affects the result" is crazy because you with a little bit of effort you can figure out a more accurate number. Look at Linux, Mozilla or smaller project and then classify the bugs and find out how many actually change or corrupt the data as opposed to crashing the program.
[ link to this | view in thread ]
Re: All bugs are not considered equal.
I'm not so sure. Some years ago I heard a presentation by an expert of bugs who recounted the following story:
He took the mission critical data analysis software of 3 or 4 different major oil/oil exploration companies and configued each so that they would run exactly the same set of algorithms. He then fed in the same input data to each. The agreement was only to one decimal place.
Also remember the design error in the 8087 that made the floating point evaluation stack unusable. I was still getting program crashes caused by this bug in the mid 90's (via more than one different compiler).
[ link to this | view in thread ]
Re: Re: All bugs are not considered equal.
[ link to this | view in thread ]
Re:
The real heavy lifting in scientific code goes on in numerical libraries. Those are written not just by specialist programmers, but programmers who are specialists on that specific topic. They are heavily scrutinized.
[ link to this | view in thread ]
[ link to this | view in thread ]
Not just software
LOTS of fun in large AutoCAD shops...
[ link to this | view in thread ]
Detectability of software defects
Contrast with closed-source software, where no such opportunity exists. It's not an option. Yet, as we've seen on those few occasions where such source code has been leaked, the defect rates are often much higher than for open-source code. There's absolutely no reason to believe for even a moment that the programmers at Microsoft and Oracle and Adobe and elsewhere are any better at what they do than the programmers working on Mozilla or Wireshark or Apache HTTPD.
So the author of the referenced paper is right, but arguably doesn't go far enough: we don't just have to worry about the code being written by random graduate students, we need to worry about the code being USED by random graduate students. If it's never been independently peer-reviewed, then it can't be trusted.
The irony of this is that the entire scientific research process is founded on the concept of peer review. Yet researchers will uncritically use a piece of software that's essentially an unreviewed black box and trust the results that it gives them.
[ link to this | view in thread ]
So climate change is a hoax!
[ link to this | view in thread ]
NAAA
You think that manual methods are less suspect!
With a computer program at least the same result (correct or erroneous) happens every time the line is executed. With a manual system you get to put in a whole new error every time! I'd say that's a whole lot worse!
[ link to this | view in thread ]
Re: Re:
Sure. But by whom?
Everyone who programs knows that it's really hard to find your own errors: if it was easy, they'd never escape your gaze. Asking your colleague(s) to check your code is better because they're not you...but they're NOT completely independent, even if they make a good-faith effort to be so.
So if review stops there and never goes any further, that is, it never extends to people who are completely independent, then "heavily scrutinized" comes down to "me and the people I work with".
And that's not very thorough.
[ link to this | view in thread ]
Re: Detectability of software defects
At least when they write it themselves it's different every time - so the inconsistencies flag the errors. When they all run the same code the failure will happen the same way every time.
[ link to this | view in thread ]
Re: NAAA
With a computer program at least the same result (correct or erroneous) happens every time the line is executed.
That's not necessarily true. Sometimes flaky code winds up giving different results every time; sometimes it gives a right result most of the time and a wrong one some of the time; sometimes it...well, you get the point.
Repeatable and obvious errors are the easy ones to find and fix. Semi-random weirdness can be difficult to even notice, let alone diagnose and fix.
[ link to this | view in thread ]
[ link to this | view in thread ]
Re: Re: Re:
[ link to this | view in thread ]
Re: Re: NAAA
You are right of course - that kind of bug does exist but my real point was that hand calculation simply creates a much larger number of places where errors can get in.
[ link to this | view in thread ]
That's not a study.
The defects referred to in David Soergel's reference ("Code Complete", written more than 10 years ago) include things such as misaligned output, insufficient error trapping, invalid data input filters, user interface problems, and many other errors that do not cause "wrong answers".
Furthermore, debugging is a major process in software development. There are many, many errors that appear in any non-trivial software project as it is being written and tested. Many of these prevent the software from running in the first place. With testing, these are largely eliminated, particularly those that can significantly affect results.
For example, in a scientific application with a limited number of users, it may be completely acceptable for the application to crash on invalid input. It may take more programming time than it's worth to add elegant error handling. This is a "software defect", yet it has zero effect on the results. Many of the defects referred to in "Code Complete" are of this nature.
In addition, the statics quoted above (and all over the internet), are mere guesswork. The article even states "The factors going into the above estimates are rank speculation, and the conclusion varies widely depending on the guessed values," and this in the "rigorous analysis"!
In my opinion, this does not merit appearance in Tech Dirt, and certainly lowers the average quality of this site. It's a typical scare-hype article, all too common today.
[ link to this | view in thread ]
Re: Detectability of software defects
I assume you're referring to Heartbleed? That could actually be written up as a textbook failure of the open-source development process: in the OpenSSL project, the many eyeballs simply weren't there. People who looked into it found that very few people besides the authors were actually doing anything to review the code before news of Heartbleed became public.
[ link to this | view in thread ]
Re: Re: Detectability of software defects
In other words, independent peer review is necessary, but not sufficient. "Sufficient" is TBD, but it probably looks like "serious in-depth audit", e.g., what Truecrypt went through recently. I think that's probably the best that we can do today. I'm hopeful that we'll evolve better methods of coding and review, and that we'll create better automated analysis tools -- and both of those are happening, so maybe, just maybe, we might be asymptotically approaching "quality code".
On a good day. ;)
[ link to this | view in thread ]
needs more meta
[ link to this | view in thread ]
It could easily be off because of a bug.
[ link to this | view in thread ]
Re: Re: Re: Re:
[ link to this | view in thread ]
Re:
To wander out into digression-land, I had a professor who had once worked at Los Alamos National Labs. He was put on a team that was working on climate models. Funny thing, when testing the models he couldn't help noticing that the climate behaved very much like a nuclear explosion.
[ link to this | view in thread ]
Re: Re: Re: Re: Re:
[ link to this | view in thread ]
Re: Re: Re: Re: Re:
[ link to this | view in thread ]
Re:
[ link to this | view in thread ]
Re: Re: Re: All bugs are not considered equal.
Not at all! There are plenty of other ways to get different results without a single error. It depends on how the programmer interprets different algorithms. Maybe one is approximating part of the problem as linear when it's not because linear problems are easier to solve. Maybe another realizes linear isn't very accurate in the case, so he does piecewise linear for better results. Maybe yet another uses a quadratic that's even better over part of the range, but much worse outside the range.
You don't need bugs to have problems with accuracy. It's one of the things that my engineering classes at the Uni covered. Certain problems cannot be evaluated directly, so you make approximations and then justify those over the range of inputs you expect to receive. Specially designed computers and programs are often used in physics to solve certain problems DIRECTLY, and they can take over a year to solve one problem. That's not acceptable for many folk, so they MUST approximate the solution.
[ link to this | view in thread ]
Re: Not just software
[ link to this | view in thread ]
Re: Re: Re: Re: Re: Re:
[ link to this | view in thread ]
Re: Re: Re: Re: All bugs are not considered equal.
Very careful analysis of the information flow is also required beyond just looking for bugs.
[ link to this | view in thread ]
I can write a program to get just about any results I want.
[ link to this | view in thread ]
Re: Re:
The interesting point for me is how much of the modelling being used in all scientific field has the potential (if not the actuality) of not working as it is believed to be working. I have seen too many software systems that have appeared to give reasonable results (and people have made major decisions on those results) and yet the models have been flawed.
Don't forget that belief is a strong motivator in accepting the results produced. If you believe the results are correct and they sort of match your expectation then you will believe the software is doing its job.
This is why extensive regression tests are important. This is why extensive review by expert non-involved parties is important. This is why extensive analysis of data and data flows using alternative means is important. This is why extensive analysis of algorithms used and testing of those algorithms is important.
Many of the programs used in the scientific community (particularly the models built up by scientific teams) are piecemeal development. Accretion of new facilities are a fact of life. As such, these specialists (who often are quite over-optimistic of their programming abilities) haven't a clue that their piece of software is a piece of junk.
[ link to this | view in thread ]
Re: OpenSSL was an extreme outlier though, and hardly typical.
Which is why LibreSSL was forked off.
Certainly other popular open-source projects get a lot of scrutiny. Look at what Coverity does, for example.
[ link to this | view in thread ]
Re: That's not a study.
Misaligned output - I have seen examples of this where this is then used as input to another program and errors are then generated because a wrong value is read in at the next stage. This is a problem for pipes (you know that feature used in unix based machines).
Insufficient error trapping - if an error is not caught properly (or not caught at all), the program can continue as if nothing has happened. For example, if values to be used have a range and there is no out of range error detection, then an out of range value can change the results in significant ways that still appear reasonable but are wrong.
Invalid data input filters - how often do we see this problem arising. SQL injection anyone, incorrect string processing, conversion of strings to numbers because floating point conversions initiated and so forth.
User interface problems - how many errors have been generated because the user interface doesn't work correctly, allowing wrong data to be entered, not giving feedback that a value is out of range, indicating a process has been completed when it hasn't or indicating that a process hasn't completed when it has and the action being initiated for a second time which leads to processing errors. True, however, one of the most common kinds of errors is the incorrect tests on any kind of conditional statement. I have seen too many programs written when the incorrect conditional is executed and the bulk of the code (which should have been executed) is not. Getting conditions right is actually quite difficult, particularly is highly complex situations. Testing can miss much of this. It is quite tedious to create a complete conditional testing map and as a consequence quite easy to miss particular branches. This is why there is a rise in having conditional matching (often called match in the relevant programming languages) being analysed by the compiler and reporting such errors. Such is not available in old languages such as c and c++. If the results are needed for publication and/or further analysis, then it behooves the people in question to ensure that their program doesn't vomit on inelegant errors. It can significantly effect the reliability of those results. How can we trust them if the system cannot handle errors properly. With regards to software systems, the less reliable you consider it, then the more likely you will challenge the results and check the software for correctness. Much as you might be happy to just believe, I would prefer more work done on checking and testing. Too many software systems today are being used in situations that significantly effect the lives of people, whether these be banking and financial systems, medical and medical research, traffic and traffic management, etc. Software systems are complex and need to be be scrutinised more closely than they are currently. You are entitled to your opinion. However, I believe the opposite for this article. It is certainly not scare-hype, but a reflection of the reality of the systems that are used to create the scare-hype used by various. If we actually did more critical analysis of our software system, we would be more likely to have less scare-hype arising. we could be more confident of our software systems in the areas of reliability and security.
[ link to this | view in thread ]
[ link to this | view in thread ]
Re: but still had glaring problems that escaped everyone
As opposed to proprietary software, where known problems can go unfixed for years.
I had this peculiar conversation with someone once, who refused to use a well-known piece of actively-developed open-source software because it was “unsupported”, while in the next breath he complained about how his preferred proprietary alternative would keep crashing all the time.
I wondered what his idea of “support” was...
[ link to this | view in thread ]
Re: All bugs are not considered equal.
Create an Access Database in Access (Any version).
Well I think I found the problem.
(couldn't resist)
[ link to this | view in thread ]
Hand crafted
Is there some other kind of software? Software produced on an assembly line maybe? Or are you drawing a distinction between custom made software and "shrink-wrapped" (a less relevant term now that almost everything is distributed electronically but I'm not sure if there's a new term to replace it)?
[ link to this | view in thread ]
Re: Re: Re:
[ link to this | view in thread ]
Re: Re: Re: Re:
[ link to this | view in thread ]
Re: Hand crafted
But seriously, haven't you written or even used systems that take specifications or configurations that generate code. You don't hand craft the code, you let the computer generate it for you. There are a myriad of software systems that require only a specification to be designed and they will automate the software generation for you.
Few people write in binary these days as it is so much harder to get it right. Even LISP and its ilk use macros to generate code.
But I do see your point. But the amount of hand crafting can be quite minimal for some environments. That is one reason for the growing use of libraries of software, so the various parts don't have to hand-crafted.
[ link to this | view in thread ]
Re: Re: Re: Re: Re: All bugs are not considered equal.
[ link to this | view in thread ]
Re: Re: Hand crafted
Yeah that's true.
Few people write in binary these days as it is so much harder to get it right.
You don't have to write in binary or even assembler for it to be hand crafted.
That is one reason for the growing use of libraries of software, so the various parts don't have to hand-crafted.
Not exactly. The libraries are mostly hand-crafted. We use them so that functionality doesn't have to be created over and over again.
[ link to this | view in thread ]
Re: Re: Re: Hand crafted
The comparison, I suppose can be the likened to the situation of hand crafting a Damascus steel blade. Is it hand-crafted if you use a power hammer to beat the steal or do you need to actually use a blacksmith's anvil and hammer to shape the material? Different people will have different views on that and we can not really say either is right or wrong. They both have some merit.
[ link to this | view in thread ]
Re: Re: Re: Re: Hand crafted
I don't consider "hand-crafted" to mean *I* made it, it just means *someone* made it. The library was hand-crafted, just not by me.
[ link to this | view in thread ]
No-code-generated bug
We were profiling the code to see which instructions were using the most time, so we could further optimize the routine. (It was already carefully handcrafted in assembler, aligned on cache lines, etc. to be as fast as possible.)
In the bar graph of time used per instruction, there was one instruction which never got executed. The strange part - it was in a section of code with no branches. An impossibility - linear execution of a series of instructions where one of them is never found executing when the ones before and after it are found executing millions of times.
The bug turned out to be a comment in the previous line of code. The comment extended one column too far, into the "continuation" column. This made the comment continue onto the next line, which contained the instruction that never got executed. Of course, even though the instruction was in the listing, it was not in the generated binary shown to the left of each line in the assembler output. In other words it was in the source but never assembled. Nobody looked over there, we were reviewing the code, not the binary instruction stream generated from the code.
That one missing instruction caused all the lost-process problems.
[ link to this | view in thread ]