Using Spreadsheets In Bioinformatics Can Corrupt Data, Changing Gene Names Into Dates
from the careful,-now dept
A few years back, people were rather disturbed to find out about the famous Excel bug, whereby the multiplication of two numbers in Microsoft's spreadsheet gave the wrong number. It turns out there are other circumstances in which Excel (and, to be fair, presumably other spreadsheets) can give incorrect results, but they are unlikely to be encountered in typical everyday tasks. However, in the specialized world of bioinformatics, which uses computers to analyze data about genes and related areas, careless use of spreadsheets can throw up a significant numbers of errors, as this paper in BMC Bioinformatics explains:
Use of one of the research community's most valuable and extensively applied tools for manipulation of genomic data can introduce erroneous names. A default date conversion feature in Excel (Microsoft Corp., Redmond, WA) was altering gene names that it considered to look like dates. For example, the tumor suppressor DEC1 [Deleted in Esophageal Cancer 1] was being converted to '1-DEC.'
Here we have the interesting interaction of two very different fields, where the name of a gene involved in esophageal cancer, DEC1, was interpreted by Excel to mean the date, 1 December. As the paper points out, these kinds of substitution errors are already to be found in key public databases:
DEC1, a possible target for cancer therapy, was incorrectly rendered, and it could potentially be missed in downstream data analysis. The same type of error can infect, and propagate through, the major public data resources. For example, this type of error occurs several times in even the immaculately curated LocusLink database.
As that notes, a gene that might be relevant for treating cancer could well be missed because of this incorrect conversion to a date by Excel. Although it is unlikely that any serious harm has been caused by this -- yet -- it's a useful reminder of the dangers of depending a little too heavily on the results of software without checking for corruption of this kind.
Follow me @glynmoody on Twitter or identi.ca, and +glynmoody on Google+
Thank you for reading this Techdirt post. With so many things competing for everyone’s attention these days, we really appreciate you giving us your time. We work hard every day to put quality content out there for our community.
Techdirt is one of the few remaining truly independent media outlets. We do not have a giant corporation behind us, and we rely heavily on our community to support us, in an age when advertisers are increasingly uninterested in sponsoring small, independent sites — especially a site like ours that is unwilling to pull punches in its reporting and analysis.
While other websites have resorted to paywalls, registration requirements, and increasingly annoying/intrusive advertising, we have always kept Techdirt open and available to anyone. But in order to continue doing so, we need your support. We offer a variety of ways for our readers to support us, from direct donations to special subscriptions and cool merchandise — and every little bit helps. Thank you.
–The Techdirt Team
Filed Under: bioinformatics, data, errors, genes, spreadsheets
Reader Comments
Subscribe: RSS
View by: Time | Thread
[ link to this | view in chronology ]
[ link to this | view in chronology ]
Re:
[ link to this | view in chronology ]
am I doing this right?
[ link to this | view in chronology ]
Re:
[ link to this | view in chronology ]
[ link to this | view in chronology ]
Re:
[ link to this | view in chronology ]
Re: Re:
[ link to this | view in chronology ]
Re: Re: Re:
[ link to this | view in chronology ]
Re:
[ link to this | view in chronology ]
Re: Re:
[ link to this | view in chronology ]
Re: Re:
The wizard doesn't run if you click on a file and select to open it in Excel, or if you set your exporting application to directly open Excel, or if you paste the data into a sheet. It will then autocorrect the data into a date field that, at it's heart, is a formatted number. If you convert that back to text, you'll see the number.
[ link to this | view in chronology ]
Re: Re: Re:
CSV import "just works!"
[ link to this | view in chronology ]
Re: Re:
.
[ link to this | view in chronology ]
Re:
[ link to this | view in chronology ]
This problem occurs in many different data analysis situations using Microsoft spreadsheets
The data had to be pre-processed to insert "" around the relevant fields before loading the data into the spreadsheet. This did stop the spreadsheet conversion routines from touching that specific data. Unfortunately, the data was supplied in varying formats so that a uniform method couldn't be developed at the time in the timeframe required.
David Oliver Graeme Samuel Offenbach
[ link to this | view in chronology ]
Spreadsheets Are Undebuggable
Even important economic research is being called into question because of spreadsheet errors.
[ link to this | view in chronology ]
Re: Spreadsheets Are Undebuggable
[ link to this | view in chronology ]
Re: Spreadsheets Are Undebuggable
[ link to this | view in chronology ]
So instead of spreadsheets...
[ link to this | view in chronology ]
Re: So instead of spreadsheets...
[ link to this | view in chronology ]
Re: Re: So instead of spreadsheets...
[ link to this | view in chronology ]
Re: So instead of spreadsheets...
It's the same as climate scientists stating "If you aren't a climate scientist, you have no authority to comment on our work." But the problem is, most of climate science (computer modelling, statistical analysis) isn't "Climate" Science, it's a mix of Computer Science (programming for computer modelling), Fluid Dynamics (also for computer modelling, you've got to accurately model fluid dynamics to have an accurate atmospheric model), Statistical analysis (all the climate chronologies that are done are based on Statistics) and so on.
When a non-Computer Scientist programs an advanced model, without input from a physicist/Engineer (for fluid dynamics), and picks and uses statistical methods without an in-depth knowledge of statistics to know which method to use when, you are going to have problems.
However, all these users (Climate Scientists, Biologist, Engineer and what have you) go: "Wonderful, I don't need all these specialists (Computer Scientist, programmer, Engineer, Climate Scientist, Biologist), I can just use an off the shelf spreadsheet, whip up a few formulas in it, and presto!".
A Computer Scientist might go: "I'll just write my own program to do this climate modelling thing, I've got a fluid dynamics text book, I'll just read that and implement it in code." The code may be beautiful, concise, with the formula's implemented flawlessly...until a fluid dynamics specialist looks at it and goes "why did you use that formula/paradigm there? That's a rather exotic case, superheated plasmas, it's not very good when you are modelling a layer of air that's at -45C..."
Of course, if you don't have an in-depth knowledge of spreadsheets (hell, a civil servant who lives, breathes, finance and the spreadsheets that are daily created probably knows how to use a spreadsheet program better than a 'scientist', whether that's an Engineer, Computer Scientist, Economist, Climate Scientist what have you), or Fluid Dynamics, or programming or Computer Science, then you'll probably end up with a balls-up.
Unless you have expertise in the other field, you should engage an expert in that field to at least advise on what you are doing. If a Biologist needs to manipulate a lot of genes with a computer...consult a computer scientist on the best way to do what you want done, and have them implement it for you if necessary...and if the implementation requires expertise from another field, say fluid dynamics, then the computer scientist should probably consult a fluid dynamacist, or statistician or whatever else is necessary.
[ link to this | view in chronology ]
Re: Re: So instead of spreadsheets...
[ link to this | view in chronology ]
Re: So instead of spreadsheets...
* NumPy and SciPy
* matplotlib
* The R statistical language
* SAGE
* GNU Octave
... just to name a few.
[ link to this | view in chronology ]
Re: So instead of spreadsheets...
I do not understand this statement. Are you saying that the software in question is efficient at producing erroneous output? How can this be considered efficient when the user then has to expend countless hours finding and correcting the introduced errors. I find this to be grossly inefficient. You would be better off writing your own scripts.
[ link to this | view in chronology ]
Re: Re: So instead of spreadsheets...
Because that will work right the first time?
[ link to this | view in chronology ]
Re: Re: Re: So instead of spreadsheets...
It was an example of how inefficient a poorly designed application is, when doing it yourself is potentially better.
[ link to this | view in chronology ]
Re: Re: Re: Re: So instead of spreadsheets...
So you're saying making mistakes using Excel (because that's what happened here) is inefficient, and it would be a better use of time for that same person (who is a scientist, not a software developer, and can't use Excel properly) to write their own software from scratch?
[ link to this | view in chronology ]
Re: would be a better use of time for that same person (who is a scientist, not a software developer
[ link to this | view in chronology ]
Re: So instead of spreadsheets...
[ link to this | view in chronology ]
It's the same thinking that lead to them adding a virus distribution mechanism to Windows and calling it AutoRun/AutoPlay.
[ link to this | view in chronology ]
Re:
Excel is used by many organizations for many purposes. You need to at least be aware of how the software functions. Otherwise why are you using it?
[ link to this | view in chronology ]
Re: Re:
[ link to this | view in chronology ]
Re: Re:
That should be an option.
I'm sure that there are probably quite a few people who wouldn't mind having all their figures rounded up or down to make things simpler. Should the software just go ahead and do that for everyone?
[ link to this | view in chronology ]
Re: Re: Re:
[ link to this | view in chronology ]
Re: Re: Re:
You can turn this option off by selecting your worksheet and formatting the cells as text. Anything you add from that point forward will be treated as text.
It's not rocket science, but sometimes it helps to RTFM.
[ link to this | view in chronology ]
Re: Re: Re: Re:
But Micro$oft claims their software is intuitive to use, therefore users should not need to RTFM.
/big myth
[ link to this | view in chronology ]
Re:
This. A thousand times over.
[ link to this | view in chronology ]
Re:
Yay for Microsoft bashing. However, this is the DEFAULT behavior of excel when pasting in data from the clipboard that is in certain common formats. Someone actually familiar with the software would have know this was a possibility and could have taken lots of different steps to avoid it - the simplest being to set the format of the column to text before pasting (all of three clicks), or properly using the import features of the software to identify data types correctly.
This particular "issue" is not a software problem.
If I had to select the individual data type and formatting of everything I ever pasted into Excel, I would never get anything done.
[ link to this | view in chronology ]
Re: Re:
Well deserved bashing in this case.
Rather than make excuses and describe workarounds for a poorly designed software application, one might simply use a more sophisticated suite from a different source that is more suited to their purpose. I realize that in many corporate environments obtaining such special treatment is an uphill battle, so good luck with that.
[ link to this | view in chronology ]
Re: Re: Re:
It's not a workaround, it's using the software correctly. Maybe the default behavior isn't right for these users, but that doesn't mean the software is flawed (in this particular way). It's not "Excel for Bioinformatics" after all. If a user doesn't know how to import data correctly, doesn't learn how before doing it, and ends up with incorrect data, why is that Microsoft's fault?
[ link to this | view in chronology ]
Re: Re: Re: Re: Incorrect data
A dialog on first import saying "Excel has normalized some date-like strings." could have alerted the user that the import mangled the results. The burden would then be on the user to go back and reimport the data correctly. A fancier version might offer links to documentation that explains what kind of normalization Excel enables and options to re-run the import in place with those normalizations disabled. A very fancy version could even import the data in non-normalized form, but provide an option to have Excel interact with it (render, search, transform, etc.) in normalized or as-imported form.
[ link to this | view in chronology ]
Re: Re: Re: Re: Re: Incorrect data
Sadly, users rarely read dialogs. Reading documentation is probably even less likely. Given that they're not going to get much help from the user, software designers have to choose what is the best default behavior for the greatest number of users. Perhaps you disagree with the default behavior, but either way it's going to be wrong for someone.
[ link to this | view in chronology ]
Re: Re: Re: Re: Re: Incorrect data
I'm not sure if you have noticed that software has really stopped popping up dialogs like this, but while they used to be standard practice, sometime in the late 80's software companies realized that their customers HATE THOSE THINGS MORE THAN GETTING SOMETHING WRONG NOW AND THEN.
[ link to this | view in chronology ]
Re: Re: Re: Re:
Exactly. People need to remember that Excel isn't marketed for that particular use. The vast VAST majority of users of this software benefit from these sorts of changes. Remember, most will be using it for things like financial data and stuff like that. For those people importing a date that displays in a different format can create issues. In the US we would write Dec 1 but in Europe 1 Dec is more common. I can pretty much guarantee that displaying gene abbreviations correctly is a niche use that NO designer ever had cross their mind. I'd also be surprised if the other packages on the market didn't do the same thing.
[ link to this | view in chronology ]
Re: Re:
I couldn't disagree more. This issue is most definitely a software problem. At the very least, conversion warnings should have been issued.
[ link to this | view in chronology ]
Re: Re: Re:
That would merely be a CYA issue, since generally users would ignore them. It wouldn't have any substantive effect.
[ link to this | view in chronology ]
Re: Re: Re:
They are on by default because that is what most of the users want most of the time. They have gotten years of feedback, complaints, and requests to come up with the defaults - they didn't arbitrarily decide that automatic date detection should be on and automatic fraction detection should be off.
However, the included documentation and the ability to control all of these things - just in case you don't want them on for a specific purpose.
If they had gone your way, you would type in a list of numbers, highlight them, and wonder why it didn't show you the total.
[ link to this | view in chronology ]
Re:
A million times this. This is a consistent problem with Microsoft software.
[ link to this | view in chronology ]
Re: Re:
As mentioned above, the target audience for Excel was never bio-informatics... and yet they are still catered for because steps can be taken to import data correctly, instead of blindly pasting stuff around, not checking the results, then blaming the software for getting it wrong.
There are lots of suggestions that Excel should have warned that it had converted data to a date... but nobody has considered how much they would be pissed off when Excel issued a warning every time they entered a date, or a number - that's just training your users to ignore warnings.
There are lessons to be learned from this story, but blaming it all on Microsoft is choosing not to learn.
[ link to this | view in chronology ]
Re: Re: Re:
I'm aware of that, but catering to what most of its customers want means that they produce software which is dumbed down and allows for stupid errors to be easily overlooked. This effect is one of the reasons I avoid Microsoft software as much as I can.
"blaming it all on Microsoft is choosing not to learn."
Don't misunderstand me -- I'm not blaming it all on Microsoft at all. This type of problem is hard to resolve in a way that pleases everyone. I'm just saying that Microsoft's approach is one that consistently fails to meet my needs.
[ link to this | view in chronology ]
Re: Re: Re:
[ link to this | view in chronology ]
[ link to this | view in chronology ]
Re:
Why do you even think you know what the requirements for the job are? Presumably being able to share the document around is one of the requirements, and databases aren't usually very good at that.
Also, from the article: For example, this type of error occurs several times in even the immaculately curated LocusLink database.... use of spreadsheets doesn't preclude the use of databases.
[ link to this | view in chronology ]
Re: Re:
And no, Excel is not a database application. It's a spreadsheet application that can do some database type work.
Excel could be used for this sort of data work, but Woadan is entirely correct that a database would be the better solution.
[ link to this | view in chronology ]
Re: Re: Re:
a collection of pieces of information that is organized and used on a computer
Excel most certainly falls into the category of database applications - although I would argue that Excel is a database, a user interface, a calculation engine, and programming platform, a reporting tool, and an analysis toolset.
But anyone writing a database has to define the field data types, whereas Excel by default makes assumptions for you.
Absolutely not true. I have worked with databases in the past that were entirely made up of un-typed data, lots of them I have worked with have effectively been defined as nothing but character data or blobs, and Excel makes default assumptions, but they are only defaults and you can override them very easily and it stores everything as typed data.
Woadan is entirely correct that a database would be the better solution
This is the kind of thing that gets me furious. Without any specifications, you cannot possibly make this determination. For all we know, the data we are talking about is a single table with 10 columns and 300 rows. Excel is a great solution for that. Frankly, a "database" would be pretty useless for this "problem" as it was not the data storage engine that set the type and reformatted the values - that was the user interface.
The best solution here would be a user that knows how to use whatever software they were using to import data - and setting the type properly when they did it.
[ link to this | view in chronology ]
Re: Re: Re: Re:
[ link to this | view in chronology ]
Re: Re: Re: Re:
Excel is a spreadsheet app. Yes, like some database apps, it holds data in a tabular format. It also typically displays data in a tabular format, which most databases don't do (unless you run queries - an area where Excel is also weak (but, then, it's a spreadsheet app, not a db app). It can't cope with significantly large amounts of data. In day to day use it relies on the same formulas being duplicated again and again with different inputs. It doesn't create indexes effectively.
A pen and paper can be a database too. So can Word. So can Photoshop. That doesn't mean they're database apps.
And, if you come across databases where a field that requires a specific datatype has not been defined as that data type, then the developer didn't know what they were doing or the spec they were working from was flawed.
And we're talking genome data here. It's not unreasonable to assume that the dataset was very large. Otherwise, this story wouldn't be a story.
[ link to this | view in chronology ]
Once upon a time
I bought a copy (one of my very few software purchases). Twenty-one years later I still use it. It is still hugely better than excel and its clones. It was (is) called Improv.
The fact that it did not take over is a tragedy of market failure. The fact that I have increasing difficulty keeping it going (confined now to a virtual xp box) is a tragedy of copyright.
[ link to this | view in chronology ]
Re: Once upon a time
[ link to this | view in chronology ]
Re: Once upon a time
[ link to this | view in chronology ]
Re: Once upon a time
http://www.vetusware.com/download/Lotus%20Improv%202.1/?id=5797
[ link to this | view in chronology ]
Re: Once upon a time
The thing is that Excel is designed for simple stuff. It can do more complex things but it isn't really designed for it. Excel is great for small forms with a couple of automatically formatted outputs, or a home budget, or keeping track of documents in a small office. It is not designed for advanced statistical analysis of gene structures.
Office is designed with several specialized tools for specific purposes rather than one swiss army knife that tries to do everything. You can make a presentation in Word, a simple spreadsheet in Access, and write a report in Powerpoint. It works. It just doesn't do it well.
There was an easy way to avoid this, which a simple Google search would have revealed...click the upper left box to select the entire sheet, and change the drop down box from "General" to "Text." Problem solved.
I personally use Excel extensively in my own job and am very familiar with both its strengths and its limitations. Is it perfect? Heck, no...I'd love for the ability to write formulas in multiple lines, sort of like how VBA is formatted, to help keep track of coding in a complex sheet when macros aren't an option (usually due to business rules). But overall the quick formatting of data is useful for prototyping and reformatting data, creating useful rosters and graphs, and other simple tasks when your boss asks for a quick analysis of something.
This is like complaining that using the flat side of a power drill to hammer a nail isn't very easy. Why not use a hammer? Well, I want to use my power drill, screw hammers.
Sorry. It's not the drill's fault you don't know how to use a hammer.
[ link to this | view in chronology ]
Journal publisher states the obvious. TD runs it.
Get your copy now!
[ link to this | view in chronology ]
Re:
[ link to this | view in chronology ]
BMC Bioinformatics 2004, 5:80 doi:10.1186/1471-2105-5-80
The electronic version of this article is the complete one and can be found online at: http://www.biomedcentral.com/1471-2105/5/80
Received: 5 March 2004
Accepted: 23 June 2004
Published: 23 June 2004
[ link to this | view in chronology ]
Re:
http://www.theregister.co.uk/2004/07/16/excel_vanishing_dna/
[ link to this | view in chronology ]
[ link to this | view in chronology ]
Re:
https://www.libreoffice.org/
But as a number of posters above have said, nothing beats inputting the correct information and, just as important, make sure cell-formatting is done correctly.
I suppose any spreadsheet / database program will fall flat if these two criteria aren't met.
I remember the good old days of DBASE III. Currently, they can be found at
http://www.dbase.com/
and they are sitting at version 9
Just my - stating the obvious - 2c worth. :-)
[ link to this | view in chronology ]
Idiocy
[ link to this | view in chronology ]
this is exactly the situation, and it's not just microsoft but pretty much all software are certain they know infinitely more about what the user should be doing than the user. for instance, doubling spacing after a sentence-ending period. most websites refuse that even though at times it really aids readability. momma knows what's right for you.
another real pain to me is headers when sorting. about half the time excel declares the top cell to be a header and doesn't include it in the sort. if you aren't vigilant about checking, you may get an error there.
super annoying to have to do handstands to use the stupid software. if that auto-dating feature (which annoys me, too) causes a serious error, i hope the offended party sues the shit out of microsoft.
[ link to this | view in chronology ]
[ link to this | view in chronology ]
Re:
[ link to this | view in chronology ]
We knew better than this in 2nd year stats classes in college. In 3rd year we were using alternative stat programs with more built in functionality. This isn't a failure of Microsoft. This is people not being careful, not doing work how they were supposed to, being cheap, and trying to find a scapegoat for their sloppy work.
[ link to this | view in chronology ]
Okay, which idiot didn't turn this 'default' feature off?
[ link to this | view in chronology ]
bad design really is bad design.
[ link to this | view in chronology ]
some years ago i was employed using the design software catia v4, which was very amendable to make the environment the most useful to a designer that it could be.
one day a fairly new user complained to me that he wished there weren't so many things that could be adjusted. i told him i agreed exactly and that all machines should be set up the way i like mine.
[ link to this | view in chronology ]
Re:
Damned if you do, damned if you don't. In my experience it has always been that each user has their handful of things that should be configurable and "everything else is just clutter that should be removed". And guess what, it was a different handful of things for each user.
[ link to this | view in chronology ]
yes, but it's even worse than that. if excel can't discern a year, it will insert this year into that date. so the spreadsheet you make that error on this year will equate that cell to 41974, but if you go back to a 2013 table, it won't be that number for dec1.
[ link to this | view in chronology ]
i'm pleased, of course, when someone adds heft to my argument, but i always hope that someone realizes it.
[ link to this | view in chronology ]
no, it doesn't. i just tried it.
now, it does if i set the format of a cell or group of cells to be TEXT and enter the value into the one-line input field up top or copy/paste from a text document one item of information into that field.
problem is if i copy/paste a large amount of info, say 30 football players' info, from a document, i can't put it all in that one-line entry field. i have to pick a location in the sheet and dump all that info with respect to the place i picked to put it. when i do that it changes the format of those TEXT cells to CUSTOM and changes those 5-10 football players' heights to 10-May. if i force that format back to either GENERAL or TEXT, i get 41769.
it doesn't work unless i handpick some cells to format and transfer the information one item at a time and be careful to select the correct cell and dump the info into that upper entry field. ONE HEIGHT AT A TIME.
let me spell it: S U C K S.
i tried mass dumping three players' heights and weights into that one-line upper entry field, and it took it, but put all those heights and weights of three players into that one cell, not an array of cells like i need.
like i say, S U C K S.
[ link to this | view in chronology ]
Re:
now, it does if i set the format of a cell or group of cells to be TEXT and enter the value into the one-line input field up top or copy/paste from a text document one item of information into that field.
I don't have Excel to test it, but are you only testing with copy/paste, or are you importing from a file? Because the latter has all kinds of options that the former doesn't.
[ link to this | view in chronology ]
if i have info in a spreadsheet carefully formatted so that anything that excel would misread as a date is handled as TEXT, then i can select a bunch of cells, copy/paste that info anywhere in another spreadsheet, and it will reformat the cells to be TEXT in the new spreadsheet as needed. i don't have to pre-format those cells. so that's not too bad.
it evens transfers the justification changes i had made in the original spreadsheet. in other words, it brings the formatting with the data.
transfers from any other source than another excel spreadsheet probably have to be very carefully handled, but if i painstakingly set up and excel spreadsheet, i can transfer without losing formatting.
[ link to this | view in chronology ]
Would you like help?
* Convert all text into date format.
* Leave all my gene names alone, you fool!
* Yes, I desperately in need of a date! Do you have any attractive single friends who have low enough standards to go on a date with me? (Forever alone!)
[ ] Don't show me this tip again
[ link to this | view in chronology ]