Scientists Forced To Change Names Of Human Genes Because Of Microsoft's Failure To Patch Excel
from the code-is-law dept
Six years ago, Techdirt wrote about a curious issue with Microsoft's Excel. A default date conversion feature was altering the names of genes, because they looked like dates. For example, the tumor suppressor gene DEC1 (Deleted in Esophageal Cancer 1) was being converted to "1-DEC". Hardly a widespread problem, you might think. Not so: research in 2016 found that nearly 20% of 3500 papers taken from leading genomic journals contained gene lists that had been corrupted by Excel's re-interpretation of names as dates. Although there don't seem to be any instances where this led to serious errors, there is a natural concern that it could distort research results. The good news is this problem has now been fixed. The rather surprising news is that it wasn't Microsoft that fixed it, even though Excel was at fault. As an article in The Verge reports:
Help has arrived, though, in the form of the scientific body in charge of standardizing the names of genes, the HUGO Gene Nomenclature Committee, or HGNC. This week, the HGNC published new guidelines for gene naming, including for "symbols that affect data handling and retrieval." From now on, they say, human genes and the proteins they expressed will be named with one eye on Excel's auto-formatting. That means the symbol MARCH1 has now become MARCHF1, while SEPT1 has become SEPTIN1, and so on. A record of old symbols and names will be stored by HGNC to avoid confusion in the future.
So far, 27 genes have been re-named in this way. Modifying gene names in itself is not unheard of. The Verge article notes that, in the past, names that made sense to experts, but which might alarm or offend lay people, are also changed from time to time:
"We always have to imagine a clinician having to explain to a parent that their child has a mutation in a particular gene,” says [Elspeth Bruford, the coordinator of HGNC]. "For example, HECA [a cancer-related human gene] used to have the gene name 'headcase homolog (Drosophila),' named after the equivalent gene in fruit fly, but we changed it to 'hdc homolog, cell cycle regulator' to avoid potential offense."
It is nice to know that we won't need to worry about serious problems flowing from Excel's habit of automatically re-naming cell entries. But it's rather troubling that Microsoft doesn't seem to have thought the problem worthy of its attention or a fix, despite it being known for at least six years. It shows once again how people are being forced to adapt to the software they use, rather than the other way around. Or, as Lawrence Lessig famously wrote: "code is law"·
Follow me @glynmoody on Twitter, Diaspora, or Mastodon.
Thank you for reading this Techdirt post. With so many things competing for everyone’s attention these days, we really appreciate you giving us your time. We work hard every day to put quality content out there for our community.
Techdirt is one of the few remaining truly independent media outlets. We do not have a giant corporation behind us, and we rely heavily on our community to support us, in an age when advertisers are increasingly uninterested in sponsoring small, independent sites — especially a site like ours that is unwilling to pull punches in its reporting and analysis.
While other websites have resorted to paywalls, registration requirements, and increasingly annoying/intrusive advertising, we have always kept Techdirt open and available to anyone. But in order to continue doing so, we need your support. We offer a variety of ways for our readers to support us, from direct donations to special subscriptions and cool merchandise — and every little bit helps. Thank you.
–The Techdirt Team
Filed Under: autoconversion, dates, excel, gene names, genes, spreadsheets
Companies: microsoft
Reader Comments
Subscribe: RSS
View by: Time | Thread
Format
Of course they could just highlight the column or row in question, and pick format cell, and choose plain text.
[ link to this | view in chronology ]
Re: Format
From what I understand, that may not work in all cases.
Depending on how the document is saved, that data may not be preserved. So if somebody else open up that document on another computer, it would change the names to dates before anybody could adjust the column... and changing the column type after the fact won't change the data back.
[ link to this | view in chronology ]
Re: Re: Format
This as well. If it converts to date it's actually saving as a number.
[ link to this | view in chronology ]
Re: Re: Format
You are correct. A very common means of sharing data is the CSV (comma seperated values) file format. The file is structured so that commas seperate columns, and line breaks seperate rows.
The CSV is a very file-size efficient way of sharing data sets, but it stores literally no formatting information.
Excel is usually registered as the default application to open CSVs, and because there is no meta information about the data in the cells, Excel tries to be clever with its parsing of the information.
This leads to all sorts of fun issues - strings being interpreted as dates like in the article above, phone numbers being interpreted as numbers (and consequently losing their leading 0 if they're not in international format), fractional prices losing their precision when being rounded to 2 places, etc.
[ link to this | view in chronology ]
Re: Format
When you have large data sets this may be quite annoying. It should be the other way around, if you want Excel to treat stuff as dates then you tell it and it converts things otherwise it should treat cells with numbers as plain numbers and cells with text as text.
[ link to this | view in chronology ]
Re: Format
Except that Excel will drop that choice whenever possible.
I am an accountant. I use .csv files to upload large transactions into my accounting software, and requires very specific formatting to acomplish.
Notably I need to not use the 'date' format in excel which stores the date as a number instead of the actual mm/dd/yy formatting, and I needed leading zeros without any puncutation in one column.
If I used plain text and saved my CSV file, excel would helpfully save space by removing very critical leading zeros. If I loaded the file, it would strip the dates out of the file and force them into date format. It fought me. I have since switched to Libre office which gives me the option to set formatting before i load a file, and will accept leading zeros being saved to CSV format.
Its an issue with any data set saved as a CSV or other file type that doesn't carry format. Its most critical with large data sets that benefit in size from a lack of formatting. Excel auto format on load kills many of its best uses for me. Just formatting the sheet once doesn't help plain data files like CSV when excel gets its hands on it. All it takes is one intern opening the file and autosave catching the changes for you to lose lots of data.
Its a toggle, Microsoft. a free software company has better CSV auto-format code then you. Get it together.
[ link to this | view in chronology ]
Re: Re: Format
I was just bitching on another social media site today about Excel's poor choices in everything. Today's gripe was about the welcome screen that helpfully shows you all of your recent files but neglects to show the file extension so you can't tell which is which.
Also high on my aggravation list, the ISO-8601 date format that the entire world uses (YYYY-MM-DD) is not a standard date format type and has to be manually added every time you want it to make your dates look like that.
And yes, every time you load any kind of file or copy/paste any data into it, it tries to auto-format the data using what could be best described as the "best guess of a 5th grader currently flunking math and science".
Yeah, this topic makes this developer a bit not happy.
[ link to this | view in chronology ]
Re: Re: Re: Format
Auto-format itself wouldn't be much of a problem, if it weren't for the completely brain-dead handling of the result: Excel tries to auto-format, AND DISCARDS PRE-FORMATTED DATA, with no way to get it back.
You'd think a ~35 years old application whose primary purpose is organizing data wouldn't work so hard on actively destroying said data.
[ link to this | view in chronology ]
Re: Re: Format
Surely you know that every MS product is required to have a totally insane and broken solution to a problem which has either been solved sensibly for decades by everyone else, or which only exists as a problem because of other insane design decisions.its in their deign requirements, right next to the one forbidding compliance with the user interface standards.
[ link to this | view in chronology ]
Re: Re: Format
"Its a toggle, Microsoft. a free software company has better CSV auto-format code then you. Get it together."
Hey, don't give them ideas. MS never managed to create the illusion of "choice" without including that damn paperclip to provide you with "helpful" suggestions. No one wants Clippy back.
[ link to this | view in chronology ]
Re: Re: Re: Format
And the good news is that you need not deal with Clippy. MS offered several options, mostly at about the same annoyance level, for people who did not like Clippy. Just adjust your user options.
Where is Microsoft Bob when you need it?
[ link to this | view in chronology ]
Re: Re: Format
MS and older formats..
Long ago,
the Amiga had a very neat trick. The First part of loading hte file, TOLD the prog/Data what format it was in.
If you know windows and dos, it can have any ext. and Still not be able to decide what the format is, unless its told. with The thought of Loading and excel deciding what to do...its MS..THEY decide what to do.
Iv loaded Standard files into MS office and other MS products and the prog tends to DO ITS OWN THING.. Where If I used an Alt. Program designed for the ORIGINAL Formats, it never had a problem. It was Like files in/exported back and forth to Apple.
Postscript was Wonderful for this..talk about a mess.
[ link to this | view in chronology ]
Re: Format
You've clearly never worked a day of data analytics in your life.
[ link to this | view in chronology ]
Re: Format
Me: spends 8 hours getting everything perfectly formatted
Excel: You meant to save this all in a European accounting format, right? Done!
[ link to this | view in chronology ]
Microsoft: Where do we want you to go Today?
[ link to this | view in chronology ]
I think it's worthwhile to ask what percentage of users type in "SEPT1" and expect it to be interpreted as a date versus the percentage who type it in and don't. I don't know if there's data for that, but I would expect the majority of people typing "SEPT1" to mean September 1st. In which case no, I don't think the correct solution is for Microsoft to modify Excel's behavior in a way that's unexpected and inconvenient for a larger number of users for the sake of a smaller minority, I think it's for the smaller minority to learn how to change the default settings in Excel.
[ link to this | view in chronology ]
Re: what percentage of users
This.
It's hardly Microsoft's fault that geneticists chose names that look like dates.
When you decide what to call things, choose wisely.
Microsoft isn't my favorite firm, but this is hardly their fault.
[ link to this | view in chronology ]
Re: Re: what percentage of users
It's Microsoft's fault their application mangles data in the course of trying to think for you.
[ link to this | view in chronology ]
Re: Re: what percentage of users
Microsoft should make this behaviour turn on and off-able. (And store that state in the spreadsheet, not just the options of the installation)
Then those that need the stability of their data to be maintain can get that behaviour, and those who like the convenience of using the most widely use spreadsheet application in the world can have it hold their hand for them.
[ link to this | view in chronology ]
Re: Re: Re: what percentage of users
Which is a trap, because they end up being limited by what the hand holding allows, rather than learning to get to where they want to be.
[ link to this | view in chronology ]
Re: Re: what percentage of users
"It's hardly Microsoft's fault that geneticists chose names that look like dates"
It's hardly geneticists fault that Microsoft tries to double guess every decision they make and insert their own interpretation without permission. It should be easy to turn off Microsoft's automatic alteration of your data, and it's their fault if they make design choices that ensure that it's not an easy choice to specify.
[ link to this | view in chronology ]
Re:
No, but the correct solution is for Microsoft to incorporate persistence in the .xls file format for retaining preferences set during document creation. If you designated a column as plaintext when creating the spreadsheet, it should REMAIN plaintext when opened on other systems by your coworkers.
[ link to this | view in chronology ]
Re:
Autoconverting to a datatype in a spreadsheet, whatever the type, is a defect, plain and simple. The fact that it's often absurdly aggressive in its conversion makes it even worse. Your example - SEPT1 - should not even be considered valid for conversion into a date as it doesn't have a year. Any assumption as to what year to use - i.e. this year, the next September to occur - is a very strong assumption. Plus Excel will interpret a whole bunch of separators as being an attempt at a date, even those that don't fit the current locale.
A lack of a conversion can always be fixed after the fact. An unwanted conversion often cannot.
Excel isn't alone in this defect, most spreadsheets mimic this defective behaviour. It's an antifeature, and should be purged ASAP, from every piece of software with it.
[ link to this | view in chronology ]
Re: Re:
Attempting to mangle anything vaguely date like into a date if the cell is date or general makes sense when tying in a single cell of input, but doing it to imported data is completely wrong.
[ link to this | view in chronology ]
Re:
Expect somewhere in the neighborhood of epsilon. That is a mighty odd way of typing a date. I would expect users to give a space, and most of them to leave out the `t'.
And remember, we are not actually looking at user keyed input here. We are looking at data from a file. What percentage of useful programs would store a date as "SEPT1"? I would expect roughly zero.
Someone at Microsoft is immune to brain injury.
[ link to this | view in chronology ]
Open Source
They should have gone with LibreOffice.
[ link to this | view in chronology ]
Easy fix
just put a ' in front of it...
[ link to this | view in chronology ]
Yes, certainly--the way Microsoft does it. (Of course "the way Microsoft does it" is, in the programming world, a euphemism for "ostentatiously wrong".) Merely autoconverting would just be wrong with a capital "R". The ostentation consists in refusing not to preserve the error across saves.
Some real programming languages carry the string as entered UNTIL it is used in computation that requires a specific data type. (Javascript does this--and no quibbling about "real": we're comparing to spreadsheets here.) This approach can be made to work--nothing is mangled until it's absolutely necessary to mangle it.
I suspect Microsoft fell into the "premature optimization" trap--which is a temptation even to real programmers: Let's make the buggy program fast (or in small memory), then (maybe) try to fix the bugs. The fix always turns out to be much harder (more likely to break something else) and generally cancels out the speedup of the optimization or worse. (Microsoft is not keen on fixing bugs that the user can theoretically work around--they are considered user errors. (Which isn't WRONG ... if you include the choice of software in the first place.)
[ link to this | view in chronology ]
Microsoft does not have customers.
It has hostages. Everyone, escape while you still can. If more people don't jump on the Linux bandwagon, and soon, Microsoft's incorporation of a Linux subsystem into Windows will wind up swallowing Linux completely, and there will be no escape.
[ link to this | view in chronology ]
Re: Microsoft does not have customers.
I have succeeded using Apple Numbers in place of Excel and it works like a charm.
[ link to this | view in chronology ]
Re: Re: Microsoft does not have customers.
Apple has hostages, too. Apple just goes about the hostage-taking process a bit differently. As has already been mentioned, LibreOffice / Linux works just fine, for free people, and for free.
[ link to this | view in chronology ]
I am reminded of the joke:
Q: why do programmers confuse Halloween and Christmas?
A: because OCT31 is DEC25.
[ link to this | view in chronology ]
Re:
Ouch.
[ link to this | view in chronology ]
Another example of "smart" things being dumb.
[ link to this | view in chronology ]
Just use Apple Numbers app instead of Excel. I finally got away from Microsoft after 30 years and I just chuckle at some of the bullshit they made me do that I did not want to do.
[ link to this | view in chronology ]
Re:
Apple exercises more control over what its users can do than Microsoft. Try Linux where you have a choice of spread sheets, Libreoffice calc, Gnumeric, Caligra sheets amongst others for a spread sheet.
[ link to this | view in chronology ]
Re:
Though in recent years MS has been gradually moving towards Apple-style totalitarian control (and combining it with Google-style adware and spyware, for good measure), I'd say at this point they're still an order of magnitude better than Apple.
That advantage is shrinking, but sadly not because Apple is getting better...
[ link to this | view in chronology ]
Re:
Forgot to say: the problem in the article is Excel insisting it knows what you need better than you do.
Suggesting Apple as a solution for that is... well, ever heard of the phrase "out of the frying pan and into the fire"?
[ link to this | view in chronology ]
Meme I wish I could post.
Venn Diagram
Circle 1 - Excel
Circle 2 - Incel
Overlap - Thinking everything is a date.
[ link to this | view in chronology ]
Re:
That's Beautiful!!
[ link to this | view in chronology ]
Excel is not the problem; your browser is!
I feel one of the most common sources of corrupted data is when you find a cvs file on the internet and want to save it on your computer.
But unfortunately, in order to help you, the browser does not simply save the file locally. Instead it launches Excel and hands it the file. And as a naive user, you immediately save the file from within Excel, not realising you just saved a bunch of autoconverted and possibly corrupted data...
Had the browser just saved the file locally, that data would be fine and you could try to your hearts content to import or open it in a number of applications, formats, with or without any automatic conversion or datatype detection, until you were satisfied that your software of choice had interpreted the data correctly.
Main workaround: don't let the browser open an application to handle csv files.
In fact, don't let the computer in general (I'm looking at you, Explorer) use any 'default' application to open any csv file which attempts to automagically convert or interpret your data! It will probably fail at some point, and possibly without you realising at that time.
[ link to this | view in chronology ]
Re: Excel is not the problem; your browser is!
I'm not following... you're saying that you have your OS set to open CVS files, then Excel corrupts the data, but the fault is the browser's for using your OS specified defaults, and not Excel for corrupting the file?
[ link to this | view in chronology ]
Re: Excel is not the problem; your browser is!
In IT we like to refer to this problem with error codes like --
ID-10T
or
E.R.B.C.A.K.
You can blame Microsoft for a lot of things, but the settings you have on your browser of what to automatically do (like opening an APP instead of saving the file) is totally your problem, not theirs.
[ link to this | view in chronology ]
Found out the hard way
I so hate MS excel. Sadly, my job only allows us to use it with the "SAS" solution thru 365. I could go on a 10-page rant about the never ending ribbon changes.
[ link to this | view in chronology ]
Back in the day, we had the option of using a cloth ribbon instead of a carbon ribbon. You did not have to change it very often, and it was really up to you as to how faint you were willing for your type to be as the ink was transferred from the ribbon to the paper.
Ask your office supply vendor about cloth ribbons. For some people, this may be an upgrade.
[ link to this | view in chronology ]