Once More With Feeling: 'Anonymized' Data Is Not Really Anonymous
from the nothing-to-see-here dept
As companies and governments increasingly hoover up our personal data, a common refrain to keep people from worrying is the claim that nothing can go wrong because the data itself is "anonymized" or stripped of personal detail. But time and time again, we've noted how this really is cold comfort; given it takes only a little effort to pretty quickly identify a person based on access to other data sets. Yet most companies (including cell phone companies that sell your location data) act as if "anonymizing" your data is iron-clad protection from having it identified. It's simply not true.
The latest case in point: in new research published this week in the journal Nature Communications, data scientists from Imperial College London and UCLouvain found that it wasn't particularly hard for companies (or, anybody else) to identify the person behind "anonymized" data using other data sets. More specifically, the researchers developed a machine learning model that was able to correctly re-identify 99.98% of Americans in any anonymised dataset using just 15 characteristics including age, gender and marital status:
"While there might be a lot of people who are in their thirties, male, and living in New York City, far fewer of them were also born on 5 January, are driving a red sports car, and live with two kids (both girls) and one dog,” explained study first author Dr Luc Rocher, from UCLouvain."
And using fifteen datasets is actually pretty high for this sort of study. One investigation of "anonymized" user credit card data by MIT found that users could be correctly "de-anonymized" 90 percent of the time using just four relatively vague points of information. Another study looking at vehicle data found that 15 minutes’ worth of data from just brake pedal use could lead them to choose the right driver, out of 15 options, 90% of the time.
The problem, of course, comes when multiple leaked data sets are released in the wild and can be cross referenced by attackers (state sponsored or otherwise), de-anonymized, then abused. The researchers in this new study were quick to proclaim how government and industry proclamations of "don't worry, it's anonymized!" are dangerous and inadequate:
"Companies and governments have downplayed the risk of re-identification by arguing that the datasets they sell are always incomplete,” said senior author Dr Yves-Alexandre de Montjoye, from Imperial’s Department of Computing, and Data Science Institute. "Our findings contradict this and demonstrate that an attacker could easily and accurately estimate the likelihood that the record they found belongs to the person they are looking for."
It's not clear how many studies like this we need before we stop using "anonymized" as some kind of magic word in privacy circles, but it's apparently going to need to be a few dozen more.
Thank you for reading this Techdirt post. With so many things competing for everyone’s attention these days, we really appreciate you giving us your time. We work hard every day to put quality content out there for our community.
Techdirt is one of the few remaining truly independent media outlets. We do not have a giant corporation behind us, and we rely heavily on our community to support us, in an age when advertisers are increasingly uninterested in sponsoring small, independent sites — especially a site like ours that is unwilling to pull punches in its reporting and analysis.
While other websites have resorted to paywalls, registration requirements, and increasingly annoying/intrusive advertising, we have always kept Techdirt open and available to anyone. But in order to continue doing so, we need your support. We offer a variety of ways for our readers to support us, from direct donations to special subscriptions and cool merchandise — and every little bit helps. Thank you.
–The Techdirt Team
Filed Under: anonymity, anonymized data, data, privacy
Reader Comments
The First Word
“So, let's see. Facial recognition is only accurate 20% of the time in identifying an individual (leading to lots of false arrests because LEOs don't do any other checking. BUT, anonymized data available on the Internet is 98.98% of the time. Why doesn't law enforcement start using this instead?
Subscribe: RSS
View by: Time | Thread
'Our arguments don't apply to you though, honest...'
The best part of the government making that argument is that at least one of the USG agencies(and several others I'm sure) know full well that you can take bits and pieces of otherwise harmless data and combine it into something not so harmless, which would make it extra rich should they try the 'it's just metadata/anonymized data' argument.
[ link to this | view in chronology ]
Project Insight requires insight...
[ link to this | view in chronology ]
Inject false data
Wouldn't the best way to muddy the 'anonymized' data be to irregularly, but often, do something entirely out of character for yourself? Maybe someone could make an app for that?
The hardest part might be finding out what your 'characteristics' are so that doing something 'out of character' can be determined (denial of what 'your' characteristics are will be a problem). The second piece, the list of things that are 'out of character' for you, but still acceptable (which might be a characteristic that in the end 'outs' you) would also be hard. But sending data that you did something you didn't actually do might actually make acceptance irrelevant.
And, since we don't seem to do random very well, making 'random' injections might just provide sufficient information to identify individual 'randomness' and separate that from other identifiable data points.
Oh well. Maybe it would just be best to no let anyone have the data to begin with, but Pandora's box has been opened and I cannot think of any way of putting what has escaped back in the box. That is likely not possible and banning future collections might just be futile.
[ link to this | view in chronology ]
Anonymity
People used to (and maybe still) call the TSA "Security theater": action designed to give the appearance of security, without actually accomplishing much.
Current anonymization techniques should be called "anonymity theater". They give the appearance of anonymity to the average person, while still allowing companies to correlate the results with their entire history as well as other data gathering services.
[ link to this | view in chronology ]
Personally Identifiable Data
[ link to this | view in chronology ]
Re: Personally Identifiable Data
the more common term that people encounter is "Personally Identifiable Data" (PID) in Terms of Service agreement for various software, apps, services, memberships, etc.
companies often claim that no PID is collected, stored, shared ... or at least very strictly secured for highly limited, absolutely necessary purposess.
Such TOS claims are usually false or misleading. Few people read them anyway.
[ link to this | view in chronology ]
This example seems a bit "lying with statistics" to me. Simply because... how often are there only 15 people to pick from in datasets like this? How often could they use this information to choose the right driver out of 15 million options?
[ link to this | view in chronology ]
I think you missed the point
The point wasn't to show you that they could pick out the correct person out of 15 correctly 90% of the time, but that we have personally identifiable characteristics even in as small a task as braking. Now think about all of the other things you do that you now have a sensor and micro-computer attached to that is taking very precise readings about you and giving them to someone else to use. Even anonymized, these readings do a very good job of "fingerprinting" you.
[ link to this | view in chronology ]
Re:
This is why I like to drive in the left lane on the highway while holding my brake pedal just enough to keep my brake lights on.
[ link to this | view in chronology ]
Re:
How often are 15 million people the likely driver of a single vehicle?
The quote you cite is right after a discussion about combining leaked/stolen data pools. That we can use as random a data point as braking pattern to narrow and refine de-anonymization efforts is important. While true that a 15 person pool in isolation seems like a restriction that limits the value of this find, I was struck by how likely someone with braking data is to know what car the data came from. I was struck, reading that data point, about how with enough data, the braking patterns could be used to determine who was driving a car when. Then if they had access to that car's location from LPR databases, or perhaps the car's lojack GPS, and know intimate details about each individual sharing that vehicle.
[ link to this | view in chronology ]
You can do the same thing back to them
You can apply the same logic to find the K Street PR firms that Boeing and Warren Buffet hired to get hate stories on Elon Musk pushed to tech sites.
You can use a VPN service to see how Google presents 15 different versions of Google News biased 15 different bizarre ways depending on what country you are in, instead of just telling the truth. You can watch for patterns in the bias to notice "enhanced news" as I like to call it. You can also see full fake campaigns being used to direct folks to websites that have only existed a day or two. You can see all of this because Google won't mess with their own Trends. So you can see terms and conventions being thrown around by the Media at times that claim to be ancient but have existed for days, not years. You can literally observe the man behind the curtain.
Start using their logic to look where they don't want you to look.
[ link to this | view in chronology ]
Re: You can do the same thing back to them
As noted by techdirt, search is inhearently biased. The whole reason Google got popular over Yahoo was that Google was better at showing you results you actually wanted to see. The algorithm is designed to bias towards current events and larger websites and news sources, because the data it gets from use indicate those results are what people are searching for.
Hell Bing ran its entire initial ad campaign on the idea that it was better than Google at biasing its search results to websites relevant to your request.
I heard the term Zero-Day from Scorpian, thought it was a stupid term and suddenly everyone was using it. It is a phenomenon that has been described for a long time, to see seemingly new terms erupt into your awareness and they be everywhere. As well, academic terms often explode into public consciousness very suddenly. They can be old terms, but only gaining widespread usage recently. Millennial for instance. A lot of Gen X'ers originally heard the term used by Marketing departments - the uninspired Gen Y - and assumed that is what everyone was using. But academics had coined the term Millennial around the same time. So I have spent 5 years listening to them discover that Millennial describes people born before 2000, not after, and angrily wondering when they 'changed it. Of course, marketing may have changed the term they used, but it was a term in long use in academic circles. They just Assumed who Millennials were, much as Gen X and the MTV generation were being used by baby boomers to describe children born well into the 90s. Just because a term didn't trend doesn't mean it didn't exist. Occam's Razor.
Similarly both Democratic Socialist and Social Democrat have been academic terms for a while, but only recently have become general use terms as we needed a way to differentiate from the socialist boogeyman of the Republican Party and are often misused or misunderstood when compared to their academic roots as most words from Academia are.
Your lack of citation for your claims makes your screed here look more a conspiracy theory than genuine claim. Perhaps show me a full fake campaign (what does that even mean), or a term you think was just created out of thin air by [Google? The Media? its unclear from your rant] that Google somehow is hiding by letting us view the trend data. Then I have something to sink my teeth into.
[ link to this | view in chronology ]
Re: Re: You can do the same thing back to them
Submitted missing this part:
And none of this really follows from your subject. The closest thing you have to de-anonymizing data sets is your comment about backtracking an HR firm - but not actually backtracking from there to Boeing or Buffet which would be the the actual comparison to make.
[ link to this | view in chronology ]
Re: You can do the same thing back to them
But can you see why kids love Cinnamon Toast Crunch?
[ link to this | view in chronology ]
Re: You can do the same thing back to them
Wait, you're telling me that different countries get different new results! Wow! Where did you discover this incredible conspiracy?
[ link to this | view in chronology ]
Re: You can do the same thing back to them
The idea that presenting different stories to people from different countries is some kind of bizarre and deceitful bias is itself bizarre. People have always cared more about local news. If it's something more sinister than that, you haven't made your point.
[ link to this | view in chronology ]
Re: Re: You can do the same thing back to them
Alternatively, consider what happens if Google and other media outlets only push out one truth.
You can bet your back teeth Zof would piss himself.
[ link to this | view in chronology ]
Re: Re: You can do the same thing back to them
Google does not write news stories, and different countries have different news sources, which report the news in slightly different ways. Also, they have different priorities as to what is most newsworthy.
However what would be frightening is if Google news presented the same stories from the same sources in the same order in every country.
[ link to this | view in chronology ]
Re: Re: Re: You can do the same thing back to them
Certainly it could be useful to let people select their country, and one could argue that it would be more transparent to skip the automatic geolocation and just ask everyone which area they want to see news for. But "Google News uses geolocation!" is just about the weakest Google-conspiracy I've ever seen.
[ link to this | view in chronology ]
Re: Re: Re: You can do the same thing back to them
Almost like if some global broadcaster had every local news firm make the same "local" presentation, and then someone simulcast 100 of them at the same time showing the 'local' presentation from 100 different channels...
I mean nothing 'creepy' about that now is there?
[ link to this | view in chronology ]
Re: You can do the same thing back to them it’s called zoffing
And you can lie 15 different ways and get debunked just as much because you are dangerously addicted to failure.
[ link to this | view in chronology ]
Re: Today’s trending word is: Extremely
“Start using their logic to look where they don't want you to look.”
And that place is a massively popular search tool? Do you realise how stupid and paranoid you sound?
[ link to this | view in chronology ]
Re:
Do tell how providing relevant local search results to users in other countries is somehow a conspiracy.
Weren't you leaving?
[ link to this | view in chronology ]
So, let's see. Facial recognition is only accurate 20% of the time in identifying an individual (leading to lots of false arrests because LEOs don't do any other checking. BUT, anonymized data available on the Internet is 98.98% of the time. Why doesn't law enforcement start using this instead?
[ link to this | view in chronology ]
hmmmm....
if only there were a way to license my private data, including my face and physical image...
[ link to this | view in chronology ]
Re: hmmmm....
There is, if you take the picture (such as a selfie) and don't give away a license to it to the likes of Instagram, Facebook, Google, or anyone else. Then, they would need to license it. However; just walking down the street and letting them take you picture won't do it. You'll need to just start always walking in public wearing a "Guy Fawkes" mask.
[ link to this | view in chronology ]
but we're not supposed to be able to get THIS information because governments everywhere want us to think we are safe, that no one, in particular all governments, cant get our data, let alone have any company on the Planet get it and sell it so money can be made by anyone/everyone except us, and privacy and safety go straight out the window!
[ link to this | view in chronology ]
Re:
The government has almost nothing to do with this. It's really all about corporate liability. I'm oversimplifying this to make a point, but it goes something like this.
Person: I'm suing big corp for misusing my personal data.
Big Corp: We used industry standard anonymization.
Judge: Industry Standard. Case dismissed.
The problem here is that "everyone else is doing it" is a valid defense against negligence. No negligence means no liability. Most businesses only care about the liability. That's why computer systems are so insecure. So long as a company follows industry standard practices, they really don't have to worry about what actually happens, they won't be liable.
Many companies use things like Symantec. It doesn't matter if the "security" software is so insecure that it causes customer credit card numbers to be exposed. It matters that the company doesn't have to worry about liability, because everyone is doing it.
[ link to this | view in chronology ]
Re: Re:
They also dont have to worry about liability if they have a 3rd part reaponsible for security and not the business. Another tactic is the company has insurance to cover the costs if an incident does occur. So as long as the premiums aren't too high and a company is covered they don't care what happens.
[ link to this | view in chronology ]
Iv told many..
about the tricks and problems with Their computer devices..
Enter your name, address, phone, CC#, SS#..
At ANY TIME...and the browser will give it to anyone with clearance. MS sold that ability for $99 per year on Browsers.
Chrome recently discovered a flaw in the incognito mode and is revamping it to WORK better.
Bank keeps asking me to use the INTERNET, to access my accounts, and I tell them I know enough about computers to NOT do it..
People dont get What is happening out there..
Compare sites, and log the names of certain people, in the same groups you will have alt names, but HOW THEY SPELL, and form sentences.. will give you a good chance of knowing who they are.
Allot of people use the same name ALL OVER the net.. Hack 1 site he belongs on and the password is probably the same everywhere..(at leat 50-60%) and if you can get access to his PERSONAL info sheet on a site...name,address, phone,...
ALL, of this, and in the old days...they would collect the Sales info you would use in mail order.. PCH is the BIG ONE.. order 2-3Mags and end up getting A TON more for others you never heard.. And those lists can be bought. Even NOW,. your credit card is tracking you..and Corps can use that data to sell, MORE info.
Then comes the fun...what can they do?
depending on the Info, open a bank account with your DATA, and a fake picture. And then put you in debit... because they got your SS#..
The Odds are, at this time there are 3-4 people in the nation USING your SS# to find work..(I love them doing this for me)..
Think about it..
SS# = 1234-34-2345. IS NOT suppose to be used by any one except your bank, your work, and you.. And this has been bypassed, even in the past. that the Corps use it to track you. Now they also use your bank and credit cards..
AND YOUR PHONE..built in GPS....love it. And a phone that COULD tell them everything about you..
Anyone remember Blue tooth hacking??? it still happens.
[ link to this | view in chronology ]
Once More With Feeling: 'Anonymized' Data Is Not Really Anonymou
[ link to this | view in chronology ]
The crux of the problem of companies breaking user anonymity is users giving personally identifiable data needed to correlate with other anonymous data sets. What many users fail to realize is that most identifiable data that companies request is completely unnecessary to begin with. Sure, sites may want your real name, birthday, phone number and physical address, but does those companies have any actual need for it? I find the exceptions to be increasingly rare. Unless the company or its users actually need to interact you with personally, they don't need any of that. In fact, not only can you fake most of the data provided to most sites, you absolutely should!
The vast majority of exceptions were purchases. Before the rise of third-party Internet payment services such as PayPal, purchases used to require providing each site a credit card number. Since those were inherently insecure, payment processors required various identifiable data in an attempt to verify the buyer's identify and curb fraud. This practice gave many sites a lot of user data, which led to the proliferation of that data and consequently identity theft and fraud. As far as I can tell, PayPal doesn't check any of that data. They use a completely different set of data points to identify the buyer. If the purchase isn't for a physical item, then just about all data on the transaction can be fake as well.
One other exception is if you can't reset an account password or if the account has been compromised, the site or a tech support rep may ask for some of the data you provided in order verify your ID. While it may seem less inconvenient to use your real data, just in case, identity theft and other breeches of user anonymity can be considerably more inconvenient. Instead, keep an address book of your fake IDs so that you can provide the "correct" one when needed. And of course, secure that book just as well as you do your passwords, as internet hoodlums can use it to compromise your accounts just as if they had the password itself.
[ link to this | view in chronology ]
Corps know..
Corps know MORE about the people in this country then the Gov. ever has..
They COULD(if they were honest) answer almost any question about the Whole populace..
[ link to this | view in chronology ]