Harvard Students Again Show 'Anonymized' Data Isn't Really Anonymous
from the I-know-more-about-you-than-you-do dept
As companies and governments increasingly hoover up our personal data, a common refrain to keep people from worrying is the claim that nothing can go wrong because the data itself is "anonymized" -- or stripped of personal identifiers like social security numbers. But time and time again, studies have shown how this really is cold comfort, given it takes only a little effort to pretty quickly identify a person based on access to other data sets. Yet most companies, many privacy policy folk, and even government officials still like to act as if "anonymizing" your data means something.
A pair of Harvard students have once again highlighted that it very much doesn't.
As part of a class study, two Harvard computer scientists built a tool to analyze the thousands of data sets leaked over the last five years or so, ranging from the 2015 hack of Experian, to the countless other privacy scandals that have plagued everyone from social media giants to porn websites. Their tool collected and analyzed all this data, and matched it to existing email addresses across scandals. What they found, again (surprise!) is that anonymized data is in no way anonymous:
"An individual leak is like a puzzle piece,” Harvard researcher Dasha Metropolitansky told Motherboard. “On its own, it isn’t particularly powerful, but when multiple leaks are brought together, they form a surprisingly clear picture of our identities. People may move on from these leaks, but hackers have long memories."
“We showed that an ‘anonymized’ dataset from one place can easily be linked to a non-anonymized dataset from somewhere else via a column that appears in both datasets,” Metropolitansky said. “So we shouldn’t assume that our personal information is safe just because a company claims to limit how much they collect and store."
For example, one UK study showed how machine learning could currently identify 99.98% of Americans in an anonymized data set using just 15 characteristics. Another MIT study of "anonymized" credit card user data showed how users could be identified 90% of the time using just points of information. One German study (pdf) looked at how just 15 minutes of brake pedal data could help them identify the right driver, out of 15 potential options, 90% of the time.
Take that data and fuse it with, say... the location data hoovered up by your cell phone provider, or the smart electricity meter data collected by your local power utility, and it's possible for a hacker, researcher, corporation to build the kind of detailed profiles on your daily movements and habits that even you or your spouse might be surprised by. And since we still don't have even a basic U.S. privacy law for the internet era, nothing really seems to change, and any penalties for abusing the public trust are, well, routinely pathetic.
Yet somehow, every time there's another massive new hack or break, the involved companies (as we just saw with the Avast antivirus privacy scandal), like to downplay the threat of the hack or breach by insisting the data collected was anonymized, and therefore there's just no way the data could help specifically identify or target individuals. There's simply never been any indication that's actually true.
Thank you for reading this Techdirt post. With so many things competing for everyone’s attention these days, we really appreciate you giving us your time. We work hard every day to put quality content out there for our community.
Techdirt is one of the few remaining truly independent media outlets. We do not have a giant corporation behind us, and we rely heavily on our community to support us, in an age when advertisers are increasingly uninterested in sponsoring small, independent sites — especially a site like ours that is unwilling to pull punches in its reporting and analysis.
While other websites have resorted to paywalls, registration requirements, and increasingly annoying/intrusive advertising, we have always kept Techdirt open and available to anyone. But in order to continue doing so, we need your support. We offer a variety of ways for our readers to support us, from direct donations to special subscriptions and cool merchandise — and every little bit helps. Thank you.
–The Techdirt Team
Filed Under: anonymity, anonymous data, study
Reader Comments
Subscribe: RSS
View by: Time | Thread
My phone sleeps with my wife's phone.
[ link to this | view in thread ]
I am starting to think they do not know the meaning of the word anonymized.
"but hackers have long memories"
They also have lots of storage space & a need to keep things that might be useful in the future.
I mean its not like I still have that entire dump of ACS when they screwed up and put the whole server backup online.... er wait.
The world seems to keep working under the assumption, no one would ever do that.
No one would ever combine datasets.
No one would ever scrap every picture they could find.
No one would ever give the names, numbers, emails of everyone they know to a platform.
No one would ever build shadow profiles to help find more links between people.
No one would ever use shopping data to send baby coupons.
No one would ever lie online.
Humanity... The Good Intentions of: No one would ever...
[ link to this | view in thread ]
Re:
And the corollary to the above:
"Why would you do that?"
"Because I can!"
[ link to this | view in thread ]
Companies running the internet all became billionaires because they hoover up our personal data. The government that could fix this, is currently run by money & not the public interest.
Glad we have atleast GDPR ruling.
[ link to this | view in thread ]
Sure, but there aren't only 15 drivers out there. Any reasonable sampling of drivers you would want to search through for an individual would include hundreds if not thousands of initial members of the pool.
[ link to this | view in thread ]
Re:
"Companies running the internet all became billionaires"
All of 'em huh?
[ link to this | view in thread ]
Re:
Yes - and in addition, I doubt that variations in brake pedal application of one driver is unique enough to pick it out of millions.
[ link to this | view in thread ]
Re:
okay, but lets pick just the cars that leave your work parking lot between 5 and 5:30, and slow, then brake to a stop for a few minutes at the same time and the same days you happen to stop in at 711 for a post-work slurpee according to your credit card data and then drive enough (time or distance) to reach your home minutes before your smart power meter shows that electricity consumption increases as you turn on your oven to start preheating for dinner. keep plucking out data points that could escape and you fill in more and more gaps.
[ link to this | view in thread ]
the power of correlation
Correlation may not imply causation, but the associations are enough for google to earn billions per year.
[ link to this | view in thread ]
we all know
that 'online anonymity is a fallacy. It's all about control.
Is tech moving toward or away from allowing one person in power to monitor almost everyone?
[ link to this | view in thread ]
Re: the power of correlation
A small correction, the marketeers belief in correlation and profiling allow Google to earn billions per year.
[ link to this | view in thread ]
And yet...
I all but guarantee you that the people/companies putting forth the 'it's harmless data collection, it's been anonymized' would refuse point-blank were someone to ask them to provide their 'anonymized' data to pour through.
They know damn well the 'we can't identify people with this data' excuse is a lie, they're just hoping that the people they're talking to don't know that, or have a vested interest in perpetuating the lie.
[ link to this | view in thread ]
The anonymized data problem stems from an internet backbone that was designed to not be anonymous.
[ link to this | view in thread ]
Re: Re:
Ah; but that's where the point of this (and similar) article comes in.
Using the brake pedal data, you segment the dataset of 30 million down into buckets of, say, 10,000.
Now for each of those buckets, you segment by acceleration data, creating unique buckets of size 10.
The chance that there would be a GPS location or ALPR location collision between those 10 people is extremely low. Meaning you can now fingerprint not just the vehicle and backtrace where it went over a period of time, you know with a very high degree of certainty who was driving that vehicle. All without any visual confirmation of the face behind the wheel.
[ link to this | view in thread ]
Re: Re: Re:
Imagine, for example, recovering a stolen vehicle, dumping the telemetry, and with only 3 factors or so being able to ID who stole the vehicle and what they did with it.
There's no central database of vehicle telemetry, so that's not currently possible -- but if you're an insurance agency with wide coverage and favourable terms for people who share their telemetry... you'll get an idea pretty quickly.
Also: if a car is ensured pleasure only and shares telemetry, it becomes obvious pretty quickly if extra people are using the vehicle that aren't on the insurance, even if the car meets the distance criteria.
[ link to this | view in thread ]
Re: And yet...
Actually, one reporter did that, and the researchers were able to determine, without his assisance:
1) where he worked
2) where he lived
3) where he got gas
4) where he shopped for groceries
5) that he was married or in a relationship
and the list of things they were able to determine went on and on.
[ link to this | view in thread ]
Re: Re: Re: Re:
And now imagine becoming a serf for the corporations, because they control almost everything you need to live as part of society.
[ link to this | view in thread ]
Re: And yet...
Indeed.
Similarly, people who advocate for war should have their children fight on the front lines in said war.
[ link to this | view in thread ]