Harvard Students Again Show 'Anonymized' Data Isn't Really Anonymous
from the I-know-more-about-you-than-you-do dept
As companies and governments increasingly hoover up our personal data, a common refrain to keep people from worrying is the claim that nothing can go wrong because the data itself is "anonymized" -- or stripped of personal identifiers like social security numbers. But time and time again, studies have shown how this really is cold comfort, given it takes only a little effort to pretty quickly identify a person based on access to other data sets. Yet most companies, many privacy policy folk, and even government officials still like to act as if "anonymizing" your data means something.
A pair of Harvard students have once again highlighted that it very much doesn't.
As part of a class study, two Harvard computer scientists built a tool to analyze the thousands of data sets leaked over the last five years or so, ranging from the 2015 hack of Experian, to the countless other privacy scandals that have plagued everyone from social media giants to porn websites. Their tool collected and analyzed all this data, and matched it to existing email addresses across scandals. What they found, again (surprise!) is that anonymized data is in no way anonymous:
"An individual leak is like a puzzle piece,” Harvard researcher Dasha Metropolitansky told Motherboard. “On its own, it isn’t particularly powerful, but when multiple leaks are brought together, they form a surprisingly clear picture of our identities. People may move on from these leaks, but hackers have long memories."
“We showed that an ‘anonymized’ dataset from one place can easily be linked to a non-anonymized dataset from somewhere else via a column that appears in both datasets,” Metropolitansky said. “So we shouldn’t assume that our personal information is safe just because a company claims to limit how much they collect and store."
For example, one UK study showed how machine learning could currently identify 99.98% of Americans in an anonymized data set using just 15 characteristics. Another MIT study of "anonymized" credit card user data showed how users could be identified 90% of the time using just points of information. One German study (pdf) looked at how just 15 minutes of brake pedal data could help them identify the right driver, out of 15 potential options, 90% of the time.
Take that data and fuse it with, say... the location data hoovered up by your cell phone provider, or the smart electricity meter data collected by your local power utility, and it's possible for a hacker, researcher, corporation to build the kind of detailed profiles on your daily movements and habits that even you or your spouse might be surprised by. And since we still don't have even a basic U.S. privacy law for the internet era, nothing really seems to change, and any penalties for abusing the public trust are, well, routinely pathetic.
Yet somehow, every time there's another massive new hack or break, the involved companies (as we just saw with the Avast antivirus privacy scandal), like to downplay the threat of the hack or breach by insisting the data collected was anonymized, and therefore there's just no way the data could help specifically identify or target individuals. There's simply never been any indication that's actually true.
Filed Under: anonymity, anonymous data, study