from the but-it's-important dept
A few weeks back, news broke that some hackers were offering up personal information on over 500 million Facebook accounts (there had been some earlier reports about this data availability as well). It was even highlighted that some of the "personal information" apparently included Mark Zuckerberg's phone number (though it's unclear if that data was confirmed to be accurate or not).
After there was more reporting on this data being available, some people noticed that Facebook had not alerted people about it. Various data breach laws require notification to effected users of a data breach, and so Facebook's decision raised some eyebrows. Facebook's explanation for not notifying users is that the company did not consider this to be a breach, but rather was done via scraping of information that people put on their own profiles:
Scraping is a common tactic that often relies on automated software to lift public information from the internet that can end up being distributed in online forums like this. The methods used to obtain this data set were previously reported in 2019. This is another example of the ongoing, adversarial relationship technology companies have with fraudsters who intentionally break platform policies to scrape internet services. As a result of the action we took, we are confident that the specific issue that allowed them to scrape this data in 2019 no longer exists. But since there’s still confusion about this data and what we’ve done, we wanted to provide more details here.
This may feel like weaseling out (and in some ways it absolutely is), but the distinction here is actually really important. As explained in a detailed Wired article, no one "hacked" into Facebook, so in some way the data wasn't breached the way many people think of a breach. Instead, it appears that the "import contacts" feature (which is wise to avoid using on basically any service if you can!) had a vulnerability in that malicious actors could basically try to import every possible phone number, and any "match" would give them access to the information those people shared with connections (so, personal info like phone numbers, names, date of birth, email, etc.).
It appears that data protection regulators around the globe are not happy with Facebook's response, and it would not surprise me at all to see Facebook eventually hit with another fine over this situation.
However, we should be at least somewhat wary of efforts to make data scraping a violation of privacy rules.
Admittedly, Facebook is now trying to lean heavily on the distinction to get out of trouble here. In an email someone from Facebook accidentally sent to a Dutch reporter, it was revealed that Facebook was going to try to "normalize" data scraping efforts as both common and different than data breaches (Facebook later confirmed to Vice that the email was legitimate):
'In the long term we expect more scraping incidents and it is important to frame this as a sector problem and normalize that this happens regularly. To do this, the team proposes a follow-up post in the coming weeks that talks more broadly about our anti-scraping work and provides more transparency around the work we do here. This may reflect much of the scraping activity, we hope this helps normalize the fact that this is ongoing and avoid the criticism that we are not transparent about specific incidents. '
And there is some truth to the fact that this kind of thing happens regularly. There were recent reports on similar "data scrapes" from Clubhouse and LinkedIn.
And here's where things get a bit trickier. If we demonize all data scraping, that would create much bigger problems. Data scraping is actually really important for many non-nefarious situations. Hell, search engines are giant data scrapers, and that's incredibly useful. Similarly, many academic researchers use data scraping to analyze various things online (including the activities of big companies like Facebook). Demonizing all data scraping as if it's functionally equivalent to a data breach would lead to bad results.
Indeed, in many ways, being able to scrape data can help lead to a more competitive internet. As people like Cory Doctorow are pushing for competitive compatibility, some of that may include the ability to scrape data to get it out of a silo like Facebook.
Notably, both Facebook and LinkedIn -- both of whom have just recently faced public reports of this kind of scraping -- have, in the past, sued companies that scraped their sites, claiming it violated the Computer Fraud and Abuse Act (CFAA) for "unauthorized access." As you may recall, Facebook won its case against Power, blocking Power's attempt to create an out-of-Facebook dashboard that users could use to post content to Facebook and other social networks without having to use Facebook directly.
LinkedIn, on the other hand, lost its very similar case, because of an important distinction that Facebook is now conveniently ignoring. In the LinkedIn case, the court found that LinkedIn couldn't use the CFAA to stop scraping because the information in question was publicly available. The court distinguished it from the Power case by saying that, with Power, it was scraping (with permission from the end users) data that was not "publicly available."
And, thus, I think that a lot of people are -- somewhat cynically perhaps -- conflating a bunch of different situations to their own advantage. People who are mocking Facebook for trying to distinguish between data breaches and data scraping are perhaps going too far. There is a difference, and that difference matters. Jumping to the extreme position that data scraping is bad would lead to really bad results. The distinction that people should be making is who has authorization to access the data. So, for example, with the recent Clubhouse data scraping, all of the information that was scraped was publicly available to anyone on Clubhouse. No secret information was revealed.
Similarly, with the LinkedIn/HiQ case, the court rightly found that HiQ should be able to scrape public data from LinkedIn. That was a good ruling because scraping public data is an important factor in how the web works.
But, now let's look at the Facebook situation. In the Power case, which I think was decided incorrectly, Facebook argued that scraping data with the permission of the end user who created that data was bad hacking that violated the law. The courts agreed, but that still seems very, very wrong. The key factor here should have been that the end user gave their authorization, and thus the data scraping should have been allowed.
But with this latest situation with the over 500 million accounts... well... now it gets more complicated. This data was available because of the sloppy way in which Facebook set up its "import contacts" feature. The people who put their own data in did not authorize it to be shared that widely (or if they technically did so through convoluted terms of service, it certainly was not with the intent that that data would then be available in a database anyone could hunt through to find their personal info). Given that -- that this data was accessed without the end user's authorization -- this feels a hell of a lot worse than what was happening with Power. Yet, in Power's case, Facebook ran to the courts to claim a "breach." But with the latest story, Facebook is trying to downplay it as not a breach.
At the very least, Facebook is being cynically hypocritical.
And that worries me. I fear that too many people will not sort through the different issues at play here, and will argue that any kind of data scraping is the equivalent of a data breach. Dangerously, that could play right into Facebook's hands, allowing it to put up even larger walls, and create more impenetrable silos, that make it that much more difficult to extract your data from Facebook and move it to alternative and competing platforms.
So, yes, Facebook is being cynical and opportunistic (and inconsistent) in arguing that this scraping situation is different than a data breach. But there is an underlying kernel of truth that shouldn't be ignored. Not all data scraping is bad. It's often quite good and important. The real issue is what data is public and what is private -- and who is given the authorization to make more private data public.
If the end result of this is for regulators to say that Facebook has to lock down more data, that would be bad, as it would continue to raise the competitive barriers for new entrants, and give Facebook even more power. That doesn't mean anyone should let Facebook off the hook for its stupidly implemented contact importer, but understanding the nuances here is important.
Filed Under: cfaa, competition, data breaches, interoperability, privacy, scraping
Companies: clubhouse, facebook, linkedin