Netflix $1 Million Award Shows The Value Of Collaboration... But Kicks Up New Privacy Questions
from the good...-and-bad dept
Back in July, we wrote about how the Netflix $1 million prize showed how much further research efforts could get by collaborating, rather than hoarding. Now that the official prize has been awarded, we're hearing even more about that point:The blending of different statistical and machine-learning techniques "only works well if you combine models that approach the problem differently," said Chris Volinsky, a scientist at AT&T Research and a leader of the Bellkor team [which won]. "That's why collaboration has been so effective, because different people approach problems differently."Indeed. There's plenty of research out there showing the leaps that are made in innovation when people with different approaches collaborate. Yet, with so much of a focus on "patents" representing "innovation," the opposite occurs. The patent system is all about hoarding information and making it harder to collaborate by putting tollbooths in the process. Many of the final "teams" involved a whole bunch of different approaches. Imagine if each one had a patent on their method. Think of how expensive that kind of innovation would be. Then, realize that there are plenty of technologies that face that exact problem today.
In the meantime, Paul Ohm is raising some serious questions about people's privacy on the new Netflix Prizes that are being announced. While Netflix claims that the data is anonymized, we've seen before that anonymous datasets are almost never anonymous, and in Netflix's case, the details are pretty bad:
Although I give Netflix a pass for its past privacy breach, I am astonished to learn from the New York Times that the company plans a second act:Ohm also points out that this prize almost certainly violates the law:The new contest is going to present the contestants with demographic and behavioral data, and they will be asked to model individuals' "taste profiles," the company said. The data set of more than 100 million entries will include information about renters' ages, gender, ZIP codes, genre ratings and previously chosen movies. Unlike the first challenge, the contest will have no specific accuracy target. Instead, $500,000 will be awarded to the team in the lead after six months, and $500,000 to the leader after 18 months.Netflix should cancel this new, irresponsible contest, which it has dubbed Netflix Prize 2. Researchers have known for more than a decade that gender plus ZIP code plus birthdate uniquely identifies a significant percentage of Americans (87% according to Latanya Sweeney's famous study.) True, Netflix plans to release age not birthdate, but simple arithmetic shows that for many people in the country, gender plus ZIP code plus age will narrow their private movie preferences down to at most a few hundred people. Netflix needs to understand the concept of "information entropy": even if it is not revealing information tied to a single person, it is revealing information tied to so few that we should consider this a privacy breach.
Because of this, if it releases the data, Netflix might be breaking the law. The Video Privacy Protection Act (VPPA), 18 USC 2710 prohibits a "video tape service provider" (a broadly defined term) from revealing "personally identifiable information" about its customers. Aggrieved customers can sue providers under the VPPA and courts can order "not less than $2500" in damages for each violation. If somebody brings a class action lawsuit under this statute, Netflix might face millions of dollars in damages.It seems rather surprising that Netflix's lawyers did not consider this.
Additionally, the FTC might also decide to fine Netflix for violating its privacy policy as an unfair business practice.
Thank you for reading this Techdirt post. With so many things competing for everyone’s attention these days, we really appreciate you giving us your time. We work hard every day to put quality content out there for our community.
Techdirt is one of the few remaining truly independent media outlets. We do not have a giant corporation behind us, and we rely heavily on our community to support us, in an age when advertisers are increasingly uninterested in sponsoring small, independent sites — especially a site like ours that is unwilling to pull punches in its reporting and analysis.
While other websites have resorted to paywalls, registration requirements, and increasingly annoying/intrusive advertising, we have always kept Techdirt open and available to anyone. But in order to continue doing so, we need your support. We offer a variety of ways for our readers to support us, from direct donations to special subscriptions and cool merchandise — and every little bit helps. Thank you.
–The Techdirt Team
Filed Under: innovation, netflix prize, privacy
Companies: netflix
Reader Comments
Subscribe: RSS
View by: Time | Thread
If blockbuster wanted to do a study on its renters, it would hire an outside firm to crunch all the numbers. It would have to provide the information to the firm. That would not violate the law. However, that firm, like those who recieved the information from netflix, would be under the same legal obligations as blockbuster/netflix.
And zipcode+age+gender is not revealing of personally identifiable information. it is revealing of groups of renters. Now zip+4 may be a little more specific...
[ link to this | view in chronology ]
Re:
[ link to this | view in chronology ]
Re: Re:
[ link to this | view in chronology ]
Re: Re: Re:
[ link to this | view in chronology ]
Re: Re: Re: Re:
[ link to this | view in chronology ]
Re: Re: Re: Re: Re:
But how bad is it if the data can be tracked back to "well, it's from one of these five people". That's obviously nowhere near as bad. In fact, IMHO, there is a quantum leap in difference of privacy breach between uniquely identified, and ANY kind of uncertainty.
But, as Brooks is thinking, a 1/1 match is very bad, a 1/2 match is bad. How much better is a 1/5, etc? At which point is it actually anonymous data?
Anyway, I would argue that by using age instead of birthdate, they have reduced the likelihood of an exact match by a factor of 365 (I know, the precise stats calculations are much more complicated.) That goes a long way to protecting privacy. I think that the suggestion above, that they make it 5-year ranges, would be most acceptable, and would not significantly reduce the predictive value of the movie recomendation solutions.
[ link to this | view in chronology ]
Re: Yes, but the defense is that the information was not revealed to the public.
[ link to this | view in chronology ]
Maybe there are privacy concerns. But that's a huge red herring that distorts the issue a lot. I mean, if they released credit card numbers and names, it would be a huge issue (but they aren't). Why include a stat and then say it's not relevant?
Me, I'm a lot less bothered by something that "could" be used to reduce anonymity to "a few hundred individuals in some zip codes." Maybe there's an interesting conversation about at what point personally identifiable becomes non-personally identifiable. But my instinct is that this doesn't cross the line, once you parse the article for what's actually happening and not how it would be if different things were happening.
[ link to this | view in chronology ]
However, I don't think that will wash with privacy advocates or, come to think of it, the RIAA's legal team.
[ link to this | view in chronology ]
Re:
First off, you have solid information for anything both people rented.
Second, it might not take a large stretch to figure out which of the two is which based on trends. (For example, someone's profession or hobby might make it easy to pick out the one renting all the music documentaries.)
Lastly, it could still leave a situation where you know someone rented either A or B, where both A and B might both be selections the person wouldn't want publicized.
The ease of identification increases the more of an outlier you are in your community. While a student on a college campus would be very hard to identify, that same student in a small suburban neighbourhood could be uniquely identified without any other info.
(Btw, I have no idea where the RIAA would factor into this.)
[ link to this | view in chronology ]
First I would like to know if anybody knows about something bad that could potentially happen to those that have been identified. Would any employer fired someone who rented "Beverly Hills Chihuahua (2008)" or have so many people making fun of him and suffer mental distress?
Seriously what are the bad things that could happen in this case?
[ link to this | view in chronology ]
2. What is one to make of the comment in the article attributed to the team finishing second that the vast majority of collaborations were not fruitful? No one doubts that collaboration can be helpful, but by no means should it be viewed as the general rule. Sometimes it helps. Sometimes it does not. It all depends upon the circumstances and the persons involved in the collaboration.
[ link to this | view in chronology ]
Re:
Point missed, huh? The point is that patents make collaboration like this harder, not easier.
What is one to make of the comment in the article attributed to the team finishing second that the vast majority of collaborations were not fruitful? No one doubts that collaboration can be helpful, but by no means should it be viewed as the general rule. Sometimes it helps. Sometimes it does not. It all depends upon the circumstances and the persons involved in the collaboration.
Heh. Have you ever taken statistics? This is a classic statistics error. Just because most collaborations aren't fruitful doesn't mean collaboration isn't fruitful. In fact, just the opposite -- it means you should want to enable EVEN MORE collaboration to allow the good ones to get through.
[ link to this | view in chronology ]
Re: Re:
[ link to this | view in chronology ]
Re: Re:
yes Mike, but conversely, just because the winners happened to collaborate does not prove the collaboration helped them win.
[ link to this | view in chronology ]
Opt Out?
[ link to this | view in chronology ]
Data CAN be further obusfcated - not 100% solution, but still
[ link to this | view in chronology ]