Netflix $1 Million Award Shows The Value Of Collaboration... But Kicks Up New Privacy Questions

from the good...-and-bad dept

Tue, Sep 22nd 2009 4:09pm — Mike Masnick

Back in July, we wrote about how the Netflix $1 million prize showed how much further research efforts could get by collaborating, rather than hoarding. Now that the official prize has been awarded, we're hearing even more about that point:

The blending of different statistical and machine-learning techniques "only works well if you combine models that approach the problem differently," said Chris Volinsky, a scientist at AT&T Research and a leader of the Bellkor team [which won]. "That's why collaboration has been so effective, because different people approach problems differently."

Indeed. There's plenty of research out there showing the leaps that are made in innovation when people with different approaches collaborate. Yet, with so much of a focus on "patents" representing "innovation," the opposite occurs. The patent system is all about hoarding information and making it harder to collaborate by putting tollbooths in the process. Many of the final "teams" involved a whole bunch of different approaches. Imagine if each one had a patent on their method. Think of how expensive that kind of innovation would be. Then, realize that there are plenty of technologies that face that exact problem today.

In the meantime, Paul Ohm is raising some serious questions about people's privacy on the new Netflix Prizes that are being announced. While Netflix claims that the data is anonymized, we've seen before that anonymous datasets are almost never anonymous, and in Netflix's case, the details are pretty bad:

Although I give Netflix a pass for its past privacy breach, I am astonished to learn from the New York Times that the company plans a second act:
The new contest is going to present the contestants with demographic and behavioral data, and they will be asked to model individuals' "taste profiles," the company said. The data set of more than 100 million entries will include information about renters' ages, gender, ZIP codes, genre ratings and previously chosen movies. Unlike the first challenge, the contest will have no specific accuracy target. Instead, $500,000 will be awarded to the team in the lead after six months, and $500,000 to the leader after 18 months.
Netflix should cancel this new, irresponsible contest, which it has dubbed Netflix Prize 2. Researchers have known for more than a decade that gender plus ZIP code plus birthdate uniquely identifies a significant percentage of Americans (87% according to Latanya Sweeney's famous study.) True, Netflix plans to release age not birthdate, but simple arithmetic shows that for many people in the country, gender plus ZIP code plus age will narrow their private movie preferences down to at most a few hundred people. Netflix needs to understand the concept of "information entropy": even if it is not revealing information tied to a single person, it is revealing information tied to so few that we should consider this a privacy breach.

Ohm also points out that this prize almost certainly violates the law:

Because of this, if it releases the data, Netflix might be breaking the law. The Video Privacy Protection Act (VPPA), 18 USC 2710 prohibits a "video tape service provider" (a broadly defined term) from revealing "personally identifiable information" about its customers. Aggrieved customers can sue providers under the VPPA and courts can order "not less than $2500" in damages for each violation. If somebody brings a class action lawsuit under this statute, Netflix might face millions of dollars in damages.

Additionally, the FTC might also decide to fine Netflix for violating its privacy policy as an unfair business practice.

It seems rather surprising that Netflix's lawyers did not consider this.

Thank you for reading this Techdirt post. With so many things competing for everyone’s attention these days, we really appreciate you giving us your time. We work hard every day to put quality content out there for our community.

Techdirt is one of the few remaining truly independent media outlets. We do not have a giant corporation behind us, and we rely heavily on our community to support us, in an age when advertisers are increasingly uninterested in sponsoring small, independent sites — especially a site like ours that is unwilling to pull punches in its reporting and analysis.

While other websites have resorted to paywalls, registration requirements, and increasingly annoying/intrusive advertising, we have always kept Techdirt open and available to anyone. But in order to continue doing so, we need your support. We offer a variety of ways for our readers to support us, from direct donations to special subscriptions and cool merchandise — and every little bit helps. Thank you.

–The Techdirt Team

Filed Under: innovation, netflix prize, privacy
Companies: netflix

17 Comments

If you liked this post, you may also be interested in...

Reader Comments

Subscribe: RSS

View by: Time | Thread

Anonymous Coward, 22 Sep 2009 @ 4:28pm

Yes, but the defense is that the information was not revealed to the public. The information, if indeed it would have violated the act, was revealed only to employees and contractors.

If blockbuster wanted to do a study on its renters, it would hire an outside firm to crunch all the numbers. It would have to provide the information to the firm. That would not violate the law. However, that firm, like those who recieved the information from netflix, would be under the same legal obligations as blockbuster/netflix.

And zipcode+age+gender is not revealing of personally identifiable information. it is revealing of groups of renters. Now zip+4 may be a little more specific...
[ link to this | view in chronology ]
- Michael Kirkland, 22 Sep 2009 @ 5:48pm
  
  Re:
  Zipcode+age+gender is enough to personally identify ~90% of Americans, so yes.
  [ link to this | view in chronology ]
  - Brooks (profile), 22 Sep 2009 @ 5:54pm
    
    Re: Re:
    No. Zipcode + gender + birthdate identifies 87% of Americans. Age != Birthdate.
    [ link to this | view in chronology ]
    - Anonymous Coward, 22 Sep 2009 @ 6:07pm
      
      Re: Re: Re:
      I wonder if changing it from age to age group (5-10 year ranges) would be enough to satiate those with privacy concerns... Could probably work out something similar with zip codes
      [ link to this | view in chronology ]
      - Brooks (profile), 22 Sep 2009 @ 6:34pm
        
        Re: Re: Re: Re:
        Yep, but I guess the question is at what point it becomes "anonymous", and that's going to be a matter of opinion. Is it anonymous if I can say it's one out of these 100 people? 1 out of 1,000? 100,000?
        [ link to this | view in chronology ]
        
        Derek Kerton (profile), 24 Sep 2009 @ 12:00am
        
        Re: Re: Re: Re: Re:
        Well, the big problem with Zipcode + gender + birthdate is that it uniquely identifies 87% of people. Uniquely identifying anyone is a problem.
        
        But how bad is it if the data can be tracked back to "well, it's from one of these five people". That's obviously nowhere near as bad. In fact, IMHO, there is a quantum leap in difference of privacy breach between uniquely identified, and ANY kind of uncertainty.
        
        But, as Brooks is thinking, a 1/1 match is very bad, a 1/2 match is bad. How much better is a 1/5, etc? At which point is it actually anonymous data?
        
        Anyway, I would argue that by using age instead of birthdate, they have reduced the likelihood of an exact match by a factor of 365 (I know, the precise stats calculations are much more complicated.) That goes a long way to protecting privacy. I think that the suggestion above, that they make it 5-year ranges, would be most acceptable, and would not significantly reduce the predictive value of the movie recomendation solutions.
        [ link to this | view in chronology ]
- Haelian, 22 Sep 2009 @ 7:42pm
  
  Re: Yes, but the defense is that the information was not revealed to the public.
  That's not true. The information was available to anyone who wanted to take part in the contest and was easily downloaded via the Netflix website.
  [ link to this | view in chronology ]
Brooks (profile), 22 Sep 2009 @ 5:33pm

So I was totally with this article, and that 87% figure grabbed me -- and then we go on to say that, actually, it's a totally meaningless figure because it's based on information that *won't* be released.

Maybe there are privacy concerns. But that's a huge red herring that distorts the issue a lot. I mean, if they released credit card numbers and names, it would be a huge issue (but they aren't). Why include a stat and then say it's not relevant?

Me, I'm a lot less bothered by something that "could" be used to reduce anonymity to "a few hundred individuals in some zip codes." Maybe there's an interesting conversation about at what point personally identifiable becomes non-personally identifiable. But my instinct is that this doesn't cross the line, once you parse the article for what's actually happening and not how it would be if different things were happening.
[ link to this | view in chronology ]
Big Al, 22 Sep 2009 @ 6:33pm

Technically, as soon as the selection criteria are broad enough to encompass two individuals, the information is not personally identifiable in that there is still an element of doubt as to which of the individuals is being referred to.
However, I don't think that will wash with privacy advocates or, come to think of it, the RIAA's legal team.
[ link to this | view in chronology ]
- scarr (profile), 22 Sep 2009 @ 10:00pm
  
  Re:
  That isn't true.
  
  First off, you have solid information for anything both people rented.
  
  Second, it might not take a large stretch to figure out which of the two is which based on trends. (For example, someone's profession or hobby might make it easy to pick out the one renting all the music documentaries.)
  
  Lastly, it could still leave a situation where you know someone rented either A or B, where both A and B might both be selections the person wouldn't want publicized.
  
  The ease of identification increases the more of an outlier you are in your community. While a student on a college campus would be very hard to identify, that same student in a small suburban neighbourhood could be uniquely identified without any other info.
  
  (Btw, I have no idea where the RIAA would factor into this.)
  [ link to this | view in chronology ]
Anonymous Coward, 22 Sep 2009 @ 7:53pm

I'm not sure this is really a big deal for privacy.
First I would like to know if anybody knows about something bad that could potentially happen to those that have been identified. Would any employer fired someone who rented "Beverly Hills Chihuahua (2008)" or have so many people making fun of him and suffer mental distress?

Seriously what are the bad things that could happen in this case?
[ link to this | view in chronology ]
Anonymous Coward, 22 Sep 2009 @ 8:46pm

1. This matter has nothing to do with patents. Thus, the gratuituous reference seems misplaced.

2. What is one to make of the comment in the article attributed to the team finishing second that the vast majority of collaborations were not fruitful? No one doubts that collaboration can be helpful, but by no means should it be viewed as the general rule. Sometimes it helps. Sometimes it does not. It all depends upon the circumstances and the persons involved in the collaboration.
[ link to this | view in chronology ]
- Mike Masnick (profile), 22 Sep 2009 @ 11:11pm
  
  Re:
  This matter has nothing to do with patents. Thus, the gratuituous reference seems misplaced.
  
  Point missed, huh? The point is that patents make collaboration like this harder, not easier.
  
  What is one to make of the comment in the article attributed to the team finishing second that the vast majority of collaborations were not fruitful? No one doubts that collaboration can be helpful, but by no means should it be viewed as the general rule. Sometimes it helps. Sometimes it does not. It all depends upon the circumstances and the persons involved in the collaboration.
  
  Heh. Have you ever taken statistics? This is a classic statistics error. Just because most collaborations aren't fruitful doesn't mean collaboration isn't fruitful. In fact, just the opposite -- it means you should want to enable EVEN MORE collaboration to allow the good ones to get through.
  [ link to this | view in chronology ]
  - Anonymous Coward, 23 Sep 2009 @ 8:26am
    
    Re: Re:
    Since I saw no mention in the article about patents, bringing them up does seem to be a gratuitous reference. Had patents posed a problem I would have expected at least some mention, and yet the article contains nary a word.
    [ link to this | view in chronology ]
  - Griff (profile), 25 Sep 2009 @ 5:21am
    
    Re: Re:
    Just because most collaborations aren't fruitful doesn't mean collaboration isn't fruitful.
    
    yes Mike, but conversely, just because the winners happened to collaborate does not prove the collaboration helped them win.
    [ link to this | view in chronology ]
Another AC, 23 Sep 2009 @ 6:01am

Opt Out?
They should allow you to opt out of these studies, alternatively, all they ask for is your birth year, change that by a year or 2 and they would never find you.
[ link to this | view in chronology ]
Mike Orr, 23 Sep 2009 @ 9:50am

Data CAN be further obusfcated - not 100% solution, but still
Netflix can/should maybe use forms of obfuscation to increase the "Anonymity factor'. FOr example, while It makes sense that having a common zipcode is an important attribute, it (probably) does not matter WHICH zipcode it is, so all zipcodes can be scrambled in some uniform manner.
[ link to this | view in chronology ]