Faux Randomness Strikes Again: How Researchers Realized Research 2000's Daily Kos Data Looked Faked
from the random-ain't-so-random dept
You may have heard by now that the political website Daily Kos has come out and explained that it believes the polling firm it has used for a while, Research 2000, was faking its data. While it's nice to see a publication come right out and bluntly admit that it had relied on data that it now believes was not legit, what's fascinating if you're a stats geek is how a team of stats geeks figured out there were problems with the data. As any good stats nerd knows, the concept of "randomness" isn't quite as random as some people think, which is why faking randomness almost always leads to tell-tale signs that the data was faked or manipulated. For example, one very, very common test is to use Benford's Law to look at the first digit of data in a data set, because in a truly random set, the distribution is not what people usually expect.In this case, the three guys who had problems with the data (Mark Grebner, Michael Weissman, and Jonathan Weissman) zeroed in on just a few clues that the data was faked or manipulated. The first thing they noticed was that when R2K did polls that tested how men and women viewed certain politicians or political parties (favorable/unfavorable) there was an odd pattern: if the percentage of men that rated a particular politician favorable or unfavorable was an even number, so was the the percentage of female raters. It seemed like these two points always matched up. If the male percentage was even the female percentage was even. If the male percentage was odd, the female percentage was odd. Yet, as you should know, these are independent variables, not influenced by each other. That 34% of men find a particular politician favorable should have no bearing on why an even percentage of women find that politician favorable. In fact, this happened in almost every such poll that R2K did, to such a level as to suggest it being as close to impossible as you can imagine:
Common sense says that that result is highly unlikely, but it helps to do a more precise calculation. Since the odds of getting a match each time are essentially 50%, the odds of getting 776/778 matches are just like those of getting 776 heads on 778 tosses of a fair coin. Results that extreme happen less than one time in 10228. That's one followed by 228 zeros. (The number of atoms within our cosmic horizon is something like 1 followed by 80 zeros.) For the Unf, the odds are less than one in 10231. (Having some Undecideds makes Fav and Unf nearly independent, so these are two separate wildly unlikely events.)The other statistical analysis that I found fascinating was that when you looked at weekly changes in favorability ratings, the R2K data almost always changed a bit. But, if you look at other data, no change is the most common result. As they point out, if you look at, say, Gallup data, you get this nice typical bell curve:
There is no remotely realistic way that a simple tabulation and subsequent rounding of the results for M's and F's could possibly show that detailed similarity. Therefore the numbers on these two separate groups were not generated just by independently polling them.
How do we know that the real data couldn't possibly have many changes of +1% or -1% but few changes of 0%? Let's make an imaginative leap and say that, for some inexplicable reason, the actual changes in the population's opinion were always exactly +1% or -1%, equally likely. Since real polls would have substantial sampling error (about +/-2% in the week-to-week numbers even in the first 60 weeks, more later) the distribution of weekly changes in the poll results would be smeared out, with slightly more ending up rounding to 0% than to -1% or +1%. No real results could show a sharp hole at 0%, barring yet another wildly unlikely accident.Kos is apparently planning legal action, and so far R2K hasn't responded in much detail other than to claim that its polls were conducted properly. I'm not all that interested in that part of the discussion however. I just find it neat how the "faux randomness" may have exposed the problems with the data.
Thank you for reading this Techdirt post. With so many things competing for everyone’s attention these days, we really appreciate you giving us your time. We work hard every day to put quality content out there for our community.
Techdirt is one of the few remaining truly independent media outlets. We do not have a giant corporation behind us, and we rely heavily on our community to support us, in an age when advertisers are increasingly uninterested in sponsoring small, independent sites — especially a site like ours that is unwilling to pull punches in its reporting and analysis.
While other websites have resorted to paywalls, registration requirements, and increasingly annoying/intrusive advertising, we have always kept Techdirt open and available to anyone. But in order to continue doing so, we need your support. We offer a variety of ways for our readers to support us, from direct donations to special subscriptions and cool merchandise — and every little bit helps. Thank you.
–The Techdirt Team
Filed Under: data, faked, random
Companies: dailykos, research 2000
Reader Comments
Subscribe: RSS
View by: Time | Thread
Look at the poll results
[ link to this | view in chronology ]
Re: Look at the poll results
[ link to this | view in chronology ]
[ link to this | view in chronology ]
Re:
[ link to this | view in chronology ]
Update
http://www.fivethirtyeight.com/2010/06/research-2000-issues-cease-desist.html
So - if you do a statistical analysis of the problems with a company, and publish those results, you may damage "the company's reputation and the company's existing and prospective business relationships." Actually, I would imagine that is exactly what will happen -- but it sure isn't illegal to do that.
[ link to this | view in chronology ]
Re: Update
[ link to this | view in chronology ]
Re: Update
Seriously, though - Howrey? R2K is lawyering-up big time. Hope that Kos and Nate have good lawyers of their own.
[ link to this | view in chronology ]
Re: Re: Update
The fact that you (and R2K) think that hiring expensive lawyers could have any possible effect on the outcome here shows how ridiculous the legal system is.
A lawyer giving a guilty criminal the best possible defence is one thing.
A Lawyer standing up in court to defend the proposition that 2+2=5 is quite another.
One would hope that the expensive lawyers would just tell R2K that they have no chance and refuse to take the case - but of course they won't.
[ link to this | view in chronology ]
Re: Update
I would think the damage was already done.
[ link to this | view in chronology ]
LOL
[ link to this | view in chronology ]
I wouldn't say they are independent variables, after all, both men and women are people and have similar brains (I know, I know, I'm going to get criticized for this one. Men are from mars and women are from Venus, blah blah blah). Still, if they are really really close to each other consistently that is certainly suspicious.
[ link to this | view in chronology ]
Re:
The oddness/evenness of the percentages certainly are - because they depend on your choice of percentages (base 10 to two significant digits) as a mechanism to express the data.
That choice (although standard) is certainly independent of the data itself and effectively decouples the two pieces of data from each other in that respect.
Supposing the sample was 100000 of each gender and the actual numbers were 66124 and 26223. If you choose percentages both are even, if you choose parts per thousand (3 digits) one is even and one is odd. If you choose parts per 10000 (4 digits) then both are even again.
Whilst you are correct in saying that the actual data are interdependent, that particular aspect of the statistical expression of the data is independent.
[ link to this | view in chronology ]
Re: Re:
[ link to this | view in chronology ]
Re:
[ link to this | view in chronology ]
Re:
[ link to this | view in chronology ]
Re:
[ link to this | view in chronology ]
not even competent liars
[ link to this | view in chronology ]
Re: not even competent liars
[ link to this | view in chronology ]
improper design
[ link to this | view in chronology ]
Alternative explanation
This would explain both the odd/even coincidence and the data changing by at least 1% per week way too often.
[ link to this | view in chronology ]
The rest of the story
[ link to this | view in chronology ]
Re: The rest of the story
[ link to this | view in chronology ]
Re: The rest of the story
[ link to this | view in chronology ]
Non-random hysteria
See, now, just like those R2K people, you're throwing facially bad data at us and then aggressively asserting unsupported conclusions supposedly made obvious by that data, and you're expecting us to . . . what? . . . sign on as your Mouseketeers?
No, not all polls are BS. Just today, I asked my two kids if they would enjoy raw oysters for breakfast. They both said no. (Really loudly, too.) Served 'em anyway, and, Lo!, the poll turned out to be completely accurate! And that's all it took to disprove your data.
And as for your conclusion - well, maybe I will, but it's going to depend on things like the gender of the pollster at hand, their relative desirability, their willingness and interest . . . point is, your assertion about "all polls" is not going to have one iota of influence on how part two comes out. (Ooo, bad pun. Sorry.)
[ link to this | view in chronology ]
Re: Non-random hysteria
Strictly that wasn't actually a poll (in the usual sense of opinion poll) - since your sample was the whole population.
One could argue the meanings of words but...
[ link to this | view in chronology ]
Re: Re: Non-random hysteria
[ link to this | view in chronology ]
Lack of zero changes
[ link to this | view in chronology ]
Re: Lack of zero changes
[ link to this | view in chronology ]
Re: Lack of zero changes
But the changes are not being rounded - it is thevalues that are being rounded. The changes are dervived from the values that are already rounded. So that explanation does not work.
[ link to this | view in chronology ]
You don't need to be a statistician to know this was crap. I trained as a chemist and know squat about stats, but had I ever looked at this data (not being a Kossack I am never over there), I'd have seen it in an instant.
All it takes is experience in the real world (in business, investing, whatever) to know that these data are obviously faked (or run through some totally bogus algorithm that forces the odd/even results [though why there were a couple of counterexamples if it was an honest algorithm error is pretty curious]).
[ link to this | view in chronology ]
Re W Klink @ 28
Markov, anyone?...
[ link to this | view in chronology ]
Re: Re W Klink @ 28
[ link to this | view in chronology ]
Re: Re W Klink @ 28
To your question, I think good understanding of probability is really rare. Almost non-existent in the general American populace IMO, and possibly a minority even among geeks.
[ link to this | view in chronology ]
Smarter crooks
[ link to this | view in chronology ]
[ link to this | view in chronology ]