The Problem With Too Much Data: Mistaking The Signal For The Noise
from the quantity-over-quality dept
The NSA can't get enough data, as is evidenced by its shiny, new data center and its multiple efforts to either bypass laws entirely or have them rewritten in its favor. General Alexander, in particular, wants all the data. Everything. And as Mike covered earlier, he's not shy about grabbing the data first and worrying about the legality later.
In his enthusiastic pursuit for more data, Alexander seems to have bypassed any sort of confirmation that adding more data is helpful. Here's one issue the indiscriminate data harvesting raised.
“He had all these diagrams showing how this guy was connected to that guy and to that guy,” says a former NSA official who heard Alexander give briefings on the floor of the Information Dominance Center. “Some of my colleagues and I were skeptical. Later, we had a chance to review the information. It turns out that all [that] those guys were connected to were pizza shops.”Tons of noise, or rather, tons of dots, the kind intelligence leaders seem to believe we're still short on. Alexander certainly liked connecting dots, but seemed unconcerned if the resulting picture was completely unintelligible.
Under Alexander's leadership, one of the agency's signature analysis tools was a digital graph that showed how hundreds, sometimes thousands, of people, places, and events were connected to each other. They were displayed as a tangle of dots and lines. Critics called it the BAG -- for "big ass graph" -- and said it produced very few useful leads.When you have tons of data, you have to filter out the noise if you're going to use it any meaningful way. Alexander may have learned from the previous experience that while many terrorists may purchase pizzas, not everyone who purchases pizza is a terrorist. Hence the first level of "auditing," as Marcy Wheeler points out at emptywheel.
As I noted last month, the NSA’s primary order for the Section 215 program allows for technical personnel to access the data, in unaudited form, before the analysts get to it. They do so to identify “high volume identifiers” (and other “unwanted BR metadata”). As I said, I suspect they’re stripping the dataset of numbers that would otherwise distort contact chaining.Separating the signal from the noise is the first step for working with any large data set. But the NSA's separation step operates under the assumption that every number with an inordinate number of hits is just noise. If the NSA is now stripping out eateries as possible connectors, it could very well be filtering out links to terrorists. Wheeler goes back through the series of missed connections by intelligence and law enforcement agencies that were uncovered after the Boston bombing.
I suspect a lot of what these technical personnel are doing is stripping numbers — probably things like telemarketer numbers — that would otherwise distort the contact chaining... I used telemarketers, but Alexander himself has used the example of the pizza joint in testimony.
In other words, it appears Alexander learned from his mistake at INSCOM that pizza joints do not actually represent a meaningful connection. His use of the example seems to suggest that NSA now strips pizza joints from their dataset.
I also suspect there may be one gaping hole in the NSA’s data relating to the Tsarnaevs: any calls and connections through Gerry’s Italian Kitchen.Here's where the NSA's collection activities become a damned-if-you-do, damned-if-you-don't situation. Leave the pizza places in and everyone is linked to terrorists. Take them out and you delete helpful connections. The agency will probably point to the need to access more data, in order to somehow further filter the previously collected data. It has most likely already devoted several million dollars towards solving this conundrum -- more analysts, more tools, more data. The one thing it hasn't considered, apparently, is the simplest solution: targeted collections.
Gerry’s was, if you recall, the pizza joint involved in the 2011 murder in Waltham: the three men were killed sometime between ordering a pizza and its delivery 45 minutes later. I’ve been told both Tsarnaevs had delivered pizza for that restaurant before then and Tamerlan may still have been.
But Gerry’s is also where the brothers disposed of some of their explosives the night of the manhunt, and it may well have been what brought them to Watertown.
So a connection to the brothers going back years when they worked there, a connection to the 2011 murder, and a connection (however tangential) to the manhunt. Yet (I’m guessing here) any ties the brothers had through that pizza joint would not show up in the dragnet collected precisely for that purpose, because such data is purged because normally pizza joints don’t reflect a meaningful relationship.
[B]ecause this was a dragnet, rather than a collection of the brothers’ calls, this pizza connection may have been hidden entirely in the data.The continuous, ever-increasing flow of data into the NSA's haystacks has just as much of a chance to bury useful connections as it has to bring them to light. Intelligence agencies don't care much for targeted data acquisition, preferring to pick it up in bulk "just in case."
It's as though the collection of data is its own end. I suppose the only "fortunate" aspect of this dragnet is that its occurring in a digital age, thus keeping the NSA's data centers from looking like interior shots of a particularly horrific episode of "Hoarders." The theory is that this will prevent terrorist attacks. But in practice, it keeps looking as if our intelligence agencies could be just as ineffective with half the data.
Thank you for reading this Techdirt post. With so many things competing for everyone’s attention these days, we really appreciate you giving us your time. We work hard every day to put quality content out there for our community.
Techdirt is one of the few remaining truly independent media outlets. We do not have a giant corporation behind us, and we rely heavily on our community to support us, in an age when advertisers are increasingly uninterested in sponsoring small, independent sites — especially a site like ours that is unwilling to pull punches in its reporting and analysis.
While other websites have resorted to paywalls, registration requirements, and increasingly annoying/intrusive advertising, we have always kept Techdirt open and available to anyone. But in order to continue doing so, we need your support. We offer a variety of ways for our readers to support us, from direct donations to special subscriptions and cool merchandise — and every little bit helps. Thank you.
–The Techdirt Team
Filed Under: big data, connections, data, keith alexander, nsa, nsa surveillance, too much data
Reader Comments
Subscribe: RSS
View by: Time | Thread
Next in the news: FBI marks pizza shops as probable terrorism dens. Along with mosques and pet shops.
[ link to this | view in chronology ]
Re:
[ link to this | view in chronology ]
Re: Re:
[ link to this | view in chronology ]
Re: Re: Re:
[ link to this | view in chronology ]
Re: Re:
[ link to this | view in chronology ]
Re:
Also since we are spying on people with connections to terrorists we can't ignore the phone companies as sources of linkages. Terrorist use AT&T, AT&T has a business relationship with Verizon, therefor all Verizon customers are linked to terrorists with 3 links or less.
I feel dirty just for writing that, I wish it wasn't so representative of how the government appears to be thinking.
[ link to this | view in chronology ]
Re: Re:
[ link to this | view in chronology ]
Insipiration..
[ link to this | view in chronology ]
Re: Insipiration..
[ link to this | view in chronology ]
The NSA & too much data
[ link to this | view in chronology ]
graphing the relationship...
With over 6 BILLION people to cross-reference and relate to each other...
well... it's better off to analyse a relationship diagram if Ranma 1/2 or Negima! than what NSA is planning to make...
But one thing is for sure.... there's alot of people will have an "Annoyed at" relationship at anyone under NSA right now with this mass relationship diagram making...
They should have tried graphing their own ancestry instead of this...
[ link to this | view in chronology ]
Wonderful News!
[ link to this | view in chronology ]
No Restaurants?
The "mob" used Italian Restaurants exclusively! I guess no one in the NSA watched the Sopranos.
[ link to this | view in chronology ]
Statistics 101, people!
[ link to this | view in chronology ]
Re: Statistics 101, people!
But that makes as much sense as...
"The absence of evidence is not the evidence of absence."
[ link to this | view in chronology ]
Six Degrees of Separation
Keith Alexander does not appear to appreciate this, and this make his approach to using data very very dangerous.
[ link to this | view in chronology ]
Re: Six Degrees of Separation
[ link to this | view in chronology ]
[ link to this | view in chronology ]
The Pizza Connection is real.
[ link to this | view in chronology ]
Even direct in person connections are usually completely innocuous, e.g. a neighbor, college roommate, brother or sister, etc.
At best you may find a "potential" criminal by cross-checking connectivity maps of two or more known criminals (especially those who aren't directly connected to each other). Someone who is closely connected (within 2 jumps, not 3) to multiple known criminals would be rather suspect, and may warrant further (non-intrusive) investigation. I say non-intrusive because we do not (ostensibly) believe in guilt by association. Innocent until proven guilty, and all that good stuff.
Then again, I'm not sure I should be giving the NSA tips on how to use the mountains of data they've illegally (or, at least, unethically) obtained. I've always had a fascination with large data sets, though. In another reality, it's possible I could have been working for the NSA on just that sort of thing (and hopefully have had the courage to pull a Snowden).
[ link to this | view in chronology ]
Nintendo FTW
[ link to this | view in chronology ]
/s
[ link to this | view in chronology ]
[ link to this | view in chronology ]
The patterns between any two events will by subtly different. Those subtle differences will lead to finding more than you can handle or missing what you need.
[ link to this | view in chronology ]
Why we'll never have real artificial intelligence
Hmmm.... Maybe that is a key reason why AI research is stalled right now. "What were you thinking? We didn't mean REAL intelligence! Your research funding has been cut until we get more positive (to us) results."
[ link to this | view in chronology ]
Re: Why we'll never have real artificial intelligence
The first world is the complex, scary wold of big data where every "conclusion" is actually just the first piece of a larger puzzle and must be tested several different ways before being given any credibility. Almost everything is just shades of gray.
The second world is that of producing pictures and useless "find bad stuff instantly" tools for executives and tourists who can't be troubled to think and won't accept that this is an impossible goal. The best outcome a good analyst can hope for is that no one treats the pictures or buttons as actionable information.
[ link to this | view in chronology ]
Fixed
In strategy games, playing against computer AI, I noticed this thing that if I built a closed castle, the opposing armies would bring siege engines to breach the gates or the walls (whichever was weakest). Yet, if I left the gates open and turned the courtyard into a killzone, they'd happily rush their armies in to get mulched.
It turns out that humans (real intelligence) often make this mistake as well.
Similarly, we'll never have artificial intelligence that can discern terrorist activity from benign communication because real intelligence cannot agree which is which, much in the way judges cannot discern when erotic artistic media ends and porn begins.
The New York Times crossword puzzle designer was busted by the government for (coincidentally) adding too many code-words from Operation Overlord into the puzzle.
[ link to this | view in chronology ]
Refer to the spying on Brazil's president, Brazil's largest oil corporation, Bradley Manning, and Edward Snowden's exile in Russia as proof of all three.
There's many more examples of course. Edward Snowden understood what the oppressive global spying apparatus is really about.
He did his best to steer humanity away from it's corrupt iron grip. We should attempt to do the same. Otherwise freedom will be lost, possibly forever.
[ link to this | view in chronology ]
Time and Timing
Everyone in this thread could be seen as connected, even though we've never met, or spoken to each other. We now share this meta connection.
Ditto everyone who has ever clicked a url, say as an entry in the mother of all meta nodes - google search, to read an article printed by this web site.
Then there would be our online 'Trolls' to consider. Trolls could be seen as 'meta nodes'. Anyone who has ever encountered these online trolls will know they often have very wide agendas. Which means as nodes, anyone they target could be treated as if they were connected, just by virtue of who they have in common.
Meantime these trolls as meta nodes and functionaries of this system, would remain invisible as the cause or context for those connections.
Connections alone wont tell the whole story, you would have to look at the frequency of those connections, as well as the context of those connections in order to judge the relevance of that information.
As is often the case, we define x by what we seek. Those limited set of attributes defining x, could mean we fail to see other aspects of the information which might contradict our conclusions. In other words our answers are only as good as the questions asked, which are only as good as the attributes of information recorded. A lot of data doesn't means a lot of useful data. Or put another way, some times you want in that data, information which allows you to exclude a particular result.
[ link to this | view in chronology ]