The Failings of Social Network Analysis

If you watched Britz this week you’ll have seen the MI5 chaps and chappesses poring over really complicated social-network diagrams of the jihadi menace. And this has indeed been a boom academic industry since 2001; after Valdis Krebs’s seminal paper in which he mapped the relationships of the September 11 plotters, there were all kinds of people hoping to win the war with an interesting powerpoint presentation. Anyone remember Steven Emerson’s Investigative Project?

In fact, a lot of this wasn’t particularly new; the “new software” usually turned out to be Analyst’s Notebook, which I recall seeing in 1996 or thereabouts, and most of the folk involved outside academia eventually merged into the general wingnutosphere. Anyway, let’s cut to the data.

This diagram is from the Namebase research project, and can be seen in its natural habitat with much more functionality here. It’s a diagram of Viktor Bout’s social network, or at least as much of it as could be deduced from their pile of newspaper articles. And frankly, it’s useless. If you zoom in on it, you’ll notice that Sergei Bout, his brother and the founder of the biggest of his holding companies, CET Aviation of Malabo, is far more distant from Viktor than either Alex Vines, a Global Witness staffer, or Paul Vixie.

Who is this man? Is he a terrorist? If you’re reading this on a Linux, Unix or MacOS machine, he wrote great chunks of the operating system; as well as this, his accomplishments include a huge range of things in the field of Internet engineering and operations, networking theory, and the like. It’s just good news that the Namebase doesn’t scrape blogs, or I’d be right in the fucking middle!

So when we look at the NSA wiretapping project – rather, I should say, its data-mining project, as it involved the statistical analysis of CDRs rather than the interception of calls – we need to ask ourselves how it could ever possibly work. The plan appears to have been to pull all the CDRs for a list of suspects, then pull all the CDRs for the phone numbers that appeared in them, and then for the numbers in them, then plot the whole huge pile o’files on a network diagram. It sounds convincing until you think of the sheer number of phone calls involved, and the rate at which the noise factor grows.

This is a key cognitive bias in anti-terrorism; it feels logical to assume that it doesn’t matter very much if your search results in false positives. It’s worth it to make sure we get as much of a chance as possible of sweeping up the bad guys…right? But consider this example; imagine there are a thousand people, and there is a 1 per cent chance (Dick Cheney’s standard) of any one person being a terrorist. We have a big technical breakthrough, the Terroriser, an algorithm that searches all available databases to look for terrorists and has only a 1 per cent chance of misidentifying a terrorist as a law-abiding citizen. On the downside, it has a 2 per cent false positive rate; or as Capita RAS pitched it to the Home Office, accuracy of 98 per cent.

So if we run everyone through the Terroriser, what are the chances that anyone who is flagged is a terrorist?

Well, there are by definition 10 terrorists, and the Terroriser performs as expected; 9 of them are flagged. But so are 20 law-abiding citizens; we now have 29 people in the cells, and the chance of any given suspect being a terrorist is less than one in three. Oh, and there’s still a terrorist out there. Now, that’s a reasonably good result; a false positive rate of 2 per cent is probably unrealistically low. I count 8 false positives out of 52 names on that diagram for a false positive rate of 15 per cent; it’s worth remembering that there is probably a good reason why the Government’s pathetically inadequate biometrics trial never reported accuracy rates with false positives and negatives broken out.

1 Comment on "The Failings of Social Network Analysis"

Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.