Category: hacker


I am really impressed both by this OpenNews post about how to tackle a huge pile of documents, and also by the tools recommended. After all:

What I received a month later from Nash County, N.C., were two boxes filled with thousands of printed pages of emails. Double-sided.

One of the problems it solves is that your filesystem is usually very, very good at finding files, on all kinds of criteria, and fast – just look at any unix/linux find examples page – but that presupposes that the information you have is broken out into files whose boundaries map roughly to a logical structure within the underlying data.

Also, one of the best things is also the simplest: Overview has a feature that pulls a randomly selected sample of documents.

The blog is crazy good, too. Interestingly, I remember IBM announcing their big investment in big data the other year and giving “Computational Journalism” as one of the use cases.

Did I say the blog was good? The blog is good.


Look what our reader Dan O’Huiginn is up to!

I’ve always thought a great extension for DocumentCloud would be a plugin that generates a concordance of the documents, as it still strikes me as a big heavy way of just dumping out a lot of scanned-jpeg PDFs, which is what most people do with it.

Project Lobster user stories

OK, a bit of Lobster. Two things have happened recently to up my tuit on the project. First, I learned that Drew Conway of the Zero Intelligence Agents blog has given NetworkX the ability to generate a force-directed graph in d3.js, which you can stick right in a web page. Second, I’ve been reading the Flask docs and falling in love. So now, it’s got a github repo and structure and I have a pretty good idea of how to build it, and I took some decisions.

Lobster has basically 3 user stories. These are:

“Explore and investigate”

I ooh and aah over the network graph. I search for ministers I hate. I look up issues, lobbies, or subjects I am interested in. I drill down to more detail. Then I get angry, and lobby back via WriteToThem etc.

“Rouse the mob”

I notice something outrageous. I customise the data presentation in order to make my point and to see the issue more clearly. I get a URI to this view, and spread it to everyone I know.
for who in whoville.whos:
...tweet(my point)

“Enduring stare/fix in place by surveillance”

I pick lobbies, ministers, or subjects I am interested in. I customise the data presentation to understand the problem better and identify significant events. I register alerts to tell me when something happens.

Obviously, these three elements make up a larger message. Once the mob has been roused, it’s important to monitor the results (the Health & Social Care Bill “listening pause” being a case in point). Further, an alert going off is a cue to investigate further, that leads to a call to action.

Anyway, the next to-do is to rework a pile of ugly code from the analytics scrapers. Also, this is a pretty sweet way of plotting data and close to what I want.

Lib Dems: not quite useless

So, Wired writes up three West Point professors and their algorithm to decide which members of a terrorist network to zap. Apparently they implemented it in 30 lines of Python. The paper is here, with some pseudocode and the tantalising hint that they used NetworkX, but no Python. However, even the Wired piece tells us enough to reverse engineer it.

The key idea here is that whacking terrorist leaders is often stupid, because it causes the enemy to adopt a flatter, more decentralised, and therefore less vulnerable network structure. Also, they point out, the leaders are often forces for restraint and points of contact for negotiation.

Being who they were, they decided that they could fix this with a better optimisation. They looked at the network-wide degree centrality, a measurement of the centralisation or otherwise of the whole network which is defined as the fraction of total nodes in the network an average node is connected to. They then asked how this changed when they removed a node from the network. And they reasoned that increasing it was desirable, as it rendered the network overall more fragile and unstable.

Now, the Lobster Project uses weighted betweenness centrality – the fraction of the shortest routes through the network that pass through a given node, with more important nodes being accounted for as such – as its centrality metric. There is no particular reason to think that this would work differently.

So I thought I’d implement it. Their implementation used 30 lines, but I presume that includes the test harness to generate or load a specimen network as well as the analysis. Here goes:

def greedy_fragile(mgraph, month, mini, nodes):
...network_wide_centrality = float(sum(nodes.values())/len(nodes.values()))
...n = centrality_nodes(mgraph)
...nwc = float(sum(n.values())/len(n.values()))
...mgraph.add_node(mini[0], mini[1])
...return {'Minister': mini[0], 'Title': mini[1]['Title'], 'Department': mini[1]['Department'], 'Date': month, 'Greedy_Fragile': network_wide_centrality - nwc}

mgraph is the NetworkX graph object, month is the month, mini is the minister (or lobby), nodes is the precomputed list of nodes and their centrality values. Obviously, if it wasn’t for the weird datastore thing I’d have done this recursively and made it return the values for the whole network rather than calling it for each node.

And it works. The first result was that one particular minister was slightly reducing the overall centralisation (and therefore fragility/instability) of the system as a whole. And he’s Ed Davey. As the point of having Lib Dems is meant to be reducing the centrality of Dave from PR and paddock-boy in the system, this suggests that we shouldn’t get rid of him yet.

Theme: Cyber-Oddness

Well, this is a story. Who hacked the French presidency? The original source of the story is Le Telegramme de Brest, a bit of a surprise but not the first time a really crazy news story got out in the regional press first. It suggests the attack took place at some point during the transition from President Sarkozy to President Hollande, between the 6th and the 15th of May, and the presidential transition was used as a cover story for the clean-up operation.

This piece in L’Express is mostly boilerplate “cyberwar”, but it does give some details of the exploit and points the finger…at the United States. Now, I’ve no idea how they can be so sure, but there is some actual information in there.

Apparently, the exploit consisted of three steps. The first was a version of the now-classic spearphishing attack. Several officials were sent a message on Facebook, presumably crafted for them, inviting them to follow a link, which led to a fake version of their intranet’s login page. This harvested their login credentials. The second step used the logins to deploy the Flame worm to the Elysee’s network. Flame would compromise some of the computers, which could then be searched for interesting information.

The reasoning is, apparently, that Flame was based on Stuxnet and everyone knows Stuxnet was the Israelis and therefore that’s the same as the Americans. I paraphrase a bit. I would argue that, based on what we actually know, it’s a best-of-breed solution, with one element (the spear-phishing) that is stereotypically associated with the Chinese (like so), and another (the code from Stuxnet) that originates with someone who doesn’t like the Iranians, and further work (the development from Stuxnet to Flame) from a third party.

This is completely normal for malware development, as it is for real viruses (how long before we start talking about “genetic” viruses to force the distinction?), and this is why “attribution” is difficult. Oh yes, and don’t distribute links to documents inside the firewall on Facebook!

Meanwhile, it seems someone nicked the entire Greek ID card database, near enough, and then there was the whole crazy-weird GPS timing/NTP bug incident, where the stratum 1 time sources run by the US Naval Observatory (yeah, where Dick Cheney used to live) stopped working, as did and NIST’s time source, and NTP servers reacted weirdly differently from the way they’re meant to, and for a while the NIST GPS archive didn’t show any data.

A Project Lobster progress report!

So I completely forgot I needed to register for OKFN’s Open Interests Europe hackathon last weekend, which even had a lobbying track, and just round the corner from the office, too.

I decided to have my own lobbying hackathon by eating pizza and caffeine pills and being misogynistic spending my weekend finishing the Lobster Project’s analytics scrapers for ministers and lobbies respectively. I abandoned the plan of generating NetworkX objects and storing them in the database for later use in favour of directly generating them and reading out the metrics, and dealing with the performance hit by writing slightly less horrible code.

Specifically, I decided to optimise for fewer calls to the database API. Memoising the rankings function cuts its usage from two calls a meeting to 82 for the first month, plus any future changes, and storing the cache itself means that only new combinations of ministers and titles generate a query in future runs. Getting all the lobbies for the month in one query, and then processing them in Python using itertools, replaces one query for each meeting with one admittedly complex query per month and a small function.

This still took far longer than I expected to run, but then I realised there was more data.

Anyway, they work and they are generating results by month, so we will be able to draw nice time series charts, up to September 2011. Unfortunately, the ScraperWiki datastore is doing something quite weird – replacing float values with nulls or zeroes – and although I thought I might have fucked up type declarations, pragma tells me that the column types are what they ought to be. So I’ve got a query outstanding with the ScraperWiki folk.

Netroots UK catchup

Other stuff from Netroots UK.

Having chugged through my official Brown Bag Lunch (which actually included Ribena, in a disturbingly infantilising touch), I went to the open space group on the Leveson inquiry. This ended up merging with the one on the LIBOR scandal. I was able to contribute by knowing how the LIBOR panel was meant to work, although we couldn’t get away from the point that separating investment and retail/commercial banking wouldn’t have helped because BarCap was big enough in its own right to be on the panel.

One point which everyone thought would resonate was that the scandal represented an attack on an institution that had relied on its members’ fair dealing. Exactly what to do with it, though, was harder. Could this support the Co-operative’s claim to buy the branches demerged out of Lloyds? Or a Leveson inquiry, but with banks? Of course there have already been inquiries, but then, the original ideal type of this kind of inquiry, the Pecora Committee, wasn’t the first inquiry or even the second into Wall Street in the 1920s.

What else? I went to one of the more tech-centric workshops, run by Blue State Digital. This was pretty good; I liked the point that Facebook advertising was usually a “hopeless waste of £2.50″, but it did have its uses. Those weren’t anything Facebook would want, though. Specifically, the ad-targeting tool lets you get a quick estimate of the size of a potential audience – input the demographics, locations, and search strings you’re interested in, and it spits out an estimate of your audience.

The other one was using it to bait your enemies. If you had a reasonable amount of information, you could place an ad that your target would have to read every time they logged in. This amused me more than a little.

Everyone, but everyone, loves ScraperWiki.

What else? WhoFundsYou scored thinktanks by the degree to which they are forthcoming about their funding. Astonishingly enough, Respublica, the “Not the Other” TaxPayers’ Alliance, and the Adam Smith Institute (no less) got an E. The very, very serious Centre for Policy Studies and Institute for Economic Affairs, and the somewhat less serious but certainly influential Policy Exchange and Centre for Social Justice got a D. You could have mistaken the score-card for a left-right political spectrum, as IPPR, Progress, Resolution Foundation, NEF, SMF, and Compass all got As, while Demos, Reform, the Fabians, and Policy Network got Bs. CentreForum was, superbly, right in the centre with Civitas and the Smith Institute.

It is telling that the distinction between wanktanks like Respublica and TPA and the Very, Very Serious ASI disappears on this scale.

Owen Jones has a lot of good laugh lines. The BSD people are good but self-satisfied. Clifford Singer is funny. I really regret missing the workshop on shooting better video on smartphones as I have zero video skills (even if their live demo was the traditional fiasco). You can’t hear anyone speaking anywhere in Congress House without using a loud hailer.

Lurking technology

Recently, preparing a case study for a client, I was struck by the idea of a “lurking technology”. (The history of technology is the trade secret of IT consulting, or something.) That’s one that isn’t necessarily obviously linked to the end user, has broad influence, and causes changes to happen. You can make a case that Ethernet was such a thing for the media in the 1980s and 1990s – the new colour print machines, the Apple Macs in the layout department, and the faxes and WAN technologies supporting the reporters are all influential in themselves, but they wouldn’t have worked without good local area networking to tie them together. You could say something similar about finance, and taken together, there’s a case that its influence has been mostly evil:-)

But the one I fixed on was distributed version-control. (I thought of Whitworth’s screw-originating machine, but I felt it might be a bit recondite.) It’s easy enough to see that there are a hell of a lot of Linux/Apache/MySQL/programming language beginning with P servers out there, and an absolutely enormous number of Android devices, to say nothing of BSD Unix-based iPhones. And it’s even easier to crack out 800 shiny radical words on the joy of open source.

However, just remember the last time you circulated a document for comment around a dozen people and what a pain in the arse that was. Now scale up to a million lines of code and several hundred contributors distributed around the world, and require that every change be submitted to automated testing, and imagine just how much pain and trouble and time this is going to involve.

The lurking technology that fixes this, and makes it possible, is distributed version control. Like all lurking technologies, very few people really care, a few more master it as part of a trade, and a bigger pool just assume it’s there. Of course, the people who do care over-care and imagine you could just sling the statute book in Github, and of course they are wrong, as a real expert points out here.

the missing link between slimming tea and tactical electronic warfare

Well, speak of the devil. Peter Foster makes his appearance in the Murdoch scandal and fingers the Sun directly.

He said he then received an email from a Dublin-based private investigator calling himself ”Autarch”, who told Mr Foster he tapped into his mother’s phone in December 2002.

That month, The Sun published the ”Foster tapes”, which featured transcripts of Mr Foster talking about selling the story of his links with Tony Blair’s wife, Cherie. Yesterday, Mr Foster said he had since had a Skype conversation with the investigator in Dublin, in which Autarch described how he tapped into Mr Foster’s mother’s phone.

”He said she was using an analogue telephone which they were able to intercept,” Mr Foster said. Autarch said he discussed the hacking with Sun journalists.

However, this story – at least this version of it – probably isn’t true. It is true that the first-generation analogue mobile phone systems like TACS in the UK and AMPS in the States were unencrypted over the air, and therefore could be trivially intercepted using a scanner. (They were also frequency-division duplex, so you needed to monitor two frequencies at once in order to capture both parties to the call.) It is also true that they were displaced by GSM very quickly indeed, compared to the length of time it is expected to take for the GSM networks to be replaced. In the UK, the last TACS network (O2’s) shut down in December 2000. It took a while longer in the Republic of Ireland, but it was all over by the end of 2001.

So Foster is bullshitting…which wouldn’t be a surprise. Or is he? TACS wasn’t the only analogue system out there. There were also a lot of cordless phones about using a different radio standard. Even the more modern DECT phones are notorious for generating masses of radio noise in the 2.4GHz band where your WiFi lives. It may well be the case that “Autarch” was referring to an analogue cordless phone. Because a lot of these were installed by individual people who bought them off the shelf, there was no guarantee that they would be replaced with newer devices. (Readers of Richard Aldrich’s history of GCHQ will note that his take on the “Squidgygate” tape is that it was probably a cordless intercept.)

This would have required a measure of physical surveillance, but then again so would an attempt to intercept mobile traffic over-the-air as opposed to interfering with voicemail or the lawful intercept system.

The Daily Beast has a further story, which points out that the then editor David Yelland apologised after being censured by the Press Complaints Commission (no wonder he didn’t go further in the Murdoch empire) and makes the point that such an interception was a crime in both the UK and Ireland at the time. They also quote Foster as follows:

According to Foster, the investigator told him that, for four days at the height of Cheriegate, he had been sitting with another detective outside Foster’s mother’s flat in the Dublin suburbs, intercepting and recording the calls to her cordless landline

The Sun hardly made any effort to conceal this – they published what purports to be a transcript, as such.