Category: programming

What if reality was more like software? Visit to a failed smart city

I took these photos of Berlin’s forever-delayed airport terminal this summer. It’s just turned out a few days before the latest attempt to announce an opening date that the thing is still nowhere near ready after 2000 days of delay and twelve years of construction. Meanwhile, the executives bicker, the stakeholders wrestle for power, and the contractors are on time-and-materials, so at least somebody’s happy.


One thing that came across very strongly on the site is that this could be a preview of the future. Outside, everything is at once brand new and also being overtaken by dereliction. The car parks and access roads’ blacktop is perfect while advanced colonies of pioneer plants take over what they can. Every strand of weldmesh in the security perimeter is present and correct, but it’s also uniformly rusted to the same shade of iron oxide, as if deliberately weathered to be in keeping. Inside, though, the real story doesn’t take long to impose itself.


The interior is gorgeous – its design is supremely austere and the architects chose to banish the shops deeper into the building to provide more generous circulation spaces, and the materials are almost comically superb. But the second thing you notice is that everywhere, access panels into trunks and closets are hanging open and hanks of cabling have been pulled out. You’re in a failed smart city.


The smart city concept, back when it was fashionable, was all about the unplanned protocol layer integration of sensor networks centred on the user, or better, citizen. But in practice it has turned out to be little better than a branding concept for Huawei CCTV systems or your favourite Big Systems Integrator.

Airports are a good test case for this because they are so dense with building services systems – multiple and different telecoms networks, main and failover power supply, very complex air-conditioning, smoke evacuation, CCTV, both own-use and tenant radio systems, baggage handling conveyors, weird retail-sector surveillance – and so politically complex, with dozens of stakeholder groups (airlines, retailers, different police forces, air traffic control, port health, you name it) crammed under the roof. They also represent many of the horrible things about the smart city concept. Surveillance of various kinds is pervasive. Selling is relentless. Meaningful public space is nonexistent. Speech is inescapable, but it’s not free. I was stuck in Stockholm-Arlanda for hours a few weeks ago and the thing that sticks in my mind is the way money seemed to evaporate off me like sweat for every minute I was there.

The problems at BER centre on the extraordinarily complex building services that shoot through every bit of the terminal down to each and every ticket desk and departure board. (My favourite thing there was the departure boards – they are mostly live and showing flights across the airfield at Schönefeld, almost as if to mock the project’s hubris.) The biggest single problem is fire safety, but the systems interact in complicated ways and it might be more accurate to say that the smart element of the whole thing is the problem. During the last-minute sprint to finish the job in 2012, the cable installation was done so badly it turned out to be impossible to say where all the runs were and what was in each trunk. They’re still re-wiring, hence the cables hanging out of walls.

Something nobody seems to discuss about it is software. All those sensors and effectors are linked physically by the cable network, but logically by the decision rules in the controllers. Even if it’s all done in Siemens-Rexroth programmable-logic controllers rather than general purpose digital computers, which seems unlikely in this day and age, this is still software of a kind. And the project is failing in a way that is spookily familiar from failing software projects. It’s not that far from the finish, but the finish always recedes. More money and more people seem to slow it down. The complexity of its management, rather than anything inherent in the task, seems to control its fate in as much as anyone or anything controls it. This is straight out of The Mythical Man-Month and the literature of the software crisis. It’s telling that, out of the contractors being paid on time and materials terms, one of the biggest earners is T-Systems, the IT consulting wing of Deutsche Telekom.

One of the biggest problems is that the systems don’t just need to work, they need to be shown to work. Essentially everything about it is subject to regulatory approval of one kind or another, and as the mayor who started it all, Klaus Wowereit said: Das ist auch gut so! This will be a huge problem for all kinds of Internet of Things projects. If something really matters, like fire protection, the way it works needs to be possible to understand, to demonstrate, and to take responsibility for. I am sympathetic to regulation. But I do wonder if the rural district council of Dahme-Spreewald is really capable of being BER’s regulator. This is their gloriously mundane HQ.

That said, perhaps it might do everyone good if they had to justify their code to a fire protection officer from a small suburban local government. Like this guy.

(Actually he’s a town councillor.) Maybe, if BER is one day ever finished, Dahme-Spreewald should go global and export this. No as a Service.

So, this is your dumb smart future. Projects of all kinds that display the special pathologies of software projects. They are both deeply unaccountable, and also crippled by a profusion of veto actors. In so far as they work, they’re basically horrible environments riddled with creepy surveillance and yelling adverts demanding money, that are probably also insecure and unsafe…does that remind you of anything? The smart city dreamers wanted an urban infrastructure like the Internet, and dammit they’re getting one. BER’s architects wanted to create a single, integrated, monolithic building that would contain everything and be managed in the same way. The Siemens engineers saw it as a showpiece that would sell the systems they built there as a new product for similar projects around the world. Together, they created a monster, a building so soaked with software, marinating in the stuff, it’s literally impossible to finish. The only credible solution I’ve heard of is to strip the thing back to the structural skeleton and start again.

That time I was nearly burned alive by a machine-learning model and didn’t even notice for 33 years

Remember Red Plenty, Francis Spufford’s historical SF novel about the Soviet Union’s efforts to create a real-time planned economy using computers and the ideas of Oskar Lange and Leonid Kantorovich? Sure you do if you’re on this blog. Well, it turns out that it had a dark and twisted 1980s sequel.

We already knew about Operation RYAN, the Yuri Andropov-inspired maximum effort search for intelligence offering strategic warning of a putative Western preventive war against the Soviet Union, and that it intersected dangerously with the war scare of 1983. We also knew that part of it was something to do with an effort to assess the intelligence take using some sort of computer system, but not in any detail. A lot more documents have just been declassified, and it turns out that the computer element was not just a detail, but absolutely central to RYAN.

At the end of the 1970s the USSR was at the zenith of its power, but the KGB leadership especially were anxious about the state of the economy and about the so-called scientific-technological revolution, the equivalent of the Revolution in Military Affairs concept in the US. As a result, they feared that once the US regained a substantial advantage it would attack. The answer was to develop an automated system to predict when this might happen and what the key indicators were.

Model the whole problem as a system of interconnected linear programming problems. They said. Load up the data. They said. Comrades, let’s optimise. They said.

In all, the RYAN model used some 40,000 data points, most of which were collected by greatly increased KGB and Joint GRU field activity. It generated a numerical score between 0 and 100. Higher was better – above 70 peace was probable, whereas below 60 it was time to worry. The problem was the weighting applied to each of those parameters. Clearly, they had to train the model against some existing data set, and the one they chose was Nazi Germany in the run-up to Operation BARBAROSSA.

Who needs theory? They said. We’ve got the data. They said. A simple matter of programming. They said.

As Sean Gallagher at Ars wisely points out, this is a case of the problem described here, that gave us those amazing computer dream pictures. The neural network that classifies cat photos must by definition contain enough information to make a random collection of pixels catlike, although uncannily not quite right. Similarly, RYAN picked up a lot of unrelated data and invariably made it vaguely Hitler-y.

The score went through 60 as early as 1981. The Soviets responded by going on higher alert and sending more agents to posts in the West to get more data. Meanwhile, in the West, John Lehman’s maritime strategy was being put into effect, causing the US Navy and its allies to operate progressively closer to the Soviet periphery, which only made things worse. In the autumn of 1983, the score may have fallen below 40, around the time Stanislas Petrov did his thing.

At this point, Communist Party local cadres were being called in to be briefed on the coming war and their duty to prepare the population. Tactical nuclear weapons were released to local control and moved about by helicopter. The Soviet military was on a higher state of alert than even during the Cuban missile crisis. Fortunately, at this crucial juncture, Yuri Andropov resolved the situation by dying and therefore denying the Big Algo that crucial parameter: patronage.

So, when I was reading all that SF as a kid, I had actually narrowly escaped being vaporised with nuclear space rockets by an evil computer that had convinced itself I was Hitler! I had no idea!

Less flippantly, one of the major themes in Red Plenty is the tension between Kantorovich’s vision of a decentralised, instantly responsive socialist economy, and the Party’s discretionary power – between communism and the Communists, if you like. The RYAN story flips this on its head. This time, it wasn’t the bureaucrats’ insistence on clinging to power that was the problem. It was the solution. The computer said “War”; only fundamentally political, human discretion could say “Peace”. As Joseph Weizenbaum put it, a computer can decide but it cannot choose.

Another thing from Red Plenty that comes up here is that the same unvarying forces of Soviet politics worked the same way, computers or no computers. In the end, everything was personal, and settled through the backstairs gift-economy of favours and influence. Only the loss of its patron could stop the machine.

Also, another theme in the book is the future role the actors in it will play in the perestroika years. We have the cadre down in Novocherkassk who refuses to get used to violence. We have the cadre and programmer who may be turning into an embarrassing trendy dad, but has been enduringly influenced by the Czech experience of 1968. We have the economist who has learned the lesson that the system will have to change dramatically, even if this gets put off 20 years. When they reach the peak of their careers, something is going to change.

And of course they were arriving there just in time to “sudo killall -9 ryand” before ryand killed us all.

Basicer Basic

From the open newslist, I’ve been asked to reflect on the 50th anniversary of BASIC.

This came at the same time I read this post on the Light Table blog. I think Jamie Brandon has a point about the distinction between an evaluate-or-die and an edit-and-continue model of computing. He’s also right about the deployment gap – it’s often much harder to deploy code in a useful manner than it is to develop it, especially if it’s not your job. For example, one of the reasons why the Web is so important is that it gets around having to get your app onto all 15 PCs at work that nobody really manages.

Deployment is important because it is the way in which your good idea can actually affect anyone else. The Time piece quotes the current chair of maths at Dartmouth, where BASIC was invented, as saying:

you need some immediacy in the turnaround

He’s talking about debugging, but the two issues are the same thing at different levels of analysis. You run it and debug it on your machine, you deploy it and improve it. When I’ve done things like this, I’ve always used the work BSD server just because I can ssh into it, write python, and well, that’s it.

Working with Symbian was the absolute opposite; the sheer embuggerance of getting to the point where you could even try, and getting back there once you changed anything, was desperate. I never realised why anyone cared about iPhones, in the era when they had crappy radios and no copy-and-paste, until I tried to make an app for S60.

Fortunately, this ought to be the easiest problem to solve in these cloudy days.

On the other hand, I’ve got to stand up for language. Graphical programming tools are usually awful, for two reasons. First of all, they mix aesthetic and logical decisions. Second, language is wonderful, and deeply universal, and text is language. Expressing yourself graphically to the extent you can in speech or writing is a whole other craft. Blogging didn’t happen because the software did the writing for you; it happened because it distributed it.

BASIC itself? Well, I was a BASIC kid. I had a ZX Spectrum and I used to take the handbook to bed and read it under the covers. I remember talking in BASIC with Matty Stockham on the school bus. In many ways the Spectrum-BASIC combo was the ultimate middle-class artefact; somehow educational, totally useless, and rather less fun than it was meant to be unless you’re like me. If the Early Learning Centre had been able to develop a computer, that’s what they would have done. Is it telling in the light of what I’ve just said that the BBC Micro had much better networking support?

But this is beside the point. The point wasn’t that I became a programmer – in fact, I rebelled against computers and the computer geek stereotype to the extent of refusing to have an e-mail address, and as a result I missed out on the A-level computing course Eben Upton took a year ahead of me – it was education in the classic sense. I learned that the digital environment, springing up all around, was one I could understand, criticise, and influence.

From the economy (and asterisk-biz)

VoIP engineer, Linux developer ( Asterisk, 2+ years, Perl, MySQL)

ESSEX BASED – Small but growing hosted VoIP provider requiring experienced engineer to join our team. Looking for a team player who is keen to be involved in new ideas.

The successful candidate is required to have:

-2+ years of VoIP (Asterisk, SER/Kamailio), incl. experience of independently setting up VoIP systems.

-Familiarity with AGI and able to develop in Perl

-Working with large Servers and carrier grade hardware.

-Strong Linux background (RedHat/Centos/Debian)

-Thorough understanding of the principles of VoIP telephony (SIP)

-Working knowledge (experience) of SQL based DBMS, MySQL.

Desired skills:

-Web development experience, HTML/XML/SOAP programming.

Your duties (role) will include:

-Support the company and its client’s Hosted VOIP platform, help in developing new services.
-Looking after existing infrastructure, modifying and updating services.
-Deploy new VOIP platform to new customers.

-Provide hardware maintenance and general technical support for the platform.

-Liaise with other members of both, technical and sales teams, on service planning, capacity forecast, quality issues, etc…

-Report to manager, on all service outages and take proactive steps to minimize downtime.

Tent provided Salary £30,000+ based on experience.

meet Project Lobster

My lobbying project has been entered in the Open Data Challenge! Someone posted this to the MySociety list, with rather fewer than the advertised 36 hours left. I was at a wedding and didn’t read it at the time. After my partner and I had tried to invent a tap routine to the back end of Prince’s “Alphabet Street” and had got up at 8am to make it for the sadistic bed & breakfast breakfast and gone back to help clean up and drink any unaccountably unconsumed champagne, and the only thing left to look forward to was the end of the day, I remembered the message and noted that I had to get it filed before midnight.

So it was filed in the Apps category – there’s an Ideas category but that struck me as pathetic, and after all there is some running code. I pushed on to try and get something out under the Visualisation category but ManyEyes was a bit broken that evening and anyway its network diagram view starts to suck after a thousand or so vertices.

As a result, the project now has a name and I have some thin chance of snagging an actual Big Society cheque for a few thousand euros and a trip to Brussels. (You’ve got to take the rough with the smooth.)

The most recent experiment with the Lobster Project – see, it’s got a name! It’s got you its grips before you’re born…it lets you think you’re king when you’re really a prawn…whoops, wrong shellfish – was to try out a new centrality metric, networkx.algorithms.centrality.betweenness_centrality. This is defined as the fraction of the shortest paths between all the pairs of nodes in the network that pass through a given node. As you have probably guessed, this is quite an inefficient metric to compute and the T1700 lappy took over a minute to crunch it compared to 7 seconds to complete the processing script without it. Perhaps the new KillPad would do better but the difference is big enough that it’s obviously my fault.

Worth bothering with?

As far as I can see, though, it’s also not very useful. The results are correlated (R^2 = 0.64) with the infinitely faster weighted graph degree. (It also confirms that Francis Maude is the secret ruler of the world, though.)

The NX functions I’m really interested in, though, are the ones for clique discovery and blockmodelling. It’s obvious that with getting on for 3,000 links and more to come, any visualisation is going to need a lot of reduction. Blockmodelling basically chops your network into groups of nodes you provide and aggregates the links between those groups – it’s one way, for example, to get department level results.

But I’d be really interested to use empirical clique discovery to feed into blockmodelling – the API for the one generates a python list of cliques, which are themselves lists of nodes, and the other accepts a list of nodes or a list of lists (of nodes). Another interesting option might be to blockmodel by edge attribute, which would be a way of deriving results for the content of meetings via the “Purpose of meeting” field. However, that would require creating a list of unique meeting subjects and then iterating over it creating lists of nodes with at least one edge having that subject, and then shoving the resulting list-of-lists into the blockmodeller.

That’s a lorra lorra iteratin’ by anybody’s standards, even if, this being Python, most of it will end up being rolled up in a couple of seriously convoluted list comps. Oddly enough, it would be far easier in a query language or an ORM, but I’ve not heard of anything that lets you do SQL queries against a NX graph.

Having got this far, I notice that I’ve managed to blog my enthusiasm back up.

Anyway, I think it’s perhaps time for a meetup on this next week with Who’s Rob-bying.

OpenTech washup, and an amended result

So it was OpenTech weekend. I wasn’t presenting anything (although I’m kicking myself for not having done a talk on Tropo and Phono) but of course I was there. This year’s was, I think, a bit better than last year’s – the schedule filled up late on, and there were a couple of really good workshop sessions. As usual, it was also the drinking conference with a code problem (the bar was full by the end of the first session).

Things to note: everyone loves Google Refine, and I really enjoyed the Refine HOWTO session, which was also the one where the presenter asked if anyone present had ever written a screen-scraper and 60-odd hands reached for the sky. Basically, it lets you slurp up any even vaguely tabular data and identify transformations you need to clean it up – for example, identifying particular items, data formats, or duplicates – and then apply them to the whole thing automatically. You can write your own functions for it in several languages and have the application call them as part of the process. Removing cruft from data is always incredibly time consuming and annoying, so it’s no wonder everyone likes the idea of a sensible way of automating it. There’s been some discussion on the ScraperWiki mailing list about integrating Refine into SW in order to provide a data-scrubbing capability and I wouldn’t be surprised if it goes ahead.

Tim Ireland’s presentation on the political uses of search-engine optimisation was typically sharp and typically amusing – I especially liked his point that the more specific a search term, the less likely it is to lead the searcher to a big newspaper website. Also, he made the excellent point that mass audiences and target audiences are substitutes for each other, and the ultimate target audience is one person – the MP (or whoever) themselves.

The Sukey workshop was very cool – much discussion about propagating data by SMS in a peer-to-peer topology, on the basis that everyone has a bucket of inclusive SMS messages and this beats paying through the nose for Clickatell or MBlox to send out bulk alerts. They are facing a surprisingly common mobile tech issue, which is that when you go mobile, most of the efficient push-notification technologies you can use on the Internet stop being efficient. If you want to use XMPP or SIP messaging, your problem is that the users’ phones have to maintain an active data connection and/or recreate one as soon after an interruption as possible. Mobile networks analogise an Internet connection to a phone call – the terminal requests a PDP (Packet Data Profile) data call from the network – and as a result, the radio in the phone stays in an active state as long as the “call” is going on, whether any data is being transferred or not.

This is the inverse of the way they handle incoming messages or phone calls – in that situation, the radio goes into a low power standby mode until the network side signals it on a special paging channel. At the moment, there’s no cross-platform way to do this for incoming Internet packets, although there are some device-specific ways of getting around it at a higher level of abstraction. Hence the interest of using SMS (or indeed MMS).

Their other main problem is the integrity of their data – even without deliberate disinformation, there’s plenty of scope for drivel, duplicates, cockups etc to get propagated, and a risk of a feedback loop in which the crap gets pushed out to users, they send it to other people, and it gets sucked up from Twitter or whatever back into the system. This intersects badly with their use cases – it strikes me, and I said as much, that moderation is a task that requires a QWERTY keyboard, a decent-sized monitor, and a shirt-sleeve working environment. You can’t skim-read through piles of comments on a 3″ mobile phone screen in the rain, nor can you edit them on a greasy touchscreen, and you certainly can’t do either while looking out that you don’t get hit over the head by the cops.

Fortunately, there is no shortage of armchair revolutionaries on the web who could actually contribute something by reviewing batches of updates, and once you have reasonably large buckets of good stuff and crap you can use Bayesian filtering to automate part of the process.

Francis Davey’s OneClickOrgs project is coming along nicely – it automates the process of creating an organisation with legal personality and a constitution and what not, and they’re looking at making it able to set up co-ops and other types of organisation.

I didn’t know that OpenStreetMap is available through multiple different tile servers, so you can make use of Mapquest’s CDN to serve out free mapping.

OpenCorporates is trying to make a database of all the world’s companies (they’re already getting on for four million), and the biggest problem they have is working out how to represent inter-company relationships, which have the annoying property that they are a directed graph but not a directed acylic graph – it’s perfectly possible and indeed common for company X to own part of company Y which owns part of company X, perhaps through the intermediary of company Z.

OpenTech’s precursor, Notcon, was heavier on the hardware/electronics side than OT usually is, but this year there were quite a few hardware projects. However, I missed the one that actually included a cat.

What else? LinkedGov is a bit like ScraperWiki but with civil servants and a grant from the Technology Strategy Board. Francis Maude is keen. Kumbaya is an encrypted, P2P online backup application which has the feature that you only have to store data from people you trust. (Oh yes, and apparently nobody did any of this stuff two years ago. Time to hit the big brown bullshit button.)

As always, the day after is a bit of an enthusiasm killer. I’ve spent part of today trying to implement monthly results for my lobby metrics project and it looks like it’s much harder than I was expecting. Basically, NetworkX is fundamentally node-oriented and the dates of meetings are edge properties, so you can’t just subgraph nodes with a given date. This may mean I’ll have to rethink the whole implementation. Bugger.

I’m also increasingly tempted to scrape the competition‘s meetings database into ScraperWiki as there doesn’t seem to be any way of getting at it without the HTML wrapping. Oddly, although they’ve got the Department of Health’s horrible PDFs scraped, they haven’t got the Scottish Office although it’s relatively easy, so it looks like this wouldn’t be a 100% solution. However, their data cleaning has been much more effective – not surprising as I haven’t really been trying. This has some consequences – I’ve only just noticed that I’ve hugely underestimated Oliver Letwin’s gatekeepership, which should be 1.89 rather than 1.05. Along with his network degree of 2.67 (the eight highest) this suggests that he should be a highly desirable target for any lobbying you might want to do.

Self-binding note: lobby metrics

Things to get out of the data in this scraper of mine: for each lobby, the monthly meeting counts, degrees in the weighted multigraph, impact factor (i.e. graph degree/meetings to give an idea of productivity), most met ministers, most met departments, topics. For each ministry, meeting counts, most met lobbies, most discussed topics. For each PR agency (Who’s Lobbying had or has a list of clients for some of them), the same metrics as for lobbies. Summary dashboard: top lobbies, top lobbyists, top topics, graph visualisation, top 10 rising and falling lobbies by impact.

Things I’d like to have but aren’t sure how to implement: a metric of gatekeeper-ness for ministers, for example, how often a lobby met a more powerful minister after meeting this one, and its inverse, a metric of how many low-value meetings a minister had. I’ve already done some scripting for this, and NetworkX will happily produce most of the numbers, although the search for an ideal charting solution goes on. Generating the graph and subgraphs is computationally expensive, so I’m thinking of doing this when the data gets loaded up and storing the results, rather than doing the sums at runtime.

Where’s that Django tutorial? Unfortunately it’s 7.05 pm on Sunday and it’s looking unlikely I’ll do it this weekend…

Exactly what is Communication Strategy & Management Ltd?

So I scraped the government meetings data and rescraped it as one-edge-per-row. And then, obviously enough, I tidied it up in a spreadsheet and threw it at ManyEyes as a proof-of-concept. Unfortunately, IBM’s otherwise great web site is broken, so although it will preview the network diagram, it fails to actually publish it to the web. Oh well, ticket opened, etc.

Anyway, I was able to demonstrate the thing to Daniel Davies on my laptop, on the bar of the Nelson’s Retreat pub in Old Street. This impressed him excessively. Specifically, we were interested by an odd outlier on the chart. Before I get into that, though, here are some preliminary findings.

1 – Clegg’s Diary

At first sight, Nick Clegg appears to be unexpectedly influential. His calender included meetings with NATO, the World Bank, the Metropolitan Police, the Gates Foundation, and oddly enough, Lord Robertson of Port Ellen. Not only that, he had one-to-one meetings with all of them. However, he also got The Elders (i.e. retired politicos playing at shop) and the leader of the Canadian opposition, one Michael Ignatieff, Esq. God help us, is Clegg turning out to be a Decent?

2 – Dave from PR’s surprisingly dull world

The Prime Minister, no less, meets with some remarkably dull people. In fact, he met quite a lot of people who you’d expect to be left to flunkies while leaving quite a lot of important people to Nick Clegg. He did get BP, Shell, Pfizer, Rupert Murdoch, the TUC general secretary, and Ratan Tata (twice!) as one-on-ones, but he also met a surprising number of minor worthies from Cornwall and vacuous photocalls with people from Facebook.

3 – Francis Maude, evil genius of the coalition

Secretary of State for the Cabinet Office and Paymaster-General, Francis Maude MP, is the surprise hit, as far as I can make out. He seems to have a special responsibility for anything that smacks of privatisation – therefore, the monetary value of meeting him is probably high. Of course, if your evil genius is Francis Mediocritus, you’ve got problems. No wonder we’re in such a mess. All these points are also true of Oliver Letwin.

4 – Communication and Strategy Management Ltd

This is our far outlier. Some of the least significant people on the chart appear to be government whips, which is obviously an artefact of the data set. The data release does not cover intra-governmental or parliamentary meetings, nor does it cover diplomatic activity. Whips, of course, are a key institution in the political system. Given their special role with regard to both the government and parliament, it’s not surprising that they appear to be sheltered from external lobbying – access to the Whips’ Office would be such a powerful and occult influence that it must be held closely.

So what on earth is Communication and Strategy Management Ltd., a company which had one-on-one access to the Government Chief Whip, the Rt. Hon. Patrick McLoughlin MP, and which according to Companies House was founded on the 11th of April? It has no web site or perceptible public presence. It is located in what looks like a private house, here, not far from Stratford upon Avon:

View Larger Map

Evidently the hub of political influence, but those are the facts. The directors are Elizabeth Ann Murphy and Richard Anthony Cubitt Murphy*, ignoring a company-formation agent who was a director for one day when setting up the company. It’s not as if C&SM Ltd is a constituent of McLoughlin’s – he’s MP for the Derbyshire Dales. Actually, either the directors are related or else there was a cockup, as Murphy’s name on the books was amended from Bromley the day after the company was formed and both were born in 1963. The Companies House filing* doesn’t give any other information – accounts aren’t due for a while – except that the one share issued is held by Norman Younger, who is a partner in the company formation service that was used.

Anyway, the next stop is to learn how this works and put up a nice little dashboard page to help watch the lobbysphere. I’d be happier doing something with python – such as nodebox – but the diagram is already too big to be useful without interactivity, and you can’t stick a NodeBox window in a web page.

*Not the Richard Murphy, who is too young.
*WebCheck – it’s not an ugly website, it’s a way of life…

the House of Lords is not just stranger than you think..

This has me thinking one thing – TheyWorkForYou needs to integrate the text-mining tool researchers used to estimate the point at which Agatha Christie’s Alzheimer’s disease set in by analysing her books. We could call it WhatHaveTheyForgotten? Or perhaps HowDrunkIsYourMP? Jakob Whitfield pointed me to the original paper, here. It doesn’t seem that complicated, although I have a couple of methodological questions – for a start, are there enough politicians with a track record in Hansard long enough to provide a good baseline for time-series analysis?

Instead, we could do a synchronic comparison and look at which politicians seem to be diverging from the average. Of course, some might object that this would be a comparison against a highly unusual and self-selected sample. Another objection might be that the whole idea is simply too cruel. Yet a further objection might be the classic one that there are some things man should not know.

Update: Implemented!

so you want to know who’s lobbying?

So I was moaning about the Government and the release of lists of meetings with external organisations. Well, what about some action? I’ve written a scraper that aggregates all the existing data and sticks it in a sinister database. At the moment, the Cabinet Office, DEFRA, and the Scottish Office have coughed up the files and are all included. I’m going to add more departments as they become available. Scraperwiki seems to be a bit sporky this evening; the whole thing has run to completion, although for some reason you can’t see all the data, and I’ve added the link to the UK Open Government Licence twice without it being saved.

A couple of technical points: to start with, I’d like to thank this guy who wrote an alternative to Python’s csv module’s wonderful DictReader class. DictReader is lovely because it lets you open a CSV (or indeed anything-separated value) file and keep the rows of data linked to their column headers as python dictionaries. Unfortunately, it won’t handle Unicode or anything except UTF-8. Which is a problem if you’re Chinese, or as it happens, if you want to read documents produced by Windows users, as they tend to use Really Strange characters for trivial things like apostrophes (\x92, can you believe it?). This, however, will process whatever encoding you give it and will still give you dictionaries. Thanks!

I also discovered something fun about ScraperWiki itself. It’s surprisingly clever under the bonnet – I was aware of various smart things with User Mode Linux and heavy parallelisation going on, and I recall Julian Todd talking about his plans to design a new scaling architecture based on lots of SQLite databases in RAM as read-slaves. Anyway, I had kept some URIs in a list, which I was then planning to loop through, retrieving the data and processing it. One of the URIs, DEFRA’s, ended like so: oct2010.csv.

Obviously, I liked the idea of generating the filename programmatically, in the expectation of future releases of data. For some reason, though, the parsing kept failing as soon as it got to the DEFRA page. Weirdly, what was happening was that the parser would run into a chunk of HTML and, obviously enough, choke. But there was no HTML. Bizarre. Eventually I thought to look in the Scraperwiki debugger’s Sources tab. To my considerable surprise, all the URIs were being loaded at once, in parallel, before the processing of the first file began. This was entirely different from the flow of control in my program, and as a result, the filename was not generated before the HTTP request was issued. DEFRA was 404ing, and because the csv module takes a file object rather than a string, I was using urllib.urlretrieve() rather than urlopen() or scraperwiki.scrape(). Hence the HTML.

So, Scraperwiki does a silent optimisation and loads all your data sources in parallel on startup. Quite cool, but I have to say that some documentation of this feature might be nice, as multithreading is usually meant to be voluntary:-)

TODO, meanwhile: at the moment, all the organisations that take part in a given meeting are lumped together. I want to break them out, to facilitate counting the heaviest lobbyists and feeding visualisation tools. Also, I’d like to clean up the “Purpose of meeting” field so as to be able to do the same for subject matter.

Update: Slight return. Fixed the unique keying requirement by creating a unique meeting id.

Update Update: Would anyone prefer if the data output schema was link-oriented rather than event-oriented? At the moment it preserves the underlying structure of the data releases, which have one row for each meeting. It might be better, when I come to expand the Name of External Org field, to have a row per relationship, i.e. edge in the network. This would help a lot with visualisation. In that case, I’d create a non-unique meeting identifier to make it possible to recreate the meetings by grouping on that key, and instead have a unique constraint on an identifier for each link.

Update Update Update: So I made one.