March 28, 2010 by yorksranter 5 comments

oiling the steel to sharpen the blade to shave the yak

Progress update on fixing the Vfeed.

Dubai Airport has done something awful to their Web site; where once flights were organised in table rows with class names like “data-row2”, now, exactly half the flights are like that, they’ve been split between separate arrival, departure, and cargo-only pages, they only show the latest dozen or so movements each, and the rows that aren’t “data-row2” don’t have any class attributes but random HTML colours.

And the airline names have disappeared, replaced by their logos as GIFs. Unhelpful, but then, why should they want to help me?

Anyway, I’ve solved the parsing issue with following horrible hack.
output = [[td.string or td.img["src"] for td in tr.findAll(True) if td.string or td.img] for tr in soup.findAll('tr', bgcolor=lambda(value): value == 'White' or value == '#F7F7DE')]

As it happened, I later realised I didn’t need to bother grabbing the logo filenames in order to extract airline identifiers from them, so the td.img[“src”] bit can be dropped.

But it looks like I’m going to need to do the lookup from ICAO or IATA identifiers to airline names, which is necessary to avoid having to remake the whitelist and the database and the stats script, myself. Fortunately, there’s a list on wikipedia. The good news is that I’ve come up with a way of differentiating the ICAO and the IATA names in the flight numbers. ICAOs are always three alphabetical characters; IATAs are two alphanumeric characters, which aren’t necessarily globally unique. In a flight number, they can be followed by a number of variable length.

But if the third character in the flight number is a digit, the first two must be an IATA identifier; if a string, it must be an ICAO identifier.

5 Comments on "oiling the steel to sharpen the blade to shave the yak"

Laban
March 29, 2010 8:08 am

What’s the language ? I have an idea for a web-scraping application (basically a book search tool that checks Amazon/ebay/Abebooks etc), but no clue as to what to use.

The last time I did such a thing was 10 years back, only wanted the data from one site, downloaded the entire site, then merged the html into one file (with a dos command I think) stripped the tags and parsed the data with VB to produce csv output. Worked, but clunky.

Can’t do that for multiple sites – I need to scrape the screens. How ? Last time I asked you about the airport data you said the code was ‘of my own devisin’ !

Reply
Laban
March 29, 2010 8:11 am

Do you think that web change has anything to do with your research, btw ?

Reply
yorksranter
March 29, 2010 9:14 am

The Vfeed is implemented in Python (like the code snippet above). Python has a fantastic third-party library for parsing HTML and XML documents called Beautiful Soup, which will screenscrape pretty much anything into useful data structures.

For example, that snippet finds all tr tags that have the attribute bgcolor with either the values “White” or “#7D7D7E” and then finds all td tags within each tr that have either a single string as their content or an img tag and returns the content or the image filename as a list of python list objects. That you can do this in a oneliner, admittedly a tortuous one, is one of the reasons to use Python – that’s a nested list comprehension with a lambda function passed as a keyword argument.

If you need a clientside solution, you’d be looking at JavaScript and XPath.

Depending on how complex the job is, you might be able to get away with a Yahoo Pipe.

As far as DXB’s motives go, I don’t think so – it looks like the Pointy-Headed Boss wanted the colours changing one morning and they pushed out a really hacky fix.

Reply
Gridlock
April 2, 2010 11:32 am

Maybe these guys could give you a hand, or become a customer…

http://www.airspacemag.com/history-of-flight/Grab-the-Airplane-and-Go.html?c=y&page=1&device=iphone&c=y

Reply
marry
April 5, 2010 10:40 am

Blogs are so informative where we get lots of information on any topic. Nice job keep it up!!
_____________________________

Photography Dissertation

Reply

5 Comments on "oiling the steel to sharpen the blade to shave the yak"

Leave a Reply to marry Cancel reply