scraping the barrel
I’ve finally got around to answering my own question here. The scraper is work in progress at the moment; the original pdf is rendered by pdftohtml into a tiresomely semi-structured (i.e. worse than no structure) tagpile. I was trying to tackle this through recursion, but I might either try using Python’s continue keyword or perhaps…
Read More scraping the barrel