I am really impressed both by this OpenNews post about how to tackle a huge pile of documents, and also by the tools recommended. After all:

What I received a month later from Nash County, N.C., were two boxes filled with thousands of printed pages of emails. Double-sided.

One of the problems it solves is that your filesystem is usually very, very good at finding files, on all kinds of criteria, and fast – just look at any unix/linux find examples page – but that presupposes that the information you have is broken out into files whose boundaries map roughly to a logical structure within the underlying data.

Also, one of the best things is also the simplest: Overview has a feature that pulls a randomly selected sample of documents.

The blog is crazy good, too. Interestingly, I remember IBM announcing their big investment in big data the other year and giving “Computational Journalism” as one of the use cases.

Did I say the blog was good? The blog is good.

Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.