A ScraperWiki snippet – The Yorkshire Ranter

OK, so I found out why data was apparently disappearing. Or rather David Jones from ScraperWiki did. It turns out the ScraperWiki datastore works rather differently to what you might expect. I theorise this has something to do with the fact it was a key-value store before it was an SQLite database, as it tastes of both.

When you call scraperwiki.sqlite.save(unique_keys=[], data), you’re expected to provide a list of unique keys and a dict of data (or a list of dicts). Each dict will become a row in the database, and it contains the column names as keys and the values as, well, values. The unique_keys parameter is used to identify if you’re creating a new row or updating an existing one, i.e. differentiating an SQL INSERT from an SQL UPDATE.

Now, if you pass a new column name in the data, the API will automagically alter the table and add the column. So, you might think that if you made a call to save() setting some unique keys and adding data to a column, and then a second call inserting data into a different column, the effect would be equivalent to a single call inserting to both columns.

mydict = {'column1': 'Ed Vaizey', 'column2': 'foo'} scraperwiki.sqlite.save(unique_keys=['column1'], mydict) myseconddict = {'column1': 'Ed Vaizey', 'column3': 'bar'} scraperwiki.sqlite.save(unique_keys=['column1'], myseconddict)

Looks like it might be equivalent to:

mydict.update(myseconddict) scraperwiki.sqlite.save(unique_keys=['column1'], mydict)

And the result would be that the row identified as the amusing government minister would end up with column2 = foo and column3 = bar. After all, if you did:

UPDATE sometable SET column2 = 'foo' WHERE column1 = 'Ed Vaizey'

you’d expect column2 in that row to equal foo and nothing else to change. But if you execute the Python code above in ScraperWiki, you’ll get a nasty surprise, because the columns that you didn’t specify in the data will be overwritten with nulls. What seems to happen is that the values in the dict get mapped onto the database columns by their keys, and then the row is replaced in its entirety. If there aren’t enough values to map all the columns, they default to NULL.

It’s not very much like either an SQL INSERT or an UPDATE. It’s not very much like a Python dict either. Consider:

mydict = {'column1': 'value1', 'column2': 'value2', 'column3': 'value3'} print mydict['column1'] >>> value1 mydict['column1'] = 'foo' # or equivalently mydict.update({'column1': foo}) print mydict >>> {'column1': 'foo', 'column2': 'value2', 'column3': 'value3'} # dicts aren't ordered, so not necessarily in that order, but certainly not... >>> {'column1': 'foo', 'column2': None, 'column3': None}

So, it doesn’t behave like an SQLite database, nor a Python data structure, but rather something else entirely. If you’re going to update an existing row in the SW Datastore, or add more columns to it, or even write to less than a full set of columns…don’t, because weirdness lurks. It is wise to write only full rows. If you need to do an UPDATE, write SQL explicitly.

Leave a Reply Cancel reply