Opening up data seems an inherently good thing, but what about the risks you’re taking on when using the data, particularly in a context that it wasn’t collected to be used for?
…assume random errors rates of up to 10% on original data entry unless the material was entered and checked by those with a vested interest in its accuracy and with the knowledge and authority to ensure that errors were identified and corrected. We were also told to assume that it would subsequently degrade at about 10% per annum unless actively used and updated by those with the knowledge and ability to update the files.
(This would of course apply to a lot of corporate collected data too)
Or, in the words of the Audit Commission:
The priority for local public bodies has often been to ensure the quality of the data needed for top-down performance management. Unwittingly, the requirements of submitting data nationally have sometimes eclipsed the requirements of frontline service delivery and public need. (para 20 of the report)
That implies that some fields (the ones central government are interested in) will be more accurate and consistently recorded than others (everything else). The solution is in the same paragraph:
… Data generation should be a by-product of normal business, not an end in itself. The starting point should be ‘what data does the frontline need to deliver its business well, and for us to know that is happening?
That’s the way to ensure any errors get fixed immediately.
Maybe the open-data movement needs to start thinking about a to be a way of marking the relative dodginess of a source of data: so that for instance if you know 10% of records may be wrong in one dataset and 5% in another, you can judge what margin to add in when making decisions based on an analysis of the data.
Is there some sort of implicit assumption that opening up the data will force a general clean-up and improvement in quality? Even so, we should never be assuming the data is perfect.
Update (later on 9 Nov): Coincidently, Ton Zijlstra has posted a great “Open Gov Data Poster Flow Chart” he developed with James Burke over at his blog. It’s designed to help civil servants “decide if and how it is ok to open up data sets they have available” …the current version doesn’t include an explicit step for thinking about whether the data quality would support (safe) export, but maybe that will change in future.
PS This was triggered by a posting in Emma Mulqueeny‘s blog – which I’ve just come across – it seems a great place for tracking what’s happening with open data in the UK government. Something to add to the list.