Opendata + poor quality control = trouble?

Opening up data seems an inherently good thing, but what about the risks you’re taking on when using the data, particularly in a context that it wasn’t collected to be used for?

As Philip Virgo pointed out last week when he was discussing an Audit Commission report on the quality of the data that is recorded by government organisations, he was taught to:

…assume random errors rates of up to 10% on original data entry unless the material was entered and checked by those with a vested interest in its accuracy and with the knowledge and authority to ensure that errors were identified and corrected. We were also told to assume that it would subsequently degrade at about 10% per annum unless actively used and updated by those with the knowledge and ability to update the files.

(This would of course apply to a lot of corporate collected data too)

Or, in the words of the Audit Commission:

The priority for local public bodies has often been to ensure the quality of the data needed for top-down performance management. Unwittingly, the requirements of submitting data nationally have sometimes eclipsed the requirements of frontline service delivery and public need. (para 20 of the report)

That implies that some fields (the ones central government are interested in) will be more accurate and consistently recorded than others (everything else). The solution is in the same paragraph:

… Data generation should be a by-product of normal business, not an end in itself. The starting point should be ‘what data does the frontline need to deliver its business well, and for us to know that is happening?

That’s the way to ensure any errors get fixed immediately.

Maybe the open-data movement needs to start thinking about a to be a way of marking the relative dodginess of a source of data: so that for instance if you know 10% of records may be wrong in one dataset and 5% in another, you can judge what margin to add in when making decisions based on an analysis of the data.

Is there some sort of implicit assumption that opening up the data will force a general clean-up and improvement in quality? Even so, we should never be assuming the data is perfect.

Just wondering…

Update (later on 9 Nov): Coincidently, Ton Zijlstra has posted a great “Open Gov Data Poster Flow Chart” he developed with James Burke over at his blog. It’s designed to help civil servants “decide if and how it is ok to open up data sets they have available” …the current version doesn’t include an explicit step for thinking about whether the data quality would support (safe) export, but maybe that will change in future.

 


PS This was triggered by a posting in Emma Mulqueeny‘s blog – which I’ve just come across – it seems a great place for tracking what’s happening with open data in the UK government. Something to add to the list.

 

Advertisements

About Peter Cruickshank

Lecturer in the School of Computing and a member of the Centre for Social Informatics at Edinburgh Napier University, Scotland. Interested in information systems, learning, politics, society, security and where they intersect. My attempts at rounding out my character include food, cinema, running, history and, together with my lovely wife, bringing up a cat and a couple of kids.
This entry was posted in e-government, thoughts, UK and tagged , , . Bookmark the permalink.

8 Responses to Opendata + poor quality control = trouble?

  1. Mulqueeny says:

    Funnily enough that is exactly what is happening. Data has a star rating anyway in some departments depending on its cleanliness: based on method of capture etc. By opening data, this is going to mean that data produced going forward will have to be created with a view to its reuse.

  2. Thanks Emma, that’s great to hear! I love it when people get there ahead of me.

    Two things I’d like to find out more about:

    1. Whether it would be useful to differentiate the quality of fields in the data set (due to the factors the audit commission discuss in their report) – so a dataset may be a mix of 5- and 2- star quality data, and

    2. If you mash together a 4-star dataset with a 3-star dataset – will the quality of the result be only 2-star? And how do you check, and how many people understand propagation of uncertainty anyway?

    People will come up with best practice in due course I’m sure 🙂

  3. Pingback: Mashable data quality part 2: Local government in England « Spartakan

  4. guess says:

    Well, it is really difficult conceptually to consider rating data quality for a particular unspecified task such as merging with other data and the idea is fundamentally flawed, nice though it sounds.

    Data quality can be defined and measured whether individual fields or entire datasets, on its own without any idea o except in general terms concerning the design, collection, processing and dissenination processes. For example there is ONS guidance giving a general explanation of quality at http://www.statistics.gov.uk/about/data/methodology/quality/projects/what_is_quality.asp and more detail on its measurement at http://www.statistics.gov.uk/StatBase/Product.asp?vlnk=13578 detailing a wide range of quality measures and indicators, grouped together into stages of the statistical production process. There is more general guidance on sytems producing statistics within the National Statistician’s guidance about Quality, Methods and Harmonisation at http://www.statisticsauthority.gov.uk/national-statistician/guidance

    There is much else guidance out there on measuring data quality, both within UK and from international statistical organisations.

    however combining two data sets of high quality will not automatically result in an outcome of high quality for all sorts of reasons, ranging from definitinal differences, processing errors, matching errors, analysis and interpretive errors or simply because sample sizes are too small in one or other of the original datasets and hence results from the resulting overlap/combination is too small to provide meaningful results, or because of lack of understanding of limitations of the original data or of how it was collected. Equally it is perfectly possible that a combination of two ‘low quality’ data sets can potentially provide ‘high quality’ information if what is needed is for example a rough and ready ‘ballpark’ result with wide error bounds. So it is impossible to generalise.

    Essentially there is no substitute to assess ‘fitness for purpose’ , for identifying the purpose first, then using knowledge of data characteristics including collection methodologies, and the quality of particular data fields, and a sceptical approach to any results which appear surprising, odd, or are out of line with results from other sources and research, or from new bespoke collection [if any].
    🙂

  5. I couldn’t have put it better myself, and thanks for the supporting links! It’s exactly the error margins in results from unexpected combinations of data that I was worrying about.

    I believe open data is a good thing but I think that data mashers need to be aware of the risks they might be propagating. Maybe the challenge is to find a good way of making the uncertainties more obvious to the people using the data?

    • Anonymous says:

      Providing meta data about data is a perennial problem & in an ideal world every data field is well documented including the data collection instrument and the overall collection methodology. This is well done for many of the datasets relating to surveys, held at the UK Data Archive, which will usually contain details of questionnaires used and sampling methods, purposes of collections and other technical information such as results of pilot exercises. Nevertheless analysts should not be treated as idiots, the onus is rightly on them to take the time find out about any data they plan to use, whether by direct analysis [for example looking at levels of missing data (whether these relate to particular parts of the dataset etc), unusual values/outliers, and the distribution of data values] and by making contact with others doing similar analysis or with the original data producers.

      Explaining uncertainty is a perennial problem for weather forecasters and all other scientists who collect data and analyse it. The best training in uncertainty apart from actually having to learn to collect, process and analyse data and then explain all the sources of potential biases [from conceptual and definitional through collection, processing and interpretative] to a non-technical audience, would probably be to get folk to read ‘How to Lie with Statistics’ by Darrell Huff http://en.wikipedia.org/wiki/How_to_Lie_with_Statistics

      As a shorter alternative, you could get them to read the recent article by David Spiegelhalter in The Times http://www.timesonline.co.uk/tol/comment/columnists/guest_contributors/article6901649.ece

  6. Thanks for that, anonymous! More useful links too

    I’d agree that professional analysts are likely to know what they’re doing, and how far to push the data.

    The challenge I see is that opendata, um, opens up the analysis to anyone with some basic techy skills – and that element of common sense or experience might go missing in the search for a groovy visualation or a big headline.

    It would be useful lesson if we could find an example where a mashup has produced results that turned out to be an artifact of the data – maybe creating a news story that later had to be retracted?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s