Informative Provenance for Repurposed Data: a Case Study Using Clinical Research Data

doi:10.2218/ijdc.v8i2.262 Provenance for Repurposed Data 27 The International Journal of Digital Curation Volume 8, Issue 2 | 2013 Informative Provenance for Repurposed Data: A Case Study Using Clinical Research Data Richard Bache, Department of Informatics, Division of Health and Social Care Research, King’s College London Simon Miles, Department of Informatics, King’s College London Bolaji Coker, Division of Health and Social Care Research, King’s College London and Biomedical Research Centre, Guy’s and St Thomas’ NHS Foundation Trust Adel Taweel, Department of Informatics, Division of Health and Social Care Research, King’s College London Abstract The task repurposing of heterogeneous, distributed data for originally unintended research objectives is a non-trivial problem because the mappings required may not be precise. A particular case is clinical data collected for patient care being used for medical research. The fact that research repositories will record data differently means that assumptions must be made as how to transform of this data. Records of provenance that document how this process has taken place will enable users of the data warehouse to utilise the data appropriately and ensure that future data added from another source is transformed using comparable assumptions. For a provenance-based approach to be reusable and supportable with software tools, the provenance records must use a well-defined model of the transformation process. In this paper, we propose such a model, including a classification of the individual ‘sub-functions’ that make up the overall transformation. This model enables meaningful provenance data to be generated automatically. A case study is used to illustrate this approach and an initial classification of transformations that alter the information is created. International Journal of Digital Curation (2013), 8(2), 27–46. http://dx.doi.org/10.2218/ijdc.v8i2.262 The International Journal of Digital Curation is an international journal committed to scholarly excellence and dedicated to the advancement of digital curation across a wide range of sectors. The IJDC is published by UKOLN at the University of Bath and is a publication of the Digital Curation Centre. ISSN: 1746-8256. URL: http://www.ijdc.net/ 28 Provenance for Repurposed Data doi:10.2218/ijdc.v8i2.262 Introduction There are many situations where data collected and stored to perform some routine activity can be repurposed for different uses, such as research. Where this data comes from several heterogeneous sources and a researcher wishes to combine this data into a single warehouse with a consistent semantic representation, or indeed several distributed homogeneous warehouses, a transformation is required. Such a warehouse would typically be designed according to some information model appropriate for the new (research) purpose. There is no guarantee that mapping the data from the repurposed sources will be exact or accurate during the transformation process. Thus, for a researcher to make full and accurate use of this data, its provenance must be made explicit. Such a case in point and the focus of this paper is clinical data collected from electronic healthcare record (EHR) systems. Data collected for the routine care of individual patients at various particular clinical sites (e.g. GP practices or hospitals) can be repurposed for epidemiological research or for identifying and recruiting patients for clinical trials. Yet to be usable, such data need to be transformed into a consistent, meaningful and standardised form suitable for the new purpose. Constructing a common clinical research database from these diverse sources is a challenging activity and requires making a number of non-trivial assumptions about the transformations required, owing to the inconsistent way in which data will be recorded and the fact that data will often be missing or incomplete. Not maintaining consistency in the transformation of the data for these assumptions across different data sources most certainly will lead to inaccuracies that may, in cases, invalidate the results of the research. In this paper, we propose an approach for expressing the provenance of clinical data transformations, whilst recognising its application to other domains where data is transformed from one repository to another. This not only informs users how the data was derived but also makes the transformation process more transparent and thus both consistent and well formulated so that it can be reproduced in a comparable way for other data sources. The Scope of the Problem The task of populating the research database here is often referred to as an ETL (Extract-Transform-Load) process. Indeed, ETL is a standard activity when migrating data from one system to another. There are tools such as TALEND1 and SSIS2 (SQL Server Integration Services) to support this activity. However, these tools do not utilise provenance to maintain consistency in the transformation assumptions made. This may lead to inconsistencies in the derivations made later from the data. The problem of constructing a clinical research database from EHR data can be summarised as having some or all of the following characteristics: 1Talend: http://www.talend.com 2SQL Server Integration Services: http://msdn.microsoft.com/en-us/library/ms141026.aspx The International Journal of Digital Curation Volume 8, Issue 2 | 2013 doi:10.2218/ijdc.v8i2.262 Bache et al. 29 1. The clinical research database was designed for the purpose of research and may not be directly compatible with all potential sources in the database, which may not be known when the research database was designed; 2. The source EHR systems were not designed for the purpose of the research and thus may only contain partial or incomplete data; 3. The source EHR systems vary in both the data they contain, including its meaning, and the way they represent it; 4. Much of the source data is irrelevant to the intended research purpose, e.g. information used for billing or contacting patients; 5. Some data reveals patient identity and so must be either discarded or obfuscated; 6. The source data may have already undergone some transformation for the purpose of standardising data from diverse sources, or been prepared for some other intended research purpose; 7. Some information is not be held explicitly in the source data but is instead held either in metadata associated with the source or implied by the context. For example, units of measurement may be defined only in the accompanying documentation or implied by the context. We shall refer to the intended clinical research warehouse as the ‘target database’. Henceforth, we shall consider a particular source of data used to populate the target as a single ‘source database’, although in hospitals there may actually be more than one database accessed. Without loss of generality, we shall use the concept of a relational database and its related vocabulary throughout, but note that the data may actually be represented as XML files or flat files. The latter could be readily converted to such a single-table database if the fields were suitably delimited. For XML, we will instead refer to elements and attributes but note that the underlying principle would apply. We note also that the databases could be virtual, where instead of the data being physically stored, it is synthesised at runtime in accordance with some abstract information model. A further dimension to the problem is that patient records often store clinical concepts as categorical data, in which a given attribute may hold one of hundreds or thousands of possible values. Examples are diagnoses, prescribed drugs and procedures. To avoid the ambiguity and synonymy of natural language, as well as the possibility of typographical errors, each of these clinical concepts is given a code according to one of the many clinical coding systems, such as SNOMED3 or ICD-104. Although mapping between coding systems is a non-trivial problem beyond the scope of this paper, difficulties arise when an exact match of concepts does not exist and it is necessary to map a specific concept to a more general one. Clinical researchers using the target database will have a number of questions concerning the data they wish to gain answers for: 3SNOMED: http://www.connectingforhealth.nhs.uk/systemsandservices/data/uktc/snomed 4International Classification of Diseases (ICD): http://www.who.int/classifications/icd/en/ The International Journal of Digital Curation Volume 8, Issue 2 | 2013 30 Provenance for Repurposed Data doi:10.2218/ijdc.v8i2.262 1. Where did this data come from? 2. How has it been changed from its original form? 3. How accurate/reliable is it? A simple example here is obfuscation of the date of birth (DOB) to protect confidentiality, so that all patients are deemed born on 1st January of the year in which they were actually born. This will cause a bias on the patients’ age, making all patients – on average – six months older than they really are and this would inform the use of statistics on such data. The Aims of this Paper This paper provides an approach by which provenance data may be generated automatically from the data transformation process. This serves two purposes: 1. The end user can use the data appropriately by gaining answers to the questions above; 2. When adding data from another similar source, it is possible to ensure that transformations are consistent with that data. In order to represent the provenance as a sequence of distinct transformations, we first require modelling the transformation process and provide a classification of the separate transformations that enables us to reason about them. This model not only enables the provenance to be expressed, but also acts to make explicit the required assumptions. Since, when performing ETL on clinical data, there are often many different conventions that could reasonably be adopted, the act of documenting assumptions often leads to those assumptions being made in a considered manner and assures a degree of consistency.

Load more