Data Quality of Native XML Databases in the Healthcare Domain
Total Page:16
File Type:pdf, Size:1020Kb
Data Quality of Native XML Databases in the Healthcare Domain Henry Addico As XML data is being widely adopted as a data and object exchange format for both structured and semi structured data, the need for quality control and measurement is only to be expected. This can be attributed to the increase in the need for data quality metrics in traditional databases over the past decade. The traditional model provide constraints mechanisms and features to control quality defects but unfortunately these methods are not foolproof. This report reviews work on data quality in both database and management research areas. The review includes (i) the exploration into the notion of data quality, its definitions, metrics, control and improvement in data and information sets and (ii) investigation of the techniques which used in traditional databases like relational and object databases where most focus and resource has been directed. In spite of the wide adoption of XML data since its inception, the exploration does not only show a huge gap between research works of data quality in relational databases and XML databases but also show how very little support database systems provide in giving a measure of the quality of the data they hold. This inducts the need to formularize mechanisms and techniques for embedding data quality control and metrics into XML data sets. It also presents the viability of a process based approach to data quality measurement with suitable techniques, applicable in a dynamic decision environments with multidimensional data and heterogeneous sources. This will involve modelling the interdependencies and categories of the attributes of data quality generally referred to as data quality dimensions and the adoption of a formal means like process algebra, fuzzy logic and any other appropriate approaches. The attempt is contextualised using the healthcare domain as it bears all the required characteristics. Categories and Subject Descriptors: H.2.0 [Database Management]: Security, integrity and protection; K.6.4 [System Management]: Quality assurance; J.3 [LIFE AND MEDICAL SCIENCES]: Medical information systems General Terms: Healthcare, HL7, NXD Additional Key Words and Phrases: data quality, health care, native XML databases 1. INTRODUCTION This research is motivated by three key issues or trends. First, the adoption of XML (eXensible Markup Language) as a data representation, presentation and ex- change format by domains with semistructured data like the health care domain. Second, the transition from paper based health records to centralised electronic health records to improve the quality and availablility of patient health data na- tionally. And lastly the measurement, control and improvement of data and infor- mation quality in informations sets. These issues which are introduced below briefly, present the need for an integrated formal approach between healthcare datasets, information models like HL7, openEHR etc, data and information quality metrics. There is high interest and excitment in industries over XML. This is shown by the number of emerging tools and related products and its incoporation into very important standards in domain like Health care. Why are traditional databases and models are still heavily used in the healthcare domain inspite of (i) the fact that healthcare data bears most of the characteristics of semistructed data as identified End of Year Review, October 2006. 2 · Henry Addico in [Abiteboul 1997]? (ii) the issue of semistructuredness as a result of the integra- tion process of healthcare data especially when the disparate datasources say GP surgueries do not use a generic or standard schema for data storage Could it be the scepticism towards the use of XML databases as most initial? implementations lacked the essential DBMS features needed by the industrial community? There has been immense research and commercial activity in XML databases since [Widom 1999] highlighted possible directions of XML database research. Prospectively, do- mains that require database management systems (DBMS) that supports flexible structure and high dimensional data, the healthcare domain for example should be adopting XML databases as they are more appropriate than other traditional or legacy DBMS. This research began by performing a genral review of XML and XML databases and this is presented in section 2. Information quality (IQ) or data quality (DQ) is another issue which has be- come a critical concern of organisations and an active research area in Management Information Systems(MIS) and Database Warehousing as the availability of infor- mation alone is no longer a strategic advantage [Lee et al. 2002; Oliveira et al. 2005; Motro and Rakov 1997]. Whilst the database community use the term DQ and is more concerned about using formal approaches which involve models, languages and algorithms developed in their field, the management community use IQ (infor- mation is transformed data) and is interested in the abstract demographics, process influences, data cycle (from collection to actual use) of data quality. The manage- ment community approach is useful when trying to understand the DQ concept whilst the database approach is towards automation of their results with the aim of eliminating data quality problems from their datasets. However the management community go beyond the intrinsic quality attributes and are generally interested in complete frame works generally referred to as Total Data Quality Management (TDQM) to improve data quality across domains. Despite significant increase in IQ research to meet the needs of organisation, only a few ad hoc techniques are available for measuring, analyzing and improving data quality [Lee et al. 2002]. Ensuring that this data in dynamic decisions environment is fit for its purpose is therefore still a difficult one [Shankaranarayan et al. 2003]. Nevertheless the deci- sion maker or user of these databases requires the data to measure up in order to confidently make quality decisions. The recent attempts to build a centralized electronic health record (EHR) is in-line with the above two issues as it requires the use of heterogeneous databases and widespread adaptation of enabling technologies that facilitate data sharing and access from multiple data sources. Aside the introduction of EHR, data quality in health care has been of concern over the past decade and is still a key area within the healthcare domain with considerable attention as it is crucial to effective health care [Leonidas Orfanidis 2004]. This work begins with a review of work on XML and XML databases in section 2 as XDBMS unfurl features suitable for managing data and information quality especially in the healthcare domain. The review continues with data and infor- mation quality in database and management research areas in section 3. Section 4 explores the context for the research by looking at recent trends in health care towards the development centralized EHR particularly in the united kingdom, spar- ingly making reference to other national implementations. This review on XML and End of Year Review, October 2006. Data Quality of Native XML Databases in the Healthcare Domain · 3 XML databases, data and information quality and EHR in the healthcare domain, provide the grounding to conclude by presenting possible research directions on a process based approach to incorporating data quality into XML databases in the section 5. 2. XML AND XML DATABASES XML (eXtensible Mark-up Language), a subset of Standard Generalized Markup Language (SGML) with an extensible vocabulary; is now a widely used data and object exchange format for structured and semi structured data as well as the rep- resentation of tree structured data [Jiang et al. 2002; Rys et al. 2005; Shui et al. 2005; Widom 1999; MacKenzie 1998]. It has been fast emerging as the dominant standard for representing data in the World Wide Web since the inception of its specification in 1996 by the World Wide Web Consortium(W3C) [Shanmugasun- daram et al. 1999; MacKenzie 1998]. Unlike Html another subset of SGML which serve the task of describing how to display a data item, XML describes the data itself [Shanmugasundaram et al. 1999; Widom 1999; MacKenzie 1998]. However trivial one might think of this property; its usefulness cannot be underestimated. It enables applications to interpret data in multiple ways, filter the document based on its content or even restructure data on the fly. Furthermore it provides a natural way to separate information content from presentation allowing multiple views via the application of the eXtensible Stylesheet Language (XSL) specifications [Widom 1999]. XML is generally presented in documents consisting of ordered or unordered el- ements that are textually structured by tags with optional attributes (key value pairs). Elements can be semi structured by allowing unstructured free text be- tween the start and end tags as in figure 1. The elements can be nested within other elements but the document must have a particular element that embodies all the other elements called the root element of the document. A Document Type def- inition (DTD) describes the structure for the elements, their associated attributes and constraints [Achard et al. 2001]. The DTD can be expressed more elaborately as an XML document called XML schema. A well formed (correct in syntax) XML document is said to be valid with respect to a DTD or XML schema if it conforms to the logical structures defined in the DTD or XML schema. The following features of XML do not only account for its fast adoption and rapid development but also influence techniques adopted for efficient storage and querying [Kim et al. 2002; Kader 2003]: —Document-Centric —Large amount of mixed unstructured and structured data which is useful for semi-structured or flexible data applications [figure 1]. —Supports element and content order. —Readily Human readable —100% round tripping is essential (see 2.2) [Feinberg 2004]. For example in the health care domain reconstruction of a patient EHR is important —Data-Centric: an alternative to the above document-centric feature —Structured or very little mixed content of unstructured data.[figure 2].