Data Quality of Native XML in the Healthcare Domain

Henry Addico

As XML is being widely adopted as a data and object exchange format for both structured and semi structured data, the need for quality control and measurement is only to be expected. This can be attributed to the increase in the need for data quality metrics in traditional databases over the past decade. The traditional model provide constraints mechanisms and features to control quality defects but unfortunately these methods are not foolproof. This report reviews work on data quality in both and management research areas. The review includes (i) the exploration into the notion of data quality, its definitions, metrics, control and improvement in data and information sets and (ii) investigation of the techniques which used in traditional databases like relational and object databases where most focus and resource has been directed. In spite of the wide adoption of XML data since its inception, the exploration does not only show a huge gap between research works of data quality in relational databases and XML databases but also show how very little support database systems provide in giving a measure of the quality of the data they hold. This inducts the need to formularize mechanisms and techniques for embedding data quality control and metrics into XML data sets. It also presents the viability of a process based approach to data quality measurement with suitable techniques, applicable in a dynamic decision environments with multidimensional data and heterogeneous sources. This will involve modelling the interdependencies and categories of the attributes of data quality generally referred to as data quality dimensions and the adoption of a formal means like process algebra, fuzzy logic and any other appropriate approaches. The attempt is contextualised using the healthcare domain as it bears all the required characteristics. Categories and Subject Descriptors: H.2.0 [Database Management]: Security, integrity and protection; K.6.4 [System Management]: Quality assurance; J.3 [LIFE AND MEDICAL SCIENCES]: Medical information systems General Terms: Healthcare, HL7, NXD Additional Key Words and Phrases: data quality, health care, native XML databases

1. INTRODUCTION This research is motivated by three key issues or trends. First, the adoption of XML (eXensible Markup Language) as a data representation, presentation and ex- change format by domains with semistructured data like the health care domain. Second, the transition from paper based health records to centralised electronic health records to improve the quality and availablility of patient health data na- tionally. And lastly the measurement, control and improvement of data and infor- mation quality in informations sets. These issues which are introduced below briefly, present the need for an integrated formal approach between healthcare datasets, information models like HL7, openEHR etc, data and information quality metrics. There is high interest and excitment in industries over XML. This is shown by the number of emerging tools and related products and its incoporation into very important standards in domain like Health care. Why are traditional databases and models are still heavily used in the healthcare domain inspite of (i) the fact that healthcare data bears most of the characteristics of semistructed data as identified

End of Year Review, October 2006. 2 · Henry Addico in [Abiteboul 1997]? (ii) the issue of semistructuredness as a result of the integra- tion process of healthcare data especially when the disparate datasources say GP surgueries do not use a generic or standard schema for data storage Could it be the scepticism towards the use of XML databases as most initial? implementations lacked the essential DBMS features needed by the industrial community? There has been immense research and commercial activity in XML databases since [Widom 1999] highlighted possible directions of XML database research. Prospectively, do- mains that require database management systems (DBMS) that supports flexible structure and high dimensional data, the healthcare domain for example should be adopting XML databases as they are more appropriate than other traditional or legacy DBMS. This research began by performing a genral review of XML and XML databases and this is presented in section 2. Information quality (IQ) or data quality (DQ) is another issue which has be- come a critical concern of organisations and an active research area in Management Information Systems(MIS) and Database Warehousing as the availability of infor- mation alone is no longer a strategic advantage [Lee et al. 2002; Oliveira et al. 2005; Motro and Rakov 1997]. Whilst the database community use the term DQ and is more concerned about using formal approaches which involve models, languages and algorithms developed in their field, the management community use IQ (infor- mation is transformed data) and is interested in the abstract demographics, process influences, data cycle (from collection to actual use) of data quality. The manage- ment community approach is useful when trying to understand the DQ concept whilst the database approach is towards automation of their results with the aim of eliminating data quality problems from their datasets. However the management community go beyond the intrinsic quality attributes and are generally interested in complete frame works generally referred to as Total Data Quality Management (TDQM) to improve data quality across domains. Despite significant increase in IQ research to meet the needs of organisation, only a few ad hoc techniques are available for measuring, analyzing and improving data quality [Lee et al. 2002]. Ensuring that this data in dynamic decisions environment is fit for its purpose is therefore still a difficult one [Shankaranarayan et al. 2003]. Nevertheless the deci- sion maker or user of these databases requires the data to measure up in order to confidently make quality decisions. The recent attempts to build a centralized electronic health record (EHR) is in-line with the above two issues as it requires the use of heterogeneous databases and widespread adaptation of enabling technologies that facilitate data sharing and access from multiple data sources. Aside the introduction of EHR, data quality in health care has been of concern over the past decade and is still a key area within the healthcare domain with considerable attention as it is crucial to effective health care [Leonidas Orfanidis 2004]. This work begins with a review of work on XML and XML databases in section 2 as XDBMS unfurl features suitable for managing data and information quality especially in the healthcare domain. The review continues with data and infor- mation quality in database and management research areas in section 3. Section 4 explores the context for the research by looking at recent trends in health care towards the development centralized EHR particularly in the united kingdom, spar- ingly making reference to other national implementations. This review on XML and

End of Year Review, October 2006. Data Quality of Native XML Databases in the Healthcare Domain · 3

XML databases, data and information quality and EHR in the healthcare domain, provide the grounding to conclude by presenting possible research directions on a process based approach to incorporating data quality into XML databases in the section 5.

2. XML AND XML DATABASES XML (eXtensible Mark-up Language), a subset of Standard Generalized Markup Language (SGML) with an extensible vocabulary; is now a widely used data and object exchange format for structured and semi structured data as well as the rep- resentation of structured data [Jiang et al. 2002; Rys et al. 2005; Shui et al. 2005; Widom 1999; MacKenzie 1998]. It has been fast emerging as the dominant standard for representing data in the World Wide Web since the inception of its specification in 1996 by the World Wide Web Consortium(W3C) [Shanmugasun- daram et al. 1999; MacKenzie 1998]. Unlike Html another subset of SGML which serve the task of describing how to display a data item, XML describes the data itself [Shanmugasundaram et al. 1999; Widom 1999; MacKenzie 1998]. However trivial one might think of this property; its usefulness cannot be underestimated. It enables applications to interpret data in multiple ways, filter the document based on its content or even restructure data on the fly. Furthermore it provides a natural way to separate information content from presentation allowing multiple views via the application of the eXtensible Stylesheet Language (XSL) specifications [Widom 1999]. XML is generally presented in documents consisting of ordered or unordered el- ements that are textually structured by tags with optional attributes (key value pairs). Elements can be semi structured by allowing unstructured free text be- tween the start and end tags as in figure 1. The elements can be nested within other elements but the document must have a particular element that embodies all the other elements called the root element of the document. A Document Type def- inition (DTD) describes the structure for the elements, their associated attributes and constraints [Achard et al. 2001]. The DTD can be expressed more elaborately as an XML document called XML schema. A well formed (correct in syntax) XML document is said to be valid with respect to a DTD or XML schema if it conforms to the logical structures defined in the DTD or XML schema. The following features of XML do not only account for its fast adoption and rapid development but also influence techniques adopted for efficient storage and querying [Kim et al. 2002; Kader 2003]: —Document-Centric —Large amount of mixed unstructured and structured data which is useful for semi-structured or flexible data applications [figure 1]. —Supports element and content order. —Readily Human readable —100% round tripping is essential (see 2.2) [Feinberg 2004]. For example in the health care domain reconstruction of a patient EHR is important —Data-Centric: an alternative to the above document-centric feature —Structured or very little mixed content of unstructured data.[figure 2]. —Element and content order is insignificant

End of Year Review, October 2006. 4 · Henry Addico

Fig. 1. a Document-Centric XML element

Fig. 2. a Data-Centric XML example

—Best consumed by machine —100% round tripping is not essential (see 2.2). For example one might only need the cost of a flight —Provision of a family of Technologies like XREF linking, XPointer, XPath, XQuery, XSL, AXML, SGML, SOAP, RDF etc. —self-documenting format that describes structure and field names as well as spe- cific values —verbose meta data: each value text has a syntactic or semantic meta data. —yields higher storage costs —Best consumed by machine and promotes interoperability between systems —numerous schema languages with varying features for value, structural and reference constraints e.g. XML Schema, RELAX NG, Schematron etc. —platform-independent The document centric and data centric features are treated as generally treated as alternatives. The interests of this research lie in local or distributed XML doc- ument stores with traditional database features like efficient storage, querying of

End of Year Review, October 2006. Data Quality of Native XML Databases in the Healthcare Domain · 5 data, transaction, indexing etc to be precise an XML document management sys- tem(XDBMS). The next section defines XML databases and explores its subcom- ponents in its subsections.

2.1 Anatomy of an XML Database An XML database is defined informally as lists of XML documents matching their given Document Type Definition (DTD) or Schema by [Cohen et al. 1999]. This is comparable to the relational model which contains a list of tuples conforming to relation schemes [Cohen et al. 1999]. However as an XML document can be without a schema it is more appropriate to be defined ordinarily as a list of XML documents [Moro et al. 2005]. Systems for managing XML data are either specialized systems designed exclu- sively for XML [Rys et al. 2005] or extended (”universal”) databases, designed to managed XML data amongst others. The first is referred to as Native XML database (NXD) and the latter XML enabled databases (XED). While NXD is not entirely new from ground up as a lot of the techniques are adaptations and adop- tions from semi structured databases, XED was a relentless effort and warranted need to tap into the 36 years work of investment [Shanmugasundaram et al. 1999] in relational databases, object databases and other traditional databases. The key difference between XED and NXD is how they employ the traditional database concepts. The NXD will have a (logical) model for its fundamental unit (XML document), stores and retrieves documents according to this model. It can how- ever employ underlying physical storage model based on a relational, hierarchical, or object-oriented database, or use a proprietary storage format such as indexed, compressed files. The XED on the other hand use the traditional databases in its entirety and only provide XML outputs via transformation of the results. Most im- plementations of XED and NXD provide more than just efficient storage but also the essential features of a database management system (DBMS) like appropriate querying interfaces, security, multiple user access, transactional support etc. In this paper will refer to XED and NXD of this calibre and hence will use XEDBMS and NXDBMS respectively. There are quite a number of XEDBMS like Berkeley DB XML [Burd and Staken 2005], DB2 9 Express- No Charge PureXML Hy- brid Data Server [IBM 2006] etc. and numerous implementations of NXDBMS like eXist [Pehcevski et al. 2005], 4suite [Olson 2000], Sedna [Aznauryan et al. 2006], Xindice [Gabillon 2004; Sattler et al. 2005],TIMBER [Jagadish et al. 2002], Natix [Fiebig et al. 2002] etc. An attempt to include quality metrics in an NXD requires a thorough under- standing of its architecture and interfaces. A general exploration of the essential components of multi tier databases architecturally structured into levels with the aim of providing physical and logical data independence could lead to fruitful pos- sibilities. This exploration begins with storage and in sections 2.2 and 2.3. Next in section 2.4 the exploration continues with indexing including indexing support for annotation and meta data. Querying interfaces and transaction follow in section 2.5 and section 2.6. Distributed XML database concepts and how they are currently achieved concludes the anatomy exploration in section 2.7 as it is a basic requirement in the healthcare domain.

End of Year Review, October 2006. 6 · Henry Addico

2.2 Storage models The approach to storing XML documents could be either intact as whole docu- ments (document-centric) or following a shredding scheme (data-centric) [Feinberg 2004]. This could be storage in off-the shelf traditional database management sys- tems [Jiang et al. 2002; Fiebig et al. 2002] or proprietary storage format. The intact approach involves mapping the document into large objects field or files as whole documents with an overflow mechanisms for very large documents whilst in the shredded case contents of the documents are mapped unto an elaborate schema of a traditional database taking advantage of the presence of a DTD or XML schema [Jiang et al. 2002]. From a document centric perspective the storage of XML documents as a sin- gular data item is ideal in handling whole document queries but manipulation of fragments will require parsing of the document each time. However from a data centric (non-intact model) view where documents are broken down into smaller chunks, the case of handling parts of the document is more efficient but negatively impacts the extraction of whole XML documents. The document-centric and data-centric approaches can be compared from a per- spective of their granularity. The Intact storage approach has a granularity of ad- dressability [Feinberg 2004] of a whole XML document except for the special case of immutable documents where interior addresses or offsets can be used. However there is a problem of how to address target documents. Two primary mechanisms are, by some unique name or generated document id or by query (XPath or Xquery). Querying over large collection of intact databases will require indexing which can be done when parsing documents. The intact approach is useful in applications where collections comprise of relatively small documents which tend to be processed as a whole and 100% round tripping (the decomposition of the documents into smaller chunks and the re-composition when evaluating whole document queries is referred to as round tripping) is required. Non-intact on the other hand has addressability and accessibility to be subdocument level typically element or node level [Feinberg 2004]. However concurrency granularity may, or may not be finer than the doc- ument level. It allows the ability to reference, modify and retrieve an element or partial document or other objects within a document. It also provides more efficient querying, without whole document parsing. There are challenges like the degree of round trip as decomposition storage results in loss or change of information, such as reordering of attributes, change in XML declarations. This is due to the fact that there is not a 1:1 mapping from XML info-set to bytes in a document [Feinberg 2004]. Model-mapping, a typical implementation of non-intact approach stores the logi- cal form of the XML document instead of the actual XML documents. This logical model generally referred to as the data model is reviewed in next section. The map- ping is either edge oriented or node-oriented with a fixed schema is used for all XML documents with out the assistance of DTD [Jiang et al. 2002]. An example, XPar- ent, is characterized by a four table schema. It maintains both label paths(sequence of element tags) and data-paths(sequence of elements) in two separate but related tables. There is still a dilemma or conflict when mixing the concepts of a single document

End of Year Review, October 2006. Data Quality of Native XML Databases in the Healthcare Domain · 7 being a database and a set of XML documents being a database. This is due to the document and data centric features of XML. This needs to be resolved as it might be useful to use both views simultaneously [Widom 1999]. Nevertheless most of the storage and data models seem to adopt the data-centric path with only few considering the document-centric feature. This is not favourable to this research as the EHR is an aggregational information model so that the document-centric approach will be more appropriate. In traditional databases the data storage is abstracted away from the data model providing what is referred to as physical data independence. Most implementations (especially the intact approaches) of NXD however lack this feature and directly store the data model to disk.

2.3 XML data models Above the storage model there is a need to have is a logical one referred to as the data model [Feinberg 2004; Widom 1999]. This model can be sometimes closely tied to the querying language. A typical and popular example is DOM which is tied to XQuery. DOM is object oriented, document-centric and forms the basis of most data models.It simply maps the document and its contents to basic and complex data structures [Goldman et al. 1999]. Each data item is modelled as a node, some nodes have names, values, and may have siblings or children forming a tree structure. Generally the models take the form of a forest of unranked, node- labelled trees, one tree per document [Moro et al. 2005; Kacholia et al. 2005]. A node corresponds to an element, attribute or a value and the edges represent ele- ment (parent)-sub element (child) or element-value relationships [Moro et al. 2005]. XPath another popular example, models XML documents as an ordered tree using 7 types of nodes (root, element, text, attribute, namespace, processing instructions and comment) [Jiang et al. 2002]. TaDOM a third example is an extension of DOM by [Haustein et al. 2005]. It separates attribute nodes into attribute roots for each element and the attribute values are become text nodes. Lore project’s schema- less, self-describing semi structured data model Object Exchange Mode (OEM), was ported to an elegant XML data model where every data item is modelled as an element pair of unique element identifiers(eid) and values which can be atomic text or complex values. The complex value bearing the element tag name, an ordered list of attribute name or atomic value pairs, crosslink sub elements and normal sub element types. The following are some of the classical features and examples of query languages that implement them if any: —data typing support: Lore data model, XQuery —element and attributes ordering: Lore, —inclusion of white space between elements, comments, processing instructions e.g. XPath —semantic view representing IDREF as cross link edges and foreign key relation- ships: Lore —temporal support —concurrency support: taDOM —metadata and annotations management.

End of Year Review, October 2006. 8 · Henry Addico

There is not as yet a data model providing all of these features with some of these features yet to be included in XML database models.

2.4 Indexing In order to exploit the content and structure of XML documents several num- bering schemes (index structure mechanisms) are normally embedded in the tree representation model [Moro et al. 2005]. Absolute address index approaches like position-based indexing and path based indexing by [Sacks-Davis et al. 1997] are not very efficient as updates cause an expensive re-computation which is non desir- able in applications with high insert and update frequency [Kha et al. 2001]. [Kha et al. 2001] present an indexing scheme which extends Absolute Region Coordinate (ARC) used to locate the content of structured documents called Relative Region coordinate which reduces the number of computation when a value of a leaf node changes. As this scheme has a special property of only affecting the nodes in its region and on the right hand (a storage scheme called block sub tree) it reduces the input-output(I/O) necessary to perform updates. There are also indexing schemes for the structure of the document, data centric in nature as opposed to the latter which is document centric. Examples are the interval based schemes [Li and Moon 2001] where label are based on transversal order of the document and its size and prefix labelling scheme where nodes inherit their parent label as a prefix. The above indexing schemes also suffer an undesirable computation and modification to other records when multiple elements referred to as segments are inserted or updated, [Catania et al. 2005] presented a lazy approach to XML update where no other modifications is made to existing records. The writer also presented data struc- tures and structural join algorithm based on segments. DeweyID indexing scheme by [Haustein et al. 2005] is another which identifies tree nodes, avoids re-labelling even under bulk insertions and deletions and allows the derivation of all ancestor node identities(IDs) without accessing the actual data. As most work on quality measurement assumes the need to have meta data there is also a need to provide meta data indexing facilities. In XML databases the meta data can be applied directly to elements or stored as views to the XML data. The work of [Cho et al. 2006] which investigates meta-data indexing along each location step in XPath queries. The writer derive full meta data index which maintains a meta data level for all elements and inheritance meta data index which maintains meta data for only elements which is the meta data is specifically defined. Indexing is such an important aspect as it does not only affect the storage and query efficiency but also holds the key for effective use of meta data for data quality measurement.

2.5 Querying XML Databases The usefulness and potential of a database is highly dependent on providing op- timal query language and interfaces to store and retrieve data element. Querying XML data has quite complex requirements most of which are still unknown and will be until significant number of data-intensive applications are built [Widom 1999]. Most text searches performed on XML related data seem to be keyword based. An identified problem in this scenario is efficiency of extracting this information [Ka- cholia et al. 2005]. An approach is Backward Expanding search which unfortunately performs poorly due to the common occurrence of a keyword or if some nodes have

End of Year Review, October 2006. Data Quality of Native XML Databases in the Healthcare Domain · 9 a high degree [Kacholia et al. 2005]. A Bidirectional Search algorithm which im- proves backward expanding by allowing forward search from potential roots towards leaves [Kacholia et al. 2005]. The query language needs to support structural expressiveness, rich qualifiers and recursiveness with optimal query evaluations providing numerous interfaces via the family of XML technologies. It must also exploit the presence of DTD or XML schema, document-centric or data-centric features of XML, meta-data and allow fuzzy restriction and query ranking. A Language like XPath superseded by XQuery, employ ”tree pattern” queries to select nodes based on their structural characteristics [Moro et al. 2005]. For example the query: //article/[/author[@last=”DeWitt”]]//proceedings[@conf=”VLDB”] requests for all article with author last name DeWitt and has been in the VLDB conference [Moro et al. 2005]. The writers [Moro et al. 2005; Amer-Yahia et al. 2005] explains that the query to consist of two parts.

—@last=”DeWitt”, @conf=”VLDB” are value based as they select elements ac- cording to their values i.e. content based. —//article[/author]//proceedings is structurally related as it imposes structural restrictions (e.g. a proceedings element must exist under the article with at least one child author).

The value based conditions can be efficiently evaluated with traditional indexing schemes whilst the structural conditions are quite a challenge. Set based algo- rithms, holistic processing techniques which have been developed in recent research work to out perform the conventional methods [Moro et al. 2005]. The conven- tional method mostly involved query decomposition using binary operators, then some optimization is applied to produce an efficient query plan [Moro et al. 2005]. The holistic approach however requires some pre-processing on the data or both the data and the query but with incremental global matching ability [Moro et al. 2005]. Most query languages express order sensitive queries which is particularly important in document centric views through the ability to address parts of a doc- ument in a navigational fashion, based on the sequence of nodes in the document tree [Vagena et al. 2004]. View mechanisms, dynamic structural summaries may be useful for query processing and rewriting, keyword and proximity search [Widom 1999]. [Amer-Yahia et al. 2005] propose a scoring and ranking method based on both structure and content instead of the term frequency (tf), inverse document frequency (idf), techniques from Information retrieval which are approximations for relevance ranking. Languages that support these features are declarative in nature. W3QL is a database-inspired query language designed for the web around 1995. It is a declarative query language just like Structured Query Language (SQL) and applies database techniques to search the hypertext and semi structured data em- ploying existing indexes as access paths [Konopnicki and Shmueli 2005]. Lorel is another declarative language designed initially for semi structured data [Abiteboul et al. 1997] and then migrated to XML. It seems to be the most flexible with a lot of the features highlighted above. Other languages like XML-QL, a value based language with structural recursion, using patterns and allowing not only node se- lection but transformations via the notion of variable binding [Deutsch et al. 1998].

End of Year Review, October 2006. 10 · Henry Addico

OQL and Quilt follow a functional paradigm integrating features from the other declarative languages [Fan and Sim´eon2003; Chamberlin et al. 2001]. There is also the need for query languages or approaches which hides away the complexity without compromising the complex grammar features by provid- ing graphical, form based amongst others for naive users. Equix is a typical form based query language which automatically generates results documents and their associated DTD without a user intervention [Cohen et al. 1999]. This was an at- tempt to provide a graphical interface better than the graphical interfaced query language XMAS which facilitates user-friendly query formulation and very useful for optimization purposes [Baru et al. 1999]. XML-GL is another of such graphical oriented languages [Ceri et al. 1999]. As with extensions to SQL like work of [Parssian and Yeoh 2006] in relational databases, to cater for data quality queries there might be a need to extend one of the numerous XML query languages. This forms a very small proportion of work on querying with XPath and XQuery being the most popular as this area is most actively being researched and interests in a detailed exploration is out of scope the aims of this review.

2.6 Transaction Models in XML Databases Research in XML database field has moved focused from a single user to fully fledge multi user, multi access databases with collaboration on XML documents via Con- current Version System (CVS) [Dekeyser et al. 2004] being a thing of the past. As this has been studied extensively in the traditional database systems context, native XML databases must measure up. Processing interactions with XML docu- ments in multi-user environments requires guarded transactional context assuring the well-known ACID properties [Haustein and H¨arder 2003]. [Feinberg 2004] work on transaction management is a typical example bearing these properties. The structural properties of XML documents present some more challenges with locking schemes propagation with respect to authorisation [Finance et al. 2005]. [Dekeyser et al. 2004] argue about the inadequacy of table locking, predicate lock- ing and hierarchical locking schemes (adaptations of traditional database locking schemes) mainly used in the shredded approach to XML data storage. They then propose two new locking schemes with two scheduling algorithm based on path locks which are tightly coupled to the document instance. This area seems to have the least research focus. It is however critically important when it comes to integration and heterogeneous database sets. Most of the key features of transactions depend on heavily on the extents of integration between the query and security models.

2.7 Distributed XML databases Distributed data management is an important research area for the database com- munity. Most prominent is the data warehouses in traditional databases. A data warehouse is generally understood as an integrated and time-varying often histor- ical collections of data from multiple, heterogeneous, autonomous and distributed information sources. It is primarily used in strategic decision making by means of online analytical processing (OLAP) techniques [Lechtenb¨orgerand Vossen 2003]. A data warehouse (DW) is different from the conventional DBMS in that is operates

End of Year Review, October 2006. Data Quality of Native XML Databases in the Healthcare Domain · 11 on aggregated data from disparate sources for analysis purposes or answer complex ad hoc queries which the disparate data sources can not answer in isolation. DW uses materialised views to provide faster and more efficient answers to real world queries. The introduction of standards like WSDL, UDDI, SOAP using http and XML especially AXML has driven a shift from client-server architectures to serviced based approaches [Bilykh et al. 2003]. [Abiteboul et al. 2006] conceptualises an XML DW by utilising work from ActiveXML (AXML), a language which is based on the concept of embedding service calls inside XML documents. So these service calls could be calls to other databases or even tasks or process agents in a localised or distributed environment. It is also quite different from application based peer to peer approaches as in [Boniface and Wilken 2005] using mediators to ensure interoperability as this is solved by an information model like HL7 CDA for example. This is discussed into more depth in section 4.3.3. This DW is of particular interest as it is a research goal to incorporate quality metric in such a distributed virtual environment.

3. DATA AND INFORMATION QUALITY The determination of quality of resources, products, systems, processes etc. has always been essential to mankind. Being currently in an information age where data is a precious resource, its quality measurement is needed as corporations, gov- ernmental organizations, educational institutions, health institutions and research communities maintain and use data intensively. The generation of knowledge to support critical decisions based on these collections of data bring up the long- standing question of ”is this data of high quality?”. Several attempts has been made to answer this question in two main research communities; database man- agement and the information management. Whilst the database community use the term DQ and is more concerned about using formal approaches which involve models, languages and algorithms developed in their field, the management com- munity use IQ as their interests lie in the abstract demographics, process influences and the impact of data cycle (from collection to actual use) on quality. The next section defines DQ and IQ and investigates possible interrelations.

3.1 Defining data quality and Information quality . Data quality as defined by in the American National Dictionary for information Systems concerns correctness, timeliness, accuracy and completeness that make data appropriate for use. A more elaborate definition from the ISO 8402 (Quality Management and Quality Assurance Vocabulary) reflects data quality as: The totality of characteristics of an entity that bears on its ability to satisfy stated and implied needs [Abate et al. 1998]. Meaning that a measurable level of quality is reached if a specification (confor- mity to set of rules or facts) which reflects its intended use (utility) is satisfied. The writer [Abate et al. 1998] argues that conformity and utility are sufficient quality indicators and ties up with other traditional definitions like ”fitness for use”, ”fit- ness for purpose”, ”customer satisfaction”, or ”conformance to requirements”.The

End of Year Review, October 2006. 12 · Henry Addico above definitions are simple extensions of the definition of Quality and information can be substituted for an appropriate definition of IQ. Many academics use IQ and DQ interchangeably [Bovee et al. 2003]. IQ is however contextual as Information is contextualised data [Stvilia et al. ]. On the other hand others believe information quality is an issue after data items have been assembled and interpreted as infor- mation. This stems from another key difference between data and information; the presence or absence of relationships between terms (numbers and words) [Pohl 2000]. So that as data is transformed into information, data quality and its at- tributes (dimensions) are transformed to information quality [Ehikioya 1999]. The notion of the product-based perspective called data quality and the service-based perspective called information quality as presented in [Price and Shanks 2004] is much preferred. The product-based perspective focuses on design and internal view of Information System and the service-based focuses on the information consumer response to their task based interactions with the information system (IS). This implies that the processes involved in producing and providing the service is as important as the data itself when determining information quality and motivates the development of a processes dependent formal model to measure information quality. Even though Information and data quality can be used interchangeably the above perspectives and definitions show their differences. The following section look at the attributes of quality called data quality dimensions.

3.2 Dimensions of IQ Generally assessment of data quality is based on attributes and characteristics that capture specific facets of quality [Scannapieco et al. 2005]. These facets generally referred to as dimensions of quality differ with the quality definition (intrinsically or extrinsically defined) or the quality model used in the assessment (theoretical, system-based, product-based, or empirical). The next section looks into these var- ious assessments. There have been several attempts [Wand and Wang 1996; Motro and Rakov 1997; Gertz and Schmitt 1998; Pipino et al. 2002; Ballou et al. 1998] to define and generate lists of these dimensions as the classical different approaches seem to outline new and synonomical variations of existing dimensions. Instead of a long list of these dimensions and their definition which will be far from exhaustive, it seems more appropraite to present a smaller set of these dimension identified by [Leonidas Orfanidis 2004; Gendron 2000] in the health care domain.

—Accessibility . EHR data should be available to authorized users including pa- tients with special needs, care providers, mobile users, emergency services, and members of integrated care teams. Access should be easy to use, be fast for both care professionals and patients. Data should be accessible from wherever they are needed in appropriate forms and amount . Privacy from unauthorized users should be strictly maintained, and overriding of authorization constraints should be recorded with documented reasons . —Usability . EHRs should be accessible in different data formats and from different kinds of hardware and networks, to ensure interoperability between different sys- tems. Data held within an EHR should be organized (including chronologically) and presented for ease of retrieval.

End of Year Review, October 2006. Data Quality of Native XML Databases in the Healthcare Domain · 13

—Security and confidentiality. EHRs must be secure and confidential. Patients should be allowed to check who has access to their data and in what circum- stances. —Provenance . EHRs should show the source and the context of data, linked to metadata about provenance of data. This is really to ensure believability —Data validation . The status of the EHR data should be described by metadata, for example, to indicate if data are pending, and times of entry and retention. Patients should be allowed to check the validity of their EHR data. —Integrity . Data accreditation standards should be established for new data, and inconsistency and duplication should be removed. —Accuracy and timeliness . The content of an EHR should be as near real-time as possible. Thus, data should be timely, in that it relates to the present. Tempo- rality is also essential as the EHR need to cover the lifetime of patients —Completeness . The existence of further data should also be indicated, possibly with links to other data. —Consistency . There should exist consistency between items of multiple data from multiple sources. EHRs should comply with the existing relevant standards, such as security, data protection, and communication standards (HL7). The above itemisation of a few of the data quality needs in the health care from [Leonidas Orfanidis 2004] used data quality dimensions defined in other literature. These are represented in italics. Reviews like [Naumann and Rolker 2000] of em- pirical, theoretical experiments which generate these dimensions have shown this limitation. These limitations are not surprising as [Abate et al. 1998] has identi- fied the following problems of quality assessment using these dimensions. First of all, generating an exhaustive lists of attributes may be difficult and unverifiable. Secondly, interdependencies of attributes make it difficult to define a minimal or- thogonal set of these attributes. Last but not the least, this results in the isolation of quality attributes hindering the identification of systematic data quality problems. The grouping of attributes to identify specific quality problems is a more compre- hensive solution which will ease the identification of systematic problems. Most categorisation attempts [Naumann and Rolker 2000; Strong et al. 1997; Price and Shanks 2004; Hinrichs 2000] to find a more comprehensive solution has not yielded yet a desirable solution. One of these attempts which has gained some attention is a categorization from [Richard et al. 1994]. The writers grouped dimensions of the quality problems that precipitate in other dimensions together. They are as follows: —Intrinsic (Accuracy, Objectivity, Believability and Reputation):- A lack of process or weakness in the current process for creating data values that corresponds to the actual or true values. This implies that the information has quality in its own right. —Contextual (Value-Added, Relevancy, Timeliness, Completeness and Appropriate Amount of Data):- A lack of process or weakness in the current process for producing data pertinent to the tasks of the user. It highlights the requirement that the IQ be considered within the context of the task at hand.

End of Year Review, October 2006. 14 · Henry Addico

—Representational (Interpretability, Ease of Understanding, Representational Con- sistency and Concise Representation):- A lack of process or weakness in the cur- rent process for supplying data that is intelligible and clear. This emphasizes the need for the storage systems to provide access to this information in an interpretable way, easy to understand, manipulate with concise and consistent representation. —Accessible (Accessibility and Access Security):- A lack of process or weakness in the current process for providing readily available and obtainable data. The data must be accessible but secure. There are no evidence of the benefits of categories with doubts about their usefulness when improving data quality of data sets [Lee et al. 2002]. Dimensional analysis is still under active research and there is as yet no agreed consensus. A resolution of their interdependencies and synonymical variations is most needed as they play a very important part in data quality assessment processes discussed in the next section.

3.3 Data Quality assessments The simplest approach to data quality assessment as identified by [Mandl and Porter 1999] involves the quantification of quality indicators (for example com- pleteness and good communication). These indicators are actually the quality di- mensions described above which the research community has been defining over the past decade. Several other attempts however seem to follow [Arts et al. 2001] model with the following seven steps: —describe the objectives of the entry —determine data items that need to be checked —define data quality aspects —select the methods for quality measurement —determine criteria —perform quality measurement —quantify measured data quality Attempts following the above model have resulted in a paradigm of total quality management frameworks for data quality. These frameworks have been empirical, theoretical, system-oriented, and ontological or domain oriented. [Bovee et al. 2003] developed a conceptual framework with all the essential dimensions or attributes for accessing IQ. The AIMQ project [Lee et al. 2002] developed an overall model with an accompa- nying assessment instruments for measuring IQ and comparing these measurements to bench marks and across stake holders. Firstly their model consists of a 2 x 2 model framework defining what IQ means to consumers and managers. The frame work is made of four quadrants dependent on whether information is considered a product or a service with either a formal or customer driven assessment. Secondly a questionnaire for measuring IQ along the relevant dimensions to consumers and managers. The third component consists of analysis techniques for interpreting the assessments against benchmarks of different stakeholders.

End of Year Review, October 2006. Data Quality of Native XML Databases in the Healthcare Domain · 15

Formal methods for measuring data quality in the relational world have been considered. Generally focusing on completeness [Davidson et al. 2004; Scannapieco and Batini 2004; Parssian et al. 2004] A closest formal approach was the attempt by [Ehikioya 1999] using fuzzy logic. The writer stated O=FT (D) where O is output (information) FT is the transformational unit and D is the data to be processed. This implies that the quality of the information out depends on the transformation function and the quality of the data unit. The writer then adopts a quality mea- suring spectrum showing the level of satisfaction of a dimension as they affect the overall quality directly. A confidence measure of 0 means low and 1 meaning high. As there will be infinite points on the information spectrum fuzzy sets and logic theory will be most suitable. Even though this is similar to assigning a probability ranging from 0 to 1 the writer believes fuzziness is more flexible. According to [Wand and Wang 1996] to design information systems that de- liver high-quality data, the notion of data quality must be well understood. An ontologically based approach to defining data may be the ticket for success in real- world systems. More recently [Milano et al. 2005] has been looking at identifying and possibly correcting data quality problems using an ontological approach called Ontology-based XML Cleaning(OXC). It formalises the problem of defining data quality metrics based on an ontological model as DTD and XML Schema are not expressive enough to capture additional knowledge and constraints. Furthermore they lack formal semantics in order to allow automated reasoning. Using the DTD and Schema as basis, an ontology capturing domain knowledge is designed by a domain expect including any additional knowledge that have been left out of the conceptual design due to limits of the schema language or bad design. A mapping between the ontology and schema is also defined. Together they form a reference world against which data quality dimensions can be defined. Data quality improve- ment can also be applied. The result of applying OXC methodology makes an XML document not only schema valid but also ontology-valid. The paper managed to define data quality completeness in terms of value completeness, leaf completeness and parent child completeness. [Shankaranarayan et al. 2003] discuss a modelling scheme for managing data quality via data quality dimensions using Information Product map (IPMAP). IPMAP fills the void and extends Information Management System (IMS) which is used in computing the quality of the final product but does not quite measure up to total quality management [Shankaranarayan et al. 2003]. They compare man- ufacturing IP to manufacturing of a physical product(PP). Raw material, storage, assembly, processing, rework and packing. Both IP and PP may be outsourced to an external agency or organisation which uses different standards and comput- ing resources. IP with similar properties and data inputs can be manufactured on the same or subset of the production processes. Quality at source and continuous improvement have been successfully applied in managing data quality [Shankara- narayan et al. 2003]. There is a need to trace a quality problem in an IP to manufacturing stages that may have caused it and predict the processes impact on data quality. The paper introduces an IP framework which it describes as set of modelling constructs to systematically represent the manufacture of an IP called the IPMAP. It then employs a meta data approach which includes data quality di- mensions. The IPMAP helps in visualising the distribution of the data sources and

End of Year Review, October 2006. 16 · Henry Addico also the flow of the data elements and sequences by which data elements are pro- cessed to create an IP. It also enhances the understanding of the processes and the business units in the overall production. However it is does not provide functional computability of the data quality during the process.

3.4 Data quality metrics Statisticians were the first to investigate data quality problems, considering data duplicates in the 60s. This was followed by data quality control analogous to quality control and improvement of physical products in the 80s. It is only at the beginning of the 90s that measuring quality in datasets and databases and warehouses has been considered [Scannapieco et al. 2005]. In order to understand data quality fully the research community has identified a number of characteristics that capture specific facets of quality. This is referred to as data quality dimensions. These dimensions can be objective or subjective in nature and their measurement can vary on granularity of the dataset semantically and syntactically. The few next sections look at the dimensions which have received most attention. 3.4.1 Accuracy. There are quite a few definitions of accuracy: (i) Inaccuracy implies that Information System (IS) represents a Real World (RW) state different from the one that should have been represented. (ii) Whether the data available are the true values (correctness, precision accuracy or validity). (iii) The degree of correctness and precision with which real world data of interest to an application domain are represented in an information system. A value based accuracy of data is the distance between a value v and a value v’ which is considered correct. Value based accuracy can be syntactic and seman- tic. The syntactic accuracy is measured by comparison functions that evaluate the distance between v and v’. Semantic accuracy however goes beyond syntactic correctness, considering correctness in terms of fact and truth. Semantic accuracy normally requires object identification. Ideally value accuracy in XML data sets will be similar. Literature considers coarser computing accuracy metrics like col- umn accuracy relation or a whole database accuracy. There is a further notion of accuracy, duplication when an object is stored more than once in a data source. Primary key and integrity constraints are normally useful in this instance unless non natural keys are employed. Measuring accuracy coarser than values typically use a ratio between accurate value and total number of values [Scannapieco et al. 2005]. 3.4.2 Completeness. Completeness is defined as the extent to which data is of sufficient breadth, depth and scope for task at hand [Wang and Madnick 1989]. [Pipino et al. 2002] identifies schema completeness as the degree to which entities and attributes are not missing from the schema (database granularity level), col- umn completeness as a function of missing values in a column of a table (column or attribute level granularity) and population completeness as the amount of missing values with respect to a reference population (value granularity: value, tuple, at- tribute, relational completeness). The above completeness fits the relational model where null values are present but for the XML data model where nulls are non exis- tent the concept of reference relation is introduced. Given a relation r the reference relation ref(r) is the relation containing all tuples that satisfies the relational schema

End of Year Review, October 2006. Data Quality of Native XML Databases in the Healthcare Domain · 17 of r. However the reference relations are not always available and the cardinality, which is time dependent, must be used. Completeness is however expressed as cardinalityofr/cardinalityofref(r). (1) It must be noted that in a model with null values, generally the presence of null values indicates missing value. However in order to characterize completeness there is a need to understand why a value is missing. A value can be unknown, nonexistent or have an unknown existence. 3.4.3 Time-related: Currency, timeliness, and volatility. Data values are either stable or time variable. Even in instances where data values are stable their time of collection and transformation are relevant and hence the need for time-related di- mensions. The time dependent dimension is rather interdependent not only on each other but dimensions like completeness and accuracy. The measurement of these dimensions is dependent on the availability of time related meta data. Currency is a measure of how promptly data are updated. This is measured with respect to the last updated meta data. Timeliness is currency of data relative to a specific task or usage or in time (meets a deadline). So there is the concept of the data being current but late. Volatility is a measure of the frequency to which data vary in time. [Ballou and Pazer 2003] defines currency as Currency = Age + (DeliveryT ime − InputT ime) (2) where age measures how old the data unit is when received. Delivery time is when the information is delivered and input time when the data was obtained. If volatility is defined as the length of time data remains valid Timeliness can be defined as T imeliness = max(0, 1 − currency/volatiltiy) (3) 3.4.4 Consistency. This dimension captures the violation of semantic rules de- fined over a set of data units [Scannapieco et al. 2005]. In relational theory integrity constraints are instantiations of such semantic rules. Consistency is problematic mainly in environments where semantics can not be fully expressed in the schema due to restriction of the schema language or the diversity and share amount of these constraints known and unknown. Relational theory presents inter-relational and intra-relational constraints (multiple attribute or domain constraints). To ex- press semantics in non relational environments theoretical editing model can be employed. 3.5 Data quality control in Databases The terms information and data quality have been used to describe mismatches between the views of the world provided by an Information and Database systems and the true state of the world [Parssian et al. 2004]. In relational model informa- tion products can be viewed as a sequence of relational algebra operations. The quality (accuracy and completeness) of the derived data are functions of the qual- ity attributes of the underplaying data from the tables and the operations. This inspired [Parssian et al. 2004] to develope output quality metrics for the relational operators i.e. selection, projection and join. [Mazeika and B¨ohlen 2006] uses a string proximity graph which captures the properties of proper nouns databases with misspellings. This is typically useful in

End of Year Review, October 2006. 18 · Henry Addico proper noun databases like names and addresses databases where dictionary model techniques are not adequate. However a combination of those techniques which give efficient approximation of selectivity for a given string and edit distance statistically and the more precise computation of the centre and border of hyper spherical clus- ters of misspellings. QSQL is an extension of SQL to allow the inclusion of quality metrics in user queries to multiple sources in order to select the data source that meets the quality requirements of the user. It however assumes the pre-assessment of the quality dimensions using sampling techniques and results stored in meta ta- bles. The examples presented did not consider the granularity of the assessments as the meta data where all table oriented. The work of [Ballou et al. 2006] also esti- mates the quality of IPs as the output produced by some combination of relational algebraic operations applied to data from base tables. [Ballou et al. 2006] assumes the quality of the base tables are not know during design time and cannot be known with precision due to the dynamic nature of real world databases. The measure of the quality of an IP is the number of acceptable data units found divided by the total number of data units. The unacceptable data units are evaluated from defects in sampled data from the base tables with the data units derived by the level of granularity sufficient to produce other data units via algebraic operations. Data quality measurement still remains an open problem. There seems to be a lot of work based on a few dimensions in an isolated context. A true measure will have to be incorporate all dimensions the most relevant to the domain in context.

4. DATA IN THE HEALTHCARE DOMAIN Health and Clinical professionals meet the needs of patients by drawing on the knowledge accumulated by medicine over 5000 years. Clinical practise is a knowl- edge based business, requiring clinicians to use vast amount of information to make critical decisions during patients care. About a third of doctors’ time is spent recording and synthesising information from personal and professional communica- tion. Yet most of the information doctors used during clinical practise is locked up in paper based records and the knowledge reproduced by rote [Smith 1996]. During personal and professional communication with patients and other clinicians several questions and information is shared. Managing this information amongst other problems which are listed below has been the focus of most information medical research.

—Primarily there is lack of explicit structure. Many of these documents are still textual narrations and irrelevant search queries can be reduced by applying some meaningful structure to these clinical documents [Schweiger et al. 2002]: data quality requirements like usability, consistency etc. . —There is also the problem of flexibility of a documents structure. Clinical data requires flexibility in terms of free-textual descriptions, different structural lev- els and even individual structures so that the document must not restrict con- tent [Schweiger et al. 2002]: requirement of an appropriate storage system . —The emergence of new complex data types and often their impact on the per- ception of the old types in particular their inter relationships [MacKenzie 1998]: requirement of an appropriate storage system .

End of Year Review, October 2006. Data Quality of Native XML Databases in the Healthcare Domain · 19

—Huge amount of data where typical instance of data growth is exponential [Leonidas Or- fanidis 2004; MacKenzie 1998]. Managing this knowledge with high rate of change critical to good clinical practice [Smith 1996]: a combination of data quality re- quirements like appropriate amount of data and an appropriate storage system. —Diverse and high growth knowledge of information set: requirement of an appro- priate storage system and information model . —Data storage is normally distributed and data duplication is a norm [MacKenzie 1998]. Health care processes involve caregivers in highly mobile functions. Care happens at GP, secondary care hospital, in surgery, on the patients bed and home, in ambulances etc. [Bilykh et al. 2003] —Storage format are as heterogeneous as the actual data [MacKenzie 1998]: a com- bination of data quality requirements like completeness of data and an appropriate storage system. —Indexing facilities provided are inadequate [MacKenzie 1998]: requirement of an appropriate storage system. —Non-integrated database across hospitals [Lederman 2005]: requirement of an appropriate distributed storage system. —Widespread use of paper records [Lederman 2005]. —Poor data security [Lederman 2005; Bilykh et al. 2003]. A growing scope of people requiring access to clinical information. For example clinical librarians in the information process of Evidenced based medicine [Jerome et al. 2001]: data quality requirements like accessibility, availability, security, consent, reuse etc. —Poor process management of consenting to disclosure of patient data [Lederman 2005]: Data quality requirements like confidentiality. —Scale and integration of heterogeneous data like Clinical images(X-Rays, MRI, PET scans) and speech as to mention a few: requirement an appropriate storage system. —Failure to understand consumers needs [Riain and Helfert 2005]. data quality requirements like integrity of patients. —Poorly defined information production process coupled with unidentified product life cycle [Riain and Helfert 2005]: requirement of an appropriate information model. —Lack of an information product manager (IPM) [Riain and Helfert 2005]: is crucial if analysis corresponding to quality treatment is the prime query prob- lem [Pedersen and Jensen 1998; Leonidas Orfanidis 2004]: requirement of an appropriate storage system and information model with the integration of quality control. The development of the electronic records and the use of the right storage medium like XML databases will either alleviate or reduce the extent most of the afore mentioned problems apart from the data quality related ones. It will also help build research databases to enhance patient care delivery. Many national healthcare sectors are reforming existing polices, technology and frameworks with a long term goal of keeping a complete record about a patient care. This complete record will be available to all concerned and is described to be complete as it will capture all

End of Year Review, October 2006. 20 · Henry Addico data generated from the process of care spanning a patient’s cradle to grave. This presents a fundamental change to how health professionals manage patient care data. The next few sections are summaries of its implementation in the UK.

4.1 Introduction: Current trends of the Electronic record for Patient care The idea of computerizing the patient care record has been around since the early 1960s when hospitals first started using computers [Grimson et al. 2000]. Its use was initially focused on financial process and therefore only kept basic data about the patient [Canning 2004]. As hospitals and laboratories became more computerized, test results were in computerized format and in consequence their integration with the basic demographic data [Grimson et al. 2000]. The electronic record over the past years has had variety of names Computerized Medical Record (CMR), Computerized Patient Record (CPR) and Electronic Med- ical Record (EMR) as to mention a few. However more recently Electronic Patient Record (EPR) and Electronic Health Record (EHR) has been globally used and defined. The EPR describes a record of periodic care mainly provided by an insti- tution whilst the EHR is a longitudinal record of a patient’s health and healthcare from cradle to grave [Department of Health 1998]. EHR has been embraced and its been actively used in most national healthcare domain [Department of Health 1998; Sherman 2001; Raoul et al. 2005; Riain and Helfert 2005]. The computerisation of the NHS and its patients record started under the Na- tional Programme for IT (NPFIT) now the new Department of Health agency NHS Connecting for Health (CfH). The National Programme for IT plans to connect over 30,000 GPs in England to almost 300 hospitals and give patients access to their personal health and care information. An estimate of about 8 billion transactions being handled each year by 2005 has been made according to [Canning 2004]. This will be a total transformation the way the NHS works by enabling service-wide in- formation sharing of electronic patient records [Sugden et al. 2006]. The NHS CfH which was launched April 2005 is responsible for delivery the national IT program along with business critical systems. The programme started with the implementa- tion of the N3 national networking service followed by the NHS care record service functionality. This was in two parts; firstly the national procurement which in- cluded the Spine, Electronic booking and the electronic transfer of prescriptions and secondly the ”Cluster” level procurement which included patient administra- tion system (PAS), electronic ordering and browsing of tests, picture archiving and communication systems as well as clinical decision support systems. The rest of these sections investigate the current state of the above systems and services in order to gain an overall understanding and developments made by the programme towards the creation of the national patient care record by discussing some of the above elements in the following order [CfH 2005; Canning 2004; Sugden et al. 2006]. —NHS Care record service —Choose and Book —Electronic Prescription service (EPT) —N3 national network for the NHS —NHSmail national email and directory service

End of Year Review, October 2006. Data Quality of Native XML Databases in the Healthcare Domain · 21

—Picture Archiving and Communications Systems (PACS) This will help in discussing the challenges affecting the programmes successful de- velopment particularly, as the implementation of such large scale heath service IT projects in a few nationals’ health sector has proved difficult [Hendy et al. 2005].

4.2 The Programme 4.2.1 NHS Care record service. The NHS Care record service (NCRS) is central to the strategy towards the creation of the unified electronic record [Moogan 2006]. The electronic record which will be the output of this NHS care record service will be the basis to offer the implementation of other above mentioned services and systems. It will enable the booking of appointments based on the patients choice (the Choose and Book service), automatic transfer of complete records between General Practitioners when the patient changes addresses, provide instant patient medical records when needed for emergency care, improve health by giving people access to this information, as well as improving care through better safety and outcomes through ensuring that information and knowledge is available when it needed. It will replace the old Patient Administration System(PAS) [Hendy et al. 2005]. The NCRS is composed of the following two elements with a third patient ac- cess feature [CfH 2006]. Firstly the detailed care record, which comprises of data generated during episodes of care in a particular institution combined with sev- eral others from other organisations providing care to the same patient. These organisations include NHS Acute Trust, Mental health trust, general practices and their wider primary care team (pharmacy, community nursing team, dental surgery, opticians, social care team and so on). Secondly the Summary Care record with less complexity containing information to facilitate cross organisation patient care. This needs to be carefully controlled so that it serves its purpose as the detailed record summary. It should contain aspects like major diagnosis, current and regular prescriptions, allergies, adverse reaction etc. which are deemed significant whilst still keeping its complexity to the minimum. The summary record will have com- ponents maintained or obtained from General Practices, maintained summaries of the patient record, discharge summaries from Hospitals, mental health trusts etc. Lastly the access feature, HealthSpace, which does not only ensures the availability of the summaries of the electronic record to the patient via the secure HealthSpace website but also patients can add comments, add treatment preference notes, im- portant facts like religion, a record of self medications, update details about their weight, blood pressure and so on. HealthSpace will eventually be replacing the NHS Direct website [Cross 2006a]. A typical example from the central Hampshire electronic health record pilot project shows the typical elements of the electronic record as follows. —General practice systems —General practice clinical record(coded items) —General practice prescription record —Hospital Systems —inpatient record(hospital episodes statistics extract) —discharge letter

End of Year Review, October 2006. 22 · Henry Addico

—Pathology requests —Radiology requests —inpatient drug prescriptions and administrations —out patient attendances —attendances accident and emergency department and discharge letter —Maternity discharge letter —Maternity discharge letter —waiting lists details —NHS direct —Call Summary —Advice Given —Ambulance —Patients details —Observation details —Intervention details —social services —Client details —Residential care record —Non-residential care record The following illustration of a Patient record taken from [Moogan 2006] should help give a better picture: Mrs Ross’s general practitioner records all consultations with Mrs Ross, as does the practice nurse. Together these form part of the general practice component of Mrs Ross’s Detailed Care Record. So when she sees a referred diabetologist after being diagnosed to have diabetes mellitus, a junior doctor and separately a nurse in her local hospital’s can see her detailed care record contributed by the hospital and portions of the general practitioners contribution after Mrs Ross’ consent with supporting knowledge base that offer decision support and guidance whilst a record of the access trail is kept progressively. Mrs Ross then sees her pharmacist to get her medication, her optician to have a check for diabetic eye disease and her podiatrist for foot care. Each will be able to see their own records and if she consents, important entries in other parts of her detailed record will help deliver best care. A few weeks later Mrs Ross visits her hospital diabetic clinic. They record, among other things, her new blood test, change her medication, and note that she has developed the first signs of a diabetic eye disease. These are all enough to be in her summary record. She will have access to part of this information through the HealthSpace and even record things like her blood pressure, her weight changes etc. A summary of the patient EHR will be held in a national database, known as the ’SPINE’ ensuring that particular vital information is always accessible. More in-depth record will be kept locally at the patient main GP or hospital. Issues of data confidentiality are very important and a lot of measures have been taken. This is discussed further in the next section as its more appropriate. The NCRS will start in the middle of the 2006 and will iteratively develop in improvement cycles until 2010 [Moogan 2006; CfH 2006]. This summer(2006) every household in England should receive a leaflet explaining the NHS plan to make their healthcare accessible electronically [Cross 2006a].

End of Year Review, October 2006. Data Quality of Native XML Databases in the Healthcare Domain · 23

4.2.2 Choose and Book. The Choose and book system allows a General Prac- titioner to refer a patient for elective treatment at another centre via a computer system. This enables the General Practitioner to perform this booking direct from the surgery [Cross 2006b; Canning 2004; CfH 2006]. So that in the above illus- tration Mrs Ross will have a choice of which diabetologist to see as well as the time to book the initial appointment. The patient will have a up to a choice of four hospitals or clinics [CfH 2006; Coombes 2006]. The information detailing the services which local centres offer are compiled by the Department of Health(DOH) and will include performance statistics like waiting times, rates of methicillin resis- tant infection and cancelled operations. However the commissioning is left to local primary care trusts placing no restriction on whether the commissions go to NHS hospitals or independent treatment centres [Coombes 2006]. The Choose and Book system will be effectively changing the access rights to the patient EHR so that a clinician from the elective treatment centre will have authorisation to all or sections of the record. Questions like ”whether this will involve transfer of the sections or all of the record between the two treatment centres?”, ”how secure will the transfer between two centres?”and ”will the patient confidentiality be maintained?”. This will largely depend on the structure of the data warehouse. 4.2.3 N3 national network for the NHS. The N3 the new network replacing the NHSNET (a private NHS network) will run nationally. It is a world class networking service vital to the creation, delivery and support of the nation IT programme. It has sufficient, secure connectivity and broadband capacity to meet current and future networking needs of the NHS [?]. It will serve as both broad bands for networking the spine as well as a telephony network. It will offer rapid transmission of x ray films and other images [CfH 2006]. There will be private virtual networks to allow transmission between GPs and trusts [Hendy et al. 2005]. NHS Connecting for Health delegates the responsibility for integrating and managing the service to an appointed N3 Service Provider (N3SP) with British Telecom being the main Service Provider. Implementation of N3 began in April 2004. The planned timeline was to have connected at least 6,000 sites in the first year which is by 31 March 2005, and then 6,000 in each following year for two years. Estimating to complete all 18,000 sites in the NHS to be connected by 31 March 2007. By August 2005 over 10,000 sites had received their N3 connections, including more than 75 per cent of GP practices [?]. A total sites in England connected to N3 by the 17th February was 13,559 and in Scotland 1148 [N3 2006]. Commendable progress has been made as 98% of general practice along with essential components of the records service having been in place since end of February 2006 [Cross 2006a]. A critical look of technological needs like wireless systems, voice enabled devices, handwriting recognition devices etc and support is however out of scope of this report. Nevertheless the implications and effects on the implementation of EHR are well understood and a list of these is presented in section 4.3.1 4.2.4 Electronic Prescription service. The Electronic Prescription Service(EPS) will allow the transmission of a patients prescription from the General Practitioner to a Pharmacy of the patient choice [CfH 2006]. The Electronic Prescription Ser- vice will streamline the current time-consuming process for dealing with repeat

End of Year Review, October 2006. 24 · Henry Addico prescriptions by alleviating the need for patients to visit their GP just to collect a repeat prescription. It will provide a more efficient way of dealing with the 1.3 million prescription currently being issued. It will be implemented in two phases. The first stage which allows existing prescription forms to be printed with a unique barcode has been rolled out at the time of writing it report [NCFH 2006b]. The patient can then present this form in a pharmacy which retrieves the record using the barcode. The second stage will however totally replace the paper based pre- scription with an electronic one [NCFH 2006a]. The electronic prescriptions will include an electronic signature of the prescribing health professional and access to the prescription will be controlled by smartcards. In the longer term, the EPS will be integrated with the NHS care record system [NCFH 2006b]. This service stated in 2005 and it is expected to roll out by the end of 2007 [CfH 2006; NCFH 2006b].

4.2.5 Picture Archiving and Communications Systems (PACS). PACS will en- able the storage and transmission of X-ray and scan data in electronic format linked with the patient detailed record [CfH 2006]. PACS technology allows for a near film-less process, with all the flexibility of digital systems. This eliminates the costs associated with hard film and releases valuable space currently used for storage. PACS will deal with a wide range of specialties, including radiotherapy, CT, MRI, nuclear medicine, angiography, cardiology, fluoroscopy, ultrasound, den- tal and symptomatic mammography. In due course NHS PACS will be tightly integrated with the NHS Care Record Service (CRS) described above, removing the traditional barrier between images and other patient records and providing a unified source for clinical electronic record of a patient. PACS will be delivered at NHS locations including strategic health authorities, acute trusts and any location where pictures of a medical nature required for the purposes of NHS diagnosis or treatment such as military hospitals, the homes of NHS radiologists and specialists and new diagnostic treatment centres [NCFH 2005]. It will be available nationally during 2007 [CfH 2006].

4.3 Challenges of the Electronic Health Record and its quality 4.3.1 Technological Issues. Most interaction between clinicians comprises nar- rative (free text). Narrative contains more information than isolated or coded words. Most electronic records, however, rely on structured data entry when fun- damentally the health care data is not structured but rather semi-structured. The semi structured property of healthcare data stipulates the consideration of XML databases which is now a widely used data and object exchange format for both structured and semi structured data [Shanmugasundaram et al. 1999; Widom 1999; MacKenzie 1998]. However database container solutions which support the required interoperability are still under intense research. Making sure that the data is in electronic format requires heavy investment into wireless, voice enabled, hand writing recognition, touch screens software and hard- ware to mention a few. Handwriting, for example is automatic. But for most people entering data into a computer via typing is not. Handwriting also potentially allows for more thought for focusing on the diagnosis and the management of a patients illness. However tools that will stimulate cognitive reasoning such as differential di- agnosis, prompting, reminders, mnemonics, algorithms, references, risk calculators,

End of Year Review, October 2006. Data Quality of Native XML Databases in the Healthcare Domain · 25 decision trees, and best evidence resources are difficult to develop [Walsh 2004]. Speech is also another easier way to data entry [Zue 1999]

Speech is natural, we know how to speak before knowing how to read and write. Speech is also efficient-most people can speak about five times faster than they type and probably ten times faster than they can write. And speech is flexible-we do not have to touch or see anything to carry on a conversation.

The overhead in terms of investment is huge in both time and capital. The success of electronic health care records is so dependent on such technological advancements to the extent that any solution without these technologies will not match expectations. The short comings have an adverse effect on controlling quality defects introduced in the data quality while clinicians deliver care to patients. 4.3.2 Confidentiality and Patient Acceptance. Access to the electronic record will only be available to trained NHS professionals who will require a smart card with a security chip together with a Personal Identification number(PIN) [CfH 2006]. Access to the electronic record will be role based. Each user will have a role and will also belong to group which associates with specific privileges and rights to specific sections of the electronic record. An access trail is recorded and this is monitored by a privacy officer [CfH 2006]. These measures are restrictive but unfortunately not foolproof as card and identity fraud still prevails over similar systems for personal banking. This raises an alarm which patients cannot ignore and therefore are not generally convinced about the safety of their records. The availability of the information over the internet is another issue of concern which affect the implementation with regards to the gaining the patients trust. The use of virtual private networks and encryption of the data especially with an algorithm which involves the patient NHS unique number is not adequate. However patients might believe that their data confidentiality and security is by far better than when the data was keep in pieces and incomplete with very little control to which NHS personnel had access to them as the security is by far better than previous which was used to control both paper based record and the partial electronic record. There is evidently better quality of security, audits and fraud detection [CfH 2006]. Generally patients are often upset to discover the sharing of their records and the inclusion of the social care record decreases the public confidence as only 23% of people would be willing for their NHS records to be shared with social care staff [Cross 2006a]. Measures like informing patients of their right to opt out and providing a sealed envelope service to control parts of specific sections should help. Patients can limit their participations allowing access for emergency use only, a partial access to the summary of the records and no summary care record. Patients can also limit NHS professionals who have access to their record based on grouped roles or individual roles [Moogan 2006]. This challenge is of most interest as it a symptom of the data quality dimensional category of accessibility discussed earlier. Any attempt to deal with this category will require a process model integrated with a suitable security model for the health domain. It will have to provide both assessment and control as it is one issue for the model to provide outstanding quality in terms of accessibility and another to provide a measure of its level.

End of Year Review, October 2006. 26 · Henry Addico

4.3.3 Standardization. The lack of standardization in keeping of the electronic records both within and between local primary care trusts seems to hamper the progress of the programme. A key requirement of making an interoperable elec- tronic record is to preserving its meaning, protecting the confidentiality and sen- sitivity across systems. This is unachievable without appropriate defined stan- dards [Cross 2006a; Kalra and Ingram 2006; Blobel 2006]. As [Dolin 1997] argues Data can be nested to varying degrees (e.g. a data table storing labora- tory results must accommodate urine cultures growing one or more than one organism, each with its own set of antibiotic sensitivities). Data can be highly interrelated (e.g. a provider may wish to specify that a patients renal insufficiency is due both to diabetes mellitus and to hypertension, and is also related to the patients polyuria and malaise). Data can be heterogeneous (e.g. test results can be strictly numeric, alpha-numeric, or composed of digital images and signals) ... a computerized health record must be able to accommodate unforeseen data. The development of a generic cross-institutional architecture record and infrastruc- ture started with the Synapses and SynEx research project, an attempt to develop a generic approach in applying internet technologies for viewing and sharing health- care records integrated with existing health care computing environments [Grimson et al. 2001; Joachim Bergmann et al. 2006]. Synapses and SynEx built upon the considerable research efforts by the database community into federated (collection of autonomous, heterogeneous databases to which integrated access is required) or interoperable database systems. Synapses aimed to equip clients with the abil- ity to request complete or incomplete EHCRs from connected information systems referred to as feeder systems. The European Union’s Telematics Framework Pro- grammes have supported research over the past decade to create standards like Good European Health Record(GEHR) from standards like CEN standards and ENV 13606 [Kalra and Ingram 2006]. The European Pre-standard ENV 13606:2000(Electronic Healthcare record com- munication) which was a result of the an earlier standard ENV 12265 is a message based standard for the exchange of EHRs [Christophilopoulos 2005; Grimson et al. 2001]. Its revision in 2001 took into consideration the adaptation of techniques from other standards like OpenEHR. OpenEHR was initiated under an EU research program (Good European Health Record, continued as Good Electronic Health Record). Unlike the ENV 13606:2000 it followed a multi level methodology as opposed to the single [Christophilopoulos 2005]. The first level is a generic reference model for healthcare containing few classes (e.g. role, act, entity, participation etc.) ensuring stability over time whilst the other level considers healthcare application concepts modelled ”archetypes”(the key concept of OpenEHR). ”Archetypes” are reusable elements which facilitate interoperability and re-use and which has the capability of evolving with medical knowledge and practice [Christophilopoulos 2005; Rector et al. 2003]. Medical Markup Language (MML) is another set of standards developed in 1995 to allow the exchange of medical data between different medical informa- tion providers by ”Electronic Health Record Research Group”, a special interest group of the Japan Association for Medical Informatics. It was based on Standard

End of Year Review, October 2006. Data Quality of Native XML Databases in the Healthcare Domain · 27

Generalized Markup Language (SGML) at its inception and was later ported to XML (extensible mark-up language) [Guo et al. 2004]. However the recent version 3.0 is based on the HL7 Clinical Document Architecture (CDA). The Patient Record Architecture (PRA) was describedd by [Boyer and Alschuler 2000]: The semantics of PRA is drawn from HL7 RIM [Boyer and Alschuler 2000]. The HL7 Patient Record Architecture (PRA) is a document rep- resentation standard designed to support the delivery and documenta- tion of patient care. A PRA is a defined and persistent object. It is a multimedia object which can include text, images and sounds. One key feature of HL7 is that it allows refinement of models by specification restrictions and its vocabulary is replaceable [Glover 2005]. The HL7 standard ver- sion three (HL7 v3) which includes Reference information Model(RIM) and Data type specification (both ANSI standards) spans all healthcare domains. Unlike the previous versions which provided Syntactic interoperability (exchange of infor- mation between systems without the guarantee of meaning consistency across the systems) HL7 V3 aims at semantic interoperability (which ensures unambiguous meaning across systems) through its support for data types. This version does not quite measure up as stakeholders expect computable semantic interoperability (CSI). This is however a limitation as XML cannot support CSI [Jones and Mead 2005]. HL7 Version 3 includes a formal methodology for binding the common struc- tures to domain-specific concept codes. This enables separation of common struc- tures from domain-specific terminologies, such as vocabularies used in Systematized Nomenclature of Medicine (SNOMED), Digital Imaging and Communications in Medicine (DICOM), Medical Dictionary for Regulatory Activities (MeDRA), Min- imum Information About a Micro array Experiment (MIAME)/Micro Array and Gene Expression (MAGE) and Logical Observation Identifiers, Names and Codes (LOINC) [Jones and Mead 2005]. Problems highlighted above stem from data quality a problem which has gained considerable attention from the research community. The healthcare and most other domains reliant on data have considered the effect of poor data quality. Its measurement and control remain difficult and an open problem. The best stan- dards for EHR will have to incorporate data quality control if the problems are to be nipped in the bud. Any developed system which does not incorporate the aforementioned standards will only result in a less future proof solution. 4.3.4 Scale and expectation of patients to rise. The National Health Service in England alone handles 1 million admissions and 37 million outpatients attendances per annum requiring high quality and efficient communications between 2,500 hos- pitals and 10,000 general practices [Moogan 2006]; this scale puts huge amount of stress and pressure on the implementation of the programme considering the risks of migration and integration with the legacy system gradually developed over time. Expectations of the benefits of the electronic records are so high they may become an obstacle to accelerating its development especially with frequent implementation failures and the fact that health data growth is exponential [MacKenzie 1998]. Health care in any national setting is also quite a complex enterprise [Goldschmidt 2005]. It associates with such immense boundless knowledge which is overwhelming

End of Year Review, October 2006. 28 · Henry Addico as its integration is such a critical need [Kalra and Ingram 2006]. The fact that a centralized approach is being adopted is not future proof considering the sheer scale. A distributed approach like [Abiteboul et al. 2004] will be far better as the data will be kept at the location where is more frequently use so that GP will hold data of their patients and can securely retrieve other aspects of the record from secondary care if required. All that required is appraise references in the record to its other segments.

4.3.5 Other Factors. [Midgley 2005] believes that the effect of handling health care Information technology as a compulsory element in political processes is not a positive one. New campaigns has been launched to persuade clinicians to support the multimillion pound computerisation programme of the health care in England. This was a result of the underperformed launch of the elements of the programme in particular the Choose and Book System [Cross 2006b]. Even though the survey conducted in January by Medix show scepticism on the side of NHS professionals (68% regarded the performance of the programme to be poor), 59% of GPs and 66% of other doctors still believing electronic records will improve patient care should drive the programme to succeed [Cross 2006b]. The requirement of limiting the health costs and to maximize resource utilisation cannot be ignored especially considering the extent of the expenditure of the NHS. The effect of Government policies controlling and maintaining evidenced-based and quality assured care can- not be ignored either as their integration in the process is paramount [Kalra and Ingram 2006]. Communication between the NHS CfH and the NHS care centers is poor with lack of clarity about future developments [Hendy et al. 2005]. Financial circumstance of some trusts is also being reported to slow the transition to electronic records. Major problems during delivery of releases have failed as legacy IT systems which most trust has procured over a long period of time with long term contracts is not compatible. Trusts might need new contracts to have these systems replaced. The timing of changing this system is of concern as it will have a negative impact on the development of the electronic record. International research has highlighted clinical, ethical and technical requirements needed to effect the transition to elec- tronic records [Kalra and Ingram 2006]. It is too early to predict whether the NHS programme for the electronic records will succeed or fail [Cross 2006a]. However as the world look at EHR with emerging and fusing of standards like HL7, openEHR etc. there is a glimpse of success ahead. According to [Goldschmidt 2005] by 2020, based on the present projections, approximately 50% of health care practitioners will be using some form of a functional EHR. However it needs vast investment and massive expenditure must be made in the short-term while most benefits can only be realized in the long-term. Despite long-standing claims, and data from recent studies, there is still relatively little real-world evidence that widespread adoption of electronic record will save money overall. The introduction of the EHR will eliminate most of the data problems in the healthcare domain as identified in section 4 leaving the data quality related ones. However as EHR model are aggregational in nature they aggravate the quality prob- lems. Apart from carrying these problems and difficulties along it also compound their severity. Nevertheless the expectations of quality from users of this new in-

End of Year Review, October 2006. Data Quality of Native XML Databases in the Healthcare Domain · 29 formation model will only rise. This is why an attempt to address these quality problems is paramount is this new setting.

5. RESEARCH DIRECTIONS

The relational modelling and data management has gained most ground. It involves the breaking of complex units into typed atomic units. Relations are then created between these units to mimic the initial conceptual complex units. This is normally not enough to support most online decisions and requires the generation of views for different purposes which readily answer the questions of interest (data warehouse with materialised views). Data warehouses serve these purposes as they are designed to efficiently store data and answer online queries. This is typical of the healthcare domain and the proposed solution for the centralised EHR in the UK is a distributed data warehouse solution. Data warehouse management processes employ means to manage both data in- tegration and its quality. This however is inadequate when determining a measure for service based IQ as it considers tasks and transformation performed mainly by system users like the programmer, designer etc. Hence the results do not reflect perspectively the right level or measure of data quality to other stakeholders. The process of data collection, organisation to storage, reorganisation, processing, rein- terpretation, summarization for presentation and for other data application form the process of manufacture of information and users play roles of: —Data producers; people or groups who generate data during the data production process —Data custodians; people who provide and manage computing resources for storing and processing data and carry responsibility for the security of the data. —Data consumers; people or groups who use the data, the people that utilize, aggregate and integrate the data. —Data managers; responsible for managing data quality The quality of data is affected by this data flow within a domain and a true measure will need to consider the influence of the above processes and the effect of these roles on data quality systematically. Apart from [Hinrichs 2000] and the IP approach [Davidson et al. 2003] (cross organisation and departments oriented) there is still no well established process model for managing data quality. Most assessments, including the ones which has been applied later in a domain like health care, focus on cross domain data quality concepts and reuse despite of the fact that data quality is domain dependent [Gendron et al. 2004]. There has been no attempt as yet restricting the solution to the healthcare domain first and then tackling the abstraction later. This on the other hand might lead to fruitful possibilities. The assessment and measurement of healthcare data quality needs to go beyond the quantification of several of the quality dimensions of relevance to the domain. It requires an integrated performance measurement approach for service quality and clinical effectiveness. This can only be achieved with appropriate information system architectures which are process oriented providing meta-information about the processes and the quality effects via appropriate indicators which will provide a

End of Year Review, October 2006. 30 · Henry Addico better basis for automating most aspects of process of quality management including monitoring and improving healthcare processes. It will also enhance the tracking, management and audit relationships within the healthcare network as more detail meta data can be captured from participants and service groups [Helfert et al. 2005]. A process based approach will require a tight integration of workflow and the information model in a particular domain. The emergence of new information models for complex multidimensional data like healthcare data coupled with the adoption of XML as a data exchange format and data storage presents new op- portunities to transform the management of such data localised or distributed. Such models include the clinical document architecture (CDA) of HL7 version 3, OpenEHR archtypes, XML DTD for Health care by ASTM E31.25( sub committee of the health care informatics-ASTM) amongst others. These attempts to provide standard and complete models aggregating the disparate health information about patients is a transition. A transition comparable to a time after Codd’s [Codd 1970] introduction of relational algebra: the period of building the essential models which are the back bone of todays legacy traditional DBMS, a time where integrity con- straints was thought to be enough to control the quality of input data into databases and a time where meta data and data quality issues were ignored [Stephens 2003]. However the concept of data quality and the need to incorporate quality metrics in databases is now well understood [Motro and Rakov 1997]. The depth of incor- poration of these metrics into the above information models is unfortunately quite poor and needs rigorous consideration. Instead of following the data decompositioning approach of breaking down the complex units into atomic types, an attempt to manage the above information models as a whole complex data type is much preferred. This will however require a standard means to transform these information models using some sort of object modelling construct into a data model. This should be cross domain with an appro- priate language. As this is a direct transformation, this should result in a mapping of information quality expectations from the model into data quality policy con- straints. This will require the adoption of a metadata data model to manage the data quality control. It follows from this review that the information production process is a com- bination of data manipulation queries involving selections, inserts and deletions. Assuming that an aggregation of such operations is informally referred to as a pro- cess. Tasks performed in a domain will be based on a process that may or may not be allowed to alter data. Typically processes need to be identifiable to be able to allow only a capable user to run or to be used via means of some access control policy. They should particularly, be able to define the impact they have on data quality dimension or their categorisation. This aggregation model like this infor- mation models mentioned in the last paragraph have a lower granularity. However this level of granularity is more appropriate when automatic generation of semantic meta data is needed as there is the possibility to consider the tasks involve in the workflows or practices of that domain with all its associated data. The nested nature of information model nature form a tree structure similar to an XML document data. XML databases offer the most features required to manage these information model and the process approach. Its use will enhance the exploitation of structural properties of an EHR, implementation of flexible

End of Year Review, October 2006. Data Quality of Native XML Databases in the Healthcare Domain · 31 policy files, making reference to other policy files and also tie processes to sub trees or particular access path. An attempt to implement a relational database which manages data, accuracy and lineage (i.e. data and some of its quality attributes) by [Widom 2005] is a key motivation to this research. However this work will differ as it will not involve to creation of a new XML database system from scratch but rather derive ways to incorporate data quality measurement in a distributed XML database environment as described in section 2.7. This work will also go beyond accuracy and lineage considering as many healthcare related dimensions in particular one of the non- objective quality attributes like accessibility. This research aims to —Define the notion of processes formally for data quality measurement. This will include the exploration of process finiteness per domain, processes behaviour and effect on data quality. Even though the generation of an exhaustive process basis for a particular domain will be difficult it is critical as the existence of unknown processes affect the overall quality and influence the ability to trace quality defects. This will require a formal construct like process calculus and the application of fuzzy logic instead of the simple ratio and probability theory. —Define operators for the creation of new processes, division of processes into smaller units and the combination of processes with their data quality effects in context. —Formalise a data quality policy scheme by which each process will need to operate. This will provide means of specifying constraints which are implementable in a distributed environment. This will cater for all relevant dimensions and will follow a tier approach structured around the categories of dimensions. —Incorporate accessibility in particular as it has had very little focus in research despite its essentially in the EHR setting. An adaptation of the security policy model [Anderson 1996] which is comparable to the Bell-LaPadula (a military security policy) and Clark-Wilson (banking security policy) with an appropriate process a based path indexing is needed. —Refine the policy model resolving redundant issues surrounding processes with equal, child or ancestry access paths. This will also investigate the administration of data quality policy files in a distributed environment especially one following the peer to peer architecture mentioned in section 2.7. —Test the model using dataset from a hospital about obesity of patients. —Derive a domain independent solution which will be more appropriate and reusable.

REFERENCES Abate, M. L., Diegert, K. V., and Allen, H. W. 1998. A Hierarchical Approach to Improving Data Quality. Data Quality Journal 4, 1 (september), 365–9. Abiteboul, S. 1997. Querying semi-structured data. In ICDT. 1–18. Abiteboul, S., Alexe, B., Benjelloun, O., Cautis, B., Fundulaki, I., Milo, T., and Sahuguet, A. 2004. An Electronic Patient Record ”on Steroids”: Distributed, Peer-to-Peer, Secure and Privacy-conscious. In VLDB. 1273–1276. Abiteboul, S., Manolescu, I., and Taropa, E. 2006. A Framework for Distributed XML Data Management. In EDBT, Y. E. Ioannidis, M. H. Scholl, J. W. Schmidt, F. Matthes, M. Hat-

End of Year Review, October 2006. 32 · Henry Addico

zopoulos, K. B¨ohm, A. Kemper, T. Grust, and C. B¨ohm, Eds. Lecture Notes in Computer Science, vol. 3896. Springer, 1049–1058. Abiteboul, S., Quass, D., McHugh, J., Widom, J., and Wiener, J. L. 1997. The Lorel Query Language for Semistructured Data. International Journal on Digital Libraries 1, 1, 68–88. Achard, F., Vaysseix, G., and Barillot, E. 2001. XML Bioinformatics And Data Integration Bioinformatics. Bioinformatics Review 17, 2, 115–125. Amer-Yahia, S., Koudas, N., Marian, A., Srivastava, D., and Toman, D. 2005. Structure And Content Scoring For XML. In VLDB ’05: Proceedings Of The 31st International Conference On Very Large Data Bases. VLDB Endowment, Secaucus, NJ, USA, 361–372. Anderson, R. 1996. A Security Policy Model for Clinical Information Systems. BMA Report ISBN 0-7279-1048-5, British Medical Association. Arts, D., de Keizer, N., and de Jonge, E. 2001. Data Quality Measurement and Assurance in Medical Registries. In MEDINFO 2001: Proceedings of the 10th World Congress on Medical Informatics. IOS Press, IMIA, 404. London:. Aznauryan, N. A., Kuznetsov, S. D., Novak, L. G., and Grinev, M. N. 2006. SLS: A numbering scheme for large XML documents. Programing Computer Software 32, 1, 8–18. Ballou, D., Wang, R., Pazer, H., and Tayi, G. K. 1998. Modeling Information Manufacturing Systems to Determine Information Product Quality. Management Science 44, 4, 462–484. Ballou, D. P., Chengalur-Smith, I. N., and Wang, R. Y. 2006. Sample-Based Quality Estima- tion of Query Results in Relational Database Environments. IEEE Transactions on Knowledge and Data Engineering 18, 5, 639–650. Ballou, D. P. and Pazer, H. L. 2003. Modeling Completeness Versus Consistency Tradeoffs in Information Decision Contexts. IEEE Transactions on Knowledge and Data Engineering 15, 1, 240–243. Baru, C., Chu, V., Gupta, A., Ludascher,¨ B., Marciano, R., Papakonstantinou, Y., and Velikhov, P. 1999. XML-based information mediation for digital libraries. In DL ’99: Pro- ceedings of the fourth ACM conference on Digital libraries. ACM Press, New York, NY, USA, 214–215. Bilykh, I., Bychkov, Y., Dahlem, D., Jahnke, J. H., McCallum, G., Obry, C., Onabajo, A., and Kuziemsky, C. 2003. Can GRID Services Provide Answers to the Challenges of National Health Information Sharing? In CASCON ’03: Proceedings of the 2003 conference of the Centre for Advanced Studies on Collaborative research. IBM Press, 39–53. Blobel, B. G. 2006. Advanced EHR Architectures–Promises or Reality. Methods Inf Med. 45, 1, 95–101. Boniface, M. and Wilken, P. 2005. ARTEMIS: Towards a Secure Interoperability Infrastructure for Healthcare Information Systems. In HEALTHGRID. Bovee, M., Srivastava, R. P., and Mak, B. 2003. A Conceptual Framework and Belief-function Approach to Assessing Overall Information Quality. International Journal of Intelligent Sys- tems 18, 1 (January), 51–74. Boyer, S. and Alschuler, S. 2000. HL7 Patient Record Architecture Update. XMLEu- rope2000 http://www.gca.org/papers/xmleurope2000/papers, accessed 05-2005, 5. Burd, G. and Staken, K. 2005. Use a Native XML Database for Your XML Data. XML Journal May edition, http://xml.sys–con.com/read/90126.htm. Canning, C. SPRING 2004. The Relevance of the National Programme for Information Technol- ogy to Ophthalmology. FOCUS 29, 2. Catania, B., Ooi, B. C., Wang, W., and Wang, X. 2005. Lazy XML Updates: Laziness as a Virtue, of Update and Structural Join Efficiency. In SIGMOD ’05: Proceedings of the 2005 ACM SIGMOD international conference on Management of data. ACM Press, New York, NY, USA, 515–526. Ceri, S., Comai, S., Damiani, E., Fraternali, P., Paraboschi, S., and Tanca, L. 1999. XML- GL: A Graphical Language for Querying and Restructuring XML Documents. In Sistemi Evoluti per Basi di Dati. 151–165. CfH. 2005. Delivering IT for a modern, efficient NHS. Tech. rep., Connecting for Health. May.

End of Year Review, October 2006. Data Quality of Native XML Databases in the Healthcare Domain · 33

CfH. accessed 2006. Health Care Records Whenver and Whereever You Need Them. Tech. rep., NHS Connecting For Health. accessed May. Chamberlin, D., Robie, J., and Florescu, D. 2001. Quilt: An XML Query Language for Heterogeneous Data Sources. Lecture Notes in Computer Science 1997, 1. Cho, S., Koudas, N., and Srivastava, D. 2006. Meta-data Indexing for XPath Location Steps. In SIGMOD Conference. 455–466. Christophilopoulos, E. 2005. ARTEMIS (IST-1-002103-STP): A Semantic Web Service-based P2P Infrastructure for the Interoperability of Medical Information Systems. InnoFire Medical Cooperation Network Newsletter. Codd, E. F. 1970. A Relational Model of Data for Large Shared Data Banks. Communications of ACM 13, 6, 377–387. Cohen, S., Kanza, Y., Kogan, Y. A., Nutt, W., Sagiv, Y., and Serebrenik, A. 1999. EquiX Easy Querying in XML Databases. In ACM International Workshop on the Web and Databases (WebDB’99). 43–48. Coombes, R. 2006. Patients Get Four Choices for NHS Treatments. BMJ 332, 7532, 8. Cross, M. 2006a. Keeping the NHS Electronic Spine on Track. BMJ 332, 7542, 656–658. Cross, M. 2006b. New Campaign to Encourage use of National IT Programme Begins. BMJ 332, 7534, 139–a–. Davidson, B., Lee, Y. W., and Wang, R. Y. 2003. ”Developing Data Production Maps: Meeting Patient Discharge Submission Requirements. International Journal of Healthcare Technology and Management, 6, .2, 87–103. Davidson, I., Grover, A., Satyanarayana, A., and Tayi, G. K. 2004. A General Approach to Incorporate Data Quality Matrices into Data Mining Algorithms. In KDD ’04: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM Press, New York, NY, USA, 794–798. Dekeyser, S., Hidders, J., and Paredaens, J. 2004. A Transaction Model for XML Databases. World Wide Web 7, 1, 29–57. Department of Health, N. E. 1998. Information for Health: An In- formation Strategy for the Modern NHS 1998-2005, series A1103,. http://www.dh.gov.uk/PublicationsAndStatistics/Publications,Crown copyright Publica- tionsPolicyAndGuidance, 123 p. Deutsch, A., Fernandez, M., Florescu, D., Levy, A., and Suciu, D. 1998. XMLQL: A Query Language for XML. In WWW The Query Language Workshop (QL). Cambridge, MA. Dolin, R. Jan 1997-Feb 1997. Outcome Analysis: Considerations for an Electronic Health Record. MD Computing. 14, 1, 50–6. Ehikioya, S. A. 1999. A characterization of information quality using fuzzy logic. In Fuzzy In- formation Processing Society NAFIPS. 18th International Conference of the North American. 635–639. Fan, W. and Simeon,´ J. 2003. Integrity Constraints for XML. J. Comput. Syst. Sci. 66, 1, 254–291. Feinberg, G. 2004. Anatomy of a Native XML Database. In XML 2004 Conference And Exibition. SchemaSoft. Fiebig, T., Helmer, S., Kanne, C.-C., Moerkotte, G., Neumann, J., Schiele, R., and West- mann, T. 2002. Anatomy of a Native XML Base Management System. The VLDB Jour- nal 11, 4, 292–314. Finance, B., Medjdoub, S., and Pucheral, P. 2005. The Case For Access Control on XML Rela- tionships. In CIKM ’05: Proceedings of the 14th ACM international conference on Information and knowledge management. ACM Press, New York, NY, USA, 107–114. Gabillon, A. 2004. An Authorization Model for XML Databases. In SWS ’04: Proceedings of the 2004 workshop on Secure web service. ACM Press, New York, NY, USA, 16–28. Gendron, M., Shanks, G., and Alampi, J. 2004. Next Steps in Understanding Information Quality and Its Effect on Decision Making and Organizational Effectiveness. In 2004 IFIP In- ternational Conference on Decision Support Systems, G. Widmeyer, Ed. PRATO, TUSCANY.

End of Year Review, October 2006. 34 · Henry Addico

Gendron, M. S. 2000. Data quality in the Healthcare Industry. Ph.D. thesis, State University of New York at Albany. Gertz, M. and Schmitt, I. 1998. Data Integration Techniques based on Data Quality Aspects. In 3rd National Workshop on Federated Databases. Magdeburg, Germany Shaker Verlag. ISBN: 3-8265-4522-2. Glover, H. 2005. An Introduction to HL7 Version 3 and the NPfIT Message Implementation Manual. In HL7 UK Conference: HL7 and its key role in NPfIT and Existing Systems Inte- gration, H. U. Conference, Ed. Goldman, R., McHugh, J., and Widom, J. 1999. From Semistructured Data to XML: Migrating the Lore Data Model and Query Language. In Proceedings of the 2nd International Workshop on the Web and Databases (WebDB ’99). Philadelphia, Pennsylvania. Goldschmidt, P. G. 2005. HIT and MIS: Implications of Health Information Technology and Medical Information Systems. Communications of the ACM 48, 10 (October), 69–74. Grimson, J., Grimson, W., and Hasselbring, W. 2000. The SI challenge in Health Care. Communications of the ACM 43, 6, 48–55. Grimson, J., Stephens, G., Jung, B., Grimson, W., Berry, D., and Pardon, S. 2001. Sharing Health-Care Records over the Internet. IEEE Internet Computing 5, 3, 49–58. Guo, J., Takada, A., Tanaka, K., Sato, J., Suzuki, M., Suzuki, T., Nakashima, Y., Araki, K., and Yoshihara2, H. 2004. The development of MML (Medical Markup Language) version 3.0 as a medical document exchange format for HL7 messages. Journal of Medical Systems 28, 6 (December). Haustein, M. P. and Harder,¨ T. 2003. Advances in Databases and Information Systems. 978- 3-540-20047-5, vol. 2798/2003. Springer Berlin / Heidelberg, Chapter taDOM: A Tailored Synchronization Concept with Tunable Lock Granularity for the DOM API, 88–102. Haustein, M. P., Harder,¨ T., Mathis, C., and Wagner, M. 2005. Deweyids - The Key to Fine-Grained Management of XML Documents. In 20th Brasilian Symposium On Databases. 85–99. Helfert, M., Henry, P., Leist, S., and Zellner, G. 2005. Healthcare performance indicators: Preview of frameworks and an approach for healthcare process-development. In Khalid S. Soliman, K.S. (ed): Information Management in Modern Enterprise: Issues & Solutions - Proceedings of The 2005 International Business Information Management Conference. ISBN: 0-9753393-3-8. Lisbon, Portugal, 371–378. Hendy, J., Reeves, B. C., Fulop, N., Hutchings, A., and Masseria, C. 2005. Challenges to Implementing the National Programme for Information Technology (NPfIT): A Qualitative Study. BMJ 331, 7512, 331–336. Hinrichs, H. 2000. CLIQ - Intelligent Data Quality Management. In Fourth International Baltic Workshop on Databases and Information Systems: Doctoral Consortium. Vilnius, Lithuania. IBM. accessed july 2006. DB2 9 for Linux UNIX and Windows pureXML and storage compression. IBM Software http://www-306.ibm.com/software/data/db2/9/, website. Jagadish, H. V., Al-Khalifa, S., Chapman, A., Lakshmanan, L. V. S., Nierman, A., Pa- parizos, S., Patel, J. M., Srivastava, D., Wiwatwattana, N., Wu, Y., and Yu, C. 2002. TIMBER: A Native XML database. The VLDB Journal 11, 4, 274–291. Jerome, R. N., Giuse, N. B., Gish, K. W., Sathe, N. A., and Dietrich, M. S. 2001. Informa- tion needs of clinical teams: analysis of questions received by the Clinical Informatics Consult Service. Bulletin Medical Library Association 89, 2 (April), 177–185. Jiang, H., Lu, H., Wang, W., and Yu, J. X. 2002. Path Materialization Revisited: An Efficient Storage Model for XML Data. In CRPITS ’02: Proceedings Of The Thirteenth Australasian Conference On Database Technologies. Australian Computer Society, Inc., Darlinghurst, Aus- tralia, Australia, 85–94. Joachim Bergmann, Oliver J. Bott, D. P. P., Hau, R., Joachim Bergmann, Oliver J. Bott, D. P. P., and Haux, R. 2006. An e-consent-based shared EHR system architecture for inte- grated healthcare networks. International Journal of Medical Informatics 2305, 7. Jones, T. M. and Mead, C. N. 2005. The Architecture of Sharing. Healthcare Informatics online, 35–40.

End of Year Review, October 2006. Data Quality of Native XML Databases in the Healthcare Domain · 35

Kacholia, V., Pandit, S., Chakrabarti, S., Sudarshan, S., Desai, R., and Karambelkar, H. 2005. Bidirectional Expansion For Keyword Search On Graph Databases. In VLDB ’05: Pro- ceedings Of The 31st International Conference On Very Large Data Bases. VLDB Endowment, Secaucus, NJ, USA, 505–516. Kader, Y. A. 2003. An Enhanced Data Model and Query Algebra for Partially Structured XML Database. Tech. Rep. CS-03-08, Department of Computer Science, University of Sheffield. Kalra, D. and Ingram, D. 2006. Information Technology Solutions for Healthcare. Springer- Verlag London Ltd, Chapter Electronic health records, 5–102. Kha, D. D., Yoshikawa, M., and Uemura, S. 2001. An XML Indexing Structure with Relative Region Coordinate. In ICDE ’01: Proceedings of the 17th International Conference on Data Engineering. IEEE Computer Society, Washington, DC, USA, 313. Kim, S. W., Shin, P. S., Kim, Y. H., Lee, J., and Lim, H. C. 2002. A Data Model and Algebra for Document-Centric XML Document. In ICOIN ’02: Revised Papers from the International Conference on Information Networking, Wireless Communications Technologies and Network Applications-Part II. Springer-Verlag, London, UK, 714–723. Konopnicki, D. and Shmueli, O. 2005. Database-inspired Search. In VLDB ’05: Proceedings Of The 31st International Conference On Very Large Data Bases. VLDB Endowment, Secaucus, NJ,USA, 2–12. Lechtenborger,¨ J. and Vossen, G. 2003. Multidimensional Normal Forms for Data Warehouse Design. Inf. Syst. 28, 5, 415–434. Lederman, R. 2005. Managing Hospital Databases: Can Large Hospitals Really Protect Patient Data. Health Infromatics Journal 13, 3, 201–210. Lee, Y. W., Strong, D. M., Kahn, B. K., and Wang, R. Y. 2002. AIMQ: A Methodology for Information Quality Assessment. Inf. Manage. 40, 2, 133–146. Leonidas Orfanidis, Panagiotis D. Bamidis, B. E. 2004. Data Quality Issues In Electronic Health Records: An Adaptation Framework For The Greek Health System. Health Informatics Journal 10, 1, 23–36. Li, Q. and Moon, B. 2001. Indexing and Querying XML Data for Regular Path Expressions. In VLDB ’01: Proceedings of the 27th International Conference on Very Large Data Bases. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 361–370. MacKenzie, D. 1998. New Language Could Meld the Web Into a Seamless Database. Sci- ence 280, 5371, 1840–1841. Mandl, K. and Porter, S. 1999. Data Quality and the Electronic Medical Record: A Role for Direct Parental Data Entry. In American Medical Informatics Association Three Year Cumulative Symposium Proceedings. Mazeika, A. and Bohlen,¨ M. H. 2006. Cleansing Databases of Misspelled Proper Nouns. In CleanDB. Midgley, A. K. 2005. ”Choose and book” Does not Solve any Problems. BMJ 331, 7511, 294–b–. Milano, D., Scannapieco, M., and Catarci, T. 2005. Using Ontologies for XML Data Cleaning. In OTM Workshop on Inter-organizational Systems and Interoperability of Enterprise Software and Applications (MIOS+INTEROP). Moogan, P. 2006. The Clinical development of the NHS care record service. http://www.connectingforhealth.nhs.uk/crbd/docs/, Connecting for health. accessed on 05/2006. Moro, M. M., Vagena, Z., and Tsotras, V. J. 2005. Tree-pattern Queries on a Lightweight XML Processor. In VLDB ’05: Proceedings of the 31st international conference on Very large data bases. VLDB Endowment, Secaucus, NJ, USA, 205–216. Motro, A. and Rakov, I. 1997. Not All Answers Are Equally Good: Estimating the Quality of Database Answers. 1–21. N3. 2006. DeliveryUpdate. N3 bulletin News 17, 1. Naumann, F. and Rolker, C. 2000. Assessment Methods for Information Quality Criteria. In IQ. 148–162. NCFH. 2005. Communications Toolkit 4 -PACS. Tech. Rep. 3234, NHS connecting for health, http://www.connectingforhealth.nhs.uk/publications/toolkitaugust05/. August.

End of Year Review, October 2006. 36 · Henry Addico

NCFH. 2006a. Electronic Prescription Service passes one million mark. Tech. rep., NHS Connect- ing for Health. May. accessed May 2006. NCFH. 2006b. Strategic Health Authority Communications Toolkit - Electronic Prescription Service (EPS). Tech. Rep. 2091, NHS connecting for health, http://www.connectingforhealth.nhs.uk/publications/toolkitaugust05/. January. Oliveira, P., Rodrigues, F., and Henriques, P. 2005. A Formal Definition of Data Quality Problems. In IQ. MIT. Olson, M. 2000. 4Suite an open-source platform for XML and RDF processing. 4suite.org accessed july 2006, http://4suite.org/index.xhtml. Parssian, A., Sarkar, S., and Jacob, V. S. 2004. Assessing Data Quality for Information Products: Impact of Selection, Projection, and Cartesian Product. Management Science 50, 7, 967–982. Parssian, A. and Yeoh, W. 2006. QSQL: An Extension to SQL for Queries on Information Quality. In 1st Australasian Workshop on Information Quality (AusIQ 2006). University of South Australia, Adelaide, Australia. Pedersen, T. B. and Jensen, C. S. 1998. Research Issues in Clinical Data Warehousing. In In Proceedings of the Tenth International Conference on Statistical and Scientific Database Management. IEEE Computer Society, 43–52. Pehcevski, J., Thom, J. A., and Vercoustre, A.-M. 2005. Hybrid XML Retrieval: Combining Information Retrieval and a Native XML Database. Inf. Retr. 8, 4, 571–600. Pipino, L. L., Lee, Y. W., and Wang, R. Y. 2002. Data quality assessment. Commun. ACM 45, 4, 211–218. Pohl, J. 2000. Transition from Data to Information. Tech. rep., Collaborative Agent Design Research Center. November. Price, R. J. and Shanks, G. 2004. A Semiotic Information Quality Framework. In IFIP WG8.3 International Conference on Decision Support Systems (DSS2004). IFIP WG8.3 International Conference on Decision Support Systems (DSS2004), 658–672. Raoul, K., Euloge, T., and Roland, M. 2005. Designing and Implementing an Electronic Health Record System in Primary Care Practice in Sub-Saharan Africa: A Case Study from Cameroon. Informatics in Primary Care, Volume 13, Number 3, November 2005, pp. 179-186(8) 13, 3 (November), 179–186(8). Rector, A., Rogers, J., Taweel, A., Ingram, D., Kalra, D., Milan, J., Singleton, P., Gaizauskas, R., Hepple, M., Scott, D., and Power, R. 2003. CLEF Joining up Health- care with Clinical and Post-Genomic Research. Clef industrial forum, CLEF, Sheffield. Riain, C. O. and Helfert, M. 2005. An Evaluation of Data Quality Related Problem Patterns in Healthcare Information Systems. In IADIS Virtual Multi Conference on Computer Science and Information Systems. Vol. single. 189,193. Richard, W., Strong, D., and Guarascio, L. 1994. An Empirical Investigation of Data Quality Dimensions: A Data Consumer’s Perspective. Tech. rep., MIT TDQM Research Program, 50 Memorial Drive, Cambridge, Ma. 02139. Rys, M., Chamberlin, D., and Florescu, D. 2005. XML and Relational Database Manage- ment Systems: The Inside Story. In SIGMOD ’05: Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data. ACM Press, New York, NY, USA, 945–947. Sacks-Davis, R., Dao, T., Thom, J. A., and Zobel, J. 1997. Indexing Documents for Queries on Structure, Content and Attributes. In Proceedings of International Symposium on Digital Media Information Base (DMIB). 236–245. Sattler, K.-U., Geist, I., and Schallehn, E. 2005. Concept-based querying in mediator systems. The VLDB Journal 14, 1, 97–111. Scannapieco, M. and Batini, C. 2004. Completeness in the Relational Model. A Comprehensive Framework. In 9th International Conference on Information Quality. Scannapieco, M., Missier, P., and Batini, C. 2005. Data Quality at a Glance. Datenbank- Spektrum 14, 1–23. Schweiger, R., Hoelzer, S., Altmann, U., Rieger, J., and Dudeck, J. 2002. Plug-and-Play XML:. Journal of the American Medical Informatics Association 9, 1, 37–48.

End of Year Review, October 2006. Data Quality of Native XML Databases in the Healthcare Domain · 37

Shankaranarayan, G., Ziad, M., and Wang, R. Y. 2003. Managing Data Quality in Dynamic Decision Environments: An Information Product Approach. Journal of Database Manage- ment 14, 4, 14–32. Shanmugasundaram, J., Tufte, K., Zhang, C., He, G., Dewitt, D. J., and Naughton, J. F. 1999. Relational Databases For Querying XML Documents: Limitations And Opportunities. In VLDB. Morgan Kaufmann, 302–314. Sherman, G. 2001. Toward electronic health records. Office of Health and Information Highway Health Canada. http://www.hc-sc.gc.ca,2005 . Shui, W., Lam, F., Fisher, D. K., and Wong, R. K. 2005. Querying and Maintaining Or- dered XML Data using Relational Databases. In Sixteenth Australasian Database Conference (ADC2005), H. E. Williams and G. Dobbie, Eds. CRPIT, vol. 39. ACS, Newcastle, Australia, 85–94. Smith, R. 1996. What Clinical Information do Doctors Need? British Medical Journal 313, 7064 (October), 1062–8. Stephens, R. T. 2003. Metadata and XML: Will History Repeat Itself? Columns 7553, RTodd.com, http://www.dmreview.com/article sub.cfm?articleId=7553. October. vis- ited:20/08/2006. Strong, D. M., Lee, Y. W., and Wang, R. Y. 1997. Data Quality In Context. Commun. ACM 40, 5, 103–110. Stvilia, B., Gasser, L., Twidale, M. B., and Smith, L. C. A Framework for Information Quality Assessment. http://www.isrl.uiuc.edu/ ∼gasser/papers/stvilia IQFramework.pdf. Sugden, B., Wilson, R., and Cornford, J. 2006. Re-configuring the health supplier market: Changing relationships in the primary care supplier market in England. Technical Report Series CS-TR-951, Newcastle upon Tyne: University of Newcastle upon Tyne:Computing Science. Vagena, Z., Moro, M. M., and Tsotras, V. J. 2004. Efficient Processing of XML Containment Queries Using Partition-Based Schemes. In IDEAS. 161–170. Walsh, S. H. 2004. The Clinician’s Perspective on Electronic Health Records and How They Can Affect Patient Care. BMJ 328, 7449, 1184–1187. Wand, Y. and Wang, R. Y. 1996. Anchoring Data Quality Dimensions in Ontological Founda- tions. Communications ACM 39, 11, 86–95. Wang, Y. R. and Madnick, S. E. 1989. The Inter-Database Instance Identification Problem in Integrating Autonomous Systems. In Proceedings of the Fifth International Conference on Data Engineering, February 6-10, 1989, Los Angeles, California, USA. IEEE Computer Society, 46–55. Widom, J. 1999. Data Management For XML Research Directions. IEEE Data Engineering Bulletin, Special Issue On XML 22, 3 (September), 44–52. Widom, J. 2005. Trio: A System for Integrated Management of Data, Accuracy, and Lineage. In Conference on Innovative Data Systems Research. 262–276. Zue, V. 1999. Talking with your computer. Scientific American 40.

End of Year Review, October 2006.