An Implementation of Xml Data Integration
Total Page:16
File Type:pdf, Size:1020Kb
AN IMPLEMENTATION OF XML DATA INTEGRATION Weidong Pan, Jixue Liu and Jiashen Tian School of Computer and Information Science, University of South Australia Mawson Lakes, Adelaide, SA 5095, Australia Keywords: XML, data Integration, enterprise infiormation sytsem, DTD, document, data transformation. Abstract: Data integration is essential for building modern enterprise information systems. This paper investigates the implementation of XML data integration through transforming XML data from different data sources into a common global schema. Following the work our research group has done earlier, the paper is focused on the implementation of XML data transformation. First, the proposed methodology to realize XML data integration is sketched. Then, the representation of a DTD and document and other relevant concepts for transforming XML data are presented. The transforming operations defined by a set of operators are outlined focusing on the required functionality for data integration. Building upon these, the implementation of the data transformation operations is investigated. The current implementation is reported with a simplified example illustrating how the methodology can be applied for practical enterprise information integration. 1 INTRODUCTION query processing, users retrieve data from a document and then build another document using the retrieved Data integration is essential for building modern en- data. W3C’s XQuery (Boag et al., 2007) and XSLT terprise information systems since data is normally (Kay, 2007) are two typical query languages. Data distributed on different platforms. In order to inte- conversion can also be implemented by executing a grate these data together to provide a unique view sequence of data transformation operations defined by for users, there is a need for technologies to imple- a set of operators. Data transformation operators have ment the conversion of data from different sources to been proposed in a number of previous work, some a common unified schema. Since XML has been in- are similar to XML algebra expressions derived from creasingly used for data representation and exchange relational data model (Zamboulis, 2004). In general, across the Internet, it is of specific importance to in- they just provide operators for XML document update tegrate XML data from multiple data sources into a and have not provided a systematic set of transforma- global schema with a unique structure. This paper tion operators with a clear semantic. Although a few investigates the implementation of XML data inte- has considered DTD transformation when document gration, more specifically, it aims to realize the inte- is being transformed (Erwig, 2003), none has covered gration through converting XML data from different the full DTD syntax. In addition, to the best of our schemas into a unified global schema. knowledge, the implementation issue of the operators A number of earlier research has devoted to XML has not been well addressed and accordingly their per- data integration. The techniques proposed can be formance has not been deeply studied from the view- summarized into two main procedures, scheme map- points of practical applications. ping and data conversion. The former aims to de- To address these problems, our research group has velop techniques to map XML data from different proposed a set of XML data transformation operators sources to a common global schema, through which to realize XML data integration (Liu et al., 2006). XML data distributed at different sources can be rep- Compared with other data transformation operators, resented in a unique view; the latter aims to, based our operators include two types of transformation: 1) on the mapping between an original and a destina- DTD transformation, and 2) document transforma- tion schema, develop techniques to convert XML data tion. This can ensure the output documents of the from different sources into a destination platform that operators always conform to the output DTDs, which conforms to an integrated global schema. Data con- is critical to the semantics of output data. Our op- version can be done through query processing or ex- erators are defined with the consideration of the full ecuting a sequence of transformation operations. By syntax of DTDs and documents, especially the nested 111 Pan W., Liu J. and Tian J. (2008). AN IMPLEMENTATION OF XML DATA INTEGRATION. In Proceedings of the Tenth International Conference on Enterprise Information Systems - DISI, pages 111-116 DOI: 10.5220/0001677401110116 Copyright c SciTePress ICEIS 2008 - International Conference on Enterprise Information Systems brackets in element type definitions, multiplicity con- Build a Transform the Restore the tuple expression DTD four-tuple transformed DTD straints attached to those nested brackets, and disjunc- by executing a tuple file expression file tion. They are complete for the transformation of from the series of expression DTD file transformation into a DTD DTDs and documents. In addition, they are semanti- operations file cally traceable when being used to realize XML data conforms to supplies conforms to information integration; each has a particular intention and creates Transform the Build Restore the document trees a particular semantic effect towards the overall goal. XML XML document by executing a transformed file file Up to this point, we have completed a preliminary im- trees from series of document plementation of these operators using Java and JSP the XML transformation trees into an Source file operations XML file Destination techniques. This paper is to report our recent research in XML data integration, focusing on the implemen- tation of the operators, as the performance analysis Figure 1: Framework for XML data transformation. for the operators has been presented in another paper (Tian et al., 2008). fined by a set of operators. After the transformation, The paper is organized as follows. In Section 2, an the new tuple expression is restored into a new DTD overview of the proposed methodology is presented. file, achieving a conversion from the previous DTD Section 3 illustrates the representation of XML data, to a new one. The transformation of an XML doc- providing some essential concepts and terms. Section ument from a particular source to a targeted schema 4 describes the data transformation operations defined has a similar process except that it requires the sup- by the data transformation operators. Section 5 looks port from the transformation of its conformed DTD into the implementation of the operators. Section 6 re- to ensure the output document has a compatible se- ports the current implementation and presents a sim- mantic structure. plified example to illustrate how the approach is used In the sections that follow, we will elaborate the in practical enterprise information integration appli- methodology outlined here. cations. The final section summarizes the paper and indicates the further work. 3 XML DATA REPRESENTATION 2 OVERALL DESCRIPTION OF 3.1 DTD Representation OUR METHODOLOGY TO REALIZE XML DATA A DTD defines the structure of its corresponding doc- uments by a list of element type declarations. In the INTEGRATION proposed approach, this information is represented by a four-tuple D = (EN;G;b;r), where EN is the set of As stated in the previous section, what we aim to the element names in the DTD; G is the set of type achieve is XML data integration by converting XML constructors which define the elements; b is the set of data to a unified schema from different sources with- the functions connecting an element with its type con- out losing valuable semantic information. It is com- structor; and r is the root of the DTD. Figure 2 shows plicated by the flexible XML syntax that allows differ- a simplified DTD example and its corresponding tu- ent ways to represent the data of same semantics. We ple expression. It is extracted from a DTD used in a realize XML data conversion using a set of transfor- publishing company for publication information. mation operators that supply enough transformation Note, to save space, the headers and the string type power and preserve the data semantics to a certain de- #PCDATA in the DTD are intentionally left out. An gree. element in G can be a single element, or a component The proposed methodology to achieve XML data in- including multiple elements in conjunctive or disjunc- tegration is through data transformation. All the tive sequences. For example, if g 2 G, then g can be in DTDs and the conforming documents from different the following forms: g = e (e 2 EN), g = Str (Str is a data sources are converted to a unified XML format symbol denoting #PCDATA), or g = g1;g2 or g1jg2 or c that conforms to the integrated global schema. As [g] , where g1 and g2 are recursively defined, and c 2 illustrated in Figure 1, in the conversion of a DTD f‘?’, ‘1’, ‘+’, ‘∗’g. The multiplicity c defines the mul- from a particular source, a four-tuple expression is tiplicity constraints of g. Transforming a DTD also built based on the given DTD file. Then, the tuple ex- involves the transformation of the multiplicities. For pression is transformed to a new one through execut- handling multiplicities, ‘?’, ‘1’, ‘+’ and ‘∗’ are rep- ing a sequence of data transformation operations, de- resented by the intervals [0;1], [1;1], [1;n] and [0;n], 112 AN IMPLEMENTATION OF XML DATA INTEGRATION <!ELEMENT root (name,publ)*> <!ELEMENT publ (year,(book|article)+)*> constraints of a structure can be checked. <!ELEMENT book (title,ISBN, price)>