AN IMPLEMENTATION OF XML INTEGRATION

Weidong Pan, Jixue Liu and Jiashen Tian School of Computer and Information Science, University of South Australia Mawson Lakes, Adelaide, SA 5095, Australia

Keywords: XML, , enterprise infiormation sytsem, DTD, document, .

Abstract: Data integration is essential for building modern enterprise information systems. This paper investigates the implementation of XML data integration through transforming XML data from different data sources into a common global schema. Following the work our research group has done earlier, the paper is focused on the implementation of XML data transformation. First, the proposed methodology to realize XML data integration is sketched. Then, the representation of a DTD and document and other relevant concepts for transforming XML data are presented. The transforming operations defined by a set of operators are outlined focusing on the required functionality for data integration. Building upon these, the implementation of the data transformation operations is investigated. The current implementation is reported with a simplified example illustrating how the methodology can be applied for practical enterprise information integration.

1 INTRODUCTION query processing, users retrieve data from a document and then build another document using the retrieved Data integration is essential for building modern en- data. W3C’s XQuery (Boag et al., 2007) and XSLT terprise information systems since data is normally (Kay, 2007) are two typical query languages. Data distributed on different platforms. In order to inte- conversion can also be implemented by executing a grate these data together to provide a unique view sequence of data transformation operations defined by for users, there is a need for technologies to imple- a set of operators. Data transformation operators have ment the conversion of data from different sources to been proposed in a number of previous work, some a common unified schema. Since XML has been in- are similar to XML algebra expressions derived from creasingly used for data representation and exchange relational data model (Zamboulis, 2004). In general, across the Internet, it is of specific importance to in- they just provide operators for XML document update tegrate XML data from multiple data sources into a and have not provided a systematic set of transforma- global schema with a unique structure. This paper tion operators with a clear semantic. Although a few investigates the implementation of XML data inte- has considered DTD transformation when document gration, more specifically, it aims to realize the inte- is being transformed (Erwig, 2003), none has covered gration through converting XML data from different the full DTD syntax. In addition, to the best of our schemas into a unified global schema. knowledge, the implementation issue of the operators A number of earlier research has devoted to XML has not been well addressed and accordingly their per- data integration. The techniques proposed can be formance has not been deeply studied from the view- summarized into two main procedures, scheme map- points of practical applications. ping and data conversion. The former aims to de- To address these problems, our research group has velop techniques to map XML data from different proposed a set of XML data transformation operators sources to a common global schema, through which to realize XML data integration (Liu et al., 2006). XML data distributed at different sources can be rep- Compared with other data transformation operators, resented in a unique view; the latter aims to, based our operators include two types of transformation: 1) on the mapping between an original and a destina- DTD transformation, and 2) document transforma- tion schema, develop techniques to convert XML data tion. This can ensure the output documents of the from different sources into a destination platform that operators always conform to the output DTDs, which conforms to an integrated global schema. Data con- is critical to the semantics of output data. Our op- version can be done through query processing or ex- erators are defined with the consideration of the full ecuting a sequence of transformation operations. By syntax of DTDs and documents, especially the nested

111 Pan W., Liu J. and Tian J. (2008). AN IMPLEMENTATION OF XML DATA INTEGRATION. In Proceedings of the Tenth International Conference on Enterprise Information Systems - DISI, pages 111-116 DOI: 10.5220/0001677401110116 Copyright c SciTePress ICEIS 2008 - International Conference on Enterprise Information Systems

brackets in element type definitions, multiplicity con- Build a Transform the Restore the tuple expression DTD four-tuple transformed DTD straints attached to those nested brackets, and disjunc- by executing a tuple file expression file tion. They are complete for the transformation of from the series of expression DTD file transformation into a DTD DTDs and documents. In addition, they are semanti- operations file cally traceable when being used to realize XML data conforms to supplies conforms to information integration; each has a particular intention and creates Transform the Build Restore the document trees a particular semantic effect towards the overall goal. XML XML document by executing a transformed file file Up to this point, we have completed a preliminary im- trees from series of document plementation of these operators using Java and JSP the XML transformation trees into an Source file operations XML file Destination techniques. This paper is to report our recent research in XML data integration, focusing on the implemen- tation of the operators, as the performance analysis Figure 1: Framework for XML data transformation. for the operators has been presented in another paper (Tian et al., 2008). fined by a set of operators. After the transformation, The paper is organized as follows. In Section 2, an the new tuple expression is restored into a new DTD overview of the proposed methodology is presented. file, achieving a conversion from the previous DTD Section 3 illustrates the representation of XML data, to a new one. The transformation of an XML doc- providing some essential concepts and terms. Section ument from a particular source to a targeted schema 4 describes the data transformation operations defined has a similar process except that it requires the sup- by the data transformation operators. Section 5 looks port from the transformation of its conformed DTD into the implementation of the operators. Section 6 re- to ensure the output document has a compatible se- ports the current implementation and presents a sim- mantic structure. plified example to illustrate how the approach is used In the sections that follow, we will elaborate the in practical enterprise information integration appli- methodology outlined here. cations. The final section summarizes the paper and indicates the further work. 3 XML DATA REPRESENTATION

2 OVERALL DESCRIPTION OF 3.1 DTD Representation OUR METHODOLOGY TO REALIZE XML DATA A DTD defines the structure of its corresponding doc- uments by a list of element type declarations. In the INTEGRATION proposed approach, this information is represented by a four-tuple D = (EN,G,β,ρ), where EN is the set of As stated in the previous section, what we aim to the element names in the DTD; G is the set of type achieve is XML data integration by converting XML constructors which define the elements; β is the set of data to a unified schema from different sources with- the functions connecting an element with its type con- out losing valuable semantic information. It is com- structor; and ρ is the root of the DTD. Figure 2 shows plicated by the flexible XML syntax that allows differ- a simplified DTD example and its corresponding tu- ent ways to represent the data of same semantics. We ple expression. It is extracted from a DTD used in a realize XML data conversion using a set of transfor- publishing company for publication information. mation operators that supply enough transformation Note, to save space, the headers and the string type power and preserve the data semantics to a certain de- #PCDATA in the DTD are intentionally left out. An gree. element in G can be a single element, or a component The proposed methodology to achieve XML data in- including multiple elements in conjunctive or disjunc- tegration is through data transformation. All the tive sequences. For example, if g ∈ G, then g can be in DTDs and the conforming documents from different the following forms: g = e (e ∈ EN), g = Str (Str is a data sources are converted to a unified XML format symbol denoting #PCDATA), or g = g1,g2 or g1|g2 or c that conforms to the integrated global schema. As [g] , where g1 and g2 are recursively defined, and c ∈ illustrated in Figure 1, in the conversion of a DTD {‘?’, ‘1’, ‘+’, ‘∗’}. The multiplicity c defines the mul- from a particular source, a four-tuple expression is tiplicity constraints of g. Transforming a DTD also built based on the given DTD file. Then, the tuple ex- involves the transformation of the multiplicities. For pression is transformed to a new one through execut- handling multiplicities, ‘?’, ‘1’, ‘+’ and ‘∗’ are rep- ing a sequence of data transformation operations, de- resented by the intervals [0,1], [1,1], [1,n] and [0,n],

112 AN IMPLEMENTATION OF XML DATA INTEGRATION

constraints of a structure can be checked. M. Fox (a) A Simplified DTD example 2006 EN = {root,name, publ,year,book,article,title,ISBN, ABC-345 50 DEF-302 120 price, journal,issue, page} 2005 G = {Str,[name, publ]∗,[year,[book|article]+]∗,[title, XYZ-145 180 ?

FGHJ1255-58 ISBN, price],[title, journal,issue , page]}
β(root) = [name, publ]∗,β(publ) = [year,[book|article]+]∗, 2004 ,
XXXJ220-24
β(book) = [title,ISBN, price]
T8J128-15 β(article) = [title, journal,issue?, page],
β(year) = β(name) = β(title) = β(ISBN) = β(price) = K. Page β( journal) = β(issue) = β(page) = Str. (b) The tuple expression of the DTD example 2006 YYY-452 200 2004 Figure 2: A DTD example and its tuple expression. ZZZ-223 220 2003
GGJ2 75-80
TTTT-243180 respectively. The operations of two multiplicities c1
and c2 are conducted relying on the operations of their intervals. Thus, c1 ⊕ c2(= c1c2) has the semantics of ( a ) A simplified document example the multiplicity whose interval encloses the intervals T = (root: T1 T2) T1 = ((name: “M. Fox”) (publ: (T3 T4 T5))), T2 = ((name: “K. Page”) ··· ) of c1 and c2, e.g., +? = ∗ and 1? =?. Similarly, c1 c2 T3 = ((year: “2006”) (book: ···)) is the multiplicity whose interval equals to the inter- T4 = ((year: “2005”) ···), T5 = ((year: “2004”) ···) ······ val of c1 taking that of c2 and adding that of 1, e.g., ( b ) The document trees for the example = ∗ + = ? ? 1 and ?. Figure 3: A document example and its tree expression. 3.2 Document Representation

A well-formed XML document is a textual rep- 4 XML DATA resentation of data and is composed of elements with hierarchically nested structures as defined in TRANSFORMATION its corresponding DTD. In our methodology, a doc- OPERATIONS ument is represented by a series of trees T = (e : val T1 T2 ··· Tm), where T1 T2 ··· Tm are recursively In the proposed methodology, the transformation of defined child trees of T, e : val is the root of T and XML data is implemented through executing a series the parent of T1 T2 ··· Tm, and e ∈ EN, val is a text of data transformation operations against the tuple ex- string if the root node contains a value, otherwise, it pression and the document trees. The data transfor- is omitted. Figure 3 is an example showing an XML mation operations are defined by a set of operators. document represented by such trees. Because of the syntax differences between DTD and document, each operator has defined two parts: one 3.3 Hedge and Hedge Conformation for transforming the DTD and the other for transform- ing its conforming documents. The formal definition A hedge H is a sequence of trees under one of each operator has been presented in (Liu et al., node in a document. For instance, in the doc- 2006). This section will provide an overall descrip- ument shown in Figure 3, T, T1T2, T3T4T5, and tion of the data transformation operations defined by (title:ABC)(ISBN:-345)(Price:50) are four hedges. A those operators. hedge may contain smaller hedges. Here our in- The DTD transformation operation that each opera- terest is in which child trees of a node belong to tor performs is listed in Table 1. Because the doc- a hedge conforming to a specific type construc- ument transformation operation of each operator is tor. For example, let g+ = [A,[B,C?]∗,D?]+, β(e) = to convert a given document into one with a struc- g+ and T=(e((A)(B)(B)(C)(A)(B)(C)(D))), then hedge ture that conforms to the transformed DTD, the ta- (A) conforms to [A], hedge (B)(B)(C) conforms to ble also reveals the information for what transfor- [B,C?]∗, hedge (A)(B)(B)(C) conforms to g, and hedge mation operation will be carried out on the con- (A)(B)(B)(C)(A)(B)(C)(D) conforms to g+. A hedge H forming document by each operator. For instance, g + conforms to g is denoted by H . unnest operator converts the DTD β(e) = [g1,g2 ] + By using the hedge notation, the child trees of a into a new DTD β1(e) = [g1,g2] . The operator node can be logically split and thus, the cardinality also transforms the document by converting the hedge

113 ICEIS 2008 - International Conference on Enterprise Information Systems

0 Table 1: The DTD transformation operation of each opera- an equivalent schema βs(e). By this process, βs(e) 0 tor. is converted to βs(e), which agrees with the structure −1 Operator DTD transformation operations defined in βt (e). Note all the operations in Φ are c1 cn c c1 ? cn ? c ? opin opin([g1 ,··· ,gn ] ) −→ [g1 ,··· ,gn ] , respectively the inverse operations in Φ, e.g., if min where c ⊇? is executed in Φ, then the corresponding operation is c1 cn c c1 ? cn ? c? opout opout([g1 ,··· ,gn ] ) −→ [g1 ,··· ,gn ] , −1 where i = 1,··· ,n(ci ⊇?) mout in Φ . 0 c1 ci cn c c1 min(cc,i,[g1 |···|gi |···|gn ] ) −→ g = [g1 0 0 ci cn c cc min |···|gi |···|gn ] , where c ⊇ cc and if i = 0: 0 ∀ j = 1,··· ,n(c j = c j cc); else i ∈ [1,··· ,n]: 0 0 ci = cicc ∧ ∀ j 6= i(c j = c j ) c0 5 IMPLEMENTATION OF THE c1 ci cn c 1 mout(cc,i,[g1 |···|gi |···|gn ] ) −→ g = [g1 | 0 0 ci cn ccc ···|gi |···|gn ] , where if i = 0 : j = 1,··· ,n XML DATA 0 mout (c j ⊇ cc ∧ c j = c j cc); else i ∈ [1,··· ,n]: 0 0 ci ⊇ cc ∧ ci = ci cc and ∀ j 6= i(c j = c j ) TRANSFORMATION rename rename(e,e1) −→ e1, where e1 6∈ parent(e) shi ft Let g = g1,g2, shift(g1,g2) −→ g = (g2,g1) OPERATIONS group group(g) −→ [g]1 ungroup ungroup([g]1) −→ g expand Let ge ∈ g ∧ e 6∈ parent(ge), then 5.1 Main Challenges expand(ge,e) −→ g = e ∧ β(e) = ge c c collapse Let gp = e ∧ β(e) = [ge] 1 and ge ∩ parent(e) = φ cc then collapse(e) −→ gp = [ge] 1 c2 c There are a number of challenges in the development nest Let g = [g1,g2 ] ∧ c ⊇ +, c2 + c nest(g2) −→ g = [g1,g2 ] of the XML data transformation operators when is- c2 c unnest Let g = [g1,g2 ] ∧ c2 = +|∗, c2 + c+ sues such as information preservation, nested brack- unnest(g2) −→ g = [g1,g2 ] 1 1 c f act Let g = [[g0,g1] |···|[g0,gh] ] , then ets, and mutually nested conjunction and disjunction 1 1 1 c fact(g0) −→ g = [g0,[[g1] |···|[gh] ] ] 1 1 1 c are considered. Obviously these issues must be ade- de f act Let g = [g0,[[g1] |···|[gh] ] ] , then 1 1 c defact(g0) −→ g = [[g0,g1] |···|[g0,gh] ] quately addressed for practical enterprise information c c Let g = e 1 ,··· ,ecm and g = e¯¯1 ,··· ,e¯c¯m , 1 1 m 2 1 m integration applications. merg where ∀ i = 1,··· ,m(β(ei) = β(e¯i)), c1 c¯1 cm c¯m + then merg(g1,g2) −→ [e1 × ··· ×em ] It is essential to preserve the semantics of data c c1 cm c Let g = [e1 ,··· ,em ] , c = ∗|+, then split(g) −→ c¯1 c¯2 c¯h when carrying out a data transformation. The rela- g1 ,g2 ,··· ,gh , where h ≥ 2 andc ¯1 = c +,c ¯2 c1 cm split = ··· = c¯h−1 =?,c¯h = ∗ and g1 = [e11,··· ,em1], tionships between data elements must be preserved. g = [ec1 ,··· ,ecm ], ···, g = [ec1 ,··· ,ecm ] and 2 12 m2 h 1h mh For example, before and after a data transformation, ∀ i = 1,··· ,m(ei1 = ei ∧ ∀ j = 2,··· ,h(β(ei j ) = β(ei))) and all ei j (i = 1,··· ,m; j = 2,···h) are it is desired that the title and ISBN of a book are put distinct and are not in parent(g) c1 cm cg c¯1 c¯w c f under the same element. This derives a requirement Let g = [e1 ,··· ,em ] , f = [e¯1 ,··· ,e¯w ] , ∀ ei ∈ g(β(ei) = Str), ∀ e¯j ∈ f (β(e¯j ) = Str), that the data transformation operations must be able pro j c ,··· ,c ,c ,··· ,c ∈ [ , ] c ⊇c ∀ ec¯j ∈ 1 m ¯1 ¯w 1 ? , f g and ¯j to be reversed. Although our operators have been de- ci f ( j = 1,··· ,w)( if ∃ ei ∈ g so that ei = e¯j then c¯j ⊇ci, elsec ¯j =?), then proj(g, f ) −→ f fined by carefully considering this requirement, it is very complicated to realize the reverse due to the con- g g sideration of the full XML syntax. As an example, let Hg1 H 2 ··· H 2 which conforms to β(e) into a hedge 1 m β (e) = [g?|g ]∗, if we do min(?,0, [g?|g ]∗), we’ll get g1 g2 g1 g2 1 1 2 1 2 H H1 ··· H Hm that conforms to β1(e). ?? ? ∗ ? ? ? + β2(e) = [g1 |g2] , that is [g1|g2] because ?? =? and The operations that transform a DTD and its conform- ∗ ? = +. Clearly we cannot go back to the origi- ing documents into a target schema from its original ? ? + nal β1(e) by doing mout(?,0,[g1|g2] ) because in that schema progress in two recognized phases. In the first ∗ case, we attain β3(e) = [g1|g2] . A solution to such phase, the XML data is converted into a flat form; problems is not to carry out the multiplicity opera- and in the second phase, the flat form is converted tion immediately but just keep the information. That into the target schema. A flat form is defined by means we should store the operation ?? attached to g c1 cn c 1 β(e) = [g |···|gn ] , where n ≥ 1 and ∀i = 1,··· ,n(ci = 1 in the β2(e) in a particular data structure, rather than ci1 cimi 1|? ∧ gi = [e ,··· ,e ]),∀ j = 1,··· ,mi(ei j ∈ EN ∧ ci j = i1 imi getting their operation result immediately. This leads 1|? ∧ β(ei j) = Str). Obviously such a flat form has the to the data structure and the algorithm implementing minimum layers of brackets to keep the semantics of the operations more complicated. disjunctions and conjunctions. It can help to simplify the comparison of two data schemas. Here the idea is 5.2 Overall Implementation to respectively convert the original schema βs(e) and the target schema βt (e) into the flat form Bs(e) and Framework Bt (e), and then convert Bs(e) into Bt (e) through a se- ries of projection operations. Suppose the operation The overall framework for implementing a data trans- sequence from βt (e) to Bt (e) is Φ, then carry out the formation operator can be briefly described as fol- −1 revered sequence Φ against Bt (e) to convert it into lows. Based on the element e to be operated, the β(e)

114 AN IMPLEMENTATION OF XML DATA INTEGRATION

is located from the tuple expression built from a given 6 THE OPERATOR PACKAGE DTD, and then its output part is updated. This real- AND ITS APPLICATIONS FOR izes the conversion from β(e) to β1(e), the latter will determine the transformed DTD of the operator. The ENTERPRISE INFORMATION β(e) is also provided for the document transforma- INTEGRATION tion. The node e in the document is found through parsing the document tree, then all the hedges con- To apply the proposed methodology to realize enter- forming to β(e) under the node are converted into prise information integration, we have developed a the new ones conforming to β1(e). The conversion Complete XML Data Transformation System. Fig- requires considerable complicated operations against ure 4 shows its entry interface. From the interface, the hedges, including comparison, group, modifica- users can, across the Internet, perform various data tion, re-structure, etc. transformation to realize enterprise information inte- gration. They can carry out the data transformation 5.3 The Implementation of Document in a step-by-step mode or ask the system to automati- cally accomplish a series of transformation operations Transformation for them. To transform a document from its original schema into a targeted one by using our operators, five tasks must be performed: 1) construct document trees from the given XML file; 2) locate the nodes to be operated in the document trees; 3) identify the hedges requir- ing conversion under each of the nodes; 4) transform the hedges into a new format according to the opera- tor; and 5) store the modified document trees into an XML file. Some of the tasks can be accomplished with the help of the XML DOM parser (Maruyama, 2002). It is invoked to traverse the document tree, and when the node e to be operated is found, it transmits all its child trees to a procedure. The latter identifies the hedges that conform to the β(e) provided by the DTD Figure 4: The entry interface of the system. transformation. If such a hedge is identified, then task 4 will start and the hedge will be converted according In the following, we will briefly illustrate the appli- to the definition of the operator. At the last, an XML cations of the method in practical enterprise informa- file is built by storing the new hedges to it, which can tion integration via an example. Recently, publica- be accomplished by the parser. tions are no longer just the distribution of the printed An algorithm has been developed to identify the works, e.g. book or journal. They now include vari- hedges that conform to a particular type definition ous electronic media, e.g. CD, web, etc. In different from the child trees of a node. It splits the child trees publishing companies, the publication information is into logic groups using the hedge notations and then normally encoded in different XML formats. It thus checks their conformation to a given type definition. requires techniques to integrate these information to- Its input arguments include: 1) a type definition ex- gether to provide users with publication information pression gg to which the child trees conform; 2) a type in a unified format. Suppose an enterprise informa- constructor gg1 to which the hedges to be identified tion system adopts the DTD shown in Figure 5 to must conform; 3) the child trees of a node; and 4) an provide users with publication information. Note the index from which to start the search in the child trees. headers and the string type #PCDATA of the DTD are In the algorithm, gg is used as the rule to parse the not included in the figure to save space. Clearly in child trees, gg1 is used as the criterion to check if a order to provide users with publication information hedge is the one to be identified. using such a structure, all the relevant publication in- The algorithm produces two indexes as its output. formation encoded in other formats, including the one The elements between them in the child trees form the shown in Figures 2 and 3, must be converted to match hedge identified by the algorithm. Due to the space the structure. limit, we are not able to provide the algorithm in this Our methodology can be used to realize the publica- paper. The interested readers can contact us to get it. tion information integration. The basic process is, by

115 ICEIS 2008 - International Conference on Enterprise Information Systems

using the data transformation operators, publication Erwig, M. (2003). Toward the automatic derivation of information encoded in other format is converted to a transformations. LNCS 2814 - ER 2003 Workshops ECOMO, IWCMQ, AOIS, and XSDM Proceedings, flat form, and then further converted into an equiva- pages 342–354. lent structure which conforms to the integrated global Kay, M. (2007). XSL Transformations (XSLT), Version schema. 2.0. http://www.w3.org/TR/xslt20/. Liu, J., Park, H., Vincent, M., and Liu, C. (2006). A for- malism of XML Restructuring Operations. LNCS - ASWC, pages 342–342. Maruyama, H. (2002). XML and Java: developing Web ap- plications. Addison-Wesley, Boston, Massachusetts. Tian, J., Liu, J., Pan, W., Vincent1, M., and Liu, C. (2008). ...... Performance Analysis and Improvement for Transfor- mation Operators in XML Data Integration. APWeb Figure 5: A common global DTD structure. 08, pages 214–226. Zamboulis, L. (2004). Xml data integration by graph re- structuring. BNCOD, pages 57–71. 7 SUMMARY AND FURTHER WORK

This paper has presented a framework for implement- ing XML data integration via converting XML data from different data sources to an integrated global schema. The representation of DTDs and documents, and the transforming operations of XML data, de- fined by a set of operators, have been illustrated. The implementation of the transformation operations has been investigated and a package of the operators has been reported. In the proposed method, the full XML syntax has been covered, including nested brackets in element type definitions, multiplicity constraints at- tached to those nested brackets, and disjunction, for which other work has not provided sufficient support. Using the methodology, the transformed documents always conform to the transformed DTDs, which is a property that is not possessed by any query languages and algebras where users require detailed program- ming of the transformation procedure. With our work, users are freed from using complex language syntaxes and they just need to combine operators for achieving their desired XML data integration. Our next work includes the refinement of the oper- ators and the automated XML data conversion using the operators according to a sequence of operations defined based on the mapping between the original and target schema. The latter is also the one we will continue to investigate.

REFERENCES

Boag, S., Chamberlin, D., Fernandez, M. F., Florescu, D., Robie, J., and Simeon, J. (2007). Xquery 1.0: An xml query language. 1st International Conference on Template Production.

116