Schema Matching For Structured Document Transformations

THÈSE Nº 3108 (2004)

PRÉSENTÉE À LA FACULTÉ INFORMATIQUE ET COMMUNICATIONS

Institut des systèmes informatiques et multimédias

SECTION D’INFORMATIQUE

ÉCOLE POLYTECHNIQUE FÉDÉRALE DE LAUSANNE

POUR L’OBTENTION DU GRADE DE DOCTEUR ÈS SCIENCES

PAR

AIDA BOUKOTTAYA

Diplôme National d’Ingénieur en Informatique, Ecole Nationale des Sciences de l’Informatique Tunis et de nationalité tunisienne

Acceptée sur proposition du jury:

Prof. Giovanni Coray, co-directeur de thèse Dr. Christine Vanoirbeek, co-directrice de thèse Prof. Gilles Falquet, rapporteur Prof. Martin Rajman, rapporteur Dr. Vincent Quint, rapporteur

Lausanne, EPFL 2004

ACKNOWLEDGMENTS

This thesis is the outcome of three years experience at the Theoretical Computer Science Laboratory (LITH) at EPFL. There are many people who helped me in this work, either directly by working on projects with me or indirectly by giving me vision and support.

I would like to express my deep gratitude to my advisors Prof. Giovanni Coray and Dr. Christine Vanoirbeek. They offer me the great opportunity of working with them. Thank you for the support you gave to me through your continuous encouragements, your constructive suggestions, your focus and vision on my research and your precious help in reviewing my PhD dissertation.

I would like to thank Prof. Gilles Falquet for the long discussions we had, for his constructive remarks, his availability and his precious feedback during the course of this research and the writing of my PHD.

I would like to thank the members of my examining committee: Prof. Gilles Falquet, Prof. Martin Rajman and Dr.Vincent Quint, who honoured me by accepting to review and evaluate my thesis. A special thank to Prof. Roger. D. Hersch who kindly accepted to be the president of my PHD jury.

I would like to thank all the students who hardly worked with me and contributed efficiently to the development of proposed solutions.

I would like to thank all members of LITH Laboratory for their support and their friendship.

I would like to thank Frederic for his presence, his invaluable encouragements and help and most of all for lending me an ear when I needed it the most.

I would like to thank all my friends especially Dali who has always support me through his human attitude.

Last but not least, I would like to express my deep gratitude to my family: my parents Chedly and Aziza, who thought me how to succeed, thank you for your love, your patience and your help; my brothers Hichem and Mohamed for their encouragements, their love and their financial support.

Without the help of the above people, this work would never have been completed, nor even begun.

i

ABSTRACT

This dissertation studies structured document content reuse problem. In structured document content reuse, a document (or a part of document) structured under one schema must be restructured and translated into an instance of a different schema. Thus, a notion tied to structured document reuse problem is that of structure transformations. This is typically attained in real world by writing translators encoded on a case-by-case basis using specific transformation languages. Writing and managing complex transformations programs is time consuming and generally requires programming skills. Many solutions to simplify and automate as much as possible the task of structured document transformation specification and execution have been proposed. Several simpler and highly declarative transformation languages and graphical tools for transformation specifications have been introduced as solutions to avoid programming. These languages and tools are very useful in describing and specifying transformations. However, they still require developers to manually indicate mappings for each source and target pair. The manual generation of mappings is an extremely labor-intensive and error-prone process.

To shield users from manually performing this task, we advocate the use of a schema matching process which (semi) automatically finds semantic correspondences, so called mappings, between two heterogeneous schemas. Based on such mappings a transformation generator generates automatically the translation program. In this dissertation, we propose a framework for solving XML schema matching problem. We rely on the extraction of semantic information nested within XML structures. Semantics is first captured by making explicit element names meanings, exploiting several element characteristics including the analysis of XML Schema’s designer point of view expressed by the logical organisation of XML content and additional semantic information given by means of features such as datatypes, element constraints and inheritance mechanisms.

The proposed framework provides a generic view and formalizes the overall matching process. Our proposed solutions allows to (1) discover semi-automatically an efficient sequence of operations for transforming a source XML schema into a target XML schema using schema matching techniques (2) model discovered mappings and (3) generate automatically the transformation script. Note that the purpose of our work is not to replace the programming languages for XML data translations, but rather to complement them. Experiments we have conducted show that we have been able to achieve good performance for the generation of source-to-target mappings.

iii RÉSUMÉ

Les travaux de cette thèse concernent la réutilisation de contenu dans le cadre des documents structurés. Cette réutilisation consiste, à partir d’un document ou d’un fragment de document structuré contraint par un schéma, d’être restructuré afin de produire un nouveau document contraint par un schéma différent. Cette notion est donc étroitement liée à la problématique de transformation de structure. Ecrire un programme de transformation est un processus qui est coûteux en temps et qui requiert des connaissances non négligeables en programmation. Plusieurs solutions ont été proposées afin de simplifier ou automatiser cette tâche. Certaines utilisent des langages déclaratifs et des outils graphiques pour contourner les problèmes dus à la programmation. Ces langages et ces outils sont très utiles pour la spécification des transformations. Malgré tout, ils nécessitent toujours l’intervention explicite des développeurs pour indiquer manuellement les liens de correspondance entre les structures sources et les structures cibles. La spécification de ces correspondances est extrêmement complexe à réaliser car elle est non seulement coûteuse en temps mais c’est également un processus cognitif délicat sujet à de nombreuses erreurs.

Pour palier ces difficultés, nous proposons l’utilisation d’un processus semi- automatique qui établit les correspondances sémantiques entre deux schémas hétérogènes. Basé sur cette correspondance un générateur de transformation produit ensuite automatiquement un programme de transformation. Notre travail s’inscrit concrètement dans le cadre des documents XML et de la problématique de mise en correspondance de plusieurs schémas XML hétérogènes. Nous nous appuyons sur les informations contenues dans les structures XML. Ces informations sont capturées en rendant explicite le sens des noms des éléments, en exploitant certaines caractéristiques des éléments notamment leur organisation logique et en tirant partie des informations sur le typage, les contraintes sur les éléments ainsi que des caractéristiques d’héritage décrites dans les schémas XML.

Nous proposons une formalisation de l’ensemble du processus de correspondance entre les schemas, ce qui permet de calculer de façon semi-automatique une séquence pertinente d’opérations qui va permettre de transformer une instance d’un schéma XML source en une instance d’un schéma XML cible. Nous commençons par (1) appliquer des méthodes de recherche d’équivalences entre les schémas. Une fois établies, ces équivalences permettent de (2) construire les règles de correspondance qui serviront à (3) produire automatiquement la description de la transformation. Ce travail n’a pas pour but de remplacer les langages de transformation, mais au contraire de venir en complément de leurs capacités. Les expérimentations que nous avons conduites nous montrent des résultats satisfaisants et notamment en ce qui concerne la pertinence de la mise en correspondance entre des schémas hétérogènes.

iv

Table of contents

Introduction...... 1 1. Motivation...... 1 2. Goals of the dissertation ...... 7 3. Contributions and road map...... 8 Part I State of the Art ...... 13 Chapter 1 Structured Document Reuse...... 15 1.1 Why structuring documents ? ...... 15 1.2 XML Fundamentals ...... 16 1.2.1 The XML data model ...... 17 1.2.2 Document Type Definition...... 18 1.2.2.1 Element Type declaration ...... 18 1.2.2.2 Attribute Type declaration ...... 19 1.2.2.3 Discussion ...... 19 1.2.3 XML: Further notions ...... 20 1.2.3.1 XML Accessories...... 20 1.2.3.2 XML Transducers ...... 20 1.2.3.3 XML Applications ...... 21 1.3 Structured document reuse...... 21 1.4 Conclusion ...... 26 Chapter 2 Structured Document Transformations...... 27 2.1 Transformation of structured documents...... 27 2.2 Tree transformation methods...... 28 2.2.1 Syntax Directed Translation (SDT)...... 28 2.2.2 Tree transformation grammar (TT grammar)...... 29 2.2.3 Tree pattern matching and replacement...... 31 2.3 SGML and XML Transformation Languages...... 32 2.3.1 XSLT...... 32 2.3.1.1 XPath...... 33 2.3.1.2 XSLT template declaration ...... 33 2.3.1.3 XSLT template Application ...... 34 2.3.1.4 Generating the Output Tree...... 35 2.3.1.5 XSLT: further notions...... 36 2.3.1.5.1 Branching elements of XSLT ...... 36 2.3.1.5.2 XSLT Variables ...... 36 2.3.2 XML transformation languages...... 36 2.3.2.1 Streaming Transformations for XML (STX) ...... 36 2.3.2.2 HaXML ...... 37 2.3.2.3 XDuce ...... 38 2.4 Summary and discussion ...... 38 2.5 Automating Document Transformations ...... 40

v

TABLE OF CONTENTS

2.5.1 Declarative Transformation Specification Languages ...... 40 2.5.2 Schema Matching...... 41 2.6 Conclusion ...... 42 Chapter 3 XML Schema Matching...... 45 3.1 Schema Matching: Complications and Challenges...... 45 3.2 Application Domains ...... 46 3.2.1 Schema Integration...... 47 3.2.2 Data integration ...... 47 3.2.3 Data warehousing...... 47 3.2.4 Data Transformation...... 48 3.2.5 Peer-to-Peer data management...... 48 3.2.6 Ontology Matching...... 48 3.3 Schema matching Solutions...... 49 3.3.1 Learner based solution...... 49 3.3.2 Rule-based solutions...... 49 3.3.3 Metadata based solution ...... 49 3.3.4 Learner based solutions Vs Rule based solutions...... 50 3.4 Matching Methods...... 50 3.4.1 Terminological matching...... 50 3.4.2 Constraint-based matching ...... 51 3.4.3 Structural matching ...... 51 3.5 XML Schema matching...... 52 3.5.1 Cupid (Microsoft Research) ...... 52 3.5.2 Learning source description (Univ. of Washington) ...... 53 3.5.3 Similarity Flooding (Stanford Univ. and Univ. of Leipzig)...... 53 3.5.4 Clio (IBM Almaden and Univ. of Toronto) ...... 54 3.6 Conclusion ...... 55 Part II A framework for XML schema Matching and automatic generation of transformation scripts ...... 57 Chapter 4 Problem formalisation ...... 59 4.1 XML Schema Matching Problem...... 59 4.1.1 Semantic matching Vs Syntactic matching ...... 59 4.1.2 Input Information for the matching process ...... 61 4.1.3 Output Solution for the matching process ...... 63 4.2 Formal definitions...... 63 4.3 Modelling XML Schema ...... 64 4.3.1 Features of XML Schema...... 64 4.3.1.1 XML Schema Data Types...... 65 4.3.1.2 Attribute and element declaration ...... 66 4.3.1.3 Complex Types ...... 66 4.3.1.4 Element and Type Substitutability...... 67 4.3.1.5 Abstract Types and Abstract Elements ...... 68 4.3.1.6 Integrity Constraints...... 70 4.3.2 XML Schema Graph ...... 71 4.3.2.1 Schema graph nodes...... 71 4.3.2.2 Schema graph edges...... 72 4.3.2.3 Schema graph constraints...... 72

vi

TABLE OF CONTENTS

4.3.2.3.1 Constraints over an edge...... 72 4.3.2.3.2 Constraints over a set of edges ...... 73 4.3.2.3.3 Constraints over a Node...... 73 4.4 Source-to-target mapping Algebra...... 77 4.5 Conclusion ...... 79 Chapter 5 Automating XML Schema matching...... 81 5.1 Matching process: The big picture...... 81 5.2 Terminological matching...... 84 5.3 Data type compatibility...... 86 5.4 Designer Type hierarchy...... 88 5.5 Structural matching...... 90 5.5.1 Node context definition ...... 91 5.5.2 Path resemblance measure...... 92 5.5.2.1 Longest Common Subsequence ...... 94 5.5.2.2 Average positioning ...... 96 5.5.2.3 LCS with minimum gaps ...... 96 5.5.2.4 Length difference ...... 96 5.5.3 Node Context similarity ...... 97 5.5.3.1 Ancestor context similarity ...... 97 5.5.3.2 Child-context similarity ...... 97 5.5.3.3 Leaf context similarity ...... 99 5.5.3.4 Node similarity...... 100 5.5.4 Discovery of nodes and edges matches ...... 103 5.6 User feedback to filter matching result...... 106 5.6.1 Validating mapping result ...... 106 5.6.2 Constraint filtering ...... 106 5.6.2.1 Data type compatibility...... 107 5.6.2.2 Constraints compatibility ...... 108 5.7 Evaluation of XML schema matching process ...... 108 5.7.1 Evaluation technique ...... 108 5.7.2 Real World examples ...... 110 5.7.3 Comparative study...... 112 5.8 Conclusion ...... 114 Chapter 6 Automating XML document transformations...... 117 6.1 Global architecture...... 117 6.2 Structuring mapping result...... 118 6.2.1 Mapping structure...... 118 6.2.2 Generation of mapping structure...... 120 6.2.3 Example of mapping result structuring ...... 123 6.3 XSLT generation ...... 124 6.3.1 The XSLT generator...... 124 6.3.2 Example of XSLT generation...... 126 6.4 Implementation issues...... 128 6.4.1 Conceptualization toolkit...... 129 6.4.2 Matcher engine...... 131 6.4.3 Execution engine ...... 131 6.5 Conclusion ...... 131

vii

TABLE OF CONTENTS

Chapter 7 Conclusions and future directions ...... 133 7.1 Key Contributions...... 133 7.2 Future directions ...... 136 7.2.1 Terminological and Constraint based matching ...... 136 7.2.2 Efficient user interaction ...... 137 7.2.3 Mapping maintenance ...... 138 7.2.4 Performance evaluation...... 138 Bibliography ...... 139 Appendixes...... 157 Appendix A Terminological matching (Hirst and St-Onge algorithm) ...... 159 Appendix B Top-down translation algorithm...... 165 Appendix C Detailed Example ...... 167

viii

List of figures

Figure 1: DTD for available data and DTD for required data...... 5 Figure 1-a: A valid XML document example ...... 18 Figure 1-b: The Document Reuse Process ...... 23 Figure 2-a: Example of tree transformation ...... 30 Figure 4-a: Semantic matching Vs Syntactic matching ...... 60 Figure 4-b: A schema graph example ...... 74 Figure 4-c: A join operation example ...... 79 Figure 5-a: The matching process...... 83 Figure 5-b: Semantic relationships classification in WordNet...... 85 Figure 5-c: University bibliographic schema graph...... 90 Figure 5-d: The context of a schema element ...... 92 Figure 5-e: Source and target schema graphs after context construction...... 105 Figure 5-f: Comparing real matches and derived matches...... 109 Figure 5-g: Comparative study with Cupid and SF...... 113 Figure 6-a: An XML schema defining mapping result structure ...... 122 Figure 6-b: A structured mapping example...... 123 Figure 6-c: Prototype system ...... 129 Figure 6-d: The schema graph visualizer ...... 130 Figure 6-e: User interface for the matcher engine...... 132 Figure A-a: Allowable and non allowable paths in Hirst and St-Onge algorithm...... 160

ix

List of tables

Table 1.1: Examples of W3C XML related recommendations and working drafts...... 25 Table 2.1: Properties of some transformation systems ...... 39 Table 2.2: Properties of some XML/SGML transformation languages ...... 40 Table 3.1: Characteristics of some existing XML Schema matching tools ...... 56 Table 5.1: Evaluation parameters...... 111 Table 5.2: Characteristics of tested schemas...... 111 Table 5.3: Results of real world examples ...... 112 Table A.1: Classification of WordNet relations into directions...... 159

xi

Introduction

1. Motivation

An important issue in distributed information systems is providing support for data exchange and reuse between autonomous and heterogeneous applications. Heterogeneity arises in general from the fact that each organization or application creates its own data according to specific requirements. These requirements are most often specified within abstract data models, so-called schemas (such as relational schemas, object oriented schemas and, more recently XML Schemas). The need for developing methods and tools, that support data exchange and reuse, is not new; it has been the focus of research community several decades ago [Shu 77], [Miller 00], [Fagin 03]. It has increased over the years, especially with the proliferation of Web data sources deploying a variety of information models and data encoding syntaxes. XML (eXtended ) [W3C 98a] has clearly emerged as the most relevant standardization effort for document and data representation on the Web; it leverages a promising consensus on the encoding syntax for both human and machine readable information.

The work described in this dissertation addresses the problem of reusing XML documents, constrained by a model. In order to specify precisely the objectives of the research and make clear the underlying assumptions, it is worth while to briefly summarize the evolution about the use of tagged documents and thus, emphasize the features of XML documents that are taken into consideration in our framework.

Evolution of mark-up languages

The use of mark-ups, i.e. text added to the data of a document, has its origins in text processing systems inside which such additional information was interpreted as formatting instructions to guide the rendering of a document, either on printers or screens. The SGML (Standard Generalized Markup Language) [ISO 86] standard promoted this practice, encouraging declarative mark-up (versus procedural one) that takes into account the hierarchical structure of the document and, identifies the logical components without specifying procedures to be applied to them. Moreover, it has to be pointed out that SGML does not define a language; it is a meta-language that permits the creation of adapted descriptive mark-up languages. This is achieved

1

Motivation

through the SGML Document Type Definition (DTD) mechanism that offers the possibility to define tags and their organisation for a given class of documents.

Despite the fact that SGML is basically not directed towards a particular goal, it has mainly been used in publishing purposes. As examples, two well known SGML applications are: Docbook [Walsh 01] (a DTD particularly well suited to books and papers about computer hardware and software) and the Text Encoding Initiative – TEI1 (an international and interdisciplinary standard that helps libraries, museums, publishers, and individual scholars represent all kinds of literary and linguistic texts for online research and teaching). However, the mark-up approach has also been progressively adopted by the hypertext community. The Hytime ISO standard [DeRose 94], an SGML application defines a language and underlying model for the representation of "hyperdocuments". SGML has never been used massively; a number of obstacles such as the complexity to develop tools (editors, parsers, etc.) based on a powerful but syntactically complex language led to the fact that the use of SGML remained concentrated in some major companies dealing with large volume of documents.

The World Wide Web (WWW), initiated in 1990 by T. Berners-Lee, boosted the use of marked-up documents at a larger scale. The initial simplicity of HTML (Hyper Text Markup Language), an SGML application to represent documents and make them accessible over the planet, through a basic document location system (URL – Uniform Resource Location), clearly helps in starting a new global communication medium. Several years later, the “evolving” HTML language adopted not only by individuals but industrial stakeholders, revealed its limitations. Under the supervision of the W3 Consortium (created in 1994 to coordinate standards of the Web), the XML standard has been published in 1998. XML has highly facilitated the development of tools, but also introduced a notion that significantly altered the use of marked-up information. As a matter of fact, it distinguishes between well-formed document (a document respecting the XML syntax) and valid document (a document that is validated against a document model definition).

As a consequence, lot of existing XML documents are documents that use XML for the data transport, i.e. structured data flow, mainly generated from databases. Such a situation is easily understandable; using XML, instead of proprietary formats for exchanging information between applications, relies on an established standard that offers a variety of tools for dealing with such data (especially for publishing purposes).

Finally, another Web standard, XML Schema, published in 2001, aiming at ‘defining the structure, content and semantics of XML documents’ played an important role and definitely influenced the nature of XML documents. In order to enhance the “processability” of XML documents, the XML Schema standard proposed an XML

1 www.tei-c.org

2

Introduction language to define abstract documents models. It is largely inspired from the design of databases schema and object-oriented paradigm; it introduces the notion of datatypes (in comparison with DTDs limited to textual data) to enhance the “processability” of XML documents.

As a result, it turns out that XML that was first designed for the document world is now much more used in the data world. Therefore, we can distinguish between two types of documents with different requirements: document-centric and data-centric documents. Data-centric documents are documents that use XML for the data transport. Although XML is human readable, data-centric documents are designed for machine consumption. Data-centric documents are characterized by a fairly regular structure, fine-grained data, and mostly no mixed content. The order, in which elements occur, is generally not significant. Document-centric documents are documents that are designed for a human reader. They are characterized by a less regular structure, coarse grained data and lots of mixed content. The order, in which elements occur, is almost always significant, particularly when the document is read serially by a human being.

Reusing marked-up documents

SGML offered the possibility to represent documents in a standard base, guaranteeing platform independence storage format, and has been adopted by several communities of users. In this sense, it contributed to improve documents exchange between text processing systems. However, dealing with structured documents, constrained by a model (described via a DTD) has also some drawbacks. Reusing structured documents within users’ environments raises a number of fundamental problems to transform or to adapt their intrinsic structure. Numerous transformation languages and systems have been proposed in the literature for transforming structured documents [Tirri 94], [Berger 96], [Bonhomme 98], [Harbo 93], [ISO 91]. Such languages and systems range from descriptive and simple ones to more powerful functional languages allowing complex transformations.

Reusing structured documents content is typically attained by writing translators (often manually encoded on a case-by-case basis using specific transformation languages). Currently the best known and widely adopted language for transforming structured documents is XSLT [W3C 99b]. XSLT, a recommendation of the World Wide Web Consortium, is a language, itself written in XML, with powerful computing capability encoding transformation of XML documents.

An XSLT program (called stylesheet) is a set of template rules, each of which has two parts: a pattern that is matched against nodes in the source document and a template that can be instantiated to form the result document. The process of transformation is generally divided in three main phases [Kuikka 96], [Murata 98]. The first one aims at understanding the source and target schemas’ structures and semantics. The goal of the second phase is to discover schemas’ mappings by means of inter-schema

3

Motivation

correspondences, capturing input and output constraints imposed on the documents. In the third phase discovered mappings are translated into an appropriate sequence of operations in a given transformation language over the input document, to produce the required output document. Example 1.1 roughly illustrates the process through which human goes in order to generate a transformation script.

The transformation process has long been known to be extremely laborious and error- prone at several levels. We distinguish two levels that correspond respectively to the third and the second phases described above: XSLT program writing and mapping discovery.

XSLT is a powerful transformation language, however it has several drawbacks. It appears to be a complex language and simple transformations require the user to write a program, which needs complete mastery of XSLT language, and thus requires non- trivial programming skills. In most cases, users might not be willing to invest their time and effort in writing transformation programs. The latter effort quickly becomes untenable if the transformation of several different data sources is required.

Moreover, a serious obstacle for translating directly between two XML sources is that a mapping between each source and target XML representations needs to be carefully specified by a human expert. Mapping two schemas is a very difficult problem. In fact, schemas are typically mapped according to their syntactic clues. Examples of such clues include element names, types, schema structure, and integrity constraints. However, such clues are often unreliable (e.g., two elements sharing the same name refer two different real-world entities). Manual mapping is time consuming and thus especially unacceptable for applications where the information sources change frequently. Moreover, since the XML schemas can be very diverse, the mappings created by the expert are often complex. This complexity makes them hard to maintain when original XML schemas change.

Example 1.1: Let us consider the following scenario: an application to evaluate the publication activities of researchers accepts XML data according to the DTD 2 shown in Figure 1, which requires publications to be clustered by authors. A user of this system finds a large repository of bibliographic data, which is given in the format according to the DTD 1 shown in Figure 1. In order to reuse these data, the user goes generally through a long process. First, he has to understand DTD1 structure and semantics. Second, he has to discover correspondences between schemas elements. Such correspondences, also called mappings identify what data can be useful for his application. Examples of such mappings are element “author” corresponds to element “writer”. Author “name” can be obtained by the concatenation of elements “FirstName” and “LastName”. Element “Article” corresponds to element “JournalArticle”, etc. Finally, the user has to choose the transformation language, and a set of adequate operations to translate the discovered mappings into a transformation script. The user has to transform “reference” elements into a list of

4

Introduction

“author” elements, each containing a sub list of their publications and restrict this list to elements “book” and “Journal article”.

(book|Journalarticle|Proceedingarticle)*> (name,(book|article)*)> (title, (writer)*,journal)> (title, writer*,proceeding)>

DTD 1 DTD 2

Figure 1: DTD for available data and DTD for required data

Faced with the complexity of structure transformation, several attempts to simplify as much as possible this task are underway. Solutions to structured documents transformations fall roughly into two groups:

• The first group proposes highly declarative specification languages generally with the support of graphical user interfaces in order to help the user in the task of transformation specification [Tang 01], [Pietriga 01], [XSLWIZ 01], [Vernet 02], [Mapforce 04], and thus tend to hide the complexity of XSLT language. However, such languages and tools shield the user from programming effort, they require that a mapping between each source and target XML representations is carefully specified. As suggests the previous example, manual mapping is time consuming and requires non negligible human effort.

• The second solution group seeks to automate the process of mapping discovery (identified as schema matching process) and automatically deducing from such mappings the transform script which can rearrange and modify the associated data [Su 01], [Leinonen 03], [Milo 98]. Schema matching is the task of finding correspondences between two heterogeneous schemas. Schema matching is not a new problem and has been the focus of DataBase, Artificial Intelligence, E-commerce and semantic web communities [Clifton 97], [Chalupsky 00], [McGuinness 00], [Mork 01], [Madhavan 01], [Castano 99], [Castano 03], [Chawathe 94], [Do 02b], [Doan 00], [Haas 99], [Melnik 02], [Miller 01]. A recent survey of automatic schema matching algorithms [Rahm 01a] classifies approaches developed by the DataBase community respecting to the

5

Motivation

information (element naming, structure, data types, integrity constraints, domain specific common terminologies or thesaurus, etc.) used to discover schema similarities. Two Schema matching techniques can be distinguished: learning-based matching and rule-based matching. The first essentially relies on clues from a representative set of data instances using machine learning techniques [Doan 02b]. The second utilises schema information (labels, structure, datatypes, etc.) to discover schemas similarities. In the context of document reuse (contrary to data integration application) the assumption that a set of representative data instances does not always hold true. For this we essentially focus on rule-based matching techniques. A lot of rule-based matching algorithms have been proposed for matching relational schemas. Recently, some algorithms were proposed to deal with the hierarchical structure of XML schemas; however they suffer from several serious shortcomings when applied in the context of XML documents transformations. The two basic problems we faced when trying to apply existent schema matching algorithms in the context of document transformations could be summarized as the following:

o Up to now, very few projects considered XML’s structure in their schema matching methods. This is because most schema matching systems were developed by database community and thus deal essentially with relational schemas. Moreover, the rare XML schema matching tools deal essentially with DTDs (which have more limited expressiveness than the current W3C XML Schemas) taking into considerations only parent-child relationships [Melnik 02], [Madhavan 01].

o The second fundamental problem concerns the mapping result itself. Generally current schema matching algorithms only focus on discovering 1-1 mappings, also called direct mapping. The output result is a confidence score (ranging in [0,1]) between schemas’ elements. Such result is insufficient to perform transformations. First, because complex mappings make up a significant portion of discovered mappings in practice. Second, to generate a transformation script we need to further precise transformation operations. With a mapping result having the form (First-name, name, 0.8) and (Last- name, name, 0.8), the problem is partially solved. What is needed is a mapping specifying that the concatenation of First-name and Last- name is similar to name with a confidence score 0.8: (concat(First- name, Last-name), name, 0.8).

Faced with the limitations of current schema matching solutions, today the vast majority of schema mappings in the context of structure transformations are still created manually. The slow and expensive manual acquisition of mappings has now become a serious bottleneck [Doan 02b]. Hence, the development of semi-automatic

6

Introduction

solutions to schema matching is now truly crucial to building a structure transformation system.

2. Goals of the dissertation

The importance of structured document paradigm and the increasing use of XML by several Web communities as standard for document representation and exchange has made a large amount of data-centric XML documents available in distinct heterogeneous sources, and stored respecting different structures. In fact, for the same kind of data, independent developers often design XML schemas that have little in common [Melnik 00], [Halevy 03a].

In this dissertation, we essentially focus on data-centric XML documents2 reuse and exchange between web applications. The scenario we refer to consists of a number of heterogeneous Web sources of XML documents able to exchange and reuse documents contents among each other. Each source stores its documents according to a specific schema.

In our work, XML data reuse is the problem of restructuring and translating data structured under one XML schema (which we call source schema) into an instance of a different XML schema (which we call target schema). Our focus is on solving structured document reuse problem using schema matching techniques.

Two significant problems have to be addressed:

(1) Given a source schema and a target schema, we want to design a semi- automatic solution that produces mapping between both schemas. Specifically our main goals here are as follows:

• Develop a formal framework for defining XML Schema matching problem. The latter framework should specify exactly the input information for the matching problem, the required output solution and the different assumptions made.

• Maximize the matching process accuracy by exploiting wide range of available information. First since any single type of syntactic clue is generally insufficient to judge the degree of schemas similarity, the solution should combine multiple syntactic clues in a manner that achieve high matching accuracy. Second, because a matching process can not be fully automated, the solution should be able to efficiently incorporate user feedback into the matching process.

2 Throughout this dissertation, term XML document refers to data-centric XML document.

7

Contributions and road map

• The proposed solution should be able to provide a mapping result that incorporates a set of transformation functions allowing the automatic generation of transformation scripts.

• In dynamic environment such as the Web, data sources tend to change not only their data but often their schemas. In this context, the proposed solution should provide a formal structure for the mapping result that facilitates its evolution. The goal is to avoid reapplying the entire mapping process every time schemas change.

(2) Based on matching process result, data instances valid against source schema have to be translated into instances of the target schema. Two main goals have to be achieved:

• Insure that the produced data instance is valid against the target schema. This means that the constraints imposed by the target schema (datatypes, integrity constraint, etc.) have to be respected.

• The generation of transformation script should be transparent to the user. However, the solution should give the possibility to the user to add traditional constraints (such as selection, projection, etc.) on the reused source data (e.g., select only books that have a specific title) in a simple syntax that the user could easily understand.

3. Contributions and road map

In the following we sketch an overview of this dissertation as organized in two parts. The first, including chapters 1, 2, and 3, details the state of the art respectively in structured documents, structure transformations and schema matching. The second part, including chapters 4, 5 and 6, outlines our proposed solutions for XML schema matching problem and automatic generation of transformation script. We finally conclude the dissertation by summarizing the main contributions and discuss future work.

Part I: State of the art

Chapter 1: Structured Document Reuse

In This chapter we give an overview of structured documents basic concepts through the description of XML markup language. We essentially focus on two main issues. The first deals with the benefits to be taken when structuring documents. Such benefits include the automating of a wide range of document processing activities and the increase of document reusability. The second deals with structured document reuse problem. We try to answer the question: What exactly can be reused? Two levels of reuse are then identified (1) design oriented reuse, where structures and related

8

Introduction applications are reused; (2) authoring oriented reuse, where document content is reused. In this PHD, we focus on authoring oriented reuse.

Chapter 2: Structured document transformations

A notion that is tied to structured document content reuse is that of structure transformations. In chapter 2, we first detail the state of the art in structure transformation outlining the complexity behind such procedure. Second, we will show how schema matching techniques could be used to automate structured document transformations.

Chapter 3: XML Schema matching

Rule-based matching techniques rely on the weighted combination of several matching criteria. The three frequently used methods are terminological, constraint and structural matching. Terminological matching finds matched pair based on the similarity of their names. While structural matching relies on the analysis of the context in which schema elements appear. Constraint based matching finds matching pairs based on their constraints (datatypes, referential integrity, etc.). While terminological and constraint-based matching techniques are widely developed and applicable for matching XML schemas, proposed structural matching methods remain insufficient and very limited (deal only with DTDs and exploits only parent-child relationships). In this chapter, we detail the state of the art in schema matching (in general) and more specifically we describe recent research in XML schema matching. We essentially present two structural matching algorithms: Cupid [Madhavan 01] and Similarity Flooding [Melnik 02], through which we point out the current limitations of XML schema matching algorithms.

Part II: A framework for XML schema matching and automatic generation of transformation scripts

Chapter 4: Problem formalisation

Schema matching attracted much attention and several algorithms have been proposed. However, existent solutions lack a formal definition of the problem they are addressing. In this chapter, we suggest a formal framework for defining XML schema matching problem, making it clear to the user what exactly means a mapping and under which assumptions the solution is produced. Our definitions for XML schema matching were inspired by works in [Madhavan 02] and [Do 02b]. The latter researches present a formal framework for solving schema matching in order to integrate data described according to a variety of representations (relational, DTDs, Ontologies) using machine learning techniques. In contrast to these studies, we state different assumptions for XML schema matching problem in the context of structured document reuse. The basic differences are in the needed input information (we require only schemas, while they make use of data instances) and the required output solution

9

Contributions and road map

(however they also provide complex mappings, they do not state a set of needed operations). To formalise the matching problem, we rely on two notions:

• A formal model for XML schemas. This model summarizes the main features of XML schema language that we consider within the matching process. It also helps the user in understanding XML schema semantics without dealing with syntax issues. The latter model is based on a directed labelled graph with constraint sets. Nodes represent XML schema elements and attributes while edges represent the different relationships between elements (containment, associations, etc.).

• A source-to-target matching algebra that extends the standard relational algebra. Our defined set of operations includes union, selection, merge, split, join, apply and rename. To define such operations, we essentially have been inspired by research on the field of generating virtual views in data integration systems [Biskup 03], [Dobre 03], [Xu 03b].

Chapter 5: Automating XML Schema matching

This chapter details our XML schema matching approach based on the assumptions made in chapter 4. We essentially propose:

• Three different matching methods for computing element similarities:

o Terminological matching using Hirst and St-Onge’s algorithm in semantic distance computation based on WordNet [Hirst 98]. The latter algorithm has been modified to produce as well as similarity numerical coefficient and semantic relationships (equivalent, broader than, narrower than, etc.)

o Constraint-based matching: dealing essentially with the analysis of XML schemas datatypes. For atomic nodes (leaves), we make use of the XML schema Type hierarchy [W3C 01b] to draw a datatype compatibility coefficient. While for intermediate nodes, we use features such as type inheritance, substitution groups and abstract types to find complex mappings.

o Structural matching: contrary to current structural matching algorithms, we emphasise the notion of context of an element. The context of an element is given by the combination of its ancestor context, its child context and its leaf context. For comparing such contexts, we introduce the notion of path comparison using algorithms from dynamic programming [Myers 86] and path query answering [Carmel 02].

10

Introduction

• A method to drive direct as well as complex matches (with their associated transformation operations) from the computed element similarities.

• A filtering method for mapping result incorporating user feedback.

• An evaluation of our XML schema matching techniques using quality measures defined in [Do 02a] and a real-world application: bibliographic data that reflects the main characteristics of XML data-centric documents. For our evaluation, we rely on a set of XML schemas that use different granularities and abstract levels (flatten schemas versus highly nested schemas) to describe the same real world concepts. We also draw a comparative study with two existent structural matchers and show that our solution produces fairly reasonable results for both direct and complex matches.

Chapter 6: Automating Structure transformations

This chapter describes how a source instance could be transformed into a target one using the established semantic relationships between a pair of source and target schemas. We begin by introducing the chosen structure of the mapping results. Structuring mapping result is essential for two reasons. First, it is easier to manipulate structured mappings in order to automatically generate transformation scripts. Second, structuring the mapping result greatly increases its reusability, especially when schemas evolve. We use XML schema language in order to structure the mapping result. Then we describe how, based on such structured mapping result, we could generate automatically a transformation script that translates a source XML instance into an XML instance that is valid against the target schema.

11

Part I

State of the Art

13

Chapter 1

Structured Document Reuse

In this chapter, we first define structured document basic concepts; we show the benefits to be expected when structuring documents and present the XML markup language as the current standard to structure documents. Second, we expose the structured document reuse problem focusing on structured document content reuse. Structured document content reuse is the problem of restructuring and translating data structured under a source schema into an instance of a target schema, more known as structure transformation problem. Finally, we show that structure transformation is central to resolve heterogeneities between schemas and thus to allow structured document content reuse.

1.1 Why structuring documents?

A document is considered to be structured if it explicitly contains extra information that describes its logical, hierarchical structure. In contrast to semi-structured data where the structure is often irregular, partial, unknown, or implicit in the data, structured document refers to a document conforming to a pre-defined grammar or schema that describes the permissible document components and their logical organization [Abiteboul 00]. Since the distinction between schema and data is often blurred, semi-structured data is called “schemaless” or “self describing” [Abiteboul 97b]. Models discussed in [Buneman 97], [Fernandez 98], [Hugh 97], [Abiteboul 97c] and [Suciu 97] are examples of semi-structured data. Documents conforming to a context-free grammar or forest-regular grammar as such found in [Furuta 87], [Murata 97], [Gecseg 84] are examples of structured documents. SGML (Standard Generalized Markup Language) [ISO 86] documents conforming to some Document Type Definition (DTD) and XML (Extensible Markup Language) documents conforming to a DTD or recently to an XML Schema are considered as structured documents. Figure 1-a illustrates an XML document with its DTD.

The document structure can be utilized to facilitate several issues such as document authoring, document publishing, document querying and browsing, etc. Recent

15

XML Fundamentals

research has shown that during the authoring process of documents as much as half of the time is spent on formatting. A basic principle of structured documents is to separate the structure from its presentation, this shields authors from formatting tasks allowing them to only worry about the content. The writing of documents can then be guided by prompting the required structural parts, and by interactively validating the resulting structure. Another advantage of the structure is that the authoring tool may help the user with powerful commands. According to structure, the tool can automatically update cross-references, and establish a table of content or an index [Cole 90], [Furuta 88].

Based on structure, it is also possible to automate the generation of different layout formats such as Word (Doc/Rtf), HTML (for Web sites), PDF (Printed documentation), WML (for wireless devices). This greatly facilitates the publishing of structured documents and saves efforts. Structuring documents clearly facilitates their later processing. In fact, producing a class of documents that conform to a unique grammar typically enhances the specification of appropriate processing operations on documents belonging to a given class, since processing is uniformly defined for a set of documents and not for individual document. Several tasks can be automated, such as the above mentioned transformation of documents to different formats enabling automatic publication of printed copies or of online or CD-ROM versions.

In structured documents, mark-up is also used for identifying meaningful parts of a document, and thus facilitates document querying and browsing [Ludäscher 02] and [Hardt 02]. Structured documents may be stored in document bases, such as Tamino database [Tamino], which are typically enriched with editing, query, workflow, and versioning facilities. Finally, a very important aspect of structured markups is that the documents are software and system independent, which enables interchange between different environments.

In this respect, a number of research works are dedicated to the analysis of raw or semi-structured documents in order to structure or re-structure them. In [Frankhauser 93], authors proposed the MarkItUp system designed to recognize the structure of untagged electronic documents; their approach is based on learning by example to gradually build recognition grammars. [Belaïd 97] used a constraint propagation method to extract logical structure of library references. A lot of effort is dedicated to the analysis of document images, such as Postscript files [Bapst 98] in order to deduce document structure. Finally, we may also cite work performed to interactively restructure HTML documents, an approach based on the use of a transformation language [Bonhomme 96].

1.2 XML Fundamentals

XML is a markup language for presenting information as structured documents [W3C 98a]. The design of XML was driven by the idea to have a generic SGML-based

16

Structured Document Reuse

language for representing data in a self describing way. The latter is achieved by structuring the contents using tags. Every XML application defines its own tags which are described in a DTD or more recently using an XML Schema.

1.2.1 The XML data model An XML instance (example of an XML instance is given in Figure 1-a, lines 12-29) is the linearization of a tree structure3. The tree that an XML document represents has a number of different types of nodes, among them a document node (the entry point of the tree), element nodes (the inner tree nodes), text nodes (leaves), and attribute nodes. Every XML instance is associated with a unique document node which serves for accessing the XML instance. The document node has a unique child which is the root node of the document.

• Element nodes (including the root node) have a name and are delimited by a start tag and an end tag, or for empty elements, by an empty-element tag. The text between the start tag and end tag is called the element’s content. The element content is an ordered list of children, i.e., element nodes and text nodes. Additionally, elements may have a set of attribute nodes located in their associated start tag. In Figure 1-a (lines 12 and 29), the element Bibliography is delimited by the start tag and the end tag . The content of the element Bibliography is a list of elements author, each of which is a list of element name and book or article.

• Text nodes contain only textual contents and have no children. Elements name, title and journal (lines 14, 16 and 20) are examples of such nodes.

• Attribute nodes have a name, and a value. The value can be either atomic, or a set of atomic values. The order of attribute specifications in a start tag is not significant. On line 20 the start tag of element journal contains the attribute year whose value is 2004.

An XML document is well-formed if the following holds: the document meets all the well-formedness constraints given in the XML specification, each of the parsed entities referenced directly or indirectly within the document is well-formed, the document contains one or more elements nested properly within each other. The nesting is proper if (1) there is exactly one root element, no part of which appears in the content of any other element, (2) for other elements than the root the following rule applies: if the start tag is in the content of an element, then the end tag is in the content of the same element. A well-formed XML document may in addition be valid if it meets the requirements defined in a Document Type Definition (DTD) or more recently an XML schema. The XML document in Figure 1-a (lines 12 to 29) is valid and conforms to the DTD described from line 2 to line 11.

3 www.w3c.org/XML/Datamodel.html

17

XML Fundamentals

This dissertation studies XML documents that have an underlying schema. In terms of XML, we limit ourselves to valid documents rather then merely well-formed ones.

1.2.2 Document Type Definition A document type definition (DTD) consists of a grammar that describes:

• Elements that are allowed in the document, and their content (element types, order, cardinality),

• Attributes that are associated with these elements,

• In addition, entities may be defined.

1. 2. 4. 5. 6. 7. 8. 9. 10. 11. ]> 12. 13. 14. Author 1 15. 16. Title 1 17. 18.

19. Title 2 20. Journal 1 21.
22. 23. 24. Author 2 25. 26. Title 3 27. 28. 29.

Figure 1-a: A valid XML document example

1.2.2.1 Element Type declaration An element type declaration has the form: . We distinguish several kinds of content model for elements:

• EMPTY: elements that do not have any content (but may have attributes),

• #PCDATA: elements that have only text contents and may be attributes, line 7, 8 and 9 are declarations for text elements.

• Regular expression over element names: elements that contain other elements. Constructs ”,” (sequence) and “|” (exclusive-or) are used to describe the

18

Structured Document Reuse

organization of such sub-elements. The cardinality of elements is specified using the following occurrence indicators: “*”(zero or more), “+”(one or more), and “?”(zero or one). Lines 3 to 6 of Figure 1-a are such examples.

• Mixed content: elements that may contain element with text content and other elements in arbitrary ordering and nesting.

• Any: elements that have arbitrary content.

1.2.2.2 Attribute Type declaration Element attributes are described by their name, their type and a cardinality constraint. We distinguish four kinds of cardinality constraints for attributes:

• #REQUIRED: the attribute is mandatory for each instance of the element type,

• #IMPLIED: the attribute is optional,

• #FIXED value: the attribute has this value for all instances of the element type.

• Default: the attribute will implicitly be present for each instance of the element type. If a value is given in the document, it is used, otherwise the default is used.

1.2.2.3 Discussion DTDs are a heritage from the SGML specification to describe generic document structure. Essentially a DTD can be viewed as an extended context free grammar or regular tree grammar [Murata 01] that simply specifies a set of element names and their possible nesting structure. In particular, DTDs are poor in terms of data types and conceptual abstractions. Some typical modelling issues are not covered:

• Datatypes: the only literal types are CDATA/PCDATA and NMTOKEN(S).

• Non-ordered subelements: A significant weakness is that it is very difficult to specify that an element must contain some subelements in an arbitrary order.

• Cardinalities: constructors (“,” and “│”) and occurrences indicators (“*”, “?”, “+”) don’t offer a lot of possibilities to constraint element’s occurrences.

• Conceptual abstractions: DTD lacks conceptual abstractions (generalization/specialization relationship, association relationship, etc.,) as those found in the ER or UML models.

• DTD Syntax: Another problem with DTDs is that they are not in XML syntax, which prevents the use of several XML tools (editors, parsers, etc.,) and XML technologies (XML transformation languages, XML query languages, etc.).

19

XML Fundamentals

To overcome DTDs limitations, the W3C XML Schema working group provides an XML schema definition language for specifying XML document structure, that introduces more sophisticated typing mechanisms (user-defined datatypes, more built- in datatypes, type inheritance), and it itself uses XML syntax. Other languages for writing XML schemas have also been proposed such as DSD [Klarlund 00], RelaxCore [Murata 00], TREX [Clark 01] and Schematron [Jelliffe 00]. Authors in [Lee 00] propose a comparative analysis of such schema languages. In this dissertation, we essentially consider W3C XML Schema language.

1.2.3 XML: Further notions The XML specification was completed in early 1998 by the World Wide Web Consortium (W3C). Since then, a great deal of new XML related specifications (often called the XML family) and applications have been developed and published. Several resources including W3C Web pages4, Robin Cover and OASIS Web site5, XMLINFO Web pages6, and O’REILLY Web pages7 are dedicated to the XML family. In the following, we introduce some recommendations and working drafts of the W3C that are designed to accompany the use of XML in practice. Three main categories are identified: XML Accessories, XML Transducers, and XML Applications [Salminen 04].

1.2.3.1 XML Accessories XML Accessories are languages which are intended to extend the capabilities specified in XML. Examples of XML accessories are the XML Schema language extending XML DTDs. Another example of XML accessories is XPath language [W3C 99c] that defines how to address parts in XML documents. XPointer [W3C 03a] extends XPath to allow addressing points and ranges as well as whole nodes, locating information by string matching, and using addressing expressions in URI references as fragment identifiers. The XML Linking language (XLink) [W3C 01d] provides mechanisms which extend the basic ID/IDREF linking mechanism inside a document instance.

1.2.3.2 XML Transducers XML Transducers are languages which are intended for translating some input XML data into some output format. Examples of XML transducers are the Cascading Style Sheets (CSS) and XSL. CSS-level-1 [W3C 99d] was intended to attach style to XML and HTML documents e.g., control the colour, font, borders, spacing, indenting, etc. CSS-level-2 [W3C 98b] extends CSS level 1 with media-specific stylesheets, content positioning, downloadable fonts, tables, automatic counters and properties related to user interfaces. The Extensible Stylesheet Language (XSL) [W3C 01e] is a stylesheet

4 http://www.w3c.org 5 http://xml.coverpages.org 6 http://www.xmlinfo.com 7 http://www.xml.com

20

Structured Document Reuse language especially designed for XML documents using the XML syntax. In the early working drafts, XSL consisted of three components:

• XSL Patterns as an addressing and selection mechanism for XML trees (which was separated latter to XPath).

• XSL-FO (Formatting Objects) which provides elements describing formatting and layout markup for XML documents.

• XSLT (XSL Transformations) which serves as a functional programming language for transforming XML documents.

1.2.3.3 XML Applications XML Applications are languages which define constraints for a class of XML data for some special application area. XML applications are divided into four subcategories. The first subcategory consists of languages intended for non-textual forms of data: mathematical data, multimedia data, animation, vector graphics, and voice. Examples include the SMIL language [W3C 01f] for integrating a set of independent multimedia objects into a syncronized multimedia presentation and SVG language [W3C 03b] for describing two-dimensional vector and mixed vector/raster graphics in XML. The second subcategory consists of the languages intended for web publishing, to replace HTML, this include XHTML [W3C 02a] and XFrames [W3C 02b]. The third subcategory includes languages for the semantic web. Examples include RDF [W3C 99a], RDF Schema [W3C 04a] and OWL [W3C 04b] the semantic markup language for publishing and sharing ontologies on the web. Finally, the fourth subcategory consists of the XML applications related to web communication and services such as WSDL [W3C 03c] intended to describe Web services. Another language, called WSCL [W3C 02c] for Web Service Conversation Language is used in conjunction with WSDL to allow abstract interfaces of Web services. Table 1.1 summarises examples of W3C related recommendations and working drafts.

1.3 Structured document reuse

Reuse has been recognized as an important factor to improve productivity and reduce the investment in time and effort in documental activities. This issue has been widely addressed in several areas including programming languages, software engineering and more recently document engineering. Object oriented programming languages offer an easy way to modify pieces of programs and reuse them for other purposes and within other environments. The scope of reuse has rapidly evolved from the simple reuse of code to encompass all the software-related components such as specification, design diagrams, architectures and documentations. Document reuse has been defined as “the process of producing new documents and new versions of old documents by reusing pieces of previously existing documents” [Levy 93]. The process of document reuse (Figure 1-b) involves finding the reusable components from document collections

21

Structured document reuse

(e.g., libraries, document databases, Web, etc.), modifying them as needed and combining them with new material to build a new document.

It is widely accepted that document reuse facilitates and speeds up document production within many application areas. For instance, in the educational domain, instead of producing courses from scratch, a group of teachers could maintain a course material pool consisting of examples, definitions, theorems and their proofs, exercises, book chapters and examinations. When a teacher is producing a new course, he (or she) could reuse this existing educational material which reduces considerably time and teachers’ effort. In this respect several projects such as ARIADNE described in [ARIADNE 00] and SEMUSDI described in [Delestre 98] aim to provide distributed shared databases for pedagogical material. A survey of techniques for educational material reuse has been given in [Boukottaya 03].

Document reuse is as old as document production itself, and has concerned all kinds of documents including papers, electronic documents, and more recently structured and multimedia documents. Paper documents have been widely reused. The “photocopier” allows copying portions or entire documents and folds them together to obtain new documents [Levy 93]. The growing use of electronic documents makes the issue of their reuse very challenging. Electronic documents may obviously no more be considered as a simple representation of their paper counterpart. They become dynamic components whose content may be modified according to user’s interaction but also in reaction to modifications in the user’s environment [Quint 94], [Quint 95]. Several efforts have addressed electronic document reuse. Research on document recognition and scanning has been the first step in the evolution of the reusability of paper documents. Text and graphic editors, such as Microsoft Word, offer different possibilities for reuse. [Levy 93] has identified at least three ways: (1) Replication: From a single document, several presentations can be produced. (2) Redaction: New versions of a document can be made by editing its electronic representation. And (3) Extraction: Portions of a document can be taken from one document and moved to another, by means of the now popular “Cut&Paste” command.

Reusing documents in either of these three ways was easy. However, when reuse requires crossing system and application boundaries, several problems arise due to the heterogeneities of such systems. One response to these problems is to structure documents by using Markup Languages such as XML. The advent of structured documents leveraged a promising consensus on the encoding syntax for machine processable information and thus resolves several issues, such as parsing and character encoding recognition. XML was rapidly considered as a data representation and exchange format within applications for various areas other than document management.

However, XML documents are no longer considered only as an information delivery format for information exchange between heterogeneous systems and environments, but also as a major component within global information systems. XML document

22

Structured Document Reuse becomes a rich component that includes semantic and structural information (described in the content of the XML document and in its related schema) and may be accompanied by several related applications involving management, querying and processing (examples of such applications were described in section 1.2.3). To respond to this new document’ role, the issue of reuse has to be revised. Although one of the strengths of XML is to allow several representations and formats to be generated from the same source document, reuse of structured documents appears to be of much wider interest. Rekik [Rekik 01], identifies two levels of reuse: design oriented reuse and authoring oriented reuse. In design oriented reuse, structures and document- related applications are reused, while in authoring oriented reuse, the content of XML instances is reused.

In this dissertation, we essentially focus on content reuse. Structured document content reuse is the problem of restructuring and translating data structured under a source schema into an instance of a target schema. Although increasing use of XML has simplified data exchange, the problem of structured document reuse remains largely unsolved due to the heterogeneities of such documents.

Extraction

Reusable pieces Document collection Querying and browsing (Resources) Selection

Iteration Querying and Transformation browsing New document

Figure 1-b: The Document Reuse Process

A significant amount of literature discusses schema heterogeneities. Classifications or taxonomies of such heterogeneities appear in [Goh 97], [Kashyap 95] and deal essentially with relational and object oriented databases. More recently [Lee 01] and [Durand 02] describe conflicts that may occur between two XML schemas. The key commonality underlying these studies is that they classify schema heterogeneities into two categories: schematic heterogeneity and semantic heterogeneity. Schematic heterogeneities [Kim 91], [Krishnamurthy 91] and semantic heterogeneities [Sheth 92], [Naiman 95], [Garcia 96], [Hammer 95] have been well documented in the literature

23

Structured document reuse

with a consensus of what each encompasses. In most cases, the distinction between the two can be characterized by differences in structure (how are the data logically structured?) and interpretation (what do the data mean?). This distinction however is not always clear, since the logical organization of data often conveys semantic information. [Durand 02] distinguished several types of heterogeneity between XML schemas:

• Semantic-based heterogeneity: refers to heterogeneity of natural semantics (the linguistic meaning of element and attribute names). Examples include the attribution of different names for semantically equivalent elements or attributes, the attribution of the same name for semantically different elements or attributes, etc.

• Granularity-based heterogeneity: includes the number and nesting depth of elements each schema uses to encode a given content and the amount of information added via attributes. Examples include the use of different number of elements to describe equivalent content (e.g., …. and …..).

• Constraint-based heterogeneity: refers to differences in positional constraints, elements cardinality, domain values, and datatypes. Examples include the difference in structure on equivalent content (e.g., and ….), different type constraints (e.g., strings versus integers), etc.

• Data-based heterogeneity: involves differences in the format of data content. Examples include the use of different formats for dates or units.

• Inheritance- based heterogeneity: is considered as a consequence of another type of heterogeneity. The failure to successfully translate an attribute in schema A into an equivalent attribute in schema B means that any elements which inherited that attribute’s distributed property in A will no longer have it in B, which causes a loss of information.

To resolve such heterogeneities, a notion closely tied with structured document reuse is that of structure transformations. Structure transformations are central to resolving such kind of heterogeneities. Numerous syntax-directed translation schemas and tree- transformation-based languages have been described in the litterature, e.g., SIMON [Feng 93], ALCHEMIST [Tirri 94], XDuce [Hosoya 00], HARMONIA [Boshernitsan 01]. The best known and widely adopted language for transforming structured documents is XSLT. [Jayavelu 00] shows the ability of XSLT to resolve different types of schema heterogeneity (primary structural, semantic and data related).

24

Structured Document Reuse

XML Accessories

XML Namespace A mechanism for qualifying element and attribute names. XML Schema A schema specification language extending the properties of DTDs. XPath Addressing parts of an XML document.

XPointer A language for pointing into XML resources.

XLink A language to create and describe links.

XML Transducers

CSS A stylesheet language for rendering XML documents.

XSL A versatile stylesheet and document transformation language. XSLT The transformation language of XSL.

XQuery A query language for XML.

XML Applications

SMIL A language for creating interactive multimedia presentations.

SVG A language for creating Vector graphics.

XHTML Reformulation of HTML as an XML application.

XFrames XML application allowing the functionality of HTML frames.

RDF/RDF Schema A framework for representing metadata for Web resources.

OWL Semantic markup language for publishing and sharing ontologies on the web.

WSDL/WSCL To describe Web services.

Table 1.1: Examples of W3C XML related recommendations and working drafts

25

Conclusion

1.4 Conclusion

The importance of structured document paradigm has made a large amount of heterogeneous and distributed XML documents widely available. In this context, the reuse of structured documents content becomes a crucial issue. Heterogeneity arises from the fact that applications define and constrain their data using different structures. This implies that making new XML data available to an application requires that the data is either transformed into the system’s specific data structure or is even acquired again. Acquiring new XML information is very time consuming and generally requires human intervention and/or large investment in technical infrastructure. In this context, sharing and reusing XML documents by several applications, systems and users is of major concern and permits to overcome new data acquisition’ problems. However, in order to establish efficient XML data sharing two key issues have to be addressed:

• Firstly, relevant and potentially reusable XML data has to be identified and located. Finding suitable XML information needed to perform a given task has been largely addressed in the area of information retrieval and information filtering and will not be addressed in this work.

• Second, reused XML data has to be transformed into system’s specific data structure. We essentially focus on such issues. Chapter 2 gives an overview of existent transformation languages and systems for structured documents.

26

Chapter 2

Structured Document Transformations

This chapter studies structured document transformations as a means for manipulating and reusing content of structured documents. First, an overview of the literature in this field ranging from simple and descriptive languages/systems to more complex and expressive ones is presented. Through the description of the existing, we show that transforming structured documents is not a trivial problem. Instead it appears as a complex process that requires manual effort and programming skills. Second, several attempts to automate structured document transformations are described, however none of them encompasses the overall transformation process. Finally, we argue that schema matching techniques could be a suitable solution to help the automation of structured document transformations.

2.1 Transformation of structured documents

Trees are very common data structures in many systems and applications. Tree transformations have been the focus of varied research and an extensive literature (including theories for graphs and trees, tree editors as well as the research of tree pattern matching and replacement methods) discusses their formal properties and their implementations. Since it is widely accepted that document structure is often represented as trees, tree transformation methods have been largely used for transforming structured documents. Parsing is important when applying tree transformations to structured documents. The parser works the same way with structured documents as in some algorithms for data analysis or compiling. A parser for structured documents can check if a document is consistent with a given grammar. Parsing constructs a parse tree from the document, a data structure representing the document. Once the document has been parsed, its transformation can begin. [Linden 97] argues that a parsed document gives more opportunities for transformation since it gives more details about the document itself.

27

Tree transformation methods

The common point between transformation systems for structured documents is the transformation process itself. The process of transformation can be divided to three main phases [Kuikka 96], [Murata 98]:

• Structure Identification: This phase aims at understanding the input and output structural constraints imposed on the documents by analysing schemas (e.g. grammars, DTDs, XML Schemas, etc.) of input and output documents in order to precisely identify all significant components within a document instance, each correctly interpreted in its context.

• Transformation specification: This phase deals with the specification of mappings by means of inter-schema correspondences, capturing input and output constraints imposed on the documents. This process identifies the parts to be transformed in the input document and describes how to place them in appropriate contexts within the output document so that the result is valid against the target schema.

• Transformation generation: In this phase, mapping specifications are translated into an appropriate sequence of operations (in a given transformation language) over the input document, to produce an output document satisfying the constraints and structure of the output schema.

In section 2.2, we will first give an overview of tree transformation methods. Second we will present some transformation languages and systems designed for transforming SGML and XML documents using such methods.

2.2 Tree transformation methods

Parse tree transformations are generally defined with the use of paired grammar productions and can be implemented by syntax directed translations, TT grammars, or pattern matching and replacement techniques.

2.2.1 Syntax Directed Translation (SDT) Syntax directed translation was one of the first and simplest techniques used to transform structured documents constrained by a context free grammar [Kilpelainen 90] [Kuikka 91]. A transformation is defined by a syntax directed translation schema (SDTS) [Irons 61] [Lewis 68]. SDTS is a quintuple (N, Σ, ∆, R, S), where

• N is a finite set of non terminals, one of which is a start symbol S and all of which are shared by both input and output grammars,

• Σ is a finite input alphabet,

• ∆ is a finite output alphabet, and

28

Structured Document Transformations

• R is a set of paired grammar rules of the form A→α, β, where α∈ (N ∪ Σ)* and β∈ (N ∪ ∆)*. α and β share the same non terminals and the non terminals in the output rule β are a permutation of the non terminals in the input rule α.

Suppose that we want to change the order of children occurrences within a given element Article such that element journal will appear after title element. The following paired grammar rule is then used: “Article→ title journal, journal title”. An SDTS defines transformations which only allow one to change the order of terminal children of a given node.

SDTS have been widely used in multiple view editors such as the Helsinki Structured Text database (HST) system [Kilpelainen 90] which provides users with multiple views of a document. The logical view is described through a context free grammar and the user has the possibility to modify it according to the rules of a syntax directed translation schema. Another example of transformation system that relies on SDT is the Integrated Chameleon Architecture (ICA) system [Mamrak 93]. The ICA system relies on user intervention to define an intermediate representation that always lies between the source and the target grammars and where source and target grammars are described by reordering the non terminals. Aho and Ullman [Aho 72] provide an algorithm for the automatic transformation of a parse tree via an SDTS. The algorithm processes a parse tree from the root to the leaves by the depth-first traversal of nodes. Each translation step is applied to a node and all its children.

Syntax Directed Translation techniques have very limited expressive power. The input and output grammars have to be very similar, they must contain non terminals with identical names and there must be a corresponding target production for each source production. Several extensions have been made to SDTS in order to overcome their limitations. [Kuikka 96] has extended SDTS with the possibilities to rename the paired associated non terminals and to add or delete some non terminals. Because the SDTS and their extensions require that the input grammar and output grammar are strictly paired, an SDT schema only works on one level in the source tree, and thus it is impossible to describe hierarchical changes such as introducing internal sub-structure or moving nodes up or down in the parse tree.

2.2.2 Tree transformation grammar (TT grammar) To overcome SDTS limitations, Tree Transformation grammars or TT grammars were introduced. TT grammars extend SDTS by allowing users to specify explicitly the associations between the input/output grammar rules; this implies that an input non terminal node can be associated with an output non terminal with a different name and in a different structural level. Formally, a TT grammar is a sextuplet (Gs, Gt, Ss, St, PA, SA) where:

• Gs and Gt are the source and target grammars, both are context-free grammars,

• Ss and St are sets of source and target sub-grammars,

29

Tree transformation methods

• PA are a set of production group associations, and

• SA a set of symbol associations.

A production group association is a pair of source sub-grammar Ss and a target sub- grammar St. A symbol association relates a symbol in a source sub-grammar to a symbol in a target sub-grammar. Consider the following transformation example illustrated by Figure 2-a, where we want to transform the sub-grammar on the left side into the sub-grammar on the right side. Such transformation can be formulated using a TT grammar as follows:

Gs : Reference ::= (book|Journalarticle|Proceedingarticle)* Ss[1] : book ::= title writer writer::= FirstName LastName Gt : Bibliography::= (author)* St[1] : author::= book name Book::= title Name::= FirstName LastName PA: {(Ss[1],St[1])} SA: {(Reference, Bibliography), (writer, author)}

Reference Bibliography

book author

title writer book name

FirstName LastName title FirstName LastName

Figure 2-a: Example of tree transformation

The formalism of TT grammars has been widely used in the implementation of transformation generators for structured documents [Keller 84]. Authors in [Tirri 94] propose the ALCHIMIST tool based on TT grammar to express more powerful and complex structure changes. ALCHIMIST relies on context free grammars for representation specification and is mainly used for producing transformations between two different structured documents without an explicit intermediate format between the representations. Transformation relies on three modules: a parser, a mapper and an unparser. The parser reads the source document and produces the corresponding parse tree. The mapper transforms the parse tree into a target parse tree and finally the unparser writes the frontier of the target parse tree into the target document. TT grammars are more powerful than SDTS, however they cannot specify contextual

30

Structured Document Transformations conditions, which are fundamental in expressing more complex structural transformation [Murata 98].

2.2.3 Tree pattern matching and replacement Transformations with the use of a tree pattern matching and replacement consist of two phases. The first phase, pattern matching, is the process of locating substructures by comparing them to a given form of a pattern. The second phase, pattern replacement, is the process that takes one substructure and replaces it with a new one. Tree to tree transformations are generally specified as a set of pattern {action} rules, where each pattern matches a certain subtree and its corresponding {action} specifies extraction or transformation of the match.

Scrimshaw [Arnon 93] extends to trees the familiar notions of regular expressions, pattern matching and pattern replacement for strings. It serves as both a structured document query language and as a language for expressing document transformations. A rule consists of a matching part and a construction part. The transformation process matches some substructures in the parse tree, assigns some of the structure to variables, which are used in the output rules that decides how the matched pattern is replaced. Another example of transformation language based on tree pattern and replacement is the TXL (Turing eXtender Language) [Cordy 90]. A TXL program takes as input a parse tree according to a given input grammar and transforms it into an output parse tree by applying the corresponding pattern matching rules. A basic pattern matching rule in TXL looks like:

Rule name Replace [type] Pattern By replacement End rule.

Where name is a rule identifier, type is the non terminal designating the root of the input parse tree, pattern is a pattern which the input parse tree must match and replacement is the result of the corresponding transformation. A TXL transformation consists of three submodules. The parser is based on the source language grammar but takes notions also of the differing target language features, the transformer transforms a parse tree according to some rules and the deparser writes out the target program.

31

SGML and XML Transformation Languages

2.3 SGML and XML Transformation Languages

XML and SGML transformation languages are basically divided into two categories: event-based languages and tree-based languages. Both use a parser program as assistance for transforming documents. In event-based languages, a parser returns a list of token in the source (such as the start and end tags or data elements) defining the source document structure to the transformation program, or responds via API to the transformation program’s requests related to the events the parser encounters. Tree- based languages instead, constructs a parse tree of the source document and the transformation program can navigate on document structure, simultaneously defining transformation from the source document to the output document.

Examples of event-based transformation languages include OmniMark [Exo 93] and Cost [Harbo 93] which work as syntax directed translators. The Metamorphosis system [MID 95] instead builds the parse tree of the SGML document, and the user specifies how each node in the parse tree should be modified. Transid [Jaakkola 97] is an SGML transformation language based on tree transformations, it requires a specification of the input representation, in the way the input SGML document must have a DTD, but no output DTD is used. Transid in its first version contains normal tree transformation operations and has been next extended with string operations and regular expressions. Balise [Berger 96] provides event-based and tree-based transformations for both XML and SGML documents. For SGML documents, specific languages for the transformation and formatting are defined in a standard for Document Style Semantics and Specification language (DSSSL) [ISO 91]. A transformation is defined by a set of transformation specifications and made by generating a new structure into an output tree for every node of the input tree.

Several languages have been proposed for the purpose of XML document transformations. XSLT is the best known XML transformation language. The XSLT 1.0 W3C recommendation [W3C 99b] was published in 1999 and has been widely implemented since then. In the following we will present in detail XSLT language and give a brief overview of other XML transformation languages sharing the same capabilities with XSLT. Such languages rely on several open source parsers8. Many of the open source parsers offer an API (application programming interface) conforming to DOM or SAX APIs, which allow to communicate using these APIs independently of the programming language the parser was written in.

2.3.1 XSLT An XSLT program (so called a stylesheet) is a set of template rules, each of which has two parts: a pattern that is matched against nodes in a source tree and a template that can be instantiated to form part of a result tree. Every XSLT stylesheet is a valid XML document, embedding XPath expressions as attribute values and text contents for addressing nodes. In order to illustrate XSLT functionalities, we consider the example

8 See for example W3C list of XML software available at http://www.w3c.org/XML/#software

32

Structured Document Transformations of Figure 1 (c.f introduction) where we want to transform an instance of DTD 1 into an instance of DTD 2. The following summarizes the main features of XPath and XSLT languages.

2.3.1.1 XPath XPath [W3C 99c] defines the basic addressing mechanism in XML documents, which is employed by most XML querying languages. The expressions which are defined by XPath are called location paths. Every location path declaratively addresses a set of nodes from a given XML document. XPath is based on navigation through the XML tree by path expressions of the form //step/step/.../step. Formally, the input to every location step is a node set, called the context (the input to the first step is the set containing only the document node). From this set, a new node set (called result set) is computed which then serves as input for the next step. For this computation, the input node set is processed, evaluating the location step for every node in it, appending its result set to the overall result, and proceeding with the next node.

Every single step is of the form axis::nodetest[filter]*, which specifies that navigation goes along the given axis in the XML document. The axis specifies the tree relationship between the nodes selected by the location step and the current context node. Along the chosen axis, the nodetest specifies the node type and the name of the nodes to be selected. From these, the ones which satisfy the given filter (which contains predicates over XPath expressions) are returned.

Examples of XPath expressions are given in the following :

- The absolute location path /Reference/Book//Writer/Firstname selects all Firstname children from Writer descendants from any Book child of any Reference child of the root element.

- /Reference/Book//Writer/Firstname/text() selects the text contents of these elements.

- /Reference/Book [name = "book1"]// Writer/Firstname/text(), and /Reference/Book [name/text() = "book1"]// Writer/Firstname/text() do the same, but only for Writers of "book1".

- //Book [price > 100]/title/text() selects all titles of books which prices exceed 100.

2.3.1.2 XSLT template declaration A template declaration in XSLT is of the form:

content

33

SGML and XML Transformation Languages

Where, match-expression is an XPath expression that specifies elements to which the template applies. Content is a sequence of XSLT elements which contains the operations for transforming selected nodes and inserting nodes and text contents into the result tree. Such operations include textual writing of values, copying nodes and values from the input document, and generating new elements and attributes.

Example 2.1 (XSLT: copying nodes from the input tree)

This example illustrates a simple template which can be applied to journal and title elements, and copies them unchanged to the result tree.

Example 2.2 (XSLT: creating a new node in the output tree)

This example illustrates a template which can be applied to writer elements in the input tree, it create a new element name within the output tree which value is the concatenation of values of input tree’ elements firstName and lastName. A white space is also added by means of text element:

2.3.1.3 XSLT template Application A template is applied by means of xsl:apply-templates and xsl:call-template (used for named templates) elements in the content part of xsl:template elements:

Example 2.3 (XSLT: Template Application)

34

Structured Document Transformations

In this example, we select all title subelements in the current context node (book) and apply the appropriate template to each of them. The template whose match attribute is title will be applied (the template of example 2.1 could be applied).

Example 2.4 (XSLT: Named Templates)

In this example we use a named template TemplateForTitle that copies input tree title elements. The application of such template is done by the xsl:call-template.

2.3.1.4 Generating the Output Tree The execution of an XSLT stylesheet on an XML instance starts with applying the template which matches the outermost element of the instance. Then, recursively, the processing is controlled by and where the template executions contribute to the result tree generation:

• Elements and attributes in the content of templates are added to the result tree.

• Nodes are copied from the input tree to the output tree by means of and literal values are copied by .

• Attribute values can be assigned by specifying an XPath expression:

• New elements can be generated by specifying xpath expressions:

content

35

SGML and XML Transformation Languages

2.3.1.5 XSLT: further notions

2.3.1.5.1 Branching elements of XSLT Three XSLT elements are used for branching , and . The first two are much like if and case statements in programming languages. The element serves to select a set of nodes and do some processing for each of them. Conditional execution is provided by test or case-splits having the form:

contents contents1 contents2 ... contentsn+1

2.3.1.5.2 XSLT Variables The element allows users to store a value and associate it with a name. element can be used in three ways. The simplest form creates a new variable whose value is an empty string . A variable value can be also given by the select attribute: . The third way to use variables is to put content inside it : contents . Variables are then used either by: for defining attribute values, or by elements of the form which converts the variable's value into a string, or by which adds the variable contents to the result tree.

2.3.2 XML transformation languages Besides XSLT, document and functional programming communities have proposed several XML transformation languages sharing the same computational capabilities with XSLT.

2.3.2.1 Streaming Transformations for XML (STX) STX [STX 02] [Becker 02] is a transformation language designed for transformations that keep the order of the incoming data such as changes in naming of elements and/or attributes, providing a view or a subset of the data and local transformation where the structured changes are constrained to a small amount of data. The idea behind STX was born from the insight that XSLT may be too complex for such kind of transformation tasks. The STX initiative began in February 2002. At the moment two

36

Structured Document Transformations

open source implementations for STX are available: the Perl-based processor XML::STX9 and the Java-based processor Joost10.

An STX program contains the element stx:transform that encloses a set of templates. As in XSLT, a template must have a match attribute with an associated pattern. The set of allowed pattern is almost the same as XSLT and many STX instructions were just adopted from XSLT, for example stx:param, stx:value-of, stx:if, stx:choose, stx:element, stx:attribute, and others.

STX reuses many XSLT features, however the main differences between XSLT and STX is that the latter is an event-based transformation language that works on a stream of SAX events while XSLT is a tree-based language. Thus the order of instructions in STX is determined by the input (especially the document order of the input nodes) while an XSLT transformation process may process a set of nodes from the input tree in parallel. Another difference is that STX uses stx:process-children instruction which is the counterpart to XSLT’s xsl:apply-template. Since STX processes its input as a stream of SAX events, then the transformation process must run in document order and xsl:apply-templates can select arbitrary nodes by means of additional select attribute cannot be applied. The STX language introduces the notion of changeable variables a well known concept in procedural programming languages. While XSLT is functional and stateless, STX maintains state and works in a more imperative way.

2.3.2.2 HaXML HaXML [HaXML] uses modern functional languages in XML processing. The system provides two approaches for transforming XML documents. The first offers a combinatory library for generating, selecting and transforming XML documents in a generic setting. The syntax proposed is mainly the Haskell syntax. HaXML uses the notion of filters for selecting matching nodes. A filter takes a fragment of the content of an XML document and returns a computed sequence of content. Two kinds of filters are proposed: selection filters and construction filters.

Selection filters select parts of documents being thus similar with the processing model of XPath. Filters are thought as predicates, examples of such selection filters are elm which returns just the item if it is an element or nothing otherwise, children which returns the children of an element if it has any, attval (a,v) which returns this item only if it is an element containing the attribute a with the value v.

Construction filters produce the output document. A construction filter consists of a set of functions including the function litteral s that produces a text content containing just the string s. The function mkElem t fs builds a content element with the tag t, the argument fs is a list of filters, each of which is applied to the current item and all their results are collected as children of the new element. The function mkElmAttr t avs fs is

9 http://stx.gingerall.cz/stx/xml-stx/ 10 http://joost.sourceforge.net/

37

Summary and discussion

just like mkElm except that its extra parameter avs is a list of attribute values to be attached to the tag.

Because Haskell syntax is not well suited for describing XML fragments, in the second approach DTDs are translated to data types in functional language. The advantage is that the static type checking of the language can be used for the validation of XML documents. This approach can be taken only when DTDs of the input and the output are known and language for translating DTDs into data types are provided. Furthermore no generic support for XML processing can be provided in this approach. A language using DTDs to data types approach is XMLLambda [Meijer 99], a small functional language that considers XML documents as its basic datatypes.

2.3.2.3 XDuce XDuce (pronounced Transduce) [Hosoya 00] is a tree transformation language, similar in spirit to mainstream functional languages for XML documents. Values in XDuce are XML documents and can be specified either in XML syntax or in a native syntax. XDuce introduces features like regular expression pattern matching. Regular expression types, used as generalization of DTDs, describe structures in XML documents using regular expression operators (ie, *, ?, │, etc). Pattern matching in XDuce is similar to ML’s (or Haskell’s) but somewhat more powerful since it includes the use of regular expression types to dynamically match values of those types. However, there is no further support for specifying the context and the structure of a match. It is not possible for example to match trees of variable depth as permitted by the // in XPath.

XDuce design is finished but extensions to add standard features from functional programming such as high order functions and also the support for object-oriented features found in XML Schema specification are under investigations.

2.4 Summary and discussion

Table 2.1 summarises transformation systems described in this chapter. The first column describes the transformation method used. Transformation methods are usually based on a formal technique such as syntax-directed translation (SDT), TT grammars (TT) or tree pattern matching and replacement (Patt). Given source and target grammars, the second column indicates whenever the transformation is constructed automatically (auto), manually (man) or semi-automatically (semi) with some user intervention. Table 2.2 summarises SGML and XML transformation systems. The first column specifies if the transformation is event or tree based. The second column specifies if the transformation language has been intended to SGML and/or XML documents.

As we can see through these two summering tables, several transformation languages and systems have been proposed in the context of structured documents. Techniques

38

Structured Document Transformations such as syntax-translation and its extended form TT (tree transformation) grammar have been borrowed from transformations of source code in various programming languages to specify structured document transformations. These languages and systems are descriptive but have limited expressive power to specify complex structure transformations. As alternative, tree pattern matching and replacement techniques were used and gave rise to more than one functional programming languages with powerful computational capabilities such as XSLT.

However, such languages are usually complex and simple transformations require the user to write a program which requires non-trivial programming skills. Such languages are not appropriate in situations where many data transformations are needed or where transformations have to be generated frequently. For this, several solutions to either simplify or automate the transformation process have been proposed. Section 2.5 summarizes such solutions.

Transformation Transformation Transformation system/language method Generation

HST SDT Semi

ICA SDT Semi

ALCHIMIST TT Man

ScrimShaw Patt Man

Table 2.1: Properties of some transformation systems

39

Automating Document Transformations

Transformation Transformation Markup

Language Method Language

Omnimark Tree SGML

Cost Event SGML

Metamorphosis Tree SGML

Transid Tree SGML

Balise Tree/Event XML/SGML

DSSSL Tree SGML

XSLT Tree XML

STX Event XML

HaXML Tree XML

XDuce Tree XML

Table 2.2: Properties of some XML/SGML transformation languages

2.5 Automating Document Transformations

Currently, to perform document transformations, the burden falls on the human to first analyse both the structure and the semantics of the source and target documents, and second to manually code the transformations. Many solutions have been proposed to simplify and automate structured document transformations. Approaches can be distinguished along the following two dimensions: Declarative Transformation specification and Schema matching.

2.5.1 Declarative Transformation Specification Languages Faced with the complexity of current structure transformation languages, several simpler and highly declarative transformation languages have been introduced. These languages try to keep a manageable balance between complexity and expressiveness. They are in general extensions to SDTs or TT grammars in order to overcome their limitations in expressing hierarchical changes. Authors in [Tang 01] propose a new language called Paired SynTrees which extends TT grammar with XPath expressions and a set of Boolean conditions (including existence testing expressions and function

40

Structured Document Transformations constraints) in order to localize nodes in a tree. Special graphical tools have been also proposed to assist the specification of the transformations [Pietriga 01], [XSLWIZ 01]. [Vernet 02] give an overview of existent transformation languages and tools. These languages and tools are very useful in describing and specifying transformations. However, they still require developers to manually indicate mappings for each source and target pair.

2.5.2 Schema Matching Schema matching is the task of semi-automatically finding correspondences between two heterogeneous schemas. Research on specifically XML transformations and data translation in general has been mostly focused on translation languages rather than on automating the creation of transformation programs expressed in these languages [Abiteboul 97a], [Atzeni 97], [Cluet 98], [Davidson 97], [Hull 90]. In addition several works were proposed for translating data between data models (essentially from XML model to relational model and vice versa). These works creates a translated schema and does not consider translation into a fixed target that may not match (or match partially) the source.

In this section, we describe the few available systems that utilise schema matching approach in order to automate data translation. The TranScm [Milo 98] system was one of the first systems that consider such issues. It identifies match (or mappings) between schemas based on a set of predefined rules describing how to match schema elements. The matching is performed node by node, starting at the root, and rules are checked in a fixed order based on their priorities. However since TranScm is a system designed to provide a generic approach, issues such as what special XML matching rules should be provided to the rule base and assignment of priorities to such rules would need to be solved in order to apply the approach in the XML context. The Clio system [Miller 01] was proposed to automate data translation. It presents a novel framework for mapping among XML and relational schemas in which a mapping is translated into meaningful queries that transform source data into target representation. Clio system requires a lot of user interventions and does not exploit all of structural features of XML.

More specific work focusing on automating XML document transformations has been recently proposed. Examples include the work done in [Leinonen 03], where authors propose a syntax directed approach for automating transformations between two grammars based on finite state tree transducer. The idea behind this work is to generate a transformation semi-automatically if the user defines a matching between elements containing the text of the document (i.e. leaves). The transformation occurs in two phases. In the first, the correspondence between the labels of interior non terminals, called label association, is determined given the label association between the labels of terminals. The concept of label association is used to show semantic similarity between the corresponding structure elements of two documents. The label association alone, does not determine transformations but it helps to find matching substructures between the source and target grammar. In the second phase source substructures are matched

41

Conclusion

with target substructures. The number of matching target substructures is restricted using heuristics functions, which parameters are updated when associations and subtrees are generated. In the case where several matching target substructures are found, the user is invited to choose one among the candidates. After the user has selected the corresponding pairs, the system can generate automatically an XSLT transformation script.

This approach presents several limitations: first it works only if the two grammars have common parts, which restricts the scope of transformations to local transformations. Thus it doesn’t cover most real world examples. Moreover, this approach is unable to resolve all the heterogeneities that may occur between structured documents: authors restrict themselves to transformations for which certain types of structure elements in the source document are transformed to the same types of elements in the target document. For example, a list of repeating nodes in the source document can only be transformed into a list that contains the same number of repeating elements.

The authors of [Su 01] propose an approach for automating the transformation of XML documents, where they focus on two fundamental problems. First, they address the problem of how to automate the identification of semantic relationship between XML documents. To this end they define a set of DTD transformation operations that establish the semantic relationships between two DTDs. The approach is based on a tree matching algorithm, called DMatch (DTD Match), to discover automatically a sequence of operations (i.e. transformation script) that transforms a source DTD into a target DTD. The matching process is based on provided auxiliary semantic information (synonym dictionary, domain ontology, etc.) and a cost model based on heuristics functions for choosing a transformation script among multiple alternatives. The resulting transformation script is then used to generate automatically an XSLT script.

This approach presents several limitations. First the matching algorithm used is only able to discover one-to-one correspondences between DTDs and does not deal with many-to-many matches. Second, the matching algorithm require additional information (more than DTDs) to work correctly, which limits the scope of its application since such semantic information is not always available. Finally, the matching algorithm used is inspired by work done in tree matching and is unable to deal with the current XML schema model.

2.6 Conclusion

Transforming a document that conforms to a source structure into an instance document that conforms to a target structure is a challenging problem. Users have to first manually establish semantic correspondences between both source and target schemas, then manual code such correspondences with a transformation language such as XSLT. We believe that schema matching techniques are a suitable solution to simplify and automate such process. Schema matching algorithms developed in the

42

Structured Document Transformations literature for automating structure transformations are very limited. In parallel to these algorithms, Database and Artificial Intelligence communities have widely considered schema matching problem in many application domains such as data integration, peer- to-peer data management, data translation, and data warehousing. Several schema matching algorithms and tools were proposed (either generic or domain specific). Recently XML schema matching problem have been of a major focus by several researchers. In chapter 3, we give an overview of such studies and check their applicability in the context of XML data transformations.

43

Chapter 3

XML Schema Matching

As mentioned in chapter 2, some work has been done in the area of automating the process of schema matching. This chapter draws a comparative study of the existing schema matching algorithms and points out the encountered limitations when applied to structured document transformation problem.

3.1 Schema Matching: Complications and Challenges

Schema matching is a schema manipulation process that takes as input two heterogeneous schemas and possibly some auxiliary information, and returns a set of dependencies, so called mappings that identify semantically related schema elements [Rahm 01a], [Rahm 01b]. In practice, schema matching is done manually by domain experts, usually with the support of a graphical tool [Miller 00], [Nam 02], [Mapforce 04], and it is time consuming and error prone. As a result, much effort has been done toward automating schema matching process. This is challenging for many fundamental reasons:

• According to [Drew 93], schema elements are matched based on their semantics. Semantics can be embodied within few information sources including designers, documentation, schemas, and data instances. It is widely accepted that it is difficult for the schema designer, when he is available, to remember all schema details and generally available documentation tends to be incorrect, outdated, and not accessible. Hence schema matching process typically relies on purely syntactic clues in the schema and data instances [Doan 00].

• Schemas developed for different applications are heterogeneous in nature i.e. although the data they describe are semantically similar, the structure and the employed syntax may differ significantly.

• To resolve schematic and semantic conflicts, schema matching often relies on element names, element datatypes, structure definitions, integrity constraints,

45

Application Domains

and data values. However, such clues are often unreliable and incomplete. For example the same labels may be used for schema elements having totally different meanings. In contrast, two elements with different labels can refer to the same real world entity. Datatypes are usually not exact (e.g. use of type “string” instead of “date”), and constraints are often incomplete. In such conditions, the main challenge is not to only determine existing relations between schema elements, but also making sure that the matching process does not discover incorrect mappings.

• In its simplest form, a mapping is one-to-one, that it binds a source schema element directly to a target schema element. This simple mapping form, however, is rarely sufficient. In practice, a single element in one schema may correspond to multiple elements in another schema obtained by applying one or several operations: (e.g., a target element “name” corresponds to the concatenation of two source elements “firstName” and “lastName”), and even multiple elements in one schema may correspond to multiple elements in another schema. Such kind of mappings is called complex or indirect and is usually infeasible to infer from just the schema and instances requiring in general human intervention. Discovering complex matching is challenging because the matching process should not only find such mappings but also identify the required operations such as concatenation of two elements, merge or split data values, etc.

• Because schema matching cannot be fully automated and thus requires user intervention, it is important that the matching process not only do as much as possible automatically but also identify when user input is necessary and maximally used.

3.2 Application Domains

Schema matching appears as a critical step in several applications where (1) the data they manipulate is structured under specified models (such as relational schemas, object oriented schemas, XML DTDs, XML Schema, etc.) and (2) the schemas they employ are heterogeneous. Schema matching enables schemas manipulation [Bernstein 00a], [Andritsos 02], data translation and query answering across heterogeneous schemas.

Several applications relying on schema matching have arisen and have been widely studied by the database and AI communities [Rahm 01b]. Examples of such applications include schema and data integration, data warehousing, data translation, peer-to-peer data management and Ontology integration These applications play important roles in emerging domains such as E-commerce and bioinformatics [Mork 01], [UDB].

46

XML Schema Matching

3.2.1 Schema Integration Most research on schema matching has been motivated by the problem of schema integration, which refers to the problem of merging autonomous and heterogeneous schemas into a global schema, so called mediated schema. Schema integration problem has been a major focus in the last two decade by the database community [Parent 98], [Elmagarmid 90], [Sheth 90]. Since schemas are autonomous and independently developed, they often are heterogeneous presenting different terminologies and structures. The integration process requires identifying inter-schema dependencies [Batini 86]. This is a process of schema matching. Once they are identified, matching elements are unified under a coherent mediated schema.

3.2.2 Data integration Data integration systems aim to provide the user with a uniform query interface to a multitude of data sources [Cali 02], [Halevey 01], [Ives 99], [Abiteboul 99], [Bergamaschi 99]. Two main components constitute the architecture of a data integration system: wrappers and mediators. A wrapper wraps an information source and models the source using a source schema. A mediator maintains a global schema and mappings between global schema and source schemas. Whenever a user poses a query against the global schema, the mediator uses such mappings in order to reformulate a global query into a set of sub-queries that can be executed in source schemas such that the mediator can collect returned answers from the sources and combine them as the answer to the query.

A critical problem in building a data integration system, therefore, is to correctly provide mappings between the global and source schemas. Currently, there are two main initiatives to supply such mappings: Global As Views (GAV) [Molina 97], [Papakonstantinou 96], [Adali 96] and Local As Views (LAV) [Levy 96], [Duschka 97]. In the first approach (GAV), the mediated schema is defined in term of sources’ schemas. While in the LAV approach, the descriptions of the sources are given in the opposite direction. The main advantage of the GAV approach is that query reformulation is very simple. However, adding sources to the mediated schema is non- trivial. In contrast, in the LAV approach, query reformulation is harder but adding new sources is quite straightforward. [Friedman 99] and [Lenzerini 02] suggest the GLAV approach, that combines the expressive power of GAV and LAV to integrate heterogeneous data sources.

3.2.3 Data warehousing A variation of the data integration problem is that of materializing integrated data sources into a centralized repository, called data warehouse [Stohr 99], [Rodriguez 01]. This process requires transforming data from the source format into the data warehouse format. Authors in [Bernstein 00b] show that schema matching is useful to perform such transformations. Given a data source, one appropriate method to creating transformation is to find those elements of the source that are present in the data warehouse. The main advantage of data warehousing is the performance in

47

Application Domains

answering queries (since the queries are applied directly to the warehoused data). However it requires the warehouse to be updated when data changes, which is not appropriate when handling a large number of sources or when the sources frequently change. A framework for supporting integrated views using a combination of virtual and data warehousing approaches is proposed in [Hull 96].

3.2.4 Data Transformation In recent years, the explosive growth of information online in distinct heterogeneous sources, stored under different formats, has given raise to another application: data exchange that requires schema matching. Data exchange, often called data translation or transformation is the problem of translating the contents of data source to an instance of a target schema that reflects the source data as accurately as possible. However, data exchange problem presents several similarities with data integration problem, there are important differences between the two [Fagin 03]. In data exchange scenarios, the target schema is generally independently created and has its own constraints. In data integration, however the global schema is a reconciled, virtual schema with no predefined constraints. Another significant difference is that in a data exchange we have to materialize a target instance that best reflects the given source instance. In data integration no exchange of data is required. Automating data translation process requires schema matching in order to detect similarities between source and target schema which is a critical step to produce an appropriate translation program [Milo 98], [Popa 02], [Su 01].

3.2.5 Peer-to-Peer data management An important application of schema matching is peer data management, which is a natural extension of data integration. In contrast to a data integration environment, a peer data management doesn’t rely on the notion of mediated schema, but allows an arbitrary number of peers to query and retrieve data directly from each other [Nejdl 02]. To enable information processing and content retrieval between a multitude of autonomous peers, appropriate matching techniques are required to determine mappings between concepts of different peers that are semantically related [Halevy 03a], [Halevy 03b], [Bouquet 03].

3.2.6 Ontology Matching Ontology is defined as a conceptualization of a domain in terms of concepts and relations. Due to the semantic web vision, ontologies are recognized as an essential tool for allowing knowledge sharing among distributed and heterogeneous applications. A lot of work has been done around ontologies covering the entire ontology life cycle, from design to deployment and reuse The vision of the Semantic Web envisages the current Web enriched with several domain ontologies, which specify meaning of data [Heflin 01], [Berners-Lee 01]. In order to make the Web interoperable, appropriate matching techniques are required between autonomous ontologies to determine semantic mappings between concepts of different ontologies that are semantically related [Aberer 03], [Doan 02a], [Castano 03], [Silva 02].

48

XML Schema Matching

3.3 Schema matching Solutions

Many solutions to schema matching problem have been developed. In this section, we review such solutions and describe their different features.

3.3.1 Learner based solution Machine learning is defined as the ability of a machine to exploit previous obtained results in order to improve its performances, especially for automating time-consuming and expensive processes. Recently several works have employed machine learning techniques to perform schema matching. Examples include [Li 00], [Clifton 97], [Berlin 01], [Berlin 02] in databases and [Noy 01], [Ryutaro 01], [Lacher 01] in Artificial Intelligence. Schema matching tools based on machine learning generally consist of a number of modules, called learners including a specific module, the meta- learner to coordinate them. Each learner exploits a different kind of information present in structure definitions and/or in data sources. Once the learners have been trained, the matching tool is able to find mappings for a new data source by applying the learners and then combining their results using the meta-learner. SemInt system [Li 00] uses a neural-network learning approach for matching schema elements based on both element characteristics and statistics over data content. The LSD system [Doan 01] applies a meta-learning strategy to compose several base matchers, which considered either data instances, or schema information.

3.3.2 Rule-based solutions The majority of current schema matching tools employ rules to match heterogeneous schemas. Examples include [Palopoli 98], [Castano 99], [Madhavan 01], [Melnik 02] in databases and [Chalupsky 00], [McGuinness 00], [Noy 00] in Artificial Intelligence. Rule based solutions exploit a variety of schema information such as element names, data types, element constraints, and structure hierarchy. The CUPID system [Madhavan 01], the Similarity Flooding (SF) system [Melnik 02] and the ARTEMIS [Castano 99] system employ rules that compute the similarity between two schemas as a weighted sum of similarities of elements names, data types, and structural position.

3.3.3 Metadata based solution Another solution to schema matching problem found in the literature is the metadata based solution. Metadata based solutions tend to exploit additional semantic information provided generally by domain experts, such information include for example RDF metadata or domain ontologies. In general such solution tries to automatically map all fields of each data source (or schema) to predefined domain ontology and then perform schema matching at ontology level. This approach does not solve the problem but just shifts the problem to mapping data sources and schemas to mapping ontologies.

49

Matching Methods

3.3.4 Learner based solutions Vs Rule based solutions Rule based solutions are generally inexpensive and fairly fast, since they do not require training phase and typically operate only on schemas ignoring data instances. Rule based solution seems to be suitable to easily capture user knowledge about the domain. On the other hand, rule based solutions cannot exploit instances information effectively and have serious problems with schema elements for which no effective rules can be found; heuristics for such cases should be provided.

Learning based solutions require abundant training data and right training examples. Furthermore learning based solutions can not work correctly if an initial mapping (which is generally provided by the user) and a representative set of data sources are not provided, which limits considerably the application of such methods. Since learning based solutions rely essentially on data sources and on mappings between the leaf nodes of two trees, they do not exploit really hierarchical structures which is primordial in the context of structured documents. Finally, many learner-based approaches employ only a single learner, and thus have limited accuracy and applicability.

3.4 Matching Methods

Whether rule-based or learner-based solutions are employed, all matching tools try to exploit element names, datatypes, constraints and structure definition. Based on the research projects done on the field of schema matching, and on the surveys by Larson et al. [Larson 89] and Rahm & Bernstein [Rahm 01a], we classify the most effective schema matching methods into three categories: Terminological matching, constraint- based matching and structural matching. A schema matching tool generally uses a combination of such methods to run the matching process.

3.4.1 Terminological matching Terminological matching relies essentially on schema components’ names in order to find matches between schemas. To work, terminological matching generally requires that descriptive names are employed. As mentioned in Rahm & Bernstein survey, similarity of names is measured in various ways including the equality of names, the equality of canonical name representation after streaming and other processing, similarity of names based on common substrings, edit distance and soundex. Similarity of names is also measured based on synonymy, hypernymy, hyponymy and holonymy relations between words. Such relations are obtained by using dictionaries and thesauri.

Results given by terminological matching cannot often be trusted and this for several reasons. For example two elements with the same name, let us say title, may refer to completely different things, such as book’ title and journal article’ title respectively. Based on common substrings and soundex elements such as university_address and

50

XML Schema Matching university_departments can be also matched. Moreover terminological matching has to deal with special cases such as the use of abbreviation and acronyms, names containing prepositions, etc. Despite the above problems, terminological matching is easy to implement and can give an initial mapping that might be confirmed or rejected by other matching methods.

3.4.2 Constraint-based matching Schema constraints (if they are properly set) can give an interesting clue about how schema elements should be matched and thus reduce the amount of candidate matches (obtained for example from a terminological matching) [Larson 89]. Schema constraints include data types, value range, uniqueness and integrity constraints, etc. The problem that might arise with using datatypes is when they are not specific enough (for example, a field that should have been declared as an integer is declared as a string). Examples of systems using datatype compatibility include Cupid [Madhavan 01] and Artemis [Castano 99]. XML schema recommendation provides a set of refined datatypes. Given this perspective, datatypes could possibly provide candidate matches. Numerical value ranges and/or character patterns comparison can also be very useful in drawing candidate matches. For instance they may allow identifying phone numbers, Zip codes, and addresses. The SemInt System, described in [Li 00], uses such features to draw a mapping between relational schemas. Uniqueness and integrity constraints have been also used to help the matching process. An example of matching system relying on such clues is described in [Popa 02].

3.4.3 Structural matching Structural information (especially hierarchical structures) is very useful in drawing conclusions on the semantics of schema elements. Structure similarity is a measure of the similarity of contexts in which the elements occur in the two schemas being compared. The structure similarity in the SemMa system [Sun 03], which is a tool for matching relational schemas, consists of table name similarity and the sum of the fields similarity in the table. While extensive literature is available for terminological matching and constraint based matching and are almost applicable to XML DTDs and XML schemas, up to now, very few projects considered XML’s structure in their schema matching methods. The fundamental reason is that most schema matching systems were developed by database community and thus deal essentially with relational schemas. Relational database have very little structural data in contrast with XML data. Most work on relational schema structural matching is done by following foreign keys in order to find relationships between the data.

Some recent projects begin to focus on how to exploit XML structures within the matching process. Xyleme project [Cobena 02] computes the structural similarity between two versions of the same XML document based on ancestors and descendants relationships between nodes. Xyleme is a tool for detecting version changes between XML documents, this means that the source and target documents are highly similar, which is an assumption that does not hold in schema matching process.

51

XML Schema matching

3.5 XML Schema matching

With the growing use of XML, several matching tools take into consideration the hierarchical structure of XML. However, most of them do not exploit fully XML structure and deal essentially with DTDs. In the following, we present some examples of recent schema matching algorithms that incorporate XML structural matching.

3.5.1 Cupid (Microsoft Research) Cupid is a hybrid matcher combining several matching methods [Madhavan 01]. It is intended to be generic across data models and has been applied to XML and relational data sources. Cupid is based on schema comparison without the use of instances; the matching process has three phases. The first phase is the terminological matching. Element names are first parsed into tokens based on delimiters and expanded in order to identify abbreviations and acronyms using a thesaurus having both common language and domain-specific references. Cupid clusters then schema elements into categories based on their datatypes and linguistic content in order to reduce the number of one-to-one comparisons of schemas’ elements (only schema elements that belong to similar categories in the two schemas are considered). Cupid calculates then a terminological similarity coefficient between datatype and linguistic content compatible pairs based on sub-string matching and a thesaurus having synonymy and hypernymy relationships. The result of this phase is a table of terminological similarity coefficients in the range [0,1] between elements in the two schemas.

The second phase transforms the original schemas into trees and then performs a bottom-up structure matching. The basic assumption behind the structure matching phase of Cupid is that much of the information content is represented in leaves and that leaves have less variation between schemas then internal structures. The elements in the two trees being compared are enumerated in post-order. The similarity of inter- nodes is based on their terminological similarity and the similarity of their leaf sets. If the similarity exceeds a threshold, then their leaf set similarity is incremented, the rational is that leaves with highly similar ancestors occur in similar contexts. The structural similarity computation has a mutually recursive flavour: two elements are similar if their respective leaf sets are similar, and the similarity of leaves is influenced by the similarity of intermediate nodes. Phase two concludes by calculating a weighted mean of terminological and structural similarity. The third phase decides on the final mapping between schemas’ elements based on coefficients obtained in phase two. This phase is regarded as application dependent and thus not emphasized in the algorithm.

Schema structure in Cupid is used as a matching constraint, that is, the more the structures of the two nodes are similar, the more the two nodes are similar. For this reason, Cupid faces problems in the cases of equivalent concepts occurring in completely different structures, and completely independent concepts that belong to isomorphic structures. Despite these problems, Cupid is the only notable matching algorithm for XML data that has gone beyond the classical tree representation to handle XML schemas (or any rooted graph) considering shared types and referential

52

XML Schema Matching constraints. The idea is based on the conversion of such rooted graph into tree and this for two reasons: to reuse the structure matching algorithm for trees and to cope with context-dependent mappings. In the case of shared types, for each element Cupid adds a schema tree node whose successors are the nodes corresponding to elements reachable via any number of IsTypeOf relationships followed by a single containment. Trees are also augmented with nodes that model referential constraints, which are interpreted as potential join views. For each foreign key, a new node that represents the join of participating elements (tables in relational schemas) is introduced. Despite these extensions, Cupid does not exploit all XML schema features such as substitution groups, abstract types, etc that could give a significant clue in solving XML schema matching problem.

3.5.2 Learning source description (Univ. of Washington) The LSD (Learning Source Description) system [Doan 01] uses machine-learning techniques to match a new data source against a previously defined global schema. LSD is based on the combination of several match result obtained by independent learners. Those learners are trained during a pre-processing phase. Given an initial mapping (generally provided by the user) between a given data source and the global schema, the pre-processing phase analyses instances from that data source to train the different learners and discovers characteristic instance patterns and matching rules. The latter patterns and rules are then used to match new data sources to the global schema. All the base learners (name learner, content learner, etc.) cannot handle the hierarchical structure of XML because they “flatten” out all structures in each input instance and use as tokens only the words in the instance. For this, LSD introduced an XML-specific learner to help improve its matching results. It is similar to a Naive- Bayes learner in that it uses word frequencies to find possible matches, but considers not only the element names as tokens, but also the edges between elements, thus manipulating two types of tokens: node tokens and edges tokens. Each non-root node in the tree forms a node token and each node with its child node form an edge token.

Even though this helps improve the results, this approach presents several limitations since it does not fully exploit XML structure. For example, no distinction is made between elements and attributes (attributes are considered to be part of an element, so they are not treated at all). Besides, the only structural relationship considered within the LSD system is the parent-child relationship, which is not sufficient to describe the context of an element, and limited the scope of the system to similar structures with the same depth. More structural information has to be taken into account such as siblings, ancestor-descendant relationship, integrity constraints, etc.

3.5.3 Similarity Flooding (Stanford Univ. and Univ. of Leipzig) In [Melnik 02], authors present a structure matching algorithm called Similarity Flooding (SF). The SF algorithm is implemented as part of a generic schema manipulation tool that supports, in addition to structural SF matcher, a name matcher, schema converters and a number of filters for choosing the best match candidates from

53

XML Schema matching

the list of ranked map pairs returned by the SF algorithm. A typical matching process starts by converting the two input schemas into labelled graph representation. Then the name matcher is used to provide an initial mapping which is fed to the structural SF matcher. Finally, various filters are applied to select relevant mappings.

The structural matching SF is based on the following idea: first schemas to be matched are converted into directed labelled graphs; second iterative fix point computation is used to determine the match candidates between nodes of the graphs. For computing nodes similarities, SF relies on the intuition that elements of two distinct graphs are similar when their adjacent nodes are similar. The spreading of similarities in the matched models is reminiscent to the way how IP packets flood the network in broadcast communication. As a first step, a similarity propagation graph is constructed, that is an auxiliary data structure derived from the two schemas being matched, say for example A and B. To derive a similarity propagation graph, a pairwise connectivity graph is computed from A and B. Each element in the connectivity graph is an element, called map pair, from A×B. Two map pairs, for

example (a,b) and (a1,b1) are connected based on the intuition that if a and b are

similar, then probably their immediate descendent a1 and b1 are also similar. The pairwise connectivity graph is then used to propagate similarities between nodes using an iterative process, where in every iteration the similarity of a map pair is incremented by the similarity of its neighbours. The iteration process converges after a fixed maximal number of iterations.

SF has been applied to several input formats including relational, XML and RDF. The algorithm works for directed labelled graphs only. It degrades when labelling is uniform or undirected. An important assumption behind the algorithm is that adjacency contributes to similarity propagation. Thus, the algorithm will perform unexpectedly in cases when adjacency information is not preserved. As in the LSD system, the only structural information used is the parent-child relationship, attributes and ancestor- descendant relationships are ignored. Furthermore, SF ignores all type of constraints while performing structural matching. Constraints like typing and integrity constraints are used at the end of the process to filter mapping pairs with the help of user.

3.5.4 Clio (IBM Almaden and Univ. of Toronto) In [Miller 01] authors present the Clio tool for mapping among XML and relational schemas, in which a high level mapping is translated into queries that transform source data into target representation. The Clio system works in two phases. In the first phase, a high level mapping (a set of attribute-to-attribute correspondences) is converted into a logical mapping that captures the schemas design choice (for example, the hierarchical structure in the case of XML schemas). The second phase translates the latter logical mapping into meaningful query that produces data satisfying the constraints and structure of the target schema.

The high level mapping could be obtained from a terminological matcher comparing elements names. Logical mapping calculation is performed in two steps. First, the

54

XML Schema Matching logical access paths that exist in each schema are enumerated. Such paths definitions is based essentially on two kinds of relationships: keys and foreign keys definitions and on parent-child relationships. To do this, the chase classical relational method is used in order to compute all the joins in nested schemas, and consequently enumerate all logical relations in such schemas. Second, a mapping algorithm looks at each pair of source logical relation and target logical relation, and computes for each pair a source- to-target dependency. The computation of the dependency is driven by all the correspondences that are relevant for the given source logical relation and target logical relation. The Clio tool is efficient in discovering all types of matches, even complex joins involving keys, but it requires a lot of user interventions (whenever a match pair is detected). Moreover, only parent-child relationship is used within the mapping process, and there are no details about the use of features like typing.

3.6 Conclusion

As we have seen in the section 3.5, several schema matching algorithms begin to focus on XML structures. Table 3.1 summarizes the main characteristics of such algorithms. While terminological and constraint-based matching techniques are widely developed and applicable for matching XML schemas, proposed structural matching methods remain insufficient and very limited (deal only with DTDs and exploits only parent- child relationships). Convinced that the structural organisation in XML documents inferred some semantics of the data and represented the designer point of view, a solution to XML schema matching problem should exploit this information in a manner that increases the matching accuracy. In XML, in addition to keys, the context of an element is given in simple cases by its position in the document tree: its sequence relationships and its containment relationships with other elements. Additional relationships can also participate in the definition of such context. A simple and common case is the use of mechanisms like inheritance and substitution groups (in the case of XML Schema). Such information help considerably to draw conclusions about match candidates. The basic challenge here is to identify structural features useful for solving XML schema matching problem.

By analysing the existent schema matching algorithms, we also notice that they lack a formal definition of the problem they are solving and frequently assumptions made within such works are not enough clear and precise. What is significantly lacking is a formal framework that defines schema matching problem, makes clear the input information of such problem and the required output solution under a set of precise assumptions. Moreover, the key commonalities between proposed XML schema matching algorithms is the output solution given as a confidence score, also called similarity coefficient, ranging in [0,1]. When addressed the problem of automating document transformation, such output is often insufficient. We then should keep in mind, when considering the XML schema matching problem, what kind of output matching we need to generate automatically transformation programs. In chapter 4, we propose a formal definition of the XML matching problem and the different

55

Conclusion

ingredients needed to solve such problem. Those ingredients will be then be considered in depth in subsequent chapters.

Cupid LSD [Doan 00] SF [Melnik 02] Clio [Miller 01] [Madhavan 01] [Doan 01] [Popa 02]

Schema Types Generic (applied on XML Relational, XML, RDF Relational, XML XML and Relational)

Terminological Name equality, Name matcher (name String matcher matching Thesaurus (synonymy, equality and synonyms) comparing common hypernymy, prefixes and suffixes of Not specified homonymy), litterals abbreviations

Constraint-based Datatype and domain Constraint are used as Referential constraints matching compatibility, filters after the referential constraints - matching process

Structural matching Matching sub-trees XML classifier for Iterative Fix point Comparison of local weighted by leaves matching non-leaf computation, parent- access paths, parent- elements, parent-child child relationships child relationships and relationships (instance- referential relations via based) keys and foreign keys

Mapping Result Coefficient ranging in Coefficient ranging in Coefficient ranging in Dependency expression, [0,1] [0,1] [0,1] transformation queries.

User intervention User can adjust User provides initial Post matching user Requires user threshold weights match and training feedback (checks and if confirmation for every source examples necessary adjusts the detected match pair. results) W3C XML Schema Shared Types - - - additional features

Matching Cardinality 1:1 and n:1 1:1 and 1:n 1:1 local mapping and Not specified m:n global mapping

Remarks Works on similar Limited structural Degrades when Only parent-child structures information and labelling is uniform or relationships are used. depends on the quality undirected. Performs Constraints are limited of training examples. unexpectedly in cases to referential integrity when adjacency constraints information is not preserved.

Table 3.1: Characteristics of some existing XML Schema matching tools

56

Part II

A framework for XML schema Matching and automatic generation of transformation scripts

57

Chapter 4

Problem formalisation

One central task in automating structural document transformations is schema matching. XML Schema matching attracted much attention and several matching solutions exist, but usually, they don’t provide a prior formal definition of the problem they are solving. They do not make it clear what is means by “similar elements” and they do not state the implicit assumptions they are making. In this chapter, we provide a framework for defining the XML Schema matching problem by specifying the ingredients of such problem solutions and the assumptions that underlie our proposed solution.

4.1 XML Schema Matching Problem 4.1.1 Semantic matching Vs Syntactic matching Schema matching is defined as the task of finding similarities between schemas relying on the “semantics” inferred from those schemas. Schema matching and semantic schema matching are used interchangeably in the literature without a precise definition on what means “semantics”. We will begin with a discussion on the meaning of “semantics” while trying to automate a schema matching process. We then give a formal definition of the matching problem. The only work in the field of schema matching that provides a formal definition of semantic matching is described in [Madhavan 02]. A restricted definition is also used in [Doan 02b]. Our definition of semantics in schema matching problem is inspired by these contributions.

The majority of research work in schema matching makes the assumption that the user is able to decide if a given mapping is correct or not. This capability suggests that the user understand the meaning of schemas content. In fact, each user has in his mind a representation of the Universe in term of concepts and semantic relationships among them, capturing his “understanding” of the Universe. Such representations only exist “in the mind” of users and are not concrete. In contrast, a syntactic schema is a formal, i.e., syntax based concrete representation of a set of semantic concepts and their relationships (examples of syntactic schemas include XML Schemas, Relational

59

XML Schema Matching Problem

schemas, etc). Semantic matching tends to abstract the mental process through which the user finds how two (or more) semantic concepts are similar in his representation of the Universe based on some subjective similarity function.

While analysing the similarity between two concrete syntactic elements, the user generally first maps them to some semantic concepts in his mind (which refers to the process of understanding syntactic schemas) and then performs semantic matching. Figure 4-a illustrates the process of semantic matching, in which a user A tries to understand how similar syntactic constructions a and b belonging to two XML schemas S and T. User A tries first to map in his mind a and b to some semantic concepts α and β and second deduce the similarity between a and b based on the similarity of α and β. When trying to automate the discovery of similarity between two syntactic schemas, the matching tool does not have access to human understanding of the Universe. The matching tool has only access to syntactic clues in the syntactic schemas and data. Thus, in solving automatic schema matching problem, the proposed solution must approximate as much as possible semantic similarity using the syntactic clues, making the following assumption:

Assumption 1: The more similar the syntactic clues of two schemas, the more semantically similar the two schemas are.

Figure 4-a: Semantic matching Vs Syntactic matching

Currently, several attempts have been proposed to design “good” XML schemas reflecting human understanding of the Universe. Authors in [Routledge 02] suggested to conceptually modelling XML Schemas on the basis of the Unified Modeling Language (UML). An essential part of static UML was used to design XML Schemas. Another approach argues in favour of object-oriented methods to conceptually design XML Schemas [Xio 01]. The key point underlying these works is that they provide

60

Problem formalisation

XML schemas where each element represents a semantic entity (e.g., publication, book, author, etc) and each XML schema relation represents a semantic relationship (e.g., the semantic relation “a book has an author” is represented by a containment relationship between elements book and author, the semantic relation “a book is a kind of publication” is represented by an inheritance relationship between the book type and the publication type). A path in XML tree also represents a composite semantic relationship. Thus, each XML schema reflects the schema designer understanding of the Universe. However, such methods provide modelling foundations to design XML schemas; they do no specify how properties and additional constraints like datatypes or cardinalities are assigned to schema elements. The latter issue depends on designer’s “best practice”. From this perspective heterogeneities between schemas reflect first the different understanding of schema designers to the Universe and second the difference in conceptually modelling and constraining same real world entities. In order to compare two XML schemas, we need to make explicit both issues for the matching process.

For this, we propose to model XML schemas as a directed labelled graph, so called schema graph (cf. section 4.3) where nodes and edges relating nodes reflect the schema designer understanding of the Universe and constraints over nodes and edges reflect the schema designer imposed constraints on schema entities. The proposed model has two main advantages. The first is to make clear XML schema features used within the matching process. The second is to normalize XML schemas languages to uniform representation hiding syntax differences, and thus facilitating XML schema understanding.

4.1.2 Input Information for the matching process To solve automated semantic matching problem, we need first to precisely identify the type of information that the problem takes as input and the type of output that we require the solution to produce. In general the input information of a matching problem includes any type of knowledge about schemas to be matched, their instances and their domains. In this section, we describe the input information that we consider to solve XML schema matching problem:

• Schema information: referring to syntactic constructions such as elements, their names, their structural organisation, relationships among the elements, schema constraints, etc. As we already mention in chapter 3, up to now few existent XML schema matching algorithms focus on structural matching exploiting all XML schemas features. Our proposed schema graph makes explicit such features for the matching process.

• Data instances: Generally only learner-based matching approach takes into consideration the analysis of data instances content. As we showed in chapter 3, rule-based matching approach (focusing in schema analysis to perform matching task) can be complemented by the analysis of data instances content, when there are available. However, in the context of content reuse it is not

61

XML Schema Matching Problem

guaranteed that a set of representative data instances are available, thus we do not consider clues present in data instances.

• User feedback: It is widely accepted that the matching process cannot be fully automated and user intervention is always required. In most existent matching algorithms, user input is requested at pre-match (to provide an initial matching) or at post-match just to validate the matching output. The purpose of our work is to populate the target schema with source data as desired by the user. We thus can assume that the user knows what is wanted in the target instance and can by the way decide if a match is correct or not. Moreover, he can decide, for example as a result of seeing a potential loss of data from the source, that the target should be altered.

Assumption 2: The user knows exactly target schema details and is able to decide if a mapping is correct to produce a desired transformation pair. Otherwise, the user can either reject the data or modify and adjust its target schema.

• Auxiliary information: Several matching algorithms rely on the use of thesauri and dictionaries in order to perform terminological matching. In this dissertation the only auxiliary information that we use is WordNet [Miller 95] [Fellbaum 98], an electronic lexical database where relations such as homonymy are available to relate word meanings. In this dissertation, we have ignored problems, described in chapter 3, that arise from the use of abbreviations and acronyms, names containing articles or prepositions, etc. This choice is based on the fact that these problems have been widely studied in string matching field and several algorithms are available to solve such issues11. Thus we made the assumption that names used in XML schemas are descriptive names (don’t include abbreviations and acronyms and do not contain prepositions). We only consider English words and also compound words. Examples include Location, Address, Journal, Journal-article. Other domain-specific dictionaries could be used (for example in the case where used words don’t belong to WordNet). We do not treat such issue, since we are interested on how to exploit a given dictionary for the purpose of schema matching not on proposing specific dictionaries. The availability of techniques to treat abbreviations, acronyms, punctuations, etc, issues and the adoption of a domain-specific dictionary giving more precise indications about what used names mean could be a nice extension to our work.

• Matching Criteria: In section 4.1.1, we identify that the key component to automate semantic schema matching is the similarity function (also called matching criteria) that simulate human reasoning on similarities. Most of schema matching algorithms rely on the combination of several heuristics

11 Cupid algorithm for example treats the case of abbreviations, acronyms, prepositions, articles and conjunctions.

62

Problem formalisation

(called hints or clues). An example of hint can be defined as follow: “elements with the same names or that are synonyms are more likely to be semantically similar”. The choice of such hints affects the quality of the matching process. We should then precise the hint functions we use specifying their parameters, their output, and how they are combined to provide a final mapping.

4.1.3 Output Solution for the matching process Most existent schema matching systems provide mappings with confidence scores (also called similarity coefficient) ranging in [0,1]. Since our aim is not to just provide a matching algorithm for XML schemas but also to use the matching result (mappings) in automating document transformations, we require that a solution to the XML schema matching problem produces semantic mappings with confidence scores and transformation functions. An example of such mapping could be (address, location, 0.8, rename), which says that the element address match the element location with confidence 0.8. Moreover, to transform an instance of element address (over the source schema) to an instance of element location (over the target schema), we just need to rename element address into location. In addition to one-to-one mappings, the proposed solution should be able to discover complex mappings such as ({FirstName, LastName}, name, 0.8, concat) which precise that the concatenation of FirstName and LastName source elements matches target element name with confidence 0.8. A set of operations over source elements allowing the automatic generation of transformation functions should then be defined.

In case of multiple mappings result, we require the solution to produce a list of k best mappings. The mappings results should be ranked according to their confidence scores. If k is small, then the user can easily choose the correct mapping among such list. Such approach is for example used in the LSD systems, where authors prove that the cognitive load of finding the correct mapping within a small list is negligible. Thus, by requiring a solution to return a small list instead of a single mapping, the accuracy of the solution is increased without imposing additional burden on the user. Finally, we require the mapping result to be structured in a way that facilitates its validation by the user and its adaptation (for example when schemas evolve). In such case, we do not repeat the matching process but just update the mappings between elements involved in source and target schemas modifications.

4.2 Formal definitions

Above, we have introduced the informal definition of XML schema matching and specified the requirements for a possible solution to this problem. In this section we try to provide a formal framework to define XML schema matching problem. For this, we first provide some fundamental definitions.

Definition 1: (Semantic Matching). Given a human representation of the Universe U, and a subjective semantic similarity function over U’s concepts, denoted SU, semantic

63

Modelling XML Schema

matching is the process through which a user is able to identify that two syntactic construction a and b are semantically similar based on how similar are their respective

mappings in U with respect to SU.

Definition 2: (Automated Semantic matching). Automated semantic matching is defined as the process in which an objective syntactic similarity function, called matching criteria (applied to syntactic schemas) is used to approximate as much as possible the results of a user semantic matching based on the assumption 1.

Definition 3: (A formal framework for XML matching problem). A formal framework for defining and solving matching problem under assumptions we fix throughout this chapter should incorporate:

• A model for XML schemas specifying what XML Schemas’ features are used within the matching process,

• An algebra for source-to-target mappings including a set of operators that can be applied to the elements of source schemas according to a set of predefined rules,

• A set of matching criteria able to produce semantic mapping based on operators defined by the above algebra and specification of how such criteria are implemented and combined,

• A method to incorporate user feedback.

In the following, we detail the proposed model for XML schemas and the algebra for source-to-target mappings. Used matching criteria and user feedback will be described in chapter 5.

4.3 Modelling XML Schema 4.3.1 Features of XML Schema The World Wide Web Consortium (W3C) began work on XML Schema in 1998, and the first version became official in May 2001 [W3C 01a], [W3C 01b], [W3C 01c]. The intent was to create a schema language that is more expressive than DTDs. The structure of an XML document is defined in an XML Schema in terms of predefined hierarchical relationships between XML elements and/or attributes to which specific constraints concerning for example ordering and cardinality are imposed. In this section, we present the main features of XML schema language that we take into consideration while defining the XML Schema matching problem.

64

Problem formalisation

4.3.1.1 XML Schema Data Types An XML Schema datatype (using the terminology of [W3C 01c]) is a 3-tuple, consisting of a) a set of distinct values, called its value space (domain), b) a set of lexical representations, called its lexical space, and c) a set of facets that characterize properties of the value space, individual values or lexical items. Types in XML Schema are either simple or complex. Simple types allow character data content, but no child elements or attributes. Complex types, on the other hand, allow child elements or attributes.

The XML Schema recommendation defines 44 built-in simple types that represent commonly used data types. They consist of string types, numeric types (e.g., float, decimal, integer), date and time types (e.g., date, duration, time). XML Schema offers also to users the possibility to derive their own types from the built-in types by applying several facets. Examples of facets are those used to restrict the legal range of numerical values by giving the maximal/minimal values and to limit the length of string values.

Example 4.1 (XML Schema: Derived Simple Types by restriction)

The following XML schema example derives restricted atomic datatypes from string for an email address and specifies that each email address should have the form “string@string”.

Example 4.2 (XML Schema: Derived Simple Types by List)

XML schema offers two other ways to define derived simple types: derived by List and derived by Union. The value space of a list datatype is a set of finite-length sequences of atomic values. The atomic datatype that participates in the definition of a list datatype is known as the itemType of that list datatype. The following example defines a phoneNumber type as a list of unsignedBytes.

65

Modelling XML Schema

Example 4.3 (XML Schema: Derived Simple Types by Union)

The following example defines a technicalMemo type which is the union of the built-in datatype string and the simple type phoneNumber.

4.3.1.2 Attribute and element declaration Attribute definitions and element definitions are allowed both globally, or locally. Globally, they are defined as immediate children of the element, and then they may be referenced from arbitrary elements. Locally, they are defined inside a element. Attribute definitions are given by elements which specify its name, its type (which is always a simpleType), its minimal (optional vs. required) and maximal cardinality, and may be a default value, or a fixed value. An element declaration is an association of a name with a type definition, either simple or complex, an (optional) default value and a (possibly empty) set of identity-constraint definitions. The association is either global or scoped to a containing complex type definition.

4.3.1.3 Complex Types In contrast to simple types, complex types allow child elements or attributes definitions. These datatypes are then used for defining element types. This is one of the main differences to DTDs: elements (roughly spoken, the tags which are then used in the document) may be different from datatypes (defined as complex Types). Complex datatypes can be also derived from existing datatypes: either by restricting another complex datatype (either in its components, or in its structure), or by extending a simple or complex datatype.

A complex Type definition is specified by the following properties:

• name,

• base type and derivation method (if it is derived type),

• attribute declarations contain subelements as described before,

• content type (elementOnly, empty, mixed, or simple Type),

• content model (in case of elementOnly): contains subelements as described below which declare the structure of element contents. The element

66

Problem formalisation

contents may further be nested into , , , and elements which allow for specifying similar properties as for DTDs.

Example 4.4 (XML Schema: derived complex type by restricting a complex type)

The following example illustrates how the complex type Limited-Contacts restricted the complex type Contacts by restricting the maximal cardinality of the element person.

Example 4.5 (XML Schema: derived complex type by Extension)

The following example illustrates how the complex type Book extended the complex type Publication by adding two elements ISBN and Publisher.

4.3.1.4 Element and Type Substitutability XML Schema provides a mechanism, called substitution groups, that allows elements (types) to be substituted for other elements (types). More specifically, elements can be assigned to a special group of elements that are said to be substitutable for a particular named element called the head element. In example 4.6, we declare the element monograph and assign it a substitution group where the head element is book, and so monograph can be used where element book can be used (cf. publication1.xml and

67

Modelling XML Schema

publication2.xml). Elements in a substitution group must have the same type as the head element, or they can have a type that has been derived from the head element's type.

Similar to element substitutability, XML Schema offers the possibility to achieve type substitutability i.e., the ability to substitute an element’s content with content. The principle of type substitutability is that a base type can be substituted by any derived type. Example 4.7 illustrates such issue. Types Book and Journal are derived from the type Publication (cf. the XML schema Publications.xsd). If we declare an element, pub, to be of type Publication (the base type), then in the instance document pub's content can be a Publication, a Book or a Journal Type. In the XML instance, this is done by using a special attribute xsi:type whose value is one of the derived types (cf. the instance document Publication.xml). The use of the attribute xsi:type within the XML instance is an exception to the rule that XML schema information is not part of XML instances.

Example 4.6 (XML Schema: element substitution)

Publication1.xml

….

Publication2.xml (An alternative instance document)

….

4.3.1.5 Abstract Types and Abstract Elements XML schema introduces the notion of abstract types and abstract elements. As usual in object oriented modelling, abstract types may not have direct instances, but their concrete subtypes may. An abstract type is a template/placeholder type, meaning that if

68

Problem formalisation

an element is declared to be a type that is abstract then in an XML instance document the content model of that element may not be that of the abstract type but of one of the derived types of that abstract type. The notion of abstractness applies also to elements. If an element is declared abstract then in an XML instance document that element may not appear. However, elements that are substitution group’ed to the abstract element may appear in its place.

Example 4.7 (XML Schema: Type Substitution)

Publications.xsd

……

Publications.xml

TCP-IP Illustrated 1998 aaaaa 10.23 TCP-IP Illustrated 1998 aaaaa

69

Modelling XML Schema

Example 4.8 (XML Schema: Abstract Type)

Let us consider the schema of example 4.7 revisited. We just declare the type Publication to be abstract as follow:

The instance document Publications.xml considered in example 4.7 remains the same. Since the element pub has an abstract type Publication, then the instances of that element will only be indirectly of type Publication. The direct type has to be one of the concrete subtypes of type Publication e.g., Book or Journal. In the XML instance, this is again done by using the special attribute xsi:type whose value is one of the concrete subtypes.

4.3.1.6 Integrity Constraints XML Schema supports identity constraints and referential integrity constraints known from the relational model by means of unique, key, and keyref. Unique/key specifies a list of properties which must uniquely identify each item amongst a set of nodes which are addressed by a selector (which is a restricted XPath expression). This mechanism is more powerful than the ID/IDREF concept in DTDs. XML Schema has much enhanced identity and referential constraints:

• element content can be defined to be unique.

• non-ID attributes can be declared to be unique or keys.

• combination of element content and attributes can be declared to be unique or as keys, that is, not only unique, but always present and non-nillable.

• XML Schema distinguishes between unique and key.

• enables to declare the range of the document over which something is unique

• The comparison between keyref fields and key or unique fields is by value equality, not by string equality.

Example 4.9 (XML Schema: Uniqueness constraint)

The following example specifies that each element book has a unique ISBN.

70

Problem formalisation

Example 4.10 (XML Schema: Referential constraint)

The following example illustrates that every Book has an ISBN, and that ISBN must be unique. Moreover, each ISBN of the Book that the Author is signing must refer to one of the ISBN elements in the collection defined by the PK key.

4.3.2 XML Schema Graph We model an XML schema as a directed labelled graph with constraint sets. A schema graph consists of a series of nodes that are connected to each other through directed labelled edges. In addition, constraints can be defined over nodes and edges. Figure 4-b illustrates a schema graph example.

4.3.2.1 Schema graph nodes We categorize nodes into atomic nodes and complex nodes. Atomic nodes have no edges emanating from them. They are the leaf nodes in the schema graph. Complex nodes are the internal nodes in the schema graph. Each complex node has one or more labelled directed edges emanating from it, each associated with a label, and each going to another node. Each atomic node has a simple content, which is either an atomic value from the domain of basic data types (e.g., string, integer, date, etc.); or a constructional value, meaning a list value or a union value. The content of a complex node, called complex content, refers to some other nodes through directed labelled edges. In Figure 4-b, nodes University and Library are complex nodes, while nodes Name and Location are atomic nodes.

71

Modelling XML Schema

4.3.2.2 Schema graph edges Each edge in the schema graph links two nodes capturing the structural aspects of XML schemas. We distinguish three kinds of edges, namely containment, association, and of-property relationships.

• A Containment relationship, denoted c, is a composite relationship, in which a composite node (“whole”) consists of some component nodes (“parts”).

• The of-property relationship, denoted p, specifies the subsidiary attribute of a node.

• Explicit relationships defined in XML Schema by means of key/keyref pairs or substitution group, are modelled using so called association edges, denoted by a. An association between two nodes, denoted a, is a structural relationship, specifying that both nodes are conceptually at the same level. Such relationship is defined in XML Schema by specifying the related schema components and possibly a predicate function (e.g., join conditions). Association relationships are generally bidirectional. To preserve the compatibility with containment and of-property edges, association relationships are represented using a pair of reverse edges.

In Figure 4-b, a containment edge links the two nodes University and Library. An of- property edge links the node University to its attribute Location. Two association relationships are also represented. The first association between the two nodes Book and Monograph is used for modelling the substitution group relation between the two nodes. While the second association relation between Journal-article and Journal specifies a key/keyref relation. Visually association edges are depicted as dashed lines.

4.3.2.3 Schema graph constraints Different constraints can be specified with XML Schema language. These constraints can be defined over both nodes and edges. They explicate various requirements such as domain, cardinality, order, exclusion, uniqueness, and referential integrity.

4.3.2.3.1 Constraints over an edge Typical constraints over an edge are cardinality constraints. Cardinality constraints over a containment edge specify the cardinality of a child with respect to its parent. Typical cardinality constraints include “one” [1..1], “one-or-more” [1..N], “zero-or- one” [0..1], and “zero-or-more” [0..N]. Cardinality constraints can be specified also with any arbitrary interval of the form [minOccurs, maxOccurs]. Cardinality constraints over an of-property edge imply for example an optional or mandatory attribute for a given node. For example, each University in Figure 4-b must have exactly one Name. Thus the cardinality constraint is [1..1] on the link “UniversityÆName.” A University must also have an attribute Location, this is denoted by [1..1] on the corresponding of-property edge “UniversityÆLocation.” The default cardinality specification in Figure 4-b is [1..1].

72

Problem formalisation

4.3.2.3.2 Constraints over a set of edges We essentially distinguish three kinds of constraints:

• Ordered Composition: Such constraint is defined essentially for a set of containment relationship. In fact a “whole” node may be composed of different “part” nodes in a particular order. Taking the complex node Author as an example, its Name, and Address should appear in this order. This constraint is used for modelling XML Schema “sequences”. We can also explicit that a set of nodes can appear in any order and each node among them can appear just once or not at all, which model the “all” in XML schema.

• Exclusive Disjunction: Such constraint over a set of edges (relationships) is used for indicating that each time only one of the relationships exclusively applies. This is used for modelling the XML Schema “choice”. The exclusive disjunction constraint applies to containment edges.

• XML schemas offer the possibility to express integrity constraints. Sometimes the content of one node (or a set of nodes) is linked to the content of another node (or a set of nodes). The referential constraint on a node requires that there must exist another corresponding referential node; both of the linked nodes must have the same node content for the referential node. This constraint applied to association edges, and is generally modelled through a join predicate. As example the association edge between Article and Journal with the predicate join (Article, Journal) = Article/JournalRef = Journal/Name (as typical join condition).

4.3.2.3.3 Constraints over a Node Constraints applied to a node include:

• Uniqueness: When a node may appear several times, the uniqueness constraint requires each of these appearances to have unique node content.

• Domain Constraints: Domain constraints are very broad. They essentially concern the content of atomic nodes. They can restrict the legal range of numerical values by giving the maximal/minimal values; limit the length of string values, or constrain the patterns of string values. For example, a pattern may instruct the number of characters or format of allowable components that comprise a valid string value; and specify the number of members or permissible members for a constructional content value (e.g., a list value). They also can specify if a given atomic node or complex node is of “any type” which implies that no constraints concerning its content are attached to it.

73

Modelling XML Schema

Figure 4-b: A schema graph example

Based on the above observations, we introduce the following definitions:

Definition 4 (Schema Graph) A schema graph G is a 4-tuple (NG, EG,Label,Const) where:

• NG is a nonempty finite set of nodes,

• EG is a finite set of edges representing relationships between nodes,

• Label is a finite set of labels denoting different relationships between nodes,

• Const is a set of constraints over nodes and edges.

Definition 5 (Node) A node n ∈NG is 3-tuple (nname, ncategory, ncontent) consisting of a node name nname associated with a namespace, a node category ncategory indicating whether the node n is atomic or complex, and a node content ncontent specifying simple content for atomic nodes and complex content for complex nodes. Let NA and NC be respectively the set of atomic nodes and complex nodes, NA and NC form a partition

over NG, that is NA ∩ NC = ∅ and NA ∪ NC=NG.

Definition 6 (edge) An edge e ∈EG is 3-tuple (elabel, esource, etarget), consisting of a label elabel ∈ Label stating the link type, the source node of the edge esource ∈ NG, and the

target node of the edge etarget node ∈ NG. We distinguish three kinds of edges: Let Ec,

74

Problem formalisation

Ep and Ea be respectively the set of containment edges, of-property edges and association edges. Ec, Ep and Ea form a partition over EG, that is Ec ∩ Ep ∩ Ea = ∅ and Ec ∪ Ep ∪ Ea =EG.

Example: In Figure 4-b, Title and Publisher are atomic nodes, each of which has a simple content drawn from the atomic datatype string. The complex node University is connected to a set of nodes Name, Location, and Library through three out going edges, denoting respectively containment relationship, of-property relationship and containment relationship. Thus the complex content of node University is (, < p (Location)>).

Definition 7 (Constraint) A constraint c ∈ Const is 4-tuple (Cname, Ccategory, Capply-to, Cvalue) where Cname is a constraint name (such as uniqueness, cardinality constraint), Ccategory is a constraint category (node constraint, edge constraint, constraint over a set of edges), Capply-to defines the components to which the constraint apply (e.g, a node name in the case of node constraint) and Cvalue is the value of the constraint (if it exists, otherwise value takes a default value ∅ ).

Example: The order composition between edges (c, University, Name) and (c, University, Library) is represented as follow (Order-composition, set-edges-constraint, ((c, University, name), (c, University, Library)),∅). The referential constraint between nodes Journal-article and Journal is defined as follow: (Referential integrity, set-nodes- constraint, (Article, Journal}, join (Article, Journal) = Article/JournalRef =Journal/name)).

We do not need to check the well formedness of our schema graph (e.g, if the edge label is of-property, then check that the source node is a complex node and the target node is an atomic node). This is because, we construct our schema graph based on already valid XML schemas. Based on the notions of nodes and edges, we further introduce the concepts of path, original path as follows:

Definition 8 (Path) A path from node n1 to node nk is a sequence n1, n2, … , nk, where n1, n2, … , nk ∈ NG, and for any two consecutive nodes, ni and ni+1 (1 ≤ i ≤ k-1), there exists an edge ei(l, ni , ni+1) ∈ EG-{Ea}. A path contains only containment and of- property edges. We call respectively n1 and nk the starting node and ending node of the path. The path is said to go through node ni (1 ≤ i ≤ k). The length of a path is the total number of nodes that the path goes through; that is, k for the path P=(n1, n2, … , nk), denoted as length (P)= k.

Definition 9 (Original Path) A path from node n1 to node nk, is an original path, if and only if the starting node n1 has no incoming edges (meaning that the starting node is the root of the schema graph).

The concept of Schema graph as described above does not include features like type inheritance and abstract types. Such issues will be discussed in chapter 5. Notice that Schema graph is constructed by expanding shared types in XML schema. For example

75

Modelling XML Schema

in Figure 4-b element Author (where Author is a child of Article) and element Author (where Author is a child of Book) are twice represented even if they have the same type. Types names are not represented in the schema graph but represented in a type table and a binding from schema graph nodes to types are established.

Based on the formalism of schema graph, we further introduce more definitions for what we mean by mapping that can provide the basis for transforming source constructions into target ones.

A mapping element relates, in its simplest form, a construction 12 from S to a semantically similar construction from T (respectively source and target schema graphs). Although, a source may not have a construction that directly correspond to a target one, nevertheless target constructions may be derived from source constructions by applying a set of predefined operations. This mechanism is somehow similar to virtual views creation in data integration, where a virtual view over the source has to match a mediated schema (e.g., the concatenation of source elements FirstName and LastName gives raise to a virtual element Name that match target element Name). Virtual views are also applicable to edges (e.g., the two edges Author→FirstName and Author→LastName are merged together into a virtual edge Author→Name’ that matches the edge Author→Name in the target schema). Based on this observation we borrow the notion of virtual view to formally define mappings between source and target schemas. We were inspired by definitions given in [Xu 03a] for mappings using virtual views in data integration system.

Definition 10 (Schema alphabet). Given a schema K (conformed to the schema graph

formalism), we define the alphabet of K, ΣK as the union of nodes and edges in K’s schema graph. ΣK = NK ∪ EK.

Definition 11 (Virtual View). Given a source schema S and its alphabet ΣS, a virtual view over S, υS is defined as a derivation from the alphabet ΣS, applying a set of predefined operations O ={o1,…on.}.

Definition 12 (Mapping element, direct mapping, complex mapping). Let VS denotes the set of possible virtual views constructed over ΣS. A mapping element is a function mst: ΣS ∪ VS→ ΣT, that associates an element s in ΣS ∪ VS to an element t in ΣT. A mapping element is a direct mapping if it binds an element in ΣS (⊆VS) to an element in ΣT and a complex mapping if it binds a virtual element υS ∈ VS to a target element in ΣT through a mapping expression defined over O.

12 Construction refers to nodes and edges in the schema graph.

76

Problem formalisation

4.4 Source-to-target mapping Algebra

Since the number of derived virtual elements and relationships over a source schema may be unbounded, we restrict ourselves to a set of possible derivations over source schemas. For this, we define a bounded set of applicable operations over source constructions. Several works (essentially in data integration field) describe operations for creating virtual views over schemas. Authors in [Biskup 03] define a theoretical framework to extract information from heterogeneous data sources based on ontologically specified target views. For this they specify a set of operations used as query transformations to integrate source information into the target view. An implementation that extends this work has been proposed in [Xu 03a] for performing queries reformulations in data integration system. A similar work proposed in [Dobre 03] defines a set of mapping operators for both schema entities and attributes that should provide a schema integration system. These works deals with relational schemas and they are not totally applicable to XML schemas.

Works in the area of tree matching [Shasha 90] [Zhang 95] address the change detection problem for labelled trees. They propose essentially three edit operations for matching trees: delete, insert, relabel each of which assigned a cost. The “cheapest” edit script for transforming the source tree into a target tree is then obtained by combining such operations. However, in these approaches nodes’ labels are not significant. Relabel operation is considered to be cheaper than a delete followed by an insert operation. This assumption doesn’t hold in the XML context. In fact, nodes labels are generally names of the XML tags that carry semantic information. Relabel one node into another semantically unrelated node causes an undesirable matching in our context. To motivate our choice of operations, we list the following problems we must face in matching two XML schemas:

• Frequently, schema designer tends to qualify semantically similar concepts using different names (in general synonyms). For this we consider a rename operation defined as follow:

Rename: t = rename (s), generates a construction that is the same as a construction s, but with a different name t. Example: writer=rename (author) indicates that a source instance author is renamed to a target instance writer.

• For a target node, a particular source may have (1) a proper subset of the desired values or (2) a proper superset of the desired values. For example let’s consider a schema 1 where an Author has as children both nodes Paper and Book describing his publications. In contrast, Author in a given schema 2 has a node Publication as child. In schema 1, nodes Paper and Book are both subsets of Publication content in schema 2, and the relationships Author→Paper and Author→Book are both specializations of the relationship Author→Publication in schema 2. Thus, if schema 2 is the target, we may need the union of both contents of Paper and Book and relationships

77

Source-to-target mapping Algebra

Author→Paper and Author→Book. If schema 1 is the target, we should find a way to select both papers and books contents and relationships between author and papers and author and books from those between author and publication. For such case, we propose two operations Union and Selection defined as follow:

Union: t = union (s1,s2), generates a construction t whose content is the union of s1 values and s2 values.

Selection: t = SelectP (s), where P is a predicate generates a construction t whose content is the part of content of s that satisfies a predicate P.

• Schema designers do not always choose to represent values at the same level of atomicity. For example an author name is represented in a given schema 1 using an element Name, while in a schema 2 it is separated into a First-Name and a Last-name. If schema 1 is the target, values of First-Name and a Last- name have to be concatenated together. If schema 2 is the target, we need to split values of name elements into First-Name and Last-name values. Another example could be the address description. While in one schema the address is separated in Street, City and Zip values, they are concatenated together in an Address element in another schema. For such cases, we define two operations Merge and split as follow:

Merge: t = merge (S1….Si), generates a construction t whose value is obtained by concatenating s1….si values.

Split: (t1…ti) = splitcriteria (s), where t1…ti are obtained by splitting a construction s respecting to a separation criterion. An example of separation criterion is “white space” in the case of strings.

• A target construction may be obtained by applying a natural join (as in relational schemas) to source constructions. Consider for example schema 1 and schema 2 in Figure 4-c. Relations between Journal-Paper and its children in schema 2 are obtained by a natural join between Journal-Paper (and its relationships) and journal (and its relationships) in schema 1 under the predicate Journal-ref = Journal-name. We define a join operation as follows:

Join: t= joinP(S1,S2), generates a target construction t which is the natural join of S1

and S2 under the predicate P.

78

Problem formalisation

Journal-Paper Journal-Paper

Title Author Journal- ref Title Author Editor Journal- name

Journal

Journal-name Editor

Schema 1 Schema 2

Figure 4-c: A join operation example

• Frequently we need some specific functions to transform the content of source values into target values. Such functions include for example unit conversion, date format transformation, mathematical functions such as min, avg, div, etc. For such cases, we introduce a new operation apply defined as follow:

Apply: t = applyf (s1…si) where f is a function that takes s1…si values and returns t, whose value corresponds to f (s1…si).

• For the case where we have a one-to-one matching without any modification, we provide an operation called connect that takes as parameter a construction s and returns the same construction without any modifications.

Connect: t = connect (s), generates a construction t which has the same content and label as s.

4.5 Conclusion

We believe that providing a formal framework for schema matching is important because it makes it clear to the users what a solution means by a mapping and under which assumptions. This helps the user to evaluate the applicability of a solution to a given matching scenario. Based on the analysis of input and output information a solution to XML schema matching problem requires, we identify four components within our matching framework. First, it is necessary to define precisely the main features of XML schema language that a matching algorithm could exploit. For this, we define a model for XML schema, called schema graph. Schema graph formalism describes XML schema as a set of nodes related by three kinds of relationships: Containment referring to the parent-child relationships, of-property referring to the relation between nodes and their attributes, and association describing referential integrity and substitution group relations between nodes. A set of constraints is also

79

Conclusion

defined over nodes and edges. Second, since we want to use schema matching solution to generate a transformation program, we need to define what transformation operations the schema matching solution should discover. For this aim, we propose a source-to-target mapping algebra that incorporates operations such as rename, connect and apply to generate direct mappings and union, selection, merge and split to express complex mappings. Third, since the core of schema matching solution is the similarity computation, matching criteria have to be specified as well as their implementation and combination. Fourth, since it is impossible to fully automate the matching process, the required user feedback has to be carefully specified. The two last points will be discussed in chapter 5.

80

Chapter 5

Automating XML Schema matching

This chapter will first discuss the different matching methods we use as well as their combination to obtain similarity measures between schemas entities. We essentially focus on how to maximize the use of structural information to derive mappings between source and target XML schemas. For this, we adapt several existing algorithms in many fields: dynamic programming, data integration, and query answering to serve solving XML schema matching problem. We then show how we generate both direct and indirect mappings based on such similarity measures. We also detail how user feedback could be incorporated to the mapping result. Finally, we expose an evaluation study of the proposed solution relying on real-world application: bibliographic data, which reflect the main characteristics of data-centric XML documents. This is done in two parts: the first evaluates the solution in term of precision and recall, the second draws a comparative study with two structural matchers: Cupid and Similarity.

5.1 Matching process: The big picture

To match schema graphs, we make use of four basic matching criteria (1) terminological matching, (2) datatype compatibility, (3) Designer type hierarchy and (4) structural matching. The matching process (Figure 5-a) proceeds in several phases:

• Terminological matching: the aim of this phase is to compute the similarity between schema nodes based on the similarity of their labels. The proposed matching method takes as input node labels in both source and target schema graphs, and produces as output a matrix of terminological similarity coefficients (ranging in [0,1]) between source and target nodes as well as semantic relationships (such as Equivalent, Related-to, Broader than). Terminilogical similarity uses WordNet as auxiliary information.

• DataType compatibility: this phase aims at computing the similarity between source and target nodes based on their domain constraints. It concerns atomic nodes that are declared similar by the terminological matching phase. Datatype

81

Matching process: The big picture

compatibility uses the type hierarchy of built-in datatypes in XML schema and domain constraints imposed on source and target schema atomic nodes to update the terminological similarity measure.

• Designer type hierarchy analysis: in this phase, we make use of (a) XML schema abstract types, type inheritance and substitution group mechanisms, (b) semantic relationships produced by the terminological matching, in order to derive a set of complex mappings.

• Structural matching: this phase aims at computing source and target schema nodes’ similarity using their structural characteristics. It relies on the notion of node context. We define two steps to perform structural matching. The first one assigns for each node in source and target schema a context. After what, such contexts are compared to produce a node similarity coefficient that reflects structural similarity between schema nodes. The second step uses produced node similarity to derive a set of direct and complex mapping elements. In this step, we also valid the complex mappings generated by designer type hierarchy phase.

• User validation: In this phase, the mapping result produced by the structural matching is validated by the user in order to produce a final mapping result that serves to generate transformation scripts.

In the following sections, we further detail each phase of the matching process.

82

Automating XML Schema matching

Source and target Schema graphs

Terminological Matching (Hirst and St-Onge algorithm, (1) WordNet)

Terminological Semantic similarity Relationships Coefficients (2) (3) Datatype compatibility Designer type (XML Schema Type hierarchy analysis hierarchy)

Updated Complex Terminological mappings similarity Coefficients

Node context similarity Discovery of nodes and (Ancestor, child and leaf Node Similarity edges matches contexts)

(4) Structural matching Direct and Complex mappings

Final User validation Mapping (5) result

Figure 5-a: The matching process

83

Terminological matching

5.2 Terminological matching

To perform terminological matching, we make explicit the meaning of used element names and establish semantic relationships between them based on WordNet [Miller 95] [Fellbaum 98], an electronic lexical database for English words.

WordNet makes the commonly accepted distinction between conceptual semantic relations, which link concepts (meanings) and lexical relations, which link individual words [Evens 88]. We essentially consider conceptual semantic relations. The most ambitious feature of WordNet, is its attempt to organize lexical information in terms of word meanings, rather than word forms. Word meanings (senses) are given through the notion of synset (synonym set). In WordNet, a word form may be associated with many synsets, each corresponding to a different sense of the word. WordNet is organized by semantic relations. Since a semantic relation is a relation between meanings, and since meanings can be represented by synsets, it is natural to think of semantic relations as pointers between synsets. WordNet makes use of several semantic relations such as hypernymy, hyponymy, and meronymy.

Most of schema matching algorithms suggest the use of WordNet for terminological matching, but generally gave few, if any details about how they exploit WordNet. Our terminological matching is inspired essentially from Hirst and St-Onge’s work [Hirst 98]. When attempting to find a relation between two words, each synset (set of synonyms representing a sense associated to a word) of the first word must be considered with each synset of the second word, looking for a possible connection between some synset of the first word and some synset of the second. WordNet links can have three possible directions: upward, downward, and horizontal. An upward direction corresponds to a generalization of the context (connecting more specific concepts to more general ones). Similarly, a downward link corresponds to a specialization of the context. Horizontal links are very specific specializations and are less frequent than upward and downward links. Table A.1 in Appendix A gives examples of such links.

The Hirst and St-Onge measure has three levels of relatedness: extra strong, strong, and medium. An extra strong relation corresponds to a word repetition (e.g., author and authors, foot and feet). Extra strong relations have the highest weight. Two words have a strong relation if one horizontal link exists between their respective synsets. We restrict our algorithm to these three scenarios: (1) when there is a synset common to two words, (2) when there is a horizontal link (e.g., similarity, see-also) between synsets of each word, (3) if one of the words is a compound word and the other word is a part of the compound, and if there is any kind of link between synsets associated with each word. A strong relation has a lower weight than an extra-strong relation. Unlike extra-strong and strong relations, medium relations have different weights depending on the existent path between a pair of synsets.

84

Automating XML Schema matching

Nodes in this graph represent words synsets. Semantic relationships between words are classified into four categories: Equivalent : when one or more horizontal links between words synsets exist (case 3) Broader than : when one or more upward link between two synsets (case 1) or when upward links are followed by horizontal links (case 5). Related-to : when an upward link is followed by a downward link and may be horizontal links (two words have a common hypernym) (case 4) Narrower-then : when a downward link (s) occur between two synsets (case 2), or when a downward link is followed by horizontal links (case 6 and 7).

Figure 5-b: Semantic relationships classification in WordNet

The idea behind Hirst and St-Onge’s measure of semantic relatedness is that two concepts are semantically close if their WordNet synsets are connected by a path that is not too long and that does not change direction too often. A set of allowable paths have been then defined. A path is said to be allowable if the links that compose the path respect the three following conditions:

(1) No other direction precedes an upward link. (Once the context is specialized using a downward or an horizontal link, it is not permitted to enlarge the context by using an upward link)

(2) No more than one change of direction is allowed. (Direction changes constitute semantic changes in the context and thus have to be limited.) This rule has one exception in the sense that an horizontal direction can be used to make the transition from an upward to a downward direction.

(3) Path length is limited to a certain threshold (generally taken to be equal to four or five).

Figure A-a in Appendix A illustrates examples of allowable paths.

To compute the weight of a medium relation, the shortest path between a pair of synsets is first identified. If the path is not allowable, then the weight assigned to the

85

Data type compatibility

relation is NULL (i.e., there is no relation). Otherwise, the weight of the path and thus the terminological similarity is given by the formula:

c – Path length – k × number of changes of direction

Where c and k are constants 13 . By normalizing the latter measure, we obtain a terminological coefficient lsim, ranging between [0,1], where 0 is given when no semantic relation exist between two words and 1 is given between very similar words. The lsim coefficient for Address and Location is 1, while it is equal to 0.81 between First-name and Name and to 0.79 between Publication and Book.

Moreover, based on the classification of allowable paths, we identified four kinds of semantic relations between words, namely equivalent (≡), Broader than (⊇), Narrower than (⊆), and related to (∼) (c.f. Figure 5-b). The detailed Hirst and St-Onge algorithm is given in Appendix A. The same algorithm is also applied to type names. To simplify the comprehension of our approach, we assume that nodes have the same names as their types.

Terminological matching may produce high scores based on WordNet hierarchy, even though the nodes do not semantically correspond to each other. As we shall see, other techniques can sort and eliminate such anomalies.

5.3 Data type compatibility

Expected values appearing in a set of data provide another significant clue to which elements match. Such technique has been used in [Xu 03a], [Xu 03b] where for a specific application area authors specify a domain ontology including a set of concepts and semantic relations between those concepts. To each ontology concept is associated a set of regular expressions that match values expected to appear for the concept. Such ontology is parsed to generate a database scheme and rules for matching constants and keywords [Embley 99]. Given a set of data instances associated with source and target schemas, they are able to associate each element in the source and target schemas with a data value pattern based on the regular expressions declared for a given application ontology. Obtained data value patterns together with domain ontology help to find both direct and complex matches for schema elements. For example, the application domain ontology could specify that a concept address as potentially consisting of state, city and zip concepts. Based on such specification and the recognition of expected data values, they can deduce complex matching using merge or split operators. However such technique presents two serious limitations. First it closely depends on the existence of an ontology scaling the application domain, such ontology is not always

13 The choice of c and k were done on running experiments based on examples provided in [Budanitsky 00] where authors compare several semantic distance measure algorithms. C=8 and k=1, thus, the longer the path and the more changes of direction, the lower the weight.

86

Automating XML Schema matching

available and it is often infeasible for a given ontology to scale an application area [Halevy 03a]. Second, recognizing data-value patterns from data instances require that a set of instances are available and are descriptive enough, which not always holds true.

XML schema recommendation provides many different datatypes and regular expressions. It is probably the ideal set of datatypes since it is refined enough for the purpose of schema matching. In fact, XML schema allows the definition of very specific datatypes. For this, we make use of built-in datatypes hierarchy [W3C 01b] in order to compute datatype compatibility coefficient, denoted dsim. XML schema datatypes are classified in multiple categories (called primitive datatypes) including for example Duration, Boolean, String, Decimal, etc. Each category has several derived datatypes (e.g., integer is a derived datatype from Decimal and non-positive-integer and long are derived datatypes from integer). Two datatypes are considered to be similar if they belong to the same datatype category, and their datatype compatibility depends on their respective position in XML schema datatype hierarchy. Based on XML Schema datatype hierarchy, we construct a datatype compatibility table, such as the one used in [Madhavan 01] that gives a similarity coefficient between two given datatypes. Example, the dsim coefficient between types int and short is equal to 0.8 (since type short is derived from type int).

Moreover, we also make a modest use of imposed constraints (expressed by means of facets) over datatypes in order to refine the datatype compatibility coefficient (two datatypes belonging to the same category and presenting similar set of constraints are more likely to be similar). When available, such constraints can be very helpful for the matching process. For example, two string datatypes (or string derived datatypes) having similar length constraints and two integers having similar numerical value ranges are more likely representing similar real word entities. Datatype compatibility similarity is essentially defined for atomic nodes to measure the similarity between their typing constraints. We define a set of heuristics and rules inspired from [Li 00] to draw such measures.14 We limit the scope of datatype compatibility to atomic nodes that are already similar using terminological matching method. Finally we update the terminological similarity coefficient lsim of atomic nodes by including their datatype compatibility as described in algorithm 1.

14 Although XML Schema offers specific datatypes, those are usually not exact and constraints are often incomplete, since they are not a necessity, but merely a convenience for the schema designer. Our use of datatypes constraints is restricted to some facets. For example, we do not consider patterns comparison. Works such as [Embley 99] and [Xu 03a] can be used to extend datatype compatibility measure.

87

Designer Type hierarchy

Algorithm 1 TERMINOLOGICAL AND DATA-TYPE COMPATIBILITY

Input: NS, NT (nodes in source and target schema graphs),th OutPut: SimMatrix (similarity matrix) Begin NAS ← atomic-nodes (NS)// source atomic nodes NAT ← atomic-nodes (NT)// target atomic nodes

for each s in NS for each t in NT do lsim(s,t)← Compute-terminological-similarity (s,t) for each s in NAS for each t in NAT if lsim (s,t)> th then do dsim (s,t)← compute datatype-compatibility (s,t)

lsim (s,t)← ωl × lsim (s,t) + ωt × dsim (s,t)

for each s in NS for each t in NT SimMatrix ← SimMatrix ∪ (s,t, lsim(s,t))

return SimMatrix End

The algorithm takes as input source and target nodes described respectively in source schema graph and target schema graph and returns a similarity matrix SimMatrix storing similar nodes and their lsim similarity coefficients. For complex nodes, lsim is computed using Hirst and St-Onge algorithm (described in section 5.2). For atomic nodes, which lsim exceeds a threshold th, lsim is updated using datatype compatibility dsim. Finally a weighted measure between lsim and dsim is returned15.

5.4 Designer Type hierarchy

XML schema features concerning subtyping, abstract types and substitution group mechanisms traduce the designer point of view and could be used as a set of meta-data to help the matching process to discover both direct and complex mappings16. The schema graph formalism introduced in chapter 4 covers partially such features. In

15 The two used parameters ωl and ωt weighted respectively terminological similarity of element names and the characteristic of their data values, are application dependent. If the schema element are descriptive and the typing is not correctly set, we give more importance to ωl . If element names are not descriptive and data values characteristics are carefully set, we assign a higher value to ωt . Otherwise we can give the same priority to terminological and datatype analysis (ωl =ωt = 0.5). 16 Such approach is more realistic then the use of a “miracle” ontology in [Xu 03a]. It describes designer conceptual modelling point of view and is specific to a given schema.

88

Automating XML Schema matching particular there are no indications about complex type extension and abstract types and elements.

In XML schema restricting a type, only adds constraints on its content and this is represented by the constraint set of an XML schema graph. In contrast extending a type means to add elements and/or attributes, which always results in a complex type. At a conceptual level, this refers to a generalisation/specification relationships. Similarly substitution group mechanism is used to specify that two elements are conceptually on the same level. In the following, we present some examples of such features and show how they can be used to deduce match candidates:

• Abstract Types: Let us consider a source schema represented by the schema graph of Figure 5-c where two elements Journal-Article and Procceding- Article are declared respectively of types JOURNAL-ARTICLE and PROCEEDING-ARTICLE (we use the same appellation for elements and types to simplify the comprehension of the example). Assume that these two types are subtypes of an abstract type ARTICLE. Consider a target schema where only an element Article of type ARTICLE is presented. Based on the fact that JOURNAL-ARTICLE and PROCEEDING-ARTICLE are subsets of type ARTICLE in the source schema and the type ARTICLE in the source schema matches the type ARTICLE in the target schema, we can deduce the following complex mapping: the union of source elements Journal-article and Proceeding-article matches the target element Article. This kind of hints may also provide wrong matches, let us keep the same source schema, but consider the target schema as the schema represented in Figure 4-b (in chapter 4), element Article in the target schema corresponds to element Journal-Article in the source schema and not to the union of elements Journal-article and Proceeding-article. Such wrong matches can be corrected by the structural matching techniques described in section 5.5.

• Complex Type extension: Conceptually, extension between two types (in the XML Schema sense) indicates that one type is a subset of another. This information can successfully help the matching algorithm to infer complex matches. Let us consider a source schema where publication, book and article are schema elements, of type PUBLICATION, BOOK and ARTICLE, respectively. Consider also a target schema having an element publication of type PUBLICATION. Terminological matching gives that a source schema’s element publication is equivalent to the target element publication and elements book and article are narrower than publication. The fact that BOOK and ARTICLE are subtypes of PUBLICATION in the source schema, allow us to infer a complex matching: the union of elements book and article in the source matches element publication in the target.

• Substitution Group: Two substitutable elements are conceptually at the same level. Let us consider a source schema where elements book and monograph

89

Structural matching

are substitutable and a target schema where element publication is similar (by terminological matching) to the source element book. Since in the source schema elements book and monograph are substitutable, a direct match between monograph and target schema element publication can be inferred and the lsim (monograph, publication) is assigned the value of lsim (book, publication).

The result of this step is a set of complex mappings essentially involving Union/Selection operators. Such complex mappings will be kept or rejected using either structural matching techniques or user intervention. More complex mappings will be further discovered through structural matching method. In the case where such information is not available in the schema, we can also rely on semantic relationships discovered by the terminological matching to derive complex matches. However, we give the priority to designer type hierarchy since it reflects the designer point of view.

Figure 5-c: University bibliographic schema graph

5.5 Structural matching

The matching techniques described in sections 5.2 and 5.3 compare only nodes between a source schema and a target schema. These matching techniques may provide incorrect match candidates. Structural matching is used to correct such match candidates based on their structural context. For example, assume that we let the schema graph in Figure 4-b be the source schema graph, denoted S and the schema graph in Figure 5-c be the target schema graph, denoted T. Based on the two terminological matching and datatype compatibility techniques, we obtain a match

90

Automating XML Schema matching between node University/Address in T and node Author/Address in S, while the first is a university address and the second is an author address. The structural matching compares the contexts in which nodes appear and can deduce that the two nodes Address do not match, instead the nodes Location in the source schema match the node Address in the target schema and target relationship between University and Address match source relationship between University and Location. Structural matching relies on the notion of node context.

5.5.1 Node context definition The aim of structural matching is the comparison of the structural contexts in which nodes in the schema graph appear. Thus, we need a precise definition on what we mean by node context. We distinguish three kinds of node contexts depending on its position in the schema graph as illustrated in Figure 5-d.

• The ancestor-context: An ancestor context of a node n is defined as the original path (as defined in chapter 4) having n as its ending node and the root of the document tree as its starting node. Example: the ancestor-context of node Publication in Figure 5-c is given by the path University/Researcher/Publication indicating that the node publication describes the publications of a researcher belonging to a university. The ancestor-context of the root node is empty and it is assigned a NULL value.

• The Child-context: A child-context of a node n includes its attributes and its immediate subelements. The child-context of a node reflects its basic structure and its local composition. Example: the child-context of node University in schema graph of Figure 5-c is given by (Name, Address, Researcher). The child-context of an atomic node is assigned a NULL value. For keyref(s), we include the referential nodes in the child context. For example the child- context of the node Article in the schema graph of Figure 4-b is composed of nodes Title, Address, Uri, Abstract and Journal-ref. Since Journal-ref is a key ref node, we also include the referential node Journal in the child-context of Article.

• The leaf-context: Leaves in the XML tree represent the atomic data that the schema describes. The leaf-context of a node n includes the leaves of the subtrees rooted at n. Example: The leaf-context of node Publication in the schema graph of Figure 5-c is given by the set (Abstract, Journal, Editor, Abstract, Volume, Price, Title, Publisher). The leaf-context of an atomic node is assigned a NULL value.

The context of a node is defined as the union of its ancestor-context, its child-context and its leaf-context. Two nodes are structurally similar if they have similar contexts. The notion of context similarity has been used in Cupid and SF, however none of them relies on the three kinds of contexts. Cupid used essentially leaf-context similarity and SF used child-context similarity. To measure the structural similarity between two

91

Structural matching

nodes, we compute respectively the similarity of their ancestor, child and leaf contexts. In the following we describe the basis needed to compute such similarity.

Ancestor-Context

Child-Context

Leaf-Context

Figure 5-d: The context of a schema element

5.5.2 Path resemblance measure Each path in the target schema garph represents a certain context in which target nodes may appear. The notion of context will be further detailed in section 5.5.2. For each target path, we want to find the most similar source path.

Let us consider two paths P1 (s1, d1) = (s1, ni1,…,nim, d1) and P2 (s2, d2) = (s2, nj2,…,ejl, d2) conforming to the path definition given in chapter 4. A mapping between P1 and P2 is an assignment function µ: P1→P2 that associates a node in P1 to a node in P2. An assignment µ is a strong matching if it satisfies the two following conditions:

• Root constraint: Source nodes in P1 and P2 are similar. Two nodes are considered similar if their similarity exceeds a specified threshold with respect to a predefined similarity function.

• Edge constraint: For each directed edge u→ v, u,v ∈ (s1, ni1,…,nim, d1), there exist a directed edge u’→v’, where u’, v’ ∈ (s2, nj2,…,ejl,d2) such that nodes u, u’ are similar nodes and v, v’ are similar nodes.

The definition of strong matching reminds us the classical view of a conjunctive query and an answer to it. Under such conditions paths such as Author/Publication and Publication/Author are no matched however they convey same semantics. Other unmatchable paths under such conditions are Author/Contact/Address and Author/Address. Based on such observations, it is more appropriate to go beyond the strong matching by relaxing the above conditions. One can think of several ways of relaxing strong matching: for example allow matching paths even when nodes are not

92

Automating XML Schema matching embedded in a same manner or in the same order. Several works in query answering have proposed relaxation issues to approximate answering of queries (including path queries) [Amer 02]. Inspired by [Carmel 02] work in answering XML queries, we made the following relaxations:

• Root constraint relaxation: Paths can be matched even if their source nodes do not match, for example Author/Publications may match staff/Authors/Author/Publications.

• Edge relaxation: Paths can be matched even if their nodes appear in a different order Author/Publication and Publication/author. Paths can also be matched even if there are additional nodes within the path (e.g., Author/Contact/Address match Author/address) meaning that the child-parent edge constraint is relaxed into ancestor-child constraint.

Relaxations may give raise to multiple match candidates. For this reason, authors in [Carmel 02] define a path resemblance measure between a given path query Q and a path in the source tree. Such measure is used for ranking match candidates. We extend these definitions by allowing two elements within each path to be matched, even if they are not identical but their terminological similarity exceeds a fixed threshold. We define a path resemblance measure, denoted pr, which determines the similarity between two given paths. The values of pr range between 0 and 1. Match candidates can then be ranked according to pr measure.

Consider two paths P1 and P2 being matched (when P1 is a target path and P2 is a source path), P2 is the best match candidate for P1 if it fulfils the following criteria:

• The path P2 includes most of the nodes of P1 in the right order.

• The occurrences of the P1 nodes are closer to the beginning of P2 than to the tail, meaning that the optimal matching corresponds to the leftmost alignment.

• The occurrences of the P1 nodes in P2 are close to each other, which means that the minimum of intermediate non matched nodes in P2 are desired.

• If several match candidates that match exactly the same nodes in P1 exist, P2 is the shortest one.

To calculate pr (P1, P2), we first represent each path as a set of string elements; each element represents a node name (e.g., the path Author/Publication is a string composed of two string elements Author and Publication). We used the four scores established in [Carmel 02] and borrowed from dynamic programming for string comparison; each of which corresponds to one of the above criteria.

93

Structural matching

5.5.2.1 Longest Common Subsequence As paths may be considered as strings, we use a classical dynamic programming

algorithm in order to compute the Longest Common Subsequence (LCS), between P1 and P2. More the length of the longest common subsequence is high; more P2 includes P1 nodes in the right order.

A word w is a longest common subsequence of x and y if w is a subsequence of x, a subsequence of y and its length is maximal. Two words x and y can have several different longest common subsequences. The set of the longest common subsequences of x and y is denoted by LCS(x,y). The (unique) length of the elements of LCS(x,y) is denoted by lcs(x,y). For comparing two words x and y of size m and n respectively, we reuse a classical dynamic programming algorithm that relies on a two-dimensional table T of size (m+1)×(n+1) (algorithm 2). We then exhibit the longest common subsequence tracing back in table T from T[m-1,n-1] to T[-1,-1] (algorithm 3).

Finally, to obtain a score in [0,1], we normalize the length of the longest common

subsequence by the length of target path P1 as following:

lcsn(P1, P2) =|lcs(P1, P2)|/|P1|

Example 5.1: Consider P1 to be Publication/Book/Author and P2 as Author/Publication/Book, the longest common subsequence between the two paths is

Publication/Book, ⏐lcs (P1, P2)⏐= 2, thus lcsn = 2/3= 0.66.

94

Automating XML Schema matching

Algorithm 2 LONGEST-COMMON-SUBSEQUENCE (LCS)

Input: P1, P2, m, n (m and n are respectively P1 length and P2 length) OutPut: T Begin for i ← -1 to m-1 do T[i,-1] ←0 for j ← -1 to n-1 do T[-1, j] ← 0 for i ← 0 to m-1 do for j ← 0 to n-1 do if P1[i]=P2[j] T[i,j] ← T[i-1,j-1]+1 else T[i,j]= max (T[i,j-1],T[i-1,j]) return T

End

Algorithm 3 TRACE-BACK (TB)

Input : P1 , P2, m, n, T OutPut: LCS Begin i←m-1 j←n-1 k← T[m-1,n-1]-1 while i>0 and j>0 Do if (T[i,j]=T[i-1,j-1]+1 and P1[i]=P2[j])

then wk ← P1[i] i←i-1 j←j-1 k←k-1

elseif T[i-1,j]> T[i,j-1] then i←i-1

else j←j-1 return w End

95

Structural matching

5.5.2.2 Average positioning

To answer the second criterion, we first compute, according to lcs (P1, P2) what would be the average positioning of the optimal matching of P1 within P2. The optimal matching is the match that starts on the first element of P1 and continues without gaps. Consider P1 = Author/Publication/Book and P2 = Staff/Author/Publication/Book, since the optimal matching corresponds to the leftmost alignment, the average optimal position, denoted Av-Optimal-Position is (1+2+3)/3 =2. We then evaluate using the LCS algorithm, the actual average positioning (AP). AP takes the value 3 in our example ((2+3+4)/3). Last, we compute pos coefficient indicating how far the actual positioning is from the optimal one, using the following formula:

pos(P1, P2) = 1-((AP-Av-Optimal-Position)/(|P2|-2*Av-Optimal-Position+1))

5.5.2.3 LCS with minimum gaps To answer the third criterion, we use another version of the LCS algorithm in order to

capture the LCS alignment with minimum gaps [Myers 86]. If P1 = Person/Address

and P2 = Person/Contact/Address, we count a gap of length 1 between the two paths, thus gaps =1. To ensure that we obtain a score inferior to 1, we normalize the obtained gap using the following formula:

gap(1, P2) = gaps/(gaps + lcs(P1, P2))

5.5.2.4 Length difference Finally, in order to give higher values to source paths whose length is similar to the

target path, we suggest to compute the length difference ld between a source path P1 and lcs(P1, P2) normalized by the length of P1 as follow:

ld(P1, P2)= (|P2|- lcs(P1, P2))/|P2|

To obtain the path resemblance score, all the above metrics are combined as follow:

pr(P1, P2) = α lcsn (P1, P2) + ß pos(P1, P2) – λ gap(P1, P2) – δ ld(P1, P2)

Where α ,ß, λ and δ are positive parameters ranging between 0 and 1 that represent the comparative importance of each factor. They can be tuned but must satisfy a + ß = 1,

so that pr(P1, P2) =1 in case of a perfect match, and λ and δ must be chosen small enough so that pr cannot take a negative value. Algorithm 4 summarizes the computation of path resemblance measure using the above formulas.

96

Automating XML Schema matching

Example 5.2: Let

P1 = Author/Book/title and

P2= University/Author/Publications/Book/Description/Title/subtitle

We have ⏐lcs⏐ = (2+3+4)/3 = 3, AP = (2+4+6)/3 = 4, gaps =2, ld= 7-3/7=4/7

Taking α = 0.75, ß =0.25, λ=0.25, and δ =0.2, we obtain a path resemblance score equal to 0.68.

5.5.3 Node Context similarity

5.5.3.1 Ancestor context similarity The ancestor context similarity, ancestor-ctx-sim captures the similarity between two nodes based on their ancestor context. Since the ancestor context of a given node n is described by the original path (the path from the root to n), computing ancestor contexts similarity is equivalent to comparing two paths. Thus the path resemblance measure algorithm (algorithm 4) defined in section 5.5.1 could be used. The ancestor- ctx-sim between two nodes n1 and n2 is given by the path resemblance measure between the two paths (root, n1) and (root, n2) weighted by the terminological similarity between n1 and n2.

5.5.3.2 Child-context similarity

ancestor-ctx-sim (n1, n2)← pr ((root, n1), (root, n2)) × lsim (n1, n2)

The child-context similarity (child-ctx-sim) is obtained by comparing nodes immediate descendents (children) sets including attributes and subelements. Given a node n1 having n immediate children represented by the set (n11, …, n1n) and node n2 having m immediate children represented by (n21, …, n2m), to compute the similarity between these two sets, we:

• Compute the terminological similarity between each pair of children in the two sets,

• Select the matching pairs with maximum similarity values,

• Take the average of best similarity values.

Algorithm 5 illustrates how we compute the child-context similarity.

97

Structural matching

Algorithm 4 PATH-RESEMBLANCE MEASURE (PRM)

Input : P1 , P2, α, ß, λ,δ OutPut: pr(P1 , P2) Begin

//computation of the longest common subsequence

lcs(P1, P2)← TR (LCS (P1, P2)) lcsn(P1, P2) ←|lcs(P1, P2)|/|P1|

//computation of average positioning

pos(P1, P2) ← 1-((AP (P1, P2) –AverOptimalPosition (P1, P2))/(|P2|-2*AverOptimalPosition+1))

//computation of LCS with minimum gaps

gap(P1, P2) ← gaps/(gaps + lcs(P1, P2))

//computation of length difference

ld(P1, P2) ← (|P2|- lcs(P1, P2))/|P2|

//computation of path resemblance

pr(P1, P2) ← α lcsn (P1, P2) + ß pos(P1, P2) – λ gap(P1, P2) – δ ld(P1, P2)

return pr

End

98

Automating XML Schema matching

5.5.3.3 Leaf context similarity Since the effective content of a node is often captured by the leaf nodes of the subtree rooted at that node, we compute leaf context similarity of two nodes n1 and n2 by comparing their respective leaves sets, leaves (n1) and leaves (n2). It is possible that each schema represents different levels of abstraction and different granularities. Thus, to compute the similarity between two leaves l1 ∈ leaves (n1) and l2 ∈ leaves (n2), we propose to compare the contexts in which these leaves appear.

If a leaf node l ∈ leaves (n1), then the context of l is given by the path from n1 to l. The context similarity of two leaves is then obtained by comparing such paths; the path resemblance measure is then used. The similarity between two leaf nodes is obtained by combining their context similarities and their terminological similarity as follows:

Leaf-sim (l1,l2)= pr ((n1,l1), (n2,l2)) × lsim (l1,l2)

The leaf context similarity of two atomic nodes n1 and n2 is then obtained by:

• Computing the leaf similarity between each pair of leaves in the two leaves sets,

• Select the matching pairs with maximum similarity values,

• Take the average of best similarity values.

Algorithm 6 illustrates how we compute the leaf-context similarity.

Algorithm 5 CHILD-CONTEXT-SIM

Input: n1, n2 , SimMatrix //having respectively n and m children Output: Child-ctx-sim Begin Best-pairs ← select pairs (n1k,n2h,sim) where sim = max i∈[1,n], j∈[1,m]{(n1i,n2j,lsim)∈ SimMatrix}

∑ sim Child-ctx-sim ← (n1i ,n2 j ,sim)∈Best − pairs max(m,n) return Child-ctx-sim End

99

Structural matching

Algorithm 6 LEAF-CONTEXT-SIM

Input: n1, n2 //having respectively n and m leaves Output: leaf-ctx-sim Begin

for each l1i ∈leaves(n1)

for each l2j ∈ leaves (n2)

leafSim (l1i, l2j)← pr ((n1,l1i), (n2,l2j)) × lsim (l1i,l2j)

SimMatrix ← SimMatrix ∪(l1i, l2j, leafsim (l1i, l2j))

Best-pairs←select pairs (l1k,l2h,Sim) where

Sim = max i∈[1,n], j∈[1,m]{(l1i,l2j,leafSim)∈ SimMatrix}

leafsim leaf-ctx-sim ← ∑ (l1i ,l2 j ,leafsim)∈Best − pairs max(m,n)

return leaf-ctx-sim

End

5.5.3.4 Node similarity In this section, we propose to compute the similarity of two nodes belonging respectively to source schema graph and target schema graph by combining all the previous similarity measures (terminological similarity, datatype compatibility and context similarity). Algorithm 7 illustrates node similarity computation. We distinguish three different cases:

• Case 1: The two nodes being compared are atomic nodes (leaves), and then their respective child context and leaf context are assigned the NULL value. The similarity between two atomic nodes is then given by the similarity of their respective ancestor context weighted by their terminological similarity (lines 6, 7 and 8). Example: the similarity between the two atomic nodes Name (Figure 5-c) and Name (Figure 4-b) is equal to pr ((University/Name), (University/Name)) × lsim (name, name) = 1.

• Case 2: One of the two nodes being compared is an atomic node, say n1, and the other is a complex node, say n2. Since for the atomic node n1, the child- context and the leaf-context are assigned a Null values. The similarity between

n1 and n2 is obtained by computing first their respective ancestor-context (line 9). Second, since the content of an atomic node is captured by the node it self, while the content of a non atomic node is captured by its leaf-context, we

propose to calculate the average of the terminological similarity between n1 and nodes belonging to the leaf-context of n2 (lines 11, 12 and 13). The similarity between the two nodes is then obtained by weighted similarity of their ancestor and leaf contexts (line 14).

100

Automating XML Schema matching

• Case 3: Both nodes are complex nodes and then their similarity is the weighted sum of their ancestor context similarity, their child-context similarity and their leaf context similarity (lines 17 to 24).

Once element similarity is computed (algorithm 7), one can use it to correct indirect matches and set appropriate operations as we will describe in section 5.5.4.

101

Structural matching

Algorithm 7 NODE-SIMILARITY

1. Input: n1, n2,α,β,γ 2. Output: sim(n1, n2) 3. Begin 4. 5. Case 1: n1 and n2 are atomic nodes

6. ancestor-ctx-sim(n1, n2)← pr ((root, n1), (root, n2)) × lsim (n1, n2) 7. sim(n1, n2)← ancestor-ctx-sim(n1, n2) 8. return sim(n1, n2)

9. Case 2: n1 is an atomic node and n2 is a non atomic node

10. ancestor-ctx-sim(n1, n2)← pr ((root, n1), (root, n2)) × lsim (n1, n2)

11. for each l2i ∈ leaves (n2) 12. Compute lsim (l2i, n1) ∑lsim(l2i , n1 ) l ∈leaves(n ) 13. Leaf-ctx-sim (n1, n2) ← 2i 2

leaves(n2 )

14. Sim(n1, n2) ← α ×ancestor-ctx-sim(n1, n2) + β × Leaf-ctx-sim (n1, n2) 15. return sim(n1, n2)

16. Case 3: n1 and n2 are non atomic nodes

17. Step1: Compute leaf context similarity

18. Leaf-ctx-sim(n1 ,n2)← LEAF-CONTEXT-SIM (n1 ,n2)

19. Step2. compute ancestor context similarity

20. ancestor-ctx-sim(n1, n2)← pr ((root, n1), (root, n2)) × lsim (n1, n2)

21. Step 3. Compute child context similarity

22. Child-ctx-sim(n1, n2) ← CHILD-CONTEXT-SIM (n1 ,n2)

23. Step 4. Compute element similarity of n1 and n2

24. sim (n1, n2) ← α ×ancestor-ctx-sim(n1, n2) + β× Leaf-ctx-sim(n1

,n2) + γ × Child-ctx-sim(n1, n2)

25. Step 5: generating node similarity 26. for each s in NS 27. For each t in NT 28. Compute sim (s,t) 29. Matrixsim← matrixsim ∪ (s, t, sim (s, t)) 30. return Matrixsim 31. End

102

Automating XML Schema matching

5.5.4 Discovery of nodes and edges matches Most schema matching algorithms produce similarity scores between source and target schemas nodes such as the ones we produce in section 5.5.3; however such mapping result solves partially the problem. First, produced similarities between individual nodes are not enough to produce access paths for retrieving data from the available sources. For example, assume that we let schema 1 of Figure 5-c be the target schema and schema 2 of Figure 4-b be the source schema. Based on previous matching techniques, we obtain a set of node matches such as the match between University and University, or between Researcher and Author or between Book and Book. Without matches between edges, however, it may be impossible to distinguish authors that wrote books, from authors that wrote articles. Intuitively, a source-to-target mapping should describe all the needed access paths to retrieve data facts from the source of matching nodes. Second, all the produced mappings are one-to-one mappings, complex mappings identified using type hierarchy have to be incorporated in the matching result and further complex mappings have to be discovered. For this we proceed in four steps:

Step 1: Compatible nodes identification

While generating mapping elements, we apply a top-down strategy17. At the top level, we establish correspondences between complex nodes of the target and source schemas. Matched complex nodes are called compatible nodes. In Figure 4-b, the complex node University is considered to be a compatible node since it is matched to node University in the schema graph of Figure 5-c, while node Library is not a compatible node. Then, at the bottom level, with the guide of compatible nodes between the target and source schemas, we establish the finer-level correspondences between nodes and edges sets. Figure 5-e illustrates compatible nodes. Visually compatible nodes are depicted as coloured boxes and dashed lines.

Step 2: Context generation for compatible nodes

After identifying compatible nodes, we proceed to construct a context for each compatible node (the notion of context here differs from the context we defined in section 5.5.1). By taking edges around a complex node n into account, we cluster a set of nodes and edges with a complex node as a conceptual component in the schema graph. We call this the context of n. The Context for a compatible node n consists of a

set of nodes and a set of edges among those nodes. For a given compatible node n, we construct such context as follows:

• include all atomic nodes directly related to the compatible node n,

17 We use the same top-down strategy as in [Xu03 b]. However the difference is that in [Xu03 b] this technique is used to discover structural similarity. In our approach, it is just used for mapping generation; the structural matching has been already performed.

103

Structural matching

• include all non compatible nodes directly connected to n with their connected atomic nodes and connected non compatible nodes. We continue this procedure until we find a compatible node,

• if a directly connected compatible node is also similar to an atomic node, it is also included in the context of n,

• include all nodes having an association relationship with n and their respective context,

• include all containment relationships between nodes in the context of n.

Example 5.3: Figure 5-e (a) and (b) illustrates respectively schema graphs of Figure 4-b and Figure 5-c after context construction. In Figure 5-e (b), the context of the compatible node University include atomic nodes Name and Location and non compatible node Library. The context of compatible node Article includes referential node Journal and its context. In Figure 5-e (a), the context of node University include the compatible node Address, since Address is similar to a leaf node (location) belonging to the context of a matched node University.

Step 3: Node mappings generation

At this point, we finished with the top level comparison between source and target schema graphs. We are now ready to detect node and edges matches at the bottom

level. For each matching pair (nT , nS) which represented two compatible nodes, we make use of node similarity generated in section 5.5.3 to settle nodes matches. The following gives examples on how we proceed:

Example 5.4: Let the schema in Figure 5-e (a) be the target schema and the schema in Figure 5-e (b) be the source schema. Consider the two compatible nodes: target node

University (UniversityT) and the source node University (UniversityS), we first settle node-set matches between both source and target contexts that hold with the highest node similarity score. As an example, we settle the match pair (NameT, NameS) using a connect operation.

Example 5.5: The target node AddressT is both similar to the source nodes University/Location and Author/Address with approximatively same scoring. This is due in the case of University/Location to the fact that ancestor context similarity is high and in the case of Author/Address to the fact that the leaf context similarity is high. Since target node University/Address and source node Author/Address belong to non compatible complex nodes while target element University/Address and source element University/Location belong to two compatible contexts, a match is then derived between source node University/Location and target node University/Address. Moreover since we decide to map a non-leaf node with a leaf node, a complex mapping with split operation can be deduced.

104

Automating XML Schema matching

(a)

(b)

Figure 5-e: Source and target schema graphs after context construction

Example 5.6: Assume that we have already discovered that the union of target nodes Journal-article and Proceeding-article match the source node Article based on designer type hierarchy analysis. Such mapping can be confirmed or rejected by the system after compatible nodes context analysis. In fact, the context of source node Article includes the referential node Journal. Moreover, based on node similarity the target node Journal-article is compatible with both source nodes Article and Journal. By analysing the contexts of source node Article, we discover that it more likely matches the target node Journal-article. The complex mapping is then removed and a new mapping is settled between source node Article and target node Journal-article

105

User feedback to filter matching result

using a join operation. Let just notice that if node Journal is not present in the source schema graph, the discovered complex mapping is accepted and a selection operation is assigned to it.

Step 4: Access paths generation

With the available correspondences between nodes in source and target schemas, we further discover matches between edges. The recognition of edges matches starts by

locating an edge set et in T. Then, based on nodes Nt connected by et, we can locate a set of nodes that correspond to Nt in S, from which we either locate or derive a edge set es that corresponds to et.

We essentially focus on the discovery of access paths in order to retrieve source data when performing transformation. Appendix C summarizes the finally obtained mapping elements. For each target element t, we first define the access path indicating where matched source elements are localized, then the discovered transformation operation and finally the conditions under which the mapping element holds true.

5.6 User feedback to filter matching result

It is widely accepted that the matching process can not be fully automated. Thus it is necessary to incorporate user feedback to the matching task. We limit our selves to two cases: validating mapping result and constraint filtering.

5.6.1 Validating mapping result As we saw in section 5.5.4, we discover complex mappings based on semantic WordNet relationships or designer type hierarchy. However, such information may lead to incorrect complex mappings that could be accepted or rejected by the structural matching. In some cases, we notice that structural matching fails to correct complex mappings. Image for example that the node Article in Figure 4-b is related to an optional atomic node Volume and the Journal-ref is also optional. This means that Article in Figure 4-b is semantically equivalent to the union of Journal-article and Proceeding-Article in Figure 5-c. Our structural matching will fail to detect such case, since the node Journal is included in the context of Article and is terminologicalally similar to Journal-Article. As in most schema matching systems, the user is invited to validate the final mapping result. He has the possibility to add, delete or update mapping elements. The accuracy of a matching system reduces considerably post- matching user efforts.

5.6.2 Constraint filtering Up to now we establish mappings without taking care about all the constraints imposed on the data (described by the set of constraints in schema graphs). For example, a mapping between source element Book/Price and a target element Book/Price is deduced and validated. However, the type of the source node could be string, while the

106

Automating XML Schema matching type of the target node could be integer. Such kind of situations requires user intervention to filter the mapping results (assumption 2 in chapter 4). Since source schema constraints are fixed, to reuse data user can then either reject the mapping or alter (relax) target schema constraints. As authors in [Biskup 03] we distinguish two cases, which guide users’ involvements for schema mapping while translating source data into the target: (1) data-type compatibility and (2) constraints-compatibility, where the first deals with datatype constraints and the second with other constraints such as uniqueness, domain ranges and cardinality.

5.6.2.1 Data type compatibility We require the following basic restriction for type compatibility:

Let f be a mapping from a target schema T to a source schema S. If (a, b) is a mappings pair of f and the type of a is type(a) and the type b is type(b), there must exist an agreed-on (possibly trivial) conversion function c such that c converts values of type(b) to values of type(a).

Such requirements ensure that we can extract values from source nodes and load them into corresponding target nodes. To aid in satisfying this requirement, we assume that a default coercion routines exist (or can be created when needed by users). Based on such assumptions, we distinguish four cases:

Case 1 (type (a) = type (b)): this case appears to be trivial, since the types are the same, we can simply load the source values into the target values. However, it is not always the case. For example imagine that we have a mapping between two source and target elements price, both of type integer. It may arrive that in the source the currency is Euro, while in the target it is Swiss Franc. The system could propose a set of data examples extracted from the source to the user, which should decide if a conversion function is needed or not. In the case where such function is needed, the user is then invited to propose a conversion function. The mapping is then updated by using an apply operator.

Case 2 (type (a) ⊃ type (b)): The target type has a greater discriminating power than the source type. In this case, coercion routines will add arbitrary additional discriminating information to source values. If this is not acceptable, a different source should be found.

Case 3 (type (a) ⊂ type (b)): The target type has weaker discriminating power than the source type. The coercion for this case, when loading from source to target, may or may not be natural. We can truncate strings and round off reals to integers. Since a user may know a better way to do, the system should give him the possibility to specify his own routines.

Case 4 (type (a) ≠ type (b)): The user may choose either to reject the mapping or he should propose alternative solutions.

107

Evaluation of XML schema matching process

5.6.2.2 Constraints compatibility The last step checks constraint compatibility between two matched nodes. For this, we could adopt the same four cases outlined in [Biskup 03]:

Case 1: The constraints on source schema and target schema are equivalent, schema entities are matched and nothing further needs to be done.

Case 2: The constraints of source schema imply the constraints of target schema but not vice versa, entities are matched and nothing further needs to be done.

Case 3: The constraints of target schema imply the constraints of source schema but not vice versa. In this case, we can select only source elements that satisfy target constraint.

Case 4: Neither the constraints of target schema imply the constraints of source schema nor vice versa. This is a combination of case 2 and case 3 and thus, since there is nothing to do for case 2, we act as explained in case 3.

For the moment, we only integrate within our matching prototype system (described in chapter 6) user validation. Constraints filtering need to be integrated. Cases described above could be considered as a first step to this end.

5.7 Evaluation of XML schema matching process

5.7.1 Evaluation technique In [Do 02a], a number of schema matching systems are compared with respect to the techniques they use for result evaluation. One of those techniques is based on the computation of precision and recall. A schema matching system tries to approximate human reasoning on the similarity of semantic concepts. Thus, the best way to evaluate schema matcher’s output is to compare that output to real match results obtained by a human. Comparing the automatically derived matches with the real matches, results in the sets shown in Figure 5-f. We distinguish three sets: the set of correct matches detected by a human, denoted A, the set of mappings generated by the automatic matching system, denoted C. It is assumed that the set A is complete, i.e., no semantically correct mappings exist outside of this set, and no semantically incorrect mappings are included in this set. Precision and recall, which actually originate from the information retrieval field, can be computed as quality measures and are calculated as follows:

Precision: is the ratio between the number of correct mappings generated by the system and the total number of mappings in C. It gives an indication of how many incorrect mappings have been discovered by the matching system.

C ∩ A precision = C

108

Automating XML Schema matching

Recall: is the ratio between the number of correct mappings generated by the matching system and the total number of correct mappings (i.e., mappings discovered by human). It gives an indication of how many correct mappings are missed by the matcher.

C ∩ A recall = A

In order to evaluate the quality of a matching algorithm, both precision and recall have to be considered. Several combined measure have been proposed, in particular:

F-measure: which also originate from the information retrieval field [Van 79], it combines precision and recall giving them equal importance, which is the most common variant of a more generic combining function F-measure (α) that parameterize the importance of both precision and recall.

precision× recall F measure = 2× precision + recall

Overall: which is a combined measure, introduced in [Melnik 02], for mapping quality, taking into account of the post match effort needed for both removing wrong and adding missed matches.

⎛ 1 ⎞ overall = recall ×⎜2 − ⎟ ⎝ precision ⎠

Real matches Derived matches

Figure 5-f: Comparing real matches and derived matches

109

Evaluation of XML schema matching process

5.7.2 Real World examples We considered one real-world application: bibliographic data description in order to evaluate our matching process. The characteristics of used XML schemas are summarized in Table 5.2 showing some indications of the complexity of test schemas. We choose schemas that differ in the number of nodes (schema size) and in their depth (the manner of nodes nesting). Test schemas present terminological and structural heterogeneities. For running our tests, we decided to let any one of the schema graphs for the considered application be the target and let any other schema graph be the source. Different granularities and abstract levels are used to describe the same real world concepts. We also consider both flatten schemas (maximum depth between 3 and 4) and highly nested schemas (maximum depth between 8 and 10). Test schemas require several indirect matches involving merge/split, union/select and join operators. Table 5.1 summarises the values of different used parameters within the matching process. We essentially rely on experimental studies in [Budanitsky 00] and [Carmel 02].

To compute real matches, two different users were involved, and the average number of users discovered matches was considered. The total number of real matches (A) was 1382 matches (1102 direct matches and 280 complex matches). Our matching algorithm discovered 1312 matches (C) including 1281 correct matches and 31 incorrect matches. The incorrectly classified mapping elements include 18 direct mappings and 13 complex mappings. The correctly recognized mapping elements included 1082 direct matches and 199 indirect matches. For direct matches, the precision, recall, F-measure and overall achieved 99%, 94%, 97%, and 93%. For complex matches, the precision, recall, F-measure and overall achieved 98%, 71%, 82%, and 70%. The performance of the matching algorithm reached for precision, recall, F-measure and overall 98%, 92%, 95%, and 90%.

Our process successfully found all the complex matches related to the problems of Merged/Split Values and join relationships. However, for the problem of union/selection, our matching algorithm correctly found all the complex matches related to 80 of 93 union/selection matches and incorrectly declared 13 extra union/selection operators. For discovering union/selection operators, we essentially rely on type hierarchy analysis, if available, otherwise we make use of WordNet semantic relationships. Among the 80 discovered union/selection relations, 22 are discovered using WordNet. The experimental results show that the combination of terminological and structural matching produces fairly reasonable results, even if schemas are structurally highly heterogeneous.

110

Automating XML Schema matching

Parameter value Description

Terminological matching

C 8 Experimental Values from [Budanitsky 00] K 1

Data type compatibility

ωl 0.65 We give more priority to

terminological matching ω t 0.35 since data type compatibility

is limited (we do not use all typing constraints)

Path resemblance measure

α = 0.75, ß =0.25, λ=0.25,δ =0.2 Experimental values from [Carmel 02]

Node similarity

α = 0.35, ß =0.35, λ=0.3 We give the same importance to ancestor, child and leaf contexts

Table 5.1: Evaluation parameters

Domain Bibliographic Data

Schemas #1 #2 #3 #4 #5 #6 #7 #8

# Nodes 31 40 54 39 28 38 19 22

# Paths 30 39 53 38 26 37 18 20

Max Depth 4 8 10 6 6 8 4 3

Table 5.2: Characteristics of tested schemas

111

Evaluation of XML schema matching process

Quality measures Direct-matches Indirect-matches All matches

Precision 99% 98% 98%

Recall 94% 71% 92%

F-measure 97% 82% 95%

Overall 93% 70% 90%

Table 5.3: Results of real world examples

5.7.3 Comparative study We have done some preliminary comparison between XML schema matching systems in chapter 3. In this section, we essentially run evaluation comparisons between our proposed solution, Cupid and Similarity Flooding systems. This is because Cupid, SF and our solution are fairly comparable because they deal with XML structure, they are all schema based, and they all utilize terminological and structural matching techniques. In fact, they only differ in the specific matching techniques they use and in how they combine them. We run the evaluation in a similar manner as in section 5.7.2, but we only make use of four XML Schemas. Figure 5-g illustrates the obtained results. From the point of view of the quality of the matching results, our proposed solution outperforms the other systems. We argue this result from two perspectives first discovery of direct matches and second discovery of complex matches.

Direct matches: Most of existent schemas matching systems focus on generating direct matches. To discover direct matches, rule-based techniques essentially rely on schema information. In this dissertation, we focus on structural matching technique for XML schemas. Unlike other schema-based approaches, SF does not exploit terminological relationships in an external dictionary but entirely relies on string similarity between element names. Thus the terminological matching of SF was unable to discover even synonyms and since our basic goal is to compare the structural matching capabilities of each system, we use the results of our terminological matching algorithm as an initial mapping to both Cupid and SF.

The structure matching algorithms in Cupid and SF motivated our structure matching technique. However, Cupid fails to properly handle two schemas that present large structure differences and faces real problems when no semantically related concepts belong to isomorphic structures. This is due to the fact that structural matching method behind Cupid relies essentially on leaf-context similarity.

SF fails on the cases when adjacency information is not preserved, since the basic idea behind the algorithm is that node adjacency contributes to similarity propagation.

112

Automating XML Schema matching

Given schemas of varying levels of details such as address (city, state, zip) and address, both Cupid and SF will return a relatively low similarity measure. The reason is that Cupid is biased towards the similarity of leaf nodes, and SF towards the similarity of adjacent nodes.

When matching schema elements with different contexts, such as researcher (name, address) and Researcher (name, supervisor (name, address)), both Cupid and SF fail to distinguish researcher name from supervisor name. Overall, our solution is able to obtain correct mappings because we maximize the use of structure by taking into considerations ancestor-context, child-context and leaf-context similarities.

Comparative study Comparative study

1 1

0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2

0 0 Precision Recall F-Measure Overall Precision Recall F-Measure Overall Our solution SF Cupid Our solution SF Cupid

Figure 5-g: Comparative study with Cupid and SF

Complex matches: Up to now, most of current matching approaches have focused only on direct matches. They do not consider complex mapping. In [Do 02b] authors extend the learner-based matching prototype LSD to discover complex matches. Specifically, for each schema elements in a source schema S they search the space of direct matches (found by running LSD) to find a small set of best mapping candidates, then they add these candidates to the target schema T, as additional composite elements. Finally, they apply again LSD to match source schema S and the expanded target schema T. If an element in S matched with a composite element in the expanded schema, a complex match is derived. However, this approach is very expensive since several different combinations will be tested to form the correct composite element and thus LSD technique will be performed several times. This requires training data for each matchable element for base matchers and the meta learner. Another approach that discover complex matching is described in [Xu 03b] where authors rely on the manual construction of application specific domain ontology and on training a set of

113

Conclusion

representative data. This approach is unrealistic since ontology construction is not a trivial problem.

Few rule-based matching approach have considered complex matching discovery. In Cupid, if a leaf node s in the source schema is highly similar to a target leaf node t in the target schema, a mapping element between s and t is returned. This resulting mapping may be 1:n, since a source element may map to many target elements. This simplest scheme to compute global 1:n mappings is very limited. First because only splitting values are considered (no n :1 mappings are discovered). Second, this technique leads frequently to wrong complex matches, imagine that we have a source schema with a node S.University having a child node Address (S.Address) and a target nodes T.University and T.Author where both of them having a child node Address. If the similarity between S.University.Address and T.University.Address approximates the similarity between S.University.Address and T.Author.Address, a wrong complex match will be discovered. This is avoided in our approach by considering only compatible nodes contexts. Finally, no union/select and join complex mappings are considered in Cupid.

As in Cupid, the discovery of complex match is SF is done after the structural matching. SF makes use of several filters to deduce the list of match candidates from a list of ranked matching pairs. The filtering can be characterized by providing a set of constraints and selection function that pick the best subset of the multi mapping under a given selection metric. We make use of a specific threshold as a filtering criterion. For a given similarity threshold, SF selects a subset of a multi mapping, in which all map pairs carry a similarity value of at least equal to the threshold value. Contrary to Cupid, SF can generate global m:n mappings, however similarly to Cupid several wrong mappings are generated and no specific operations are discovered. Our matching solution gives reasonably correct complex mappings because we limited the search scope to compatible nodes contexts and rely on structure to discover such mappings.

5.8 Conclusion

We presented a framework for automatically discovering both direct mappings and many indirect mappings between source and target schema nodes. Multiple techniques were used, each contributes in a combined way to produce a final mapping result. Techniques considered include terminological matching, datatype compatibility, designer type hierarchy analysis and structural matching. In the proposed framework, we are able to detected complex matching for Join, Selection, Union, Merge, and Split. Such operations are mainly discovered using designer type hierarchy as well as structural matching.

The proposed structural matching technique is based on the notion of node context. We define three kinds of contexts for a given node: the ancestor context, the child context

114

Automating XML Schema matching and the leaf context. We show through a comparative study with Cupid and Similarity Flooding that the combination of these contexts highly improves the structural matching. Additional operations, such as type conversion, and arithmetic computations are for future work. We also plan to better incorporate user feedback through constraints filtering, and to test our approach first in a broader set of real-world applications and second by varying the used parameters to see their impact on the mapping result. As always, there is more work to do, but the results of our approach for both direct and complex mappings are encouraging, yielding about 95% in both recall and precision.

115

Chapter 6

Automating XML document transformations

This chapter describes how a transformation script could be generated based on previously established mappings in order to transform a source instance into a target one. We begin by introducing the chosen structure of the mapping result. Then we describe how we generate automatically a transformation script based on such structured mapping result. Finally, we detail implementation issues.

6.1 Global architecture

The major goal of our work is to propose an approach for automating the transformation of XML documents. We focus on two fundamental problems. First, we address the problem of how to automate the identification of semantic relationships between XML schemas. To this end, we have proposed (cf., chapter 5) a matching framework that incorporates several matching criteria and detects direct mappings as well as complex mappings. Second, given such mappings, we need to perform the actual transformation of XML documents from a source schema to a different, yet related target schema. To this end, we essentially proceed in two steps:

• Structuring mapping result: Structuring mapping result is essential for two reasons. First, it is easier to manipulate structured mapping result either to modify it or to automatically generate transformation scripts. Second, structuring the mapping result in a standardized form (i.e., system independent) greatly increases its reusability, especially when schemas evolve.

• Generating XSLT: The second step is concerned with data translation, i.e., implementing the specification given by the mapping result. The result of this step is a transformation script expressed in XSLT language. For each mapping element, the generated transformation rule has two roles: retrieve and insert.

117

Structuring mapping result

First, it retrieves data instances from the source by performing the required operations. Second, it must correctly insert elements and attributes in their actual places in the target schema, in order to generate a valid XML document. To ensure the validity of the produced XML document, the XSLT generator traverses the both the target schema graph and the mapping result in a depth- first manner and generates template rules for each target node. Since we do not assume that the source and target schemas represent the same data with the same constraints, there may be data in the target that is not represented in the source. In some cases, user interaction is required to produce new values for undetermined nodes (i.e., target nodes that cannot be null and for which no matches’ candidates are given). Additional values can be also added automatically in order to ensure the consistency of the target data (e.g., keys and KeyRefs are added to ensure target integrity constraints).

In the following, we detail these two steps.

6.2 Structuring mapping result

6.2.1 Mapping structure The role of the mapping result is to semantically relate facts from the source and target schemas by encapsulating all necessary information to transform instances of one source schema to instances of one target schema. The nature of mapping result may be understood by considering different dimensions, each describing one particular aspect. We have been inspired by the only work done in structuring mapping result [Maedche 01], where authors focus on mapping distributed ontologies. We restrict ourselves to the following five dimensions of a mapping result:

• Entity dimension: specifies schema entities involved in a mapping element.

• Cardinality dimension: This dimension determines the cardinality of a mapping element ranging from direct mapping (1:1) to complex mapping (m:n). However as in [Maedche 01], we have found that in most cases m: n mappings are not common, thus we limit ourselves to 1: n and m: 1 mappings. Even when m: n mappings are encountered, often they may be decomposed into m 1: n mapping elements.

• Structural dimension: This dimension reflects the way how elementary mapping elements may be combined into more complex mapping elements. Currently, we distinguish two structural relations between mapping elements:

o Composition: specifies that a mapping element is composed of other mapping elements.

o Alternatives: specifies a set of mutually exclusive mapping elements.

118

Automating XML document transformations

• Transformation dimension: This dimension reflects how instances of the source schema are transformed during the mapping process. Transformation dimension include the identification of transformation operations.

• Constraint dimension: The constraint dimension controls the execution of a mapping element. Constraints act as conditions that must hold in order the transformation operations are applied onto the instances of the source schema in order to produce instances of the target schema. As we have shown in chapter 5, some constraints are discovered within the matching process (e.g,. JournalArticle in the target schema graph maps Article in the source schema graph under the condition that the ancestor node Researcher of the target node JournalArticle is the same as the child node Author of the source node Article). Such constraints deal essentially with structural issues, so called structural constraints. However, users may whish to add more constraints on the data values they want to reuse. We call such constraints instance constraints. We achieve this by allowing the user to adjust the mapping by introducing boolean conditions associated with schema graph nodes. Boolean conditions are labelled predicate expressions. We were inspired from [Tang 01], where authors define a set of allowable constraint expressions to describe a high level specification language for structured document transformations. Predicates can take only one of the following forms:

o Existence testing expressions for a node selected by an XPath expression [W3C 99c], [W3C 01g].

o Relation testing expressions between schema graph node sets selected by an XPath expression. e.g., the mapping element is evaluated and the related transformation operations are executed only if the value of source instances matches a certain value (University/@name = “EPFL”).

o Function constraints to be satisfied by the associated nodes. Currently these may be either distinct or sort.

o Expressions using boolean operators (not, and, or) to combine expressions of the above three types.

As in [Tang 01], XPath expressions appearing in the boolean conditions must be localized, which means the selected node should be within the local context of the current node. Furthermore, “//” is not allowed in the path expression because it has the potential to refer to a node at an arbitrary distance from the current node and makes the task of efficient translation to operational program more difficult.

Within our approach, for different types of relationships between schema graphs nodes, a mapping element is created. A specification of all available mapping elements is given in a mapping schema expressed in W3C XML Schema language (Figure 6-a).

119

Structuring mapping result

To actually relate a given source and target schema graphs, the mapping process generates an instance of the mapping schema containing a set of mapping elements each of which encapsulates all information needed to transform instances of source nodes into instances of target nodes. Based on the five dimensions described above, a mapping result is described as a sequence of mapping elements, each of which has a:

• a unique identifier (ID).

• a set of source entities involved in the mapping element.

• a set of target entities involved in the mapping element. The number of source and target entities specifies the type of the mapping element (one-to-one, one- to-many, or many-to-one).

• a transformation element including a transformation operation (identified in section 4.4 of chapter 4). As mentioned in chapter 5, sometimes external operations (generally user defined) are further needed. For this, each transformation operation has an optional child element indicating the external resources (if used) and their fundamental characteristics such as name, interface (number and type of arguments) and location.

• a set of constraints that represent the conditions that should be verified in order to execute the transformation operations.

• a set of HasMappings elements allowing the current mapping element to aggregate any number of mapping elements. Those mapping elements are then called one by one and processed in the context of the former.

• a set of AltMappings elements grouping a set of mutual exclusive mapping elements. Those mapping elements are then checked and the first mapping element whose conditions hold is then executed.

6.2.2 Generation of mapping structure Based on the established mapping between source and target schema in chapter 5, a mapping generator generates a structured mapping conforming to the structure previously introduced in section 6.1.1. For each matching node pair, the mapping generator traverses the target schema graph in a depth-first manner and generates a new mapping element. The following describes how mapping elements are generated based on the target node:

(1) If the node is the first matchable encountered node, then generate a top level mapping element which will serve as the entry point for the translation program.

(2) For each node that is mapped, create a new mapping element (conforming to the structure of Figure 6-a)

120

Automating XML document transformations

• Check mapping rules (generated in chapter 5). If the node is involved in a direct match then generate a one-to-one mapping element, else if it involves to a complex match then generate either a one-to-n mapping element or an n-to- one mapping element.

• Assign an ID for the created mapping element18.

• Assign to the source element the access path of the corresponding mapping rule.

• Assign to the target element the actual target node.

• Assign the transformation operation to the discovered operation in the mapping rule.

• Assign constraints to the discovered or user defined constraints.

(3) If the mapped node n is a complex node then:

• For each set of edges starting at n and having an order composition node create a HasMappings elements for each matchable node connected to n.

• For each set of edges starting at n and having an exclusive disjunction create an AltMappings elements for each matchable node connected to n.

(4) If the node is not mapped to any source node, create a Null mapping element.

18 Mapping elements IDs are generated in an inremental automatic manner

121

Structuring mapping result

Figure 6-a: An XML schema defining mapping result structure

122

Automating XML document transformations

1. 2. 3. 4.

5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16.

17. 18. 19. 20. 21. 22. 23. 24. 25.

26. 27. 28. 29. 30. 31. 32. 33. 34. 35. /OneToManyMapping> 36. 37. ….. 38.

Figure 6-b: A structured mapping example

6.2.3 Example of mapping result structuring Let us again consider the example of Figure 4-b and Figure 5-c, where the schema graph in Figure 5-c represents the target schema and the schema graph in Figure 4-b the source schema. The goal of this example is to specify a mapping between the source and target schemas, using the developed mapping schema. A mapping result structure represented according to the mapping schema tends to arrange mapping elements in a hierarchical way (Figure 6-b).

First, the mapping must define the two schemas being mapped. Additionally, one may specify top-level mapping elements which serve as entry points for the translation. In this case the mapping generator starts specifying that the mapping element between source node University and target node University is the top level mapping element (lines 1 to 4). A new mapping element is then created to describe the relation between

123

XSLT generation

source node University and target node University (line 5). Since the target node University is a complex node described as a sequence of three children elements Name, Address, and Researcher, the mapping generator adjust the above mapping element by inserting three HasMappings elements (lines 12 to 14) in order to prepare the processing of children nodes. For each child node, a new mapping element is created. Lines 17 to 25 describes the mapping between source node Name and target node Name, while lines 26 to 36 describes the one-to-many mapping between source node Location and target node Address. The complete mapping example is detailed in Appendix C.

6.3 XSLT generation

As mentioned in chapter 2, an XSLT program typically relies on XPath expressions to navigate a source tree. Two techniques are essentially used: pull and push techniques [W3C 99b]. Push means emitting output whenever some conditions are satisfied by source nodes. Pull technique usually refers to the process that walks through an output template and retrieves data from input sources [Tang 01]. An example of push technique is the use of “match” and “apply-templates” to generate the output by further processing all the children of the matched node. An example of pull technique is the use of “select” to query the source instance and extract the value of source selected node. We essentially generate an XSLT program that relies on these two techniques. An XSLT template generally takes the following form:

do some construction work during which possibly call/apply other templates

Basically we made use of three kinds of XSLT templates: the pattern template which does not need a name or mode attribute, the named template which must have a name attribute but does not require a pattern, and the mode template which must have a mode attribute and requires a pattern as well. Pattern templates can be called by an xsl:apply- template element without mode or name attribute. Mode templates can only be called by xsl:apply-templates element with a mode attribute. Mode templates can be used to enforce a particular construction phase by restricting processing to a set of templates that will be called during that phase. Named templates can only be called by xsl:call- template with a matched name attribute. Named templates give the flexibility to call a specific template whenever necessary at any construction phase.

6.3.1 The XSLT generator An XSLT generator is designed to generate XSLT stylesheets from the structured mapping established in section 6.1. For each matching node pair, the XSLT generator traverses the both the target schema graph and the mapping result in a depth-first manner and generates template rules. The following describes the general algorithm used by the XSLT generator in order to translate a mapping result specification into

124

Automating XML document transformations

XSLT templates. Our algorithm takes as input the target schema graph, source and target instances, and the mapping result specification and produces an XSLT stylesheet consisting of a series of template rules, implementing the transformation. The XSLT generator proceeds in three steps:

Step 1: Initialize the translation

In this step, the XSLT generator:

• Tries to locate the first node of the target schema graph having a match candidate (belonging to a non Null mapping element).

• For non mapped nodes, it acts as follow: if a non mapped target node is not mandatory (e.g., minOccurrence = 0, or is optional in the case of attribute nodes), nothing is generated, otherwise just element tags are generated with a warning message indicating that a value is needed to be added in order to ensure the validity (against the target schema) of the produced instance. Once the first mapped node is localized, a Nodes-To-Process queue is initialized and the template rules generation can begin.

Step 2: Traverse the target schema graph and the mapping result specification in a depth-first manner

• Generate a construction template for current node by using current mapping element.

• For each HasMapping child of the current mapping, adjust the above template by inserting more construction or apply-template rules whenever necessary.

• For each AltMapping child of the current mapping element, choose the first applicable mapping element and adjust the above template by inserting more construction or apply-template rules whenever necessary.

• For the case of atomic node, insert new construction rule and no further process is needed for this node.

• Add adjusted templates into XSLT Stylesheet.

• If a non mapped node is encountered, act as in step 1.2.

• If NodesToProcess queue is non-empty, extract one node as current node and loop back to step 2.1, else continue to step 3.

Step 3: Return the generated XSLT Stylesheet.

125

XSLT generation

The detailed algorithm is in Appendix B. This algorithm implements a tree walker that descends from the root of the target schema graph processing each target node exactly once.

6.3.2 Example of XSLT generation Let us consider the same source and target schema graphs of Figure 4-b and Figure 5-c and the mapping result of Figure 6-b. The two root elements S.University and T.University19 match each other, thus a match template will be defined for driving instances of T.University from instances of S.University. XSLT generator traverses the target schema graph, when node T.University is visited line 4 is generated specifying the template will be instantiated when nodes satisfying the pattern (i.e., an XPath expression) “University” is encountered. Meanwhile lines 5 and 9 are generated specifying that these markups will appear literally in the output document. Next, since the current mapping element incorporates HasMappings elements, which means that the target element University is a complex node. The XSLT generator prepares then for further construction of the children nodes by generating apply-templates instructions using the Xpath expression in the corresponding mapping elements (lines 6 to 8).

1. 2. 3. 4. 5. 6. 7. 8. 9. 10.

The XSLT generator continues to process the root’s child node T.Name. T.Name is an attribute, thus the following template is generated.

11. 12. 13. 14. 15.

Next, the node T.Address is processed. Since T.Address matches split values (city, state, and zip) of S.Location, the needed enclosing element tags are generated and XSLT instructions that mimic the splitting of a string are introduced (lines 17 to 27).

19 S and T are used to differentiate respectively source and target schema nodes

126

Automating XML document transformations

16. 17.

18. 19. 20. 21. 22. 23. 24. 25. 26. 27.
28.

The algorithm iteratively generates more templates to achieve the remaining constructions for T.Researcher element and its children. Since child Publication has not a match, the element tags Publication are generated and three apply-templates are generated for Publication children (T.JournalArticle, T.ProceedingArticle, and T.Book) incorporating the condition that we need the publications of the current author (lines 33, 34, 35).

29. 30. 31. 32. 33. 34. 35. 36. 37. 38. 39.

To generate JournalArticle elements, the XSLT generator should respect the join condition within the source instance, which leads to the following XSLT template.

40. 41. 42. 43. 44. 45. 46. 47.

127

Implementation issues

Assume that the produced output should group publications by their respective writers (Researchers), which mean that a distinct constraint is added to Researcher nodes. The following templates are generated in order to implement the distinct function.

1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31. 32. 33. 34. 35. 36.

6.4 Implementation issues

In this section, we give an overview of the prototype implementation of the matching process and the XSLT generator. The entire software is written in Java, which makes the system portable to various platforms. The prototype system we have developed incorporates ideas discussed in the present dissertation. It consists of three modules: Conceptualization toolkit, Matcher engine, and XSLT generator describing the phases that we consider fundamental in our work. Additional modules run along the entire matching and XSLT generation process, interacting with the former three modules. The entire prototype system is outlined in Figure 6-c.

128

Automating XML document transformations

Additional Modules Evolution

User Interfaces

Domain Knowledge

XSLT generator

Transformation execution XSLT Script Generation

Matcher Engine

Mapping structuring Mapping Discovery

Conceptualization toolkit

Schema graph visualizer Schema graph generator W3C XML DTD Relax …….. DSD Schema Schema

Figure 6-c: Prototype system

6.4.1 Conceptualization toolkit For the generation of schema graphs from XML schemas, we develop a conceptualization toolkit that is composed essentially of two tools: The schema graph generator and the schema graph visualizer.

To overcome the limitations of DTDs in modelling XML data, several languages, called schema languages, were introduced for describing XML documents logical structure. In our work, we were first interested on W3C schema language features, but we notice that other proposed schema languages share the same features. As example, DSD 1.0 [Klarlund 00] introduces user defined types, unordered sequences, etc. SOX 2.0 [Davidson 99] allows simple types restriction and complex types extension. However, these languages share some common features, each schema language offers

129

Implementation issues

a set of constructs to describe XML information (e.g., W3C XML Schema uses a grouping construct 〈all〉 to specify the unordered sequence. In [Frankston 98], the 〈order='many'〉 attribute specifies that sub-elements can appear in any order). However, they can share the same features; these schemas languages are heterogeneous at syntactic level. The schema graph notion is able to normalize such schemas to uniform representation, thus eliminating syntax differences and making structural and semantic differences between source and target schemas more apparent.

Figure 6-d: The schema graph visualizer

The schema graph generator uses the XML parser Crimson (or Xerces) to generate schema graphs. Generated schema graphs are then displayed graphically using a user interface outlined in Figure 6-d. Such graphical representation has three major advantages. First, it helps the user (which is not nessesary familiar with the syntax of XML schema language) to understand what both source and target schemas describe. Second, since we assume that the user knows the semantics of the target schema, based on such graphical representation he can either add meta-information that could help the matching process or even establish an initial mapping. Finally, this graphical

130

Automating XML document transformations representation could be used later in the mapping validation phase, especially when user is invited to change the target schema (relax some constraints in order to make data reuse possible). Schema changes could be then done graphically without dealing with XML schema language syntax.

6.4.2 Matcher engine We develop a Matcher engine to perform the matching process described in chapter 5. The matcher engine uses additional modules: an interface for querying WordNet and a graphical user interface that allows the users to validate (or modify) mappings generated by the system (Figure 6-e). We also provide the possibility to interact with a mapping evolution module. This module will serve on storing the generated mapping and on their synchronization with the changes in source and target schemas. Keeping mappings evolution allows their reusability. The goal is to avoid reapplying the mapping process every time schemas change. The implementation of such evolution module is one of our priority future directions.

6.4.3 Execution engine The execution engine is responsible for parsing mapping result and generating automatically XSLT transformation scripts. It also permits the translation of data instances (XML files) valid against a source schema to instances valid against a target schema. Many open-source implementations around XML specifications/applications have been developed in Java, and they are available for downloading over the internet. We designed an XML format to represent mapping result so that we can benefit from the availability of XML parsers. Because we need an in-memory representation for the target schema and the mapping result, we chose JDOM to load our specifications into JDOM trees. JDOM is not compatible with W3C DOM specifications, but it is optimized for Java so that it avoids the heavy memory print due to DOM’s language- neutrality. Our target language is XSLT. We use Xalan as our XSLT engine.

6.5 Conclusion

We presented a framework for structuring the mapping result and automatically generate XSLT programs. The proposed mapping result consists on several mapping elements, each of which covers five dimensions: entity dimension specifying the source and target schema entities involved in the mapping element, the cardinality dimension specifying the nature of the mapping element (direct or complex), the transformation dimension specifying the used transformation operations, the structural dimension describing the combination of mapping elements and the constraint dimension specifying both structural level constraint and instance level constraints. Based on such structure, we show that we are able to generate automatically an XSLT program that translates a source instance into a target one.

131

Conclusion

We further developed a prototype system that incorporates a conceptualization toolkit for generating and graphically representing schema graphs for W3C XML schema. In the future, we plan to use such toolkit to integrate other XML schema languages, which could enlarge the scope of data exchange and reuse issue between XML enabled Web applications. We also provide a graphical user interface in order to support the matching process and its validation. In the future, we essentially plan to ameliorate such user interface to better incorporate user feedback.

Figure 6-e: User interface for the matcher engine

132

Chapter 7

Conclusions and future directions

Schema matching is a critical strep in structured documents transformations. Manual matching is expensive and error-prone. It is therefore important to develop techniques to automate the matching process. With the rapid proliferation and growing size of Web applications especially with intensive use of Internet and the adoption of XML technologies, automatic schema matching becomes ever more important. In this dissertation, we contributed in both understanding the matching problem in the context of structured document transformations and developing matching tool those output serves as the basis for the automatic generation of transformation scripts. In this chapter, we enumerate the key contributions and discuss the direction of future research.

7.1 Key Contributions

The main contributions of this dissertation include one approach to exploit schema- level data to generate source-to-target mappings semi automatically between source and target XML schemas and another approach that uses such mappings to automatically generate transformation scripts able to translate instances valid against the source schema into instances valid against the target schema.

XML schema matching attracted much attention and several matching systems exist. Matching solutions were developed using several kinds of heuristics, but usually, without a prior formal definition of the problem they are solving. In this dissertation, our first contribution is a framework that formally defines the XML schema matching problem and explains the workings of the subsequently developed matching algorithms. We have combined and extended the existing related research and defined a framework that incorporates:

• A model for XML schemas. We have proposed a directed labelled graph with constraint sets where nodes are connected to each other through directed labelled edges and constraints are defined over nodes and edges. The latter model serves to make clear XML schema features used within the matching

133

Key Contributions

process. In addition, our model can be used in order to normalize XML schemas languages to uniform representation hiding syntax differences and making structural and semantic heterogeneities more apparent. Finally, we proposed a graphical representation for schema graphs that supports the user in better understanding the content of source and target XML schemas without facing syntactic issues.

• A set of similarity functions. Similarity functions relate nodes in the source and target schema graphs. Proposed similarity functions include terminological similarity, constraint similarity and structural similarity.

• A source-to-target mapping algebra. We extended the relational algebra such that we are able to formally define a set of operators that are defined over source and target schemas’ nodes and that can be used to combine nodes to form complex mapping expressions.

• A set of assumptions that underlies the proposed solution. The first assumption relates the innate semantic similarity of nodes with their syntactic similarity. The second assumption requires that the user knows the target schema requirements and is able to decide if a mapping is correct or not. The third assumption states that schema labels are descriptive names.

• A formal representation of a source-to-target mapping. The representation of a mapping element in a source-to-target mapping clearly declares both the semantic correspondences as well as the access paths to access and load data from source into a target schema.

• A set of auxiliary information: we limit ourselves to the use of WordNet.

We show that most types of input and output of XML schema matching problem can be explained in terms of the above notions. A formal framework is important because it makes clear to the users what we exactly mean by a matching solution and by consequence helps them to evaluate the applicability of our solution to a given matching scenario.

The second major contribution of the dissertation is a solution to semi-automatically creates semantic mappings. We have proposed a framework to discover mapping elements between source and target schemas’ nodes. The framework we built to generate soure-to-target mapping is extensible. As new techniques become available to resolve the schema matching problem, the framework is able to import these new techniques with the current techniques discussed in this dissertation. The key innovations that we made in developing such solution could be summarized as follow:

• We proposed a Terminological similarity measure that relates schemas nodes based on the meaning inferred from their names. Most current schema matching algorithms suggest the use of WordNet in order to evaluate semantic

134

Conclusions and future directions

relatedness between schema nodes, but generally gave few, if any details about how they exploit WordNet hierarchy. Generally speaking, two words are semantically related if their respective synsets are related by a path within WordNet hierarchy. To make more precise and significant the output result, we adapted Hirst and st-Onge work, which limits the semantic relatedness between two words by only allowing paths that are not too long and that do not change direction two often. Furthermore, we do not limit ourselves to the computation of numerical similarity, but also we extract semantic relationships such as equivalent, narrower then, border then and related to. Such relationships could then either be used to derive complex mappings or to help the user in the mapping validation phase.

• We proposed a constraint similarity measure that relates schemas nodes based on their respective constraints. We limit our selves to the use of datatypes. To compare datatypes, we use XML schema type hierarchy in a way that the compatibility of two types depends on their respective position within the hierarchy. We also make a modest use of type constraints expressed by facets.

• We proposed a structural similarity measure that relates schemas nodes based on the similarity of the structural context in which they appear. As mentioned throughout this dissertation, while terminological and constraint based matching are widely developed (essentially by the Database community) and are applicable for matching XML schemas, proposed structural matching methods remain insufficient. We first outlined the limitations of current solution through the study of Cupid and Similarity Flooding systems. We then proposed a structural matching technique that considers the context of schemas nodes (defined by their ancestors and descendents in schema graphs). The idea behind our proposed solution is to represent each node context as a path and to then rely on a path resemblance measure to compare such contexts. To achieve this, we relax the strong matching notion frequently used in solving query answering problem. To compute path resemblance measure, we further use algorithms from dynamic programming.

• We proposed an algorithm to combine all the above similarity measures and produce a mapping result that clearly defines source and target mapped entities, required transformation operations, and conditions under which the mapping can be executed. For generating such mapping result, we proposed a top-down strategy. At the top level, we established correspondences between abstract components, so called compatible components, of source and target schemas. Each of the components is composed of a set of nodes and relationships among these nodes. The algorithm determines the composition of abstract components for target and source schemas based on similarity measures described above. Then at the bottom level, we deduce finer level correspondences between nodes and relationships.

135

Future directions

• We showed that our solution can also naturally handle complex matches, the types of matching that are common in practice but have not been always addressed by previous works. We detect indirect matches for join through the use of association relationships. We showed that the use of either designer type hierarchy in XML schemas (abstract types, substitution groups and type inheritance) or WordNet semantic relationships allow discovering indirect matches for union and selection. Context analysis and multiple match candidates allow discovering indirect matches for Merge and splitting.

• We proposed an empirical evaluation of the approach to automate schema mapping. We used a data set from real world application (bibliographic data) in order to evaluate our schema matching solution. For both direct and complex matches, our approach yields about 95% for both recall and precision. We further compared our work with Cupid and Similarity Flooding algorithms and showed that it outperforms these algorithms.

The third contribution of this dissertation is an approach to generate automatically XSLT programs. We first propose a model for structuring mapping result according to a mapping schema that describes the defined five dimensions of a mapping result: entity, cardinality, structural, transformation and constraint dimensions. The proposed model is flexible in the sense that it allows the user to both valid mapping result and add further constraints in a transparent manner. We second propose an algorithm that generates automatically XSLT scripts based on the above mapping structure. For each matching node pair, the algorithm traverses the target schema graph in a depth-first manner and generates progressively required template rules.

7.2 Future directions

We have made significant step into understanding and developing solutions for XML schema matching problem, however substantial work remains toward the goal of achieving a comprehensive matching solution. In the future, we would like to extend the research in this dissertation, the following lists the directions we would like to pursue.

7.2.1 Terminological and Constraint based matching Throughout this dissertation, we essentially focus on structural matching, however several works remain to be done in both terminological and constraint based matching.

Concerning terminological matching, we made the assumption that element names are descriptive and belong to WordNet as a source of lexical information. However, this assumption does not hold true in practical cases. Similar schema elements in different schemas often have names that differ due to the use of abbreviations, acronyms, punctuations, etc. Techniques to solve such issue have to be integrated into the terminological similarity module. Authors in [Madhavan 01] propose a normalization

136

Conclusions and future directions step that solves such problem. As part of the normalization step, they perform tokenization (parsing names into tokens based on punctuation, case, etc.), expansion (identifying abbreviations and acronyms) and elimination (discarding prepositions, articles, etc.). In each of these steps they use a thesaurus that can have both common language and domain-specific references. Other approaches rely on soundex (an encoding of names based on how they sound rather than how they are spelled) [Bell 01].

Moreover, exploiting synonyms and hypernyms relationships requires the use of thesauri or dictionaries such as WordNet. In addition, terminological matching can also make use of domain- or enterprise- specific dictionaries containing common names, synonyms, abbreviations, etc. These specific dictionaries require a substantial effort to be built up in a consistent way. However, the latter effort is well worth the investment, especially for schemas with relatively flat structure where dictionaries provide the most valuable matching hints [Rahm 01b]. Authors in [Chen 95] propose for example an algorithm for the automatic generation of thesauri that can serve as online search aides for scientific databases or electronic community systems. Furthermore, such specific dictionaries can also support schema creation by enabling common names to be accessed and used, such as within a schema editor.

In constraint based matching, we limit ourselves to the analysis of datatype compatibility. However, several constraints such as uniqueness and integrity constraints could give additional hints about matching candidates. Techniques such as the ones described in [Miller 01], [Li 00] could be used. Moreover, our datatypes constraints analysis is limited to some facets. For example, we do not consider patterns comparison. Research in regular expressions and pattern matching [Embley 99] could be a good candidate to extend our work.

7.2.2 Efficient user interaction Experience suggests that fully automated schema matching is infeasible, especially for complex matches that involve transformation operations. Schema matching solutions must interact with the user in order to arrive at final correct mappings. One of the most important open problems for schema matching is efficient user interaction. Specific user input could be for example interactively requested at critical points where it is maximally useful, not just at pre- and/or post-match, this makes post-match editing much easier, since bad guesses made without user guidance need not be corrected and do not propagate. Moreover, the great growing of Web data sharing systems will further exacerbate efficient user interaction problem. In fact, even if a near perfect matching solution exists, the user still has to validate the huge number of produced mapping results. The key is then to discover how to minimize user interaction but maximizing the impact of the feedback.

137

Future directions

7.2.3 Mapping maintenance In dynamic environments like the Web, data sources may change not only their data but also their schemas and their semantics. Such changes must be reflected in the mappings. Mappings left inconsistent by a schema change have to be detected and updated. Manually maintaining mappings is expensive and not scalable. It is important therefore to develop techniques for automatically adapting mappings as schemas evolve. Structuring mappings could be considered as a very first step to this end. Despite the importance of this problem, it has not been widely addressed in the litterature. Recently, authors in [Velegrakis 03] proposed a mapping adaptation technique, but they essentially deal with relational schemas changes.

7.2.4 Performance evaluation We would like to evaluate our schema matching solution using a broad set of applications and data. Moreover, we measured the accuracy of the proposed matching solution in term of precision and recall. Such measure is important for two reasons (1) the higher the accuracy, the more reduction in user effort, and (2) the measure facilitates the comparative study of multiple schema matching solutions. As also suggested in [Doan 02a], it is important to quantify the reduction in user involvement that a matching solution achieves. Moreover, we plan to run further experimental studies to show the impact of our chosen parameters on the matching process.

138

Bibliography

[Aberer 03] K. Aberer, P. Cudre-Mauroux, and M. Hauswirth. The chatty web: Emergent semantics through gossiping. In Proceedings of the Twelfth International World Wide Web Conference, (WWW2003), Budapest, Hungary, May 2003.

[Abiteboul 97a] S. Abiteboul, S.Cluet, and T. Milo. Correspondence and Translation for heterogeneous data. In Proceeding of the International Conference on Database Theory (ICDT), pages 351-363, 1997.

[Abiteboul 97b] S. Abiteboul. Querying Semi-Structured Data. In International Conference on Database Theory (ICDT), LNCS, no. 1186, pages 1-18. Springer, 1997.

[Abiteboul 97c] S. Abiteboul, D. Quass, J. McHugh, J. Widom, and J.L. Wiener. The Lorel Query Language for Semistructured Data. International Journal on Digital Libraries (JODL), 1(1):68-88, 1997.

[Abiteboul 99] S. Abiteboul, S. Cluet, T. Milo, P. Mogilevsky, J. Simeon, and S. Zohar. Tools for Data Translation and Integration. IEEE Data Engineering Bulletin, March 1999.

[Abiteboul 00] S. Abiteboul, P. Buneman, D. Suciu. Data on the Web: From Relations to Semistructured Data and XML. Morgan Kaufmann Publishers, 2000.

[Adali 96] S. Adali, K. Candan, Y. Papakonstantinou, and V. Subrahmanian. Query caching and optimization in distributed mediator systems. In Proceedings of ACM SIGMOD Conference on management of Data, Montreal, Canada, 1996.

[Aho 72] A.V. Aho and J.D.Ullman. The theory of parsing, translation and compiling, Volume 1: Parsing. Prentice-Hall, 1972.

139

Bibliography

[Andritsos 02] P. Andritsos, R. Fagin, A.Fuxman, L.M.Haas, M.A. Hernandez, C-T.Ho, A. Kementsietsidis, R.J.Miller, F.Naumann, L.Popa, Y.Velegrakis, C.Vilarem, L-L.Yan. Schema Management. IEEE Data Eng. Bull. 25(3): 32-38, 2002.

[Amer 02] S. Amer-Yahia, S.Cho, D. Srivastava, "Tree Pattern Relaxation" EDBT'02, 2002.

[ARIADNE 00] Alliance of Remote Instructional Authoring and Distribution Networks for Europe (Ariadne). Available at http://ariadne.unil.ch/

[Arnon 93] D.S. Arnon. Scrimshaw: a language for document queries and transformations. Electronic Publishing, 6 (4), pages 385-396, 1993.

[Atzeni 97] P. Atzeni and R. Torlone. MDM: A Multiple-Data Model Tool for the Management of Heterogeneous Database Schemes. In ACM SIGMOD Conference, pages 528-531, May 1997.

[Bapst 98] F. Bapst, R. Bruegger, A. Zramdini, R.Ingold. Integrated Multi- Agent Architecture for Assisted Document Recognition. In Document Analysis Systems II, Series in Machine Perception and Artificial Intelligence, vol. 29, World Scientific, pages 301-317, 1998.

[Batini 86] C. Batini, M. Lenzerini, S.B Navathe. A Comparative Analysis of Methodologies for Database Schema Integration. ACM Computing Surveys, Vol.18, No. 4, December 1986.

[Belaïd 97] A. Belaïd, Y. Chenevoy. Constraint Propagation vs Syntactical Analysis for the Logical Structure Recognition of Library References. N. A. Murshed and F. Bortolozzi (Eds.), Lecture Notes in Computer Science 1339, BSDIA'97, Springer, pages 153- 164, Curitiba, Brazil, November 2-5, 1997.

[Bell 01] G.S. Bell GS, A. Sethi. Matching records in a national medical patient index. CACM 44(9):83-88, 2001.

[Becker 02] O. Becker. Transforming XML on the fly. XML Europe 2003. http://www.idealliance.org/papers/dx_xmle03/papers/04-02- 02/04-02-02.

[Bergamaschi 99] Bergamaschi, S., Castano, S. and Vicini, M. Semantic integration of semi- structured and structured data sources. ACM SIGMOD Record 28(1), 54-59, 1999.

140

Bibliography

[Berger 96] Berger-Levrault/AIS. Balise Reference Manual, 1996.

[Berlin 01] J. Berlin and A. Motro. Autoplex: Automated discovery of content for virtual databases. In Proceedings of the Conference on Cooperative Information Systems (CoopIS), 2001.

[Berlin 02] J. Berlin and A. Motro. Database schema matching using machine learning with feature selection. In Proceedings of the Conference on Advanced Information Systems Engineering (CAiSE), 2002.

[Berners-Lee 01] T. Berners-Lee, J. Hendler, and O. Lassila. The Semantic Web. Scientific American, May 2001.

[Bernstein 00a ] P. Bernstein, A. Halevy and R.Pottinger. A vision for management of complex models. ACM Sigmod Record, 29(4): 55- 63, 2000.

[Bernstein 00b] P.A. Bernstein, E.Rahm. Data warehouse scenarios for model management. In Proceedings of 19th International Conference on Entity Relationship Modeling. Lecture Notes in Computer Science, Vol. 1920. Springler, pages 1-15, 2000.

[Biskup 03] J. Biskup, D.W.Embley. Extracting information from heterogeneous information sources using ontologically specified target views. Information systems, 28 (3), pages 169-212, 2003.

[Bonhomme 96] S. Bonhomme, C. Roisin. Interactively Restructuring HTML Documents. Computer Network and ISDN Systems, vol. 28, num. 7-11, pages 1075-1084, May 1996.

[Bonhomme 98] S. Bonhomme. Transformation des documents structures: une combinaison des approaches explicite et automatique. PHD Thesis, Université Joseph Fourier, Grenoble, 1998.

[Boshernitsan 01] M. Boshernitsan. HARMONIA: a flexible framework for constructing interactive language-based programming tools. Technical report CSD-01-1149, University of California, Berkeley, 2001.

[Boukottaya 03] A.boukottaya. Enhancing Course Reusability through XML Schemas Integration. Revue "Document Numérique 2003", Thématique : "Les nouvelles facettes du document numérique dans l'éducation", 2003.

141

Bibliography

[Bouquet 03] P. Bouquet, L. Serafini, and S. Zanobini. Semantic coordination: a new approach and an application. In Proceedings of the 2nd International Semantic Web Conference (ISWO). Sundial Resort, October 2003.

[Budanitsky 00] A. Budanitsky and G.Hirst. Semantic distance in WordNet. An experimental, application oriented evaluation of five measures, 2003.

[Buneman 97] P. Buneman. Semistructured Data (invited tutorial). In ACM Symposium on Principles of Database Systems (PODS), pages 117-121, Tucson, Arizona, (pp 5, 37, 75), 1997.

[Cali 02] A.Cali, D.Calvanese, G.Giacomo and M.Lenzerini. On the expressive power of data integration systems. In Proceedings of 21st International Conference on Conceptual Modeling (ER 2002), pages 338-350, Tampere, Finland, 2002.

[Carmel 02] D.Carmel, N. Efraty, G.M.Landau, Y.S.Maarek, and Y.Mass. An Extension of the vector space model for querying XML documents via XML fragments. Second Edition of the XML and IR Workshop, In SIGIR Forum, Volume 36 Number 2, Fall 2002.

[Castano 99] S. Castano and V. De Antonellis. A Schema Analysis and Reconciliation Tool Environment for Heterogeneous Databases. In Proceedings of IDEAS'99 International Database Engineering and Applications Symposium, Montreal, Canada, August 1999.

[Castano 03] S. Castano, A. Ferrara, and S. Montanelli. h-match: an Algorithm for Dynamically Matching Ontologies in Peer-based Systems. First International Workshop on Semantic Web and DataBases (SWDB), 2003.

[Chalupsky 00] H. Chalupsky. Ontomorph: A translation system for symbolic knowledge. In Principles of KnowledgeRepresentation and Reasoning, 2000.

[Chawathe 94] S. Chawathe, H. García-Molina, J. Hammer, K. Ireland, Y. Papakonstantinou, J. Ullman, and J. Widom. The TSIMMIS Project: Integration of heterogeneous information sources. In 16th meeting of the IPSJ, pages 7-18, Tokyo, Japan, 1994.

[Chen 95] H. Chen, T. Yim, D. Fye and B. Schatz. Automatic thesaurus generation for an electronic community system. Journal of the American Society for Information Science, volume 46, pages 175- 193, 1995.

142

Bibliography

[Clark 01] J. Clark. ``TREX - Tree Regular Expressions for XML", 2001. http://www.thaiopensource.com/trex.

[Clifton 97] C. Clifton, E. Housman, and A. Rosenthal. Experience with a combined approach to attribute-matching across heterogeneous databases. In Proceedings of the IFIP Working Conference on Data Semantics (DS-7), 1997.

[Cluet 98] S. Cluet, C. Delobel, J. Simeon, and K. Smaga. Your mediators need data conversion. In ACM SIGMOD Conference, pages 177- 188, 1998.

[Cobena 02] G. Cobena, S. Abiteboul, and A. Marian. Detecting changes in XML Documents. In ICDE, 2002.

[Cole 90] F.C.Cole and H.Brown. Editing a structured document with classes. University of Kent Computing Laboratory Report No. 73, 1990.

[Cordy 90] J.R. Cordy, and E. Promislow. Specification and automatic prototype implementation of polymorphic objects in TURING using the dialect processor. In Proceedings of IEEE International Conference on Computer Languages, New Orleans, 1990.

[Davidson 97] S. Davidson and A. Kosky. WOL: A language for database transformations and constraints. In Proceedings of the International Conference on Data Engineering., pages 55-56, April 1997.

[Davidson 99] A. Davidson, M. Fuchs, M. Hedin. Schema for Object-Oriented XML 2.0, W3C. 1999. Available at (http://www.w3.org/TR/NOTE-SOX)

[Delestre 98] Delestre N. and Rumpler B., Architecture d'un Serveur Multimdia pour les Sciences de l'Ingnieur. In Proceedings of NTICF'98, INSA de Rouen, France, 18-20 November 1998.

[DeRose 94] S.J.DeRose and D. G. Durand. Making Hypermedia Work. A Users’s Guide to Hytime. Kluwer Academic Publishers, 1994.

[Do 02a] H.H. Do, S. Melnik and E.Rahm. Comparison of schema matching evaluations. In Proceedings of the second International Workshop on Web Databases, 2002.

[Do 02b] H.H. Do, E.Rahm. COMA: Asystem for flexible combination of schema matching approach. VLDB 2002.

143

Bibliography

[Doan 00] AH.Doan, P.Domingos, A.Halevey. Learning source descriptions for data integration. In Proceedings WebDB Worshop, pages 81- 92, 2000.

[Doan 01] AH.Doan, P.Domingos, A.Halevey. Reconciling schemas of disparate data sources:A machine Learning Approach. In Proceedings ACM SIGMOD Conference, pages 509-520, 2001.

[Doan 02a] A. Doan, J. Madhavan, P. Domingos, and A. Halevy. Learning to map between ontologies on the semantic web. In Proceedings of the Eleventh International World Wide Web Conference, (WWW2002), Honolulu, Hawaii, USA, May 2002.

[Doan 02b] A.H.Doan. Learning to map between structured representations of data. PHD thesis, University of Washington, 2002.

[Dobre 03] In Proceedings of the 22nd International Conference on Conceptual Modeling (ER2003), Springer Verlag, pages 534-547, Chicago, Illinois, 2003.

[Drew 93] P.Drew, R.King, D.McLeod, M.Rusinkiewicz, and A.Silberschatz. Report of the Worshop on Semantic Heterogeneity and Interoperation in Multidatabase Systems. Sigmod Record, 22 (3): 47-56, 1993.

[Durand 02] D. Durand and P. Caton. Semantic Heterogeneity Among Document Encoding Schemes. Final Report for NIST Federal Assistance Contract 60NANB0D0115. Scholarly Technology Group. Brown University, 2002.

[Duschka 97] O.M. Duschka, and M.R. Genesereth. Answering recursive queries using views. In Proceedings of the ACM SIGACT- SIGMOD-SIGACT. Symposium on principles of database systems (PODS). Tucson, Arizona. 1997.

[Elmagarmid 90] AK. Elmagarmid and C.Pu. Guest editors’ introduction to the special issue on heterogeneous databases. ACM Computing Survey, 22(3):175-178, 1990.

[Embley 99] D.W. Embley, D.M.Campbell, Y.S. Jiang, S.W.Liddle, D.W. Lonsdale, Y.K.Ng and R.D.Smith. Conceptual-model-based data extraction from multiple record Web pages. Data and Knowledge Engineering, 31(3), pages 227-251, 1999.

[Evens 88] M.Evens. Relational Models of the lexicon. Cambridge University Press, 1988.

144

Bibliography

[Exo 93] Exoterica Corporation. OmniMark Programmer’s Guide, 1993.

[Fagin 03] R. Fagin, P.G. Kolaitis, L.Popa. Data Exchange: Getting the core. In Proceedings of the twenty-second ACM SIGMOD-SIGACT- SIGART symposium on Principles of Database Systems, pages 90–101, California, 2003.

[Friedman 99] M. friedman, A. Levy, and T. Willstein. Navigational plans for data integration. In Proceedings of the National Conference on Artificial intelligence. 1999.

[Fellbaum 98] C. Fellbaum. WordNet : An Electronic Lexical Database. MIT press, Cambridge, 1998.

[Feng 93] A. Feng and T. Wakayama. SIMON: a grammar-based transformation system for structured documents. Electronic Publishing, 6 (4), pages 361-372, 1993.

[Fernandez 98] M. Fernandez, D. Florescu, J. Kang, A. Levy, and Dan Suciu. Catching the boat with strudel: experience with a web-site management system. In Proceedings of ACM-SIGMOD International Conference on Management of Data, pages 414-425 (1998).

[Frankhauser 93] P. Fankhauser and Y. Xu. Markitup! an incremental approach to document structure recognition. Electronic Publishing, December 1993.

[Frankston 98] C. Frankston, H. S. Thompson. XML-Data reduced, Internet Document. Available at http://www.ltg.ed.ac.uk/~ht/XMLData- Reduced.htm , 1998.

[Furuta 87] R. Furuta. Concepts and Models for Structured Documents. In Proceedings of the workshop "structure for documents", Savoie, France, January 1987.

[Furuta 88] R. Furuta, V.Quint and J.Andre. Interactively editing structured documents. Electronic Publishing, Volume 1, pages 19-44, 1988.

[Furuta 90] R.Furuta, Furuta editor. In Proceedings of the International Conference on Electronic Publishing. Document Manipulation and Typography. The Cambridge Series on Electronic Publishing, Cambridge University Press, Gaithersburg, Maryland, 1990.

145

Bibliography

[Garcia 96] M. Garcia-Solaco, F. Saltor and M. Castellanos. Semantic heterogeneity in multidatabase systems. In Bukhres, O.A and Elmagarmid, A.K., editors, Object Oriented Multidatabase Systems: A Solution for Advanced Applications, chapter 5, pages 129-202. Prentice-Hall, 1996.

[Gecseg 84] Ferenc Gécseg, Magnus Steinby. Tree automata. Akadémiai Kiadó, Budapest, 1984.

[Goh 97] C. H. Goh. Representing and Reasoning about Semantic Conflicts in Heterogeneous Information Systems. Ph.D. Thesis, MIT Sloan School of Management, 1997.

[Haas 99] L. Haas, R. Miller, B. Niswonger, M.T. Roth, P. Schwarz, and E. Wimmers. Transforming Heterogeneous Data with Database Middleware: Beyond Integration. IEEE Data Engineering Bulletin, 22(1), March 1999.

[Halevey 01] A.Y.Alevey. Answering queries using views: A survey. VLDB journal, 10 (1): 270-294, 2001.

[Halevy 03a] A.Y. Halevy, Z.G. Ives, P. Mork, I. Tatarinov. Piazza: Data Management Infrastructure for semantic Web Application. In Proceedings of the WWW Conference, May 2003.

[Halevy 03b] A. Halevy, Z. Ives, D. Suciu, and I. Tatarinov. Schema mediation in peer data management systems. In Proceedings of ICDE’03, Bangalore, India, March 2003.

[Hammer 95] J.Hammer, D.Mcleod. On the resolution of representational diversity, 1995.

[Harbo 93] K. Harbo. CoST Version 0.2. Copenhagen SGML Tool. Technical report. Department of Computer Science, University of Copenhagen, 1993.

[Hardt 02] Querying Concepts - An approach to retrieve XML data by means of their data types, Margret Gross-Hardt, 17. WLP - Workshop Logische Programmierung, TU Dresden, 2002

[HaXML] XML transformations in Haskell. Available at http://www.fact- index.com/h/ha/haxml.html

[Heflin 01] J.Heflin and J.Hendler. A portrait of the Semantic Web in action. IEEE Intelligent System, 16:54–59, May 2001.

146

Bibliography

[Hirst 98] Lexical chains as representations of context for the detection and correction of malapropisms. In: Christiane Fellbaum (editor), WordNet: An electronic lexical database, Cambridge, MA: The MIT Press, 1998.

[Hosoya 00] H. Hosoya and B.C. Pierce. XDuce: a typed XML processing language (preliminary report). In Proceedings of International Workshop on the Web and Databases, 2000.

[Hugh 97] J.McHugh, S. Abiteboul, R. Goldman, D.Quass, and J.Widom. Lore: a database management system for semistructured data, SIGMOD Record 26(3):54-66 (1997).

[Hull 90] R. Hull and M. Yoshikawa. ILOG: Declarative Creation and Manipulation of Object Identifiers. In Proceedings of the International Conference VLDB, pages 455-468, August 1990.

[Hull 96] R.Hull and G.Zhou. A framework for supporting Data Integration Using the Materialized and Virtual Approaches. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pages 481-492, 1996.

[Irons 61] E.T. Irons. A syntax directed compiler for ALGOL 60. Communications of the ACM. 4 (1), pages 51-55, 1961.

[ISO 86] ISO, Information Processing- Text and Office systems- Standard Generalization Markup Language (SGML), (ISO 8879:1986), International Organisation for Standardisation, Geneva, 1986.

[ISO 91] ISO/IEC, Information Technology- Text and office systems- Documents Style Semantics and Specification Language (DSSSL). ISO/IEC DIS 101179, 1991.

[Ives 99] Z.Ives, D. Florescu, M. Friedman, A. Levy, and D.Weld. An adaptive query execution system for data integration. In Proceedings of SIGMOD, 1999.

[Jaakkola 97] J. Jaakkola, P. Kilpelainen, and G. Linden. TranSID: an SGML tree transformation language. In Proceedings of the fifth Symposium on Programming Languages and Software Tools, pages 72-83, 1997.

[Jayavelu 00] S.V. Jayavelu, Kailavani and Saravanan. XML Schema Interoperability Using XSLT. Unpublished online report. Available at http://www.cs.tamu.edu/course-info/cpsc608/hihin/fall00/projects /NASA-team3/projreport.htm

147

Bibliography

[Jelliffe 00] R. Jelliffe, ``Schematron'', Internet Document, May 2000. (http://www.ascc.net/xml/resource/schematron/)

[Kashyap 95] V. Kashyap and A. Sheth. Semantic and Schematic Similarities between Objects in Databases: A context based approach. 1995.

[Keller 84] S.E.Keller, J.A. Perkins, T. F. Payton and S. P. Mardinly. Tree transformation techniques and experiences. In Proceedings of the ACM SIGPLAN’ 84 Symposium on Compiler Construction, 19(6), pages 190-201, 1984.

[Kilpelainen 90] P. Kilpelainen, G. Linden, H. Mannila and E. Nikunen. A structured document database system. In Furuta [Furuta 90], pages 139-151, 1990.

[Kim 91] C. Kim and J. Seo. Classifying schematic and data heterogeneity in multidatabase systems, IEEE Computer, 24 (12): 12-18, 1991.

[Klarlund 00] N. Klarlund, A. Moller, M. I. Schwatzbach. DSD: A Schema Language for XML. In Proceedings of the 3rd ACM Workshop on Formal Methods in Software Practice, 2000.

[Krishnamurthy 91] R. Krishnamurthy, W. Litwin and W. Kent. Language features for interoperability of databases with schematic discrepancies. In Proceedings of the ACM SIGMOD Conference, pages 40-49, 1991.

[Kuikka 91] E. Kuikka and M. Penttonen. Designing a syntax_directed text processing system. In Proceedings of the Second Symposium on Programming Languages and Software Tools, Pirkkala, Finland. Technical Report A-1991-5, pages 191-204, University of Tampere, 1991.

[Kuikka 96] E. Kuikka. Transformation of Structured Documents. Processing of Structured Documents Using a Syntax-directed Approach. PH.D. thesis, Computer Science and Applied Mathematics, University of Kuopio, 1996.

[Lacher 01] M. Lacher and G. Groh. Facilitating the exchange of explixit knowledge through ontology mappings. In Proceedings of the 14th International FLAIRS conference, 2001.

[Larson 89] JA. Larson, SB. Navathe, R. ElMasri. A Theory of attribute equivalence in databases with application to schema integration. IEEE TransSoftwareEng 16(4):449-463, 1989.

148

Bibliography

[Lee 00] D. Lee, W. Chu. Comparative analysis of six XML schema languages. SIGMOD Record (ACM Special Interest Group on Management of Data), Vol 29, num 3, pages 76-87, 2000.

[Lee 01] K-H. Lee0, M-H.Kim, K.C. Lee, B-S. Kim, M-Y. Lee. Conflict Classification and Resolution in Heterogeneous Information Integration based on XML Schema. 2001.

[Leinonen 03] P. Leinonen. Automating XML Document Structure Transformations. In Proceedings of the ACM Symposium on Document Engineering, France, 2003.

[Lenzerini 02] M. Lenzerini. Data integration: A Theoretical Perspective. In Proceedings of the ACM Symposium on Principles of Database Systems (PODS), pages 233-246, 2002.

[Levy 93] D. M. Levy. Document reuse and document systems. Electronic publishing, vol. 6(4), pages 339-348, December, 1993.

[Levy 96] A.Y. Levy, A. Rajaraman, and J.J. Ordille. Querying heterogeneous information sources using source descriptions. In Proceedings of the 22nd International Conference on Very Large Data Bases (VLDB’96), pages 251-262, 1996.

[Lewis 68] P.M. Lewis and R. E. Stearns. Syntax-directed transduction. Journal of the ACM, 15 (3), pages 465-488, 1968.

[Li 00] W. Li and C. Clifton. SEMINT: A tool for identifying attribute correspondence in heterogeneous databases using neural networks. Data and Knowledge Engineering, 33:49–84, 2000.

[Linden 97] G. Linden. Structured document transformations. Report A-1997- 2. CS Department of University of Helsinki, Finland, 1997.

[Ludäscher 02] B. Ludäscher, I. Altintas, A. Gupta. Time to Leave the Trees: From Syntactic to Conceptual Querying of XML. In International Workshop on XML Data Management, Prague, Czech Republic, LNCS 2490, Springer, pages 148-168, March 2002.

[Madhavan 01] J. Madhavan, P.A. Bernstein, and E. Rahm. Generic schema matching with Cupid. In Proceedings of the International Conference on Very Large Databases (VLDB), 2001.

[Madhavan 02] J. Madhavan, P.A. Bernstein, P.Domingos and A.Y. Halevy. Representing and reasoning about mappings between domain models. In Proceedings of the eighteenth National Conference on Artificial Intelligence, pages 80-86, AAAI Press, 2002.

149

Bibliography

[Maedche 01] A. Maedche, B.Mork, N. Silva, and R. Volz. Mafra: A MApping FRAmework for distributed ontologies. 2001.

[Mamrak 93] S.A. Mamrak, J. Barnes and C.S.O’ Connell. Benefits of automating data translation. IEEE Software, 10(4), pages 82-88, 1993.

[Mapforce 04] http://www.altova.com/products_mapforce.html

[McGuinness 00] D. McGuinness, R. Fikes, J. Rice, and S. Wilder. The Chimaera ontology environment. In Proceedings of the 17th National Conference on Artificial Intelligence, 2000.

[Meijer 99] E. Meije rand M. Shields. A functional language for constructing and manipulating XML documents. Available at http://www.cse.ogi.edu/~mbs/pub/xmlambda/

[Melnik 00] S. Melnik and S. Decker. A Layered Approach to Information Modeling and Interoperability on the Web, 2000.

[Melnik 02] S.Melnik, H.Garcia-Molina, E.Rahm. Similarity Flooding: A versatile Graph Matching Algorithm and its Application to Schema Matching. In Proceedings of the 18th International Conference on Data Engineering, 2002.

[MID 95] MID/Information Logistics Group. MetaMorphosis Reference Manual, 1995.

[Miller 95] A.G. Miller (1995). WordNet: A lexical Database for English. ACM 38 (11). pages 39-41, 1995.

[Miller 00] R.Miller, L.Hass, and M.A. Hernandez. Schema mapping as query discovery. VLDB, pages 77-88, 2000.

[Miller 01] R. Miller. The Clio Project: managing heterogeneity. ACM SIGMOD Record 30(1):78-83, 2001.

[Milo 98] T. Milo and S. Zohar. Using Schema Matching to Simplify Heterogeneous Data Translation. In Proceedings of the International Conference VLDB’ 98, pages 122-133, 1998.

[Molina 97 ] H. García-Molina, Y. Papakonstantinou, D. Quass, A. Rajaraman, Y. Sagiv, J. Ullman, and J. Widom, "The TSIMMIS approach to mediation: Data models and Languages". In Journal of Intelligent Information Systems, 1997.

150

Bibliography

[Mork 01] P. Mork, A. Halevy, and P. Tarczy-Hornoch. A model for data integration system of biomedical data applied to online genetic databases. In Proceedings of the Symposium of the American Medical Informatics Association, 2001.

[Murata 97] Makoto Murata. Transformation of documents and schemas by patterns and contextual conditions. Lecture Notes in Computer Science, 1293:153-169, 1997.

[Murata 98] Makoto Murata. Data model for document transformation and assembly (extended abstract). Principle on Digital Document Processing, 1998.

[Murata 00] M. Murata. ``RELAX (REgular LAnguage description for XML)'', Aug. 2000. http://www.xml.gr.jp/relax.

[Murata 01] M. Murata, D. Lee, and M. Mani. Taxonomy of XML Schema Languages using Formal Language Theory. In Extreme Markup Languages, Canada 2001.

[Myers 86] E.W. Myers. Incremental alignment algorithms and their applications. TR 86-22, Department of Computer Science, University of Arizona, 1986.

[Naiman 95] C.F. Naiman and A.M. Ouskel. A classification of semantic conflicts in heterogeneous database systems. Journal of Organizational Computing, 5(2): 167-193, 1995.

[Nam 02] Y.K.Nam, J.Goguen and G.Wang. A Metadata Integration Assistant Generation for Heterogeneous Databases. In Proceedings Conference on Ontologies, Databases, and Applications of Semantics for Large Scale Information Systems, Springer, Lecture Notes in Computer Science, 2519, pages 1332- 1344, 2002.

[Nejdl 02] EDUTELLA: a P2P networking infrastructure based on RDF. In Proceedings of the Eleventh International World Wide Web Conference, WWW2002, Honolulu, Hawaii, USA, May 2002.

[Noy 00] N.F. Noy and M.A. Musen. PROMPT: Algorithm and tool for automated ontology merging and alignment. In Proceedings of the National Conference on Artificial Intelligence (AAAI), 2000.

[Noy 01] N.F. Noy and M.A. Musen. Anchor-PROMPT: Using non-local context for semantic Matching. In Proceedings of the Workshop on Ontologies and Information Sharing at the International Joint Conference on Artificial Intelligence (IJCAI), 2001.

151

Bibliography

[Palopoli 98] L. Palopoli, D. Sacca, and D. Ursino. Semi-automatic, semantic discovery of properties from database schemes. In Proceedings of the International Database Engineering and Applications Symposium (IDEAS-98), pages 244–253, 1998.

[Papakonstantinou 96] Y. Papakonstantinou, S. Abitboul, H. García-Molina. Object fusion in mediator systems. In Proceedings of the International Conference on very large databases (VLDB), Bombay, India, 1996.

[Parent 98] C.Parent, S. Spaccapietra. Issues and approaches of database integration. Communications of the ACM, 41(5):166-178, 1998.

[Pietriga 01] E.Pietriga, J-Y.Vion-Dury, and V.Quint.(2001). Vxt: a visual approach to XML transformations. In Proceedings of the ACM Symposium on Document Engineering, 2001.

[Popa 02] L. Popa, Y. Velegrakis, R. J. Miller, M. A. Hernandez, and R. Fagin. Translating Web Data. In Proceedings VLDB’ 02, pages 598-609, 2002.

[Quint 94] V. Quint, I. Vatton. Making structured documents active. Electronic Publishing - Organisation, Dissimination and Design, 7(2): 55-74, June 1994.

[Quint 95] V. Quint, I. Vatton, J. Paoli. Active Structured Documents as User Interfaces, vol. User Interfaces for Symbolic Computations, Springer Verlag, September, 1995.

[Rahm 01a] E. Rahm and P.A. Bernstein. A survey of approaches to automatic schema matching. In VLDB Journal, pages 10: 334-350, 2001.

[Rahm 01b] E. Rahm and P.A. Bernstein. On matching schema automatically. Microsoft Research Publications, 2001. Available at http://www.research.microsoft.com/pubs.

[Rekik 01] Y. A. Rekik (2001). Modélisation et manipulation des documents structurés. PHD Thesis number 2396. Swiss Federal Institute of Technology (EPFL), 2001.

[Rodriguez 01] P. Rodriguez Gianolli. Data Integration for XML based on Schematic Knowledge. PHD 2001.

[Routledge 02] N. Routledge, L. Bird and A. Goodchild. UML and XML Schema, ADC'2002, 2002.

152

Bibliography

[Ryutaro 01] I. Ryutaro, T. Hideaki, and H. Shinichi. Rule induction for concept hierarchy alignment. In Proceedings of the 2nd Workshop on Ontology Learning at the 17th International Joint Conference on AI (IJCAI), 2001.

[Salminen 04] A. Salminen. XML Family of Languages: Overview and Classification of W3C Specifications. Available at http://www.cs.jyu.fi/~airi/xmlfamily.html

[Shasha 90] D. Shasha, J. Wang, K. Zhang, and F. Shih. Fast algorithms for the unit cost editing distance between trees. In Journal of Algorithms, pages 581-621, 1990.

[Sheth 90] A.P.Sheth, J.A. Larson. Federated database systems for managing distributed, heterogeneous and autonomous databases. ACM Computer Survey, 22(3):183-236, 1990.

[Sheth 92] A. Sheth and V. Kashyap. So far (schematically) yet so near (semantically). In Hsiao, D, K ., Neuhold, E.J., and Sacks-Davis, R., editors. In Proceedings of the IFIP WG36. Database Semantics Conference on Interoperable Database Systems (DS- 5), pages 283-312,. Lorne, Victoria, Australis, North Holland, 1992.

[Shu 77] N.C. Shu, B. C. Housel, R.W. Taylor, S.P. Ghosh, and V.Y. Lum. EXPRESS: A Data EXtraction. In Processing and REStructuring System. ACM TODS, 2 (2), pages 134-174, 1977.

[Silva 02] Nuno Silva and João Rocha. Ontology mapping using multiple dimension approach. In Proceedings of the International Conference on Fuzzy Systems and Soft Computational Intelligence in Management and Industrial Engineering, Istanbul, Turquia, 2002.

[Stohr 99] Stohr, T., Muller, R. and Rahm, E.. An integrative and uniform model for metadata management in data warehousing environments. In Proceedings of the International Workshop on Design and Management of Data Warehouses, Germany, June 1999.

[STX 02] Streaming Transformations for XML. Available at http://www.fact-index.com/s/st/stx.html

[Su 01] H.Su, H.Kuno, E.A.Rundensteiner. Automating the transformation of XML Documents. In Proceedings of the ACM Symposium on Document Engineering, 2001.

153

Bibliography

[Suciu 97] D. Suciu, ed. In Proceedings of the Workshop on Management of Semi-Structured Data (in conjunction with SIGMOD/PODS), Tucson, Arizona, 1997. Available at http://www.research.att.com/~suciu/workshop-papers.html.

[Sun 03] X.L.Sun and E. Rose. Automated Schema Matching Techniques: An Exploratory Stud, 2003. Available at http://iims.massey.ac.nz/research/letters.

[Tamino] Tamino Database. Available at http://www.softwareag.com/tamino.

[Tang 01] X.Tang and F. Tompa. Specifying transformations for structured documents. In Proceedings of 4th International Worshop on Web and Databases (WebDB 2001), pages 67-72. 2001.

[Tirri 94] H. Tirri and G. Linden. ALCHIMIST: an object-oriented tool to build transformations between heterogeneous data representations. In Proceedings of the Twenty-Seventh Annual Hawaii International Conference on System Sciences, Vol II, pages 226-235, 1994.

[UDB] The Unified database for human genome computing. Available at http://bioinformatics.weizmann.ac.il/udb.

[Van 79] C.J. Van Rijsbergen. Information retrieval. Second Edition, 1979.

[Velegrakis 03] Y. Velegrakis, R. J. Miller, L. Popa, and J. Mylopoulos. ToMAS: Mapping Adaptation under Evolving Schemas. In 2nd Hellenic Symposium for Data Management (HDMS), 2003.

[Vernet 02] A. Vernet. XML transformation languages. Available at http://www.scdi.org/~avernet/misc/xml-transformation

[W3C 98a] Extensible Markup Language (XML) 1.0, W3C Recommendation, 1998. Available at http://www.w3.org/TR/REC-XML

[W3C 98b] Cascading Style Sheets, level 2, W3C Recommendation, 1998. Available at http://www.w3c.org/TR/REC-CSS2.

[W3C 99a] Resource Description Framework (RDF), W3C Recommendation, 1999. Available at http://www.w3.org/TR/1999/REC-rdf-syntax- 19990222

154

Bibliography

[W3C 99b] XSL Transformations (XSLT) 1.0, W3C Recommendation, 1999. Available at http://www.w3.org/TR/1999/REC-xslt-19991116

[W3C 99c] XML Path Language (XPATH) 1.0, W3C Recommendation, 1999. Available at http://www.w3.org/TR/1999/REC-xpath- 19991116

[W3C 99d] Cascading Style Sheets, level 1, W3C Recommendation, 1998. Available at http://www.w3c.org/TR/REC-CSS1.

[W3C 01a] XML Schema Part 0: Primer, W3C Recommendation, 2001. Available at http://www.w3.org/TR/2001/REC-xmlschema-0- 20010502/

[W3C 01b] XML Schema Part 1: Structures, W3C Recommendation, 2001. Available at http://www.w3.org/TR/2001/REC-xmlschema-1- 20010502/.

[W3C 01c] XML Schema Part 2: Datatypes, W3C Recommendation, 2001. Available at http://www.w3.org/TR/2001/REC-xmlschema-2- 20010502/

[W3C 01d] XML Linking Language (XLINK) 1.0, W3C Recommendation, 2001. Available at http://www.w3.org/TR/2001/REC-xlink- 20010627/

[W3C 01e] Extensible Stylesheet Language (XSL) 1.0, W3C Recommendation, 2001. Available at http://www.w3.org/TR/2001/REC-xsl-20011015/

[W3C 01f] Synchronized Multimedia Integration Language (SMIL) 2.0, W3C Recommendation, 2001. Available at http://www.w3.org/TR/2001/REC-smil20-20010807/

[W3C 01g] XML Path Language (XPATH) 2.0, W3C Working draft 2.0, 2001. Available at http://www.w3.org/TR/xpath20

[W3C 02a] The Extensible Hypertext Markup Language (XHTML): A reformulation of HTML 4 in XML 1.0. W3C Recommendation, 2002. Available at http://www.w3.org/TR/2002/REC-xhtml1- 20020801/

[W3C 02b] Xframes, W3C working draft, 2002. Available at http://www.w3.org/TR/2002/WD-xframes-20020806/

155

Bibliography

[W3C 02c] Web Services Conversation language (WSCL) 1.0. W3C Working Draft, 2002. Available at http://www.w3.org/TR/2002/NOTE- wscl10-20020314/

[W3C 03a] XPointer Framework, W3C Recommendation, 2003. Available at http://www.w3.org/TR/2003/REC-xptr-framework-20030325/

[W3C 03b] (SVG) 1.1, W3C Recommendation, 2003. Available at http://www.w3.org/TR/2003/REC-SVG11- 20030114/

[W3C 03c] Web services Description Language (WSDL) 1.2, part 1: core. W3C working draft, 2003. Available at http://www.w3.org/TR/2003/WD-wsdl20-20031110/

[W3C 04a] RDF Vocabulary Description Language 1.0 (RDF Schema), W3C Recommendation, 2004. Available at http://www.w3.org/TR/2004/REC-rdf-schema-20040210/

[W3C 04b] (OWL) overview, W3C Recommendation, 2004. Available at http://www.w3.org/TR/2004/REC-owl-features-20040210/

[Walsh 01] N.Walsh, and L. Muellner. DocBook: The Definitive Guide. O'Reilly & Associates, Inc., ISBN : 2-84177-091-5, 2001.

[Xio 01] R. Xio, T. Dillon, E. Chang and L.Feng (2001). Modeling and Transformation of Object Oriented Conceptual Models into XML Schema. DEXA 2001, LNCS 2113, pages795-804, 2001.

[Xu 03a] L.Xu Source Discovery and Schema Mapping for Data Integration, PhD Dissertation, July 2003.

[Xu 03b] L. Xu and D.W. Embley. Discovering Direct and Indirect Matches for Schema Elements. In Proceedings of the 8th International Conference on Database Systems for Advanced Applications (DASFAA'03), 2003.

[XSLWIZ 01] XSLWIZ. Available at http://www.induslogic.com/products/xslwiz.html

[Zhang 95] K.Zhang, J. Wang, and D.Shasha. On the editing distance between undirected acyclic graphs. International journal of Foundations of Computer Science, pages 395-407, 1995.

156

Appendixes

157

Appendix A Terminological matching (Hirst and St-Onge algorithm)

The terminological matching technique proposed in chapter 5 makes use of three kinds of relations: extra-strong relation: holds between a word and its literal repetition, strong relation: holds in one of the three scenarios explained in section 5.2 of chapter 5, and medium relation: occurs when there is an allowable path connecting synsets associated with each word. The definitions of allowable paths use a classification of WordNet synset relations into upward, downward, and horizontal links. Table A.1 gives examples of WordNet synset relations and their respective directions. We further illustrate in Figure A-a examples of allowable as well as non allowable paths.

Relation Direction

Also see Horizontal

Attribute Horizontal

Cause Down

Entailment Down

Holonymy Down

Hypernymy Up

Hyponymy Down

Meronymy Up

Pertinence Horizontal

Similarity Horizontal

Table A.1: Classification of WordNet relations into directions

159

Appendix A

(a)

(b)

Figure A-a: Allowable and non allowable paths in Hirst and St-Onge algorithm. (a) Patterns of paths between synsets that are allowable in medium relations and (b) patterns of paths that are not allowable. (Each arrow denotes one or more synset relations in the same direction).

Based on the allowable paths definition and taking the maximum length of allowable paths equal to four, we distinguish seven cases as illustrated in Figure 5-b of chapter 5. In the following, we present our algorithm for terminological matching detailing each of these cases:

160

Appendix A

// Extra-strong relation if (word1=word2) Return 24; else // Strong relation S1=synsetsOf(word1); S2=synsetsOf(word2); foreach s1 in S1 H1=horizontalSynsets(word1); U1=UpwardSynsets(word1); D1=DownwardSynsets(word1); foreach s2 in S2 H2=horizontalSynsets(word2); U2=UpwardSynsets(word2); D2=DownwardSynsets(word2); if s1=s2 then return 16; if (s1 is In(H2) or s2 is In(H1)) return 16; if (s1 is In(U2) or s1 is In(D2)) return 16; if (s2 is In(U1) or s2 is In(D1)) return 16; endif

//Medium relation listOfWeight=medStrong(0,0,0,s1,S2); return (max(listOfWeight)); MedStrong (state, distance, chdir, from, To) if ((from is In(To) and (distance > 1)) // Path found listOfWeigth.add(8-distance-chdir); return true; endif if (distance >= 5) //too long path return false; endif if (state = 0) H=horizontalSynsets(from); U=UpwardSynsets(from); D=DownwardSynsets(from); retU=retD=retH=false; foreach u in U retU=retU or medStrong(1,distance+1,0,h, To); //we try upward (state =1) foreach d in D retD=retD or medStrong(2,distance+1,0,d, To); //we try downward (state =2) foreach h in H retH=retH or medStrong(3,distance+1,0,d, To); //we try horizontally (state =3) return (retU or retD or retH); endif if (state = 1) // the first change of direction is up, we can move either Up, down or horizontally H=horizontalSynsets(from); U=UpwardSynsets(from); D=DownwardSynsets(from); retU=retD=retH=false; foreach u in U retU=retU or medStrong(1,distance+1,0,h, To); //we try again upward (state =1) foreach d in D retD=retD or medStrong(4,distance+1,1,d, To); //we try downward (state =4) 161

Appendix A

foreach h in H retH=retH or medStrong(5,distance+1,1,d, To); //we try horizontally (state =5) return (retU or retD or retH); endif if (state = 2) //we already have moved down H=horizontalSynsets(from); D=DownwardSynsets(from); retD=retH=false; foreach d in D retD=retD or medStrong(2,distance+1,0,d, To); //we try downward (state =2) foreach h in H retH=retH or medStrong(6,distance+1,0,d, To); //we try horizontally (state =6) return (retD or retH); endif if (state = 3) // we already have moved horizontally H=horizontalSynsets(from); D=DownwardSynsets(from); retD=retH=false; foreach d in D retD=retD or medStrong(7,distance+1,0,d, To); //we try downward (state =7) foreach h in H retH=retH or medStrong(3,distance+1,0,d, To); //we try horizontally (state =3) return (retD or retH); endif if (state = 4) // we already have moved up and down D=horizontalSynsets(from); retD=false; foreach d in D retD=retD or medStrong(4,distance+1,0,d, To); //we try horizontally (state =4) return (retD); endif if (state = 5) // we already have moved up and horizontally H=horizontalSynsets(from); D=DownwardSynsets(from); retD=retH=false; foreach d in D retD=retD or medStrong(4,distance+1,2,d, To); //we try downward (state =4) foreach h in H retH=retH or medStrong(5,distance+1,1,d, To); //we try horizontally (state =5) return (retD or retH); endif if (state = 6) //we already have moved down and horizontally H=horizontalSynsets(from); retH=false; foreach h in H retH=retH or medStrong(6,distance+1,1,d, To); //we try horizontaly (state =6) return (retH); endif

162

Appendix A

if (state = 7) //we already have moved horizontally and down D=horizontalSynsets(from); retD=false; foreach d in D retD=retD or medStrong(7,distance+1,1,d, To); //we try horizontally (state =7) return (retD); endif

163

Appendix B Top-down translation algorithm

In the following, we describe the algorithm used by the XSLT generator in order to generate automatically XSLT script from a mapping specification:

We present here the top-down translation algorithm used by the XSLT generator in order to generate automatically an XSLT program from a mapping specification:

//Proceed to process the first matchable target node

TopDownProcess(xsltStylesheet, Targetschemagraph.root) { //obtain all the necessary information from the mapping specification

ContextBinding currentBindings ← setCurrentBindings (Targetschemagraph.root);

//generate template for current node

XSLTTemplate template = generateTemplate (currentBindings); Queue queue = new Queue();

//Adjust template by inserting more construction or apply-template rules if necessary

While! binding.Hasmap[]isEmpty() { for (childNode ∈ currentOutputNode.getChildren()) { mapping ← childNode.getMapping().getName(); inputNodes[]←currentMapping.getInputNode(); for (int i=0; i

165

Appendix B

//go ahead with children mappings if necessary while( !queue.isEmpty()) { childNode ←queue.deQueue();

//recursively buildup more templates for descendents whenever possible

topDownProcess(xsltStylesheet, childNode); } }

//generate a set of bindings for a given output node

ContextBinding setCurrentBindings (oNode) { ContextBinding binding= new ContextBinding(); binding.mapping← oNode.getMapping(); binding.oNode← oNode; binding.iNodes[]←mapping.getInputNodes(); binding.condition ←mapping.getCondition(); binding.operation←mapping.getOperation(); binding.Hasmap[]←mapping.getHasmappings() binding.Altmap[]←mapping.getAltmappings() return binding; } XSLTTemplate generateTemplate(bindings) { XSLTTemplate template = new XSLTTemplate(); template.insertConstructionTags (bindings.oNode); for (int i=0; i

166

Appendix C Detailed Example

In this Appendix, we first present the matching rules obtained by our matching algorithm applied to examples in Figure 4-b and Figure 5-c, next we present the structured mapping result and finally the generated XSLT script.

167

Appendix C

Matching Rules:

Target Transformation Conditions Source access paths operation

University University connect - Name University/Name connect - Address University/Location Split ws (Location) - City University/Location Split ws (Location) [1] - State University/Location Split ws (Location) [2] Zip University/Location Split ws (Location) [3] Researcher University/Library/Article/Author - University/Library/Book/Author Union University/Library/Monograph/Author First-name Author/name Split ws (Name) [1] - Last-name Author/name Split ws (Name) [2] Address - Author/Address/city Author/Address/State Merge Author/Address/Zip

Publication - - - Journal-Article University/Library/A University/Library/Article rticle/Journal-ref = Join University/Library/Journal University/Library/Jo urnal/name Abstract University/Library/Article/Abstract Connect - Journal University/Library/Journal/Name Connect - Editor University/Library/Journal/Editor Connect - Proceeding-Article - - - Abstract - - - Volume - - - Book University/Library/Book | - Connect University/Library/Monograph Price University/Library/Book / Price Connect - University/Library/Monograph/ Price Connect - Title - University/Library/Book /Title Connect - University/Library/Monograph/ Title Connect Publisher University/Library/Book /Publisher - Connect University/Library/Monograph/ - Connect Publisher University/Researcher/ Author = Researcher Publication/Journal- University/Library/Article/Author - Article University/Researcher/ Author = Researcher Publication/Book University/Library/Book/Author -

168

Appendix C

Structured mapping example:

/OneToManyMapping>

169

Appendix C

170

Appendix C

171

Appendix C

Generated XSLT example:

172

Appendix C

173

Appendix C

174

CURRICULUM VITAE

Born in 1977, Aida Boukottaya completed her undergraduate studies and obtained her Computer Science Engineer Diploma from the National Institute in Computer Science (ENSI, Tunis, Tunisia) in 2000.

In 2000, she holds a master degree in Computer Science: Systems and communications from the School of Computer and Communication Sciences at the Swiss Federal Institute of Technology in Lausanne (EPFL).

In 2001, she began her PHD at LITH/EPFL. She integrated the Media (Models and Environments for Document based Interaction and Authoring) Research Group, where she was interested on the topic of structured document reuse.

In the framework of her PHD, she was involved in different scientific collaborations and teaching activities. She has several international publications related to her research interests, which are structured document reuse, structured document transformations, and schema matching.

175