CERN-THESIS-2001-036 //2001 in am Wien, ugfu r u wcedrElnugdsaaeice Grades akademischen des Erlangung der Zwecke ausgef¨uhrt zum ie otr e ehice isncatnutrdrLeitung der unter Wissenschaften technischen der Doktors eines ntttfu eiiiceKbrei n rica Intelligence Artificial und Kybernetik f¨ur medizinische Institut Intelligence Artificial und Kybernetik f¨ur medizinische Institut auta ¨rTcnsh auwseshfe n Informatik und Fakult¨at Naturwissenschaften f¨ur Technische igrih ndrTcnshnUniversit¨at Technischen Wien der an eingereicht utpeEovn uooosSchemata Autonomous Evolving Multiple nvri¨ tlko il-n.D.PooPetta Paolo Dr. Universit¨atslektor Dipl.-Ing. .Ui.Po.D.Rbr Trappl Robert Dr. Univ.-Prof. o. -00We,Barxas 26/70 Beatrixgasse Wien, A-1030 aaItgainagainst Integration Data N O I T A T R E S S I D Universit¨at Wien Universit¨at Wien hitp Koch Christoph E9425227 und von von 1 2 3
Inhaltsangabe
Forschung im Gebiet der Datenintegration hat u.a. Richtungen wie f¨oderierteund Multidatenbanken, Mediation, Data Warehousing, Global Information Systems und Model Management bzw. Schema Matching zu Tage gebracht. Von einem architektonischen Standpunkt aus gesehen kann zwischen Ans¨atzenunterschieden werden, in denen gegen ein einziges globales Schema integriert wird, und solchen, wo das nicht der Fall ist. Auf der Ebene der Interschemasemantik kann man den Großteil der bisherigen Forschungsarbeit in die sogenannten global-as-view und local-as-view Ans¨atzeeinteilen. Diese Ans¨atzeunterscheiden sich teilweise stark in ihren individuellen Eigenschaften. F¨oderierteDatenbanken haben sich in Umgebungen als brauchbar erwiesen, in denen mehrere Informationssysteme miteinander Daten austauschen m¨ussen, jedes dieser Informationssysteme aber sein eigenes Schema hat, und, was das Design dieses Schemas betrifft, auch autonom ist. In der Praxis unterst¨utztdieser Ansatz aber unangenehmerweise die Wartung von sich ¨anderndenSchemata nicht. Andere bekannte Ans¨atze,die gegen ein “globales” Schema integrieren, unterst¨u- tzen hingegen die Design Autonomy von Informationssystemen nicht. Bei not- wendig werdenden Schema¨anderungenf¨uhrtdiese Art von Autonomie n¨amlich oft zu Schemata, gegen die die erw¨unschte Interschemasemantik weder durch global-as-view noch durch local-as-view-Ans¨atzeausgedr¨uckt werden kann. Diese Problematik ist das Thema dieser Dissertation, in der ein neuer Ansatz zur Datenintegration, der Ideen von Model Management, Mediation, and local- as-view Integration vereint, vorgeschlagen wird. Unser Ansatz erm¨oglicht die Mo- dellierung von (partiellen) Abbildungen zwischen Schemata, die Anderungen¨ eine vorteilhafte Robustheit entgegensetzen. Die Motivation f¨urdie pr¨asentierten Re- sultate ist Folge eines ausgedehnten Aufenthalts des Autors am CERN, w¨ahrend- dessen die die Informationsinfrastruktur betreffenden Ziele und Notwendigkeiten von großen wissenschaftlichen Kollaborationen studiert wurden. Unser Ansatz basiert auf zwei zentralen Grundlagen. Die erste ist Query Rewriting, also das Umschreiben von Abfragen, unter sehr ausdrucksstarken “symmetrischen” Interschemaabh¨angigkeiten, n¨amlich Inklusionsabh¨angigkeiten zwischen sogenannten Conjunctive Queries, die wir Conjunctive Inclusion Depen- dencies (cind’s) nennen. Wir behandeln eine sehr allgemeine Form des Quellen- integrationsproblems, in dem mehrere Schemata koexistieren d¨urfen,und jedes davon sowohl echte Datenbankentitit¨aten,f¨urdie also Daten vorhanden sind, sowie rein logische oder “virtuelle” Entitit¨atenenthalten darf, gegen die mit Hilfe von cind’s Abh¨angigkeiten von anderen Schemata definiert werden k¨onnen.Das Query Rewriting Problem zielt nun darauf ab, eine Abfrage, die sowohl ¨uber lo- gische als auch echte Entitit¨ateneines Schemas gestellt werden darf, so in eine andere umzuschreiben, daß nur echte Datenbankentitit¨aten,allerdings, wenn n¨otig,von allen dem Integrationssystem bekannten Schemata, verwendet wer- den. Exakter wird unter der klassisch-logischen Semantik mit Hilfe einer Menge 4 von cind’s eine Conjunctive Query in eine maximale logisch enthaltene positive Abfrage umgeschrieben. Solch derart umgeschriebene Abfragen k¨onnenmit Hilfe von bekannten Techniken aus dem Gebiet der verteilten Datenbanken beant- wortet werden. Aus theoretischen Uberlegungen,¨ die in dieser Dissertation n¨aher erl¨autertwerden, beschr¨anken wir uns dabei – f¨urdie Datenintegration – auf Mengen von cind’s deren Abh¨angigkeitsgraph bezogen auf die Inklusionsrichtung der cind’s azyklisch ist. Was das Query Rewriting Problem betrifft stellen wir zuerst Semantik(en) und theoretische Eigenschaften vor. Danach werden Algorithmen und Optimierungen, die auf Datenbanktechniken aufbauen, pr¨asentiert, die in einem Prototypen im- plementiert wurden. Zu diesem werden auch passende Benchmarks geliefert, die zeigen sollen, daß unser Ansatz leistungsf¨ahiggenug ist, um auch praktische Relevanz zu besitzen. Unser Ansatz skaliert ausgezeichnet zu großen Datenmengen, da das Daten- integrationsproblem ausschließlich auf der Ebene von Schemata und Abfragen, nicht aber auf der Ebene von Daten, gel¨ostwird. Eine weitere St¨arke ist die hohe Ausdruckskraft unserer Abh¨angigkeiten (cind’s), die viel Flexibilit¨atbei der Modellierung von Interschemabeziehungen erlaubt; beispielsweise sind sowohl local-as-view als auch global-as-view Integration Spezialf¨alleunseres Ansatzes. Wie auch gezeigt wird, erlaubt diese Flexibilit¨at,Abbildungen zu erzeugen, die Anderungen¨ gegen¨uber robust sind, da sie es erm¨oglicht, cind’s weitgehend un- abh¨angigvoneinander zu machen, sodaß notwendige Anderungen¨ meist lokal beschr¨anktbleiben. Query Rewriting mit cind’s erm¨oglicht es klarerweise auch, mit einer sehr großen Klasse von Disparit¨atenvon Konzepten umzugehen, da Paare von einander entsprechenden (um exakt zu sein, einander enhaltenden) Konzepten durch zwei in Relation gebrachte Conjunctive Queries ausgedr¨uckt werden. Die zweite Grundlage stellt Model Management mit cind’s dar. Im Model Management Ansatz werden Schemata und Abbildungen als Objekte mit Iden- tit¨atverwaltet, auf die eine Anzahl von m¨achtigen Wartungs- und Manipulations- operationen angewandt werden kann. In dieser Dissertation werden solche Oper- ationen definiert, die daf¨urpassend sind, Abbildungen so zu verwalten, daß h¨aufige Anderungen¨ handhabbar sind. Dazu wird auch eine Methodologie zum Management von Schema Evolution pr¨asentiert. Die Kombination der technischen Beitr¨agedieser Dissertation erm¨oglicht eine deutliche Verbesserung von Offenheit und Flexibilit¨atf¨urdie Ans¨atzeModel Management und f¨oderierte Datenbanken in der Datenintegration und stellt die erste praktische L¨osungder Datenintegrationsprobleme dar, denen im Kontext von komplexen, autonomen und sich ¨andernden Informationslandschaften, wie es große wissenschaftliche Kollaborationen sind, begegnet wird. 5
Abstract
Research in the area of data integration has resulted in approaches such as fed- erated and multidatabases, mediation, data warehousing, global information sys- tems, and the model management/schema matching approach. Architecturally, approaches can be categorized into those that integrate against a single global schema and those that do not, while on the level of inter-schema constraints, most work can be classified either as so-called global-as-view or as local-as-view integration. These approaches differ widely in their strengths and weaknesses. Federated databases have been found applicable in environments in which several autonomous information systems coexist – each with their individual schemata – and need to share data. However, this approach does not provide sufficient support for dealing with change of schemata and requirements. Other approaches to data integration which are centered around a single “global” inte- gration schema, on the other hand, cannot handle design autonomy of information systems. Under evolution, this type of autonomy eventually leads to schemata between which neither the global-as-view nor the local-as-view approaches to source integration can be used to express the inter-schema semantics. In this thesis, this issue is addressed with a novel approach to data integration which combines techniques from model management, mediation, and local-as- view integration. It allows for the design of inter-schema mappings that are more robust when change occurs. The work has been motivated by the requirements of large scientific collaborations in high-energy physics, as encountered by the author during his stay at CERN. The approach presented here is based on two foundations. The first is query rewriting with very expressive symmetric inter-schema constraints, called con- junctive inclusion dependencies (cind’s). These are containment relationships between conjunctive queries. We address a very general form of the source inte- gration problem, in which several schemata may coexist, each of them containing a number of purely logical as well as a number of source entities. For the source entities, the information system that belongs to the schema holds data, while the logical entities are meant to allow schema entities from other information systems to be integrated against. The query rewriting problem now aims at rewriting a query over (possibly) both source and logical schema entities of one schema into source entities only, which may be part of any of the schemata known. Under the classical logical semantics, and given a conjunctive input query, we address the problem of finding maximally contained positive rewritings under a set of cind’s. Such rewritten queries can then be optimized and efficiently answered using clas- sical distributed database techniques. For the purpose of data integration and the sake of computability, we require the dependency graph of a set of cind’s to be acyclic with respect to inclusion direction. Regarding the query rewriting problem, we first present semantics and main theoretical properties. Subsequently, algorithms and optimizations based on tech- 6 niques from database theory are presented, which have been implemented in a research prototype. Finally, experimental results based on this prototype are presented, which demonstrate the practical feasibility of our approach. Reasoning is done exclusively over schemata and queries, and is independent from data volumes, which renders it highly scalable. Apart from that, this flavor of query rewriting has another important strength. The expressiveness of the constraints allows for much freedom and flexibility for modeling the peculiarities of a mapping problem. For instance, both global-as-view and local-as-view inte- gration are special cases of the query rewriting problem addressed in this thesis. As will be shown, this flexibility allows to design mappings that are robust with respect to change, as principles such as the decoupling of inter-schema dependen- cies can be implemented. It is furthermore clear that query rewriting with cind’s also permits to deal with concept mismatch in a very wide sense, as each pair of corresponding concepts in two schemata can be modeled as conjunctive queries. The second foundation is model management based on cind’s as inter-schema constraints. Under the model management approach to data integration, sche- mata and mappings are treated as first-class citizens in a repository, on which model management operations can be applied. This thesis proposes definitions of schemata and mappings, as well as an array of powerful operations, which are well suited for designing and maintaining mappings between information systems when change is an issue. To complete this work, we propose a methodology for dealing with evolving schemata as well as changing integration requirements. The combination of the contributions of this thesis brings a practical improve- ment of openness and flexibility to the federated database and model management approaches to data integration, and a first practical integration architecture to large, complex, and evolving computing environments such as those encountered in large scientific collaborations. 7 Acknowledgments
Most of the work on this thesis was carried out during a 30 months stay at CERN, which was sponsored by the Austrian Federal Ministry of Education, Science and Culture under the CERN Austrian Doctoral Student Program. I would like to thank the two supervisors of my thesis, Robert Trappl of the Department of Medical Cybernetics and Artificial Intelligence of the University of Vienna and Jean-Marie Le Goff of CERN / ETT Division and the University of the West of England for their continuous support. This thesis would not have been possible without their help. Paolo Petta of the Austrian Research Institute for Artificial Intelligence took over much of the day-to-day supervision, and I am indebted to him for countless hours of discussions, proofreading of draft papers, and feedback of any kind. I would like to thank Enrico Franconi of the University of Manchester for provoking my interest in local-as-view integration during his short visit at CERN in early 2000, which has influenced this thesis. I am also indebted to Richard Mc- Clatchey and Norbert Toth of the University of the West of England and CERN for valuable comments on parts of an earlier version of this thesis. However, mistakes, as is obvious, are entirely mine. 8 Contents
1 Introduction 13 1.1 A Brief History of Data Integration ...... 13 1.2 The Problem ...... 17 1.3 Use Case: Large Scientific Collaborations ...... 18 1.4 Contributions of this Thesis ...... 23 1.5 Relevance ...... 25 1.6 Overview ...... 26
2 Preliminaries 27 2.1 Query Languages ...... 27 2.2 Query Containment ...... 31 2.3 Dependencies ...... 33 2.4 Global Query Optimization ...... 34 2.5 Complex Values and Object Identities ...... 35
3 Data Integration 39 3.1 Definitions and Overview ...... 39 3.2 Federated and Multidatabases ...... 41 3.3 Data Warehousing ...... 43 3.4 Information Integration in AI ...... 44 3.4.1 Integration against Ontologies ...... 44 3.4.2 Capability Descriptions and Planning ...... 45 3.4.3 Multi-agent Systems ...... 47 3.5 Global-as-view Integration ...... 50 3.5.1 Mediation ...... 50 3.5.2 Integration by Database Views ...... 51 3.5.3 Systems ...... 52 3.6 Local-as-view Integration ...... 53 3.6.1 Answering Queries using Views ...... 54 3.6.2 Algorithms ...... 56 3.6.3 Bibliographic Notes ...... 60 3.7 Description Logics-based Information Integration ...... 62 3.7.1 Description Logics ...... 62
9 10 CONTENTS
3.7.2 Description Logics as a Database Paradigm ...... 63 3.7.3 Hybrid Reasoning Systems ...... 65 3.8 The Model Management Approach ...... 65 3.9 Discussion of Approaches ...... 66
4 Reference Architecture 71 4.1 Architecture ...... 71 4.2 Mediating a Query ...... 73 4.3 Research Issues ...... 73
5 Query Rewriting 75 5.1 Outline ...... 75 5.2 Preliminaries ...... 76 5.3 Semantics ...... 78 5.3.1 The Classical Semantics ...... 78 5.3.2 The Rewrite Systems Semantics ...... 82 5.3.3 Equivalence of the two Semantics ...... 84 5.3.4 Computability ...... 88 5.3.5 Complexity of the Acyclic Case ...... 90 5.4 Implementation ...... 93 5.5 Experiments ...... 94 5.5.1 Chain Queries ...... 94 5.5.2 Random Queries ...... 97 5.6 Discussion ...... 98
6 Model Management 99 6.1 Model Management Repositories ...... 99 6.2 Managing Change ...... 102 6.2.1 Decoupling Mappings ...... 103 6.2.2 Merging Schemata ...... 107 6.3 Managing the Acyclicity of Constraints ...... 108
7 Outlook 111 7.1 Physical Data Independence ...... 113 7.1.1 The Classical Problem ...... 113 7.1.2 Versions of Logical Schemata ...... 117 7.2 Rewriting Recursive Queries ...... 123
8 Conclusions 127
Bibliography 129
Curriculum Vitae 151 List of Figures
1.1 Mappings in LAV (left) and GAV (right)...... 15 1.2 The space of objects that can be shared using symmetric map- pings given true concept mismatch between entities of source and integration schemata...... 17 1.3 Data flow between information systems that manage the steps of an experiment’s lifecycle. (Cylinders represent databases or – more generally – information systems.) ...... 20 1.4 ER diagrams for Example 1.3.1: Electronics database (left) and product-data management system (right)...... 21 1.5 Concept mismatch between PCs of the electronics database and parts of the product-data management system of “Project1”. . . . 22 1.6 Architecture of the information infrastructure ...... 24
3.1 Artist’s impression of source integration...... 40 3.2 Federated 5-layer schema architecture ...... 42 3.3 Data warehousing architecture and process...... 43 3.4 MAS architectures for the intelligent integration of information. Arrows between agents depict exemplary communication flows. Numbers denote logical time stamps of communication flows. . . . 48 3.5 A mediator architecture ...... 51 3.6 MiniCon descriptions of the query and views of Example 3.6.1. . . 58 3.7 Comparison of global-as-view and local-as-view integration. . . . . 68 3.8 Comparison of Data Integration Architectures...... 69
4.1 Reference Architecture ...... 72
5.1 Hypertile of size i ≥ 2 (left) and all nine possible overlapping hypertiles of size i − 1 that can be inscribed into it (right). . . . . 91 5.2 Experiments with chain queries and nonlayered chain cind’s. . . . 95 5.3 Experiments with chain queries and two layers of chain cind’s. . . 96 5.4 Experiments with chain queries and five layers of chain cind’s. . . 96 5.5 Experiment with random queries...... 97
6.1 Operations on schemata...... 100
11 12 LIST OF FIGURES
6.2 Operations on mappings...... 100 6.3 Complex model management operations...... 101 6.4 Data integration infrastructure of Example 6.2.1. Schemata are visualized as circles and elementary mappings as arrows...... 104 6.5 The lifecycle of the mappings of a legacy integration schema. . . . 106 6.6 Merging auxiliary integration schemata to improve maintenance. . 107 6.7 A clustered auxiliary schema. Schemata are displayed as circles and mappings as arrows...... 108
7.1 A cind as an inter-schema constraint (A) compared to a data trans- formation procedure (B). Horizontal lines depict schemata and small circles depict schema entities. Mappings are shown as thin arrows...... 112 7.2 ER diagram (extended with is-a relationships) of the university domain (initial version)...... 115 7.3 ER diagram (with is-a relationships) of the university domain (sec- ond version)...... 118 7.4 Fixpoint of the bottom-up derivation of Example 7.2.1...... 124 Chapter 1
Introduction
The integration of heterogeneous databases and information systems is an area of high practical importance. The very success of information systems and data management technology in a short period of time has caused the virtual om- nipresence of stand-alone systems that manage data – “islands of information” – that by now have grown too valuable not to be shared. However, this sharing, and with it the resolution of heterogeneity between systems, entails interesting and nontrivial problems, which have received much research interest in recent years. Ongoing research activity, however, is evidence of the fact that many questions remain unanswered.
1.1 A Brief History of Data Integration
Given a number of heterogeneous information systems, in practice it is not al- ways desirable or even possible to completely reengineer and reimplement them to create one homogeneous information system with a single schema (schema integration [BLN86, JLVV00]). Instead, it is often necessary to perform data integration [JLVV00], where schemata of heterogeneous information systems are left unchanged and integration is carried out by transforming queries or data. To realize such transformations, some flavor of mappings (either procedural code or declarative inter-schema constraints) between information systems is required. If the data integration reasoning is entirely effected on the level of queries and schema-level descriptions, this is usually called query rewriting, while the term data transformation refers to heterogeneous data themselves being classified, transformed and fused to appear homogeneous under some integration schema. Most previous work on data integration can be classified into two major di- rections by the method by which inter-schema mappings used for integration are expressed (see e.g. [FLM98, Ull97]). These are called local-as-view (LAV) [LMSS95, YL87, LRO96, GKD97, AK92, TSI94, CKPS95] and global-as-view (GAV) [GMPQ+97, ACPS96, CHS+95, FRV95] integration.
13 14 CHAPTER 1. INTRODUCTION
The more traditional paradigm is global-as-view integration, where mappings – often called mediators after [Wie92] – are defined as follows. Mediators imple- ment virtual entities (concepts, relations or classes, depending on nomenclature and data model used) exported by their interfaces as views over the heteroge- neous sources, specifying how to combine their data to resolve some (or all) of the experienced heterogeneity. Such mediators can be (generalizations of) simple database views (e.g. CREATE VIEW constructs in SQL) or can be implemented by some procedural code. Global-as-view integration has been used in multi- databases [SL90], data warehousing [JLVV00], and recently for the integration of multimedia sources [ACPS96, CHS+95] and as a fertile testbed for semistructured data models and technologies [GMPQ+97]. In the local-as-view paradigm, inter-schema constraints are defined in strictly the opposite way1. Queries over a purely logical “global” mediated schema are answered by treating sources as if they were materialized views over the medi- ated schema, where only these materialized views may be used to answer the query – after all, the mediated schema does not directly represent any data. Query answering then reduces to the so-called problem of answering queries using views, which has been intensively studied by the database community [LMSS95, DGL00, AD98, BLR97, RSU95] and is related to the query containment problem [CM77, CV92, Shm87, CDL98a]. Local-as-view integration has not only been applied to and shown to be well-suited for data integration in global infor- mation systems [LRO96, GKD97, AK92], but also in related applications beyond data integration, such as query optimization [CKPS95] and the maintenance of physical data independence [TSI94]. An important distinction is to be made between data integration architectures that are centered around a single “global” integration schema against which all sources are integrated (This is the case, for instance, for data warehouses and global information systems, and is intrinsic to the local-as-view approach.) and others that are not, such as federated and multidatabases. The lack of a single global integration schema in the data integration architecture has a problematic consequence. Each source may need to be mapped against each of the integration schemata, leading to a large number of mappings that need to be created and managed. In architectures such as those of federated database systems where each component database may be a source and a consumer of integrated data at once, a quadratic number of mappings may be required. The globality of integration schemata is usually judged by their role in an integration architecture. Global schemata are singletons that occupy a very cen- tral role in the architecture, and are unique consistent and homogeneous world views against which all other schemata in the system (usually considered the
1At first sight, this may appear unintuitive, but is not. For instance, the local-as-view approach can be motivated by AI planning for information gathering using content descriptions of sources in terms of a global world model (as “planning operators”) [AK92, KW96]. 1.1. A BRIEF HISTORY OF DATA INTEGRATION 15
space of tuples expressible as queries source over the sources space of tuples 1 expressible as queries over the global schema source tuples 2 accessible source through 3 mediators
LAV GAV
Figure 1.1: Mappings in LAV (left) and GAV (right).
“sources”) are to be integrated. There is globality in integration schemata on a different level as well. We want to consider integration schemata as designed at will while taking a global perspective if
• they are artifacts specifically created for the resolution of some heterogene- ity and
• the entirety of sources in the system that have any relevance to those het- erogeneity problems addressed have been taken into account in the design process.
Thus in such “global” schemata, a global perspective has been taken when designing them. However, they do not have to be monolithic homogeneous world views. This qualifies the collection of logical entities exported by mediators in a global-as-view integration system as a specifically designed global integration schema, although such a schema is not necessarily homogeneous. An important characteristic of data integration approaches is how well concept mismatch occurring between source and integration schemata can be bridged. We have pointed out that both GAV and LAV use a flavor of views for the mapping between sources and integration schemata. In Figure 1.1, we compare the local-as- view and global-as-view paradigms by visualizing (by Venn diagrams) the spaces of tuples (in relational queries) or objects that can be expressed by queries over source and integration schemata. Views as inter-schema constraints are strongly asymmetric. One single atomic schema entity appearing in a schema on one side of the invisible conceptual border line between integration schemata and source schemata is always defined by a query or (as the general idea of mediation permits) by some procedural code 16 CHAPTER 1. INTRODUCTION which computes the entity’s extent over the schemata on the other side of that border line. As a consequence, both LAV and GAV are restricted in how well they can deal with concept mismatch2. This restriction is theoretical, because in both LAV and GAV it is always implicitly assumed that sources are integrated against integration schemata that have been freely designed with no other constraints imposed than the current integration requirements3. However, when data need to be integrated against schemata of information systems that have design autonomy, or when integration schemata have a legacy 4 burden that an integration approach has to be able to deal with, both LAV and GAV fail. Note that views are not the only imaginable way of mapping schemata in data integration architectures. For mappings that are not expressible as views, it may be possible to relate the spaces of objects expressible by complex logical expressions – say queries – over the concepts of the schemata (see Figure 1.2). “Legacy” integration schemata are faced when
• there is no central design authority providing “global” schemata,
• future integration requirements or changes to schemata of information sys- tems cannot be appropriately predicted,
• existing integration schemata cannot be amended when integration require- ments or the nature of sources to be made available change in an unforeseen way, or
• the creation of “global” schemata is infeasible because of the size and com- plexity of the problem domain and modeling task5 [MKW00].
Recent work in the area has resulted in two new approaches that do not center around a single “global” integration schema and where inter-schema constraints do not necessarily have that strictly asymmetric syntax encountered in LAV and GAV. The first uses expressive description logics systems with symmetric con- straints for data integration [CDL98a, CDL+98b, Bor95]. Constraints can be
2See Example 1.3.1 and [Ull97]. 3This makes the option of the change of requirements or the nature of sources after the design of the integration schemata has been finished hover over such architectures like Damocles’ sword. 4We do not refer to the legacy systems issue here, though. In principle, legacy systems are operational systems that in some aspect of their design differ from what they ideally should be like; they use at least one technology that is no longer part of the current overall IT strategy in some enterprise or collaborative environment [AS99]. In practice, information systems are usually referred to as legacy in the context of data integration if they are not even based on a modern data management technology, usually making it necessary to treat them monolithically, and “wrap” them [GMPQ+97, RS97] by software that makes them appear to respond to data requests under a state-of-the-art data management paradigm. 5This may make the Semantic Web effort of the World Wide Web Consortium [Wor01] seem to be threatened by another very sharp blade hanging by an amazingly fragile thread. 1.2. THE PROBLEM 17
space of tuples that can space of tuples be made tuples expressible available to expressible as queries queries over the as queries over the integrated over the global schema by sources schema mappings from sources
Figure 1.2: The space of objects that can be shared using symmetric mappings given true concept mismatch between entities of source and integration schemata. defined as containment relationships between complex concepts that represent (path) queries. The main drawback is that integration has to be carried out as ABox reasoning [CDL99], i.e. the classification of data in a (hybrid) description logics system [Neb89]. This does not scale well to large data volumes. Further- more, such an approach is not applicable when sources have restricted interfaces (as is often the case on the Web) and it is not possible to import all data of a source into the reasoning system. The second approach, model management [BLP00, MHH+01], treats schemata and mappings between schemata as first-class objects that can be stored in a repository and manipulated with cleanly defined model management operations. This direction is still in an early stage and no convergence against clean, widely usable semantics has occurred yet. Mappings are often defined as lines between concepts (e.g. relations or classes in schemata) using an array of semantics that are often not very expressive. While such approaches allow for neat graphical visualization and the editing of mappings, they do not provide the mechanisms and expressive semantics to support design and modeling actions to make evolving schemata manageable.
1.2 The Problem
The problem addressed in this thesis is the following. We aim at an approach to data integration that satisfies three requirements.
• Individual information systems may have design autonomy for their sche- mata. In general, no global schemata can be built. Each individual schema may have been defined before integration requirements were completely known, and be ill-suited for a particular integration task. 18 CHAPTER 1. INTRODUCTION
• Individual schemata may evolve independently. Even the best-designed integration schemata may end up with concept mismatch that cannot be dealt with through view-based mappings. • The third requirement concerns the scalability of the approach. The data integration problem has to be solved entirely on the level of queries and descriptions of information systems (i.e., query rewriting) rather than the level of reasoning over the data to ensure the independence of the approach from the amount of data managed. Given the problem that the number of mappings in data integration architec- tures with autonomous component systems may be quadratic in the number of schemata and thus very large, the option that schemata and integration require- ments change renders a way of managing schemata and mappings necessary that is simple and for which many tasks can be automated. This requires support for managing mappings and their change and reusing mappings both actively, in the actions performed for managing schemata and mappings, and passively, through the transitivity of their semantics6. The work presented in this thesis has been carried out in the context of a very large international scientific collaboration in the area of high-energy physics. We will have a closer look at the problem of providing interoperability of information systems in that domain in Section 1.3.
1.3 Use Case: Large Scientific Collaborations
Large scientific collaborations are becoming more and more common due to the fact that nowadays cutting-edge scientific research in areas such as high energy physics, the human genome or aerospace has become extremely expensive. Data integration is an issue since many of the individual information systems being operated in such an environment require integrated data to be provided from other information systems in order to work. As we will point out in this section, the main sources of difficulty related to source integration in the information infrastructures of such collaborations are the design autonomy of information systems, change of requirements and evolution of schemata, and large data sets. A number of issues stand in the way of building a single unified “global” logical schema (as they exist for data warehouses or global information systems) for a large science project. We will summarize them next. Heterogeneity. Heterogeneity is pervasive in large scientific research collabora- tions, as there are existing legacy systems as well as largely autonomous groups that build more such systems that quickly become legacy. 6That is, given that we have defined a mapping from schema A to schema B and a mapping from schema B to schema C, we assume that we automatically arrive at a mapping from schema A to schema C. 1.3. USE CASE: LARGE SCIENTIFIC COLLABORATIONS 19
Scientific collaborations consist of a number7 of largely autonomous institutes that independently develop and maintain their individual information systems8. This lack of central control fosters creativity and is necessary for political and organizational reasons. However, it leads to problems when it comes to mak- ing information systems interoperate. In such a setting, heterogeneity arises due to many reasons. Firstly, no two designers would conceptualize a given prob- lem situation in the same way. Furthermore, distinct groups of researchers have fundamentally different ways of dealing with bodies of knowledge, due to differ- ent (human) languages, professional background, community or project jargon9, teacher and curriculum, or “school of thought”. Several subcommunities inde- pendently develop and use similar but distinct software for the same tasks. As a consequence, one can assume similar but slightly different schemata10. In an en- vironment such as the Large Hadron Collider (LHC) project at CERN [LHC] and huge experiments such as CMS [CMS95] currently under preparation, potentially hundreds of individual information systems will be involved with the project dur- ing its lifetime, some of them commercial products, others homegrown efforts of possibly several hundred person years. This is the case because even for the same task, sub-collaborations or individual institutes working on different subprojects independently build systems. When it comes to types of heterogeneity that may be encountered in such an environment, it has to be remarked that beyond heterogeneity due to discrepan- cies in conceptualizations of human designers (including polysemy, terminological overlap and misalignment), there is also heterogeneity that is intrinsic to the do- main. For example, in the environment of high-energy physics experiments (say, a particle detector), detector parts will be necessarily conceptualized differently depending on the kind of information system in which they are represented. For instance, in a CAD system that is used for designing the particle detector, parts will be spatial structures; in a construction management system, they will have to be represented as tree-like structures modeling compositions of parts and their sub-parts, and in simulation and experimental data taking, parts have to be aggregated by associated sensors (readout channels), with respect to which an experiment becomes a topological structure largely distinct from the one of the design drawing. We believe that such differences also lead to different views on the knowledge level, and certainly lead to different database schemata. Hardness of Modeling. Apart from the notion of intrinsic heterogeneity that we have given rise to in the previous paragraph, there are a number of other issues that contribute to the hardness of modeling in a scientific domain. Firstly,
7In large collaborations, they may amount to hundreds. 8The requirements presented here closely relate to classifications of component autonomy in federated databases [HM85]. (See also Section 3.2.) 9Such jargon may have developed over time in previous projects in which a group of people may have worked on together. 10Unfortunately, it is often trickier to deal with subtle than with great mismatch. 20 CHAPTER 1. INTRODUCTION
Design Design Simulation
Human Resources Construction
Precalibration & Testing Finance
Detector Control Calibration
Maintenance
Event Simulation Decomissioning Reconstruction
Figure 1.3: Data flow between information systems that manage the steps of an experiment’s lifecycle. (Cylinders represent databases or – more generally – information systems.)
overall agreement on a conceptualization of a large real-world domain cannot be achieved. Whenever new requirements are discovered or a better understanding of a domain is achieved, there will be an incentive to change the current schema. Such change may go beyond pure extension. Instead, existing parts of schemata will have to be revisited, invalidating mappings for data integration that rely on these schemata. Global modeling also fails because of the sheer size of such a scientific domain. In fact, in a project that involves the collaboration of several thousand researchers and engineers, to be able to model the domain would require to have access to all the knowledge in the heads of all the people involved, and for this knowledge to be stable. This, however, is an unrealistic conjecture, all the more so in an experimental research environment. The Project Lifecycle. It is important to note that large science projects have a lifecycle much like industrial projects; that is, they go through stages such as design, simulation, construction, testing, calibration, deployment, decommission- ing, and many more11. Such steps have some temporal overlap in practice, but there is a gross ordering. Large science projects persist for large time spans12. As
11See Figure 1.3 for an example of data flows that may need to occur between (heterogeneous) information systems for the various activities in the lifecycle, all requiring data integration. 12For example, the LHC project is expected to be carried on for at least 15 years. 1.3. USE CASE: LARGE SCIENTIFIC COLLABORATIONS 21
part_of
pc_cpu pc cpu part
pc_location part_location
location location
id name id name
Figure 1.4: ER diagrams for Example 1.3.1: Electronics database (left) and product-data management system (right). a consequence, the information systems for some steps of the lifecycle will not be built until other information systems have already been in existence for years. In such an experimental setting, full understanding of the requirements for subsequent information systems can often only be achieved once that information systems for the current work have been implemented. Nevertheless, since some information systems are already in need of data integration, one either has to build a global logical schema today which might become invalid later, leading to serious maintenance problems of the information infrastructure (that is, the logical views that map sources), or an approach has to be followed that goes without such a schema. Since it is impossible to preview all the requirements of a complex system far into the future, one cannot avoid the need for change through proper a priori design. Concept Mismatch. It is clear from the above observations that concept mis- match between schemata relevant to data integration may occur in the domain of high energy physics research.
Example 1.3.1 Assume there are two information systems, the first of which is a database holding data on electronics components13 of an experiment under construction, with the relational schema
R1 = {pc cpu(P c, Cpu), pc location(P c, LocId), location(LocId, LocName)}
The database represents information about PCs and their CPUs as well as the location where these parts currently are to be found. Locations have a name
13To make the example more easily accessible, we speak of personal computers as sole elec- tronics parts represented. Of course, personal computers are not representative building blocks of high-energy physics experiments. 22 CHAPTER 1. INTRODUCTION
destination source schema schema
Parts of PCs Project1
PCs of Project1
Figure 1.5: Concept mismatch between PCs of the electronics database and parts of the product-data management system of “Project1”. and an identifier. The second system is a product data management system for a subproject “Project1” with the schema
R2 = {part of(P art1, P art2), part location(P art, LocId), location(LocId, LocName)}
(see also Figure 1.4). The second database schema represents an assembly tree of “Project1” by the relation “part of” and again the locations of parts. Let us now assume that the first information system (the electronics database) holds data that should be shared with the second. We assume that while the names of the locations are the same in the second as in the first information system, the domains of the location ids in the two information systems must be assumed to be distinct, and cannot be shared. We thus experience two kinds of complications with this integration problem. The distinct key domains for locations in the two information systems in fact entail that a correspondence between (derived) concepts in the two schemata is to be made that are both to be defined by queries14. Furthermore, we observe concept mismatch. The first schema only contains electronics parts but may do so for other projects besides “Project1” as well. In the second schema only parts of “Project1” are to be represented, but those parts are not restricted to electronics parts (Figure 1.5). As a third complication in this example, we assume some granularity mis- match. Assume that the second information system is to hold a more detailed model of the project “Project1” than the first and shall represent CPUs as parts of mainboards of PCs and those in turn as parts of PCs, rather than just CPUs as parts of PCs. Of course, we have no information on mainboards in the electronics database, but this information could be obtained from another source.
14Thus, this correspondence could neither be expressed in GAV nor in LAV. 1.4. CONTRIBUTIONS OF THIS THESIS 23
We could encode this by the following semantic constraint expressing a map- ping between schemata by a containment relationship between two queries:
{hP c, Cpu, LocNamei | ∃Mb, LocId : R2.part of(Mb, P c) ∧ R2.part of(Cpu, Mb) ∧ R2.location(LocId, LocName) ∧ R2.part location(P c, LocId)} ⊇ {hP c, Cpu, LocNamei | ∃LocId : R1.pc cpu(P c, Cpu) ∧ R1.belongs to(P c, “Project1”) ∧ R1.location(LocId, LocName) ∧ R1.pc location(P c, LocId)}
Informally, one may read this constraint as PCs together with their CPUs and locations which are marked as belong- ing to ‘Project1’ in the first information system should be part of the answers to queries over parts and their locations in the second informa- tion system, where CPUs should be known as parts two levels below PCs in the assembly hierarchy represented by the part of relation. We do not provide any formal semantics of constraints like the one shown above at this point, but rely on the intuition that such a containment constraint between two queries expresses the desired inter-schema dependency and allows, given appropriate reasoning algorithms (if they exist), to perform data integration in the presence of concept mismatch in a wide sense.
Large Data Sets. Scientific computing has always been known for manipulating very large amounts of data. Data volumes in information systems related to the construction of LHC experiments are expected to be in the Terabyte range, and experimental data collected during the lifetime of LHC will amount to dozens of Petabytes. For scalability reasons, information integration has to be carried out on the level of queries (query rewriting) rather than data (data transformation).
1.4 Contributions of this Thesis
This thesis is, to the best of our knowledge, the first to actually address the problem of data integration with multiple unsophisticated evolving autonomous integration schemata. Each such schema may consist of both source relations that hold data and logical relations that do not. Schemata may be designed without taking other schemata or data integration considerations into account. Each query over a schema is rewritten into a query exclusively over source relations of information systems in the environment, using a number of schema mappings. We propose an approach to data integration (see Figure 1.6) based on model management and query rewriting with expressive constraints within a federated 24 CHAPTER 1. INTRODUCTION
Repository Information systems Editor
schema Schemata Query Translation data Rewriting Mediator Proxy Proxy Query Facility relational schema Phys. Plan Mediator Generation Proxy Query Facility
Query Plan Mediator Proxy Mappings Execution Query Facility
Repository Mediator
Figure 1.6: Architecture of the information infrastructure
architecture. Our flavor of query rewriting is based on constraints with clean, expressive semantics. It allows for mappings between schemata that are general- izations of both the LAV and GAV paradigms. Regarding query rewriting, we first provide characterizations of two different semantics for query rewriting with symmetric constraints, a classical logical and one that is motivated by rewrite systems [DJ90]. The rewrite systems semantics is based on the intuitions of local-as-view rewriting and generalizes from them. We formally outline both semantics as well as algorithms for both which, given a conjunctive query, enumerate the maximally contained rewritings15. We discuss various relevant aspects of query rewriting in our context, such as minimality and nonredundancy of conjunctive queries in the rewritings. Next we compare the two semantics and argue that the second is more intuitive and may fit better the expectations of human users of data integration systems than the first. Following the philosophy of that semantics, rewritings can be computed by making use of database techniques such as query optimization and ideas from e.g. algorithms developed for the problem of answering queries using views. We believe that in a practical information integration context there are certain regularities (such as sets of predicates – schemata – from which predicates are used together in queries, while there are few queries that combine predicates from several schemata) that render this approach more efficient in practice. Surprisingly, however, it can be shown that the two semantics coincide. We then present a scalable algorithm for the rewrite systems semantics (based on previous work such as [PL00]), which we have implemented in a practical system16, CindRew. We evaluate it experi- mentally against other algorithms for the same problem. It turns out that our
15The notion of maximally contained rewritings is the one that usually most appropriately describes the intuitive idea of “best rewritings possible” in a data integration context. 16This system can be checked out at http://home.cern.ch/∼chkoch/cindrew/ 1.5. RELEVANCE 25 implementation, which we make available for download, scales to thousands of constraints and realistic applications. We conclude with a discussion of how our query rewriting approach fits into state-of-the-art data integration and model management systems. Regarding model management, we present definitions of data models, sche- mata, mappings, and a set of expressive model management operations for the management of schemata in a data integration setting. We argue that our ap- proach can overcome the problems related to “unsophisticated” legacy integration schemata, and provide a sketch of a methodology for managing evolving map- pings.
1.5 Relevance
As we discuss a framework for data integration that is based on very weak as- sumptions, this paper is relevant to a large number of applications in which other approaches eventually fail. These include networks of autonomous virtual enterprises having different deployment lifecycles or standards for their informa- tion systems, the information infrastructure of large international collaborations (e.g., in science), and large enterprises that face the integration of several exist- ing heterogeneous data warehouses after mergers or acquisitions or major change of business model. More generally, our work is applicable in simply any envi- ronment in which anything less than full commitment exists towards far-ranging reengineering of information systems to bring all information systems that roam its environment under a single common enterprise model. Obviously, our work may also allow federated databases [HM85, SL90] to deal more successfully with schema evolution. Let us reconsider the point of design autonomy for schemata of information systems in the case of companies and e-commerce. For many good reasons, com- panies nowadays want to have their information systems interoperate; however, there is no sufficiently strong trend towards agreeing on schemata. While there is clearly much work done towards standardization, large players in IT have an incentive to propose competing “standards” and bodies of meta-data. Asking for common schemata beyond enterprise boundaries today is hardly realistic. Instead, even the integration of the information systems inside a single large enterprise is a problem almost too hard to solve17, and motivates some indepen- dence of the information infrastructure of horizontal or vertical business units, again leading to the legacy integration schema problem that we want to address here. That mentioned, the work in this thesis is highly relevant to business-
17This of course excludes the issue of data warehouses, which, although they have a global scope w.r.t. the enterprise, address only a small part of the company data (in terms of schema complexity, not volume) – such as sales information – that are usually well understood and where requirements are not expected to change much in the future. 26 CHAPTER 1. INTRODUCTION to-business e-commerce and the management of the extended supply chain and virtual enterprises. Data warehouses that have been the results of large and very expensive de- sign and reengineering efforts customized to a specific enterprise really are legacy systems from the day when their design phase ends. Similarly, when companies merge, the schemata of those data warehouses that the former entities created are again bound to feature a substantial degree of heterogeneity. This can be ap- proached in two ways, either by considering these schemata legacy or by creating a new, truly global information system (almost) from scratch.
1.6 Overview
The remainder of this thesis is structured as follows. In Chapter 2, some prelimi- nary notions from database theory, computability theory, and complexity theory are presented. Chapter 3 discusses previous work on data integration. We start with definitions in Section 3.1 and consecutively discuss federated and multi- databases, data warehousing, mediator systems, information integration in AI, global-as-view and local-as-view integration18, the description logics-based and model management approaches to data integration, and finally, in Section 3.9, we discuss the various approaches by maintainability and other aspects. In Chap- ter 4, we present our reference architecture for data integration and discuss its building blocks, which will be treated in more detail in consecutive chapters. Chapter 5 presents our approach to query rewriting with expressive symmet- ric constraints. Chapter 6 first discusses our flavor of schemata, mappings and model management operations, and then provides some thoughts on how to guide the modeling process for mappings such that the integration infrastructure can be managed as easily as possible. We discuss some advanced issues of query rewriting, notably extensions of query languages such as recursion and sources with binding patterns in Chapter 7. We also discuss another application of our work on query rewriting with symmetric constraints, the maintenance of physical data independence under schema evolution. Chapter 8 concludes with a final discussion of the practical implications of this thesis.
18Local-as-view integration is presented at some length, since its theory will be highly relevant to our work of Chapter 5. Chapter 2
Preliminaries
This chapter discusses some preliminaries which mainly stem from database the- ory and which will be needed in later chapters. It is beyond the scope of this thesis to give a detailed account of computability theory and complexity theory. We refer to [HU79, Sip97, GJ79, Joh90, Pap94, DEGV] for introductory texts in these areas. We also assume a basic understanding of databases, schemata, and query languages, and notably SQL (for an introductory work on this see [Ull88]). Finally, we presume basic understanding of mathematical logics and automated theorem proving, including concepts such as resolution and refutation, and no- tions such as predicates, atoms, terms, Skolem function, Horn clauses, and unit clauses, which are used in the standard way (see e.g. [RN95, Pap94]). We define the following access functions for later use: Given a Horn clause c, Head(c) returns c’s head atom and Body(c) returns the ordered list of its body atoms. Bodyi(c) returns the i-th body atom. P red(a) returns the predicate name of atom a, while P reds(Body(c)) returns the predicate names of the atoms in the body of clause c. V ars(a) returns the set of variables appearing in atom a and V ar(Body(c)) returns the variables in the body of the clause c. We will mainly focus on the relational data model and relational queries [Cod70, Ull88, Ull89, Kan90] under a set-based rather than bag-based seman- tics (That is, answers to queries are sets, while they are bags in the original relational model [Cod70] and SQL).
2.1 Query Languages
Let dom be a countably infinite domain of atomic values. A relation schema R is a relation name together with a sort, which is a tuple1 of attribute names, and an arity, i.e.
1Relation schemata are usually defined as sets of attributes. However, we choose the tuple, as we will use the unnamed calculus perspective widely throughout this work.
27 28 CHAPTER 2. PRELIMINARIES
sort(R) = hA1,...,Ani arity(R) = n
A (relational) schema R is a set of relation schemata. A relation I is a finite set of tuples, I ⊆ domn. A database instance I is a set of relations. A relational query Q is a function that maps each instance I over a schema R and dom to another instance J over a different schema R’. Relational queries can be seen from at least two perspectives, an algebraic and a calculus viewpoint. Relational algebra ALG is based on the following basic algebraic operations (see [Cod70] or [Ull88, AHV95]):
• Set-based operations (intersection ∩, union ∪, and difference \) over rela- tions of the same sort (that is, arity, as we assume a single domain dom of atomic values).
• Tuple-based operations (projection π, which eliminates or renames rows of relations, and selection σ, which filters tuples of a relation according to a predicate built by conjunction of equality atoms, which are statements of the form A = B, where A, B are relational attributes).
• The Cartesian product × as a constructive operation that, given two rela- tions R1 and R2 of arities n and m, respectively, produces a new relation of arity n + m which contains a tuple ht1, t2i for each distinct pair of tuples t1, t2 with t1 ∈ R1 and t2 ∈ R2.
Other operations (e.g., various kinds of joins) can be defined from these. There are various subtleties, such as named and unnamed perspectives of ALG, for which we refer to [AHV95]. Queries in the first-order relational domain calculus CALC are of the form
{hX¯i | Φ(X¯)} where X¯ is a tuple of variables (called “unbound” or “distinguished”) and Φ is a first-order formula (using ∀, ∃, ∧, ∨, and ¬) over relational predicates pi. An important desirable property of well-behaved database queries is domain independence. Let the set of all atomic values appearing in a database I be called the active domain (adom). A CALC query Q over a schema R is domain independent iff, for any possible database I over R, Qdom(I) = Qadom(I).
Example 2.1.1 The CALC query {hx, yi | p(x)} is not domain independent, as the variable y is free to bind with any member of the domain. Clearly, such a query does not satisfy the intuitions of well-behaved database queries. 2.1. QUERY LANGUAGES 29
Unfortunately, the domain independence property is undecidable for CALC. An alternative purely syntactic property is safety or range restriction. We refer to [AHV95] for a treatment of safe-range calculus CALCsr, which is necessarily somewhat lengthy. It can be shown that ALG, the domain independent relational calculus and CALCsr are all (language) equivalent. We refer to the class of ∀, ¬-free queries as the positive relational calculus queries and the queries that only use ∃ and ∧ to build formulae as the conjunctive queries. By default, conjunctive queries may contain constants but no built-in arithmetic comparison operators. Conjunctive queries can be written as function-free Horn clauses, called dat- ¯ ¯ ¯ ¯ alog notation. A conjunctive query {hXi | ∃Y : p1(X1) ∧ ... ∧ pn(Xn)} is written as a datalog rule
¯ ¯ ¯ q(X) ← p1(X1), . . . , pn(Xn). Furthermore, conjunctive queries have to be safe. Safety in the case of con- junctive queries is quite simple to define. A conjunctive query is safe iff each variable in the head also appears somewhere in the atoms built from database ¯ ¯ ¯ predicates in the body, X ⊆ X1 ∪ ... ∪ Xn. Throughout this thesis, we choose among the set-theoretic notation for conjunctive queries shown above and the datalog notation, whichever is most convenient to support the presentation. Conjunctive queries correspond to select-from-where clauses in SQL where constraints in the where clause only use equality (=) as comparison operator and conjunction (“and”) to compose constraints.
Example 2.1.2 The subsumed query from Example 1.3.1 (a conjunctive query) can be written as a select-from-where query in SQL
select pc, cpu, lname from pc cpu, belongs to, loc, pc loc where pc cpu.pc = belongs to.pc and pc cpu.pc = pc loc.pc and pc loc.lid = loc.lid and belongs to.org entity = “Project1”; or equivalently q(P c, Cpu, LName) ← pc cpu(P c, Cpu), belongs to(P c, “Project1”), loc(LId, LName), pc loc(P c, LId). in datalog rule notation or
πPc,Cpu,LName(pc cpu ./ σOrg Entity=“P roject100 (belongs to) ./ pc loc ./ loc) as an ALG query. 30 CHAPTER 2. PRELIMINARIES
Queries with inequality constraints (i.e., 6=, <, ≤, also called arithmetic com- parison predicates or builtin predicates) are outside of ALG or CALC in principle, but extensions can be defined without much difficulty2. A conjunctive query with inequalities is a clause of the form
¯ ¯ ¯ q(X) ← p1(X1), . . . , pn(Xn), xi1,1 θ1xi1,2 , . . . , xim,1 θmxim,2 . ¯ ¯ where the xij,k are variables in X1,..., Xn and θj ∈ {6=, <, ≤}. A datalog program is a set of datalog rules. The dependency graph of a datalog program P is the directed graph G = hV,Ei where V is the set of predicate names in P and E contains an arc from predicate pi to predicate pj iff there is a datalog rule in P such that pi is its head predicate and pj appears in the body of that same rule. A datalog program is recursive iff its dependency graph is cyclic. Positive queries (select-from-where-union queries in SQL) can be written as nonrecursive datalog programs. Since conjunctive queries are closed under com- position, all positive queries can also be transformed into equivalent sets of con- junctive queries (with the head atoms over the same “query” predicate). The size of these sets can be exponentially larger than the corresponding nonrecursive dat- alog programs. The process of transforming a nonrecursive datalog program into a set of conjunctive queries is a form of translating a logical formula in Conjunc- tive Normal Form (CNF) and is called query unfolding.
Example 2.1.3 The nonrecursive datalog program q(x, y, z, w) ← a(x, y, z, w). a(x, y, z, 1) ← b(x, y, z). a(x, y, z, 2) ← b(x, y, z). b(x, y, 1) ← c(x, y). b(x, y, 2) ← c(x, y). c(x, 1) ← d(x). c(x, 2) ← d(x). with 2 ∗ 3 + 1 = 7 rules is equivalent to the following set q(x, 1, 1, 1) ← d(x). q(x, 1, 1, 2) ← d(x). q(x, 1, 2, 1) ← d(x). q(x, 1, 2, 2) ← d(x). q(x, 2, 1, 1) ← d(x). q(x, 2, 1, 2) ← d(x). q(x, 2, 2, 1) ← d(x). q(x, 2, 2, 2) ← d(x).
3 of 2 conjunctive queries.
Relational algebra and calculus are far from representing all computable que- ries over relational databases. For example, not even the transitive closure of
2There are, however, a few subtle issues such as the question if the domain is totally ordered with its impact on data independence [CH80, CH82] that are important for the theory of queries. Since we will only touch queries with inequalities shortly, we leave this aside. 2.2. QUERY CONTAINMENT 31 binary relations can be expressed using the first-order queries3. Much has been said on categories and hierarchies of relational query languages, and examples of languages strictly more expressive than relational algebra and calculus are, for instance, datalog with negation (under various semantics) or the while queries. We refer to [CH82, Cha88, Kan90, AHV95] for more on these issues. Treatments of complexity and expressiveness of relational query languages can be found in [Var82, CH82, Cha88, AHV95]. We leave these issues to the related literature and remark only that the positive relational calculus queries are (expression)-complete in PSPACE [Var82]. The decision problem whether an unfolding of a conjunctive query with a nonrecursive datalog program (with constants) exists that uses only certain relational predicates – which is related to the approach to data integration developed later on in this thesis – is equally PSPACE-complete and thus presumably a computationally hard problem.
2.2 Query Containment
The problem of deciding whether a query Q1 is contained in a query Q2 (denoted Q1 ⊆ Q2) (possibly under a number of constraints describing a schema) is the one of deciding whether for any possible databases satisfying the constraints, each tuple in the result of Q1 is contained in the result of Q2. Two queries are called equivalent, denoted Q1 ≡ Q2, iff Q1 ⊆ Q2 and Q1 ⊇ Q2. The containment problem quickly becomes undecidable for expressive query languages. Already for relational algebra and calculus, the problem is undecid- able [SY80, Kan90]. In fact, the problem is co-r.e. but not recursive (under the assumption that databases are finite but the domain is not). Checking the con- tainment of two queries would require a noncontainment check for every finite database over dom. For conjunctive queries, the containment problem is decidable and NP-com- plete [CM77]. Since queries tend to be small, query containment can be prac- tically used, for instance in query optimization or data integration [CKPS95, YL87]. It is usually formalized using the notion of containment mappings (ho- momorphisms) [CM77].
Definition 2.2.1 Let Q1 and Q2 be two conjunctive queries. A containment mapping θ is a function from the variables and constants of Q1 into the variables and constants of Q2 that is
• the identity on the constants of Q1
• Headi(Q2) for a variable Headi(Q1)
3However, transitive closure can of course be expressed in datalog 32 CHAPTER 2. PRELIMINARIES
• and for which for every atom p(x1, . . . , xn) ∈ Body(Q1),
p(θ(x1), . . . , θ(xn)) ∈ Body(Q2)
It can be shown that for two conjunctive queries Q1 and Q2, the containment Q1 ⊆ Q2 holds iff there is a containment mapping from Q2 into Q1 [CM77].
Example 2.2.2 [AHV95] The two conjunctive queries q1(x, y, z) ← p(x2, y1, z), p(x, y1, z1), p(x1, y, z1), p(x, y2, z2), p(x2, y2, z). and q2(x, y, z) ← p(x2, y1, z), p(x, y1, z1), p(x1, y, z1). are equivalent. For q1 ⊆ q2, the containment mapping is the identity. Clearly, since Body(q2) ⊂ Body(q1), and the heads of the two queries match, q1 ⊆ q2 must hold. For the other direction, we have θ(x) = x, θ(y) = y, θ(z) = z, θ(x1) = x1, θ(y1) = y1, θ(z1) = z1, θ(x2) = x2, θ(y2) = y1, and θ(z2) = z1.
An alternative way [Ull97] of deciding whether a conjunctive query Q1 is contained in a second, Q2, is to freeze the variables of Q1 into new constants (i.e., which do not appear in the two queries) and to evaluate Q2 on the canonical database created from the frozen body atoms of Q1. Q1 is then contained in Q2 if and only if the frozen head of Q1 appears in the result of Q2 over the canonical database.
Example 2.2.3 Consider again the two queries of Example 2.2.2. The canonical database for q2 is
I = {p(ax2 , ay1 , az), p(ax, ay1 , az1 ), p(ax1 , ay, az1 )}
where ax, ay, az, ax1 , ay1 , az1 , ax2 are constants. We have
q1(I) = {hax2 , ay1 , azi, hax2 , ay1 , az1 i, hax, ay1 , azi, hax, ay1 , az1 i,
hax, ay, azi, hax, ay, az1 i, hax1 , ay1 , az1 i, hax1 , ay, az1 i}
Since the frozen head of q2 is hax, ay, azi and hax, ay, azi ∈ q1(I), q2 is contained in q1.
The containment of positive queries Q1,Q2 can be checked by transforming 0 0 0 0 them into sets of conjunctive queries Q1,Q2. Q1 is of course contained in Q2 iff 0 0 each member query of Q1 is individually contained in a member query of Q2. 2.3. DEPENDENCIES 33
Bibliographic Notes The containment problem for conjunctive queries is NP-complete, as mentioned. The problem can be efficiently solved for two queries if neither query contains more than two atoms of the same relational predicate [Sar91]. In that case, a very efficient algorithm exists that runs in time linear in the size of the queries. Another polynomial-time complexity case is encountered when the so-called hy- pergraph of the query to be tested for subsumption is acyclic [YO79, FMU82, AHV95]. For that class of queries, the technique of Example 2.2.3 can be com- bined with the polynomial-time expression complexity of the candidate subsumer query. If arithmetic comparison predicates4 are permitted in conjunctive queries [Klu88], the complexity of checking query containment is harder and jumps to the second level of the polynomial hierarchy [vdM92]. The containment of datalog queries is undecidable [Shm87]. This remains true even for some very restricted classes of single-rule programs (sirups) [Kan90]. Containment of a conjunctive query in a datalog query is EXPTIME-complete – this problem can be solved with the method of Example 2.2.3, but then consumes the full expression complexity of datalog [Var82] (i.e., EXPTIME). The opposite direction, i.e. containment of a datalog program in a conjunctive query, is still decidable but highly intractable (it is 2-EXPTIME-complete [CV92, CV94, CV97]). Other interesting recent work has been on the containment of so-called regular path queries – which have found much research interest in the field of semistruc- tured databases – under constraints [CDL98a] and on containment of a class of queries over databases with complex objects [LS97] (see also Section 2.5).
2.3 Dependencies
Dependencies are used in database design to add semantics and integrity con- straints to a schema, which database instances have to comply to. Two particu- larly important classes of dependencies are functional dependencies (abbreviated fd’s) and inclusion dependencies (ind’s). A functional dependency R : X → Y over a relation schema R (where X and Y are sets of attribute names of R5) has the following semantics. It enforces that for each relation instance over R, for each pair t1, t2 of tuples in the instance, if for each attribute name in X the values in t1 and t2 are pairwise equal, then the values for the attributes in Y must be equal as well. Primary keys are special cases of functional dependencies where X∪Y contains all attributes of R. 4Such queries satisfy the real-worl need of asking queries where an attribute is to be, for instance, of value greater than a certain constant. 5Under the unnamed perspective sufficient for conjunctive queries in datalog notation, we will refer to the i-th attribute position in R by $i, instead of an attribute name. 34 CHAPTER 2. PRELIMINARIES
Example 2.3.1 Let R ⊆ dom3 be a tertiary relation with two functional de- pendencies R : $1 → $2 $3 (i.e., the first attribute is a primary key for R) and R : $3 → $2. Consider an instance I = {h1, 2, 3i}. The attempt to insert a new tuple h1, 2, 4i into R would violate the first fd, while the attempt to do the same for h5, 6, 3i would violate the second.
Informally, inclusion dependencies are containment relationships between que- ries of the form πγ(R), i.e., attributes of a single relation R may be reordered or projected out. Foreign key constraints, which require that a foreign key stored in one tuple must also exist in the key attribute position of some tuple of a usually different relation, are inclusion dependencies. Dependencies as database semantics, notably, are valuable in query optimiza- tion and allow to enforce the integrity of database updates.
2.4 Global Query Optimization
Modern database systems rely on the idea of a separation of physical storage structures and logical schemata in order to simplify their use [TK78, AHV95]. This, together with the declarative flavor of many query languages, leads to the need to optimize queries such that they can be evaluated quickly. In the general case of the relational queries (i.e., ALG or the relational calcu- lus), global optimization is not computable. For conjunctive queries, and on the logical level, where physical cost-based metrics can be left out of consideration, though, global optimality (that is, minimality) can be achieved. A conjunctive query Q is minimal if there is no equivalent conjunctive query Q0 such that Q0 has fewer atoms (subgoals) in its body than Q. This notion of optimality is justified because joins of relations are usually among the most expensive relational (algebra) operations carried out by a rela- tional database system during query execution. Minimality is of interest in data integration as well. Computing a minimal equivalent conjunctive query is strongly related to the query containment problem (see Section 2.2). The associated decision problem is again NP-complete. Minimal queries can be computed using the following fact [CM77]. Given a conjunctive query Q, there is a minimal query Q0 (with Q ≡ Q0) such that Head(Q) = Head(Q0) and Body(Q0) ⊆ Body(Q), i.e. the heads are equal and the body of Q0 contains a subset of the subgoals of Q, without any changes to variables or constants. Conjunctive queries can thus be optimized by checking all queries created by dropping body atoms from Q while preserving equivalence and searching for the smallest such query.
Example 2.4.1 Take the queries q1 and q2 from Example 2.2.2. By checking all subsets of Body(q2), it can be seen that q2 is already minimal. In fact, q2 is also 2.5. COMPLEX VALUES AND OBJECT IDENTITIES 35
a minimal query for q1, as Body(q2) is the smallest subset of Body(q1) such that q2 and q1 remain equivalent. Global optimization of conjunctive queries under a number of dependencies (e.g., fd’s) can be carried out using a folklore technique called the chase [ABU79, MMS79], for which we refer to the literature (see also [AHV95]).
2.5 Complex Values and Object Identities
Among the principal additional features of the object-oriented data model [BM93, Kim95, CBB+97], compared to the relational model, we have object identifiers [AK98], objects that have complex (“nested”) values [AB88], IS-A hierarchies, and behavior attributed to classes of objects, usually via (mostly) imperative methods. For the purpose of querying and data integration under the object- oriented data model, the notions of object identifiers and complex objects deserve some consideration. Research on complex values in database theory has started by giving up the requirement that values in relations may only contain atomic values of the domain (non-first normal form databases). The complex value model, theoretically very elegant, is strictly a generalization of the relational data model. Values are created inductively from set and tuple constructors. The relational data model is thus the special case of the complex value model where each relation is a set of tuples over the domain. For instance,
{hA : dom,B : dom,C : {hA : dom,B : {dom}i}i} is a valid sort in the complex value model and
{ha, b, {hc, {}i, hd, {e, g}i}i, he, f, {}i} is a value of this sort, where a, b, c, d, e, f, g are constants of dom. As for the relational data model, algebra and calculus-based query languages can be specified, and equivalences be established. Informally, in the algebraic perspec- tive, set-based operations (union, intersection and difference), which are required to operate over sets of the same sorts, and simple tuple-based operations (such as projection) known from the relational model are extended by a more expres- sive selection operation, which may have conditions such as set membership and equality of complex values, and the powerset operation, furthermore tuple- and set-creation and destruction operations (see [AHV95]). Other operations such as renaming, join, and nesting and unnesting can be defined in terms of these. The complex-value algebra (ALGcv) has hyperexponential complexity. When the powerset operation is replaced by nesting and unnesting operations, we arrive at the so-called nested relation algebra ALGcv−. All queries in ALGcv− can be 36 CHAPTER 2. PRELIMINARIES executed efficiently (relative to the size of the data), which has motivated com- mercial object-oriented database systems such as O2 [LRV88] and standards such as ODMG’s OQL [CBB+97] to closely adopt it. Interestingly, it can be shown that all ALGcv− queries over relational databases have equivalent relational queries [AB88, AHV95]. This is due to the fact that unnested values in a tuple always represent keys for the nested tuples; nestings are thus purely cosmetic. Furthermore, every complex value database can be trans- formed (in polynomial time relative to the size of the complex value database) into a relational one [AHV95] (This, however, requires keys that identify nested tuples as objects, i.e., object identifiers). The nested relation model - and with it a large class of object-oriented queries - is thus just “syntactic sugaring” over the relational data model with keys as supplements for object identifiers. From the query-only standpoint of data integration, where structural integration can take care of inventing object identifiers in the canonical transformation between data models, we can thus develop techniques in terms of relational queries, which can then be straightforwardly applied to object-oriented databases as well6. We also make a comment on the calculus perspective. Differently from the relational model, in the complex value calculus CALCcv variables may represent and be quantified over complex values. We are thus operating in a higher-order predicate calculus with a finite model semantics. The generalization of range restriction (called safe-range calculus) for the relational calculus to the complex value calculus is straightforward but verbose (see [AHV95]). It can be shown that ALGcv and the safe-range calculus CALCcv (which represents exactly the domain independent complex value calculus queries) are equivalent. Furthermore, if set inclusion is disallowed but set membership as the analog of nesting remains permitted, the so-called strongly safe-range calculus CALCcv− is attained, which is equivalent to ALGcv−. Conjunctive nested relation algebra – in which set union and difference have been removed from ALGcv− – is thus equivalent to the conjunctive relational queries.
Example 2.5.1 Consider an instance Parts, which is a set of complex values of the following sort. A part (in a product-data management system) is a tuple of a barcode B, a name N, and a set of characteristics C. A characteristic is a tuple of a name N and a set of data elements D. A data element is a tuple of a name N, a unit of measurement U, and a value V 7. The sort can be thus written as
hB : dom,N : dom,C : {hN : dom,D : {hN : dom,U : dom,V : domi}i}i
6Some support for object-oriented databases is a requirement in the use case of Section 1.3, as scientific repositories increasingly make use of object-oriented data models. 7For simplicity, we assume that all atomic values are of the same domain dom. This is not an actual restriction unless arithmetic comparison operators (<, ≤) are allowed in the query language. 2.5. COMPLEX VALUES AND OBJECT IDENTITIES 37
Suppose that we pose the following query in nested relation algebra ALGcv−:
πN,B,D(unnestC (πB,C (Parts))) which asks for transformed complex values of sort
hN : dom,B : dom,D : {hN : dom,U : dom,V : domi}i and can be formulated in strongly safe-range calculus CALCcv− as
{x : hN,B,D : {hN, U, V i}i | ∃y, z, z0, w, w0, u, u0 : y : hB,N,C : {hN,D : {hN, U, V i}i}i ∧ z : {hN,D, {hN, U, V i}i} ∧ z0 : hN,D, {hN, U, V i}i ∧ w : {hN, U, V i} ∧ w0 : hN, U, V i ∧ u : {hN, U, V i} ∧ u0 : hN, U, V i ∧ x.B = y.B ∧ y.C = z ∧ z0 ∈ z ∧ z0.N = x.N ∧ z0.D = w ∧ w0 ∈ w ∧ x.D = u ∧ u0 ∈ u ∧ u0 = w0}
Let us map the collection Parts to a flat relational database with schema
R = {Part(P oid, B, N), Char(Coid, N, P oid), DataElement(N, U, V, Coid)} where the attributes P oid and Coid stand for object identifiers which must be invented when flattening the data. The above query can now be equivalently asked in relational algebra as
πN,B,Dn,U,V ((πP oid,B(Part) ./ Char) ./ πN→Dn,U,V,Coid(DataElement))
The greatest challenge here is the elimination or renaming of the three different name attributes N. The same query has the following equivalent in the (conjunc- tive) relational calculus
{hx, y, z, u, vi | ∃i1, i2, d : Part(i1, x, d) ∧ Char(i2, y, i1) ∧ DataElement(z, u, v, i2)}
After executing the query, the results can be nested to get the correct result for the nested relational algebra or calculus query. 38 CHAPTER 2. PRELIMINARIES Chapter 3
Data Integration
This chapter briefly surveys several research areas related to data integration. We proceed by first presenting two established architectures, federated and mul- tidatabases in Section 3.2 and data warehouses in Section 3.3. Next, in Sec- tion 3.4, we discuss information integration in AI. Several research areas of AI that are relevant to this thesis are surveyed, including ontology-based global in- formation systems, capability description and planning, and multi-agent systems as a further integration architecture. Then we discuss global-as-view integra- tion (together with an integration architecture, mediator systems) in Section 3.5 and local-as-view integration in Section 3.6. In Sections 3.7 and 3.8 we arrive at recent data integration approaches. Section 3.9 discusses management and maintainability issues in large and evolving data integration systems and com- pares the different approaches presented according to various qualitative aspects. First, however, we start with some definitions.
3.1 Definitions and Overview
Source integration [JLVV00] refers to the process of integrating a number of sources (e.g. databases) into one greater common entity. The term is usually used as part of a greater, more encompassing process, as perceived in the data warehousing setting, where source integration is usually followed by aggregation and online analytical processing (OLAP). There are two forms of source inte- gration, schema integration and data integration. Schema integration [BLN86] refers to a software engineering or knowledge engineering approach, the process of reverse-engineering information systems and reengineering schemata in order to obtain a single common “integrated” schema – which we will not address in more detail in this thesis. While the terms data and information are of course not to be confused, data integration and information integration are normally used synonymously (e.g., [Wie96, Wie92]). Data integration is the area of research that addresses problems related to
39 40 CHAPTER 3. DATA INTEGRATION
schema source integration integration
data reconciliation data integration structural semantic integration integration
Figure 3.1: Artist’s impression of source integration. the provision of interoperability to information systems by the resolution of het- erogeneity between systems on the level of data. This distinguishes the problem from the wider aim of cooperative information systems [Coo], where also more advanced concepts such as workflows, business processes, and supply chains come into play, and where problems related to coordination and collaboration of sub- systems are studied which go beyond the techniques required and justified for the integration of data alone. The data integration problem can be decomposed into several subproblems. Structural integration (e.g., wrapping [GK94, RS97]) is concerned with the res- olution of structural heterogeneity, i.e. the heterogeneity of data models, query and data access languages, and protocols1. This problem is particularly inter- esting when it comes to legacy systems, which are systems that in general have some aspect that would be changed in an ideal world but in practice cannot be [AS99]. In practice, this often refers to out-of-date systems in which parts of the code base or subsystems cannot be adapted to new requirements and technologies because they are no longer understood by the current maintainers or because the source code has been lost. Semantic integration refers to the resolution of semantic mismatch between schemata. Mismatch of concepts appearing in such schemata may be due to a number of reasons (see e.g. [GMPQ+97]), and may be a consequence of differ- ences in conceptualizations in the minds of different knowledge engineers. Mis-
1We experience structural heterogeneity if we need to make a number of databases interop- erable of which, for example, some are relational and others object-oriented, or if among the relational databases some are only queryable using SQL while others are only queryable using QUEL [SHWK76]. Other kinds of structural heterogeneity are encountered when two database systems use different models for managing transactions or lack middleware compatible with both which allows to communicate queries and results. 3.2. FEDERATED AND MULTIDATABASES 41 match may not only occur on the level of schema entities (relations in a relational database or classes in an object-oriented system), but also on the level of data. The associated problem, called data reconciliation [JLVV00], includes object iden- tification (i.e., the problem of determining correspondences of objects represented by different heterogeneous data sources) and the handling of mistakes that hap- pened during the acquisition of data (e.g. typos), which is usually referred to as data cleaning. An overview of this classification of source integration is given in Figure 3.1. Since for this thesis, the main problem among those discussed in this section is the resolution of semantic mismatch, we will also put an emphasis on this problem in the following discussion and comparison of research related to data integration.
3.2 Federated and Multidatabases
The data integration problem has been addressed early on by work on multi- database systems. Multidatabase systems are collections of several (distributed) databases that may be heterogeneous and need to share and exchange data. Ac- cording to the classification2 of [SL90], federated database systems [HM85] are a subclass of multidatabase systems. Federated databases are collections of col- laborating but autonomous component database systems. Nonfederated multi- database systems, on the other hand, may have several heterogeneous schemata but lack any other kind of autonomy. Nonfederated multidatabase systems have one level of management only and all data management operations are performed uniformly for all component databases. Federated database systems can be cat- egorized as loosely or tightly coupled systems. Tightly coupled systems are ad- ministrated as one common entity, while in loosely coupled systems, this is not the case and component databases are administered independently [SL90]. Component databases of a federated system may be autonomous in several senses. Design autonomy permits the creators of component databases to make their own design choices with respect to data models and query languages, data managed and schemata used for managing them, and the conceptualizations and semantic interpretations of the data applied. Other kinds of component auton- omy that are of less interest to this thesis but still deserve to be mentioned are communication autonomy, execution autonomy and association autonomy [SL90, HM85]. Autonomy is often in conflict with the need for sharing data within a federated database system. Thus, one or even several kinds of auton- omy may have to be relaxed in practice to be able to provide interoperability.
2There is some heterogeneity in the nomenclature of this area. A cautionary note is due at this point: Many of the terms in this chapter have been used heterogeneously by the research community. Certain choices had to be made in this thesis to allow a uniform presentation, which are hopefully well documented. 42 CHAPTER 3. DATA INTEGRATION
External External ... External Schema Schema Schema
Federated ... Federated Schema Schema
Export Export ... Export Schema Schema Schema
Component Component Schema ... Schema
Local ... Local Schema Schema
Figure 3.2: Federated 5-layer schema architecture
Modern database systems successfully use a three-tier architecture [TK78] which separates physical (also called internal) from logical representation and the logical schema in turn from possibly multiple user or application perspectives (provided by views). In federated database systems, these three layers are con- sidered insufficient, and a five-layer schema architecture has been proposed (e.g. [SL90] and Figure 3.2). Under this architecture, there are five types of schemata between which queries are translated. These five types of schemata are
• Local schemata. The local schema of a component database corresponds to the logical schema in the classical three-layer architecture of centralized database systems.
• Component schemata. The component schema of a database is a version of its local schema translated into the data model and representation formal- ism shared across the federated database system.
• Export schemata. An export schema contains only the part of the schema relevant to one integrated federated schema.
• Federated schemata3. This schema is an integrated homogeneous view of the federation, against which a number of export schemata are mapped (using data integration technology). There may be several such federated schemata inside a federation, providing different integrated views of the available data. 3These are also known as import schemata or global schemata [SL90]. 3.3. DATA WAREHOUSING 43
"Data Cube" Data Marts Data (MDDBS) Analysis
Extraction &