CERN-THESIS-2001-036 //2001 in am Wien, ugfu r u wcedrElnugdsaaeice Grades akademischen des Erlangung der Zwecke ausgef¨uhrt zum ie otr e ehice isncatnutrdrLeitung der unter Wissenschaften technischen der Doktors eines ntttfu eiiiceKbrei n rica Intelligence Artificial und Kybernetik f¨ur medizinische Institut Intelligence Artificial und Kybernetik f¨ur medizinische Institut auta ¨rTcnsh auwseshfe n Informatik und Fakult¨at Naturwissenschaften f¨ur Technische igrih ndrTcnshnUniversit¨at Technischen Wien der an eingereicht utpeEovn uooosSchemata Autonomous Evolving Multiple nvri¨ tlko il-n.D.PooPetta Paolo Dr. Universit¨atslektor Dipl.-Ing. .Ui.Po.D.Rbr Trappl Robert Dr. Univ.-Prof. o. -00We,Barxas 26/70 Beatrixgasse Wien, A-1030 aaItgainagainst Integration Data N O I T A T R E S S I D Universit¨at Wien Universit¨at Wien hitp Koch Christoph E9425227 und von von 1 2 3

Inhaltsangabe

Forschung im Gebiet der Datenintegration hat u.a. Richtungen wie f¨oderierteund Multidatenbanken, Mediation, Data Warehousing, Global Information Systems und Model Management bzw. Schema Matching zu Tage gebracht. Von einem architektonischen Standpunkt aus gesehen kann zwischen Ans¨atzenunterschieden werden, in denen gegen ein einziges globales Schema integriert wird, und solchen, wo das nicht der Fall ist. Auf der Ebene der Interschemasemantik kann man den Großteil der bisherigen Forschungsarbeit in die sogenannten global-as-view und local-as-view Ans¨atzeeinteilen. Diese Ans¨atzeunterscheiden sich teilweise stark in ihren individuellen Eigenschaften. F¨oderierteDatenbanken haben sich in Umgebungen als brauchbar erwiesen, in denen mehrere Informationssysteme miteinander Daten austauschen m¨ussen, jedes dieser Informationssysteme aber sein eigenes Schema hat, und, was das Design dieses Schemas betrifft, auch autonom ist. In der Praxis unterst¨utztdieser Ansatz aber unangenehmerweise die Wartung von sich ¨anderndenSchemata nicht. Andere bekannte Ans¨atze,die gegen ein “globales” Schema integrieren, unterst¨u- tzen hingegen die Design Autonomy von Informationssystemen nicht. Bei not- wendig werdenden Schema¨anderungenf¨uhrtdiese Art von Autonomie n¨amlich oft zu Schemata, gegen die die erw¨unschte Interschemasemantik weder durch global-as-view noch durch local-as-view-Ans¨atzeausgedr¨uckt werden kann. Diese Problematik ist das Thema dieser Dissertation, in der ein neuer Ansatz zur Datenintegration, der Ideen von Model Management, Mediation, and local- as-view Integration vereint, vorgeschlagen wird. Unser Ansatz erm¨oglicht die Mo- dellierung von (partiellen) Abbildungen zwischen Schemata, die Anderungen¨ eine vorteilhafte Robustheit entgegensetzen. Die Motivation f¨urdie pr¨asentierten Re- sultate ist Folge eines ausgedehnten Aufenthalts des Autors am CERN, w¨ahrend- dessen die die Informationsinfrastruktur betreffenden Ziele und Notwendigkeiten von großen wissenschaftlichen Kollaborationen studiert wurden. Unser Ansatz basiert auf zwei zentralen Grundlagen. Die erste ist Query Rewriting, also das Umschreiben von Abfragen, unter sehr ausdrucksstarken “symmetrischen” Interschemaabh¨angigkeiten, n¨amlich Inklusionsabh¨angigkeiten zwischen sogenannten Conjunctive Queries, die wir Conjunctive Inclusion Depen- dencies (cind’s) nennen. Wir behandeln eine sehr allgemeine Form des Quellen- integrationsproblems, in dem mehrere Schemata koexistieren d¨urfen,und jedes davon sowohl echte Datenbankentitit¨aten,f¨urdie also Daten vorhanden sind, sowie rein logische oder “virtuelle” Entitit¨atenenthalten darf, gegen die mit Hilfe von cind’s Abh¨angigkeiten von anderen Schemata definiert werden k¨onnen.Das Query Rewriting Problem zielt nun darauf ab, eine Abfrage, die sowohl ¨uber lo- gische als auch echte Entitit¨ateneines Schemas gestellt werden darf, so in eine andere umzuschreiben, daß nur echte Datenbankentitit¨aten,allerdings, wenn n¨otig,von allen dem Integrationssystem bekannten Schemata, verwendet wer- den. Exakter wird unter der klassisch-logischen Semantik mit Hilfe einer Menge 4 von cind’s eine Conjunctive Query in eine maximale logisch enthaltene positive Abfrage umgeschrieben. Solch derart umgeschriebene Abfragen k¨onnenmit Hilfe von bekannten Techniken aus dem Gebiet der verteilten Datenbanken beant- wortet werden. Aus theoretischen Uberlegungen,¨ die in dieser Dissertation n¨aher erl¨autertwerden, beschr¨anken wir uns dabei – f¨urdie Datenintegration – auf Mengen von cind’s deren Abh¨angigkeitsgraph bezogen auf die Inklusionsrichtung der cind’s azyklisch ist. Was das Query Rewriting Problem betrifft stellen wir zuerst Semantik(en) und theoretische Eigenschaften vor. Danach werden Algorithmen und Optimierungen, die auf Datenbanktechniken aufbauen, pr¨asentiert, die in einem Prototypen im- plementiert wurden. Zu diesem werden auch passende Benchmarks geliefert, die zeigen sollen, daß unser Ansatz leistungsf¨ahiggenug ist, um auch praktische Relevanz zu besitzen. Unser Ansatz skaliert ausgezeichnet zu großen Datenmengen, da das Daten- integrationsproblem ausschließlich auf der Ebene von Schemata und Abfragen, nicht aber auf der Ebene von Daten, gel¨ostwird. Eine weitere St¨arke ist die hohe Ausdruckskraft unserer Abh¨angigkeiten (cind’s), die viel Flexibilit¨atbei der Modellierung von Interschemabeziehungen erlaubt; beispielsweise sind sowohl local-as-view als auch global-as-view Integration Spezialf¨alleunseres Ansatzes. Wie auch gezeigt wird, erlaubt diese Flexibilit¨at,Abbildungen zu erzeugen, die Anderungen¨ gegen¨uber robust sind, da sie es erm¨oglicht, cind’s weitgehend un- abh¨angigvoneinander zu machen, sodaß notwendige Anderungen¨ meist lokal beschr¨anktbleiben. Query Rewriting mit cind’s erm¨oglicht es klarerweise auch, mit einer sehr großen Klasse von Disparit¨atenvon Konzepten umzugehen, da Paare von einander entsprechenden (um exakt zu sein, einander enhaltenden) Konzepten durch zwei in Relation gebrachte Conjunctive Queries ausgedr¨uckt werden. Die zweite Grundlage stellt Model Management mit cind’s dar. Im Model Management Ansatz werden Schemata und Abbildungen als Objekte mit Iden- tit¨atverwaltet, auf die eine Anzahl von m¨achtigen Wartungs- und Manipulations- operationen angewandt werden kann. In dieser Dissertation werden solche Oper- ationen definiert, die daf¨urpassend sind, Abbildungen so zu verwalten, daß h¨aufige Anderungen¨ handhabbar sind. Dazu wird auch eine Methodologie zum Management von Schema Evolution pr¨asentiert. Die Kombination der technischen Beitr¨agedieser Dissertation erm¨oglicht eine deutliche Verbesserung von Offenheit und Flexibilit¨atf¨urdie Ans¨atzeModel Management und f¨oderierte Datenbanken in der Datenintegration und stellt die erste praktische L¨osungder Datenintegrationsprobleme dar, denen im Kontext von komplexen, autonomen und sich ¨andernden Informationslandschaften, wie es große wissenschaftliche Kollaborationen sind, begegnet wird. 5

Abstract

Research in the area of data integration has resulted in approaches such as fed- erated and multidatabases, mediation, data warehousing, global information sys- tems, and the model management/schema matching approach. Architecturally, approaches can be categorized into those that integrate against a single global schema and those that do not, while on the level of inter-schema constraints, most work can be classified either as so-called global-as-view or as local-as-view integration. These approaches differ widely in their strengths and weaknesses. Federated databases have been found applicable in environments in which several autonomous information systems coexist – each with their individual schemata – and need to share data. However, this approach does not provide sufficient support for dealing with change of schemata and requirements. Other approaches to data integration which are centered around a single “global” inte- gration schema, on the other hand, cannot handle design autonomy of information systems. Under evolution, this type of autonomy eventually leads to schemata between which neither the global-as-view nor the local-as-view approaches to source integration can be used to express the inter-schema semantics. In this thesis, this issue is addressed with a novel approach to data integration which combines techniques from model management, mediation, and local-as- view integration. It allows for the design of inter-schema mappings that are more robust when change occurs. The work has been motivated by the requirements of large scientific collaborations in high-energy physics, as encountered by the author during his stay at CERN. The approach presented here is based on two foundations. The first is query rewriting with very expressive symmetric inter-schema constraints, called con- junctive inclusion dependencies (cind’s). These are containment relationships between conjunctive queries. We address a very general form of the source inte- gration problem, in which several schemata may coexist, each of them containing a number of purely logical as well as a number of source entities. For the source entities, the information system that belongs to the schema holds data, while the logical entities are meant to allow schema entities from other information systems to be integrated against. The query rewriting problem now aims at rewriting a query over (possibly) both source and logical schema entities of one schema into source entities only, which may be part of any of the schemata known. Under the classical logical semantics, and given a conjunctive input query, we address the problem of finding maximally contained positive rewritings under a set of cind’s. Such rewritten queries can then be optimized and efficiently answered using clas- sical distributed database techniques. For the purpose of data integration and the sake of computability, we require the dependency graph of a set of cind’s to be acyclic with respect to inclusion direction. Regarding the query rewriting problem, we first present semantics and main theoretical properties. Subsequently, algorithms and optimizations based on tech- 6 niques from database theory are presented, which have been implemented in a research prototype. Finally, experimental results based on this prototype are presented, which demonstrate the practical feasibility of our approach. Reasoning is done exclusively over schemata and queries, and is independent from data volumes, which renders it highly scalable. Apart from that, this flavor of query rewriting has another important strength. The expressiveness of the constraints allows for much freedom and flexibility for modeling the peculiarities of a mapping problem. For instance, both global-as-view and local-as-view inte- gration are special cases of the query rewriting problem addressed in this thesis. As will be shown, this flexibility allows to design mappings that are robust with respect to change, as principles such as the decoupling of inter-schema dependen- cies can be implemented. It is furthermore clear that query rewriting with cind’s also permits to deal with concept mismatch in a very wide sense, as each pair of corresponding concepts in two schemata can be modeled as conjunctive queries. The second foundation is model management based on cind’s as inter-schema constraints. Under the model management approach to data integration, sche- mata and mappings are treated as first-class citizens in a repository, on which model management operations can be applied. This thesis proposes definitions of schemata and mappings, as well as an array of powerful operations, which are well suited for designing and maintaining mappings between information systems when change is an issue. To complete this work, we propose a methodology for dealing with evolving schemata as well as changing integration requirements. The combination of the contributions of this thesis brings a practical improve- ment of openness and flexibility to the federated database and model management approaches to data integration, and a first practical integration architecture to large, complex, and evolving computing environments such as those encountered in large scientific collaborations. 7 Acknowledgments

Most of the work on this thesis was carried out during a 30 months stay at CERN, which was sponsored by the Austrian Federal Ministry of Education, Science and Culture under the CERN Austrian Doctoral Student Program. I would like to thank the two supervisors of my thesis, Robert Trappl of the Department of Medical Cybernetics and Artificial Intelligence of the University of and Jean-Marie Le Goff of CERN / ETT Division and the University of the West of England for their continuous support. This thesis would not have been possible without their help. Paolo Petta of the Austrian Research Institute for Artificial Intelligence took over much of the day-to-day supervision, and I am indebted to him for countless hours of discussions, proofreading of draft papers, and feedback of any kind. I would like to thank Enrico Franconi of the University of Manchester for provoking my interest in local-as-view integration during his short visit at CERN in early 2000, which has influenced this thesis. I am also indebted to Richard Mc- Clatchey and Norbert Toth of the University of the West of England and CERN for valuable comments on parts of an earlier version of this thesis. However, mistakes, as is obvious, are entirely mine. 8 Contents

1 Introduction 13 1.1 A Brief History of Data Integration ...... 13 1.2 The Problem ...... 17 1.3 Use Case: Large Scientific Collaborations ...... 18 1.4 Contributions of this Thesis ...... 23 1.5 Relevance ...... 25 1.6 Overview ...... 26

2 Preliminaries 27 2.1 Query Languages ...... 27 2.2 Query Containment ...... 31 2.3 Dependencies ...... 33 2.4 Global Query Optimization ...... 34 2.5 Complex Values and Object Identities ...... 35

3 Data Integration 39 3.1 Definitions and Overview ...... 39 3.2 Federated and Multidatabases ...... 41 3.3 Data Warehousing ...... 43 3.4 Information Integration in AI ...... 44 3.4.1 Integration against Ontologies ...... 44 3.4.2 Capability Descriptions and Planning ...... 45 3.4.3 Multi-agent Systems ...... 47 3.5 Global-as-view Integration ...... 50 3.5.1 Mediation ...... 50 3.5.2 Integration by Database Views ...... 51 3.5.3 Systems ...... 52 3.6 Local-as-view Integration ...... 53 3.6.1 Answering Queries using Views ...... 54 3.6.2 Algorithms ...... 56 3.6.3 Bibliographic Notes ...... 60 3.7 Description Logics-based Information Integration ...... 62 3.7.1 Description Logics ...... 62

9 10 CONTENTS

3.7.2 Description Logics as a Database Paradigm ...... 63 3.7.3 Hybrid Reasoning Systems ...... 65 3.8 The Model Management Approach ...... 65 3.9 Discussion of Approaches ...... 66

4 Reference Architecture 71 4.1 Architecture ...... 71 4.2 Mediating a Query ...... 73 4.3 Research Issues ...... 73

5 Query Rewriting 75 5.1 Outline ...... 75 5.2 Preliminaries ...... 76 5.3 Semantics ...... 78 5.3.1 The Classical Semantics ...... 78 5.3.2 The Rewrite Systems Semantics ...... 82 5.3.3 Equivalence of the two Semantics ...... 84 5.3.4 Computability ...... 88 5.3.5 Complexity of the Acyclic Case ...... 90 5.4 Implementation ...... 93 5.5 Experiments ...... 94 5.5.1 Chain Queries ...... 94 5.5.2 Random Queries ...... 97 5.6 Discussion ...... 98

6 Model Management 99 6.1 Model Management Repositories ...... 99 6.2 Managing Change ...... 102 6.2.1 Decoupling Mappings ...... 103 6.2.2 Merging Schemata ...... 107 6.3 Managing the Acyclicity of Constraints ...... 108

7 Outlook 111 7.1 Physical Data Independence ...... 113 7.1.1 The Classical Problem ...... 113 7.1.2 Versions of Logical Schemata ...... 117 7.2 Rewriting Recursive Queries ...... 123

8 Conclusions 127

Bibliography 129

Curriculum Vitae 151 List of Figures

1.1 Mappings in LAV (left) and GAV (right)...... 15 1.2 The space of objects that can be shared using symmetric map- pings given true concept mismatch between entities of source and integration schemata...... 17 1.3 Data flow between information systems that manage the steps of an experiment’s lifecycle. (Cylinders represent databases or – more generally – information systems.) ...... 20 1.4 ER diagrams for Example 1.3.1: Electronics database (left) and product-data management system (right)...... 21 1.5 Concept mismatch between PCs of the electronics database and parts of the product-data management system of “Project1”. . . . 22 1.6 Architecture of the information infrastructure ...... 24

3.1 Artist’s impression of source integration...... 40 3.2 Federated 5-layer schema architecture ...... 42 3.3 Data warehousing architecture and process...... 43 3.4 MAS architectures for the intelligent integration of information. Arrows between agents depict exemplary communication flows. Numbers denote logical time stamps of communication flows. . . . 48 3.5 A mediator architecture ...... 51 3.6 MiniCon descriptions of the query and views of Example 3.6.1. . . 58 3.7 Comparison of global-as-view and local-as-view integration. . . . . 68 3.8 Comparison of Data Integration Architectures...... 69

4.1 Reference Architecture ...... 72

5.1 Hypertile of size i ≥ 2 (left) and all nine possible overlapping hypertiles of size i − 1 that can be inscribed into it (right). . . . . 91 5.2 Experiments with chain queries and nonlayered chain cind’s. . . . 95 5.3 Experiments with chain queries and two layers of chain cind’s. . . 96 5.4 Experiments with chain queries and five layers of chain cind’s. . . 96 5.5 Experiment with random queries...... 97

6.1 Operations on schemata...... 100

11 12 LIST OF FIGURES

6.2 Operations on mappings...... 100 6.3 Complex model management operations...... 101 6.4 Data integration infrastructure of Example 6.2.1. Schemata are visualized as circles and elementary mappings as arrows...... 104 6.5 The lifecycle of the mappings of a legacy integration schema. . . . 106 6.6 Merging auxiliary integration schemata to improve maintenance. . 107 6.7 A clustered auxiliary schema. Schemata are displayed as circles and mappings as arrows...... 108

7.1 A cind as an inter-schema constraint (A) compared to a data trans- formation procedure (B). Horizontal lines depict schemata and small circles depict schema entities. Mappings are shown as thin arrows...... 112 7.2 ER diagram (extended with is-a relationships) of the university domain (initial version)...... 115 7.3 ER diagram (with is-a relationships) of the university domain (sec- ond version)...... 118 7.4 Fixpoint of the bottom-up derivation of Example 7.2.1...... 124 Chapter 1

Introduction

The integration of heterogeneous databases and information systems is an area of high practical importance. The very success of information systems and data management technology in a short period of time has caused the virtual om- nipresence of stand-alone systems that manage data – “islands of information” – that by now have grown too valuable not to be shared. However, this sharing, and with it the resolution of heterogeneity between systems, entails interesting and nontrivial problems, which have received much research interest in recent years. Ongoing research activity, however, is evidence of the fact that many questions remain unanswered.

1.1 A Brief History of Data Integration

Given a number of heterogeneous information systems, in practice it is not al- ways desirable or even possible to completely reengineer and reimplement them to create one homogeneous information system with a single schema (schema integration [BLN86, JLVV00]). Instead, it is often necessary to perform data integration [JLVV00], where schemata of heterogeneous information systems are left unchanged and integration is carried out by transforming queries or data. To realize such transformations, some flavor of mappings (either procedural code or declarative inter-schema constraints) between information systems is required. If the data integration reasoning is entirely effected on the level of queries and schema-level descriptions, this is usually called query rewriting, while the term data transformation refers to heterogeneous data themselves being classified, transformed and fused to appear homogeneous under some integration schema. Most previous work on data integration can be classified into two major di- rections by the method by which inter-schema mappings used for integration are expressed (see e.g. [FLM98, Ull97]). These are called local-as-view (LAV) [LMSS95, YL87, LRO96, GKD97, AK92, TSI94, CKPS95] and global-as-view (GAV) [GMPQ+97, ACPS96, CHS+95, FRV95] integration.

13 14 CHAPTER 1. INTRODUCTION

The more traditional paradigm is global-as-view integration, where mappings – often called mediators after [Wie92] – are defined as follows. Mediators imple- ment virtual entities (concepts, relations or classes, depending on nomenclature and data model used) exported by their interfaces as views over the heteroge- neous sources, specifying how to combine their data to resolve some (or all) of the experienced heterogeneity. Such mediators can be (generalizations of) simple database views (e.g. CREATE VIEW constructs in SQL) or can be implemented by some procedural code. Global-as-view integration has been used in multi- databases [SL90], data warehousing [JLVV00], and recently for the integration of multimedia sources [ACPS96, CHS+95] and as a fertile testbed for semistructured data models and technologies [GMPQ+97]. In the local-as-view paradigm, inter-schema constraints are defined in strictly the opposite way1. Queries over a purely logical “global” mediated schema are answered by treating sources as if they were materialized views over the medi- ated schema, where only these materialized views may be used to answer the query – after all, the mediated schema does not directly represent any data. Query answering then reduces to the so-called problem of answering queries using views, which has been intensively studied by the database community [LMSS95, DGL00, AD98, BLR97, RSU95] and is related to the query containment problem [CM77, CV92, Shm87, CDL98a]. Local-as-view integration has not only been applied to and shown to be well-suited for data integration in global infor- mation systems [LRO96, GKD97, AK92], but also in related applications beyond data integration, such as query optimization [CKPS95] and the maintenance of physical data independence [TSI94]. An important distinction is to be made between data integration architectures that are centered around a single “global” integration schema against which all sources are integrated (This is the case, for instance, for data warehouses and global information systems, and is intrinsic to the local-as-view approach.) and others that are not, such as federated and multidatabases. The lack of a single global integration schema in the data integration architecture has a problematic consequence. Each source may need to be mapped against each of the integration schemata, leading to a large number of mappings that need to be created and managed. In architectures such as those of federated database systems where each component database may be a source and a consumer of integrated data at once, a quadratic number of mappings may be required. The globality of integration schemata is usually judged by their role in an integration architecture. Global schemata are singletons that occupy a very cen- tral role in the architecture, and are unique consistent and homogeneous world views against which all other schemata in the system (usually considered the

1At first sight, this may appear unintuitive, but is not. For instance, the local-as-view approach can be motivated by AI planning for information gathering using content descriptions of sources in terms of a global world model (as “planning operators”) [AK92, KW96]. 1.1. A BRIEF HISTORY OF DATA INTEGRATION 15

space of tuples expressible as queries source over the sources space of tuples 1 expressible as queries over the global schema source tuples 2 accessible source through 3 mediators

LAV GAV

Figure 1.1: Mappings in LAV (left) and GAV (right).

“sources”) are to be integrated. There is globality in integration schemata on a different level as well. We want to consider integration schemata as designed at will while taking a global perspective if

• they are artifacts specifically created for the resolution of some heterogene- ity and

• the entirety of sources in the system that have any relevance to those het- erogeneity problems addressed have been taken into account in the design process.

Thus in such “global” schemata, a global perspective has been taken when designing them. However, they do not have to be monolithic homogeneous world views. This qualifies the collection of logical entities exported by mediators in a global-as-view integration system as a specifically designed global integration schema, although such a schema is not necessarily homogeneous. An important characteristic of data integration approaches is how well concept mismatch occurring between source and integration schemata can be bridged. We have pointed out that both GAV and LAV use a flavor of views for the mapping between sources and integration schemata. In Figure 1.1, we compare the local-as- view and global-as-view paradigms by visualizing (by Venn diagrams) the spaces of tuples (in relational queries) or objects that can be expressed by queries over source and integration schemata. Views as inter-schema constraints are strongly asymmetric. One single atomic schema entity appearing in a schema on one side of the invisible conceptual border line between integration schemata and source schemata is always defined by a query or (as the general idea of mediation permits) by some procedural code 16 CHAPTER 1. INTRODUCTION which computes the entity’s extent over the schemata on the other side of that border line. As a consequence, both LAV and GAV are restricted in how well they can deal with concept mismatch2. This restriction is theoretical, because in both LAV and GAV it is always implicitly assumed that sources are integrated against integration schemata that have been freely designed with no other constraints imposed than the current integration requirements3. However, when data need to be integrated against schemata of information systems that have design autonomy, or when integration schemata have a legacy 4 burden that an integration approach has to be able to deal with, both LAV and GAV fail. Note that views are not the only imaginable way of mapping schemata in data integration architectures. For mappings that are not expressible as views, it may be possible to relate the spaces of objects expressible by complex logical expressions – say queries – over the concepts of the schemata (see Figure 1.2). “Legacy” integration schemata are faced when

• there is no central design authority providing “global” schemata,

• future integration requirements or changes to schemata of information sys- tems cannot be appropriately predicted,

• existing integration schemata cannot be amended when integration require- ments or the nature of sources to be made available change in an unforeseen way, or

• the creation of “global” schemata is infeasible because of the size and com- plexity of the problem domain and modeling task5 [MKW00].

Recent work in the area has resulted in two new approaches that do not center around a single “global” integration schema and where inter-schema constraints do not necessarily have that strictly asymmetric syntax encountered in LAV and GAV. The first uses expressive description logics systems with symmetric con- straints for data integration [CDL98a, CDL+98b, Bor95]. Constraints can be

2See Example 1.3.1 and [Ull97]. 3This makes the option of the change of requirements or the nature of sources after the design of the integration schemata has been finished hover over such architectures like Damocles’ sword. 4We do not refer to the legacy systems issue here, though. In principle, legacy systems are operational systems that in some aspect of their design differ from what they ideally should be like; they use at least one technology that is no longer part of the current overall IT strategy in some enterprise or collaborative environment [AS99]. In practice, information systems are usually referred to as legacy in the context of data integration if they are not even based on a modern data management technology, usually making it necessary to treat them monolithically, and “wrap” them [GMPQ+97, RS97] by software that makes them appear to respond to data requests under a state-of-the-art data management paradigm. 5This may make the Semantic Web effort of the World Wide Web Consortium [Wor01] seem to be threatened by another very sharp blade hanging by an amazingly fragile thread. 1.2. THE PROBLEM 17

space of tuples that can space of tuples be made tuples expressible available to expressible as queries queries over the as queries over the integrated over the global schema by sources schema mappings from sources

Figure 1.2: The space of objects that can be shared using symmetric mappings given true concept mismatch between entities of source and integration schemata. defined as containment relationships between complex concepts that represent (path) queries. The main drawback is that integration has to be carried out as ABox reasoning [CDL99], i.e. the classification of data in a (hybrid) description logics system [Neb89]. This does not scale well to large data volumes. Further- more, such an approach is not applicable when sources have restricted interfaces (as is often the case on the Web) and it is not possible to import all data of a source into the reasoning system. The second approach, model management [BLP00, MHH+01], treats schemata and mappings between schemata as first-class objects that can be stored in a repository and manipulated with cleanly defined model management operations. This direction is still in an early stage and no convergence against clean, widely usable semantics has occurred yet. Mappings are often defined as lines between concepts (e.g. relations or classes in schemata) using an array of semantics that are often not very expressive. While such approaches allow for neat graphical visualization and the editing of mappings, they do not provide the mechanisms and expressive semantics to support design and modeling actions to make evolving schemata manageable.

1.2 The Problem

The problem addressed in this thesis is the following. We aim at an approach to data integration that satisfies three requirements.

• Individual information systems may have design autonomy for their sche- mata. In general, no global schemata can be built. Each individual schema may have been defined before integration requirements were completely known, and be ill-suited for a particular integration task. 18 CHAPTER 1. INTRODUCTION

• Individual schemata may evolve independently. Even the best-designed integration schemata may end up with concept mismatch that cannot be dealt with through view-based mappings. • The third requirement concerns the scalability of the approach. The data integration problem has to be solved entirely on the level of queries and descriptions of information systems (i.e., query rewriting) rather than the level of reasoning over the data to ensure the independence of the approach from the amount of data managed. Given the problem that the number of mappings in data integration architec- tures with autonomous component systems may be quadratic in the number of schemata and thus very large, the option that schemata and integration require- ments change renders a way of managing schemata and mappings necessary that is simple and for which many tasks can be automated. This requires support for managing mappings and their change and reusing mappings both actively, in the actions performed for managing schemata and mappings, and passively, through the transitivity of their semantics6. The work presented in this thesis has been carried out in the context of a very large international scientific collaboration in the area of high-energy physics. We will have a closer look at the problem of providing interoperability of information systems in that domain in Section 1.3.

1.3 Use Case: Large Scientific Collaborations

Large scientific collaborations are becoming more and more common due to the fact that nowadays cutting-edge scientific research in areas such as high energy physics, the human genome or aerospace has become extremely expensive. Data integration is an issue since many of the individual information systems being operated in such an environment require integrated data to be provided from other information systems in order to work. As we will point out in this section, the main sources of difficulty related to source integration in the information infrastructures of such collaborations are the design autonomy of information systems, change of requirements and evolution of schemata, and large data sets. A number of issues stand in the way of building a single unified “global” logical schema (as they exist for data warehouses or global information systems) for a large science project. We will summarize them next. Heterogeneity. Heterogeneity is pervasive in large scientific research collabora- tions, as there are existing legacy systems as well as largely autonomous groups that build more such systems that quickly become legacy. 6That is, given that we have defined a mapping from schema A to schema B and a mapping from schema B to schema C, we assume that we automatically arrive at a mapping from schema A to schema C. 1.3. USE CASE: LARGE SCIENTIFIC COLLABORATIONS 19

Scientific collaborations consist of a number7 of largely autonomous institutes that independently develop and maintain their individual information systems8. This lack of central control fosters creativity and is necessary for political and organizational reasons. However, it leads to problems when it comes to mak- ing information systems interoperate. In such a setting, heterogeneity arises due to many reasons. Firstly, no two designers would conceptualize a given prob- lem situation in the same way. Furthermore, distinct groups of researchers have fundamentally different ways of dealing with bodies of knowledge, due to differ- ent (human) languages, professional background, community or project jargon9, teacher and curriculum, or “school of thought”. Several subcommunities inde- pendently develop and use similar but distinct software for the same tasks. As a consequence, one can assume similar but slightly different schemata10. In an en- vironment such as the Large Hadron Collider (LHC) project at CERN [LHC] and huge experiments such as CMS [CMS95] currently under preparation, potentially hundreds of individual information systems will be involved with the project dur- ing its lifetime, some of them commercial products, others homegrown efforts of possibly several hundred person years. This is the case because even for the same task, sub-collaborations or individual institutes working on different subprojects independently build systems. When it comes to types of heterogeneity that may be encountered in such an environment, it has to be remarked that beyond heterogeneity due to discrepan- cies in conceptualizations of human designers (including polysemy, terminological overlap and misalignment), there is also heterogeneity that is intrinsic to the do- main. For example, in the environment of high-energy physics experiments (say, a particle detector), detector parts will be necessarily conceptualized differently depending on the kind of information system in which they are represented. For instance, in a CAD system that is used for designing the particle detector, parts will be spatial structures; in a construction management system, they will have to be represented as tree-like structures modeling compositions of parts and their sub-parts, and in simulation and experimental data taking, parts have to be aggregated by associated sensors (readout channels), with respect to which an experiment becomes a topological structure largely distinct from the one of the design drawing. We believe that such differences also lead to different views on the knowledge level, and certainly lead to different database schemata. Hardness of Modeling. Apart from the notion of intrinsic heterogeneity that we have given rise to in the previous paragraph, there are a number of other issues that contribute to the hardness of modeling in a scientific domain. Firstly,

7In large collaborations, they may amount to hundreds. 8The requirements presented here closely relate to classifications of component autonomy in federated databases [HM85]. (See also Section 3.2.) 9Such jargon may have developed over time in previous projects in which a group of people may have worked on together. 10Unfortunately, it is often trickier to deal with subtle than with great mismatch. 20 CHAPTER 1. INTRODUCTION

Design Design Simulation

Human Resources Construction

Precalibration & Testing Finance

Detector Control Calibration

Maintenance

Event Simulation Decomissioning Reconstruction

Figure 1.3: Data flow between information systems that manage the steps of an experiment’s lifecycle. (Cylinders represent databases or – more generally – information systems.)

overall agreement on a conceptualization of a large real-world domain cannot be achieved. Whenever new requirements are discovered or a better understanding of a domain is achieved, there will be an incentive to change the current schema. Such change may go beyond pure extension. Instead, existing parts of schemata will have to be revisited, invalidating mappings for data integration that rely on these schemata. Global modeling also fails because of the sheer size of such a scientific domain. In fact, in a project that involves the collaboration of several thousand researchers and engineers, to be able to model the domain would require to have access to all the knowledge in the heads of all the people involved, and for this knowledge to be stable. This, however, is an unrealistic conjecture, all the more so in an experimental research environment. The Project Lifecycle. It is important to note that large science projects have a lifecycle much like industrial projects; that is, they go through stages such as design, simulation, construction, testing, calibration, deployment, decommission- ing, and many more11. Such steps have some temporal overlap in practice, but there is a gross ordering. Large science projects persist for large time spans12. As

11See Figure 1.3 for an example of data flows that may need to occur between (heterogeneous) information systems for the various activities in the lifecycle, all requiring data integration. 12For example, the LHC project is expected to be carried on for at least 15 years. 1.3. USE CASE: LARGE SCIENTIFIC COLLABORATIONS 21

part_of

pc_cpu pc cpu part

pc_location part_location

location location

id name id name

Figure 1.4: ER diagrams for Example 1.3.1: Electronics database (left) and product-data management system (right). a consequence, the information systems for some steps of the lifecycle will not be built until other information systems have already been in existence for years. In such an experimental setting, full understanding of the requirements for subsequent information systems can often only be achieved once that information systems for the current work have been implemented. Nevertheless, since some information systems are already in need of data integration, one either has to build a global logical schema today which might become invalid later, leading to serious maintenance problems of the information infrastructure (that is, the logical views that map sources), or an approach has to be followed that goes without such a schema. Since it is impossible to preview all the requirements of a complex system far into the future, one cannot avoid the need for change through proper a priori design. Concept Mismatch. It is clear from the above observations that concept mis- match between schemata relevant to data integration may occur in the domain of high energy physics research.

Example 1.3.1 Assume there are two information systems, the first of which is a database holding data on electronics components13 of an experiment under construction, with the relational schema

R1 = {pc cpu(P c, Cpu), pc location(P c, LocId), location(LocId, LocName)}

The database represents information about PCs and their CPUs as well as the location where these parts currently are to be found. Locations have a name

13To make the example more easily accessible, we speak of personal computers as sole elec- tronics parts represented. Of course, personal computers are not representative building blocks of high-energy physics experiments. 22 CHAPTER 1. INTRODUCTION

destination source schema schema

Parts of PCs Project1

PCs of Project1

Figure 1.5: Concept mismatch between PCs of the electronics database and parts of the product-data management system of “Project1”. and an identifier. The second system is a product data management system for a subproject “Project1” with the schema

R2 = {part of(P art1, P art2), part location(P art, LocId), location(LocId, LocName)}

(see also Figure 1.4). The second database schema represents an assembly tree of “Project1” by the relation “part of” and again the locations of parts. Let us now assume that the first information system (the electronics database) holds data that should be shared with the second. We assume that while the names of the locations are the same in the second as in the first information system, the domains of the location ids in the two information systems must be assumed to be distinct, and cannot be shared. We thus experience two kinds of complications with this integration problem. The distinct key domains for locations in the two information systems in fact entail that a correspondence between (derived) concepts in the two schemata is to be made that are both to be defined by queries14. Furthermore, we observe concept mismatch. The first schema only contains electronics parts but may do so for other projects besides “Project1” as well. In the second schema only parts of “Project1” are to be represented, but those parts are not restricted to electronics parts (Figure 1.5). As a third complication in this example, we assume some granularity mis- match. Assume that the second information system is to hold a more detailed model of the project “Project1” than the first and shall represent CPUs as parts of mainboards of PCs and those in turn as parts of PCs, rather than just CPUs as parts of PCs. Of course, we have no information on mainboards in the electronics database, but this information could be obtained from another source.

14Thus, this correspondence could neither be expressed in GAV nor in LAV. 1.4. CONTRIBUTIONS OF THIS THESIS 23

We could encode this by the following semantic constraint expressing a map- ping between schemata by a containment relationship between two queries:

{hP c, Cpu, LocNamei | ∃Mb, LocId : R2.part of(Mb, P c) ∧ R2.part of(Cpu, Mb) ∧ R2.location(LocId, LocName) ∧ R2.part location(P c, LocId)} ⊇ {hP c, Cpu, LocNamei | ∃LocId : R1.pc cpu(P c, Cpu) ∧ R1.belongs to(P c, “Project1”) ∧ R1.location(LocId, LocName) ∧ R1.pc location(P c, LocId)}

Informally, one may read this constraint as PCs together with their CPUs and locations which are marked as belong- ing to ‘Project1’ in the first information system should be part of the answers to queries over parts and their locations in the second informa- tion system, where CPUs should be known as parts two levels below PCs in the assembly hierarchy represented by the part of relation. We do not provide any formal semantics of constraints like the one shown above at this point, but rely on the intuition that such a containment constraint between two queries expresses the desired inter-schema dependency and allows, given appropriate reasoning algorithms (if they exist), to perform data integration in the presence of concept mismatch in a wide sense. 

Large Data Sets. Scientific computing has always been known for manipulating very large amounts of data. Data volumes in information systems related to the construction of LHC experiments are expected to be in the Terabyte range, and experimental data collected during the lifetime of LHC will amount to dozens of Petabytes. For scalability reasons, information integration has to be carried out on the level of queries (query rewriting) rather than data (data transformation).

1.4 Contributions of this Thesis

This thesis is, to the best of our knowledge, the first to actually address the problem of data integration with multiple unsophisticated evolving autonomous integration schemata. Each such schema may consist of both source relations that hold data and logical relations that do not. Schemata may be designed without taking other schemata or data integration considerations into account. Each query over a schema is rewritten into a query exclusively over source relations of information systems in the environment, using a number of schema mappings. We propose an approach to data integration (see Figure 1.6) based on model management and query rewriting with expressive constraints within a federated 24 CHAPTER 1. INTRODUCTION

Repository Information systems Editor

schema Schemata Query Translation data Rewriting Mediator Proxy Proxy Query Facility relational schema Phys. Plan Mediator Generation Proxy Query Facility

Query Plan Mediator Proxy Mappings Execution Query Facility

Repository Mediator

Figure 1.6: Architecture of the information infrastructure

architecture. Our flavor of query rewriting is based on constraints with clean, expressive semantics. It allows for mappings between schemata that are general- izations of both the LAV and GAV paradigms. Regarding query rewriting, we first provide characterizations of two different semantics for query rewriting with symmetric constraints, a classical logical and one that is motivated by rewrite systems [DJ90]. The rewrite systems semantics is based on the intuitions of local-as-view rewriting and generalizes from them. We formally outline both semantics as well as algorithms for both which, given a conjunctive query, enumerate the maximally contained rewritings15. We discuss various relevant aspects of query rewriting in our context, such as minimality and nonredundancy of conjunctive queries in the rewritings. Next we compare the two semantics and argue that the second is more intuitive and may fit better the expectations of human users of data integration systems than the first. Following the philosophy of that semantics, rewritings can be computed by making use of database techniques such as query optimization and ideas from e.g. algorithms developed for the problem of answering queries using views. We believe that in a practical information integration context there are certain regularities (such as sets of predicates – schemata – from which predicates are used together in queries, while there are few queries that combine predicates from several schemata) that render this approach more efficient in practice. Surprisingly, however, it can be shown that the two semantics coincide. We then present a scalable algorithm for the rewrite systems semantics (based on previous work such as [PL00]), which we have implemented in a practical system16, CindRew. We evaluate it experi- mentally against other algorithms for the same problem. It turns out that our

15The notion of maximally contained rewritings is the one that usually most appropriately describes the intuitive idea of “best rewritings possible” in a data integration context. 16This system can be checked out at http://home.cern.ch/∼chkoch/cindrew/ 1.5. RELEVANCE 25 implementation, which we make available for download, scales to thousands of constraints and realistic applications. We conclude with a discussion of how our query rewriting approach fits into state-of-the-art data integration and model management systems. Regarding model management, we present definitions of data models, sche- mata, mappings, and a set of expressive model management operations for the management of schemata in a data integration setting. We argue that our ap- proach can overcome the problems related to “unsophisticated” legacy integration schemata, and provide a sketch of a methodology for managing evolving map- pings.

1.5 Relevance

As we discuss a framework for data integration that is based on very weak as- sumptions, this paper is relevant to a large number of applications in which other approaches eventually fail. These include networks of autonomous virtual enterprises having different deployment lifecycles or standards for their informa- tion systems, the information infrastructure of large international collaborations (e.g., in science), and large enterprises that face the integration of several exist- ing heterogeneous data warehouses after mergers or acquisitions or major change of business model. More generally, our work is applicable in simply any envi- ronment in which anything less than full commitment exists towards far-ranging reengineering of information systems to bring all information systems that roam its environment under a single common enterprise model. Obviously, our work may also allow federated databases [HM85, SL90] to deal more successfully with schema evolution. Let us reconsider the point of design autonomy for schemata of information systems in the case of companies and e-commerce. For many good reasons, com- panies nowadays want to have their information systems interoperate; however, there is no sufficiently strong trend towards agreeing on schemata. While there is clearly much work done towards standardization, large players in IT have an incentive to propose competing “standards” and bodies of meta-data. Asking for common schemata beyond enterprise boundaries today is hardly realistic. Instead, even the integration of the information systems inside a single large enterprise is a problem almost too hard to solve17, and motivates some indepen- dence of the information infrastructure of horizontal or vertical business units, again leading to the legacy integration schema problem that we want to address here. That mentioned, the work in this thesis is highly relevant to business-

17This of course excludes the issue of data warehouses, which, although they have a global scope w.r.t. the enterprise, address only a small part of the company data (in terms of schema complexity, not volume) – such as sales information – that are usually well understood and where requirements are not expected to change much in the future. 26 CHAPTER 1. INTRODUCTION to-business e-commerce and the management of the extended supply chain and virtual enterprises. Data warehouses that have been the results of large and very expensive de- sign and reengineering efforts customized to a specific enterprise really are legacy systems from the day when their design phase ends. Similarly, when companies merge, the schemata of those data warehouses that the former entities created are again bound to feature a substantial degree of heterogeneity. This can be ap- proached in two ways, either by considering these schemata legacy or by creating a new, truly global information system (almost) from scratch.

1.6 Overview

The remainder of this thesis is structured as follows. In Chapter 2, some prelimi- nary notions from database theory, computability theory, and complexity theory are presented. Chapter 3 discusses previous work on data integration. We start with definitions in Section 3.1 and consecutively discuss federated and multi- databases, data warehousing, mediator systems, information integration in AI, global-as-view and local-as-view integration18, the description logics-based and model management approaches to data integration, and finally, in Section 3.9, we discuss the various approaches by maintainability and other aspects. In Chap- ter 4, we present our reference architecture for data integration and discuss its building blocks, which will be treated in more detail in consecutive chapters. Chapter 5 presents our approach to query rewriting with expressive symmet- ric constraints. Chapter 6 first discusses our flavor of schemata, mappings and model management operations, and then provides some thoughts on how to guide the modeling process for mappings such that the integration infrastructure can be managed as easily as possible. We discuss some advanced issues of query rewriting, notably extensions of query languages such as recursion and sources with binding patterns in Chapter 7. We also discuss another application of our work on query rewriting with symmetric constraints, the maintenance of physical data independence under schema evolution. Chapter 8 concludes with a final discussion of the practical implications of this thesis.

18Local-as-view integration is presented at some length, since its theory will be highly relevant to our work of Chapter 5. Chapter 2

Preliminaries

This chapter discusses some preliminaries which mainly stem from database the- ory and which will be needed in later chapters. It is beyond the scope of this thesis to give a detailed account of computability theory and complexity theory. We refer to [HU79, Sip97, GJ79, Joh90, Pap94, DEGV] for introductory texts in these areas. We also assume a basic understanding of databases, schemata, and query languages, and notably SQL (for an introductory work on this see [Ull88]). Finally, we presume basic understanding of mathematical logics and automated theorem proving, including concepts such as resolution and refutation, and no- tions such as predicates, atoms, terms, Skolem function, Horn clauses, and unit clauses, which are used in the standard way (see e.g. [RN95, Pap94]). We define the following access functions for later use: Given a Horn clause c, Head(c) returns c’s head atom and Body(c) returns the ordered list of its body atoms. Bodyi(c) returns the i-th body atom. P red(a) returns the predicate name of atom a, while P reds(Body(c)) returns the predicate names of the atoms in the body of clause c. V ars(a) returns the set of variables appearing in atom a and V ar(Body(c)) returns the variables in the body of the clause c. We will mainly focus on the relational data model and relational queries [Cod70, Ull88, Ull89, Kan90] under a set-based rather than bag-based seman- tics (That is, answers to queries are sets, while they are bags in the original relational model [Cod70] and SQL).

2.1 Query Languages

Let dom be a countably infinite domain of atomic values. A relation schema R is a relation name together with a sort, which is a tuple1 of attribute names, and an arity, i.e.

1Relation schemata are usually defined as sets of attributes. However, we choose the tuple, as we will use the unnamed calculus perspective widely throughout this work.

27 28 CHAPTER 2. PRELIMINARIES

sort(R) = hA1,...,Ani arity(R) = n

A (relational) schema R is a set of relation schemata. A relation I is a finite set of tuples, I ⊆ domn. A database instance I is a set of relations. A relational query Q is a function that maps each instance I over a schema R and dom to another instance J over a different schema R’. Relational queries can be seen from at least two perspectives, an algebraic and a calculus viewpoint. Relational algebra ALG is based on the following basic algebraic operations (see [Cod70] or [Ull88, AHV95]):

• Set-based operations (intersection ∩, union ∪, and difference \) over rela- tions of the same sort (that is, arity, as we assume a single domain dom of atomic values).

• Tuple-based operations (projection π, which eliminates or renames rows of relations, and selection σ, which filters tuples of a relation according to a predicate built by conjunction of equality atoms, which are statements of the form A = B, where A, B are relational attributes).

• The Cartesian product × as a constructive operation that, given two rela- tions R1 and R2 of arities n and m, respectively, produces a new relation of arity n + m which contains a tuple ht1, t2i for each distinct pair of tuples t1, t2 with t1 ∈ R1 and t2 ∈ R2.

Other operations (e.g., various kinds of joins) can be defined from these. There are various subtleties, such as named and unnamed perspectives of ALG, for which we refer to [AHV95]. Queries in the first-order relational domain calculus CALC are of the form

{hX¯i | Φ(X¯)} where X¯ is a tuple of variables (called “unbound” or “distinguished”) and Φ is a first-order formula (using ∀, ∃, ∧, ∨, and ¬) over relational predicates pi. An important desirable property of well-behaved database queries is domain independence. Let the set of all atomic values appearing in a database I be called the active domain (adom). A CALC query Q over a schema R is domain independent iff, for any possible database I over R, Qdom(I) = Qadom(I).

Example 2.1.1 The CALC query {hx, yi | p(x)} is not domain independent, as the variable y is free to bind with any member of the domain. Clearly, such a query does not satisfy the intuitions of well-behaved database queries.  2.1. QUERY LANGUAGES 29

Unfortunately, the domain independence property is undecidable for CALC. An alternative purely syntactic property is safety or range restriction. We refer to [AHV95] for a treatment of safe-range calculus CALCsr, which is necessarily somewhat lengthy. It can be shown that ALG, the domain independent relational calculus and CALCsr are all (language) equivalent. We refer to the class of ∀, ¬-free queries as the positive relational calculus queries and the queries that only use ∃ and ∧ to build formulae as the conjunctive queries. By default, conjunctive queries may contain constants but no built-in arithmetic comparison operators. Conjunctive queries can be written as function-free Horn clauses, called dat- ¯ ¯ ¯ ¯ alog notation. A conjunctive query {hXi | ∃Y : p1(X1) ∧ ... ∧ pn(Xn)} is written as a datalog rule

¯ ¯ ¯ q(X) ← p1(X1), . . . , pn(Xn). Furthermore, conjunctive queries have to be safe. Safety in the case of con- junctive queries is quite simple to define. A conjunctive query is safe iff each variable in the head also appears somewhere in the atoms built from database ¯ ¯ ¯ predicates in the body, X ⊆ X1 ∪ ... ∪ Xn. Throughout this thesis, we choose among the set-theoretic notation for conjunctive queries shown above and the datalog notation, whichever is most convenient to support the presentation. Conjunctive queries correspond to select-from-where clauses in SQL where constraints in the where clause only use equality (=) as comparison operator and conjunction (“and”) to compose constraints.

Example 2.1.2 The subsumed query from Example 1.3.1 (a conjunctive query) can be written as a select-from-where query in SQL

select pc, cpu, lname from pc cpu, belongs to, loc, pc loc where pc cpu.pc = belongs to.pc and pc cpu.pc = pc loc.pc and pc loc.lid = loc.lid and belongs to.org entity = “Project1”; or equivalently q(P c, Cpu, LName) ← pc cpu(P c, Cpu), belongs to(P c, “Project1”), loc(LId, LName), pc loc(P c, LId). in datalog rule notation or

πPc,Cpu,LName(pc cpu ./ σOrg Entity=“P roject100 (belongs to) ./ pc loc ./ loc) as an ALG query.  30 CHAPTER 2. PRELIMINARIES

Queries with inequality constraints (i.e., 6=, <, ≤, also called arithmetic com- parison predicates or builtin predicates) are outside of ALG or CALC in principle, but extensions can be defined without much difficulty2. A conjunctive query with inequalities is a clause of the form

¯ ¯ ¯ q(X) ← p1(X1), . . . , pn(Xn), xi1,1 θ1xi1,2 , . . . , xim,1 θmxim,2 . ¯ ¯ where the xij,k are variables in X1,..., Xn and θj ∈ {6=, <, ≤}. A datalog program is a set of datalog rules. The dependency graph of a datalog program P is the directed graph G = hV,Ei where V is the set of predicate names in P and E contains an arc from predicate pi to predicate pj iff there is a datalog rule in P such that pi is its head predicate and pj appears in the body of that same rule. A datalog program is recursive iff its dependency graph is cyclic. Positive queries (select-from-where-union queries in SQL) can be written as nonrecursive datalog programs. Since conjunctive queries are closed under com- position, all positive queries can also be transformed into equivalent sets of con- junctive queries (with the head atoms over the same “query” predicate). The size of these sets can be exponentially larger than the corresponding nonrecursive dat- alog programs. The process of transforming a nonrecursive datalog program into a set of conjunctive queries is a form of translating a logical formula in Conjunc- tive Normal Form (CNF) and is called query unfolding.

Example 2.1.3 The nonrecursive datalog program q(x, y, z, w) ← a(x, y, z, w). a(x, y, z, 1) ← b(x, y, z). a(x, y, z, 2) ← b(x, y, z). b(x, y, 1) ← c(x, y). b(x, y, 2) ← c(x, y). c(x, 1) ← d(x). c(x, 2) ← d(x). with 2 ∗ 3 + 1 = 7 rules is equivalent to the following set q(x, 1, 1, 1) ← d(x). q(x, 1, 1, 2) ← d(x). q(x, 1, 2, 1) ← d(x). q(x, 1, 2, 2) ← d(x). q(x, 2, 1, 1) ← d(x). q(x, 2, 1, 2) ← d(x). q(x, 2, 2, 1) ← d(x). q(x, 2, 2, 2) ← d(x).

3 of 2 conjunctive queries. 

Relational algebra and calculus are far from representing all computable que- ries over relational databases. For example, not even the transitive closure of

2There are, however, a few subtle issues such as the question if the domain is totally ordered with its impact on data independence [CH80, CH82] that are important for the theory of queries. Since we will only touch queries with inequalities shortly, we leave this aside. 2.2. QUERY CONTAINMENT 31 binary relations can be expressed using the first-order queries3. Much has been said on categories and hierarchies of relational query languages, and examples of languages strictly more expressive than relational algebra and calculus are, for instance, datalog with negation (under various semantics) or the while queries. We refer to [CH82, Cha88, Kan90, AHV95] for more on these issues. Treatments of complexity and expressiveness of relational query languages can be found in [Var82, CH82, Cha88, AHV95]. We leave these issues to the related literature and remark only that the positive relational calculus queries are (expression)-complete in PSPACE [Var82]. The decision problem whether an unfolding of a conjunctive query with a nonrecursive datalog program (with constants) exists that uses only certain relational predicates – which is related to the approach to data integration developed later on in this thesis – is equally PSPACE-complete and thus presumably a computationally hard problem.

2.2 Query Containment

The problem of deciding whether a query Q1 is contained in a query Q2 (denoted Q1 ⊆ Q2) (possibly under a number of constraints describing a schema) is the one of deciding whether for any possible databases satisfying the constraints, each tuple in the result of Q1 is contained in the result of Q2. Two queries are called equivalent, denoted Q1 ≡ Q2, iff Q1 ⊆ Q2 and Q1 ⊇ Q2. The containment problem quickly becomes undecidable for expressive query languages. Already for relational algebra and calculus, the problem is undecid- able [SY80, Kan90]. In fact, the problem is co-r.e. but not recursive (under the assumption that databases are finite but the domain is not). Checking the con- tainment of two queries would require a noncontainment check for every finite database over dom. For conjunctive queries, the containment problem is decidable and NP-com- plete [CM77]. Since queries tend to be small, query containment can be prac- tically used, for instance in query optimization or data integration [CKPS95, YL87]. It is usually formalized using the notion of containment mappings (ho- momorphisms) [CM77].

Definition 2.2.1 Let Q1 and Q2 be two conjunctive queries. A containment mapping θ is a function from the variables and constants of Q1 into the variables and constants of Q2 that is

• the identity on the constants of Q1

• Headi(Q2) for a variable Headi(Q1)

3However, transitive closure can of course be expressed in datalog 32 CHAPTER 2. PRELIMINARIES

• and for which for every atom p(x1, . . . , xn) ∈ Body(Q1),

p(θ(x1), . . . , θ(xn)) ∈ Body(Q2) 

It can be shown that for two conjunctive queries Q1 and Q2, the containment Q1 ⊆ Q2 holds iff there is a containment mapping from Q2 into Q1 [CM77].

Example 2.2.2 [AHV95] The two conjunctive queries q1(x, y, z) ← p(x2, y1, z), p(x, y1, z1), p(x1, y, z1), p(x, y2, z2), p(x2, y2, z). and q2(x, y, z) ← p(x2, y1, z), p(x, y1, z1), p(x1, y, z1). are equivalent. For q1 ⊆ q2, the containment mapping is the identity. Clearly, since Body(q2) ⊂ Body(q1), and the heads of the two queries match, q1 ⊆ q2 must hold. For the other direction, we have θ(x) = x, θ(y) = y, θ(z) = z, θ(x1) = x1, θ(y1) = y1, θ(z1) = z1, θ(x2) = x2, θ(y2) = y1, and θ(z2) = z1. 

An alternative way [Ull97] of deciding whether a conjunctive query Q1 is contained in a second, Q2, is to freeze the variables of Q1 into new constants (i.e., which do not appear in the two queries) and to evaluate Q2 on the canonical database created from the frozen body atoms of Q1. Q1 is then contained in Q2 if and only if the frozen head of Q1 appears in the result of Q2 over the canonical database.

Example 2.2.3 Consider again the two queries of Example 2.2.2. The canonical database for q2 is

I = {p(ax2 , ay1 , az), p(ax, ay1 , az1 ), p(ax1 , ay, az1 )}

where ax, ay, az, ax1 , ay1 , az1 , ax2 are constants. We have

q1(I) = {hax2 , ay1 , azi, hax2 , ay1 , az1 i, hax, ay1 , azi, hax, ay1 , az1 i,

hax, ay, azi, hax, ay, az1 i, hax1 , ay1 , az1 i, hax1 , ay, az1 i}

Since the frozen head of q2 is hax, ay, azi and hax, ay, azi ∈ q1(I), q2 is contained in q1. 

The containment of positive queries Q1,Q2 can be checked by transforming 0 0 0 0 them into sets of conjunctive queries Q1,Q2. Q1 is of course contained in Q2 iff 0 0 each member query of Q1 is individually contained in a member query of Q2. 2.3. DEPENDENCIES 33

Bibliographic Notes The containment problem for conjunctive queries is NP-complete, as mentioned. The problem can be efficiently solved for two queries if neither query contains more than two atoms of the same relational predicate [Sar91]. In that case, a very efficient algorithm exists that runs in time linear in the size of the queries. Another polynomial-time complexity case is encountered when the so-called hy- pergraph of the query to be tested for subsumption is acyclic [YO79, FMU82, AHV95]. For that class of queries, the technique of Example 2.2.3 can be com- bined with the polynomial-time expression complexity of the candidate subsumer query. If arithmetic comparison predicates4 are permitted in conjunctive queries [Klu88], the complexity of checking query containment is harder and jumps to the second level of the polynomial hierarchy [vdM92]. The containment of datalog queries is undecidable [Shm87]. This remains true even for some very restricted classes of single-rule programs (sirups) [Kan90]. Containment of a conjunctive query in a datalog query is EXPTIME-complete – this problem can be solved with the method of Example 2.2.3, but then consumes the full expression complexity of datalog [Var82] (i.e., EXPTIME). The opposite direction, i.e. containment of a datalog program in a conjunctive query, is still decidable but highly intractable (it is 2-EXPTIME-complete [CV92, CV94, CV97]). Other interesting recent work has been on the containment of so-called regular path queries – which have found much research interest in the field of semistruc- tured databases – under constraints [CDL98a] and on containment of a class of queries over databases with complex objects [LS97] (see also Section 2.5).

2.3 Dependencies

Dependencies are used in database design to add semantics and integrity con- straints to a schema, which database instances have to comply to. Two particu- larly important classes of dependencies are functional dependencies (abbreviated fd’s) and inclusion dependencies (ind’s). A functional dependency R : X → Y over a relation schema R (where X and Y are sets of attribute names of R5) has the following semantics. It enforces that for each relation instance over R, for each pair t1, t2 of tuples in the instance, if for each attribute name in X the values in t1 and t2 are pairwise equal, then the values for the attributes in Y must be equal as well. Primary keys are special cases of functional dependencies where X∪Y contains all attributes of R. 4Such queries satisfy the real-worl need of asking queries where an attribute is to be, for instance, of value greater than a certain constant. 5Under the unnamed perspective sufficient for conjunctive queries in datalog notation, we will refer to the i-th attribute position in R by $i, instead of an attribute name. 34 CHAPTER 2. PRELIMINARIES

Example 2.3.1 Let R ⊆ dom3 be a tertiary relation with two functional de- pendencies R : $1 → $2 $3 (i.e., the first attribute is a primary key for R) and R : $3 → $2. Consider an instance I = {h1, 2, 3i}. The attempt to insert a new tuple h1, 2, 4i into R would violate the first fd, while the attempt to do the same for h5, 6, 3i would violate the second. 

Informally, inclusion dependencies are containment relationships between que- ries of the form πγ(R), i.e., attributes of a single relation R may be reordered or projected out. Foreign key constraints, which require that a foreign key stored in one tuple must also exist in the key attribute position of some tuple of a usually different relation, are inclusion dependencies. Dependencies as database semantics, notably, are valuable in query optimiza- tion and allow to enforce the integrity of database updates.

2.4 Global Query Optimization

Modern database systems rely on the idea of a separation of physical storage structures and logical schemata in order to simplify their use [TK78, AHV95]. This, together with the declarative flavor of many query languages, leads to the need to optimize queries such that they can be evaluated quickly. In the general case of the relational queries (i.e., ALG or the relational calcu- lus), global optimization is not computable. For conjunctive queries, and on the logical level, where physical cost-based metrics can be left out of consideration, though, global optimality (that is, minimality) can be achieved. A conjunctive query Q is minimal if there is no equivalent conjunctive query Q0 such that Q0 has fewer atoms (subgoals) in its body than Q. This notion of optimality is justified because joins of relations are usually among the most expensive relational (algebra) operations carried out by a rela- tional database system during query execution. Minimality is of interest in data integration as well. Computing a minimal equivalent conjunctive query is strongly related to the query containment problem (see Section 2.2). The associated decision problem is again NP-complete. Minimal queries can be computed using the following fact [CM77]. Given a conjunctive query Q, there is a minimal query Q0 (with Q ≡ Q0) such that Head(Q) = Head(Q0) and Body(Q0) ⊆ Body(Q), i.e. the heads are equal and the body of Q0 contains a subset of the subgoals of Q, without any changes to variables or constants. Conjunctive queries can thus be optimized by checking all queries created by dropping body atoms from Q while preserving equivalence and searching for the smallest such query.

Example 2.4.1 Take the queries q1 and q2 from Example 2.2.2. By checking all subsets of Body(q2), it can be seen that q2 is already minimal. In fact, q2 is also 2.5. COMPLEX VALUES AND OBJECT IDENTITIES 35

a minimal query for q1, as Body(q2) is the smallest subset of Body(q1) such that q2 and q1 remain equivalent.  Global optimization of conjunctive queries under a number of dependencies (e.g., fd’s) can be carried out using a folklore technique called the chase [ABU79, MMS79], for which we refer to the literature (see also [AHV95]).

2.5 Complex Values and Object Identities

Among the principal additional features of the object-oriented data model [BM93, Kim95, CBB+97], compared to the relational model, we have object identifiers [AK98], objects that have complex (“nested”) values [AB88], IS-A hierarchies, and behavior attributed to classes of objects, usually via (mostly) imperative methods. For the purpose of querying and data integration under the object- oriented data model, the notions of object identifiers and complex objects deserve some consideration. Research on complex values in database theory has started by giving up the requirement that values in relations may only contain atomic values of the domain (non-first normal form databases). The complex value model, theoretically very elegant, is strictly a generalization of the relational data model. Values are created inductively from set and tuple constructors. The relational data model is thus the special case of the complex value model where each relation is a set of tuples over the domain. For instance,

{hA : dom,B : dom,C : {hA : dom,B : {dom}i}i} is a valid sort in the complex value model and

{ha, b, {hc, {}i, hd, {e, g}i}i, he, f, {}i} is a value of this sort, where a, b, c, d, e, f, g are constants of dom. As for the relational data model, algebra and calculus-based query languages can be specified, and equivalences be established. Informally, in the algebraic perspec- tive, set-based operations (union, intersection and difference), which are required to operate over sets of the same sorts, and simple tuple-based operations (such as projection) known from the relational model are extended by a more expres- sive selection operation, which may have conditions such as set membership and equality of complex values, and the powerset operation, furthermore tuple- and set-creation and destruction operations (see [AHV95]). Other operations such as renaming, join, and nesting and unnesting can be defined in terms of these. The complex-value algebra (ALGcv) has hyperexponential complexity. When the powerset operation is replaced by nesting and unnesting operations, we arrive at the so-called nested relation algebra ALGcv−. All queries in ALGcv− can be 36 CHAPTER 2. PRELIMINARIES executed efficiently (relative to the size of the data), which has motivated com- mercial object-oriented database systems such as O2 [LRV88] and standards such as ODMG’s OQL [CBB+97] to closely adopt it. Interestingly, it can be shown that all ALGcv− queries over relational databases have equivalent relational queries [AB88, AHV95]. This is due to the fact that unnested values in a tuple always represent keys for the nested tuples; nestings are thus purely cosmetic. Furthermore, every complex value database can be trans- formed (in polynomial time relative to the size of the complex value database) into a relational one [AHV95] (This, however, requires keys that identify nested tuples as objects, i.e., object identifiers). The nested relation model - and with it a large class of object-oriented queries - is thus just “syntactic sugaring” over the relational data model with keys as supplements for object identifiers. From the query-only standpoint of data integration, where structural integration can take care of inventing object identifiers in the canonical transformation between data models, we can thus develop techniques in terms of relational queries, which can then be straightforwardly applied to object-oriented databases as well6. We also make a comment on the calculus perspective. Differently from the relational model, in the complex value calculus CALCcv variables may represent and be quantified over complex values. We are thus operating in a higher-order predicate calculus with a finite model semantics. The generalization of range restriction (called safe-range calculus) for the relational calculus to the complex value calculus is straightforward but verbose (see [AHV95]). It can be shown that ALGcv and the safe-range calculus CALCcv (which represents exactly the domain independent complex value calculus queries) are equivalent. Furthermore, if set inclusion is disallowed but set membership as the analog of nesting remains permitted, the so-called strongly safe-range calculus CALCcv− is attained, which is equivalent to ALGcv−. Conjunctive nested relation algebra – in which set union and difference have been removed from ALGcv− – is thus equivalent to the conjunctive relational queries.

Example 2.5.1 Consider an instance Parts, which is a set of complex values of the following sort. A part (in a product-data management system) is a tuple of a barcode B, a name N, and a set of characteristics C. A characteristic is a tuple of a name N and a set of data elements D. A data element is a tuple of a name N, a unit of measurement U, and a value V 7. The sort can be thus written as

hB : dom,N : dom,C : {hN : dom,D : {hN : dom,U : dom,V : domi}i}i

6Some support for object-oriented databases is a requirement in the use case of Section 1.3, as scientific repositories increasingly make use of object-oriented data models. 7For simplicity, we assume that all atomic values are of the same domain dom. This is not an actual restriction unless arithmetic comparison operators (<, ≤) are allowed in the query language. 2.5. COMPLEX VALUES AND OBJECT IDENTITIES 37

Suppose that we pose the following query in nested relation algebra ALGcv−:

πN,B,D(unnestC (πB,C (Parts))) which asks for transformed complex values of sort

hN : dom,B : dom,D : {hN : dom,U : dom,V : domi}i and can be formulated in strongly safe-range calculus CALCcv− as

{x : hN,B,D : {hN, U, V i}i | ∃y, z, z0, w, w0, u, u0 : y : hB,N,C : {hN,D : {hN, U, V i}i}i ∧ z : {hN,D, {hN, U, V i}i} ∧ z0 : hN,D, {hN, U, V i}i ∧ w : {hN, U, V i} ∧ w0 : hN, U, V i ∧ u : {hN, U, V i} ∧ u0 : hN, U, V i ∧ x.B = y.B ∧ y.C = z ∧ z0 ∈ z ∧ z0.N = x.N ∧ z0.D = w ∧ w0 ∈ w ∧ x.D = u ∧ u0 ∈ u ∧ u0 = w0}

Let us map the collection Parts to a flat relational database with schema

R = {Part(P oid, B, N), Char(Coid, N, P oid), DataElement(N, U, V, Coid)} where the attributes P oid and Coid stand for object identifiers which must be invented when flattening the data. The above query can now be equivalently asked in relational algebra as

πN,B,Dn,U,V ((πP oid,B(Part) ./ Char) ./ πN→Dn,U,V,Coid(DataElement))

The greatest challenge here is the elimination or renaming of the three different name attributes N. The same query has the following equivalent in the (conjunc- tive) relational calculus

{hx, y, z, u, vi | ∃i1, i2, d : Part(i1, x, d) ∧ Char(i2, y, i1) ∧ DataElement(z, u, v, i2)}

After executing the query, the results can be nested to get the correct result for the nested relational algebra or calculus query.  38 CHAPTER 2. PRELIMINARIES Chapter 3

Data Integration

This chapter briefly surveys several research areas related to data integration. We proceed by first presenting two established architectures, federated and mul- tidatabases in Section 3.2 and data warehouses in Section 3.3. Next, in Sec- tion 3.4, we discuss information integration in AI. Several research areas of AI that are relevant to this thesis are surveyed, including ontology-based global in- formation systems, capability description and planning, and multi-agent systems as a further integration architecture. Then we discuss global-as-view integra- tion (together with an integration architecture, mediator systems) in Section 3.5 and local-as-view integration in Section 3.6. In Sections 3.7 and 3.8 we arrive at recent data integration approaches. Section 3.9 discusses management and maintainability issues in large and evolving data integration systems and com- pares the different approaches presented according to various qualitative aspects. First, however, we start with some definitions.

3.1 Definitions and Overview

Source integration [JLVV00] refers to the process of integrating a number of sources (e.g. databases) into one greater common entity. The term is usually used as part of a greater, more encompassing process, as perceived in the data warehousing setting, where source integration is usually followed by aggregation and online analytical processing (OLAP). There are two forms of source inte- gration, schema integration and data integration. Schema integration [BLN86] refers to a software engineering or knowledge engineering approach, the process of reverse-engineering information systems and reengineering schemata in order to obtain a single common “integrated” schema – which we will not address in more detail in this thesis. While the terms data and information are of course not to be confused, data integration and information integration are normally used synonymously (e.g., [Wie96, Wie92]). Data integration is the area of research that addresses problems related to

39 40 CHAPTER 3. DATA INTEGRATION

schema source integration integration

data reconciliation data integration structural semantic integration integration

Figure 3.1: Artist’s impression of source integration. the provision of interoperability to information systems by the resolution of het- erogeneity between systems on the level of data. This distinguishes the problem from the wider aim of cooperative information systems [Coo], where also more advanced concepts such as workflows, business processes, and supply chains come into play, and where problems related to coordination and collaboration of sub- systems are studied which go beyond the techniques required and justified for the integration of data alone. The data integration problem can be decomposed into several subproblems. Structural integration (e.g., wrapping [GK94, RS97]) is concerned with the res- olution of structural heterogeneity, i.e. the heterogeneity of data models, query and data access languages, and protocols1. This problem is particularly inter- esting when it comes to legacy systems, which are systems that in general have some aspect that would be changed in an ideal world but in practice cannot be [AS99]. In practice, this often refers to out-of-date systems in which parts of the code base or subsystems cannot be adapted to new requirements and technologies because they are no longer understood by the current maintainers or because the source code has been lost. Semantic integration refers to the resolution of semantic mismatch between schemata. Mismatch of concepts appearing in such schemata may be due to a number of reasons (see e.g. [GMPQ+97]), and may be a consequence of differ- ences in conceptualizations in the minds of different knowledge engineers. Mis-

1We experience structural heterogeneity if we need to make a number of databases interop- erable of which, for example, some are relational and others object-oriented, or if among the relational databases some are only queryable using SQL while others are only queryable using QUEL [SHWK76]. Other kinds of structural heterogeneity are encountered when two database systems use different models for managing transactions or lack middleware compatible with both which allows to communicate queries and results. 3.2. FEDERATED AND MULTIDATABASES 41 match may not only occur on the level of schema entities (relations in a relational database or classes in an object-oriented system), but also on the level of data. The associated problem, called data reconciliation [JLVV00], includes object iden- tification (i.e., the problem of determining correspondences of objects represented by different heterogeneous data sources) and the handling of mistakes that hap- pened during the acquisition of data (e.g. typos), which is usually referred to as data cleaning. An overview of this classification of source integration is given in Figure 3.1. Since for this thesis, the main problem among those discussed in this section is the resolution of semantic mismatch, we will also put an emphasis on this problem in the following discussion and comparison of research related to data integration.

3.2 Federated and Multidatabases

The data integration problem has been addressed early on by work on multi- database systems. Multidatabase systems are collections of several (distributed) databases that may be heterogeneous and need to share and exchange data. Ac- cording to the classification2 of [SL90], federated database systems [HM85] are a subclass of multidatabase systems. Federated databases are collections of col- laborating but autonomous component database systems. Nonfederated multi- database systems, on the other hand, may have several heterogeneous schemata but lack any other kind of autonomy. Nonfederated multidatabase systems have one level of management only and all data management operations are performed uniformly for all component databases. Federated database systems can be cat- egorized as loosely or tightly coupled systems. Tightly coupled systems are ad- ministrated as one common entity, while in loosely coupled systems, this is not the case and component databases are administered independently [SL90]. Component databases of a federated system may be autonomous in several senses. Design autonomy permits the creators of component databases to make their own design choices with respect to data models and query languages, data managed and schemata used for managing them, and the conceptualizations and semantic interpretations of the data applied. Other kinds of component auton- omy that are of less interest to this thesis but still deserve to be mentioned are communication autonomy, execution autonomy and association autonomy [SL90, HM85]. Autonomy is often in conflict with the need for sharing data within a federated database system. Thus, one or even several kinds of auton- omy may have to be relaxed in practice to be able to provide interoperability.

2There is some heterogeneity in the nomenclature of this area. A cautionary note is due at this point: Many of the terms in this chapter have been used heterogeneously by the research community. Certain choices had to be made in this thesis to allow a uniform presentation, which are hopefully well documented. 42 CHAPTER 3. DATA INTEGRATION

External External ... External Schema Schema Schema

Federated ... Federated Schema Schema

Export Export ... Export Schema Schema Schema

Component Component Schema ... Schema

Local ... Local Schema Schema

Figure 3.2: Federated 5-layer schema architecture

Modern database systems successfully use a three-tier architecture [TK78] which separates physical (also called internal) from logical representation and the logical schema in turn from possibly multiple user or application perspectives (provided by views). In federated database systems, these three layers are con- sidered insufficient, and a five-layer schema architecture has been proposed (e.g. [SL90] and Figure 3.2). Under this architecture, there are five types of schemata between which queries are translated. These five types of schemata are

• Local schemata. The local schema of a component database corresponds to the logical schema in the classical three-layer architecture of centralized database systems.

• Component schemata. The component schema of a database is a version of its local schema translated into the data model and representation formal- ism shared across the federated database system.

• Export schemata. An export schema contains only the part of the schema relevant to one integrated federated schema.

• Federated schemata3. This schema is an integrated homogeneous view of the federation, against which a number of export schemata are mapped (using data integration technology). There may be several such federated schemata inside a federation, providing different integrated views of the available data. 3These are also known as import schemata or global schemata [SL90]. 3.3. DATA WAREHOUSING 43

"Data Cube" Data Marts Data (MDDBS) Analysis

Extraction & Aggregation Data Warehouse

Data Mediator Reconciliation & Integration

Wrapper Wrapper Wrapper Wrapper Wrapper

Figure 3.3: Data warehousing architecture and process.

• External Schemata provide application or user-specific views of the feder- ated schemata, as in the classical three-layer architecture. This five-layer architecture is believed to provide better support for the inte- gration and management of heterogeneous autonomous databases than the clas- sical three-layer architecture [HM85, SL90].

3.3 Data Warehousing

Data Warehousing (Figure 3.3) is a somewhat interdisciplinary area of research whose scope goes beyond pure data integration. The goal is usually, in an en- terprise environment, to collect data from a number of distributed sites4 (e.g., grocery stores), clean and integrate them, and put them into one large central store, the corporate data warehouse. Data warehousing is also about performing aggregation of relevant data (e.g. sales data). Data may then be extracted and transformed according to schemata customized for particular users or analysis tools (Online Analytical Processing, OLAP) [JLVV00]. Since the data manipulated are in practice often highly mission-critical to enterprises and may be very large, special technologies have been developed for

4The point of this is not just the resolution of heterogeneity but also to to have distinct sys- tems for Online Transaction Processing (OLTP) and data analysis for decision support, which usually access data in very different ways and also need differently optimized schemata. (In OLTP, transactions are usually short and occur at a high density, while in OLAP, transactions are few but long and put emphasis on querying.) 44 CHAPTER 3. DATA INTEGRATION dealing with the aggregation of data (e.g. the summarization of sales data ac- cording to criteria such as categories of products sold, regions, and time spans), such as multidimensional databases (MDDBMS) or data cubes. As data integrated against a warehouse are usually materialized there, the data warehousing literature often makes a distinction between mediation, which is confined to data integration on demand, i.e. when a query against the warehouse occurs (also called “virtual” integration or the lazy approach [Wid96] by data warehouse researchers), and materialized data integration (the eager approach [Wid96]). The materialized approach to data integration in fact adds problems related to dynamic aspects (e.g., the view update and view maintenance prob- lems). These problems are not yet well understood, and known theoretical results are often quite negative [AHV95]. Data Warehousing has received considerable interest in industry, and there are several commercial implementations, such as those by Informix and MicroS- trategy [JLVV00]. Two well-known research systems are WHIPS [GMLY98] and SQUIRREL [ZHKF95b, ZHKF95a].

3.4 Information Integration in AI

There has traditionally been much cross-fertilization between the artificial intel- ligence and information systems areas, and the intelligent integration of infor- mation [Wie96] is not an exception. It is particularly worthwhile to take note of research on ontologies, capability description, planning, knowledge-based systems, and multi-agent systems. Another important area are description logics, which we leave to their own section (Section 3.7). Work in these areas has – sometimes indirectly – had much influence on data integration.

3.4.1 Integration against Ontologies There is an ongoing discussion among Formal Ontologists and AI researchers on how to define ontologies [GN87, Gru, GG95, Gua94, HS97]. One definition that has been particularly well argued for refers to ontologies as partial accounts of specifications of conceptualizations [GG95]. Ontologies are logical theories of parts of conceptualizations (to be found in the mind of some knowledge engineer) of a problem domain. As such, ontologies may consist of more than taxonomical knowledge but may be representations of virtually any kind of knowledge. In practice we are interested in work on ontologies in the context of information integration as information models of AI information systems, powerful analogs to database schemata. Ontological engineering [Gua97, Gru92, Gru93a, Gru93b, CTP00] concerns it- self with the design and maintenance of large ontologies. Several research projects on tools [DSW+99] for ontological engineering, such as the Ontolingua server 3.4. INFORMATION INTEGRATION IN AI 45

[FFR96], have been carried out. One problem also addressed is the one of reengi- neering and merging existing ontologies, which is in many ways similar to schema integration [BLN86]. Experiences show much similarity with developments in object-oriented software engineering and information systems research. Design- ing and maintaining large ontologies has been found to lead to problems (see the Cyc experience [LGP+90]), and research has followed approaches such as apply- ing the idea of design patterns [GHJV94] to ontological engineering [CTP00], or the use of libraries of micro-ontologies, which are small building blocks that can be composed to create domain ontologies on demand. AI data integration systems are usually based on an architecture in which there is one well-designed “global” domain ontology (as a theory of the world represented) against which a number of wrapped data sources are integrated. Such systems fall into the category of global information systems. For instance, the influential Carnot system [SCH+97] of MCC mapped databases against the large and well-known Cyc ontology [LGP+90, Cyc] using a deductive database language called LDL [Zan96]. For other similar interesting work see e.g. the OBSERVER project [MKSI96, MIKS00], SIMS [AK92, HK93] and InfoSleuth [NU97, NBN99, BBB+97, FNPB99, NPU98]. It has been claimed (e.g. in [MIKS00]) that global information systems based on ontologies are a substantial step forward compared to systems that integrate against database schemata because ontologies allow to describe information con- tent in data repositories independently of the underlying syntactic representation of the data. The rationale behind this is that ontologies are defined as artifacts on the knowledge level [New82, New93] rather than the symbol level and should be independent of syntactic considerations. The above claim of a practical advan- tage can be comfortably challenged, however. Apart from the necessary choice of some vocabulary for naming the concepts, ontological commitments have to be made on how to interrelate concepts (onto)logically (e.g. by part-of, is-a, and instance-of relationships) as much as they are needed in database schema de- sign. Research in Formal Ontology such as [Bra83, GW00b, GW00a] aims at determining guidelines for ontological commitments. It is highly questionable if such work could ever keep humans from intuitively disagreeing on such issues. However, until such consensus is reached, it would be misleading to make the above claim in the pragmatic context of data integration. Note also that the OBSERVER system of [MIKS00] uses the CLASSIC description logics system for representing ontologies, a system that is even considered by its designers to provide a symbol-level data model [BBMR89a, Bor95] (see also Section 3.7).

3.4.2 Capability Descriptions and Planning Planning as a particularly important application of problem solving has been among the core topics of interest in Artificial Intelligence ever since the influential STRIPS planning system [FN71] established it as a research area in its own right, 46 CHAPTER 3. DATA INTEGRATION with its own specific theoretical results and algorithms5 [Wel99]. Planning problems in STRIPS-like planners are described by an initial state of the world, a goal state, and a number of planning operators (“actions”), described by pre- and postconditions and invariants6. A solution to a planning problem is then a (possibly only partially ordered) sequence of operator applications that transforms the world from the given initial state to the desired goal state. The need for capability description, which is strongly related to such operator descriptions, in systems that use planning has resulted in a number of interesting capability description languages [WT98], e.g. LARKS, the capability description formalism of Retsina [SLK98], description-logics based formalisms [BD99], and capability description languages for problem-solving methods in knowledge-based systems (e.g. EXPECT [SGV99]). Planning for information gathering has received much recent interest because of its role in intelligent information systems for dealing with the information overload of the World Wide Web [Etz96, Mae94]. Since planning for information gathering is a quite special case of planning in general (for instance, information gathering operations do not change the world in the sense actions in a physical world do), special techniques have been developed for this problem [KW96, AK92, GEW96]. The data integration problem can be formulated as a planning problem as well, with reasoning being done based on the capability descriptions of data sources. Interestingly, this leads to mappings between data sources and global ontologies that is the inverse of the classical method of, for instance, Carnot, or that con- ventionally used in federated and multidatabase systems, data warehouses, and mediator systems. In the classical method, “destination” concepts that are part of the “global” integration schemata are described as views over the data sources (conceptually speaking; in practice, these mappings are often encoded as some procedural transformation code that does the job). This conventional method of data integration is thus termed global-as-view (GAV) integration. Data integration by planning on the other hand proceeds by having contents of data sources described as capabilities in terms of the global world model. Queries are answered by building a plan that uses the given data sources as described in the capability descriptions to extract and combine their data, and executing that plan. This kind of data integration, where mappings are expressed as descriptions of “local” sources in terms of the global ontology, is thus called local-as-view (LAV) integration7. Notable AI research that follows this route includes the OCCAM planning algorithm [KW96] and the SIMS system [AK92, HK93] for dealing with heterogeneous information sources, which is based on the LOOM

5Consider, for instance, partial-order planners [PW92, RN95] and more recently SAT plan- ning [KS92] and Graphplan [BF97]. 6In STRIPS, operators were described by preconditions and so-called add- and delete-lists for logical statements about the world that are changed by executing an action. 7We will address the GAV and LAV issue in more depth in dedicated sections, 3.6 and 3.5. 3.4. INFORMATION INTEGRATION IN AI 47 knowledge representation and reasoning system [MB87] (using its description logic for expressing contents of data sources) and the Prodigy planner [CKM91].

3.4.3 Multi-agent Systems Multi-agent systems (MAS) are, by their very conceptualization, cooperative in- formation systems par excellence. We avoid touching the unsettled issue of trying to define software agents here and refer to [Nwa96, WJ95, Wei99] or the exten- sive community discussion of that issue in the UMBC agents mailing list archives [Age]. MAS for information integration follow a heavy agent metaphor, in which agents

• have an explicit logical model of their environment and other agents.

• need to reason over their knowledge and over the states of other agents.

• need to plan, both for information gathering (i.e., as a part of the data integration problem analogous to query rewriting) and possibly for multi- agent coordination (e.g. Partial Global Planning [DL91], GPGP [DL92, DL95]).

• communicate in expressive agent communication languages. These usually provide elementary building blocks8 for protocols and knowledge exchange formats (e.g. KIF [GF92]).

Furthermore, agents in the information integration setting are usually de- signed to be cooperative rather than self-interested [SL95]. Apart from being a welcome testbed and melting pot for various areas of AI research, the field has its own interesting and still largely unresolved challenges. The coordination problem in MAS revolves around much more than just provid- ing languages for communication and knowledge exchange. The collaboration of agents requires coordination whose provision is not yet sufficiently understood. Much research has centered around providing coordination algorithms and pro- tocols (e.g. the Contract Net Protocol [Smi80] and (Generalized) Partial Global Planning [DL91, DL92, DL95]), research frameworks (e.g. TÆMS [Dec95]), ab- stractions of protocols (e.g. conversation policies [SCB+98, GHB99]), social rules [COZ00], pragmatics [HGB99], and game-theoretic considerations [PWC95]. For further interesting work on coordination see [WBLX00, COZ00, Cro94, KJ99]. Another important problem is to establish multi-agent systems as a soft- ware engineering paradigm – agent-oriented software engineering [Sho93, Jen99, JW00].

8These building blocks are sometimes called performatives and at times have been motivated by speech act theory [Sea69], as in the case of KQML [FFMM94, FL97]. 48 CHAPTER 3. DATA INTEGRATION

User Agent

9 1 10 8 2 Match User 3 maker Agent 13 14 Wrapper 11 6 1 Agent 12 7 8 Wrapper 4 Data Analysis Agent 5 Agent Mediator 4 6 5

Data 7 3 Analysis 2 Agent Wrapper Wrapper Agent Agent

Figure 3.4: MAS architectures for the intelligent integration of information. Ar- rows between agents depict exemplary communication flows. Numbers denote logical time stamps of communication flows.

Intelligent Information Integration has been a popular application of MAS. Due to their approach of seeking interoperability of several highly autonomous units (the agents), multi-agent systems are almost by definition performing an integration task. Several systems thus have addressed information integration, e.g. Retsina [SLK98], InfoSleuth [NU97, NBN99, BBB+97, FNPB99, NPU98], KRAFT [PHG+99] and BOND [TBM99]. Such systems are particularly interest- ing for their contributions to structural integration9 and have been less ground- breaking with respect to semantic integration where usually techniques in the tradition of those discussed elsewhere in this chapter are used10. A generic MAS architecture for information integration is depicted in Fig- ure 3.4. Such cooperative multi-agent systems are networks of collaborating agents of a number of categories, some of which we list next.

• Wrapper agents connect data sources (possibly legacy systems) to the sys-

9In principle they constitute the promise of the most open, hot-pluggable middleware in- frastructure possible. 10Surprisingly, systems such as InfoSleuth and KRAFT follow the global-as-view paradigm for integration as planning is not employed on the level of data integration as is the case in SIMS and OCCAM. 3.4. INFORMATION INTEGRATION IN AI 49

tem by advertising contents to other agents and listening for and answering data gathering requests of other agents on behalf of the wrapped sources. • Middle agents [DSW97, GK94] or facilitators aim at solving the connection problem [DS83], i.e., the problem of enabling providers and requesters in a multi-agent system to initially meet. Middle agents support interoperability and cooperation by matching agents with others that may be helpful in solving their integration problems. Such agents may have varying degrees of “intelligence” and proactivity, and one notably distinguishes between matchmakers, brokers and mediators. Matchmakers are advanced yellow pages services with varying degrees of sophistication that allow agents to advertise their services as well as to inquire for services of other agents (e.g. [SLK98]). Broker agents [RZA95] can be explained as analogous to real-life stock market or real estate brokers. Brokers solve the connection problem by matching agents, but may (and usually do) also act as intermediaries in the subsequent problem solving process. This may, for instance, allow agents communicating via a broker to remain anonymous. Mediators (e.g. [ABD+96]) add additional value by acting as intermediaries between agents collaborating to achieve some common goal and employing their own capabilities to support the problem solving process. More precisely, mediators do not only attack the connection problem on the level of finding matches, but often also resolve semantic heterogeneity between agents in a heterogeneous system. Note, however, that there is substantial terminological heterogeneity re- garding this issue. Particularly facilitators called brokers have had differ- ent roles from the one described above in some systems for information integration [NBN99, PHG+99]. • Data analysis and processing agents provide some value-adding reasoning functionality to the other agents in the system. • User agents represent the interests of users and gather information from the system on their behalf. In Figure 3.4, arrows between agents depict two exemplary communication flows, one involving a matchmaker and one involving a mediator agent. The arrows describe the directions of messages sent, and are attributed with logical time stamps. The main difference between the two types of middle agents that this figure is meant to clarify is that matchmakers may be consulted for services but requester agents are then left to themselves for the problem solving task, while mediators are usually highly involved throughout this process. The matchmaker of Figure 3.4 checks back whether the agents that it plans to propose to the requester are able to provide the requested service. This goes beyond a simple yellow pages service. 50 CHAPTER 3. DATA INTEGRATION

For influential work on multi-agent systems architectures following the heavy agent metaphor in general, we refer to ARCHON [CJ96, JCL+96], the Retsina infrastructure [SPVG01], and KAoS [BDBW97]. In conclusion, it is necessary to remark that MAS as cooperative information systems have gone much further than just to information integration, for instance to managing and integrating the business processes and supply chains of enterprises [PGR98, JNF98, JFJ+96].

3.5 Global-as-view Integration

The global-as-view way by which mappings between schemata are defined – by describing “global” integrated schemata in terms of the sources11 – has been used in most of the architectures discussed so far. This includes multidatabase systems, the data warehouse architecture where we had one component called the “mediator” which performed data integration, and various AI approaches as discussed in Section 3.4. In this section, we will first discuss the mediator architecture of [Wie92], which has been seminal to information systems research12. Then, we approach global- as-view integration in a simplistic way, through classical database views. (One may presume, however, that this is the approach taken most often in industrial practice.) Finally, we briefly discuss some research systems related to this area.

3.5.1 Mediation Mediators are components of an information system that address a particular heterogeneity problem in the system and provide a pragmatic “solution” to it. A mediator is a “black box” that assumes a number of sources with some exported schema each (these can be, for instance, wrapped databases or other mediators). Mediators export some interface (some schema) against which data are integrated. The integration problem is then left to a domain expert to address a certain aspect of heterogeneity and to implement the mediator. Each mediator thus encapsulates a particular integration problem. An overview of types of integration problems (“mediation functions”) is given in [Wie92]. Such mediation functions include the transformation and subsetting of databases, the merging of multiple

11This is also the method in which any procedural code that transforms data adhering to one schema to another one. 12Note that the term mediation has experienced substantial overload, and we have used it so far in three different contexts and with four slightly different meanings. Differently from the mediator concept in our data warehouse architecture [JLVV00], this fourth mediator concept has a smaller granularity. Differently from mediator agents in AI, Wiederhold’s mediators are far remote from aspects of multi-agent cooperation and are not meant to be “intelligent” [Wie92]. Mediation as the “lazy” approach to data integration mentioned earlier in the context of data warehousing closely coincides with this fourth concept. 3.5. GLOBAL-AS-VIEW INTEGRATION 51

Mediator Mediator

Mediator Mediator

Mediator Mediator Mediator

Wrapper Wrapper Wrapper Wrapper Wrapper

Figure 3.5: A mediator architecture heterogeneous databases, the abstraction and generalization of data, and methods for dealing with uncertain data as well as incomplete or mismatched sources. A typical architecture of a mediator system is shown in Figure 3.5. For struc- tural integration, data sources are usually wrapped to permit a single way of accessing sources in terms of data models and query languages. Mediators are pieces of code encapsulating some operational knowledge of a domain expert, implementing mediation functions that add value to and remove heterogeneity from the data provided by the sources.

3.5.2 Integration by Database Views Let us assume a relational database context. In the global-as-view approach, global relations are expressed as views (e.g. SQL views) in terms of source rela- ¯ tions. Given a global relation p(X) and sources p1, . . . , pn, p might be expressed 13 ¯ ¯ ¯ as a (finite) set of conjunctive views p(X) ← p1(X1), . . . , pn(Xn). Given a query posed in terms of view predicates, the query answering process is simple, as it reduces to simple conjunctive query unfolding (see Section 2.1).

Example 3.5.1 Suppose we have four sources of information about books. acm proceedings(T itle, ISBN)

13For the record, such a view is logically equivalent to a declarative constraint of the form

{hX¯i | p(X¯)} ⊇ {hX¯i | ∃Y¯ : p1(X¯1) ∧ ... ∧ pn(X¯n)}

in set-theoretic notation where X,¯ Y¯ are tuples of variables, X¯1,..., X¯n are tuples of variables and constants, and Y¯ = (X¯1 ∪ ... ∪ X¯n) − X¯. 52 CHAPTER 3. DATA INTEGRATION book1(ISBN, T itle, Author, P ublisher) product(Name, Category, P roducer, P rice) book price(ISBN, P rice)

We can create a database view (by a positive query, i.e., a set of conjunctive queries) providing an integrated interface to book information as follows. Let us assume we are only interested in titles, publishers and prices of books. book(T itle, P ublisher, P rice) ← book1(ISBN, T itle, Author, P ublisher), book price(ISBN, P rice). book(T itle, P ublisher, P rice) ← product(T itle, “Book”, P ublisher, P rice). book(T itle, “ACM Press”, P rice) ← acm proceedings(T itle, ISBN), book price(ISBN, P rice).

Queries asked over the relation “book” can be answered by unfolding them with the views. 

3.5.3 Systems Research systems in this area have usually aimed at providing toolkits and de- scription languages for automating the generation of mediators as much as pos- sible. Three notable research systems in this area, TSIMMIS [GMPQ+97], HER- MES [ACPS96], and Garlic [CHS+95], have been no exception. Since global-as- view integration in its simplest (and relational) form is quite straightforward, research systems also have put emphasis on advanced aspects such as multimedia data integration. In the following, we will have a somewhat closer look at the approach taken in TSIMMIS.

TSIMMIS TSIMMIS [GMPQ+97] (“The Stanford-IBM Manager of Multiple Information Sources”) is a well-known research prototype that provides generators for me- diators and wrappers. The generation of mediators and wrappers is a widely proposed technique for leveraging the practical usefulness of the mediator ap- proach. In this system, integration is based on the Object Exchange Model (OEM) [PGMW95] of the Stanford Database Group, a simple semistructured data model. It has also been used in other projects of that group, such as LORE [AQM+97]. The Mediator Specification Language (MSL) uses a syntax sim- ilar to datalog but which has been extended to the semistructured paradigm [ABS00, TMD92, PGMW95]. 3.6. LOCAL-AS-VIEW INTEGRATION 53

Mediator definitions are declaratively specified and can then be compiled down to mediators. Of course, such mediator definitions can only be changed (or new mediators added) offline, that is, changes require the definitions to be recompiled. The semistructured data model and query language used in TSIMMIS also allows for data sources that only supply data for some of the attributes in a mediator interface, which we cannot appropriately match with relational database views in the spirit of Example 3.5.1. For instance, it is possible to define mappings of two sources s1, s2 against a mediated relation r by

∀x, y ∃z : s1(x, y) → r(x, y, z) ∀x, z ∃y : s2(x, z) → r(x, y, z) which would not satisfy the range restriction requirement when expressed as a pair of conjunctive logical views. Given knowledge that the first attribute of r functionally determines the other two (as object identifiers in OEM of course do), the two views could nevertheless be used to answer a query such as q(x) ← r(x, y) by compiling the above mappings into a mediator for the view

r(x, y, z) ← s1(x, y), s2(x, z).

More Research Systems The HERMES system [ACPS96] is another mediator toolkit that aims at a wide goal of providing a complete methodology for source integration. In the design of the system, special care has been taken to permit the integration of multimedia sources. The system supports parameterized procedure calls that may be defined for accessing restricted sources and that are then used by HERMES mediators to answer queries. The Garlic system [CHS+95] is a research prototype that, similarly to HERMES, aims at integrating multimedia sources. Other systems that clearly fall into the global-as-view category and that we have shortly touched earlier were CARNOT and multi-agent systems such as KRAFT [PHG+99]. Much of the work on formalisms for database transformation (e.g. WOL [KDB98]) also falls into the category of global-as-view approaches. Note that data integration is not seen as the primary goal but rather as one application among others in the database transformations research area.

3.6 Local-as-view Integration and the Problem of Answering Queries Using Views

Local-as-view integration (LAV) is strongly related to the database-theoretic problem of answering (rewriting) queries using views [YL87, LMSS95, DGL00, AD98, BLR97, RSU95, PV99, SDJL96, PL00, CDLV00a], which will be discussed in more detail in this section. 54 CHAPTER 3. DATA INTEGRATION

Within data integration, the local-as-view approach is applied in global in- formation systems architectures. Influential LAV data integration systems in- clude the Information Manifold [LRO96], InfoMaster [GKD97], and SIMS [AK92, HK93]. Beyond data integration, the problem of answering queries using views has also been found relevant for query optimization [CKPS95] (where previ- ously materialized queries are used to answer similar queries14), the mainte- nance of physical data independence [TSI94], and Web-site management systems [FFKL98].

3.6.1 Answering Queries using Views The local-as-view approach is based on the notion of a “global” mediated schema, that is, a specially designed integration schema. The content of “local” sources is described by logical views in terms of the predicates of the “global” schema (thus the term local-as-view). Given “global” predicates p1, . . . , pn and a source v, a LAV view is defined by a conjunctive query ¯ ¯ ¯ v(X) ← p1(X1), . . . , pn(Xn).

Assuming a query over global predicates p1, . . . , pm, this query can be auto- matically rewritten by the system to contain only source predicates (such as v) instead of the global predicates. For the purpose of data integration, we consider only the case where one searches for complete rewritings, which are rewritings in which all global pred- icates have been replaced by views. We aim at producing rewritings that are minimal. A conjunctive query Q is minimal if there is no conjunctive query Q0 such that Q ≡ Q0 and Q0 has fewer subgoals than Q (see Section 2.4). For the minimality of positive queries (as sets of conjunctive queries) we furthermore require that conjunctive member queries are pairwise nonredundant, i.e. for a positive query {Q1,...,Qn}, we require Qi 6⊆ Qj and Qj 6⊆ Qi for each pair i, j ∈ {1, . . . , n}. One can either attempt to find equivalent rewritings or maximally contained rewritings. Given a conjunctive query Q and a set of conjunctive views V, an equivalent rewriting Q0 – if it exists – is a conjunctive query Q0 that only uses the view predicates in V and which, when expanded (unfolded) with the views, is equivalent to Q. Given a conjunctive query Q and a set of conjunctive views V, Q0 is a maximally contained rewriting15 (w.r.t. the positive queries16) if and 14The problem of answering queries using views is thus indirectly important to global-as-view data integration approaches such as data warehousing as well. 15Note that our definition of maximally contained rewritings is different from Levy’s [PL00] where a rewriting is only maximally contained if it has the properties we enumerate and there is at least one database for which the result of the original query is strictly larger than the result of the rewriting. Under our definition, however, equivalent rewritings are also maximally contained. 16Maximally contained rewritings need to be defined relative to a query language. 3.6. LOCAL-AS-VIEW INTEGRATION 55 only if each member query is, when expanded using the views, contained in Q and there is no conjunctive query Q00 such that, when expanded with the views, it is contained in Q but Q00 is not contained in any of the member queries of Q0. In general, it is not always possible to find an equivalent rewriting, and the maximally contained rewriting – as a set of conjunctive queries – may be empty. Equivalent rewritings require that views be complete, as is usually the case for true materialized database views. In a data integration setting, it is usually appropriate to consider sources to be possibly incomplete.

Example 3.6.1 (From [Ull97]) Suppose we have a global schema with a virtual predicate p (“parent of”), a query

q(x, y) ← p(x, u), p(u, v), p(v, y). and two sources s1 (“grandparent of”) and s2 (“parent of someone who is also a parent”). We can define the following logical views

s1(x, z) ← p(x, y), p(y, z).

s2(x, y) ← p(x, y), p(y, z). Let us first assume that the two views are complete, i.e. that they logically correspond to the constraints

{hx, zi | s1(x, z)} ≡ {hx, zi | ∃y : p(x, y) ∧ p(y, z)}

{hx, yi | s2(x, y)} ≡ {hx, yi | ∃z : p(x, y) ∧ p(y, z)} There is an equivalent rewriting of q:

0 q (x, z) ← s2(x, y), s1(y, z).

If we assume that our views are incomplete sources in a data integration system, they correspond to the logical constraints

{hx, zi | s1(x, z)} ⊆ {hx, zi | ∃y : p(x, y) ∧ p(y, z)}

{hx, yi | s2(x, y)} ⊆ {hx, yi | ∃z : p(x, y) ∧ p(y, z)} meaning that s1 is a source of grandparent relationships and s2 is a source of parent relationships where the children are themselves parents, but both sources do not necessarily provide all such relationships (although they only provide such relationships). The implication direction of the conjunctive views shown above is thus somewhat misleading, while the constraints based on set-theoretic notation employed above are exact. 56 CHAPTER 3. DATA INTEGRATION

It is possible to show that the following positive query is a maximally con- tained rewriting (as a set of conjunctive queries) of q that only uses the (incom- plete) views s1 and s2: 0 q (x, z) ← s1(x, y), s2(y, z). 0 q (x, z) ← s2(x, y), s1(y, z). 0 q (w, z) ← s2(w, x), s2(x, y), s2(y, z). Note that this rewriting is also nonredundant and minimal in the sense that we cannot remove any member queries or subgoals and retain a maximally contained rewriting. 

It can be shown that if both q and the views in V are conjunctive queries without arithmetic comparison predicates, then it is sufficient to consider only rewritings with at most as many subgoals (views) as the original query [LMSS95] as candidates for both equivalent and maximally contained rewritings. (See also [Ull97].) A naive algorithm for finding an equivalent rewriting is thus to guess an arbitrary rewriting Q0 of Q with at most as many subgoals as in Q which uses only the views in V and then to check if Q0 is equivalent to Q. For maximally contained positive rewritings, one can incrementally build a maximal set of rewritings by searching the whole space of such rewritings (which is finite). The problem of answering queries using logical views is NP-complete already in the simple case of conjunctive queries without arithmetic comparison predicates [CM77, LMSS95]. Thus this is a presumably hard reasoning problem. However, it spares the human designer from having to carry out the rewriting task by hand17. For more expressive classes of query languages, the problem is harder or undecidable [vdM92, SDJL96, CV92, Shm87].

3.6.2 Algorithms Several improvements over the naive query rewriting algorithm have been pro- posed, among them the Bucket algorithm of the Information Manifold [LRO96], the Inverse Rules algorithm [DG97] of the InfoMaster System [GKD97], the MiniCon algorithm [PL00], OCCAM [KW96], and the Unification-join algorithm [Qia96]. We will discuss three of these algorithms in more detail. The Bucket algorithm uses the following simple optimization over the naive algorithm. For each of the subgoals of a given query, each of the views is indepen- dently checked if it is possibly relevant to the process of replacing that subgoal. Such candidate views are collected in “buckets”, one for each subgoal. Exhaus- tive search is then carried out in the Cartesian product of the buckets. Thus the necessary search space required for combining the views in the buckets is pruned compared to the naive algorithm.

17In global-as-view integration, on the other hand, mediators have to be specially designed in order to be able to answer a certain repertoire of queries. 3.6. LOCAL-AS-VIEW INTEGRATION 57

The inverse rules algorithm first transforms the views into Horn clauses. The queries can then be answered by executing the combination of the query and the Horn clauses representing the views as a logic program, in a bottom-up fashion.

Example 3.6.2 In the inverse rules algorithm, the views of Example 3.6.1 (under the incomplete views semantics) correspond to the Horn Clauses

p(x, f1(x, z)) ← s1(x, z). p(f1(x, z), z) ← s1(x, z). p(x, y) ← s2(x, y). p(y, f2(x, y)) ← s2(x, y).

Given instances s1 = {ha, ci, hb, di} and s2 = {ha, bi, hb, ci}, we can derive

p(a, f1(a, c)), p(f1(a, c), c), p(b, f1(b, d)), p(f1(b, d), d), p(a, b), p(b, f2(a, b)), p(b, c), p(c, f2(b, c)) and finally q(a, d) as the answer to the query of Example 3.6.1. 

Such a logic program can be transformed into an equivalent (function-free) nonrecursive datalog program, which can be unfolded into a set of conjunctive queries using a simple transformation [DG97]. The MiniCon algorithm uses information about variables occurring in queries for finding maximally contained rewritings. The MiniCon algorithm is based on the notion of MiniCon descriptions (MCD)18.

Definition 3.6.3 Given a conjunctive query Q and a set of views V, an MCD m is a tuple hVkm , hm,Gm, φmi of

• A view Vkm ∈ V.

19 • A head homomorphism hm on the view Vkm .

• A set Gm ⊆ Body(Q) of subgoals of Q.

• A function φm : V ars(Gm) → V ars(Vkm ) that maps the variables in the

subgoals Gm of Q into the variables of Vkm .

18Informally speaking, an MCD represents a fragment of a containment mapping from the query to its rewriting encompassing only the application of a single view and which is in a sense atomic. 19A head homomorphism h : V ars(V ) → V ars(V ) is a mapping of variables that is the identity h(v) = v on variables not in the head of the view and maps head variables to head variables; more exactly, a head variable v ∈ V ars(Head(V )) is either mapped to itself (h(v) = v) or to another head variable for which h is the identity, i.e.

h(v) = w, w ∈ V ars(Head(V )), h(w) = w 58 CHAPTER 3. DATA INTEGRATION

q(x, y) ← p(x, u), p(u, v), p(v, y).

φ(x) φ(u) φ(u) φ(v) φ(v) φ(y) —————— m1: s1(x, z) ← p(x, y), p(y, z). m2: s1(x, z) ← p(x, y), p(y, z). m3: s2(x, z) ← p(x, y), p(y, z). m4: s2(x, z) ← p(x, y), p(y, z). m5: s2(x, z) ← p(x, y), p(y, z).

Figure 3.6: MiniCon descriptions of the query and views of Example 3.6.1. that satisfies the following properties.

• For each g ∈ Gm, there is a subgoal of our view such that φm(g) ∈

Body(hm(Vkm )). (Gm is not necessarily the largest such set of subgoals of Q.)

• For each variable v ∈ V ars(Head(Q)) we have φm(v) ∈ Head(hm(Vkm )).

• For each variable v ∈ V ars(Q) for which

φm(v) ∈ (V ars(hm(Vkm )) − V ars(Head(hm(Vkm ))))

20 (i.e., φm(v) is among the existentially quantified variables of the head homomorphism on the view), all other subgoals in Q that contain v are in Gm.

• m is minimal in the sense that there is no subset of Gm s.t. the previous property remains true.

• hm is the least restrictive head homomorphism necessary in order to allow the view and query subgoals to be unified.



This is best explained with an example.

20In the data integration setting, these are thus the attributes that were projected out in the materialized views. Data for them are not available and the variables bound to these attributes cannot only not be bound to head variables of the query but also must not occur in any subgoals of Q left to be covered by other MCDs to produce a rewriting. This would require a join of two source views by attributes that are “not available”. 3.6. LOCAL-AS-VIEW INTEGRATION 59

Example 3.6.4 For the query and the views of Example 3.6.1, there are five MCDs (see Figure 3.6). For all MCDs and variables, their head homomorphism is the identity id(v) = v for all variables in the respective view, so we do not explicitly state it. For brevity, let g1, g2, g3 denote the three subgoals of q.

m1 = hs1, id,G1 = {g1, g2}, φ1i with φ1(x) = x, φ1(u) = y, φ1(v) = z. m2 = hs1, id,G2 = {g2, g3}, φ2i with φ2(u) = x, φ2(v) = y, φ2(y) = z. m3 = hs2, id,G3 = {g1}, φ3i with φ3(x) = x, φ3(u) = y. m4 = hs2, id,G4 = {g2}, φ4i with φ4(u) = x, φ4(v) = y. m5 = hs2, id,G5 = {g3}, φ5i with φ5(v) = x, φ5(y) = y.  Given the set M of all MiniCon descriptions for a query Q and a set of views V, all conjunctive queries that have to be considered for a maximally contained positive rewriting of Q can be constructed from combinations m1, . . . , mk of ele- ments of M for which the sets Gm ...Gm are a disjoint n-partition of the set of V 1 k all subgoals in Q, i.e. (Gm1 ∪...∪Gmk ) = Body(Q) and Gmi ∪Gmj = ∅ for each pair i, j in 1 . . . k. Note that one also does not have to compute any containment mappings as needed in the Bucket algorithm anymore, as this is already implicit in the combination of hi and φi.

Example 3.6.5 Let M be the set of five MCDs of the previous example. There are three n-partitions of the three subgoals {g1, g2, g3} of our query q using G1 ...G5, namely {G1,G5}, {G3,G2}, and {G3,G4,G5}. The rewritings pro- ducible from these partitions are those of Example 3.6.1.  We arrive at the maximally contained rewriting of Q by transforming each of the partitions in the following way. Let {m1, . . . , mn} be such a partition. −1 We apply φi (hi(Vki )) for each MCD mi and combine the transformed views by conjunction into conjunctive queries. For those variables of a view for which φ−1 is undefined, i.e. variables that only appear in subgoals of the view that are not matched with any of the subgoals of the query in the MCD, new variable names need to be invented. Note that none of the three algorithms that we have discussed directly pro- duces rewritings that are guaranteed to be minimal, so results have to be sepa- rately optimized to obtain this property. The Inverse Rules algorithm in its original formulation produces a datalog rewriting, and rewritten views are kept separate from queries. In the case of the rewriting of conjunctive queries, the rewriting process thus defers part of the activity carried out by the other two algorithms to the time of query execution. To compare this algorithm with the others, it is therefore necessary to unfold the datalog program produced by the Inverse Rules algorithm using the transforma- tion of [DL97b] (also discussed in Section 7.2) or include query execution into the performance consideration. 60 CHAPTER 3. DATA INTEGRATION

Given moderately sophisticated techniques for executing datalog queries, the Inverse Rules algorithm performs better than the brute-force bucket algorithm. The MiniCon algorithm, which takes into account more problem-specific knowl- edge and thus reduces the amount of redundant computations, however, in prac- tice outperforms even the Inverse Rules algorithm in an altered form that unfolds the rewritings into sets of conjunctive queries [PL00].

3.6.3 Bibliographic Notes The theory of answering queries using views is surveyed in [Hal00] and [Lev00]. It is strongly related to the query containment problem, and is usually at least as hard. The exception is the problem of answering datalog queries using conjunctive views, which is efficiently solvable [DG97], while the related containment problem is undecidable [Shm87]. On the other hand, the solution proposed in [DG97] does not apply query rewriting in the strong sense21. The query rewriting problem in the presence of arithmetic comparison predi- cates in the query and views has been addressed in [LMSS95] for the case of equiv- alent rewritings. For the case of maximally contained rewritings, it is known that no complete algorithm can exist, not even one that produces a recursive rewrit- ing [AD98]. A sound algorithm that covers many practically important cases, however, is presented in [PL00]. Queries with aggregation are addressed in [SDJL96]. The problem of an- swering queries using views in object-oriented databases and OQL [CBB+97] has been addressed in [FVR96]. The same problem in the case of regular path queries in semistructured data models is discussed in [CDLV99, CDLV00b, CDLV00a]. The problem of answering queries using views with functional dependencies (over the global predicates) has been addressed in [LMSS95] for the case of equivalent rewritings, where the bound on the maximal number of subgoals only needs to be slightly extended to the sum of the number of the subgoals in the original query plus the sum of the arities of the subgoals. Maximally contained rewritings for the same case may need to be recursive [DL97b, DGL00].

Binding Patterns (Adornments) The problem of answering queries using views with binding patterns derives its rel- evance from the fact that many sources in data integration systems have restricted query interfaces. This is the case for legacy systems as well as for screen-scraping Web interfaces where certain chunks of information may need to be provided such that queries can be executed (e.g. book titles in online book stores). These restrictions can be conveniently modeled using binding patterns22.

21We will apply this technique for rewriting recursive queries in Section 7.2. 22Binding patterns or adornments have been used elsewhere, for instance in the theory of optimizing recursive queries [Ull89]. 3.6. LOCAL-AS-VIEW INTEGRATION 61

A binding pattern is a mark telling, for each argument position of the predi- cate, whether it is bound or free. At query execution time, variables in argument positions marked “bound” have to be bound to constants before the extent of the predicate is accessed (i.e., the source is queried). Example 3.6.6 Consider the query qb,f (x, z) ← p(x, y), p(y, z). which requires (and guarantees) that the variable x will be bound to a constant when executed. Furthermore, we have a view vb,f (x, y) ← p(x, y). for a source that can only answer queries when provided “input” in its first attribute position. The query can be rewritten into qb,f (x, z) ← v(x, y), v(y, z). However, the query qf,b(x, y) ← p(x, y). cannot be rewritten because the only available source does not allow to access p tuples without providing input for the first attribute position.  Binding patterns allow the integration of procedural data transformation code into the rewriting process, where input arguments of such procedures are modeled as “bound” and output arguments as “free”. For the problem of computing equivalent rewritings given sources with binding patterns, the search space is somewhat larger than in the case of the problem of answering queries using views without binding patterns [RSU95], but the problem remains NP-complete. Maximally contained rewritings may not be expressible as finite sets of conjunctive queries, but can be encoded as recursive datalog programs [KW96, DL97b, DGL00]. Algorithms and results bounding the search for equivalent rewritings have been presented in [RSU95]. Earlier, queries with ”foreign functions” were consid- ered in the context of query optimization in [CS93]. The Information Manifold [LRO96], a system for integrating Web sources, supports source descriptions with binding patterns that permit the specification of input and output attributes. They are meant to facilitate the integration of sources that do not have full rela- tional query capabilities, such as legacy sources or screen-scraping Web interfaces.

Answering Queries using Views under the Closed World Assumption Note that we have so far discussed the problem of answering queries using views in the light of an open-world assumption, which is appropriate in the context of data integration and the assumption that sources may provide incomplete information. It is also possible to approach the problem under a closed-world semantics, centered around the notion of certain answers [AD98].

Example 3.6.7 Consider a query q(x, y) ← p(x, y). and sources v1(x) ← p(x, y). and v2(y) ← p(x, y). Under the open-world assumption, this query cannot be answered. Let the extents of v1 and v2 now be v1 = {hai} and v2 = {hbi}. Under the closed-world assumption, we have the certain answer ha, bi to the query, because the projections of the tuples in the extent of p are complete and entail that certain answer. The original database from which the two views v1 and v2 were computed must have been I = {ha, bi}.  62 CHAPTER 3. DATA INTEGRATION

This problem and its complexity are discussed in [AD98, MLF00]. Note that the problem of answering queries using views under the closed-world assumption has the practical disadvantage that reasoning can only be done relative to the data rather than the query (with high complexity relative to the size of the data), thus leading to a scalability problem.

3.7 Description Logics-based Information Inte- gration

3.7.1 Description Logics Description logics23 (DL), also known as terminological logics or concept lan- guages, are structured logical languages that are based on a well-designed tradeoff between expressive power and complexity. The main goal is the design of lan- guages that allow to conveniently express a large number of practical problems related to concepts and objects while still remaining decidable24. They can be motivated by semantic networks, frame languages, terminological reasoning, and semantic and object-oriented data models [RN95]. Description logics are usually constructed from unary relational predicates (called concepts or concept classes) and binary relations (roles or attributes). In- stances of concepts are usually called individuals. Description logics are defined by a fixed set of logical constructors, such as concept intersection C1 u C2, union C1 tC2 and negation ¬C, all-quantification of roles with qualification ∀R.C (denot- ing the concept {x | ∀y : R(x, y) → C(y)}), existential quantification, which may (∃R.C, denoting {x | ∃y : R(x, y) ∧ C(y)}) or may not (∃R) support qualification, the conjunction and union of roles, the concatenation of roles R1 ◦ R2, number restrictions on roles ((≤ nR) and (≥ nR), where n is a constant integer), and others25. More complex concepts and roles are defined inductively from atomic

23We restrict the presentation of description logics to a short overview. For a more detailed introduction to this area see e.g. [DLNS96] or [Fra99]. 24However, the ancestor of description logics systems KL-ONE [BS85] was found not to have this property [SS89]. The culprit was the same-as constructor, which allows to express concepts of the form

∀y1, y2 :(R1(x, y1) ∧ R2(x, y2)) → y1 = y2

This constructor makes description logics lose the tree-model property [Var97] and their cor- respondence with modal logics, and causes already the simplest and most restricted description logics to become undecidable (see e.g. [DLNS96]). This problem was fixed in the successor sys- tem CLASSIC [BPS94, BBMR89b] by a slight change of the semantics of extents (a “hack”). Note also that the LOOM system [MB87], which is often listed among description logics sys- tems, provides an incomplete reasoning service over a very expressive logical language. 25These constructors are motivated by the ALC family of languages [SSS91, DLNS96]. See [PSS93] for a standardization effort. 3.7. DESCRIPTION LOGICS-BASED INFORMATION INTEGRATION 63 concepts and roles using the provided constructors. Constraints are of the form C1 v C2 or C1 ≡ C2, where C1 and C1 are concepts. Constraints are subsumption (logical “containment” of the extents of the expres- sions) relationships between concepts. For instance, the subsumption relationship C1 v C2 expresses the logical constraint ∀x : C1(x) → C2(x). The semantics of the languages are the straightforward classical logical one applied to the special syntactical peculiarities of such languages. The syntax of most description logics languages differs from the classical syntax of first-order logics because constraints in such concept languages usually can be expressed in an intuitive variable-free form. The main reasoning problems in description logics systems are subsumption and classification. Subsumption is the logical implication problem in description logics languages on the level of concepts. Given a set of constraints Σ in a DL language, subsumption is the problem of deciding whether Σ implies the truth of the logical formula corresponding to an additional constraint C1 v C2. In other words, this is the problem of deciding whether Σ implies that concept C1 is contained in C2. The classification problem is to decide whether a certain individual belongs to a given concept class.

3.7.2 Description Logics as a Database Paradigm Description logics systems have been discussed as database systems before, e.g. in the context of CLASSIC [BBMR89a, Bor95] and DLR [CDL98a, CDL+98b, CDL99]. Description logics are relevant to data integration for two reasons. Given that queries are expressed as concepts and constraints express inter-schema relationships such as views,

• concept subsumption can be used to decide query containment under con- straints and

• the classification of individuals (the objects of a database or a set of het- erogeneous databases) can be used for answering queries in heterogeneous databases.

Apart from that, description logics have been used to verify the consistency of schemata [FN00]. Let us consider description logics subsumption and class- ification as a way of performing data integration.

Example 3.7.1 Consider the following set of three constraints. GrandparentOrNoParent ≡ Person u ∀child.(∃child.>) ParentOfFerrariDriver ≡ Person u ∃child.(∃drives.Ferrari) Ferrari v ItalianCar 64 CHAPTER 3. DATA INTEGRATION

Given our data integration setting, let GrandparentOrNoParent and ParentOf- FerrariDriver be database relations with an extent (data sources). “child” and “drives” are roles. The first constraint describes individuals of the class Grand- parentOrNoParent as persons whose children, if they have children, are parents themselves. The name of the second source speaks for itself. In the third con- straint, we define Ferraris as Italian cars (that is, the concept class Ferrari is a subclass of the class of Italian cars). Now let us ask a query for all persons who have children that drive Ferraris and have children themselves.

Person u ∃child.((∃child.>) u (∃drives.Ferrari))

This query can be answered by attempting to classify all the individuals known to the system. The answer will be the set of individuals that belong to both the classes GrandparentOrNoParent and ParentOfFerrariDriver. Our constraints imply Person u ∃child.((∃child.>) u (∃drives.Ferrari)) ≡ GrandparentOrNoParent u ParentOfFerrariDriver We can determine that our set of constraints implies the subsumption Person u ∃child.((∃child.>) u (∃drives.ItalianCar)) w GrandparentOrNoParent u ParentOfFerrariDriver. but not equivalence. 

Note that the constraints of the previous example clearly follow a local-as- view pattern26. In general, however, constraints in description logics are truly symmetric (In constraints of the form C1 v C2 or C1 ≡ C2, both C1 and C2 may be complex composed concept definitions representing queries), allowing to combine global-as-view and local-as-view integration. Recently, two kinds of extensions to the ALC-style languages (for which decid- ability is of course preserved) have been proposed. Firstly, there has been work on defining concepts using fixpoints for e.g. transitive roles (µALCQ [DL97a], [HM00]) and on languages that allow to express general regular path expressions, as they are important in the context of queries over semistructured databases (for instance, see the expressive description logic DLR [CDL98a, CDL+98b, CDL99]). Secondly, description logics (e.g., again, DLR) have dropped the requirement that roles be binary relations. Instead, arbitrary relations may be used but have to be projected down to binary before being used for defining concepts. The restrictions and drawbacks of description logics systems used in data integration are

26Note that description logics-based data integration is sometimes considered a case of local- as-view integration. We kept the discussion separate to leave the work on the problem of answering queries using views to its own section. 3.8. THE MODEL MANAGEMENT APPROACH 65

• Description logics provide two kinds of reasoning, query answering [CDL99] by the classification of data and the verification of query containment by subsumption. They do not lend themselves to query rewriting, however. While it is possible to check, given a rewriting, if it is contained in the input query, there is in general no way of finding such a rewriting given only the input query. Query answering, however, is impractical, as it requires all the data available in the system to be imported into the description logics system, where each data object has to be independently classified for membership in the concept class described by the query. This does not scale to large databases and may not be feasible because data sources may have restricted (e.g. screen-scraping) interfaces or may be legacy systems, rendering it impossible to extract “all” their data.

• Query languages are restricted to tree-style queries without any circulari- ties. (Consider our earlier comment on same-as constraints and the entailed undecidability.) For instance, this excludes simple queries such as

q(x) ← parent(x, y), employer(x, y).

3.7.3 Hybrid Reasoning Systems For efficiency reasons, recent description logics systems (e.g. KRIS [BH91], BACK [vLNPS87, NvL88], KRYPTON [BPGL85] and FaCT [Hor98]) have sepa- rated the reasoning with concepts (TBox reasoning) from the reasoning with indi- viduals (ABox reasoning [HM00]), using different techniques for the two problems and creating hybrid reasoning systems [Neb89]. Hybrid knowledge representation systems have also been built by combining description logics reasoning with deductive databases and nonmonotonic rea- soning [DLNS98, Ros99] or local-as-view integration using database techniques [LR96]. The Information Manifold [LRO96, BLR97], a local-as-view system with query rewriting based on the Bucket algorithm uses the description logics CARIN [LR96] to constrain concepts used in source descriptions (views)27.

3.8 The Model Management Approach

The vision of the model management approach is to represent schemata and inter- schema mappings as first-class objects in a repository28 [BLP00]. This approach

27This is an alternative role of description logics systems in data integration. 28This relates to interesting research on logical languages for reasoning about schemata (e.g. F-Logic [KL89], HiLog [CKW89], and Telos/ConceptBase [JGJ+95]) and meta-data query lan- guages [LSS99, RVW99]. 66 CHAPTER 3. DATA INTEGRATION allows to define powerful operations on schemata and mappings such as the un- folding (concatenation) of mappings and the application of mappings to schemata in order to transform them. Model management permits the computer-aided manipulation of such meta- data using easy-to-use graphical user interfaces, as demonstrated by both research systems (e.g. Clio [MHH+01], ONION [MKW00]) and commercial systems such as Microsoft Repository [BB99]. The OBSERVER system [MIKS00] manages several heterogeneous ontologies and mappings between them in a repository and may be considered to be another pursuer of the model management approach. Schema matching techniques [MHH+01, MZ98, MKW00] have been used in such systems for defining mappings between schemata. Most work in this area is based on the definition of correspondences between schema objects (e.g. classes, attributes, or relationships), often graphically, by drawing lines between them [MKW00, MZ98, BLP00, MHH+01]. The formalisms for defining mappings have often been quite restrictive, and agreed-upon semantics have not yet developed. Systems such as Clio [MHH+01] propose several alternative semantics for such correspondences for users to choose among. Schema matching has also been used for XML data transformation [MZ98]. For data integration, these approaches have the drawback that the integration problem is solved by processing the data rather than transforming the queries, thus leading to a scalability problem.

3.9 Discussion of Approaches

Quality Factors of Data Integration Architectures In this chapter, we have encountered a number of data integration architectures. Given the integration problem motivated in Chapter 1, some of the main questions regarding the quality of data integration architectures are

• Does the approach apply query rewriting or query answering? This is im- portant because if the output of the data integration process is a query which can be independently optimized and reformulated, performance im- provements are possible that otherwise would not be attainable. The sepa- ration is also important because in some approaches, the complexity of inte- gration by data transformation is much harder than just executing queries arriving at the same results, if such queries exist and can be computed. Fi- nally, such a separation allows to select the best implementations for both problems – core data integration and query evaluation – independently.

• Does the approach use a global schema against which all sources are in- tegrated, or may there be several different schemata against which data integration is carried out? The first approach may be preferable from a 3.9. DISCUSSION OF APPROACHES 67

standpoint of managing mappings. If there is only a single integration schema, fewer mappings may be needed than if there are many. Note that given m integration schemata and n sources, of the order of m∗n mappings may be needed to integrate them. (That makes m2 in a federated database system.) Clearly, a global integration schema (m = 1) is usually preferable over a quadratically growing number of mappings. However, the integration problem may require support for multiple au- tonomous integration schemata, which may evolve independently. Change of requirements may lead to the evolution of schemata against which data are integrated. If there are many mostly independent integration problems, it may be preferable to avoid the creation of a single global schema. If there are several smaller schemata and only one of them needs to be changed, one can expect that fewer mappings will be affected.

• How stable and reusable are mappings when change occurs? Given a large information infrastructure that needs to be managed, one does not want changes to propagate through the system further than absolutely neces- sary, invalidating other components that then need to be changed as well. Subsystems should be largely decoupled, making changes manageable. Al- ternatively, if changes do need to occur, it should be possible to automate them as much as possible. There are two kinds of changes that we want to differentiate between, the change of sources and the change of integration requirements (or the evo- lution of an integration schema or “global” schema).

• How well does the approach support the mapping of sources and integration schemata that show serious concept mismatch? As we will show later in this section, procedural approaches as well as simple view-based approaches have their restrictions with respect to this issue, which are more severe than it may appear at first sight. Declarative approaches with symmetric constraints are those most desirable and complete.

Global-as-view versus Local-as-view Integration Let us first compare local-as-view and global-as-view integration. A major ad- vantage of local-as-view mappings is their maintainability (Figure 3.7). When sources are added or removed, change remains local to those logical views that define these sources. GAV mediators may require a major global redesign when sources change, which may propagate through many mediators. Once a global integration schema has been defined for LAV, this schema allows for good de- coupling between sources and the global information system, which is essential if ease of change is an issue. However, designing an appropriate global schema for local-as-view integration is hard, and requires a good understanding of the 68 CHAPTER 3. DATA INTEGRATION

Global-as-view/procedures Local-as-view Management Problematic: change of a Good of change to single source may require sources the redesign of (many) me- diators/procedures Management Problematic: coupling of Problematic: change of of change of mediator interfaces global schema requires requirements global redesign of views

Figure 3.7: Comparison of global-as-view and local-as-view integration. domain. Furthermore, the application of the local-as-view approach is only rea- sonable if the overall goals and requirements of the global information system do not change. Otherwise, the global schema as well as all defined logical views may quickly become invalid and require complete redesign. The interfaces that GAV mediators export, on the other hand, often follow quite straightforwardly and naturally from the sources that have to be combined. LAV has sometimes been called a declarative method, and GAV procedural. The design of “schemata” that global-as-view mediators export are usually more restrictive as to what kinds of queries can be asked than in LAV, where less knowledge about how queries are answered is put into the views at design time and more is decided at runtime. Indeed, LAV takes a more global perspective when answering queries than GAV (the overall integration schema becomes a mediated schema [Lev00]). As pointed out earlier in Chapter 1, both the local-as-view and the global-as- view approach make a very important assumption. It is supposed that the “global schemata” resp. interfaces exported by mediators29 can be designed at will for the special purpose of integrating a number of sources. Either approach fails if this assumption does not hold. For instance, consider the case of Example 3.6.1. We cannot build a GAV mediator that answers any queries using the given sources if we are required to export a “parents”(-only) interface. Conversely, imagine source relations containing attributes that have no analog in the global logical schema in the case of LAV. In that case, one cannot define the sources as views over the global schema.

Comparison of Architectures Now consider Figure 3.8, which compares the data integration architectures dis- cussed in this chapter.

• Federated databases support the autonomy of component databases. There

29These are in a sense “global” as well, because if they are not general enough, they will have to be redesigned when further sources are added to a mediator. 3.9. DISCUSSION OF APPROACHES 69

Query Global “Declarative” Symmetric rewriting? schema? approach? constraints Federated Databases no (?) no no no Data Warehousing yes/no yes no no Mediator Systems no (yes) no no Global Inf. Systems yes yes yes/no no Description Logics no no yes yes Model Management yes/no no no (?) no

Figure 3.8: Comparison of Data Integration Architectures.

is thus no central “global” schema in the architecture. Traditionally, data have been translated procedurally between schemata, although this is in principle not a necessity.

• In the data warehousing architecture, sources are integrated against a single sophisticated global warehouse schema. Integration is usually global-as- view and procedural.

• Mediators `ala [Wie92] apply query answering in a procedural manner. Mediators in systems such as TSIMMIS [GMPQ+97] are specified declar- atively; These specifications are subsequently compiled down into software components that answer queries on the level of data. Global-as-view inte- gration by database views is based on query rewriting. However, mediators do not take a global perspective with respect to the schema as known from local-as-view and description logics integration. Although GAV views can be considered as constraints under a declarative semantics as well, no global reasoning under this semantics will lead to more complete results than just using the views independently. Mediators independently export interfaces according to which they can pro- vide integrated data. Mediators making use of the services of other me- diators are strongly coupled via their interfaces (see Figure 3.7). While the mediator architecture at first sight does not rely on a global schema, this coupling entails the usual disadvantages of global schemata, namely that changes of requirements may lead to the need of a global, very work- intensive redesign of many components (mediators) of the system.

• Global information systems may either use GAV or LAV integration. The first case is not substantially different from the mediator approach just discussed. The latter case, local-as-view integration, has been discussed in sufficient detail earlier. 70 CHAPTER 3. DATA INTEGRATION

• Description logics systems use a declarative approach with symmetric con- straints, permitting to encode both mappings usually to be considered local- as-view and mappings usually considered global-as-view. The designer may effectively define a global schema against which all sources are integrated, but is free to do otherwise. Unfortunately, the approach does not only rule out query rewriting, worse, there is usually a high data complexity for answering queries, compromising scalability.

• The model management approach at the core leaves open which integration technology is to be used. While state-of-the-art research often uses very restrictive mappings with a somewhat declarative flavor, one is free to make other choices. Since integration schemata are just objects among many, no global schema strategy can be observed.

We are now in the position to apply the lessons learned from previous work to our problem of Section 1.2. Chapter 4

A Short Sightseeing Tour through our Reference Architecture

4.1 Architecture

The data integration architecture of Figure 4.1 will be made our reference for the presentation of the contributions of this thesis. It contains a number of compo- nent information systems that retain design autonomy for their schemata, data models, and query languages. Each component information system may contain a number of databases and processes which access and manipulate local data. For simplicity, but without loss of generality, we assume the information systems to contain a single database over a single consistent schema each. Other cases are handled by either using distributed database techniques locally or splitting one information system up into several systems that are considered independent for data integration purposes. Schemata may contain both true “source” entities for which the local database holds data and logical entities over which local queries can be executed as well, but for which it is the data integration system’s task to gather and provide mediated data from other information systems. Component information systems may be structurally heterogeneous. In order to make integration possible, the overall information infrastructure of Figure 4.1 is assumed to have a “global” data model, query language, and format for com- municating data (results to queries). Component information systems may each differ in their choices of such structural factors. The data integration architecture contains a model management repository, which stores “proxies”, copies of each schema in an information system in the infrastructure, as first-class objects subject to manipulation in the repository. These proxy schemata are of course expressed in the global data model1 used

1In this thesis, the relational data model will occupy this role.

71 72 CHAPTER 4. REFERENCE ARCHITECTURE

Repository Information systems Editor

schema Schemata Query Translation data Rewriting Mediator Proxy Proxy Query Facility relational schema Phys. Plan Mediator Generation Proxy Query Facility

Query Plan Mediator Proxy Mappings Execution Query Facility

Repository Mediator

Figure 4.1: Reference Architecture

in the repository. Mappings (as sets of symmetric inter-schema constraints) are stored in the repository and accessed by the a data integration reasoning services (which will be referred to as the mediator in the tradition of [JLVV00]). The reasoning services are assumed to have been implemented only once, “globally”, for the “global” data model and query language. Locally, inside the component information systems, there are mediator “proxies”, which accept queries using the local query language, relative to the schema over the local data model, but delegate their answering, after translation to the global data model and query language, to the mediator. Mediated queries can be issued either inside an infor- mation system using the local data model and query language or directly against the global mediator. The most common vehicles of structural integration used throughout the data integration approaches of Chapter 3 are wrappers [GMPQ+97, RS97, GK94]. The use of wrappers is appropriate for the structural integration of (legacy) informa- tion systems that act as sources to some global information system only. The metaphor of wrappers is insufficient in architectures with several heterogeneous information systems that each may need access to integrated data. We propose a different (and bi-directional) mechanism for structural integration, which may be conceptualized in analogy with the cell membranes of living organisms. In our context, heterogeneous information systems each are enclosed by some transla- tion membrane, which transforms incoming queries and data from the global data model and query language to the local one, and does the opposite for outgoing queries, data, and schema information2. If the structural design choices of some component information system have been the same as those of the global data in- tegration infrastructure, such a membrane is of course not needed. In the case of component information systems that do not need to access integrated data from

2Information that may be on its way into the model management repository. 4.2. MEDIATING A QUERY 73 other information systems, one may revert to the simpler wrapping approach.

4.2 Mediating a Query

In general, queries are answered as follows. Initially, a query Q is issued against one of the mediator proxies inside a component information system IS. This query is then sent to an instance of the mediator. When crossing the boundary of IS, Q is translated into a query Q0 in the “global” query language over the proxy schema of IS, which is a citizen of the model management repository. The mediator first rewrites Q0 into a query Q00 over source predicates only, using schema information and inter-schema constraints from the repository. This query is then decomposed into an executable (distributed) query plan, which may be optimized using cost-based metrics and special evaluation techniques known from the distributed database field [MY95, OV99, Ull89]. To execute Q00, the queries over component databases specified in the distributed query plan are sent off to the individual information systems containing those databases. While traversing the translation membrane surrounding component informa- 0 tion systems, the queries Qi are translated into queries Qi over the local query languages and modified to use the schemata over the local data models. These queries are then passed on to the local query facilities, which execute them and return data in formats relative to the local data models. On the way back “out” of the component information systems and to the me- diator, the data are translated to correspond to their schemata over the “global” data model and are passed on to the mediator. There the data are combined into one consistent result for Q00. This is then passed on to IS. On the way through the component information system’s membrane, the result is reformu- lated to comply with Q and the local data model of that component information system.

4.3 Research Issues

The following chapters will address the two main voids of our proposed approach left, which are query rewriting and the management of mappings under change. Although much of what we have discussed in this chapter relates to structural integration3, the problems related to it have been seen before and are sufficiently well understood [GMPQ+97, RS97, GK94]. Similarly, distributed query execution is quite well understood once a logical query plan exists [MY95, OV99, Ull89]. Data integration encompasses various aspects of data reconciliation that we will, as simplifying assumptions, assume to be implicit in the query rewriting

3This was done to have had it covered, such that we can subsequently focus on semantic integration. 74 CHAPTER 4. REFERENCE ARCHITECTURE problem or simply excluded from consideration. For instance, object identification [JLVV00] is the issue of matching objects from different databases which may be identified by keys from distinct domains, or which may have no keys at all. This problem has spurred some research of its own (e.g. [ZHKF95b]), but to a degree such problems may be dealt with in our framework, as shown in the example in Section 1.3. Another argument in favor of this stance also applies to a related data reconciliation problem, data cleaning [JLVV00]. In fact, much of the intricacies of these problems are related to mismatching erroneous data, inconsistencies that often arise in the context of manually-acquired data. However, in our high-energy physics use case of Section 1.3, for example, such data are rare. Data are usually also well-identified by cleanly thought-through domains of identifiers. The rationale behind the first main contribution, query rewriting with sym- metric inter-schema constraints (Chapter 5), on the other hand, is the following. Expressive constraints are required for two reasons. • The need to deal with concept mismatch which results from schemata being integrated against others that have not been conceived for data integration, and which may be a consequence of schema evolution. • The need for flexibility that allows to anticipate future change of schemata and requirements in the design of mappings. This includes the need for ex- pressiveness that allows to prepare mappings for the merging of schemata, and to emulate local-as-view integration even when sources cannot be de- clared as views over the logical entities of the integration schemata. The information infrastructure that has been outlined in this chapter can be seen from a federated database perspective. There are several databases that have design autonomy for their schemata (as well as for data models and query languages), and each need to share data. As is well known for federated databases, the lack of a “global” schema for data integration leads to the uncomfortable situation that given N schemata, N 2 mappings between them need to be created and managed. Given our requirement that schemata and integration requirements may change, it is clear that the management task is difficult. A surprising breakthrough on the management front is not to be expected. Similar issues have been studied in various contexts by a large number of re- searchers in the fields of software engineering, database (schema) design, and on- tological engineering. The solutions that have been developed all center around common ideas: the treatment of the artifacts to be managed as first-class cit- izens on which cleanly defined and powerful operations are developed that can be used to manipulate them with the greatest possible amount of automation and computer support, as well as the use of design patterns, best practices, and design heuristics. We thus propose exactly such a solution, a model management approach in combination with a methodology for managing mappings and their change (Chapter 6). Chapter 5

Query Rewriting with Symmetric Constraints

5.1 Outline

In this chapter, we address the query rewriting problem of data integration in a very general setting. To start somewhere, we take the common approach of re- stricting ourselves to the relational data model and conjunctive queries. We drop the assumption of the existence of a single coherent global integration model over which queries may be asked, which are then rewritten into queries in terms of source predicates. Given a conjunctive (or positive) relational query over (possi- bly) both virtual and source predicates, we attempt to find a maximally contained rewriting in terms of only source predicates under a given appropriate semantics and a set of constraints, and the positive queries as a query language (i.e., the output is a set of conjunctive queries). We support symmetric constraints in the form of what we call Conjunctive Inclusion Dependencies (cind’s), containment relationships between conjunctive queries. We propose two alternative justifiable semantics, the classical logical and a straightforward rewrite systems semantics1. Under both, the problem is a proper generalization of the local-as-view as well as the global-as-view approaches. In many real-life situations where neither source relations can be defined as views over a given set of virtual relations nor a virtual relation as a view over a number of sources, a satisfactory containment relationship between conjunctive queries can be formulated using cind’s. Apart from that, our type of constraints allows to map schemata in a model management context using a clean and expres- sive semantics or to “patch” local-as-view or global-as-view integration systems

1Informally speaking, the intuition of this second semantics is that given a conjunctive query Q, a subexpression E of Q, and a cind Q1 ⊇ Q2, if we can produce a contained rewriting under the semantics of the problem of answering queries using views where we take E as query and Q1 as logical view, we can replace (while applying the respective variable mappings) E in Q by Q2 to produce a rewriting that is again “contained” in Q.

75 76 CHAPTER 5. QUERY REWRITING when sources need to be integrated whose particularities have not been foreseen when designing the integration schemata. The problem may also be relevant for maintaining physical data independence under schema evolution (see Section 7.1). Unfortunately, (as is immediately clear for the classical semantics), such pos- itive rewritings may be infinite and the major decision problems (such as the nonemptiness or boundedness of the result) are undecidable. However, given that the predicate dependency graph (with respect to the inclusion direction) of a set of constraints is acyclic, we can guarantee to find the maximally contained rewritings under both semantics, which are finite. We will argue that for ob- taining maximally contained rewritings in the data integration context, we can require the constraints to be acyclic without much inconvenience. As contributions of this chapter, we first provide characterizations of both semantics as well as algorithms which, given a conjunctive query, enumerate the maximally contained rewritings. We discuss various relevant aspects of query rewriting in our context, such as the minimality and nonredundancy of conjunc- tive queries in the rewritings. Next we compare the two semantics and argue that the second is more intuitive and may fit better the expectations of human users of data integration systems than the first. Following the philosophy of that se- mantics, rewritings can be computed by making use of database techniques such as query optimization and ideas from e.g. algorithms developed for the problem of answering queries using views. We believe that in a practical data integration context there are certain regularities (such as sets of predicates – schemata – from which predicates are used together in queries, while there are few queries that combine predicates from several schemata) that render query rewriting following the intuitions of the second semantics more efficient in practice. Surprisingly, however, it can be shown that the two semantics coincide. We then present a scalable algorithm for the rewrite systems semantics (based on previous work such as [PL00]), which we have implemented in a practical system, CindRew. We evaluate it experimentally against other algorithms for the same and for the classical logical semantics. It turns out that our implementation, which we make available for download, scales to thousands of constraints and realistic applica- tions. We conclude with a discussion of how our query rewriting approach fits into state-of-the-art data integration systems.

5.2 Preliminaries

We define a conjunctive inclusion dependency (cind) as a constraint of the form Q1 ⊆ Q2 where Q1 and Q2 are conjunctive queries (without arithmetic compar- isons, but possibly with constants) of the form

¯ ¯ {hx1, . . . , xni | ∃xn+1 . . . xm :(p1(X1) ∧ ... ∧ pk(Xk))} 5.2. PRELIMINARIES 77

2 with a set of distinct unbound variables x1, . . . , xn. The containing query Q2 is called the subsumer, and the contained query Q1 may be called the subsumee. We may write {Q1 ≡ Q2} as a short form of {Q1 ⊆ Q2,Q1 ⊇ Q2}. The normalization of a set Σ of cind’s is a set of Horn clauses, the set of cind’s taken as a logical formula transformed into (implication) normal form. These Horn clauses are of a simple pattern. Every cind σ of the form Q1 ⊆ Q2 with

¯ ¯ Q1 = {hx1, . . . , xni | ∃xn+1 . . . xm : v1(X1) ∧ ... ∧ vk(Xk)} ¯ ¯ Q2 = {hy1, . . . , yni | ∃yn+1 . . . ym0 : p1(Y1) ∧ ... ∧ pk0 (Yk0 )}

0 ¯ ¯ ¯ translates to k Horn clauses pi(Zi) ← v1(X1) ∧ ... ∧ vk(Xk)). where each zi,j ¯ of Zi is determined as follows: If zi,j is a variable yh with 1 ≤ h ≤ n, replace it 0 with xh. If zi,j is a variable yh with n < h ≤ m , replace it with Skolem function fσ,yh (x1, . . . , xn) (the subscript assures that the Skolem functions are unique for a given constraint and variable).

Example 5.2.1 The normalization of the cind

σ : {hy1, y2i | ∃y3 : p1(y1, y3) ∧ p2(y3, y2)} ⊇ {hx1, x2i | ∃x3 : v1(x1, x2) ∧ v2(x1, x3)} is

p1(x1, fσ,y3 (x1, x2)) ← v1(x1, x2) ∧ v2(x1, x3). p (f (x , x ), x ) ← v (x , x ) ∧ v (x , x ). 2 σ,y3 1 2 2 1 1 2 2 1 3 

Whenever a cind translates into a function-free clause in normal form, we will write it in datalog notation. This is the case for cind’s of the form

{hX¯i | p(X¯)} ⊇ Q i.e. the subsumer query is a ∃-free single-literal query. The dependency graph of a set C of Horn clauses is the directed graph con- structed by taking the predicates of C as nodes and adding, for each clause in C, an edge from each of the body predicates to the head predicate. The diameter of a directed acyclic graph is the longest directed path occurring in it. The depen- dency graph of a set of cind’s is the dependency graph of its normalization. A set of cind’s is cyclic if its dependency graph is cyclic. An acyclic set Σ of cind’s is called layered if the predicates appearing in Σ can be partitioned into n disjoint sets P1,...,Pn such that there is an index i for each cind σ : Q1 ⊆ Q2 ∈ Σ such that P reds(Body(Q1)) ⊆ Pi and P reds(Body(Q2)) ⊆ Pi+1 and Sources = P1. The problem that we want to address in this chapter is the following:

2Note that if we would not require unbound variables in constituent queries to be distinct, the transformation into normal form would result in Horn clauses with equality atoms as heads. 78 CHAPTER 5. QUERY REWRITING

Definition 5.2.2 (Query rewriting under symmetric constraints.) Given disjoint sets of so-called “source” (materialized) and “virtual” predicates, a conjunctive (or positive) query Q over possibly both sources and virtual predicates, and a set Σ of cind’s, find the maximally contained positive query Q0 exclusively over source predicates under a given semantics. 

Later in this chapter we will discuss two such semantics for this problem. The maximally contained rewritings under these semantics will be defined analogously to the case of the problem of answering queries using views. Note that we do not require that the input query Q only contains virtual predicates. Furthermore, we do not by default have any special restrictions regarding a set of cind’s Σ, apart from the following. For simplicity, we assume that no source predicates appear in any heads of Horn clauses created by normalization of the cind’s. This does not cause any loss of generality. We can always replace a source predicate that violates this assumption by a new virtual predicate in all cind’s and then add a cind that maps the source predicate to that new virtual predicate.

5.3 Semantics

We discuss two alternative semantics for query rewriting, first the classical logical and later a straightforward rewrite systems semantics.

5.3.1 The Classical Semantics Let us begin with a straightforward remark on the containment problem for conjunctive queries under a set of cind’s, which, since they are themselves con- tainment relationships between conjunctive queries, is the implication problem for this type of constraint. If we want to check a containment

{hX¯i | ∃Y¯ : φ(X,¯ Y¯ )} ⊇ {hX¯i | ∃Z¯ : ψ(X,¯ Z¯)} of two conjunctive queries under a set Σ of cind’s by refutation (without loss of generality, we assume Y¯ and Z¯ to be disjoint and the unbound variables in the two queries above to be the same3, X¯), we have to show

¯ ¯ ¯ ¯ ¯ ¯ ¯ Σ, ¬(∀X :(∃Y : φ(X, Y )) ← (∃Z : ψ(X, Z)))  ⊥ i.e. the inconsistency of the constraints and the negation of the containment taken together. In normal form, ψ becomes a set of ground facts where all variables

3In the remainder of this chapter, we will implicitly – whenever we do not sacrifice clarity by this – assume that variables from different clauses are distinct, or in different “name spaces”, even if several instances of the same clause interfere with each other during unification or unfolding and that new variables are automatically introduced where necessary to assure this. 5.3. SEMANTICS 79 have been replaced one-to-one by new constants and φ becomes a clause with an empty head, where all distinguished variables xi have been replaced by the constants also used for ψ.

Example 5.3.1 For proving the containment

{hx1, x2i | ∃x3 :(p1(x1, x3) ∧ p2(x3, x2))} ⊇ {hy1, y2i | ∃y3 :(r1(y1, y3) ∧ r2(y3, y2))} we have to translate it into

← p1(α1, x3) ∧ p2(x3, α2).

r1(α1, α3) ← . r2(α3, α2) ← . where α1, α2, α3 are constants not appearing elsewhere. 

We have now transformed our original problem into a set of equivalent Horn clauses, and can treat it as a logic program. We can take the single clause with the empty head above (the body of the subsumer query) and use it as a goal for refutation.

Definition 5.3.2 Under the classical semantics, a maximally contained rewrit- ing of a conjunctive query Q is equivalent to the set of all conjunctive queries Q0 0 over source predicates for which Σ  Q ⊆ Q. 

We can obtain such a maximally contained rewriting in the following way. Given a conjunctive query Q4 and the normalization C of a set of cind’s, we add a unit clause (with a tuple of distinct variables) for each source atom5. Then we try to refute the body of Q. (Differently from what we do for containment, we do not freeze any variables.) If we have found a refutation with a most general unifier θ, we collect the unit clauses used and create a Horn clause with θ(Head(Q)) as head and the application of θ to the copies of unit clauses involved in the proof as body. If this clause is function-free, we output it; after that we go on as if we had not found a “proof” to compute more rewritings. Given e.g. a breath-first strategy, it is easy to see that this method will compute a maximally contained rewriting of Q in terms of multisets of conjunctive queries in the sense that for each conjunctive query contained in Q, a subsumer will eventually be produced. See Example 5.3.10 for query rewriting by an altered refutation proof. Equivalent rewritings can be computed by interleaving the computation of contained rewritings with the verification if Q is contained in any of the already computed rewritings.

4The results in this chapter generalize to positive input queries in a straightforward manner. 5We still assume that source predicates do not appear in any heads in C. 80 CHAPTER 5. QUERY REWRITING

Unfortunately, since we allow for arbitrary conjunctive queries as subsumees in cind’s, we cannot make any guarantees regarding the minimality or nonredun- dancy of rewritings. While it is of course possible to minimize conjunctive queries when they are produced, it is impractical to require that the result be nonredun- dant. It can for instance easily be seen that we can encode arbitrary (recursive) datalog programs as sets of cind’s. Query rewriting may then produce an infinite result, and the boundedness problem (that is, telling whether the result of query rewriting can be a finite set of conjunctive queries) is undecidable. Thus, if an incomplete result is acceptable in such cases, it is more appropriate to output rewritings as soon as they are found, and not to eliminate redundancies. We next present an alternative algorithm for computing maximally contained rewritings which proceeds in a bottom-up fashion. The intuition of this procedure can be used to unfold constraints early on where appropriate, which may allow us to avoid recomputing certain intermediate results many times. It also only needs a restricted kind of unification that we want to look at in more detail.

Algorithm 5.3.3 (Bottom-up query rewriting). Input: The normalization C of a set of cind’s that do not contain source predi- cates in the subsuming query, a conjunctive query Q, and a set of source predi- cates S. Output: A (multi-)set of conjunctive queries X exclusively over source predi- cates.

X := {c ∈ C | P reds(Body(c)) ⊆ S}; C := C\X; forever { choose some clause c ∈ C ∪ {Q}; let n = |Body(c)|; θ := ∅; choose some tuple hc1, . . . , cni with c1, . . . , cn ∈ X ∪ {}, (ci = ) iff P red(Bodyi(c)) ∈ S; for each 1 ≤ i ≤ n with ci 6=  do θ := unify(θ, Body(c), Head(ci)); if θ 6= fail then { 0 c := unfold(c, θ, hc1, . . . , cni); if (c 6= Q) ∧ (Body(c0) is function-free) then X := X ∪ {c0}; else if (c = Q)∧ (c0 is function-free) then print c0; } if no new query or clause for X can be found then exit; }  5.3. SEMANTICS 81

We will now have a closer look at the functions “unify” and “unfold” which we have used above and that we will meet again in that form later. “unify” takes a most general unifier θ and two atoms a and b and produces a most general unifier θ0 of a and b which is consistent with θ in the usual way, if one exists. Otherwise the function returns fail (and we assume the variables in the two atoms to be from two distinct name spaces). We assume that a is always from the body of a clause “higher up” and that b is the head of a clause whose body is to replace that former atom. Unification here is simpler than in general because we have the following restrictions: (1) Body(c) is always function-free, which simplifies the implemen- tation of unification. (2) Since the body of each valid query must be function-free and once a clause contains a function term in its body, it cannot recover from that state, we can exclude the possibility that a function term from Head(ci) gets unified with a variable from Head(cj). For the same reason, we can forbid that function terms get unified with variables that appear in atoms a ∈ Body(a) where P red(a) is a source. Secondly, when two function terms from Head(ci), Head(cj) get unified with the same variable in c, they must be equal except for variable renamings, because otherwise again subterms would get unified with variables from some ck. (3) If a variable from c gets unified with a function term, it cannot unify with any other variable. (4) If c is a query to be rewritten, we can block all variables in Head(c) early on from being unified with function terms, as this could again not lead to a function-free rewriting. The function “unfold” accepts a Horn clause c with |Body(c)| = n, a unifier θ and a tuple of n Horn clauses or  s.t. if ci 6= , θ unifies Bodyi(c) with Head(ci). It produces a new Horn Clause from c by replacing each of its non-source body atoms Bodyi(c), if ci 6= , by θ(Body(ci)). (i.e. after applying substitutions from 0 the unifier). If ci = , Bodyi(c ) = θ(Bodyi(c)). If the clauses c1, . . . , cn are from the normalization of a set of cind’s rather than the unfolding of constraints (as produced by Algorithm 5.3.3), we may avoid producing redundancies in the result by not including substituted bodies if already another body from the same cind was included and this occurred under the same substitution of all distinguished variables of that cind. A special case6 that is particularly easy to implement is when a variable of c has been unified with a function term. In that case, only one body atom that contains this variable needs to be substituted, all others can be dropped. This is the case because the normalization of a cind will only produce function terms that contain all the distinguished variables of the cind in a uniform manner. Therefore, when the unification of a variable from c with two function terms with the same function symbol succeeds, all the variables in a pair of function terms unified with the same variable of c have been pairwise unified themselves.

6This case is analogous to a technique that is part of the MiniCon algorithm [PL00], where views only need to be included into rewritings once for each application of an MCD. 82 CHAPTER 5. QUERY REWRITING

5.3.2 The Rewrite Systems Semantics The rewrite systems semantics is best defined using the notion of MiniCon de- scriptions (see Definition 3.6.3). We adapt this notion to our framework based on query rewriting with Horn clauses.

Definition 5.3.4 (Inverse MiniCon Description). Let Q be a conjunctive query with n = |Body(Q)| and C be the normalization of a set of cind’s. An (inverse) n MiniCon description for Q is a tuple hc1, . . . , cni ∈ (C ∪ {}) that satisfies the following two conditions. (1) For the most general unifier θ 6= fail arrived at by unifying all the ci 6=  with Bodyi(Q), the unfolding of Q and hc1, . . . , cni under θ is function-free and 0 0 (2) there is no tuple hc1, . . . , cni ∈ {c1, }×...×{cn, } with fewer entries different 0 0 from  than in hc1, . . . , cni, such that the unfolding of Q with hc1, . . . , cni is function free.  Note that the inverse MiniCon descriptions of Definition 5.3.4 exactly coincide with the MCDs of Definition 3.6.3. The algorithm for computing maximally contained rewritings shown below can easily be reformulated so as to use the standard MCDs of [PL00]. That way, one can even escape the need to transform cind’s into Horn clauses and can reason completely without the introduction of function terms7. However, to support the presentation of our results (particularly the equivalence proof of the following section), we do not follow this path in this chapter.

Algorithm 5.3.5 (Query rewriting with MCDs). Input. A conjunctive query Q, the normalization C of a set of cind’s, and a set S of source predicates Output. A maximally contained rewriting of Q

Qs := [Q]; while Qs is not empty do { [Q, Qs] := Qs; if P reds(Q) ⊆ S then output Q; else { M := compute inverse MCDs for Q, C; for each hc1, . . . , cni ∈ M do { θ := ∅; for each 1 ≤ i ≤ n do θ := unify(θ, Bodyi(Q), ci); 0 Q := unfold(Q, θ, hc1, . . . , cni); Qs := [Qs, Q0]; }}}  7Note that Algorithm 5.3.5 then becomes similar to a chase procedure for cind’s. 5.3. SEMANTICS 83

In Algorithm 5.3.5, maximally contained rewritings of a conjunctive query Q are computed by iteratively unfolding queries with single MiniCon descriptions8 until a rewriting contains only source predicates in its body. “unify” is the restricted kind of unification that we discussed in the previous section, with the additional constraint that now all function terms are of depth one (that is, there are no function terms that have function terms as subterms). Definition 5.3.6 (Rewrite Systems Semantics). Let Q be a conjunctive query, S a set of source predicates, and Σ a set of cind’s. Then, Algorithm 5.3.5 computes the maximally contained positive rewriting of Q under Σ in terms of S under the rewrite systems semantics.  Example 5.3.7 (“Coffee Can Problem” [DJ90]) Consider the rewrite system

black white → black white black → black black black → white with symbols “white” and “black” and the input word

w = (white white black black white white black black) where the goal is to replace sequences of symbols of that word that match the left hand side of one of the three productions listed above repeatedly to produce a rewriting that is as small as possible. One such sequence of replacements is (0) white white black black white white black black (1) white white black black white black black (2) white white white white black black (3) white white white black black (4) white white black black (5) white black black (6) black black (7) white Pairs of occurrences of the symbols “black” or “white” have been underlined immediately before their replacement. Thus, the input string can be rewritten into a word with a single symbol, “white”. We can simulate such behavior using query rewriting under the rewrite sys- tems semantics. Let us search for one-symbol rewritings. We model an n-symbol word w ∈ {black, white}n as a query of the form

q(x1) ← start end(x1, xn+1), p1(x1, x2), . . . , pi(xi, xi+1), . . . , pn(xn, xn+1). where pi is either “black” or “white”, equal to the i-th symbol of w, and x1 . . . xn1 are variables. The above input word is thus represented as

8In this respect, the rewrite systems semantics differs from the MiniCon algorithm for the problem of answering queries using views. 84 CHAPTER 5. QUERY REWRITING

q(x1) ← start end(x1, x9), white(x1, x2), white(x2, x3), black(x3, x4), black(x4, x5), white(x5, x6), white(x6, x7), black(x7, x8), black(x8, x9).

The rewrite system can be encoded as a set of cind’s {hx, yi | ∃z : black(x, z) ∧ white(z, y)} ⊇ {hx, yi | black(x, y)} {hx, yi | ∃z : white(x, z) ∧ black(z, y)} ⊇ {hx, yi | black(x, y)} (?) {hx, yi | ∃z : black(x, z) ∧ black(z, y)} ⊇ {hx, yi | white(x, y)} Furthermore, we define two source predicates w src and b src and define cind’s responsible for making the rewrite process terminate with “success” (i.e., a con- tained rewriting in terms of the source predicates is found). {hxi | ∃y : start end(x, y) ∧ white(x, y)} ⊇ {hxi | w src(x)} {hxi | ∃y : start end(x, y) ∧ black(x, y)} ⊇ {hxi | b src(x)} It can be verified by applying the above algorithm (although this is a quite work- intensive task) that the maximally contained rewriting under the rewrite systems 0 semantics is q (x1) ← w src(x1). In fact, the seven-step sequence of replacements shown above can be easily used to create a proof in our rewrite systems semantics that q0 is in the maxi- mally contained rewriting. For the first replacement of that sequence, the tuple n hc1, . . . , cni ∈ (C ∪ {}) of Algorithm 5.3.5 would equal h, , , cσ2,1, cσ2,2, , , i where cσ2,1 and cσ2,2 are the first and second Horn clause created by normalizing our second cind (?). We can conclude that the above rewrite system cannot result in a one-symbol rewriting “black” for the given input word. 

5.3.3 Equivalence of the two Semantics Theorem 5.3.8 Let Q be a conjunctive query, Σ be a set of cind’s, and S be a set of “source” predicates. Then, the maximally contained rewriting under the classical logical semantics and Σ in terms of S and its analog under the rewrite systems semantics coincide. 

For showing this, we first establish the following auxiliary result.

Lemma 5.3.9 Let P be a resolution proof establishing a logically contained rewriting of a conjunctive query Q under a set of cind’s Σ. Then, there is always a proof P0 establishing the same contained rewriting such that each intermediate rewriting is function-free. 

Proof. Let us assume that each new subgoal a derived using resolution receives an identifying index idx(a). Then, given the proof P, there is a unique next premise to be applied cidx(a) out of the Horn clauses in the normalization of Σ for 5.3. SEMANTICS 85 each subgoal a. This is the Horn clause from our constraints base that will be unfolded with a to resolve it in P. Note that the proof P is fully described by the indexes of subgoals in the body of the original query Q, some unique indexing of subgoals somewhere created later on in the proof (while we do not need to know the atoms themselves), the clauses cidx(a), and a specification of which indexes the subgoals in the bodies of these clauses are attributed with when they are unfolded with subgoals.

In our original proof P, each subgoal a of a goal is rewritten with cidx(a) in each step, transforming g0, the body of Q and at the same time the initial goal, via g1, . . . , gn−1 to gn, the body of the resulting rewriting. We maintain the head of Q separately across resolution steps and require that variables in the head are not unified with function terms, but apply other unifications effected on the variables in the goals in parallel with the rewriting process. Already P must assure at any step that no variable from the head of Q is unified with a function term, as otherwise no conjunctive query can result. We know that resolution remains correct no matter in which order the next due resolution steps cidx(a) are applied to the subgoals, and that we even may unfold, given e.g. a goal with two atoms, the first goal and then a subgoal from the unfolding of that first goal (and may do that any finite number of times) before we unfold our second original subgoal. Coming back to deriving a function-free proof starting from P, all we now have to show is that at any intermediate step of a resolution proof with cind’s, a nonempty set of subgoals S = {ai1 , . . . , aik } ⊆ gi of the function-free intermediate goal gi exists such that, when only these subgoals are unfolded with their next due premises to be applied c , . . . , c , the overall new goal g produced idx(ai1 ) idx(aik ) i+1 will be function-free9. The emphasis here lies on finding a nonempty such set S, as the empty set automatically satisfies this condition. If we can guarantee that such a nonempty set always exists until the function-free proof has been completed, our lemma is shown.

Let there be a dependency graph Ggi = hV,Ei for each (intermediate) goal gi with the subgoals as vertices and a directed edge ha, bi ∈ E iff a contains a ¯ variable v that is unified with a function term f(X) in Head(cidx(a)) and v appears in b and is unified with a variable (rather than a function term with the same function symbol) in Head(cidx(b)). (Intuitively, if there is an edge ha, bi ∈ E, then b must be resolved before a if a proof shall be obtained in which all intermediate goals are function-free.) As mentioned, query heads are guaranteed to remain function-free by the correctness of P. For instance, the dependency graph of the goal

← a(x)(0), b(x, y)(1), c(y, z)(2), d(z, w)(3).

9The correctness of the proof P alone assures that the query head will be function-free as well. 86 CHAPTER 5. QUERY REWRITING with

0 0 c0 : a(x) ← a (x). c1 : b(f(x), x) ← b (x). 0 0 c2 : c(x, x) ← c (x). c3 : d(g(x), x) ← d (x). would be G = h{0, 1, 2, 3}, {h1, 0i, h3, 2i}i, i.e. the first must be resolved before the second and the third before the fourth subgoal. We can now show that such a dependency graph G is always acyclic. In fact, if it were not, P could not be a valid proof, because unification would fail when trying to unify a variable in such a cycle with a function term that contains that variable. This is easy to see because each function term given our construction used for obtaining Horn clauses from cinds contains all variables appearing in that same (head) atom. Consider for instance

q(x) ← a(x, y), a(y, z), b(w, z), b(z, y).

{hx, yi | ∃z : a(x, z) ∧ a(z, y)} ⊇ {hx, yi | s(x, y)} {hx, yi | ∃z : b(x, z) ∧ b(z, y)} ⊇ {hx, yi | s(x, y)} where s is a source. There is no rewriting under our two semantics, because the dependency graph of our above construction is cyclic already for our initial goal, the body of q. However, since G is acyclic given a proof P, we can unfold a nonempty set of atoms (those unreachable from other subgoals in graph G) with our intermediate goals until the proof has been completed.  Proof of Theorem 5.3.8. It is easy to see that the rewriting process for finding maximally contained rewritings under the rewrite systems semantics is equivalent to resolution where only some of the subgoals of a goal may be rewrit- ten in a single step and each intermediate rewriting has to be function-free. Assume that a proof establishing a single contained conjunctive query is known for the rewrite systems semantics. Then, this is also a proof for the classical semantics, and inclusion in this direction is shown. The other direction follows from Lemma 5.3.9. Given a resolution proof P that a conjunctive query Q0 is a contained rewriting of Q, we can always construct an analogous proof of this from P for the rewrite systems semantics. From this equivalence of resolution proofs and proofs with function-free inter- mediate steps we conclude that the overall search process for maximally contained rewritings under both semantics is guaranteed to lead to equal results. 

Example 5.3.10 Given a boolean conjunctive query q ← b(x, x, 0). and the following set of Horn clauses which, as is easy to see, are equivalent to and the normalization of a set of cind’s that we do not show in order to reduce redundancy. 5.3. SEMANTICS 87

0 0 0 0 b(x , y , 0) ← a(x, y, 2), e(x, x ), e1(y, y ). c0 0 0 0 0 b(x , y , 2) ← a(x, y, 0), e1(x, x ), e0(y, y ). c4, c10, c11 0 0 0 0 b(x , y , 0) ← a(x, y, 1), e0(x, x ), e(y, y ). c12, c18, c19 0 0 0 0 b(x , y , 1) ← a(x, y, 0), e1(x, x ), e1(y, y ). c20, c25 e(x, x) ← v(x). c2, c17 e1(x, f1(x)) ← v(x). c3, c8, c23, c24 e0(x, f0(x)) ← v(x). c2, c17 v(x) ← b(x, y, s). c5, c13, c21 v(y) ← b(x, y, s). c6, c14 a(x, y, s) ← b(x, y, s). c1, c7, c15 where x, y, x0, y0, s are variables. Let P be the resolution proof

(0) ← b(x, x, 0)(0). (1) (2) (3) (1) ← a(x, y, 2) , e(x, z) , e1(y, z) . (4) (5) (6) (2) ← b(f1(y), y, 2) , v(f1(y)) , v(y) . (7) (8) (9) (3) ← a(x1, y1, 0) , e1(x1, f1(y)) , e0(y1, y) , (10) (11) b(f1(y), v1, 2) , b(v2, y, 2) . †10, †11 (12) (13) (14) (4) ← b(f0(y1), y1, 0) , v(f0(y1)) , v(y1) . (15) (16) (5) ← a(x2, y2, 1) , e0(x2, f0(y1)) , (17) (18) e(y2, y1) , b(f0(y1), v1, 0) , (19) b(v2, y1, 0) . †18, †19 (20) (21) (6) ← b(y1, y1, 1) , v(y1) . (22) (23) (7) ← a(x, x, 0) , e1(x, f1(x)) , (24) (25) e1(x, f1(x)) , b(y1, v1, 1) . †25 (8) ← a(x, x, 0)(22), v(x)(26). which rewrites our query into q ← a(x, x, 0), v(x). and in which we have su- perscribed each subgoal with its assigned index. To keep things short, we have eliminated subgoals (marked with a dagger † and their index) that are redundant with a different branch of the proof. As claimed in our theorem, P can be trans- formed into the following proof in which each intermediate step is function-free.

(0) ← b(x, x, 0)(0). (1) (2) (3) (1) ← a(x, y, 2) , e(x, z) , [e1(y, z) ]. (4) (5) (3) (2) ← b(x, y, 2) , v(x) , [e1(y, x) ]. (7) (8) (9) (3) ← a(x1, y1, 0) , e1(x1, x) , e0(y1, y) , (10) (3) b(x, v1, 2) , [e1(y, x) ]. †10 (7) (8) (4) ← a(x1, y1, 0) , e1(x1, x) , (9) (3) [e0(y1, y) ], 6 [e1(y, x) 6 ]. (12) (14) (9) (5) ← b(y, y1, 0) , v(y) , [e0(y1, y) ]. (15) (16) (17) (6) ← a(x2, y2, 1) , e0(x2, y) , e(y2, y1) , 88 CHAPTER 5. QUERY REWRITING

(18) (9) b(y, v1, 0) , 6 [e0(y1, y) 6 ]. †18 (20) (21) (7) ← b(y1, y1, 1) , v(y1) . (22) (23) (8) ← a(x3, y3, 0) , e1(x3, y1) , (24) (25) e1(y3, y1) , b(y1, v1, 1) . †25 (22) (26) (9) ← a(x3, x3, 0) , v(x3) . The subgoals that we have marked with brackets [ ] had been blocked at a certain step to keep the proof function-free.  Of course this correspondence between function-free and general resolution proofs does not hold for Horn clauses in general.

Example 5.3.11 Consider the boolean query q ← a1(u, v), b1(u, v). and the Horn clauses

a1(f(x), y) ← a2(x, y). a2(x, g(y)) ← a3(x, y). b1(x, g(y)) ← b2(x, y). b2(f(x), y) ← b3(x, y).

These entail q ← a3(x, y), b3(x, y). although one cannot arrive at a function- free intermediate rewriting by either unfolding the left ubgoal (which would result in q ← a2(x, y), b1(f(x), y).) or the right subgoal (which would result in q ← a1(x, g(y)), b2(x, y).) of our query first, neither by unfolding both at once (resulting in q ← a2(x, g(y)), b2(f(x), y).). 

5.3.4 Computability Theorem 5.3.12 Let Σ be a set of cind’s and Q and Q0 be conjunctive queries. Then the following problems are undecidable:

0 • Σ  Q ⊆ Q , the containment problem.

0 0 • ∃Q :Σ  Q ⊇ Q , i.e. it is undecidable whether the maximally contained rewriting of a conjunctive query Q under the classical logical semantics is nonempty (that is, it contains at least one conjunctive query)10.  We also give an intuition for the undecidability results of Theorem 5.3.12. Post’s Correspondence Problem (PCP, see e.g. [HU79]), a simple and well-known undecidable problem, is defined as follows. Given nonempty words x1, . . . , xn and y1, . . . , yn over the alphabet {0, 1}, the problem is to decide whether there are indexes i1, . . . , ik (with k > 0) s.t. xi1 xi2 . . . xik = yi1 yi2 . . . yik . Pairs of words hxi, yii are also called dominos [Sip97]. In the following example, we show, by an example, an encoding of PCP in terms of our query rewriting problem.

10By Theorem 5.3.8, this is equivalent to the following problem: Given a conjunctive query Q, is the maximally contained rewriting under the rewrite systems semantics nonempty? 5.3. SEMANTICS 89

Example 5.3.13 Given are a source s, a boolean query q ← inc(0, 0). and the following five cind’s

{hx, yi | dec(x, y)} ⊆ {hx, yi | ∃x1, y1 : zero(x, x1) ∧ zero(y, y1) ∧ inc(x1, y1)} (1) {hx, yi | dec(x, y)} ⊆ {hx, yi | ∃x1, y1 : zero(x, x1) ∧ zero(y, y1) ∧ dec(x1, y1)} (2) {hx, yi | dec(x, y)} ⊆ {hx, yi | ∃x1, y1 : one(x, x1) ∧ one(y, y1) ∧ inc(x1, y1)} (3) {hx, yi | dec(x, y)} ⊆ {hx, yi | ∃x1, y1 : one(x, x1) ∧ one(y, y1) ∧ dec(x1, y1)} (4) dec(0, 0) ← s. (5) that constitute the core encoding and two constraints inc(x, y) ← one(x, x1), zero(x1, x2), one(x2, x3), one(y, y1), inc(x3, y1). (6) inc(x, y) ← one(x, x1), zero(y, y1), one(y1, y2), (7) one(y2, y3), one(y3, y4), zero(y4, y5), inc(x1, y5). that stand for a PCP problem instance with two pairs of words,

I = {hx1 = 101, y1 = 1i, hx2 = 1, y2 = 01110i}

The constraints (1) – (4) can be considered to have a role of “guessing” a solu- tion to the PCP problem, constraints (6) and (7) to have a role of “checking” the solution, and constraint (5) the role of “terminating” when the search was successful. For showing the PCP instance satisfiable, one can compute a contained rewrit- ing by applying the constraints in the following order (we only describe the proof but no dead-end branches): (guess phase) (6), (7), (6), (check phase) (3), (2), (4), (4), (4), (2), (4), (termination) (5). The maximally contained rewrit- ing is nonempty because there is a solution to this particular PCP instance, x1x2x1 = y1y2y1 = 1011101.  In fact, Example 5.3.10 already presented an encoding for PCP that shows the undecidability of query rewriting with cind’s. The PCP instance

I = {hx1 = 10, y1 = 1i, hx2 = 1, y2 = 01i} is encoded in the first four Horn clauses, which can be viewed as realizing a nondeterministic automaton that accepts two words xi1 . . . xik and yi1 . . . yik if they are equal. In the start state s0, a domino hxi, yii out of I is chosen. The symbols in xi and yi are then accepted one by one. If one of the two words xi, yi is longer than the other one, the shorter one is appended  symbols. We return to the state s0 no sooner than all symbols of a domino have been accepted. For the instance of Example 5.3.10, we thus have an automaton with three states.

h1, 0i h1, 1i  - - 

s2 , 1i sfh0 h0, ih s hj 1 90 CHAPTER 5. QUERY REWRITING

The encoding, while more complicated than the one presented in Exam- ple 5.3.13, allows to show the undecidability of query rewriting11 as well as the undecidability of query containment under a set of cind’s12. The encoding of Example 5.3.10 can also be adapted without much difficulty for proving the un- decidability of equivalence and the existence of equivalent rewritings.

5.3.5 Complexity of the Acyclic Case For the important case that Σ is acyclic, the two above problems are decidable (and NEXPTIME-complete). We first establish the following auxiliary result.

Lemma 5.3.14 Let Σ be an acyclic set of cind’s and Q and Q0 be conjunctive 0 queries. Then the containment problem Σ  Q ⊆ Q and the problem of deciding whether the maximally contained rewriting of Q (as a set of conjunctive queries) is nonempty are NEXPTIME-hard. 

Proof. NEXPTIME-hardness follows from a slightly altered form of the encoding of the NEXPTIME-complete Tiling problem (see e.g. [Pap94]) used in [DV97] to show NEXPTIME-hardness of the SUCCESS problem for nonrecursive logic programming, i.e., the problem of deciding whether a nonrecursive logic program treated as a database query will return a nonempty result. TILING is the problem of tiling the square of size 2n × 2n by tiles (squares of size 1 × 1) of k types. There are two binary relations on and to defined on the tiles. Tiles ti and tj are said to be horizontally compatible if hti, tji ∈ to holds and are called vertically compatible if hti, tji ∈ on. A tiling of the square n n n n of size 2 × 2 is a function f : {1,..., 2 } × {1,..., 2 } → {t1, . . . , tk} such that vertically and horizontally neighboring tiles are compatible, i.e.

hf(i, j), f(i + 1, j)i ∈ to ... for all 1 ≤ i < 2n, 1 ≤ j ≤ 2n and hf(i, j), f(i, j + 1)i ∈ on ... for all 1 ≤ i ≤ 2n, 1 ≤ j < 2n The TILING problem is defined as follows. Suppose that we are given a set {t1, . . . , tk} of tiles, compatibility relations on and to, and a number n written in unary notation, the problem is to decide whether there exists a tiling f of the n n square of size 2 × 2 with a distinguished tile type, say t1, at the top left corner (i.e., f(1, 1) = t1). We describe a reduction that transforms any instance of the tiling problem to an instance of the containment problem of conjunctive queries under an acyclic

11A PCP instance is satisfiable iff the maximally contained rewriting of q ← b(x, x, 0). under Σ is nonempty. 12 A PCP instance is satisfiable iff Σ  {hi | ∃x : v(x) ∧ a(x, x, 0)} ⊆ {hi | ∃x : b(x, x, 0)}. 5.3. SEMANTICS 91

x1 x2 x2 y1 y1 y2 x3 x4 x4 y3 y3 y4 x1 x2 y1 y2 x3 x4 x4 y3 y3 y4 x3 x4 y3 y4 z1 z2 z2 u1 u1 u2 z1 z2 u1 u2 z1 z2 z2 u1 u1 u2 z3 z4 u3 u4 z3 z4 x4 u3 u3 u4

Figure 5.1: Hypertile of size i ≥ 2 (left) and all nine possible overlapping hyper- tiles of size i − 1 that can be inscribed into it (right). set of cind’s and which requires only polynomial time relative to the size of the problem instance. We define hypertiles as follows. Each composition of 2×2 tiles or hypertiles is a hypertile if the component tiles satisfy the compatibility constraints. Obviously, all hypertiles are of size 2i × 2i for some i ≥ 1 [DV97]. In our encoding, we define hypertiles of level 1 by the following cind

{hx1, x2, x3, x4i | ∃xf : til1(xf , x1, x2, x3, x4, x1)} ⊇ {hx1, x2, x3, x4i | to(x1, x2) ∧ to(x3, x4) ∧ on(x1, x3) ∧ on(x2, x4)}

Fortunately, for hypertiles of level i ≥ 2, it is not necessary to enforce that all the compatibility constraints are satisfied on the level of tiles. Instead, it is sufficient to verify that all of the nine possible (overlapping) constituent hypertiles of the next-smaller level i − 1 (see Figure 5.1) satisfy the compatibility constraints. We define hypertiles of level greater than one by

{hxf , yf , zf , uf , ti | ∃f : tili+1(f, xf , yf , zf , uf , t)} ⊇ {hxf , yf , zf , uf , ti | ∃ x1, . . . , x4, y1, . . . , y4, z1, . . . , z4, u1, . . . , u4, d1, . . . , d13 : tili(xf , x1, x2, x3, x4, t) ∧ tili(yf , y1, y2, y3, y4, d1) ∧ tili(zf , z1, z2, z3, z4, d2) ∧ tili(uf , u1, u2, u3, u4, d3) ∧ tili(d4, x2, y1, x4, y3, d5) ∧ tili(d6, x4, y3, z2, u1, d7) ∧ tili(d8, z2, u1, z4, u3, d9) ∧ tili(d10, x3, x4, z1, z2, d11) ∧ tili(d12, y3, y4, u1, u2, d13)}

Let bot be a nullary predicate. To complete our encoding, we add cind’s on(ti, tj) ← bot. for each hti, tji ∈ on and to(ti, tj) ← bot. for each hti, tji ∈ to. where ti and tj are constants identifying pairs out of the k given tile types. Let us consider the encoding shown above as a logic program (that we obtain by normalizing the cind’s). The existential variables in the subsumer queries of 92 CHAPTER 5. QUERY REWRITING

the tili cind’s will be transformed into function terms aggregating the 4 hypertiles of the next smaller size. (In fact, also the variables for the top left corner tiles t will be aggregated in the function terms, but this does not alter the correctness of the encoding.) The cind for til1 is transformed into the Horn clause til1(f1(x1, x2, x3, x4), x1, x2, x3, x4, x1) ← to(x1, x2), to(x3, x4), on(x1, x3), on(x2, x4). and the cind’s for tili≥2 are normalized as Horn clauses with heads

tili(fi(x1, x2, x3, x4, t), x1, x2, x3, x4, t) During bottom-up evaluation of such a logic program, the function terms constructed using fi correspond exactly with the valid hypertiles constructible from the given k tile types, if the fifth arguments of function terms of symbols fi≥2 are ignored. It is quite easy to see that there is a solution for the TILING problem iff the constraints in our encoding entail

{hi | bot} ⊆ {hi | tilm(f, x, y, z, u, 1)} Equally, there is a solution to the TILING problem exactly if the maximally contained rewriting of {hi | tilm(f, x, y, z, u, 1)} in terms of the “source predicate” bot is nonempty. Thus, these two problems are NEXPTIME-hard.  Theorem 5.3.15 Let Σ be an acyclic set of cind’s and Q and Q0 be conjunctive 0 queries. Then the containment problem Σ  Q ⊆ Q and the query rewriting problem for conjunctive queries (under acyclic sets of cind’s) are NEXPTIME- complete.  Proof. As pointed out in Section 5.3.1, the query containment problem under an acyclic set of cind’s can be solved by proving the unsatisfiability of the negation of the containment, which decomposes into a set of ground facts (in analogy with canonical database of the “freezing trick” of Example 2.2.3) and a goal. This is a special case of the SUCCESS problem for nonrecursive logic programs [DV97, VV98]. The problem of deciding whether query rewriting produces a nonempty set of conjunctive queries can be reduced to the SUCCESS problem by introducing unit clauses si(x1, . . . , xni ) ←. (where x1, . . . , xni are distinct variables) for each “source” predicate si of arity ni. As both problems are known NEXPTIME-hard from Lemma 5.3.14, com- pleteness in NEXPTIME has been shown.  This result shows that by restricting ourselves to acyclic sets of cind’s we have nevertheless retained all the expressive power for decision-making (modulo polynomial transformations) of nonrecursive logic programming. It also shows that no efficient (polynomial time) algorithm for this problem can exist. 5.4. IMPLEMENTATION 93 5.4 Implementation

Our implementation is based on Algorithm 5.3.5, but makes use of several op- timizations. Every time an MCD m is unfolded with a query to produce an intermediate rewriting Q, we compute a query Q0 (a partial rewriting) as follows.

0 Body(Q ) := {Bodyi(Q) | mi 6= } 0 ¯ 0 Head(Q ) := hXi s.t. each xi ∈ V ars(Head(Q)) ∩ V ars(Body(Q ))

Q0 is thus created from the new subgoals of the query that have been intro- duced using the MCD. If Q0 contains non-source predicates, the following check is performed. We check if our rewriting algorithm produces a nonempty rewriting on Q0. This is carried out in depth-first fashion. If the set of cind’s is cyclic, we use a maximum lookahead distance to assure that the search is finite. If Q0 is not further rewritable, Q does not need to be further processed but can be dropped. Subsequently, (intermediate) rewritings produced by unfolding queries with MiniCon descriptions are simplified using tableau minimization. Directly after parsing, Horn clauses whose head predicates are unreachable from the predicates of the query are filtered out. The same is done with clauses not in the set X computed by

X := ∅; do X := X ∪ {c ∈ C | P reds(Body(c)) ⊆ Sources ∪ {P red(Head(c0)) | c0 ∈ X}}; while X changed;

We have implemented the simple optimizations known from the Bucket Algo- rithm [LRO96] and the Inverse Rules Algorithm [GKD97] for answering queries using views which are used to reduce the branching factor in the search process. Beyond that, MiniCon descriptions are computed with an intelligent backtrack- ing method that always chooses to cover subgoals first for which this can be done deterministically (i.e., the number of Horn clauses that are candidates for un- folding with a particular subgoal can be reduced to one), thereby reducing the amount of branching. Our unification algorithm allows to pre-specify variables that may in no case be unified with a function term (e.g., for head variables of queries or atoms already over source predicates). This allows to detect the impossibility to create a function-free rewriting as early as possible. In the implementation of the deterministic component of our algorithm for generating MiniCon descriptions, we first check whether the corresponding pairs of terms of two atoms to match unify independently before doing full unification. This allows to detect most violations with very low overhead. Given an appro- priate implementation, it is possible to check this property in logarithmic or even constant time. 94 CHAPTER 5. QUERY REWRITING

An important performance issue in Algorithm 5.3.5 is the fact that MCDs are only applied one at a time, which leads to redundant rewritings as e.g. the same MCDs may be applicable in different orders (as is true for the classical problem of answering queries using views, a special case) and thus a search space that may be larger than necessary. We use dependency graph-based optimizations to check if a denser packing of MCDs is possible. For the experiments with layered sets of cind’s reported on in Section 5.5 (Figures 5.3 and 5.4), MCDs are packed exactly as densely13 as in the MiniCon algorithm of [PL00].

Distribution The implementation of our query rewriter (with algorithms for both semantics presented) consists of about 9000 lines of C++ code. Binaries for several plat- forms as well as examples and a Web demonstrator that allows to run limited-size problems online are made available on the Web at

http://cern.ch/chkoch/cindrew/

5.5 Experiments

A number of experiments have been carried out to evaluate the scalability of our implementation. These were executed on a 600 MHz dual Pentium III machine running Linux. A benchmark generator was implemented that randomly gener- ated example queries and sets of cind’s. This program created chain as well as random queries (and cind’s). In all experiments, the queries had 10 subgoals, and we averaged timings over 50 runs. Sets of cind’s were always acyclic. This was ascertained by the use of predicate indexes such that the predicates in a subsumer query of a cind only used indexes greater than or equal to a random number determined for each cind, and subsumed queries only used indexes smaller than that number. Times for parsing the input were excluded from the diagrams, and redundant rewritings were not eliminated14. Diagrams relate reasoning times on the (logarithmic-scale) vertical axis to the problem size expressed by the number of cind’s on the horizontal axis.

5.5.1 Chain Queries Chain queries are conjunctive queries of the form

q(x1, xn+1) ← p1(x1, x2), p2(x2, x3), . . . , pn−1(xn−1, xn), pn(xn, xn+1).

13See Section 3.6.2. 14Note that CindRew optionally can make rewritings nonredundant and minimal. However, for these experiments, these options were not active. 5.5. EXPERIMENTS 95

10 seconds

p=12 classical, p=16 1 p=8

0.1 p=16

0.01

0.001 chain queries unlayered 3−6 subgoals per query 0.0001 #cind’s 0 500 1000 1500 2000 2500 3000

Figure 5.2: Experiments with chain queries and nonlayered chain cind’s.

Thus, chain queries are constructed by connecting binary predicates via vari- ables to form chains, as shown above. In our experiments, the distinguished (head) variables were the first and the last. The chain cind’s had between 3 and 6 subgoals in both the subsuming and the subsumed queries. We report on three experiments with chain queries. The first diagram (Figure 5.2) shows timings for chain queries. By the steep line on the left we report on an alternative query rewriting algorithm that we have implemented and which follows the classical semantics, and a traditional resolution strategy, where we unfold certain clauses where this is deemed appro- priate as described in Algorithm 5.3.3. This is particularly effective with acyclic sets of constraints that are as densely packed, as is the case here. The experiment reported on here was carried out with 16 predicates. This algorithm is compared to and clearly outperformed by CindRew (with three different numbers of pred- icates; 8, 12, and 16). Since the more predicates are available, the sparser the constraints get, more predicates render the query rewriting process simpler. In the second diagram (Figure 5.4), we report on CindRew’s execution times with cind’s that were generated with an implicit layering of predicates (with 2 layers). This experiment is in principle very similar to local-as-view rewriting with p/2 global predicates and p/2 source predicates (where the subsumer queries of cind’s correspond to logical views in the problem of answering queries using views), followed by simple view unfolding to account for the subsumed queries of cind’s. We again report timings for three different numbers of predicates. In the third diagram, the same problem of finding a maximally contained 96 CHAPTER 5. QUERY REWRITING

10 seconds

p=8 p=12 p=16 1

0.1

0.01

chain queries 3−6 predicates per query 2 layers of predicates 0.001 #cind’s 0 500 1000 1500 2000 2500 3000

Figure 5.3: Experiments with chain queries and two layers of chain cind’s.

101 seconds

p=20 p=40 100

10−1

10−2

10−3

10−4 chain queries 3−6 predicates per query 5 layers of predicates

10−5 #cind’s 0 1000 2000 3000 4000 5000 6000

Figure 5.4: Experiments with chain queries and five layers of chain cind’s. 5.5. EXPERIMENTS 97

10 seconds

1

random queries predicate arity: 2−3 subsumers: 3−4 subgoals subsumees: 2 subgoals distinguished variables: 1−2 5 layers of predicates 0.1 #cind’s 0 500 1000 1500 2000 2500

Figure 5.5: Experiment with random queries. rewriting is solved for 20 and 40 predicates, which are grouped into a stack of five layers of 4 and 8 predicates each, respectively. Of the five sets of predicates, one constitutes the sources and one the “schema” over which queries are asked, and four equally sized sets of cind’s bridge between these layers15. As can be seen by comparing the second and third diagrams with the first, the hardness of the layered problems is more homogeneous. Particularly in Figure 5.2 and Figure 5.3, one can also observe subexponential performance. Note that in the experiment of Figure 5.4, timings were taken in steps of 20 cind’s, while in the other experiments, this step length was 100.

5.5.2 Random Queries The random queries had either three or four subgoals in the subsumer and two subgoals in the subsumed query. Predicates either had arity two or three, and the number of distinguished variables was either one or two. The number of existential queries was two to three times as high in order to reduce the number of correct solutions. For the experiments carried with random queries, the number of solution rewritings quickly got very large, so we report on computing at most 100 rewritings16. Figure 5.5 shows the timings for random queries as described

15See Section 5.2 for our definition of layered sets of cind’s. 16In the runs with chain queries (and constraints), we of course computed all rewritings. 98 CHAPTER 5. QUERY REWRITING earlier, with five predicates and five layers (i.e., one predicate per layer). We report only on the case with five layers because e.g. in the case where there were no layers or fewer layers, computing the first 100 solutions was too easy.

5.6 Discussion

This chapter has addressed the query rewriting problem in data integration from a fresh perspective. Expressive symmetric constraints are used, which we have called Conjunctive Inclusion Dependencies. The problem of computing the max- imally contained rewritings was studied under two justifiable semantics. We have discussed their main theoretical properties and have shown that they coincide. We have presented the second semantics, motivated by rewrite systems, as a valu- able alternative to the classical logical. This semantics allows to apply time-tested (e.g., tableau minimization) as well as more recent (e.g., the MiniCon algorithm) techniques and algorithms from the database field to the query rewriting problem. There are several advantages of algorithms following the philosophy of the rewrite systems semantics for query rewriting. Under this semantics, intermedi- ate results are (function-free) queries and can be immediately made subject to query optimization techniques known in the database field. As a consequence, further query rewriting may start from simpler queries, leading to an increase in performance and fewer redundant results that have to be found and later be elim- inated. Thus, it is often possible to detect dead ends early. As a trade-off (as can be seen in Algorithm 5.3.5), an additional degree of nondeterminism is introduced compared to resolution-based algorithms under the classical semantics. In the context of data integration, there are usually a number of regulari- ties in the way constraints are implemented and queries are posed. Usually we expect to have a number of schemata, each containing a number of predicates. Between the predicates of one schema, no constraints for data integration uses are defined. Moreover, we expect inter-schema constraints usually to be of the form Q1 ⊆ Q2 where most (or all) predicates in Q1 belong to one and the same schema, while the predicates of Q2 belong to another. Queries issued against the system are usually formulated in terms of a single schema. Given these assump- tions, we suspect algorithms following the rewrite systems semantics which apply optimization techniques from the database area on intermediate results to have a performance advantage over classical resolution-based algorithms, which do not exploit such layering heuristics. Clearly, the noncomputability of rewritings in general is an important prob- lem, which will be addressed in the following chapter. In particular we will argue that one can often avoid to have cyclic definitions of cind’s in a system for finding maximally contained rewritings17. We give an outline of additional computable cases of query rewriting beyond the acyclicity requirement in Section 7.1.2.

17Note that it is only reasonable to speak of equivalent rewritings if cind’s may be cyclic. Chapter 6

Model Management

The previous chapter gave a detailed presentation of the query rewriting prob- lem with conjunctive inclusion dependencies. Such inter-schema constraints are not only highly relevant to data integration because of their ability to deal with concept mismatch that requires symmetric constraints (see Example 1.3.1). This class of constraints also supports the construction of mappings that are robust with respect to change. This robustness stems from the expressiveness of the formalism that allows to combine LAV and GAV mappings and thus offers sub- stantial modeling power. This chapter starts with the definition of a very simple model for managing relational schemata and mappings based on cind’s in a repository. Schemata are simply sets of relations, without any additional semantics such as those commonly encoded as functional dependencies or inclusion dependencies (e.g., unique keys and foreign key constraints). This restriction, together with the one that we confine this presentation to a purely relational rather than semantic or object- oriented data model, makes this study a mostly theoretical one. However, this allows us to concentrate on the main issues of supporting maintainability in a concise way. Furthermore, such extensions are reasonably simple to realize1. In the second section of this chapter we discuss the problem of designing map- pings that are robust and do not easily become complete failures when the data integration requirements change. Finally, we attack the problem of arriving at collections of inter-schema constraints that are acyclic, a property that Chapter 5 has shown to be desirable.

6.1 Model Management Repositories

A model management repository is a pair hR, Mi of a set of relational schemata R and a set of mappings M. For simplicity, we consider schemata as simple sets of relation schemata without dependencies (which of course could be added to

1See Section 2.5 on the issue of applying our work to object-oriented queries.

99 100 CHAPTER 6. MODEL MANAGEMENT

1. Add a schema R to R.

2. Copy the predicates of a schema R ∈ R into a new schema R0. Mappings of R are not linked to R0.

3. Add a predicate to schema R. Predicates are identified by name within a schema and have a fixed arity.

4. Rename a predicate p of a schema R, as well as all of its occurrences in mappings.

5. Change the arity of a predicate p in a schema. (a) To add an attribute to p at position i, each of its appearances in cind’s is augmented with a new existential variable for that attribute. (b) An attribute may only be removed from p if for every appearance of p in a cind, there is a variable in this attribute position which is existentially quantified and not used in a join.

6. Delete a predicate p from a schema. This requires that p is not used in any mapping.

7. Delete a schema. This is only allowed if no predicate from the schema is used in any mapping.

8. Import a schema (from DDL, IDL, and DTD files, relational databases, spreadsheet layouts, ...)

Figure 6.1: Operations on schemata.

1. Add an elementary mapping Σ to M.

2. Add a cind Q1 ⊇ Q2 to an elementary mapping Σ from schemata R1,..., Rn to schema R. Q1 must be over predicates in R and Q2 over predicates in R1,..., Rn. 3. Remove a cind from an elementary mapping Σ.

4. Delete a mapping. In case of a composite mapping, all of its constituents are deleted, including auxiliary schemata.

Figure 6.2: Operations on mappings. 6.1. MODEL MANAGEMENT REPOSITORIES 101

1. UNFOLDM(R, ΣGAV) Rewrite a schema R using a set of GAV views ΣGAV to achieve a finer granularity of entities contained. For each view

¯ ¯ ¯ p(X) ← p1(X1), . . . , pn(Xn)

in ΣGAV, let p be a predicate in R and p1, . . . , pn be new predicate names. p is replaced in R by {p1, . . . , pn} and all subsumer or subsumee queries of cind’s in M that contain p are unfolded with the GAV view.

0 2. MERGEM(R1, R2, R ) 0 Merge two schemata R1 and R2 into a new schema R . This can be done if R1 and R2 do not contain relations of the same name but with different arities and if there are no dependencies (via mappings) between any of the predicates in R1 and R2. Predicates from R1 and R2 with the same names fall together. All predicates from R1 and R2 occurring in mappings in M are replaced by the corresponding predicates from R0.

0 3. SPLITR,M(R, {R1,..., Rm}, {Rm+1,..., Rn}, R ) Distribute the role of a schema R in the data integration infrastruc- 0 ture across R and a new schema R . Let {R1,..., Rm}, {Rm+1,..., Rn} be a partition of all schemata against which R is mapped in M (i.e., {R1,..., Rm, Rm+1,..., Rn} is the set of schemata {X ∈ R | ∃M ∈ M : R ∈ from(M) and to(M) = X}). Copy R to a new schema R0. Copy all the mappings M with to(M) = R and change all occurrences of predi- cates in R by their copies in R0. For all mappings in M against schemata 0 Rm+1,..., Rn, replace the predicates from R by their copies in R . This operation is close to being the inverse of the MERGE operation.

4. Eliminate an auxiliary schema R by unfolding the mappings from R with the mappings against R, if all the constraints thereby created are cind’s. This condition is guaranteed to hold if all mappings are GAV.

5. COMPOSEM(A) Create a composition of (existing) mappings around a (now auxiliary) schema A, as described in the definition of composite mappings.

6. Ungroup a composite mapping. This is needed when an auxiliary schema has matured and is to be (re-)used outside the mapping. Figure 6.3: Complex model management operations. 102 CHAPTER 6. MODEL MANAGEMENT the mechanism but which we leave out for simplicity). Each relational predicate is either marked “source” or “logical”. A schema is called purely logical if it does not contain source predicates. Relational attributes may be named and typed. If they are unnamed, we refer to them by their index. Relational predicates are unique across all schemata – they are identified by their schema id in combination with their predicate name. A mapping M maps from a set of possibly several schemata R1,..., Rn ∈ R (denoted from(M) = {R1,..., Rn}) against a single schema R (denoted to(M) = R). We require that to(M) 6∈ from(M). Mappings are either elementary or composite. An elementary mapping Σ is a set of cind’s where the subsumer sides of the constraints only use logical predicates from R and the subsumed sides only use predicates from R1 ∪...∪Rn. The dependency graph of an elementary mapping thus has a diameter of one. A composite mapping M can be created from a schema A, a mapping M0, and a set of mappings {M1,..., Mn} if

1. A ∈ from(M0),

2. A = to(Mi) for each 1 ≤ i ≤ n, 3. A is purely logical and

4. A is not used in any other mapping in M besides M0,..., Mn.

A is called an auxiliary schema. We require the cind’s in the union of all elementary mappings of a composite mapping to be acyclic. We do not provide an exhaustive list of all model management operations imaginable. Figure 6.1 and Figure 6.2 list operations for the manipulation of schemata and mappings, respectively. Figure 6.3 shows some of the more inter- esting complex operations. Clearly, model management software can do a very useful job in supporting a human expert in the manipulation tasks. For instance, when a new attribute is added to a relation, all of its occurrences in cind’s can be automatically expanded with a new existentially quantified variable.

6.2 Managing the Change of Schemata and Re- quirements

The lack of a global schema against which sources can be integrated leads to the problem that mappings grow with the square of the number of schemata. Together with the prospect that schemata may evolve, this leads to a serious management and maintenance problem. We approach this problem using two principal techniques. These are the de- coupling of dependencies between mappings with respect to change (Section 6.2.1) 6.2. MANAGING CHANGE 103 and the merging and clustering of design artifacts of the data integration archi- tecture wherever possible to reduce redundancy and the number of such artifacts to be managed (Section 6.2.2).

6.2.1 Decoupling Mappings Given a number of schemata and mappings expressing dependencies between them, a risk exists that some minor modification to a mapping (which may be complex and work-intensive to design) renders its complete redesign necessary. Similarly, the change of a schema or the data integration requirements regarding a schema may invalidate several mappings. Two goals are immediate consequences of this:

• Firstly, change of a set of views should remain as local as possible. When- ever a source is added, we only want to add a single (or few) logical views, but hopefully do not have to carry out a major redesign of mappings. It has been observed that local-as-view integration supports the simple addition and removal of source mappings (see Section 3.9).

• Secondly, mappings should decouple source and integration schemata from each other in the sense that the change of an integration schema (i.e., schema evolution) or the change of a schema’s data integration require- ments only have a minor impact on the “other end” of a mapping, the part of the description that is responsible for integrating sources.

Composite mappings as defined in the previous section permit the design of layers of inter-schema constraints that may be attributed different roles. The high expressiveness of our query rewriting formalism allows for such layers, for instance, to be either sets of LAV or GAV views. The resulting design potential enables us to create mappings that make intuitions regarding likely future change explicit in the mappings and to prepare for this change. We can attribute dedi- cated integration roles to individual layers, as shown in the following example.

Example 6.2.1 Let there be a fixed integration schema R with a single relation R.book, against which we would like to integrate four sources S1, S2, S3, S4 and five source relations S1.book, S2.book, S3.book, S4.sales and S4.categories. We define a composite mapping between R and its sources that consists of three layers (created using the COMPOSE operation of Figure 6.3 twice) and two auxiliary schemata, A1 and A2. We use three auxiliary predicates, A1.book, 0 A1.sales, and A2.s4. The outermost, a GAV mapping from S4 to A2, takes over the task of pre-filtering sources. The middle mapping from A2 to A1 follows the local-as-view approach and takes over the main source integration role. The innermost mapping, again GAV, projects from our well-designed auxiliary schema A1 to R. Consider the following constraints. (See also Figure 6.4.) 104 CHAPTER 6. MODEL MANAGEMENT

S1

S2 GAV LAV R A1 S3 M3 M1 GAV A2 S4 M M2

Figure 6.4: Data integration infrastructure of Example 6.2.1. Schemata are vi- sualized as circles and elementary mappings as arrows.

M3: “Pre-filtering” (GAV); from(M3) = {S4}, to(M3) = A2

0 A2.s4(Name, P roducer, P rice, Sales, Units) ← S4.sales(CategoryId, Name, P roducer, P rice, Sales, Units), S4.categories(CategoryId, ”Books”).

M2: “Source integration” (LAV); from(M2) = {S1, S2, S3, A2}, to(M2) = A1

{hIsbn, Name, Authori | S1.book(Isbn, Name, Author)} ⊆ {hIsbn, Name, Authori | ∃P rice, P ublisher : A1.book(Isbn, Name, Author, P rice, P ublisher)}

{hIsbn, Name, P ublisheri | S2.book(Isbn, Name, P ublisher)} ⊆ {hIsbn, Name, P ublisheri | ∃Author, P rice : A1.book(Isbn, Name, Author, P rice, P ublisher)}

{hName, Author, Sales, Unitsi | S3.book(Name, Author, Sales, Units)} ⊆ {hName, Author, Sales, Unitsi | ∃Isbn, P rice, P ublisher : A1.book(Isbn, Name, Author, P rice, P ublisher), A2.sales(Isbn, Sales, Units)}

{hName, P ublisher, P rice, Sales, Unitsi | 0 A2.s4(Name, P ublisher, P rice, Sales, Units)} ⊆ {hName, P ublisher, P rice, Sales, Unitsi | ∃Isbn, Author : A1.book(Isbn, Name, Author, P rice, P ublisher), A2.sales(Isbn, Sales, Units)}

M1: “Customizing” (GAV); from(M1) = {A1}, to(M1) = R R.book(Name, Author, P rice, P ublisher) ← A1.book(Isbn, Name, Author, P rice, P ublisher). 6.2. MANAGING CHANGE 105

0 We have created the GAV view A2.s4 assuming that CategoryId is only used in that source, and have anticipated that no other future sources will provide it, making it easier to leave the schema against which the LAV views are mapped unchanged. On the other hand, ISBN codes are or will be provided by several sources and are relevant to integration, although our legacy integration schema does not know them. As a consequence, we have created an auxiliary integra- tion schema, and provide a GAV mapping between the auxiliary and the legacy integration schema. We have also added a “sales” predicate to it, assuming that many sources will provide sales information and our action will save us from creating many GAV views that project these attributes out. 

Example 6.2.1 has used a three layer (GAV-LAV-GAV) integration strategy, where dedicated roles (1) customizing, (2) source integration and (3) pre-filtering were assigned to the three layers M1, M2, and M3. The LAV layer M2 assumes the role of taking over most of source integration. If sources have to be integrated against an information system for which the schema lacks properties necessary for LAV integration, the LAV layer integrates against an auxiliary schema (schema A1 in Example 6.2.1) that extends the integration schema by these properties. The first (GAV) layer M1 maps the predicates of the auxiliary schema against the (legacy) integration schema. The third layer M3 may be used to filter out data or project out attributes that are irrelevant for the integration purpose at hand, such that the (auxiliary) integration schema and with it the LAV views do not have to be changed more often than absolutely necessary. Intuitively, this strategy should allow for convenient and maintainable data integration in a large number of scenarios. The LAV layer provides locality of change when sources are added (or deleted), and the entirety of these three layers facilitates decoupling when an integration schema changes. Changes to an integration schema can often be absorbed by the pre-filtering GAV views of mappings from such a schema and the customizing GAV views of mappings against such a schema. Thus, changes to the data integration infrastructure usually remain local and reasonably simple to manage.

Adding Sources This motivates the following steps for adding sources2 (see Figure 6.5 for the development stages of the set of views of a given legacy integration schema):

• Initially, we attempt to use LAV to integrate the sources against the inte- gration schema.

2Of course, the rules given here should be followed less strictly if the designer of mappings anticipates some future change and designs a more sophisticated auxiliary integration schema that deviates more from the legacy integration schema. 106 CHAPTER 6. MODEL MANAGEMENT

Integration schema Integration+ Query Pre-filtering

GAV View layer LAV IM Query

GAV GAV

LAV Query Integration IM

LAV IM Query Customizing+

GAV LAV Integration+ IM Pre-filtering Customizing+ Integration

Figure 6.5: The lifecycle of the mappings of a legacy integration schema.

• If there are source attributes that do not exist in the integration schema3, make a choice depending on whether these source attributes are likely to occur in many other sources or not. If the answer is yes, copy the predicates4 to which they should be most naturally added, and add the attributes. Use the altered auxiliary schema for LAV while at the same time providing GAV views from the altered predicates to the original versions in the integration schema, essentially just projecting out the added attributes. This is a nonlocal change. However, all the logical views which have been there before and use changed predicates can be altered automatically (a simple dummy attribute has to be introduced at the right position). Otherwise, add a GAV view before the LAV stage that projects out these attributes. Auxiliary schemata for LAV integration can also be generalized using the UNFOLD operator of Figure 6.3.

• If some prefiltering of data available through sources (see Example 6.2.1) is needed, decide whether the predicates of other future sources are likely to be in similar ways more general than the current schema against which LAV integration is carried out. If so, generalize the auxiliary integration schema (if LAV integration is carried out against the legacy schema, copy it first) and provide proper GAV views. Otherwise, add a GAV view between a source and the LAV views.

Of course there is a varying degree of intuition that can be put into auxiliary integration schemata in order to facilitate future maintenance. On the parsimo- nious side, auxiliary integration schemata are only changed when this is really needed. For the other extreme we may attempt to design a kind of “global”

3For instance, this is the case for the Isbn attribute of several sources in Example 6.2.1. 4That is, create an auxiliary integration schema that is equal to the integration schema apart from a number of predicates that are adapted to be able to map the sources in question. 6.2. MANAGING CHANGE 107

GAV GAV

AUX LAV IM IM GAV 1

GAV AUX LAV

GAV 1+2

GAV GAV

LAV IM AUX 2 IM

Figure 6.6: Merging auxiliary integration schemata to improve maintenance. integration schema, allowing to combine the source integration of several similar information systems that subscribe to similar sources. This is discussed in more detail in the following section.

6.2.2 Merging Schemata The second main technique for simplifying the management of schemata and map- pings in our architecture is based on the attempt to merge (auxiliary) schemata in the tradition of [BLN86, KDB98] (using the MERGE operation of Figure 6.3) whenever possible, or even to develop global schemata of limited scope5 that are well designed and prepared for kinds of future change that are likely to occur.

Reusing Auxiliary Schemata It may be reasonable to use the predicates of an auxiliary integration schema of an information system rather than its legacy schema as sources to yet another information system. This is particularly appropriate if the intuitively perceived quality of the former is much higher than the quality of the latter. Other reasons may be that the GAV views mapping the auxiliary integration schema against the legacy integration schema filter out relevant data. This leads us to the possibility of reusing auxiliary integration schemata, which may eliminate redundant work and greatly simplify the maintenance task. Such a step may be justified if several information systems have similar integra- tion requirements (need similar information from sources) and if the adjustments that will be needed are expected to correlate heavily when it comes to change of sources. If this is the case, auxiliary integration schemata can be merged into one. The schema merging task can for example be carried out by defining a suit- able “more global” auxiliary schema for the given auxiliary integration schemata, defining appropriate GAV views to map the predicates of the old schemata against the new one, and then generalizing these schemata and their mappings by un- folding (by using the UNFOLD operation of Figure 6.3).

5These are similar to export schemata in federated databases [SL90] 108 CHAPTER 6. MODEL MANAGEMENT

IS

Src1

Integration Schemata IS AUX Src2 Sources

Src3 IS

Figure 6.7: A clustered auxiliary schema. Schemata are displayed as circles and mappings as arrows.

Clustering For instance, consider again the case of the LHC project (see Section 1.3). There are groups of information systems that, although they are based on different schemata, satisfy similar needs (are in the same stage of the project lifecycle) for different subprojects. For such clusters, it may be wise to create a “global” information system or data warehouse (from which the individual information systems basically receive their data through a simple GAV mapping) whose aim is restricted to that particular step of the lifecycle (as noted, building a global schema for the whole lifecycle may not be possible), and which concentrates source integration against its global schema. Figure 6.7 depicts such a shared auxiliary schema. Even if data integration is carried out on demand (i.e. using the “lazy approach” to data integration [Wid96]), one can think of such an approach as an analogy to data warehouses (the clustered schemata) and data marts (the individual integration schemata). The SPLIT operation of Figure 6.3 allows to take back clustering decisions if integration schemata making use of such “global” schemata evolve in different ways and the clusters become unsustainable. The creation of a “global” auxiliary integration schema for several similar information systems also simplifies the task of avoiding circularities in definitions of constraints caused by information systems mutually using each other’s virtual predicates.

6.3 Managing the Acyclicity of Constraints

It is clearly a goal to have the set of all cind’s in a data integration system be acyclic, as that property guarantees the computability of rewritings. Cyclic sets of cind’s mean a self-referential definition of the source-to-integration predicate relationships. Rewritings produced using the results of Chapter 5 (as sets of 6.3. MANAGING THE ACYCLICITY OF CONSTRAINTS 109 conjunctive queries) may in general be of infinite size.

• To attain that, one could give up the completeness requirement and could produce rewritings that are guaranteed to be sound but may be incomplete, simply by setting a threshold to processing time or the number of con- straints used. Our intuition is that in practice, when real-world constraints for data integration are encoded, the rewriting process will terminate with a complete result in most cases.

• Alternatively, the query rewriting tool could, given a query, cut away e.g. those cind’s whose directed edges in the dependency graph are most distant from the predicates in the query and which occur in a cycle. This can be justified by viewing the query rewriting process again from a more proce- dural perspective, where rewriting is carried out using sucessive operator applications (i.e., the unfolding of queries with MCDs) using exclusively mappings that are directed at the integration schema in question.

• If the process of designing mappings between schemata is computer-suppor- ted, a system can help to enforce already at design time that mappings remain acyclic. When acyclicity is enforced automatically all through the design process of mappings from the start, it should not be perceived as too restrictive. The clustering of auxiliary schemata combining logical predicates that rep- resent integrated sources and which are to be connected to several “sub- scriber” schemata clearly supports the goal of avoiding cyclicity. In the extreme case, one could aim at defining auxiliary schemata that are com- monly used by all information systems requiring access to certain resources, while making sure that none of the mappings against these resources share any of the logical predicates used in the earlier mappings.

Another computable case of the query rewriting problem is discussed in Sec- tion 7.1. 110 CHAPTER 6. MODEL MANAGEMENT Chapter 7

Outlook

This chapter first presents the problem of providing physical data independence under schema evolution in Section 7.1. This is another realistic application of query rewriting with cind’s, outside of data integration. It is a straightforward generalization of the problem of maintaining physical data independence analo- gous to the transition from data integration via the problem of answering queries using views to data integration by query rewriting with cind’s. In the remainder of the chapter, we discuss extensions of query rewriting with cind’s (which has so far only been considered in the context of relational conjunctive queries) that are analogous to those that have been proposed for the problem of answering queries using views. A few issues worth considering are

• Recursive queries. We address the query rewriting problem with recur- sive (datalog) queries and nonrecursive sets of cind’s in Section 7.2. This problem can be solved easily as a generalization of the work in [DG97].

• Sources with binding patterns within the data integration architecture pre- sented in Chapter 4 are relevant for two reasons. Firstly, this feature may be required for the integration of sources with restricted query interfaces such as legacy systems. Secondly, this allows to include procedural code for transforming data. This may permit a gateway to different approaches to data integration that may coexist in a heterogeneous data integration infrastructure. Another appli- cation may be procedures that implement complex data transformations. Of course, it has been observed that most practical database queries are of very simple nature, and that very restricted query languages (with their favorable theoretical properties) cover most practical needs, particularly of non-expert users. This, however, does not always remain true. Some queries that are needed in the real world are beyond the expressive power of the query language supported by the data integration platform and the underlying reasoning method. This is particularly true in engineering envi-

111 112 CHAPTER 7. OUTLOOK

Integration Interface Schema Schema

Interface with binding pattern exported by the procedure cross- cind Procedure constraints Relations accessed by the procedure

"Source" "Source" Schema Schema

(A) (B)

Figure 7.1: A cind as an inter-schema constraint (A) compared to a data transfor- mation procedure (B). Horizontal lines depict schemata and small circles depict schema entities. Mappings are shown as thin arrows.

ronments, in which data often have complex structure, such as in our use case of Section 1.3. The solution to this problem is to encapsulate advanced data transforma- tions in a “procedure”, that is, a construct that, for the purposes of data integration and query rewriting, is only described externally, by its inter- face. The procedure itself may contain a query in a highly expressive query language or a piece of code in a high-level programming language. The tradeoff made is the following: Query rewriting reasoning is simplified and often only made possible, and certain complicated queries may be hard- wired in efficient, problem-specific code. On the downside, the completeness of rewriting compared to queries that are not just externally described is lost when procedures are used. If such data transformation procedures are embedded in the data integra- tion architecture in the sense that they read out (possibly integrated) data from information systems that are inside the infrastructure as well, one may specify constraints that hold between interfaces and schemata of ac- cessed data (see Figure 7.1) using e.g. a description logics formalism such as in [BD99]. Constraints of this kind can be used to bound the query rewriting process and eliminate irrelevant rewritings. Such a hybrid ap- proach of query rewriting and description logics reasoning would be highly interesting, though necessarily incomplete. The query rewriting problem with binding patterns in the case of acyclic sets of cind’s can be reduced to the problem of answering queries using views with binding patterns by maximally contained rewritings addressed 7.1. PHYSICAL DATA INDEPENDENCE 113

in [DGL00] by the transformation described in Section 7.2.

• Object-oriented and semistructured schemata and queries. We have dis- cussed the equivalence of (the range-restricted versions of) nested relation calculus and relational calculus in Section 2.5. Given this, the rewriting of conjunctive nested relation calculus queries and analogous constraints can be simulated by the relational case by a simple syntactic transforma- tion (see e.g. Example 2.5.1). This covers a practically relevant class of queries in the complex object model that can be mapped straightforwardly to object-oriented data models (see also [LS97]). Semistructured data models (e.g. OEM [AQM+97] or ACeDB [BDHS96]) have recently received much interest due to the vision of considering the World Wide Web as a single large database [AV97a, FLM98], and the rise of XML-related technologies as a major standard for data exchange [ABS00]. The semistructured case can to a certain extent be seen as a special case of the object-oriented. However, a special case of recursive queries – regular path queries – are an important aspect of semistructured database queries [CM90, Abi97, AV97b]. We address the rewriting of recursive queries under cind’s in Section 7.2, as mentioned. For local-as-view integration in the semistructured context, particularly with regular path views, see [PV99, CDLV99, CDLV00b, CDLV00a].

• Conjunctive queries with inequalities. Although practically relevant, this issue is left open for future research. A special case is discussed in Footnote 4 in Section 7.1. Query rewriting with cind’s and functional dependencies is another topic of future research.

7.1 Physical Data Independence under Schema Evolution

7.1.1 The Classical Problem Database systems are based on the assumption of a separation between a logical schema and a physical storage layout, which represents an important factor sup- porting their popularity. In fact, however, this independence between the logical and the physical schema is not really given in state-of-the-art database systems. This is at least true for relational database systems, where relations are usually really stored as files, which are quite straightforward serializations of the data under the logical schema. For object-oriented schemata, the physical and logical schemata in practice do not coincide that closely. Otherwise, there would be too 114 CHAPTER 7. OUTLOOK much redundancy. Still, there is usually a fixed canonical relationship between physical and logical schemata. True physical data independence would be worthwhile, as it would permit the definition of a logical schema according to design and logical application requirements and a physical schema optimized for performance. Currently, the coupling between physical and logical schemata does not permit this, requiring to depart from schemata that follow domain conceptualizations to attain satisfactory performance. Work on improving this situation (in particular, GMAP [TSI94]) has defined physical storage structures as materialized views over the logical schema. That way, answering queries requires local-as-view query rewriting (which we know is NP-complete in the size of the query [LMSS95], and thus usually acceptable, because queries tend to be small), and the database update problem is com- paratively simple1. This task would be substantially more complicated if the relationship between the logical schema and the physical storage structures were defined the other way round, i.e., the logical relations as views over the physical. In that case, the view update problem [BS81, FC85] would have to be solved. The approach of [TSI94] also allows to improve performance for classes of similar queries that are often asked, simply by adding further storage structures that are defined as views similar to those queries.

Example 7.1.1 We use the popular university domain that has been previously used to communicate the essentials of the maintenance of physical data indepen- dence [TSI94, Lev00]. Consider the logical schema of Figure 7.22. This translates into the following relational schema. Primary key attributes are underlined. v1.student(StudId, Name) v1.masters student(StudId, SecondPeriod) v1.phd student(StudId, ResearchArea, Advisor) v1.professor(Name, Leads DeptId) v1.faculty(Name) v1.course(CourseId, Name, RequiredExam CourseId, CurriculumName) v1.teaches(Name, CourseId) v1.exam taken(StudId, CourseId, Date, Grade) v1.department(DeptId, Name, Address) v1.works in(FacName, DeptId)

1It is the view maintenance problem [AHV95], concerned with propagating changes to base tables incrementally to views, such that views do not need to be fully refreshed whenever a change occurs. 2The schema is presented as an Entity Relationship (ER) diagram [Che76] that we have extended by is-a relationships as in EER [TYF86]. is-a relationships are drawn as arrows with white triangular heads. For instance, PhD students inherit the attributes of the entity student, “st id” and “name”. 7.1. PHYSICAL DATA INDEPENDENCE 115

required_course

name name teaches course_id faculty course

date exam_taken professor grade works_in st_id leads student name dept_id advisor is-a is-a department phd masters name student student

address research_area second_period

Figure 7.2: ER diagram (extended with is-a relationships) of the university do- main (initial version).

All students are either masters or PhD students. Full professors are managed separately from other faculty (e.g. research or teaching assistants). Each professor leads a department. Faculty may work in possibly several departments. Full names of professors and other faculty are assumed to be unique in the combined domain of such names3. Courses are taught by professors or other faculty, have an id number, and may require up to one other course for which students must have successfully passed the exam to be admitted. If a course has no such requirement, a NULL value is stored for the attribute “RequiredExam CourseId”, rather than a course id. PhD students have a professor as their advisor and an assigned area of research. Masters students are either in their first or second period of their studies, and this state is stored as a boolean flag “second period”. Let us now have the following physical storage structures, which are defined as views over the logical schema. m1(StudId, StudName, Area, Advisor, DeptId, DeptName, DeptAddress) ← v1.student(StudId, StudName), v1.phd student(StudId, Area, Advisor), v1.works in(Name, DeptId), v1.department(DeptId, DeptName, DeptAddress).

3We intentionally outline a less-than-perfect schema. 116 CHAPTER 7. OUTLOOK

m2(Name, LeadsDeptId, DeptName, DeptAddress) ← v1.professor(Name, LeadsDeptId), v1.department(LeadsDeptId, DeptName, DeptAddress). m3(StudId, StudName, CourseId) ← v1.student(StudId, StudName), v1.course(CourseId, CourseName, Req, Curriculum), v1.exam taken(StudId, Req, Date, Grade). Now consider the following query, which asks for names of PhD students who work (e.g. as teaching assistants) in departments not lead by their advisors4. q(StudName) ← v1.professor(AdvisorName, LDeptId), v1.department(LDeptId, LDeptName, LDeptAddress), v1.student(StudId, StudName), v1.phd student(StudId, Area, AdvisorName), v1.works in(StudName, SDeptId), v1.department(SDeptId, SDeptName, SDeptAddress), LDeptAddress 6= SDeptAddress.

In this context, materialized views – the physical storage structures – are assumed complete and up-to-date. Thus, view m2, for instance, has the meaning {hName, LeadsDeptId, DeptName, DeptAddressi | m2(Name, LeadsDeptId, DeptName, DeptAddress)} ≡ {hName, LeadsDeptId, DeptName, DeptAddressi | v1.professor(Name, LeadsDeptId) ∧ v1.department(LeadsDeptId, DeptName, DeptAddress)}

4This query contains an inequality. Since the constraints (views) do not contain any in- equalities, the query may be decomposed into q(StudName) ← q0(StudName, LDeptAddress, SDeptAddress), LDeptAddress 6= SDeptAddress.

0 q (StudName, LDeptAddress, SDeptAddress) ← v1.professor(AdvisorName, LDeptId), v1.department(LDeptId, LDeptName, LDeptAddress), v1.student(StudId, StudName), v1.phd student(StudId, Area, AdvisorName), v1.works in(StudName, SDeptId), v1.department(SDeptId, SDeptName, SDeptAddress).

Thus, our algorithms from Chapter 5 are sufficient for finding maximally contained rewritings of conjunctive queries with inequalities under sets of cind’s without inequalities. Of course, the closure under composition of conjunctive queries is preserved when inequalities are introduced. As a consequence, the rewriting of q0 can be unfolded with q to obtain a maximally contained positive rewriting with inequalities. 7.1. PHYSICAL DATA INDEPENDENCE 117

By solving the problem of answering queries using views, the following equiv- alent rewriting of the input query can be found, in which all predicates of the logical schema have been replaced by materialized views. q(StudName) ← m1(SId, StudName, A, P rofName, SDId, SDName, SDeptAddress), m2(P rofName, LDId, LDName, LDeptAddress), LDeptAddress 6= SDeptAddress.

Note that the physical storage structures m1, m2 and m3 are not sufficient to fully cover the logical schema; For instance, faculty other than PhD students are not represented and “teaches” relationships are nowhere stored. Thus, additional physical structures would be needed in practice.  It is easy to see that the problem of providing physical data independence is of wide practical importance. Note that in [TSI94], each physical storage structure is indexed over either a relation attribute or a ROW id (as a relational equivalent of object identifiers. Since the work is presented in the light of a semantical data model, the term object id is used as is, however.) The query rewriting problem thus becomes the problem of answering queries using views with binding patterns, as discussed in Section 3.6. Binding patterns, however, are considered in a weak form – if no rewriting can be produced, binding patterns are ignored (equivalent to ignoring an index and scanning the whole relation or materialized view).

7.1.2 Versions of Logical Schemata Let us now assume that logical schemata may evolve. For several reasons, it may be desirable not to rebuild storage structures each time schemata evolve.

• Physical storage structures (currently) need to be designed manually for optimizing performance5. This requires expert work, which often is not justified for a minor schema change that does not greatly affect the appro- priateness of current physical storage structures. • Materialized views may be very large and be accessed rarely, such that the cost of rebuilding physical structures relative to the cost of accessing them must not be assumed zero. This is for instance the case in very large (Terabyte or Petabyte) scientific repositories that are written only once – to tertiary storage, e.g. tape robots – and where individual data records are subsequently only accessed very sparingly. In that case it is worthwhile to leave physical structures unchanged whenever possible and define new versions of logical schemata relative to existing logical schemata versions (as well as to the physical structures).

5According to [Lev00], this is an important area of future database research. 118 CHAPTER 7. OUTLOOK

required_course

name name teaches course_id faculty course

is-a date exam_taken name grade works_in professor st_id student research name dept_id leads interest is-a advises is-a department graduate undergrad. name phd student student program

address major phd_program_id

Figure 7.3: ER diagram (with is-a relationships) of the university domain (second version).

• Sometimes, data in physical storage structures must not be lost when new logical schema versions do not make use of them anymore. Reasons for that may be that a database may still be addressed under the old logical schema by certain applications or that there is reason to expect future schema versions to make use of these data again.

• Physical storage structures may be read-only or replica of databases that are offline (e.g. in mobile or wide-area distributed applications).

Given that concepts in different schema versions may experience true shift of meanings (concept mismatch), cind’s manifest themselves as appropriate for encoding such inter-schema dependencies. We next give an example showing why query rewriting with cind’s may be relevant in this context. A number of serious problems are left open, however, and are shortly summarized after this example, at the end of the section. The main assumption made is that queries over the logical schema may be translated into maximally contained positive queries (rather than equivalent conjunctive queries6) over the storage structures.

6Note that a query equivalent to a conjunctive query under a set of cind’s must itself be a conjunctive query. 7.1. PHYSICAL DATA INDEPENDENCE 119

Example 7.1.2 Let us now define the following alterations to the logical schema v1 of Example 7.1.1. Professors are now members of the faculty. The university changes from a pure graduate school to also accommodate undergraduate stu- dents. Both masters and PhD students are replaced by a new category, graduate students. The two periods of masters studies cease to exist, but there is a new field, “major”, for undergraduates. PhD research areas are represented by a log- ical relation research interest, which is also used for managing the research areas of faculty. There is a new relation phd program, which has its own key referenced by a new advises relationship with a professor. Not every professor leads a de- partment anymore, so there is a new relation leads. The schema is again shown as an EER diagram in Figure 7.3. v2.student(StudId, Name) v2.undergraduate student(StudId, Major) v2.graduate student(StudId) v2.phd program(Id, StudId) v2.research interest(Name, Area) v2.advises(Advisor, PhdProgramId) v2.faculty(Name) v2.professor(Name) v2.course(CourseId, Name, RequiredExam CourseId, CurriculumName) v2.teaches(Name, CourseId) v2.exam taken(StudId, CourseId, Date, Grade) v2.department(DeptId, Name, Address) v2.leads(ProfName, DeptId) v2.works in(FacName, DeptId)

We define the following cind’s and leave cind’s that map predicates whose meanings do not change from v1 to v2 as a (very) simple exercise for the reader.

{hStudId, StudName, P rofName, Areai | ∃P hDP rogramId : v2.student(StudId, StudName) ∧ v2.graduate student(StudId) ∧ v2.research interest(StudName, Area) ∧ v2.advises(P rofName, P hDP rogramId) ∧ v2.phd program(P hDP rogramId, StudId)} ⊇ {hStudId, StudName, Advisor, Areai | v1.phd student(StudId, Advisor, Area) ∧ v1.student(StudId, StudName)}

{hName, DeptIdi | v2.professor(Name) ∧ v2.faculty(Name) ∧ v2.works in(Name, DeptId) ∧ v2.leads(Name, DeptId)} ⊇ {hName, Leads DeptIdi | v1.professor(Name, Leads DeptId)} v2.graduate student(StudId) ← v1.masters student(StudId, SecondP eriod). 120 CHAPTER 7. OUTLOOK

With this second version of the logical schema, it is also necessary to define additional physical storage structures to accommodate new data such as under- graduate majors: m4(StudId, StudName, Major) ← v2.student(StudId, StudName), v2.undergraduate student(StudId, Major). A subsequent third version of the logical schema could be defined using cind’s relating to predicates of the previous versions as well as the physical storage structures.  As mentioned, we have left a number of important aspects of the problem of maintaining physical data independence under schema evolution out of consider- ation. In the context of this problem, query rewriting usually aims at producing equivalent rather than maximally contained rewritings. If no equivalent one can be found, no rewriting at all is produced. Rewritings over physical storage struc- tures are usually assumed to return the same results as the original queries over the logical schema. The problem of finding equivalent rewritings over cind’s, however, entails cyclic sets of such constraints, for which we know that neither maximally contained nor equivalent rewritings can be computed in general. There are two pragmatic solutions to this problem, apart from the obvious one of searching for an equivalent rewriting up to a time or memory consumption threshold. Firstly, one could define maximally containedness as the “correct” semantics. That way, results will be complete for the case that an equivalent rewriting exists, and logically still justified otherwise7. Alternatively, one could first compute the maximally contained rewriting of a query (over an acyclic set of cind’s composed of containment rather than equiv- alence constraints) and then reverse the containment relationships in the cind’s and test if any of the conjunctive queries in the maximally contained rewriting contains the input queries. This would be a sound but theoretically incomplete approach to producing maximally contained rewritings. In practice, however, it would probably well coincide with user’s expectations. Note that this requires that each cind in the constraints base individually expresses an equivalence re- lationship, and positive queries such as seen in Example 7.1.2 (e.g. the disjoint partition of PhD and masters students) cannot be expressed8. Another problem is related to propagating updates that are stated in terms of the logical schema into the appropriate storage structures. In the classical ap- proach to maintaining physical data independence, where physical storage struc- tures are defined as views over the logical schema, updating these structures is 7Certainly, design flaws in the physical storage structures – which do not permit to insert data or answer certain queries although this should be possible from the point of view of the logical schema – are harder to debug if maximally contained rewritings still return nonempty results in cases where no equivalent rewritings exist. 8This would require a major change of framework. 7.1. PHYSICAL DATA INDEPENDENCE 121 simple, as it reduces to simply refreshing the materialized views (the physical storage structures). Under our problem definition, however, a generalized ver- sion of the much more involved problem of updating views [BS81, FC85, AHV95] is faced (using the update information that is available in terms of the logical schema). Finally, an issue that we have left out of consideration is that it may be useful to have storage structures defined using binding patterns (that are, however, weak in the sense that if no rewriting can be found that obeys them, the best such rewriting – according to some cost metrics – that can be found should be chosen). That way, indexes are special cases of such storage structures where index keys are defined as bound [TSI94].

Equivalent Rewritings An interesting technique for obtaining equivalent rewritings with cind’s has not been discussed so far. It is based on the idea of reversing the process of comput- ing the rewritings (i.e., “bottom-up” rather than “top-down” computation as in Algorithm 5.3.5). In the method for computing equivalent rewritings proposed in Chapter 5, one first attempts to obtain a contained rewriting and then to prove it equivalent. Alternatively, one could try to obtain a subsuming rewriting first and subsequently prove it to be contained in the input query. This is done as follows. Let Q be the conjunctive input query and C the set of Horn clauses obtained by normalizing the cinds. First, Q is frozen into a canonical database I in the tradition of Example 3.6.1. Next, the consequences of the logic program I ∪ C (where I is taken as a set of facts) are determined by bottom- up computation. If this computation reaches a fixpoint, an equivalent rewriting is among the queries that can be constructed from the frozen head of Q and subsets of the facts over source predicates that are in the fixpoint of the bottom- up computation by undoing the freezing process9, if such a rewriting exists. An equivalent rewriting can be determined by another bottom-up derivation (this time in the “opposite” direction), as described in Example 5.3.1.

Example 7.1.3 Consider the query q(x, y) ← a(x, y). and the cind

{hx, yi | a(x, y)} ≡ {hx, zi | ∃y : b(x, y) ∧ c(y, z)} and the source schema S = {b, c}. We freeze q into the facts base {a(αx, αy)} and combine it with the three Horn clauses that result from the normalization of the above cind. Bottom-up derivation results in the fixpoint

{a(αx, αy), b(αx, f(αx, αy)), c(f(αx, αy), αy)}

9That is, variables frozen into constants are again replaced by new variables, and so are function terms. 122 CHAPTER 7. OUTLOOK

Only one query which satisfies the safety requirement can be constructed from the head of q and a subset of the fixpoint over predicates in S, which is

q0(x, y) ← b(x, z), c(z, y).

0 (z is the variable which replaces the function term f(αx, αy).) Thus, q ⊇ q. By the technique of Example 5.3.1, we discover that also q0 ⊆ q. Thus, q0 is an equivalent rewriting of q. 

If we can guarantee for a restricted class of queries and cind’s that fixpoints are always reached for bottom-up derivations, we have a complete algorithm for computing equivalent rewritings that is guaranteed to terminate. One can guarantee this for instance by giving a sufficient condition that function terms with some top-level function symbol f computed during bottom-up derivation may never contain a subterm with the same symbol f. Such a condition is presented next. We first need to define typed conjunctive queries (see e.g. [AHV95]). Typed conjunctive queries follow the named perspective of relational algebra, i.e., each attribute of a relation has a name unique inside the relation. Formally, we assume that each relational schema R is defined using a set of attribute names A = {A1,...,Am}, where the sort of each relation is a tuple of distinct attributes in A. Typed conjunctive queries are only allowed to contain equijoins, i.e., only joins between relations by attributes with the same name. If calculus notation is used for typed conjunctive queries, i.e., queries are of the form {hX¯i | ∃Y¯ : ψ(X,¯ Y¯ )}, variables functionally determine to which attributes in A they are bound throughout the entire query. Thus, given such a query Q, we can define a function Attr : V ars(Q) → A that maps each variable in Q to a single attribute name. For constants appearing in queries, one has to require an analogous typedness property. There is a distinct domain domi for each attribute Ai. For instance, note that in the boolean query q ← a(1, 1). the two constants must be assumed to be from different domains and thus different (as their attributes are of different types). The condition that we next state is in general too restrictive for data inte- gration, but mirrors quite closely the natural semantics of schema evolution. We make use of the particularities of schema evolution, such as the intuitive notion of “adding” and “removing” relational attributes.

Proposition 7.1.4 Let Q be a typed conjunctive query and Σ a set of cind’s of the form Q1θQ2 with θ ∈ {⊆, ⊇}, where both Q1 and Q2 are typed conjunctive queries. Furthermore, let Σ be layered, i.e., the predicates in Σ are partitioned into {R1,..., Rn} such that for each Q1θQ2 in Σ, there is an 1 ≤ i ≤ n − 1 s.t. P reds(Q1) ∈ Ri and P reds(Q2) ∈ Ri+1. Let P reds(Q) ∈ Rn and R1 be 7.2. REWRITING RECURSIVE QUERIES 123

the sources. Furthermore, there is a partial one-to-one mapping φi : Ai → Ai+1 over attributes between each pair of adjacent schema layers Ri and Ri+1 which is defined as follows.

• For each pair of attributes A, B ∈ Ai for which φi(A) and φi(B) are defined, −1 if φi(A) = φi(B) then A = B. Thus, φi is a partial function. ¯ ¯ ¯ • Let R(X) ← R1(X1),...,Rn(Xn). be a Horn clause c from the normaliza- tion of Σ with R ∈ Ri,R1,...,Rn ∈ Rj. Furthermore, let HAttrc(x) return the attribute to which a variable x is bound in the head of c (i.e., according to schema Ri) and let BAttrc(x) return the attribute to which x is bound in the body of c. Then, for each variable x ∈ X¯ (i.e, this excludes function terms containing variables), if j = i + 1 then φi(HAttrc(x)) = BAttrc(x) and if j = i − 1 then φi(BAttrc(x)) = HAttrc(x). • Existential variables in cind’s coincide exactly with the added or removed predicates, i.e. for a cind Q1θQ2 with Q1 over Ri and Q2 over Ri+1, for each attribute A of an existential variable in Q1, φi(A) is undefined, and for each attribute B of an existential variable in Q2, there is no attribute C s.t. φi(C) = B.

Then, bottom-up derivation of the normalization of Σ and the canonical database obtained by freezing Q is guaranteed to reach a fixpoint. Furthermore, the queries producible following our bottom-up algorithm for finding equivalent rewritings 10 are again typed . 

φi(A) is undefined iff attribute A ∈ Ai has been “removed” in the evolution −1 from schema version Ri to schema version Ri+1. φi (A) is undefined exactly if attribute A ∈ Ai+1 has been “added” in the evolution from Ri to Ri+1. For instance, Example 7.1.3 satisfies the property of Proposition 7.1.4. The schemata of Examples 7.1.1 and 7.1.2 are cyclic and thus cannot be typed. The above proposition can be easily shown by a dependency graph-based argu- ment (using a dependency graph with attribute names as nodes) demonstrating that function terms with some top-level function symbol f constructed during bottom-up derivation may never contain a subterm with function symbol f, and are thus bounded in size.

7.2 Rewriting Recursive Queries

We have shown earlier (by Theorem 5.3.12) that the case of query rewriting with cyclic sets of cind’s is undecidable. The case of finding a maximally contained rewriting of a recursive (datalog) query with respect to an acyclic set of cind’s

10Thus we also have an algorithm for proving the complementary containment result. 124 CHAPTER 7. OUTLOOK

fy(α, fv(α, β))XX e  fy(fv(α, β), γ) XXX e  ¤ XXX  E ¤ f (α, β) E ¤ ¨ v ¨¨ H E ¤ ¨ HH E e ¤ ¨¨ H E ¤ t ¨ HH t e ¤ ¨¨ H E ¤ ¨ H E ¨ HH ¤ ¨¨ H E ¤ ¨ HH E α β s A ¥ A ¥ A ¥ A e ¥ A ¥ A ¥ A ¥ A ¥ A ¥ s ¥ t fy(β, fv(β, γ))  ¥  ¥   ¥  ¥  e

fv(β, γ) ¨ ¨ £ ¨ t¨¨ £ ¨ £ ¨ ¨¨ £ e ¨¨ £ γ XXX £ XXX e fy(fv(β, γ), γ)

Figure 7.4: Fixpoint of the bottom-up derivation of Example 7.2.1.

on the other hand can be solved in a straightforward way. The result is again a recursive datalog program. We use the technique from [DG97], which has been originally defined for the problem of answering recursive queries using views in a minor generalization – we work with an acyclic set of cind’s rather than a single flat “layer” of views. We use the fact that for acyclic sets of cind’s, function terms cannot grow beyond a certain finite depth during bottom-up derivation starting from the database. This depth is bounded by the total number of func- tion symbols available. There is a unique finite set of all those Horn clauses whose head predicates appear in the recursive query to be rewritten and that only have subgoals that are materialized “source” predicates for which data are available11. Let us, however, first take the perspective of query answering by bottom-up derivation, considering the combination of a set of (acyclic) cind’s and a recursive query as a logic program. Clearly, large intermediate results are created (which

11This set is computed by Algorithm 5.3.3 if we omit the part that tries to rewrite the input query with the unfolded Horn clauses that have been computed. 7.2. REWRITING RECURSIVE QUERIES 125 are constructed using function terms) that we want to avoid for efficiency reasons.

Example 7.2.1 Let there be the recursive query

q(x, y) ← e(x, y). q(x, z) ← e(x, y), q(y, z). which computes the transitive closure of the graph

hV = {v1 | ∃v2 : e(v1, v2)} ∪ {v2 | ∃v1 : e(v1, v2)},E = ei and the cinds

Σ = { {hx, zi | ∃y : e(x, y) ∧ e(y, z)} ⊇ {hx, zi | t(x, z)}, {hu, wi | ∃v : t(u, v) ∧ t(v, w)} ⊇ {hu, wi | s(u, w)}} where t logically represents chains of two edges and s is a source of chains of four edges. Assume now that we have the database I = {s(α, β), s(β, γ)}, where α, β, γ are constants, the nodes of our graph. By transforming Σ into normal form and performing bottom-up derivation, we obtain the fixpoint shown as a directed graph in Figure 7.4. There is a tuple in q for each arc in the graph12. Those arcs that only belong to q are drawn as dotted lines. The result of the query is the set of arcs between non-function term nodes, i.e. {hα, βi, hβ, γi, hα, γi}. 

It is possible to rewrite the cind’s and the query into a single datalog query such that no function terms have to be introduced during query execution. This method is a straightforward generalization of the algorithm in [DG97] to Horn clauses that are the unfoldings of the normalized acyclic cinds, using Algo- rithm 5.3.3.

Example 7.2.2 Consider again q and Σ of the previous example. The unfolding of the normal form of Σ relative to the only EDB predicate e of the query is e(x, fy(x, fv(x, y))) ← s(x, y). e(fy(x, fv(x, y)), fv(x, y)) ← s(x, y). e(fv(x, y), fy(fv(x, y), y)) ← s(x, y). e(fy(fv(x, y), y), y) ← s(x, y).

We transform these into eh1,fy(2,fv(3,4))i(x, x, x, y) ← s(x, y). ehfy(1,fv(2,3)),fv(4,5)i(x, x, y, x, y) ← s(x, y). ehfv(1,2),fy(fv(3,4),5)i(x, y, x, y, y) ← s(x, y). ehfy(fv(1,2),3),4i(x, y, y, y) ← s(x, y).

12To save the figure from overload, the “q” arcs are not named, unlike the other arcs. 126 CHAPTER 7. OUTLOOK where the structure of the function terms produced is moved into the predicate names (e.g. eh1,fy(2,fv(3,4))i), where integers denote the index of the variable or constant in the head atom that corresponds to the position in the function term. The query is now transformed bottom-up, across possibly several iterations. The result of the first iteration is

h1,fy(2,fv(3,4))i h1,fy(2,fv(3,4))i q (x1, x2, x3, x4) ← e (x1, x2, x3, x4). hfy(1,fv(2,3)),fv(4,5)i hfy(1,fv(2,3)),fv(4,5)i q (x1, x2, x3, x4, x5) ← e (x1, x2, x3, x4, x5). hfv(1,2),fy(fv(3,4),5)i hfv(1,2),fy(fv(3,4),5)i q (x1, x2, x3, x4, x5) ← e (x1, x2, x3, x4, x5). hfy(fv(1,2),3),4i hfy(fv(1,2),3),4i q (x1, x2, x3, x4) ← e (x1, x2, x3, x4). for the first rule of q and

hfy(fv(1,2),3),fy(4,fv(5,6))i q (x1, x2, x3, x5, x6, x7) ← hfy(fv(1,2),3),4i e (x1, x2, x3, x4), h1,fy(2,fv(3,4))i q (x4, x5, x6, x7). h1,fv(2,3)i q (x1, x5, x6) ← h1,fy(2,fv(3,4))i e (x1, x2, x3, x4), hfy(1,fv(2,3)),fv(4,5)i q (x2, x3, x4, x5, x6). hfy(1,fv(2,3)),fy(fv(4,5),6i q (x1, x2, x3, x6, x7, x8) ← hfy(1,fv(2,3)),fv(4,5)i e (x1, x2, x3, x4, x5), hfv(1,2),fy(fv(3,4),5)i q (x4, x5, x6, x7, x8). hfv(1,2),3i q (x1, x2, x6) ← hfv(1,2),fy(fv(3,4),5)i e (x1, x2, x3, x4, x5), hfy(fv(1,2),3),4i q (x3, x4, x5, x6). for the second rule. The latter four rules combine the four function-free rewritings of the unfolded Horn clauses with the rewritings of the first rule of q. In the subsequent iterations, the results of the previous iterations are combined. It would consume too much space to write down the full rewriting, which contains 8 more rules for q. A single one of them is the new top-level query goal,

h1,2i h1,fy(2,fv(3,4))i hfy(1,fv(2,3)),4i q (x1, x5) ← e (x1, x2, x3, x4), q (x2, x3, x4, x5).

Clearly, a number of optimizations over this naive transformation are possi- 13 ble , for which we refer to [DG97]. 

This transformation can be easily automated and is applicable to all datalog queries.

13After all, this query is equivalent to {q(x, y) ← s(x, y). q(x, z) ← s(x, y), q(y, z).} Note that only four of the qh...i predicates (qh1,2i, qhfy (1,fv (2,3)),4i, qhfv (1,2),3i, qhfy (fv (1,2),3),4i) created using this naive transformation are – taking a top-down perspective – reachable from the goal predicate qh1,2i, and rules containing others may be eliminated outright. Chapter 8

Conclusions

The approach to data integration that has been proposed in this thesis has the following features:

• The infrastructure does not rely on a “global” integration schema as un- der LAV. Rather, several information systems each may need access to integrated data from other information systems. Integration schemata may lack sophistication or even any special preparation for source integration.

• Integration schemata may contain both materialized database relations and purely logical predicates, for which data have to be provided by means of data integration.

• Our approach provides good support for the creation and maintenance of mappings between information systems under frequent change. This in- cludes good decoupling of information systems through the mappings used for integration, such that the workload imposed on the knowledge engineer who maintains mappings when change occurs is as small as possible. The approach at the same time permits mappings to be designed in a reasonably natural way, thus simplifying the modeling and mapping work and at the same time enabling the designer to express intuitions that may be useful for anticipating future change.

• The data integration reasoning is carried out globally, declaratively, and uses an intuitive and accessible semantics. Mappings between several in- formation systems are transitive, which reduces the amount of redundant mapping work that has to be done.

• Conjunctive inclusion dependencies as inter-schema constraints allow to deal with concept mismatch in a wide sense. This is a necessary condition for being able to deal with autonomous and changing integration schemata.

127 128 CHAPTER 8. CONCLUSIONS

We have pointed out that data integration with multiple unsophisticated evolving integration schemata is a problem of high relevance1 that has been insuf- ficiently addressed so far. None of the previous work seems to be directly suitable. Apart from management problems with respect to schemata and mappings sim- ilar to those known from federated and multidatabases, we are confronted with kinds of schema mismatch that require very expressive interschema constraints. We have presented an approach based on model management and query rewriting with expressive constraints and have discussed an architecture (Chap- ter 4), model management operations (Chapter 6), and the issue of query rewrit- ing (Chapter 5), a problem at the core of data integration. We have argued that our approach supports the management of the integration infrastructure by al- lowing the modeling of mappings in a natural way and the decoupling of schemata and mappings such that maintenance under change is simplified. The practical feasibility of our approach has been in part shown by the imple- mentation of the CindRew system based on the results of this thesis, and by the benchmarks of Section 5.5. For the other part – model management – our presen- tation was based on elementary intuitions of managing large systems that have been widely verified and have permeated mainstream computer science thinking. Much recent work in data integration has focussed either on procedural or on highly structured declarative approaches meant to combine sufficient expres- sive power with decidability (which we cannot guarantee for our approach in its most general form). We have taken another direction, encoding a highly intuitive class of constraints2 and providing theoretical results and an implementation for sound best-effort query rewriting, with the intuition that practical data integra- tion problems will often be completely solved. We have also discussed a very important class (acyclic sets of constraints) for which we can guarantee com- pleteness. We believe this work may be of quite immediate practical usefulness. Plenty of material for further research has been provided in Chapter 7. A successor project to the research that led to this thesis could be an effort to de- velop an integrated model management and query rewriting system based on the results presented here, however, based on an object-oriented data model. Such a system could be of immediate usefulness to scientific communities such as the one of high energy physics. Our query rewriting approach has an acceptability advantage compared to other data integration approaches applicable to the set- ting of large scientific collaborations (see Section 1.3). This is particularly true when it comes to data integration on the Grid [FK98] with its most extensive data volumes. Having stated this, we deem this work also a practical success, with a clear benefit to the host of this PhD program, CERN.

1The relevance of this work has been sufficiently argued for in Section 1.3 and Section 1.5, and we will not reiterate this here. 2We use conjunctive queries both in constraints and targets for rewriting. When put into a syntax such as select-from-where queries or tableau queries [Ull88, Ull89, AHV95], conjunctive queries can be mastered by many non-expert users. Bibliography

[AB88] Serge Abiteboul and Catriel Beeri. “On the Power of Languages for the Manipulation of Complex Objects”. Technical Report TR 846, INRIA, 1988.

[ABD+96] Daniel E. Atkins, William P. Birmingham, Edmund H. Durfee, Eric J. Glover, Tracy Mullen, Elke A. Rundensteiner, Elliot Soloway, Jos´eM. Vidal, Raven Wallace, and Michael P. Wellman. “Toward Inquiry-Based Education Through Interacting Software Agents”. IEEE Computer, 29(5):69–76, May 1996.

[Abi97] Serge Abiteboul. “Querying Semistructured Data”. In Proc. ICDT’97, Delphi, Greece, 1997.

[ABS00] Serge Abiteboul, Peter Buneman, and Dan Suciu. Data on the Web. Morgan Kaufmann Publishers, 2000.

[ABU79] Alfred V. Aho, Catriel Beeri, and Jeffrey D. Ullman. “The Theory of Joins in Relational Databases”. ACM Transactions on Database Systems, 4(3):297–314, 1979.

[ACPS96] Sibel Adali, K. Sel¸cukCandan, Yannis Papakonstantinou, and V. S. Subrahmanian. “Query Caching and Optimization in Distributed Mediator Systems”. In Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data (SIGMOD’96), pages 137–146, Montreal, Canada, June 1996.

[AD98] Serge Abiteboul and Oliver M. Duschka. “Complexity of Answering Queries Using Materialized Views”. In Proceedings of the ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS) 1998, pages 254–263, 1998.

[Age] UMBC Agents Mailing List Archive http://agents.umbc.edu/agentslist/archive/.

[AHV95] Serge Abiteboul, Richard Hull, and Victor Vianu. Foundations of Databases. Addison-Wesley, 1995.

129 130 BIBLIOGRAPHY

[AK92] Yigal Arens and Craig A. Knoblock. “Planning and Reformulating Queries for Semantically-Modeled Multidatabase Systems”. In Pro- ceedings of the First International Conference on Information and Knowledge Management (CIKM’92), Baltimore, MD USA, 1992.

[AK98] Serge Abiteboul and Paris C. Kanellakis. “Object Identity as a Query Language Primitive”. Journal of the ACM, 45(5):798–842, September 1998.

[AQM+97] Serge Abiteboul, Dallan Quass, Jason McHugh, Jennifer Widom, and Janet L. Wiener. “The Lorel Query Language for Semistruc- tured Data”. International Journal on Digital Libraries, 1(1):68–88, 1997.

[AS99] Albert Alderson and Hanifa Shah. “Viewpoints on Legacy Sys- tems”. Communications of the ACM, 42(3):115–116, 1999.

[AV97a] Serge Abiteboul and Victor Vianu. “Queries and Computation on the Web”. In Proc. ICDT’97, 1997.

[AV97b] Serge Abiteboul and Victor Vianu. “Regular Path Queries with Constraints”. In Proceedings of the ACM SIGACT-SIGMOD- SIGART Symposium on Principles of Database Systems, May 11– 15, 1997, Tucson, AZ USA, 1997.

[BB99] Philip A. Bernstein and Thomas Bergstraesser. “Meta-Data Sup- port for Data Transformations Using Microsoft Repository”. IEEE Data Engineering Bulletin, 22(1):9–14, March 1999.

[BBB+97] Roberto J. Bayardo Jr., William Bohrer, Richard S. Brice, Andrzej Cichocki, Jerry Fowler, Abdelsalam Helal, Vipul Kashyap, Tomasz Ksiezyk, Gale Martin, Marian H. Nodine, Mosfeq Rashid, Marek Rusinkiewicz, Ray Shea, C. Unnikrishnan, Amy Unruh, and Darrell Woelk. “InfoSleuth: Agent-Based Semantic Integration of Informa- tion in Open and Dynamic Environments”. In J. Peckham, editor, Proceedings of the 1997 ACM SIGMOD International Conference on Management of Data (SIGMOD’97), pages 195–206, Tucson, AZ USA, May 1997. ACM Press.

[BBMR89a] Alexander Borgida, Ronald J. Brachman, Deborah L. McGuinness, and Lori A. Resnick. “CLASSIC: A Structural Data Model for Objects”. In Proceedings of the 1989 ACM SIGMOD International Conference on Management of Data (SIGMOD’89), pages 59–67, June 1989. BIBLIOGRAPHY 131

[BBMR89b] Ronald J. Brachman, Alex Borgida, Deborah L. McGuinness, and Lori A. Resnick. “The CLASSIC Knowledge Representation Sys- tem, or, KL-ONE: The Next Generation”. In Preprints of Workshop on Formal Aspects of Semantic Networks, Santa Catalina Island, CA USA, February 1989. [BD99] Alex Borgida and Prem Devanbu. “Adding more DL to IDL: To- wards more Knowledgeable Component Inter-operability”. In Proc. of ICSE’99, 1999. [BDBW97] Jeffrey M. Bradshaw, Stewart Dutfield, Pete Benoit, and John D. Woolley. “KAoS: Toward an Industrial-strength Open Agent Ar- chitecture”. In Jeffrey M. Bradshaw, editor, Software Agents, pages 375–418. AAAI/MIT Press, 1997. [BDHS96] Peter Buneman, Susan Davidson, Gerd Hillebrand, and Dan Suciu. “A Query Language and Optimization Techniques for Unstructured Data”. In Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data (SIGMOD’96), 1996. [BF97] Avrim L. Blum and Merrick L. Furst. “Fast Planning Through Planning Graph Analysis”. Artificial Intelligence, 90:281–300, 1997. [BH91] Franz Baader and Bernhard Hollunder. “KRIS: Knowledge Rep- resentation and Inference System”. SIGART Bulletin, 2(3):8–14, 1991. [BLN86] Carlo Batini, Maurizio Lenzerini, and Shamkant B. Navathe. “A Comparative Analysis of Methodologies for Database Schema Inte- gration”. ACM Computing Surveys, 18:323–364, 1986. [BLP00] Philip A. Bernstein, Alon Y. Levy, and Rachel A. Pottinger. “A Vi- sion for Management of Complex Models”. Technical Report 2000- 53, Microsoft Research, 2000. [BLR97] Catriel Beeri, Alon Y. Levy, and Marie-Christine Rousset. “Rewrit- ing Queries Using Views in Description Logics”. In Proceedings of the ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, May 11–15, 1997, Tucson, AZ USA, pages 99– 108, 1997. [BM93] Elisa Bertino and Lorenzo Martino. Object-oriented Database Sys- tems - Concepts and Architectures. Addison-Wesley, 1993. [Bor95] Alexander Borgida. “Description Logics in Data Management”. IEEE Transactions on Knowledge and Data Engineering, 7(5):671– 682, October 1995. 132 BIBLIOGRAPHY

[BPGL85] Ronald J. Brachman, V. Pigman Gilbert, and Hector J. Levesque. “An Essential Hybrid Reasoning System: Knowledge and Symbol Level Accounts in KRYPTON”. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI’85), pages 532– 539, 1985.

[BPS94] Alexander Borgida and Peter F. Patel-Schneider. “A Semantics and Complete Algorithm for Subsumption in the CLASSIC Description Logic”. Journal of Artificial Intelligence Research, 1:277–308, 1994.

[Bra83] Ronald J. Brachman. “What IS-A is and isn’t: An Analysis of Taxonomic Links in Semantic Networks”. IEEE Computer, 16(10), October 1983.

[BS81] Fran¸coisBancilhon and Nicolas Spyratos. “Update Semantics of Re- lational Views”. ACM Transactions on Database Systems, 6(4):557– 575, December 1981.

[BS85] Ronald J. Brachman and James G. Schmolze. “An Overview of the KL-ONE Knowledge Representation System”. Cognitive Science, 9(2):171–216, 1985.

[CBB+97] R. G. G. Cattell, Douglas K. Barry, Mark Berler, Jeff Eastman, David Jordan, Craig Russell, Olaf Schadow, Torsten Stanienda, and Fernando Velez. The Object Database Standard: ODMG 2.0. Mor- gan Kaufmann, 1997.

[CDL98a] Diego Calvanese, Giuseppe De Giacomo, and Maurizio Lenzerini. “On the Decidability of Query Containment under Constraints”. In Proceedings of the ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS) 1998, pages 149–158, 1998.

[CDL+98b] Diego Calvanese, Giuseppe De Giacomo, Maurizio Lenzerini, Daniele Nardi, and Riccardo Rosati. “Information Integration: Con- ceptual Modeling and Reasoning Support”. In Proc. CoopIS’98, pages 280–291, 1998.

[CDL99] Diego Calvanese, Giuseppe De Giacomo, and Maurizio Lenzerini. “Answering Queries using Views in Description Logics”. In Proc. of the 1999 Description Logic Workshop (DL’99), CEUR Workshop Proceedings, Vol. 22, pages 9–13, 1999.

[CDLV99] Diego Calvanese, Giuseppe De Giacomo, Maurizio Lenzerini, and Moshe Y. Vardi. “Rewriting of Regular Expressions and Regular BIBLIOGRAPHY 133

Path Queries”. In Proceedings of the ACM SIGACT-SIGMOD- SIGART Symposium on Principles of Database Systems (PODS) 1999, pages 194–204, Philadelphia, PA USA, 1999.

[CDLV00a] Diego Calvanese, Giuseppe De Giacomo, Maurizio Lenzerini, and Moshe Y. Vardi. “Answering Regular Path Queries Using Views”. In Proceedings of the IEEE International Conference on Data En- gineering (ICDE 2000), pages 389–398, 2000.

[CDLV00b] Diego Calvanese, Giuseppe De Giacomo, Maurizio Lenzerini, and Moshe Y. Vardi. “View-Based Query Processing for Regular Path Queries with Inverse”. In Proceedings of the ACM SIGACT- SIGMOD-SIGART Symposium on Principles of Database Systems (PODS) 2000, pages 58–66, Dallas, TX USA, 2000.

[CH80] Ashok K. Chandra and David Harel. “Computable Queries for Re- lational Data Bases”. Journal of Computer and System Sciences, 21(2):156–178, 1980.

[CH82] Ashok K. Chandra and David Harel. “Structure and Complexity of Relational Queries”. Journal of Computer and System Sciences, 25(1):99–128, 1982.

[Cha88] Ashok K. Chandra. “Theory of Database Queries”. In Proceed- ings of the 7th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS’88), pages 1–9. ACM Press, 1988.

[Che76] Peter Pin-Shan Chen. “The Entity-Relationship Model – Toward a Unified View of Data”. ACM Transactions on Database Systems, 1(1):9–36, March 1976.

[CHS+95] Michael J. Carey, Laura M. Haas, Peter M. Schwarz, Manish Arya, William F. Cody, Ronald Fagin, Myron Flickner, Allen W. Lu- niewski, Wayne Niblack, Dragutin Petkovic, John Thomas, John H. Williams, and Edward L. Wimmers. “Towards Heterogeneous Mul- timedia Information Systems: The Garlic Approach”. In Proceed- ings of the Fifth International Workshop on Research Issues in Data Engineering: Distributed Object Management (RIDE-DOM’95), 1995.

[CJ96] D. Cockburn and Nicholas R. Jennings. “ARCHON: A Dis- tributed Artificial Intelligence System for Industrial Applications”. In G. M. P. O’Hare and N. R. Jennings, editors, Foundations of Distributed Artificial Intelligence, pages 319–344. Wiley, 1996. 134 BIBLIOGRAPHY

[CKM91] Jaime G. Carbonell, Craig A. Knoblock, and Steven Minton. “PRODIGY: An Integrated Architecture for Planning and Learn- ing”. In Kurt VanLehn, editor, Architectures for Intelligence, pages 241–278. Lawrence Erlbaum, Hillsdale, NJ USA, 1991.

[CKPS95] Surajit Chaudhuri, Ravi Krishnamurthy, Spyros Potamianos, and Kyuseok Shim. “Optimizing Queries with Materialized Views”. In Proceedings of the 11th IEEE International Conference on Data En- gineering (ICDE’95), 1995.

[CKW89] Weidong Chen, Michael Kifer, and David S. Warren. “HiLog: A Foundation for Higher-Order Logic Programming”. Technical re- port, Dept. of CS, SUNY at Stony Brook, 1989.

[CM77] Ashok K. Chandra and Philip M. Merlin. “Optimal Implementation of Conjunctive Queries in Relational Data Bases”. In Conference Record of the Ninth Annual ACM Symposium on Theory of Com- puting (STOC’77), pages 77–90, Boulder, CO USA, May 1977.

[CM90] Mariano P. Consens and Alberto O. Mendelzon. “GraphLog: a Visual Formalism for Real Life Recursion”. In Proceedings of the 9th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS’90), 1990.

[CMS95] “CMS Technical Proposal”, January 1995.

[Cod70] E. F. Codd. “A Relational Model of Data for Large Shared Data Banks”. Communications of the ACM, 13(6):377–387, June 1970.

[Coo] International Conferences on Cooperative Information Systems, 1996–2001.

[COZ00] Paolo Ciancarini, Andrea Omicini, and Franco Zambonelli. “Multi- agent System Engineering: the Coordination Viewpoint”. In Intel- ligent Agents VI – Proceedings of the 6th International Workshop on Agent Theories, Architectures, and Languages (ATAL’99), LNAI Series, Vol. 1767. Springer Verlag, February 2000.

[Cro94] Kevin Crowston. “A Taxonomy Of Organizational Dependencies and Coordination Mechanisms”. Technical Report 174, MIT Centre for Coordination Science, Cambridge, MA USA, 1994.

[CS93] Surajit Chaudhuri and Kyuseok Shim. “Query Optimization in the Presence of Foreign Functions”. In Proceedings of the 19th Interna- tional Conference on Very Large Data Bases (VLDB’93), Dublin, Ireland, 1993. BIBLIOGRAPHY 135

[CTP00] Peter Clark, J. Thompson, and Bruce Porter. “Knowledge Pat- terns”. In Proceedings of the International Conference on Principles of Knowledge Representation and Reasoning (KR’2000), 2000.

[CV92] Surajit Chaudhuri and Moshe Y. Vardi. “On the Equivalence of Re- cursive and Nonrecursive Datalog Programs”. In Proceedings of the 11th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS’92), pages 55–66, 1992.

[CV94] Surajit Chaudhuri and Moshe Y. Vardi. “On the Complexity of Equivalence between Recursive and Nonrecursive Datalog Pro- grams”. In Proceedings of the ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS) 1994, pages 107–116, Minneapolis, MN USA, May 1994.

[CV97] Surajit Chaudhuri and Moshe Y. Vardi. “On the Equivalence of Re- cursive and Nonrecursive Datalog Programs”. Journal of Computer and System Sciences, 54(1):61–78, 1997.

[Cyc] Cycorp. “Features of CycL”. http://www.cyc.com/cycl.html.

[Dec95] Keith Decker. “TAEMS: A Framework for Environment Centered Analysis and Design of Coordination Mechanisms”. In G. O’Hare and Nicholas Jennings, editors, Foundations of Distributed Artificial Intelligence, chapter 16, pages 429–448. Wiley Inter-Science, 1995.

[DEGV] Evgeny Dantsin, Thomas Eiter, Georg Gottlob, and Andrei Voronkov. “Complexity and Expressive Power of Logic Program- ming”. To appear in ACM Computing Surveys.

[DG97] Oliver M. Duschka and Michael R. Genesereth. “Answering Recur- sive Queries using Views”. In Proceedings of the ACM SIGACT- SIGMOD-SIGART Symposium on Principles of Database Systems, May 11–15, 1997, Tucson, AZ USA, Tucson, AZ USA, 1997.

[DGL00] Oliver M. Duschka, Michael R. Genesereth, and Alon Y. Levy. “Re- cursive Query Plans for Data Integration”. Journal of Logic Pro- gramming, 43(1):49–73, 2000.

[DJ90] Nachum Dershowitz and Jean-Pierre Jouannaud. “Rewrite Sys- tems”. In Jan van Leeuwen, editor, Handbook of Theoretical Com- puter Science, volume 2, chapter 6, pages 243–320. Elsevier Science Publishers B.V., 1990.

[DL91] Edmund H. Durfee and Victor R. Lesser. “Partial Global Planning: A Coordination Framework for Distributed Hypothesis Formation”. 136 BIBLIOGRAPHY

IEEE Transactions on Systems, Man, and Cybernetics (Special Is- sue on Distributed Sensor Networks), 21(5):1167–1183, September 1991.

[DL92] Keith Decker and Victor Lesser. “Generalizing The Partial Global Planning Algorithm”. International Journal on Intelligent Cooper- ative Information Systems, 1(2):319–346, 1992.

[DL95] Keith Decker and Victor Lesser. “Designing a Family of Coordina- tion Algorithms”. In Proceedings of the First International Confer- ence on Multiagent Systems (ICMAS’95), San Francisco, CA USA, June 1995. AAAI Press.

[DL97a] Giuseppe De Giacomo and Maurizio Lenzerini. “A Uniform Frame- work for Concept Definitions in Description Logics”. Journal of Artificial Intelligence Research (JAIR), 6:87–110, 1997.

[DL97b] Oliver M. Duschka and Alon Y. Levy. “Recursive Plans for Infor- mation Gathering”. In Proceedings of the 15th International Joint Conference on Artificial Intelligence (IJCAI’97), Nagoya, Japan, August 1997.

[DLNS96] Francesco Donini, Maurizio Lenzerini, Daniele Nardi, and Andrea Schaerf. “Reasoning in Description Logics”. In G. Brewka, editor, Principles of Knowledge Representation and Reasoning, Studies in Logic, Language and Information, pages 193–238. CLSI Publica- tions, 1996.

[DLNS98] Francesco M. Donini, Maurizio Lenzerini, Daniele Nardi, and An- drea Schaerf. “AL-log: Integrating Datalog and Description Log- ics”. Journal of Intelligent Information Systems, 10:227–252, 1998.

[DS83] Randall Davis and Reid G. Smith. “Negotiation as a Metaphor for Distributed Problem Solving”. Artificial Intelligence, 20(1):63–109, January 1983.

[DSW97] Keith Decker, Katia Sycara, and Mike Williamson. “Middle-Agents for the Internet”. In Proceedings of the 15th International Joint Conference on Artificial Intelligence (IJCAI’97), Nagoya, Japan, 1997.

[DSW+99] A. J. Duineveld, R. Stoter, M. R. Weiden, B. Kenepa, and V. R. Benjamins. “Wondertools? A Comparative Study of Ontological Engineering Tools”. In Proc. Twelfth Workshop on Knowledge Ac- quisition, Modeling and Management (KAW’99), Banff, Alberta, Canada, October 1999. BIBLIOGRAPHY 137

[DV97] Evgeny Dantsin and Andrei Voronkov. “Complexity of Query An- swering in Logic Databases with Complex Values”. In LFCS’97, LNCS 1234, pages 56–66, 1997. [Etz96] Oren Etzioni. “Moving Up the Information Food Chain: Deploying Softbots on the World Wide Web”. In Proc. AAAI’96, 1996. [FC85] Antonio L. Furtado and Marco A. Casanova. “Updating Relational Views”. In W. Kim, D.S. Reiner, and D.S. Batory, editors, Query Processing in Database Systems. Springer-Verlag, Berlin, 1985. [FFKL98] Mary Fernandez, Daniela Florescu, Jaewoo Kang, and Alon Levy. “Catching the Boat with Strudel: Experiences with a Web-Site Management System”. In Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data (SIGMOD’98), pages 414–425, Seattle, WA USA, June 1998. [FFMM94] Tim Finin, Richard Fritzson, Don McKay, and Robin McEntire. “KQML as an Agent Communication Language”. In Proceedings of the Third International Conference on Information and Knowledge Management (CIKM’94). ACM Press, November 1994. [FFR96] Adam Farquhar, Richard Fikes, and James Rice. “The Ontolingua Server: a Tool for Collaborative Ontology Construction”. In Pro- ceedings of 10th Knowledge Acquisition for Knowledge-Based Sys- tems Workshop (KAW96), Banff, Canada, 1996. [FK98] Ian Foster and Carl Kesselman, editors. The Grid: Blueprint for a New Computing Infrastructure. Morgan Kaufmann Publishers, San Francisco, CA USA, July 1998. [FL97] Tim Finin and Yannis Labrou. “A Proposal for a new KQML Spec- ification”. Technical Report CS-97-03, Computer Science and Elec- trical Engineering Department, University of Maryland Baltimore County, Baltimore, MD 21250, February 1997. [FLM98] Daniela Florescu, Alon Levy, and Alberto Mendelzon. “Database Techniques for the World-Wide Web: A Survey”. SIGMOD Record, 27(3):59–74, 1998. [FMU82] Ronald Fagin, Alberto O. Mendelzon, and Jeffrey D. Ullman. “A Simplied Universal Relation Assumption and its Properties”. ACM Transactions on Database Systems, 7(3):343–360, 1982. [FN71] Richard Fikes and Nils J. Nilsson. “STRIPS: A new Approach to the Application of Theorem Proving to Problem Solving”. Artificial Intelligence, 2(3/4), 1971. 138 BIBLIOGRAPHY

[FN00] Enrico Franconi and Gary Ng. “The ICOM Tool for Intelligent Conceptual Modelling”. In Proc. 7th Intl. Workshop on Knowledge Representation meets Databases (KRDB’00), Berlin, Germany, Au- gust 2000.

[FNPB99] Jerry Fowler, Marian Nodine, Brad Perry, and Bruce Bargmeyer. “Agent-based Semantic Interoperability in InfoSleuth”. SIGMOD Record, 28(1):60–67, 1999.

[Fra99] Enrico Franconi, 1999. Description Logics Course Web Page. Avail- able at http://www.cs.man.ac.uk/∼franconi/dl/course/.

[FRV95] Daniela Florescu, Louiqa Raschid, and Patrick Valduriez. “Using Heterogeneous Equivalences for Query Rewriting in Multidatabase Systems”. In Proc. CoopIS’95, pages 158–169, 1995.

[FVR96] Daniela Florescu, Patrick Valduriez, and Louiqa Raschid. “An- swering Queries Using OQL View Expressions”. In Workshop on Materialized Views in Cooperation with ACM SIGMOD, 1996.

[GEW96] Keith Golden, Oren Etzioni, and Dan Weld. “Planning with Exe- cution and Incomplete Information”. Technical Report UW-CSE- 96-01-09, Department of Computer Science and Engineering, Uni- versity of Washington, Seattle, February 1996.

[GF92] Michael R. Genesereth and Richard E. Fikes. “Knowledge Inter- change Format, Version 3.0 Reference Manual”. Technical Re- port Logic-92-1, Computer Science Department, Stanford Univer- sity, 1992.

[GG95] Nicola Guarino and Pierdaniele Giaretta. “Ontologies and Knowl- edge Bases: Towards a Terminological Clarification”. In N. J. I. Mars, editor, Towards Very Large Knowledge Bases. IOS Press, 1995.

[GHB99] Mark Greaves, Heather Holmback, and Jeffrey M. Bradshaw. “What is a Conversation Policy?”. In Mark Greaves and Jeffrey M. Bradshaw, editors, Proceedings of the Autonomous Agents’99 Work- shop on Specifying and Implementing Conversation Policies, pages 1–9, Seattle, WA USA, May 1999.

[GHJV94] Erich Gamma, Richard Helm, Ralph Johnson, and John Vlissides. Design Patterns. Elements of Reusable Object-Oriented Software. Addison Wesley Professional Computing Series, October 1994. BIBLIOGRAPHY 139

[GJ79] Michael R. Garey and David S. Johnson. Computers and Intractabil- ity: A Guide to the Theory of NP-Completeness. W.H. Freeman & Co., 1979.

[GK94] Michael R. Genesereth and Steven P. Ketchpel. “Software Agents”. Communications of the ACM, 37(7):48–53, 1994.

[GKD97] Michael R. Genesereth, Arthur M. Keller, and Oliver M. Duschka. “Infomaster: An Information Integration System”. In Proceedings of the 1997 ACM SIGMOD International Conference on Management of Data (SIGMOD’97), pages 539–542, 1997.

[GMLY98] Hector Garcia-Molina, Wilburt Labio, and Jun Yang. “Expiring Data in a Warehouse”. In Proceedings of the 1998 International Conference on Very Large Data Bases (VLDB’98), 1998. Extended version as Technical Report 1998-35, Stanford Database Group.

[GMPQ+97] Hector Garcia-Molina, Yannis Papakonstantinou, Dallan Quass, Anand Rajaraman, Yehoshua Sagiv, Jeffrey D. Ullman, Vasilis Vas- salos, and Jennifer Widom. “The TSIMMIS Approach to Mediation: Data Models and Languages”. Journal of Intelligent Information Systems, 8(2):117–132, 1997.

[GN87] Michael R. Genesereth and Nils J. Nilson. Logical Foundations of Artificial Intelligence. Morgan Kaufmann Publishers, 1987.

[Gru] Thomas R. Gruber. “What is an Ontology?”. http://www-ksl.stanford.edu/kst/what-is-an-ontology.html.

[Gru92] Thomas R. Gruber. “Ontolingua: A Mechanism to Support Portable Ontologies”. Technical Report KSL-91-66, Stanford Uni- versity, Knowledge Systems Laboratory, March 1992.

[Gru93a] Thomas R. Gruber. “A Translation Approach to Portable Ontology Specifications”. Technical Report KSL-92-71, Stanford University, Knowledge Systems Laboratory, April 1993.

[Gru93b] Thomas R. Gruber. “Toward Principles for the Design of Ontolo- gies Used for Knowledge Sharing”. Technical Report KSL 93-04, Knowledge Systems Laboratory, Stanford University, 1993.

[Gua94] Nicola Guarino. “The Ontological Level”. In B. Smith R. Casati and G. White, editors, Philosophy and the Cognitive Sciences, Vi- enna. H¨older-Pichler-Tempsky, 1994. Invited paper presented at IV Wittgenstein Symposium, Kirchberg, , 1993. 140 BIBLIOGRAPHY

[Gua97] Nicola Guarino. “Understanding, Building, and Using Ontologies. A Commentary to ‘Using Explicit Ontologies in KBS Development’, by van Heijst, Schreiber, and Wielinga”. International Journal of Human and Computer Studies, 46(2/3):293–310, 1997.

[GW00a] Nicola Guarino and Christopher A. Welty. “Identity, Unity, and Individuality: Towards a Formal Toolkit for Ontological Analysis”. In Proceedings of the European Conference on Artificial Intelligence (ECAI-2000). IOS Press, August 2000.

[GW00b] Nicola Guarino and Christopher A. Welty. “Ontological Analysis of Taxonomic Relationships”. In International Conference on Con- ceptual Modeling (ER 2000), pages 210–224, 2000.

[Hal00] Alon Y. Halevy. “Theory of Answering Queries Using Views”. Sig- mod Record, 29(4), December 2000.

[HGB99] Heather Holmback, Mark Greaves, and Jeffrey Bradshaw. “Agent A, Can You Pass the Salt? The Role of Pragmatics in Agent Com- munications”, May 1999. Submitted to Autonomous Agents’99.

[HK93] Chun-Nan Hsu and Craig A. Knoblock. “Reformulating Query Plans for Multidatabase Systems”. In Proc. of the Second Inter- national Conference on Information and Knowledge Management (CIKM’93), pages 423–432, Washington, DC USA, 1993.

[HM85] Dennis Heimbigner and Dennis McLeod. “A Federated Architec- ture for Information Managment”. ACM Transactions on Office Information Systems, 3(3):253–278, July 1985.

[HM00] Volker Haarslev and Ralf M¨oller.“Expressive ABox Reasoning with Number Restrictions, Role Hierarchies, and Transitively Closed Roles”. In Fausto Giunchiglia and Bart Selman, editors, Proceed- ings of Seventh International Conference on Principles of Knowl- edge Representation and Reasoning (KR’2000), Breckenridge, CO USA, April 2000.

[Hor98] Ian Horrocks. “Using an Expressive Description Logic: FaCT or Fiction?”. In A. G. Cohn, L. Schubert, and S. C. Shapiro, editors, Principles of Knowledge Representation and Reasoning: Proceed- ings of the Sixth International Conference (KR’98), pages 636–647. Morgan Kaufmann Publishers, June 1998.

[HS97] M. Huhns and M. P. Singh. “Ontologies for Agents”. E-commerce, IEEE Internet Computing, 1(6):81–83, November–December 1997. BIBLIOGRAPHY 141

[HU79] John E. Hopcroft and Jeffrey D. Ullman. “Introduction to Automata Theory, Languages, and Computation”. Addison-Wesley Publishing Company, Reading, MA USA, 1979. [JCL+96] N. R. Jennings, J. Corera, I. Laresgoiti, E. H. Mamdani, F. Per- riolat, P. Skarek, and L. Z. Varga. “Using ARCHON to Develop Real-world DAI Applications for Electricity Transportation Man- agement and Particle Accelerator Control”. IEEE Expert, 11(6), 1996. [Jen99] Nicholas R. Jennings. “Agent-based Computing: Promise and Per- ils”. In Proceedings of the International Joint Conference on Ar- tificial Intelligence (IJCAI’99), Stockholm, Sweden, 1999. Morgan Kaufmann Publishers. [JFJ+96] N. R. Jennings, P. Faratin, M. J. Johnson, T. J. Norman, P. O’Brien, and M. E. Wiegand. “Agent-based Business Process Management”. International Journal of Cooperative Information Systems, 5(2 and 3):105–130, 1996. [JGJ+95] Matthias Jarke, Rainer Gallersd¨orfer,Manfred A. Jeusfeld, Martin Staudt, and Stefan Eherer. “ConceptBase – A Deductive Object Base for Meta Data Management”. Journal of Intelligent Infor- mation Systems, Special Issue on Advances in Deductive Object- Oriented Databases, 4(2):167–192, 1995. [JLVV00] Matthias Jarke, Maurizio Lenzerini, Yannis Vassiliou, and Panos Vassiliadis. Fundamentals of Data Warehouses. Springer-Verlag, 2000. [JNF98] Nicholas R. Jennings, Timothy J. Norman, and Peyman Faratin. “ADEPT: An Agent-based Approach to Business Process Manage- ment”. ACM SIGMOD Record, 27(4):32–39, 1998. [Joh90] David S. Johnson. “A Catalog of Complexity Classes”. In Jan van Leeuwen, editor, Handbook of Theoretical Computer Science, volume 1, chapter 2, pages 67–161. Elsevier Science Publishers B.V., 1990. [JW00] Nicholas R. Jennings and Michael Wooldridge. “Agent-Oriented Software Engineering”. In Jeffrey Bradshaw, editor, Handbook of Agent Technology. AAAI/MIT Press, 2000. [Kan90] Paris C. Kanellakis. “Elements of Relational Database Theory”. In Jan van Leeuwen, editor, Handbook of Theoretical Computer Sci- ence, volume 2, chapter 17, pages 1074–1156. Elsevier Science Pub- lishers B.V., 1990. 142 BIBLIOGRAPHY

[KDB98] Anthony Kosky, Susan Davidson, and Peter Buneman. “Semantics of Database Transformations”. In L. Libkin and B. Thalheim, edi- tors, Semantics of Databases. Springer LNCS 1358, February 1998.

[Kim95] Won Kim, editor. Modern Database Systems: The Object Model, Interoperability, and Beyond. Addison-Wesley, 1995.

[KJ99] Susan Kalenka and Nicholas R. Jennings. “Socially Responsible Decision Making by Autonomous Agents”. In K. Korta, E. Sosa, and X. Arrazola, editors, Cognition, Agency and Rationality, pages 135–149. Kluwer, 1999.

[KL89] Michael Kifer and Georg Lausen. “F-Logic: A Higher-Order Lan- guage for Reasoning about Objects, Inheritance, and Scheme”. In Proceedings of the 1989 ACM SIGMOD International Conference on Management of Data (SIGMOD’89), pages 134–146, Portland, OR USA, 1989.

[Klu88] Anthony Klug. “On Conjunctive Queries Containing Inequalities”. Journal of the ACM, 35(1):146–160, January 1988.

[KS92] Henry Kautz and Bart Selman. “Planning as Satisfiability”. In Pro- ceedings of the 10th European Conference on Artificial Intelligence (ECAI’92), Vienna, Austria, August 1992.

[KW96] Chung T. Kwok and Daniel S. Weld. “Planning to Gather Informa- tion”. In Proc. AAAI’96, Portland, OR USA, August 1996.

[Lev00] Alon Y. Levy. “Answering Queries Using Views: A Survey”, 2000. Submitted for publication.

[LGP+90] Douglas B. Lenat, Ramanathan V. Guha, Karen Pittman, Dexter Pratt, and Mary Shepherd. “Cyc: Toward Programs with Common Sense”. Communications of the ACM, 33(8):30–49, 1990.

[LHC] http://lhc.web.cern.ch/lhc/.

[LMSS95] Alon Y. Levy, Alberto O. Mendelzon, Yehoshua Sagiv, and Divesh Srivastava. “Answering Queries Using Views”. In Proceedings of the ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS) 1995, San Jose, CA USA, 1995.

[LR96] Alon Y. Levy and Marie-Christine Rousset. “CARIN: A Represen- tation Language Combining Horn Rules and Description Logics”. In Proc. 12th European Conference of Artificial Intelligence, 1996. BIBLIOGRAPHY 143

[LRO96] Alon Y. Levy, Anand Rajaraman, and Joann J. Ordille. “Query- ing Heterogeneous Information Sources Using Source Descriptions”. In Proceedings of the 1996 International Conference on Very Large Data Bases (VLDB’96), pages 251–262, 1996. [LRV88] Christophe L´ecluse,Philippe Richard, and Fernando Velez. “O2, an Object-oriented Data Model”. In Proceedings of the 1988 ACM SIGMOD International Conference on Management of Data (SIG- MOD’88), pages 424–433, Chicago, IL USA, June 1988. [LS97] Alon Y. Levy and Dan Suciu. “Deciding Containment for Queries with Complex Objects (Extended Abstract)”. In Proceedings of the ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, May 11–15, 1997, Tucson, AZ USA, pages 20– 31, Tucson, AZ USA, 1997. [LSS99] Laks V. S. Lakshmanan, Fereidoon Sadri, and Subbu N. Subrah- manian. “On Efficiently Implementing SchemaSQL and a SQL Database System”. In Proceedings of the 25th International Confer- ence on Very Large Data Bases (VLDB’99), Edinburgh, Scotland, 1999. [Mae94] Pattie Maes. “Agents that Reduce Work and Information Over- load”. Communications of the ACM, 37(7), July 1994. [MB87] Robert MacGregor and Raymond Bates. “The LOOM Knowl- edge Representation Language”. Technical Report ISI/RS-97-188, USC/ISI, 1987. [MHH+01] Renee J. Miller, Mauricio A. Hernandez, Laura M. Haas, Ling Ling Yan, C. T. Howard Ho, Ron Fagin, and Lucian Popa. “The Clio Project: Managing Heterogeneity”. SIGMOD Record, 30(1), March 2001. [MIKS00] Eduardo Mena, Arantza Illarramendi, Vipul Kashyap, and Amit Sheth. “OBSERVER: An Approach for Query Processing in Global Information Systems based on Interoperation across Pre-existing Ontologies”. International Journal of Distributed and Parallel Databases (DAPD), 8(2):223–271, 2000. [MKSI96] Eduardo Mena, Vipul Kashyap, Amit P. Sheth, and Arantza Il- larramendi. “OBSERVER: An Approach for Query Processing in Global Information Systems based on Interoperation across Pre- existing Ontologies”. In Proceedings First IFCIS International Con- ference on Cooperative Information Systems (CoopIS’96), pages 14– 25, Brussels, Belgium, June 1996. IEEE Computer Society Press. 144 BIBLIOGRAPHY

[MKW00] Prasenjit Mitra, Martin Kersten, and Gio Wiederhold. “A Graph- Oriented Model for Articulation of Ontology Interdependencies”. In Proceedings of the 7th International Conference on Extending Database Technology (EDBT 2000), Konstanz, Germany, March 2000. Springer Verlag.

[MLF00] Todd Millstein, Alon Levy, and Marc Friedman. “Query Contain- ment for Data Integration Systems”. In Proceedings of the ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS) 2000, Dallas, Texas, May 2000.

[MMS79] David Maier, Alberto O. Mendelzon, and Yehoshua Sagiv. “Test- ing Implications of Data Dependencies”. ACM Transactions on Database Systems, 4(4):455–469, 1979.

[MY95] Weiyi Meng and Clement Yu. “Query Processing in Multidatabase Systems”. In Won Kim, editor, Modern Database Systems: The Ob- ject Model, Interoperability, and Beyond, pages 551–572. Addison- Wesley, 1995.

[MZ98] Tova Milo and Sagit Zohar. “Using Schema Matching to Simplify Heterogeneous Data Translation”. In Proceedings of the 1998 Inter- national Conference on Very Large Data Bases (VLDB’98), August 1998.

[NBN99] Marian Nodine, William Bohrer, and Anne Ngu. “Semantic Bro- kering over Dynamic Heterogeneous Data Sources in InfoSleuth”. In Proceedings of the 15th IEEE International Conference on Data Engineering (ICDE’99), 1999.

[Neb89] Bernhard Nebel. “What is Hybrid in Hybrid Representation Sys- tems?”. In F. Gardin, G. Mauri, and M. G. Filippini, editors, Proceedings of the International Symposium on Computational In- telligence’89, pages 217–228, Amsterdam, The Netherlands, 1989. North-Holland.

[New82] Allen Newell. “The Knowledge Level”. Artificial Intelligence, 18:87– 127, 1982.

[New93] Allen Newell. “Reflections on the Knowledge Level”. Artificial In- telligence, 59:31–38, 1993.

[NPU98] Marian Nodine, Brad Perry, and Amy Unruh. “Experience with the InfoSleuth Agent Architecture”. In Proceedings of the AAAI-98 Workshop on Software Tools for Developing Agents, 1998. BIBLIOGRAPHY 145

[NU97] Marian Nodine and Amy Unruh. “Facilitating Open Communica- tion in Agent Systems: the InfoSleuth Infrastructure”. In Proceed- ings of ATAL-97, 1997. [NvL88] Bernhard Nebel and Kai von Luck. Hybrid reasoning in BACK. In Z. W. Ras and L. Saitta, editors, “Proceedings of the Third Interna- tional Symposium on Methodologies for Intelligent Systems”, pages 260–269, Amsterdam, The Netherlands, 1988. North-Holland. [Nwa96] Hyacinth S. Nwana. “Software Agents: An Overview”. Knowledge Engineering Review, 11(3):1–40, September 1996. [OV99] M. Tamer Ozsu¨ and Patrick Valduriez. Principles of Distributed Database Systems. Prentice Hall, 1999. [Pap94] Christos H. Papadimitriou. Computational Complexity. Addison- Wesley, 1994. [PGMW95] Yannis Papakonstantinou, Hector Garcia-Molina, and Jennifer Widom. “Object Exchange Across Heterogeneous Information Sys- tems”. In Proceedings of the 11th IEEE International Conference on Data Engineering (ICDE’95), March 1995. [PGR98] Charles Petrie, Sigrid Goldmann, and Andreas Raquet. “Agent- Based Process Management”. In Proc. of the International Work- shop on Intelligent Agents in CSCW, Deutsche Telekom, Dortmund, Germany, pages 1–17, September 1998. [PHG+99] A. Preece, K. Hui, A. Gray, P. Marti, T. Bench-Capon, D. Jones, and Z. Cui. “The KRAFT Architecture for Knowledge Fusion and Transformation”. In Proceedings of the Nineteenth SGES Interna- tional Conference on Knowledge Based Systems and Applied Artifi- cial Intelligence (ES’99), Cambridge, UK, 1999. [PL00] Rachel Pottinger and Alon Y. Levy. “A Scalable Algorithm for Answering Queries Using Views”. In Proceedings of the 26th In- ternational Conference on Very Large Data Bases (VLDB’2000), 2000. [PSS93] Peter F. Patel-Schneider and William Swartout. “Description Logic Knowledge Representation System Specification from the KRSS Group of the ARPA Knowledge Sharing Effort”, November 1993. [PV99] Yannis Papakonstantinou and Vasilis Vassalos. “Query Rewrit- ing for Semistructured Data”. In Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data (SIG- MOD’99), 1999. 146 BIBLIOGRAPHY

[PW92] J. Scott Penberthy and Daniel S. Weld. “UCPOP: A Sound, Com- plete, Partial-Order Planner for ADL”. In Third International Conference on Knowledge Representation and Reasoning (KR-92), Cambridge, MA USA, October 1992.

[PWC95] C. Petrie, T. Webster, and M. Cutkowsky. “Using Pareto Optimality to Coordinate Distributed Agents”. AIEDAM, 9:269–281, 1995.

[Qia96] Xiaolei Qian. “Query Folding”. In Proceedings of the 12th IEEE International Conference on Data Engineering (ICDE’96), pages 48–55, New Orleans, LA USA, 1996.

[RN95] Stuart Russell and Peter Norvig. Artificial Intelligence - A Modern Approach. Prentice Hall, NJ, 1995.

[Ros99] Riccardo Rosati. “Towards Expressive KR Systems Integrating Dat- alog and Description Logics: Preliminary Report”. In Proc. DL’99, 1999.

[RS97] Mary Tork Roth and Peter Schwarz. “Don’t Scrap It, Wrap It! A Wrapper Architecture for Legacy Data Sources”. In Proceedings of the 1997 International Conference on Very Large Data Bases (VLDB’97), 1997.

[RSU95] Anand Rajaraman, Yehoshua Sagiv, and Jeffrey D. Ullman. “An- swering Queries Using Templates with Binding Patterns”. In Pro- ceedings of the ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS) 1995, pages 105–112, 1995.

[RVW99] C. M. Rood, D. Van Gucht, and F. I. Wyss. “MD-SQL: A Language for Meta-data Queries over Relational Databases”. Technical Report TR528, Dept. of CS, Indiana University, 1999.

[RZA95] Paul Resnick, Richard Zeckhauser, and Chris Avery. “Roles for Electronic Brokers”. In G. W. Brock, editor, Toward a Competitive Telecommunication Industry, pages 289–304. Lawrence Erlbaum As- sociates, Mahwah, NJ, 1995.

[Sar91] Y. Saraiya. “Subtree Elimination Algorithms in Deductive Data- bases”. PhD thesis, Department of Computer Science, Stanford University, January 1991.

[SCB+98] Ira A. Smith, Philip R. Cohen, Jeffrey M. Bradshaw, Mark Greaves, and Heather Holmback. “Designing Conversation Policies using Joint Intention Theory”. In Proc. International Joint Conference on Multi-Agent Systems (ICMAS-98), Paris, France, July 1998. BIBLIOGRAPHY 147

[SCH+97] Munindar P. Singh, Philip Cannata, Michael N. Huhns, Nigel Jacobs, Tomasz Ksiezyk, KayLiang Ong, Amit P. Sheth, Chris- tine Tomlinson, and Darrell Woelk. “The Carnot Heterogeneous Database Project: Implemented Applications”. Distributed and Parallel Databases, 5(2):207–225, 1997.

[SDJL96] Divesh Srivastava, Shaul Dar, H. V. Jagadish, and Alon Y. Levy. “Answering Queries with Aggregation Using Views”. In Proceedings of the 1996 International Conference on Very Large Data Bases (VLDB’96), pages 318–329, 1996.

[Sea69] John R. Searle. Speech Acts: An Essay in the Philosophy of Lan- guage. Cambridge University Press, Cambridge, 1969.

[SGV99] William Swartout, Yolanda Gil, and Andre Valente. “Represent- ing Capabilities of Problem Solving Methods”. In Proc. IJCAI- 99 Workshop on Ontologies and Problem-Solving Methods: Lessons Learned and Future Trends, Stockholm, Sweden, August 1999.

[Shm87] Oded Shmueli. “Decidability and Expressiveness Aspects of Logic Queries”. In Proceedings of the 6th ACM SIGACT- SIGMOD-SIGART Symposium on Principles of Database Systems (PODS’87), pages 237–249, 1987.

[Sho93] Yoav Shoham. “Agent-Oriented Programming”. Artificial Intelli- gence, 60(1):51–92, 1993.

[SHWK76] Michael Stonebraker, Gerald Held, Eugene Wong, and Peter Kreps. “The Design and Implementation of INGRES”. ACM Transactions on Database Systems, 1(3):189–222, 1976.

[Sip97] Michael F. Sipser. Introduction to the Theory of Computation. PWS Publishing, 1997.

[SL90] Amit P. Sheth and James A. Larson. “Federated Database Sys- tems for Managing Distributed, Heterogeneous and Autonomous Databases”. ACM Computing Surveys, 22(3), September 1990.

[SL95] Tuomas Sandholm and Victor Lesser. “Issues in Automated Ne- gotiation and Electronic Commerce: Extending the Contract Net Framework”. In 1st International Conference on Multiagent Sys- tems (ICMAS), pages 328–335, San Francisco, CA USA, 1995.

[SLK98] Katia Sycara, J. Lu, and Matthias Klusch. “Interoperability among Heterogeneous Software Agents on the Internet”. Technical report, Carnegie-Mellon University, Pittsburgh, USA, 1998. 148 BIBLIOGRAPHY

[Smi80] Reid G. Smith. “The Contract Net Protocol: High-Level Com- munication and Control in a Distributed Problem Solver”. IEEE Transactions on Computers, 29(12):1104–1113, December 1980.

[SPVG01] K. Sycara, M. Paolucci, M. Van Velsen, and J. A. Giampapa. “The RETSINA MAS Infrastructure”. Technical Report CMU-RI-TR- 01-05, Robotics Institute, Carnegie Mellon University, March 2001.

[SS89] Manfred Schmidt-Schauss. “Subsumption in KL-ONE is Undecid- able”. In Proceedings of the 1st International Conference on Prin- ciples of Knowledge Representation and Reasoning (KR’89), pages 421–431. Morgan Kaufmann, 1989.

[SSS91] Manfred Schmidt-Schauss and Gert Smolka. “Attributive Concept Descriptions with Complements”. Artificial Intelligence, 48(1):1– 26, 1991.

[SY80] Yehoshua Sagiv and Mihalis Yannakakis. “Equivalences Among Relational Expressions with the Union and Difference Operators”. Journal of the ACM, 27(4):633–655, 1980.

[TBM99] P. Tsompanopoulou, L. B¨ol¨oni,and D. C. Marinescu. “The De- sign of Software Agents for a Network of PDE Solvers”. In Proc. Workshop on Autonomous Agents in Scientific Computing at Au- tonomous Agents 1999, pages 57–68, 1999.

[TK78] Dionysios Tsichritzis and Anthony Klug. “The ANSI/X3/SPARC DBMS Framework”. Information Systems, 3(4), 1978.

[TMD92] Jean Thierry-Mieg and Richard Durbin. “Syntactic Definitions for the ACeDB Data Base Manager”. Technical Report MRC-LMB xx.92, MRC Laboratory for Molecular Biology, Cambridge, UK, 1992.

[TSI94] Odysseas G. Tsatalos, Marvin H. Solomon, and Yannis E. Ioannidis. “The GMAP: A Versatile Tool for Physical Data Independence”. In Proceedings of the 1994 International Conference on Very Large Data Bases (VLDB’94), 1994.

[TYF86] Toby J. Teorey, Dongqing Yang, and James P. Fry. “A Logical Design Methodology for Relational Databases using the Extended Entity-Relationship Model”. ACM Computing Surveys, 18(2):197– 222, 1986.

[Ull88] Jeffrey D. Ullman. Principles of Database & Knowledge-Base Sys- tems Vol. 1. Computer Science Press, December 1988. BIBLIOGRAPHY 149

[Ull89] Jeffrey D. Ullman. Principles of Database & Knowledge-Base Sys- tems Vol. 2: The New Technologies. Computer Science Press, 1989. [Ull97] Jeffrey D. Ullman. “Information Integration Using Logical Views”. In Proc. ICDT’97, pages 19–40, 1997. [Var82] Moshe Y. Vardi. “The Complexity of Relational Query Languages”. In Proc. 14th Annual ACM Symposium on Theory of Computing (STOC’82), pages 137–146, San Francisco, CA USA, May 1982. [Var97] Moshe Y. Vardi. “Why is Modal Logic so Robustly Decidable”. In DIMACS Series in Discrete Mathematics and Theoretical Computer Science 31, American Math. Society, pages 149–184, 1997. [vdM92] Ron van der Meyden. “The Complexity of Querying Indefinite In- formation about Linearly Ordered Domains”. In Proceedings of the 11th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS’92), pages 331–345, San Diego, June 1992. ACM Press. [vLNPS87] Kai von Luck, Bernhard Nebel, Christof Peltason, and Albrecht Schmiedel. “The Anatomy of the BACK System”. Technical Re- port 41, KIT (K¨unstliche Intelligenz und Textverstehen), Technical University of Berlin, January 1987. [VV98] Sergei Vorobyov and Andrei Voronkov. “Complexity of Nonrecursive Logic Programs with Complex Values”. In Proceedings of the ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS) 1998, 1998. [WBLX00] Thomas Wagner, Brett Benyo, Victor Lesser, and Ping Xuan. “In- vestigating Interactions Between Agent Conversations and Agent Control Components”. In Frank Dignum and Mark Greaves, edi- tors, Issues in Agent Communication, Lecture Notes in Computer Science. Springer-Verlag, Berlin, April 2000. [Wei99] Gerhard Weiss. Multiagent Systems: A Modern Approach to Dis- tributed Artificial Intelligence. MIT Press, 1999. [Wel99] Daniel S. Weld. “Recent Advances in AI Planning”. AI Magazine, 20(2):93–123, 1999. [Wid96] Jennifer Widom. “Integrating Heterogeneous Databases: Lazy or Eager?”. ACM Computing Surveys, 28A(4), December 1996. [Wie92] Gio Wiederhold. “Mediators in the Architecture of Future Informa- tion Systems”. IEEE Computer, 25(3):38–49, March 1992. 150 BIBLIOGRAPHY

[Wie96] Gio Wiederhold, editor. Intelligent Integration of Information. Kluwer Academic Publishers, Boston, MA USA, July 1996.

[WJ95] Michael J. Wooldridge and Nicholas R. Jennings. “Intelligent Agents: Theory and Practice”. Knowledge Engineering Review, 10(2), June 1995.

[Wor01] World Wide Web Consortium. Semantic Web Activity Home Page, 2001. http://www.w3.org/2001/sw/.

[WT98] Gerhard Wickler and Austin Tate. “Capability Representa- tions for Brokering: A Survey”, November 1998. Available at http://www.aiai.ed.ac.uk/∼oplan/cdl/.

[YL87] H. Z. Yang and Per-Ake˚ Larson. “Query Transformation for PSJ- Queries”. In Proceedings of the 13th International Conference on Very Large Data Bases (VLDB’87), pages 245–254, Brighton, Eng- land, 1987.

[YO79] Clement T. Yu and Z. Meral Ozsoyoglu.¨ “An Algorithm for Tree- Query Membership of a Distributed Query”. In Proc. IEEE COMP- SAC, pages 306–312, 1979.

[Zan96] Carlo Zaniolo. “A Short Overview of LDL++: A Second-Generation Deductive Database System”. Computational Logic, 3(1):87–93, De- cember 1996.

[ZHKF95a] Gang Zhou, Richard Hull, Roger King, and Jean-Claude Franchitti. “Supporting Data Integration and Warehousing Using H2O”. IEEE Data Engineering, 18(2):29–40, June 1995.

[ZHKF95b] Gang Zhou, Richard Hull, Roger King, and Jean-Claude Franchitti. “Using Object Matching and Materialization to Integrate Heteroge- neous Databases”. In S. Laufmann, S. Spaccapietra, and T. Yokoi, editors, Proc. of the 3rd Int. Conf. on Cooperative Information Sys- tems (CoopIS’95), pages 4–18, Vienna, Austria, May 1995. Christoph Koch Beatrixgasse 26/70 A-1030 Wien AUSTRIA

Curriculum Vitae

July 11th, 1975 Born in Vienna

Education:

1981 – 1985 Four years of primary school

1985 – 1993 Eight years of secondary school (with distinction in every year).

June 22nd, 1993 Graduation from secondary school (with distinc- tion).

Oct. 1993 – May 1994 Mandatory military service, signals intelligence reg- iment (FMAR), Austrian army.

Oct. 1994 – June 1998 Student of Computer Science at Technische Univer- sit¨atWien, Vienna, Austria. Main areas of inter- est: Software Engineering, Databases, Information Retrieval, and Artificial Intelligence.

Oct. 1997 – Apr. 1998 Master’s work and English language master’s thesis in the areas of Information Retrieval and Natural Language Processing.

June 9th, 1998 Graduation examination. Finished graduate pro- gram after 4 years (Official minimum duration of Computer Science graduate program until gradua- tion at TU Wien: 5 years, average duration: 7.5 years). Average score: ∼1.5 (US-style GPA: ∼3.5, GPA of graduation examination: 4.0).

Summer 1998 – June 2001 PhD program in Computer Science at TU Wien, Austria. Research performed at CERN, Geneva, Switzerland. PhD thesis “Data Integration against Multiple Evolving Austonomous Schemata” sub- mitted May 22nd, 2001.

151 Professional Experience: June and July 1994 Raiffeisenbank Wien RegGenmbH , Vienna. Summer internship.

August 1994 Bank Austria AG, Vienna. Summer internship.

July 1995 – May 1997 Osterreichische¨ Blindenwohlfahrt, Vienna. Software development; later database administra- tion and technical support.

Oct. 1995 – June 1997 TU Wien, Institut f. Softwaretechnik, Vienna. Teaching assistant in the Software Engineering course for four terms; advised groups of students working together on small software projects.

July 18th, 1996 Obtained Austrian trade license for IT services.

July 1995 – Sept. 1998 Via GesmbH , Vienna. Implemented a statistical management information system for pharmaceutical companies (in C++). Developed an information system for pharmaceu- tical companies which keeps track of the actions of representatives, using Oracle RDBMS and Ora- cle development tools. Work done includes require- ments analysis, design, implementation, quality as- surance, and project management.

Oct. 1996 – TU Wien, Institut f. Informationssysteme, Abt. Summer 2000 Datenbanksysteme, Vienna. Took part in a research project on answer set pro- gramming systems; implemented parts of the well- known DLV system; co-authored several publica- tions. Received various grants.

Dec. 1998 – June 2001 EP Division, European Organization for Nuclear Research (CERN), Geneva, Switzerland. PhD research; implemented a query execution en- gine for a commercial persistent-object manage- ment system; developed data integration middle- ware.

See http://home.cern.ch/∼chkoch/ for a list of publications.

152 Dipl.-Ing. Christoph Koch Beatrixgasse 26/70 A-1030 Wien AUSTRIA

Curriculum Vitae

11. Juli 1975 Geboren in Wien.

Ausbildung:

1981 – 1985 Vier Jahre Volksschule.

1985 – 1993 Acht Jahre Akademisches Gymnasium Wien 1 (mit Auszeichnung in jedem Jahr).

22. Juni 1993 Matura mit Auszeichnung.

Okt. 1993 – Mai 1994 Achtmonatiger Grundwehrdienst.

Okt. 1994 – Juni 1998 Studium der Informatik an der Technischen Uni- versit¨atWien. Hauptinteressensgebiete: Software Engineering, Datenbanken, Information Retrieval, und Artificial Intelligence.

Okt. 1997 – Apr. 1998 Diplomarbeit in den Gebieten Information Re- trieval und Natural Language Processing.

9. Juni 1998 Diplompr¨uefung. Abschluß des Informatikstudi- ums nach acht Semestern. (offizielle Mindeststu- diendauer: 10 Semester; durchschnittliche Studi- endauer in Informatik: 14.5 Semester)

Sommer 1998 – Juni 2001 Doktoratsstudium in Informatik an der Techni- schen Universit¨at Wien, ausgef¨uhrt am CERN (Genf).

22. Mai 2001 Einreichung der Dissertation (Titel “Data Inte- gration against Multiple Evolving Austonomous Schemata”) am Dekanat f¨urTechnische Naturwis- senschaften und Informatik der Technischen Uni- versit¨atWien.

153 Professionelle T¨atigkeiten: Juni und Juli 1994 Raiffeisenbank Wien RegGenmbH . Ferialpraktikum.

August 1994 Bank Austria AG, Wien. Ferialpraktikum.

Juli 1995 – Mai 1997 Osterreichische¨ Blindenwohlfahrt, Wien. Entwicklung von Software f¨urdie Verwaltung von Blindenwohnheimen; Datenbankadministration und technischer Support.

Okt. 1995 – Juni 1997 TU Wien, Institut f¨urSoftwaretechnik. Tutor in den Labor¨ubungenSoftware Engineering I und II; Betreuung von Gruppen von Studenten in kleinen (Ubungs-)Softwareprojekten.¨

18. Juli 1996 Erlangung des Gewerbescheins f¨urDatenverarbei- tung.

Juli 1995 – Sept. 1998 Via GesmbH , Wien. Implementation eines statistischen Managementin- formationssystems f¨urpharmazeutische Unterneh- men und eines Informationssystems zum Manage- ment des Außendiensts von pharmazeutischen Un- ternehmen (in Oracle). T¨atigkeiten beinhalten die Erstellung von Pflichtenheften, Datenbankde- sign, die Implementation von Softwareapplikatio- nen, Qualit¨atssicherung und Projektmanagement.

Okt. 1996 – TU Wien, Institut f¨urInformationssysteme, Abt. Sommer 2000 Datenbanksysteme. Mitarbeit an einem Forschungsprojekt im Gebiet Nonmonotonic Reasoning. Implementation von Teilen des erfolgreichen Answer Set Programming Systems DLV.

Dez. 1998 – Juni 2001 EP Division, European Organization for Nuclear Research (CERN), Genf/Schweiz. Forschung im Rahmen des Doktoratsstudiums. Im- plementierung von Middleware f¨urdie Dateninte- gration.

Eine aktuelle Publikationsliste ist unter http://home.cern.ch/∼chkoch/ zu finden.

154