Transforming Databases with Recursive Data Structures

TRANSFORMING DATABASES WITH RECURSIVE DATA STRUCTURES Anthony Kosky A DISSERTATION in COMPUTER AND INFORMATION SCIENCE Presented to the Faculties of the University of Pennsylvania in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy. 1996 Susan Davidson— Supervisor of Dissertation Peter Buneman— Supervisor of Dissertation Peter Buneman— Graduate Group Chairperson c Copyright 2003 by Anthony Kosky iii To my parents. iv v WARRANTY Congratulations on your acquisition of this dissertation. In acquiring it you have shown yourself to be a computer scientist of exceptionally good taste with a true appreciation for quality. Each proof, algorithm or definition in this dissertation has been carefully checked by hand to ensure correctness and reliability. Each word and formula has been meticulously crafted using only the highest quality symbols and characters. The colours of inks and paper have been carefully chosen and matched to maximize contrast and readability. The author is confident that this dissertation will provide years of reliable and trouble free ser- vice, and offers the following warranty for the lifetime of the original owner: If at any time a proof or algorithm should be found to be defective or contain bugs, simply return your dissertation to the author and it will be repaired or replaced (at the author’s choice) free of charge. Please note that this warranty does not cover damage done to the dissertation through normal wear-and-tear, natural disasters or being chewed by family pets. This warranty is void if the dissertation is altered or annotated in any way. Concepts described in this dissertation may be new and complicated. The author accepts no liability for any confusion or damage incurred during the reading and contemplation of the dissertation. Children under the age of five should not attempt to read this dissertation without proper adult supervision. Comments, suggestions and personal abuse are all welcome and should be sent to the author via electronic mail. vi vii ACKNOWLEDGMENTS This dissertation marks the end of six years which I spent engaged in studies and research at the Department of Information and Computer Science of the University of Pennsylvania. Though only a part of that time was spent directly on the work described in this dissertation, it nevertheless reflects many influences, both from my time at Penn and from my studies prior to that in England. There are many people to thank, both for their direct contributions to this work, and also for their roles in developing my understanding and appreciation of theoretical computer science, databases, programming languages, and many other subjects of relevance. Firstly I would like to thank my advisors, Peter Buneman and Susan Davidson for their help, suggestions, support, advice and encouragement, and for introducing me to the subject of databases. Peter was also responsible for giving me the opportunity to enroll in a PhD program at Penn. I would like to thank my committee members, Tim Griffin, Victor Markowitz, Carl Gunter, Val Tannen and Chris Overton for their comments and advice. This work has also been influenced greatly by the discussions of the “Tuesday afternoon group” including Leonid Libkin, Limsoon Wong, Dan Suciu, Rona Machlin, Wenfei Fan and Kyle Hart. I would especially like to thank Leonid for his many helpful comments and advice, and for his thorough reading of the proposal for this dissertation. Barbara Eckman and Carmem Hara did much of the work on the trials of the prototype transformation system described in part IV. Barbara also helped to explain the Molecular Biology Databases and the database problems that inspired much of this work. I am also grateful to Catriel Beeri, Jan Van den Bussche and Serge Abiteboul for their comments on my other papers related to this work. Edward T. Bear gave consistent support and encouragement, and helped with some of the more technically difficult proofs in this dissertation. One of the most enjoyable aspects of my research at Penn was the collaboration with members of the computational biology group, not only because it gave me an opportunity to look at some practical applications for my work, but also because it gave me a chance to learn a little about the fascinating subjects of molecular biology and genetics. I would like to thank Chris Overton and David Searls for sharing there enthusiasm for these subjects, and for their many impromptu biology lessons. There are also many people who have contributed to my development first as a mathematician and then as a computer scientist. I would like to thank the lecturers of the Department of Mathematics at the University of Kent at Canterbury, in particular John Earl, who helped me to develop an appreciation for the beauty of pure mathematics. My introduction to computer science came when I did a Masters degree at the Department of Computing at Imperial College of Science and Technology. In particular I was introduced to the subjects of formal methods and functional programming by the lectures of Samson Abramsky, Mike Smyth, Steve Vickers, Pete Harrison, Chris Hankin and others. Samson Abramsky also supervised my masters thesis and recommended me as a possible PhD student at the University of Pennsylvania, for which I am especially grateful. My knowledge and appreciation of theoretical computer science has been extended further while at the Penn, through the lectures Val Tannen, Carl Gunter, Scott Weinstein, Peter Freyd and others. viii ACKNOWLEDGMENTS Many of the staff at the University of Pennsylvania have helped me in dealing with bureaucracy and various administrative details. I would particularly like to thank Mike Felker who’s help allowed me to finish off and co-ordinate this PhD while working in California. I would also like to thank Karen Carter, Nan Blitz, Susan Deysher, Elaine Benedetto and Jackie Caliman, and members of the computing staff Mark Foster, Mark-Jason Dominus and Alex Garthwaite. There are also many people who helped in making my time at Penn enjoyable, and helping me to maintain a semblance of sanity. I would like to thank the Old Quaker Computer Scientists for some very bizarre and amusing times, the Penn Magic play-testers, the Saturday-morning Reading Terminal crowd, and all at Bicycle Therapy for keeping my bikes running nicely. Finally, but most importantly of all, I would like to thank my parents and my family. Their love, support and encouragement have been a constant comfort to me in spite of the long distances between us, and I could not have achieved any of this without them. ix ABSTRACT TRANSFORMING DATABASES WITH RECURSIVE DATA STRUCTURES Anthony Kosky Advisors: Susan Davidson and Peter Buneman. This thesis examines the problems of performing structural transformations on databases in- volving complex data-structures and object-identities, and proposes an approach to specifying and implementing such transformations. We start by looking at various applications of such database transformations, and at some of the more significant work in these areas. In particular we will look at work on transformations in the area of database integration, which has been one of the major motivating areas for this work. We will also look at various notions of correctness that have been proposed for database transformations, and show that the utility of such notions is limited by the dependence of transformations on certain implicit database constraints. We draw attention to the limitations of existing work on transformations, and argue that there is a need for a more general formalism for reasoning about database transformations and constraints. We will also argue that, in order to ensure that database transformations are well-defined and meaningful, it is necessary to understand the information capacity of the data-models being transformed. To this end we give a thorough analysis of the information capacity of data-models supporting object identity, and will show that this is dependent on the operations supported by a query language for comparing object identities. We introduce a declarative language, WOL, based on Horn-clause logic, for specifying database transformations and constraints. We also propose a method of implementing transformations specified in this language, by manipulating their clauses into a normal form which can then be translated into an underlying database programming language. Finally we will present a number of optimizations and techniques necessary in order to build a practical implementation based on these proposals, and will discuss the results of some of the trials that were carried out using a prototype of such a system. x ABSTRACT xi Contents Acknowledgements vii Abstract ix Foreword 1 1.1 A Roadmap . 2 1.2 Some Comments on the Mathematical Approach and Assumptions . 3 I Database Transformations 5 2 Introduction 5 2.1 Methods of Implementing Database Transformation . 6 3 Transformations in Database Integration 7 3.1 Database Integration: An Example . 8 3.2 Resolving Structural Conflicts in Database Integration . 10 3.3 Schema Integration Techniques . 11 3.4 Merging Data . 16 4 Data Models for Database Transformations 17 5 Information Dominance in Transformations 18 5.1 Hull’s Hierarchy of Information Dominance Measures . 19 5.2 Information Capacity and Constraints . 22 xii CONTENTS II Observable Properties of Models for Recursive Data-Structures 25 6 Introduction 25 7 A Data-Model with Object Identities and Extents 27 7.1 Types and Schemas . 28 7.2 Database Instances . 29 8 A Query Language Based on Structural Recursion 31 8.1 Queries and the Language SRI(=) . 32 8.2 Indistinguishable Instances in SRI(=) . 37 9 Bisimulation and Observational Equivalence without Equality 42 9.1 Bisimulation and Corespondence Relations . 43 9.2 Distinguishing Instances without Equality on Identities . 46 10 Observable Properties of Object Identities with Keys 51 10.1 A Data-Model with Keys .

Transforming Databases with Recursive Data Structures

Are We Losing Our Ability to Think Critically?

Autumn 2Copy2:First Draft.Qxd

The Best Nurturers in Computer Science Research

Curating the CIA World Factbook 29 the International Journal of Digital Curation Issue 3, Volume 4 | 2009

A Survey on Scientific Data Management

Towards a Multi-Discipline Network Perspective

O. Peter Buneman Curriculum Vitæ – Jan 2008

The Hyperview Approach to the Integration of Semistructured Data

File Size for Two Types of Query As the Retrieved Resultset Increases

Using Links to Prototype a Database Wiki

RSE Fellows Ordered by Area of Expertise As at 11/10/2016

Path Queries on Compressed XML∗