Approximate Key and Foreign Key Discovery in Relational Databases

Approximate key and foreign key discoveryinrelational databases by Charlotte Vilarem A thesis submitted in conformity with the requirements for the degree of Master of Science Graduate Department of Computer Science UniversityofToronto c Copyright by Charlotte Vilarem Abstract Approximate key and foreign key discovery in relational databases Charlotte Vilarem Master of Science Graduate Department of Computer Science UniversityofToronto Organizations have b een storing huge amounts of data for years but over time the knowledge ab out these legacy databases may b e lost or corrupted In this thesis we study the problem of discovering approximate keys and foreign keys satised in a relational database of whichwe assume no previous knowledge The main ideas underlying our metho d for tackling this problem are not new however we prop ose twonovel optimizations First we extract only keybased unary inclusion dep endencies UINDs instead of all UINDs Second we add a pruning pass to the foreign key discovery p erformed on a small data sample it reduces the number of candidates tested against the database Wevalidate our contributions on b oth reallife and synthetic data Weinvestigate the inuence of input parameters on the p erformance of the pruning pass and provide guidance on how to b est set these parameters ii Acknowledgements Iwould like to express my gratitude to a numb er of p eople without whose advice and encouragement this thesis would not have b een p ossible First I would liketothankmy sup ervisor Professor Renee Miller for her guid ance advice and patience during my research I would also like to thank my second reader Professor Ken Sevcik for his time and helpful comments Iwould like to thank all my friends in Toronto esp ecially Travis Vlad and Tristan for the great time I had with you throughout my program at UniversityofToronto Finallymy deep est gratitude go es to my family and b oyfriend for their supp ort and encouragement and for their involvementinthiswork I would never have made it without them iii Contents Intro duction Motivation Contributions of the thesis Scop e of the research Outline of the thesis Integrity constraints in relational databases Background Preliminaries Functional and inclusion dep endencies Dep endency inference problem Approximate dep endency inference problem Keys and foreign keys Review of literature Levelwise algorithm Key and functional dep endency discovery Foreign key and inclusion dep endency discovery Algorithms and complexity iv General architecture Finding the keys Testing the key candidates Generating key candidates Finding the foreign keys Generating foreign key candidates Pruning pass over the foreign key candidates Testing the remaining foreign key candidates Discovering the unary inclusion dep endencies Complexity analysis Complexityofkey discovery Complexity of foreign key discovery Exp eriments Goals of the tests Exp erimental setup Metrics used to evaluate the impact of the pruning pass Datasets used in the exp eriments Setting of the algorithms input parameters Keybased unary inclusion dep endencies Eciency of the pruning pass General b ehavior of the pruning pass in relation to the pruning parameter Inuence of dep endency size Inuence of input approximation thresholds on the b enet of the pruning pass v Inuenceofdatasize Conclusion Bibliography vi List of Algorithms General extraction of keys and foreign keys ExtractKeys GenerateNextCand ExtractForeignKeys ExtractUINDs vii List of Figures Levelwise algorithm Unfolding of Tane on the database of Example Unfolding of DepMiner on the database of Example Unfolding of FUN on the database of Example Time savings Eciency of the pruning pass for various values fk Running time on synthetic data with PgParam key fk viii Chapter Intro duction For more than a decade now organizations have b een storing huge amounts of data in relational databases During the design phase of the database a conceptual represen tation of the data is established for instance an EntityRelationship mo del which describ es the organization of the data in the database This representation allows one to understand a dataset and to develop applications using the data More precisely a conceptual representation is used to explain the data data rela tionships data semantics and data constraints As an example consider a university database Its representation might sp ecify that students are represented by a student numb er a name and a rstname courses are represented by a course number and a course title and that students take several courses for their program ie there is a manytomany relationship b etween students and courses It might also asso ciate attribute names and their meaning for instance SNUM stands for studentnumb er In the relational data mo del data are stored in at tables and eachrow repre sents an ob ject or relationship in the real world But the mere denition of tables is not enough to represent accurately the realityweneedtoaddintegrity constraints Chapter Introduction to ensure that the data stored in the database reects accurately the realworld re strictions for instance that a studentnumb er is asso ciated with only one student The most imp ortantintegrity constraints also called data dep endencies are func tional and inclusion dep endencies CFP Functional dep endencies are conditions stating that for anyvalue of a set of attributes there is at most one value for a set of target attributes for instance given a ro om a date and a time there is at most one talk b eing held in there Inclusion dep endencies state that values o ccuring in certain columns of one relation must also o ccur in columns of another relation for example a manager is also an employee In practice most of the time only sp ecializations of these twotyp es of dep endencies are used keys and foreign keys Keys identify uniquely ob jects or tuples in a table eg a studentnumb er identies uniquely a student Foreign keys are inclusion dep endencies that refer to a key they express relationships b etween ob jects in the database As an example the column student numb er of a table enrolledin is a foreign key of the key column studentnumber in the table student But over time the conceptual representation may b e lost due to multiple up dates extensions loss of do cumentation or turnover of domain exp erts Hence these legacy databases may no longer b e usable and the knowledge they contain is wasted Such a situation calls for a database reverse engineering pro cess in order to build a new mo del which reects accurately the current state of the database But b ecause the data in legacy databases is likely to b e noisy or corrupted the true data semantics may not hold exactly in the database Indeed an exact mo del would incorp orate the noise in its representation of the dataset while what we really want is a mo del that reects the data indep endent of the inconsistencies intro duced with the time Therefore wema ywant to accomo date the noise with error thresholds so that the Chapter Introduction mo del found almost holds in the database and ts the data without the noise However determining dynamically the b est error thresholds for a database is a dicult problem so in this thesis we fo cus on a simpler one the discovery of approx imate keys and foreign keys holding in a database instance for given approximation thresholds Motivation Reverse engineering is the pro cess of identifying a systems comp onents and their re lationships and creating representations of the system in another form or at a higher level of abstraction CI Its aim is to redo cument convert restructure maintain or extend legacy systems Hai In the case of database reverse engineering this pro cess can b e divided into two distinct steps PTBK eliciting the data se mantics from the system and expressing the extracted semantics with a high level data mo del Many approaches to database reverse engineering overlo ok the rst step by assuming that some of the data semantics is known but such assumptions may not hold for legacy databases For a classication of database reverse engineering metho ds according to their input requirements see dJS Indeed old versions of database management systems do not supp ort the denition of foreign keys eg foreign key denition was intro duced in Oracle v released in Therefore iden tifying the keys and foreign keys holding in a database instance can b e part of the rst step of a database reverse engineering pro cess Integrity constraints esp ecially keys and foreign keys can play an imp ortantrole in query optimization The role of query optimization is to nd an ecientway Chapter Introduction of pro cessing a query using all sorts of information ab out the data stored in the database such as statistics and

Load more