<<

Approximate key and foreign key discoveryinrelational

by

Charlotte Vilarem

A thesis submitted in conformity with the requirements

for the degree of Master of Science

Graduate Department of Computer Science

UniversityofToronto

c

Copyright by Charlotte Vilarem

Abstract

Approximate key and foreign key discovery in relational databases

Charlotte Vilarem

Master of Science

Graduate Department of Computer Science

UniversityofToronto

Organizations have b een storing huge amounts of for years but over time

the knowledge ab out these legacy databases may b e lost or corrupted

In this thesis we study the problem of discovering approximate keys and foreign

keys satised in a relational of whichwe assume no previous knowledge

The main ideas underlying our metho d for tackling this problem are not new

however we prop ose twonovel optimizations First we extract only keybased unary

inclusion dep endencies UINDs instead of all UINDs Second we add a pruning pass

to the foreign key discovery p erformed on a small data sample it reduces the number

of candidates tested against the database

Wevalidate our contributions on b oth reallife and synthetic data Weinvestigate

the inuence of input parameters on the p erformance of the pruning pass and provide

guidance on how to b est set these parameters ii

Acknowledgements

Iwould like to express my gratitude to a numb er of p eople without whose advice and

encouragement this thesis would not have b een p ossible

First I would liketothankmy sup ervisor Professor Renee Miller for her guid

ance advice and patience during my research I would also like to thank my second

reader Professor Ken Sevcik for his time and helpful comments

Iwould like to thank all my friends in Toronto esp ecially Travis Vlad and Tristan

for the great time I had with you throughout my program at UniversityofToronto

Finallymy deep est gratitude go es to my family and b oyfriend for their supp ort

and encouragement and for their involvementinthiswork I would never have made

it without them iii

Contents

Intro duction

Motivation

Contributions of the thesis

Scop e of the research

Outline of the thesis

Integrity constraints in relational databases

Background

Preliminaries

Functional and inclusion dep endencies

Dep endency inference problem

Approximate dep endency inference problem

Keys and foreign keys

Review of literature

Levelwise algorithm

Key and functional dep endency discovery

Foreign key and inclusion dep endency discovery

Algorithms and complexity iv

General architecture

Finding the keys

Testing the key candidates

Generating key candidates

Finding the foreign keys

Generating foreign key candidates

Pruning pass over the foreign key candidates

Testing the remaining foreign key candidates

Discovering the unary inclusion dep endencies

Complexity analysis

Complexityofkey discovery

Complexity of foreign key discovery

Exp eriments

Goals of the tests

Exp erimental setup

Metrics used to evaluate the impact of the pruning pass

Datasets used in the exp eriments

Setting of the algorithms input parameters

Keybased unary inclusion dep endencies

Eciency of the pruning pass

General b ehavior of the pruning pass in to the pruning

parameter

Inuence of dep endency size

Inuence of input approximation thresholds on the b enet of

the pruning pass v

Inuenceofdatasize

Conclusion

Bibliography vi

List of Algorithms

General extraction of keys and foreign keys

ExtractKeys

GenerateNextCand

ExtractForeignKeys

ExtractUINDs vii

List of Figures

Levelwise algorithm

Unfolding of Tane on the database of Example

Unfolding of DepMiner on the database of Example

Unfolding of FUN on the database of Example

Time savings

Eciency of the pruning pass for various values

fk

Running time on synthetic data with PgParam

key fk

viii

Chapter

Intro duction

For more than a decade now organizations have b een storing huge amounts of data in

relational databases During the design phase of the database a conceptual represen

tation of the data is established for instance an EntityRelationship mo del which

describ es the organization of the data in the database This representation allows

one to understand a dataset and to develop applications using the data

More precisely a conceptual representation is used to explain the data data rela

tionships data semantics and data constraints As an example consider a university

database Its representation might sp ecify that students are represented by a student

numb er a name and a rstname courses are represented by a course number and

a course title and that students take several courses for their program ie there is

a manytomany relationship b etween students and courses It might also asso ciate

attribute names and their meaning for instance SNUM stands for studentnumb er

In the relational data mo del data are stored in at tables and eachrow repre

sents an ob ject or relationship in the real world But the mere denition of tables is

not enough to represent accurately the realityweneedtoaddintegrity constraints

Chapter Introduction

to ensure that the data stored in the database reects accurately the realworld re

strictions for instance that a studentnumb er is asso ciated with only one student

The most imp ortantintegrity constraints also called data dep endencies are func

tional and inclusion dep endencies CFP Functional dep endencies are conditions

stating that for anyvalue of a set of attributes there is at most one value for a set of

target attributes for instance given a ro om a date and a time there is at most one

talk b eing held in there Inclusion dep endencies state that values o ccuring in certain

columns of one relation must also o ccur in columns of another relation for example

a manager is also an employee In practice most of the time only sp ecializations

of these twotyp es of dep endencies are used keys and foreign keys Keys identify

uniquely ob jects or tuples in a eg a studentnumb er identies uniquely a

student Foreign keys are inclusion dep endencies that refer to a key they express

relationships b etween ob jects in the database As an example the student

numb er of a table enrolledin is a foreign key of the key column studentnumber

in the table student

But over time the conceptual representation may b e lost due to multiple up dates

extensions loss of do cumentation or turnover of domain exp erts Hence these legacy

databases may no longer b e usable and the knowledge they contain is wasted Such

a situation calls for a database reverse engineering pro cess in order to build a new

mo del which reects accurately the current state of the database But b ecause the

data in legacy databases is likely to b e noisy or corrupted the true data semantics

may not hold exactly in the database Indeed an exact mo del would incorp orate

the noise in its representation of the dataset while what we really want is a mo del

that reects the data indep endent of the inconsistencies intro duced with the time

Therefore wema ywant to accomo date the noise with error thresholds so that the

Chapter Introduction

mo del found almost holds in the database and ts the data without the noise

However determining dynamically the b est error thresholds for a database is a

dicult problem so in this thesis we fo cus on a simpler one the discovery of approx

imate keys and foreign keys holding in a database instance for given approximation

thresholds

Motivation

Reverse engineering is the pro cess of identifying a systems comp onents and their re

lationships and creating representations of the system in another form or at a higher

level of abstraction CI Its aim is to redo cument convert restructure maintain

or extend legacy systems Hai In the case of database reverse engineering this

pro cess can b e divided into two distinct steps PTBK eliciting the data se

mantics from the system and expressing the extracted semantics with a high level

data mo del Many approaches to database reverse engineering overlo ok the rst step

by assuming that some of the data semantics is known but such assumptions may

not hold for legacy databases For a classication of database reverse engineering

metho ds according to their input requirements see dJS Indeed old versions

of database management systems do not supp ort the denition of foreign keys eg

foreign key denition was intro duced in Oracle v released in Therefore iden

tifying the keys and foreign keys holding in a database instance can b e part of the

rst step of a database reverse engineering pro cess

Integrity constraints esp ecially keys and foreign keys can play an imp ortantrole

in The role of query optimization is to nd an ecientway

Chapter Introduction

of pro cessing a query using all sorts of information ab out the data stored in the

database such as statistics and integrity constraints keys and foreign keys It is a

threestage pro cess Dat First the query optimizer rewrites the query into a more

ecient form so that the p erformance of the query do es not dep end on how the user

chose to write it Then the optimizer generates p ossible execution plans for the new

representation of the query And then it computes cost estimations for each of these

execution plans and cho oses the most ecientofthem

Keys and foreign keys are used b oth in the query rewrite phase and to predict cost

estimations For example if dept number is a key of relation department then the

query select distinct dept num from department is equivalent to the more ecient

select dept num from department Keys and foreign keys help in predicting the

result size of joins which is used in cost estimations

Semantic query optimization ie query optimization using a broader range of

integrity constraints has long b een a sub ject of research but most of the prop osed

techniques have not b een implemented in commercial optimizers CGK Cheng

et al and Go dfrey et al address this issue CGK GGXZ investigating how

some semantic query optimizations using keys foreign keys and other typ es of in

tegrity constraints can b e integrated in traditional query optimizers

Integrity constraints can also b e useful in schema mapping MHH The goal

in schema mapping is to map a source legacy database into a xed target schema

through a set of queries on the source database These mapping queries are derived

from the corresp ondences b etween a target value and a set of source values But when

the value corresp ondences involve several relations of the source database one needs

away of joining the tuples of these relations The joins are determined from foreign

key paths So these joins require knowledge of the integrity constraints keys and

Chapter Introduction

foreign keys holding in the source database

Contributions of the thesis

The contributions of this thesis are the following

We prop ose a metho d for approximate key and foreign key discovery in rela

tional databases Even though the underlying ideas are not new this is the rst

prop osal of a concrete algorithm for extracting foreign keys from a database of

whichwe assume no previous knowledge

We implementedtwonovel optimizations for foreign key discovery The more

imp ortant is a pruning pass over the foreign key candidates p erformed on a

small data sample which can reduce dramatically the numb er of accesses to

the database The other optimization consists of extracting solely keybased

unary inclusion dep endencies rather than all unary inclusion dep endencies as is

usually done since only these are used in the foreign key candidate generation

This has a great impact in practice on the running time of the algorithm

Weevaluate the p erformance of our algorithm on b oth publicly available real

life and synthetic datasets This allows us to study the dataset characteristics

that aect the algorithm

Scop e of the research

We provide in this thesis a general framework for the discovery of keys and foreign

keys FKs Wechose for simplicitytokeep the data inside the DBMS and pro cess it

through SQL queries But our architecture could easily b e adapted to more ecient

Chapter Introduction

treatments of the data

The attributes whose typ es are large such as long strings eg varchar

were not considered for key or FK b ecause they cannot app ear in SQL statements of

the form SELECT DISTINCT which are needed to test keys and FKs In DB the

limit is reached for data typ es larger than bytes and in Oracle Kbytes This

limitation should not b e to o restrictive though b ecause attributes with large typ es

are generally not intended as keys or FKs Indeed large values in keys or FKs mean

slower searches and useless replication of data resulting in p o or p erformance

Outline of the thesis

In Chapter we give some background and survey the literature ab out integrity

constraints and their discovery

In Chapter we present our general architecture for approximate key and foreign

key discovery and analyze the complexity of our approach

In Chapter we describ e the testing of our algorithm on b oth reallife and

synthetic data and study how input parameters and data characteristics inuence

the p erformance of our application

Finallywe conclude in Chapter

Chapter

Integrity constraints in relational

databases

Background

Preliminari es

First of all we dene the concepts used throughout this thesis

For more details see the b o oks byLevene and Loizou LL or Date Dat

Denition Basic relational database concepts In the following upp er

case letters whichmay b e subscripted from the end of the alphab et suchas X Y Z

will b e used to denote sets of attributes while those from the b eginning of the alpha

b et suchas A B C will b e used to denote single attributes

A relation schema R is a nite set of attributes The domain of an attribute A

denoted by DomA is the set of all p ossible values of AAtuple over a relation

schema R fA A g is a memb er of the Cartesian pro duct DomA

 m 

DomA A relation r over R is a nite set of tuples over R The cardinalityofa

m

Chapter Integrity constraints in relational databases

set X is denoted by jX j

If X R is an attribute set and t a tuple over Rwe denote by tX the restriction

of t to X The projection of a relation r over R onto X is dened by r

X

ftX j t r g

A database schema R is a nite set of relation schemas R Adatabase d over R

i

is a set of relations r over each R R

i i

Informally relations are represented by tables attributes by columns and tuples

byrows The cardinality of a relation jr j is the number of rows of the table we will

commonly denote it by p throughout this thesis

Wenow dene a running example used in the rest of this chapter

Example Consider the following database called MovieDB representing in

formation ab out movies and the theater in which they are showing

Table Movies

Title Length Genre

The mummy min Action

In the b edro om min Drama

Life of Brian min Comedy

Atanarjuat min Drama

Monsters ball min Drama

Chapter Integrity constraints in relational databases

Table Theaters

Name Lo cation

ByTowne Rideau Street

South Keys Bank Street

Mayfair Bank Street

Table Schedule

Theater Movie Time

ByTowne In the bedroom pm

ByTowne Monsters ball pm

South Keys In the bedroom pm

South Keys The mummy pm

South Keys Monsters ball pm

Mayfair Life of Brian pm

The database MovieDB is comp osed of three relations Movies with schema fTitle

Length Genreg Theaters with schema fName Lo cationgandSchedule with schema

fTheater Movie Timeg The database schema is the set fMovies Theaters Sched

uleg

Functional and inclusion dep endencies

The schemas alone are not sucient to describ e a database Data semantics in a

relational database are expressed byintegrity constraints and the most imp ortant

of these constraints are functional and inclusion dep endencies CFP LL Thus

Chapter Integrity constraints in relational databases

integrity constraints dene the set of allowable states of a database

Functional dep endencies FDs have long b een studied in the context of database

normalization LV is the pro cess of designing a database

satisfying a set of integrity constraints eciently and in order to avoid inconsistencies

when manipulating the database LL In the presence of only functional dep en

dencies the problems that may arise are up date anomalies after an up date the

data is no longer consistent with the functional dep endencies and redundency prob

lems parts of the data are unnecessarily duplicated to prevent them the database

schema should b e in BoyceCo dd Normal Form BCNF Co d A database schema

is in BCNF if in each relation of the database all FD lefthand sides are keys or

sup erset of keys of the relation In the presence of b oth functional and inclusion

dep endencies INDs Levene and VincentLV identify additional problems and

prop ose an inclusion dep endency normal form IDNF to prevent them A database

is in IDNF if it is in BCNF with resp ect to the functional dep endencies all inclusion



dep endencies are really foreign keys and the set of inclusion dep endencies is acyclic

This normal form implies no interaction b etween the FDs and the INDs prevents the

problem of attribute redundancy and satises a generalised form of entityintegrity

These normal forms are the goal in As a result most FDs and

INDs o ccuring in practice are based on keys and foreign keys

Denition Functional dep endency A FD over

a relation schema R is a statement of the form R X Y where X Y R are sets

of attributes

An FD R X Y is satised in a relation r over R if whenever two tuples have

The authors employ dierent terms they state that a database is in IDNF if it is in BCNF

wrt the FDs and if the set of inclusion dep endencies is noncircular and keybased

Chapter Integrity constraints in relational databases

equal X values they also have equal Y values Formally a functional dep endency

R X Y is satised or holds in a relation r over R denoted by r j R X Y

if t t rtX t X t Y t Y

     

An FD R X Y is minimal in r if Y is not functionally dep endentonany

prop er subset of X ie if Z Y do es not hold in r for any Z X

In Example page Schedule Theater Movie Time This means that in

the theaters describ ed in the MovieDB database a movie is shown once a day always

at the same time

Denition Inclusion dep endency An inclusion dependency IND over

a database schema R is a statement of the form R X R Y where R R R

   

and X Y are sequences of attributes such that X R Y R and jX j jY jA

 

unary inclusion dependency UIND is an IND suchthatjX j jY j AnIND

R X R Y isofsizei if jX j jY j i

 

Let d b e a database over a database schema Rwhere r r d are relations over

 

relation schemas R R R An inclusion dep endency R X R Y is satised

   

or holds in a database d over R denoted by d j R X R Y if t r t

    

r such that t X t Y Equivalently d j R X R Y whenever X

     r

Y

r



Note that X and Y are sequences of attributes the ordering of the attributes

within the sets X and Y matters Indeed the two following INDs are not equivalent

ManagerFirstname Lastname EmployeeFirstname Lastname is dierent from

ManagerLastname Firstname EmployeeFirstname Lastname

In Example ScheduleTheater TheatersName and ScheduleMovie MoviesTitle

The rst inclusion dep endency describ es the fact that all theaters that showmovies

Chapter Integrity constraints in relational databases

ie the theaters in Table Schedule app ear in the list of the towns theaters ie are

listed in Table Theaters The second IND describ es the fact that information such

as length and genre ie information displayed in Table Movies is available for all

movies currently playing ie the movies in Table Schedule

Dep endency inference problem

Muchwork has b een done on the implication problem of functional and inclusion

dep endencies CFP LL but this is not our fo cus in this thesis The implication

problem for functional and inclusion dep endencies is the problem of deciding for a

given set S of FDs and INDs whether S logically implies where is an FD or an

IND

Rather weareinterested in the inference problem of data dep endencies given

a database d nd a small cover of all data dep endencies satised in d where the

data dep endencies can b e functional dep endencies KM or inclusion dep endencies

MLP A cover for a set F of dep endencies is a set G of dep endencies such that F

and G are equivalent ie any database that satises all the dep endencies of F also

satises all the dep endencies of G and vice versa KM A cover F is minimum if

there is no cover G of F such that G is a subset of F and G has fewer dep endencies

than F In the inference problem the goal is to nd a small but not necessarily

minimum cover Deducing a minimum cover out of a small cover could b e added as

a nal step indep endent of the data

Most of the work on this research area has b een devoted to only functional de

p endencies b oth exact HKPT NC LPL BMT BB Bel FS and

approximate HKPT KM Bou while inclusion dep endencies have raised little

Chapter Integrity constraints in relational databases

interest MLP KMRS PT LPT BB Bou

Extracting functional or inclusion dep endencies from a database instance has b een

shown to b e hard in the worst case MR KMRS MR

Mannila and Raihaproved that there are small relations where the number of

functional dep endencies is exp onential with resp ect to the numb er of attributes

of the relations MR The theorem is expressed as For each n there exists a

relation r over R suchthat n jRj jr j O n and eachcover of all functional

n

dep endencies has dep endencies Therefore the inference problem for

FDs is inherently exp onential with resp ect to the numb er of attributes

Assuming two nattribute relations r r there are more than np ossible nonequiv

 

alent inclusion dep endencies from r to r KMRS The factorial comes from

 

the fact that sequences and not sets of attributes are considered in INDs

Kantola et al proved that what they call FULLINDEXISTENCE is an NP

complete problem in the size of the schema KMRS They dene FULLINDEX

ISTENCE as follows Let R and S b e relation schemas and X a sequence

consisting of the attributes of R in some order Given relations r and s over

R and S resp ectively decide whether there exists a sequence Y consisting of

disjoint attributes of S in some order such that the dep endency RX S Y

holds in the database rs Therefore it is not always p ossible to checkthe

existence of long INDs quickly

Although our work is not ab out the implication problem we can use inference rules

for dep endency implication in the inference problem in order to sp eed up dep endency

discoveryWenowgive inference rules for FDs and INDs

Chapter Integrity constraints in relational databases

The following inference rules for the FD implication problem known as Arm

strongs axiom system Arm form a sound and complete axiomatization of FDs

reexivity if Y X RthenX Y

augmentation if X Y and W R then XW YW

transitivity if X Y and Y Z thenX Z

If X is a key for R then the functional dep endency X R holds So by rulethe

FDs XW R also hold for all W Randthus there is no need to check these FDs

against the database

Casanova et al give a sound and complete axiomatization of INDs comp osed of

the following inference rules CFP

reexivity RX RX if X is a sequence of distinct attributes of the

relation schema R

pro jection and p ermutation if RA A S B B then

 m  m

RA A S B B for each sequence i i of distinct integers from

i i i i  k

k k

f mg

transitivity if RX S Y and S Y T Z then RX T Z

Thus if the IND ManagerFirstname Lastname EmployeeFirstname Lastname

holds then there is no need to test ManagerLastname Firstname EmployeeLastname

Firstname since weknow from Rule that it also holds Again if S Y T Z and

RX T Z then by Rule RX S Y We can also use the contrap ositiveof

Rule to avoid some tests if RX S Y andRX T Z then S Y T Z

Chapter Integrity constraints in relational databases

Approximate dep endency inference problem

The approximate dependency inferenceproblem is dened as the dep endency inference

problem in which the results do not need to b e completely accurate the dep endencies

found hold exactly or almost hold in the database KM Weallow some false p osi

tives in the result but no false negative This means that some dep endencies that do

not hold exactly but almost hold are included in the result and no dep endency that

holds exactly is missing from the result

Kivinen and Mannila KM concentrate on functional dep endencies To dene

formally a functional dep endency that almost holds they present three measures

to compute the error of an FD The error measure retained by subsequentwork on

approximate functional dep endency inference HKPT Bou is the socalled g



measure which is dened b elow

The error measure g of an FD X Y in a relation r is the minimal fraction of



tuples to remove from r for X Y to hold KM

maxfjsjjs r and s j X Y g

g X Y r



jr j

Denition Approximate functional dep endency KM

An approximate FD X Y holds in a relation r with resp ect to an error threshold

if and only if g X Y r



When the dep endency holds exactly the error g is equal to and when the dep en



dency is violated in many tuples the error tends to

Computing the error for a given FD can b e done in O jr jtimeby using sp ecial

data structures such as hash tables HKPT or in O jr j log r timeby sorting the

data But the computation is not always necessary since the error can b e estimated

Chapter Integrity constraints in relational databases

through upp er and lower b ounds

Note that the error measure g is a parameter of this denition of approximate



FD hence the denition remains the same even if wechange the error measure

The error measure g dened for FDs has b een adapted to the case of INDs in



two dierentways

g b ecomes the prop ortion of rows to remove from the lefthandside relation



for the IND to hold in a database d Bou PT

maxfjsjj s r and d n r s j R X R Y g

i i i j

g R X R Y d

 i j

jr j

i

Lop es et al LPT p oint out that this denition do es not resp ect a natural

prop erty of INDs namely that for an IND holding in a database the number

of distinct values of the left hand side is smaller than the numb er of distinct

values of the right hand side This is due to the fact that the denition of

INDs is based on distinct tuples while the error measure g is not Thus they



prop ose a new error measure for INDs g which is the prop ortion of tuples



with distinct X values indep endently of their o ccurrences to remove from the

lefthandside relation for the IND to hold

g R X R Y d

i j



maxfj sjjs r and d n r s j R X R Y g j r j

X i i i j X i

r r j j r j j

X i X Y Y j X i

This error measure is also used by Shen et al in the form of a tness criterion

equalto g SZWA 

Chapter Integrity constraints in relational databases

We adopt in this thesis the second error measure g which b etter conveys the seman



tics of inclusion dep endencies

Denition Approximate inclusion dep endency An approximate IND

R X R Y holds in database d with resp ect to an error threshold if and only

i j

if g R X R Y d

i j



Here again we could replace the error measure g byany other measure



Keys and foreign keys

In this thesis we concentrateonkeys and foreign keys which are sp ecializations of

FDs and INDs resp ectively

A sup erkey is a set of attributes that determines all the attributes in a relation

schema and a key is a sup erkey whose set of attributes is minimal

Denition Key and approximate key A set of attributes K R is a

for R if K functionally determines all other attributes of the relation ie

R K R A key or minimal key is a sup erkey suchthatnosubsetofitisa

sup erkey

K is a sup erkey in a relation r if no two tuples in r agree on K thus K identies

uniquely each tuple in the relation

Because a key is a sp ecial case of FD the error measure for keys is the same as

for FDs the minimal fraction of tuples to remove for the dep endency to hold in r

This simplies into the fraction of tuples with duplicate K values

j r j

K

g K r



jr j

where j r j is the numb er of distinct K values in r K

Chapter Integrity constraints in relational databases

An approximate key K for r wrt error threshold is a key suchthat g K r



At the schema level a key is a constraint but at the relation level a key is

a prop erty derived from the relation This corresp onds to the distinction b etween

a functional dep endency constraint and the satisfaction of an FD in a relation

prop erty

For the sp ecic database of Example page that is at the relation level Title

is a key for Movies Name is a key for theaters and fTheater Movieg is a key for

Schedule Movie is an approximate key for Schedule wrt error threshold and

Time is an approximate key for Schedule wrt error threshold

The number of keys in a relation can b e exp onential in the size of the schema

Theorem LL The number of keys for a schema R such that jRj n is at

most

n

n

p

O

bnc n

n

and there exists a relation r over R such that there exists keys that are satised

bnc

in r

Denition Foreign key and approximate foreign key Let R b e a database

schema R R b e relation schemas over R and let K be a key of R In addition

  

let d fr r r g b e a database over R

  n

A foreign key constraint is a sp ecication of a set of attributes X R and a key



K of R The set of attributes X is called a foreign key FK of R and it is said

 

to ref erence the key K of R The foreign key constraint that X references K is



satised in d if the following condition holds for all tuples t r there exists t r

   

Chapter Integrity constraints in relational databases

suchthat t X t K A foreign key is an IND whose righthand side is a keyIt

 

is also called constraint or keybased inclusion dep endency

The error measure for foreign keys is thus the same as for INDs g Anapproxi



mate foreign key R X R K is satised or holds in d wrt error threshold if

 

g R X R K d

 



In Example page the following foreign keys hold ScheduleMovie

MoviesTitle TheatersName ScheduleTheater and ScheduleTheater The

atersName

Review of literature

First of all we present the levelwise algorithm used in a wide variety of data mining

applications from asso ciation rules to functional and inclusion dep endencies Then

w e review the metho ds used in functional dep endency or key discovery and nally

the approaches for inclusion dep endency or foreign key discovery

Levelwise algorithm

One of the basic problems in knowledge discovery in databases KDD is to nd all

p otentially interesting sentences from a dataset Then a user can select the really

interesting ones For example these interesting sentences can b e asso ciation rules

functional or inclusion dep endencies Mannila and Toivonen MT study thoroughly

a breadthrst or levelwise algorithm also called generic data mining algorithm for

nding all p otentially interesting sentences Their pap er includes a complexityanal

ysis as well as some applications including functional and inclusion dep endency dis

covery Algorithms for discovering functional and inclusion dep endencies within this

Chapter Integrity constraints in relational databases

framework are also describ ed elsewhere Bou The levelwise algorithm has b een

used among other applications for discovering asso ciation rules AS MTV for

discovering functional dep endencies HKPT LPL NC and for discovering

inclusion dep endencies MLP In this thesis we use it to extract keys

We dene now more precisely the KDD framework of Mannila and Toivonen

Given a database d dene a language L for expressing prop erties of the data and a

selection predicate q The selection predicate q evaluates whether a sentence L

denes a p otentially interesting prop ertyofd The task is to nd a theory of d with

resp ect to L and q ie the set T hLdqf Ljq d is trueg Consider the

example of inclusion dep endency discovery The language L consists of all inclusion

dep endencies and the predicate q is simply the satisfaction predicate

To apply the levelwise algorithm we need to dene a sp ecialization relation

between sentences ie a partial order which is also monotonic wrt q That is if

ie is more sp ecic than and q d is true then q d is also true

This means that if a sentence is p otentially interesting wrt q then all more general

sentences are also p otentially interesting wrt q

In our example of IND discovery the sp ecialization relation is dened as follows

if RX S Y and R X S Y wehave if R R S S

and the sequences X and Y are contained in X and Y resp ectivelyFor instance

RA S D RAB C S DE F

Then the idea in the levelwise algorithm is to start from the most general sen

tences and try to generate and evaluate more and more sp ecic sentences but without

evaluating sentences that weknow from previous stages will not b e interesting This

corresp onds to pruning the search space In the case of IND discoverywe are inter

ested in INDs with the longest attribute sequences so we start from INDs with small

Chapter Integrity constraints in relational databases

sequences and step by step we add attributes until the longest ones are found But

when for instance RA S D is not satised we knowthatRAB C S DE F

cannot b e satised so wedonotevaluate it For some languages such as functional

and inclusion dep endencies it is p ossible to compute the interesting sentence can

didates at one level uniquely from the candidates and interesting sentences at the

previous level thus saving memory

This framework when applied as suchtokey discovery searches for sup erkeys

starting from the large ones and moving towards small ones nding eventually the

keys But as keys tend to b e of small size this algorithm is highly inecient An

other algorithm also based on the principles of exploring a search space level by

level reusing the results of previous levels and pruning the search space as so on as

p ossible has b een prop osed byHuhtala et al HKPT But unlike the framework

describ ed ab ove this algorithm do es not rely on a sp ecialization relation to direct

the search It is more suitable for FDkey discovery since it searches for FDs from

small to large Therefore we applied this algorithm to key discovery in this thesis

We describ e informally in Figure the levelwise algorithm alongside the algorithm

applied to key discov ery

Note that this algorithm also works for approximate inference It suces that the

selection predicate q evaluates to true when a sentence almost holds

To summarize the main advantages of this algorithm are twofold reducing the

computation at each level by reusing the results of previous levels and eciently

pruning the search space

Chapter Integrity constraints in relational databases

Levelwise algorithm for nding all p otentially interesting sentences

Input a database d consisting of only one relation r over R for the key discovery

example a language L with sp ecialization relation and a selection predicate q

Output T hLdq

Metho d

Key discovery example

level

C most general candidates in L wrt C fA j A Rg

while C level do

F level f C level j q d g Keyslevel fX C level j

j r j jr jg

X

C level C level fX j

jX j level genNextCandC levelFlev el

C level C level X is not a sup erset of a key

in Keyslevelg

level level

done

output F level output Keyslevel

each lev el each lev el

Figure Levelwise algorithm

Chapter Integrity constraints in relational databases

Key and functional dep endency discovery

The most ecient metho ds to determine approximate minimal functional dep en

dencies TANE HKPT DepMiner LPL FUN NC each use an instance

of the levelwise algorithm All of these approaches workwithamuch smaller repre

sentation of a relation called a stripp ed database from which the FDs can

b e derived eciently A partition of a relation r under an attribute set X is a set

of equivalence classes where an equivalence class is the set of tuples in r sharing the

same value for the attributes in X A stripp ed partition is a partition in whichthe

equivalence classes of size one have b een removed The idea underlying this removal

is that when an equivalence class consists of a unique tuple this tuple do es not share

the value of the considered attributes with any other tuple and therefore this tuple

cannot break any functional dep endency The new representation of r the stripp ed

partition database rb is the union of the stripp ed partitions for all attributes in the

relation schema Pro cessing the data by using stripp ed partitions allows them to

p erform linearly wrt the numb er of tuples in the relation However constructing

the partitions in the rst place requires sp ecial data structures such as hash tables in

order to b e done linearly wrt the numb er of tuples jr j or it can b e p erformed by

sorting the database whichisinO jr j log jr j

The three metho ds cited ab o ve dier in their characterization of FDs ie in how

they generate FD candidates and evaluate them Wenow describ e in more detail

each of these metho ds using the database of Example

Example Consider the database comp osed of the following relation r only

taken from LPL The attribute names have b een assigned a single letter for

briefness and will b e designated by this letter in the rest of this section

Chapter Integrity constraints in relational databases

empnum depnum year depname mgr

Tuple No A B C D E

Bio chemistry

Admission

Computer Science

Computer Science

Geophysics

Bio chemistry

Admission

From nowonwe represent a tuple by its tuple numb er the value in the column

Tuple No

The partition of r under A is ff g fg fg fg fg fgg The stripp ed

A

partition of r under A is c ff gg Similarlywe obtain c ff g f g f gg

A B

c ff gg c ff g f g f ggandc ff g f g f ggThe

C D E

stripp ed partition database is rb ff g f g f g f g f gg

Tane

ThegoalinTane HKPT is to nd all minimal approximate FDs holding in a

relation The algorithm searches for FDs of the form X nfAgA by examining

attribute sets X of increasing sizes rst sets of size then of size and so on

using an instance of the levelwise algorithm For each attribute set X considered it

constructs a collection of p otential righthand sides C X An attribute A is in C X

provided that A is not determined byany prop er subset of X which do es not contain



A this enforces the minimality of the FD The set C X can b e computed from the



The algorithm actually uses a more p owerful version of these sets C X with more pruning

but to give the idea of the algorithm we limit ourselves to the simple version

Chapter Integrity constraints in relational databases

candidate sets of X s subsets as the intersection of the sets C X nfAg for all A X

Then the algorithm tests the p otential FDs X nfAgA for all A X C X

When the FD is satised it prunes A from the set C X to preserve the minimality

of future FDs When all attribute sets X of size s have b een pro cessed the algorithm

generates the attribute sets of size s to b e considered

In Tane testing FDs is based on partition renement the dep endency X nfAg

A holds if the partition renes ie if every equivalence class in is

X nfAg fAg X nfAg

a subset of some equivalence class in Practically the authors use the equation

fAg

eX nfAg eX to test whether the FD is satised where eX is the dierence

between the sum of the sizes of the equivalence classes in the stripp ed partition c

X

and the numb er of such equivalence classes This equation allows them to deal with

stripp ed partitions and to limit the amount of memory used by the algorithm since

they exploit results from the previous level only

On the database of Example the algorithm would unfold as shown in Figure

DepMiner

In DepMiner LPL the underlying idea is based on the concept of agreeset

which groups all attributes having the same value for a given pair of tuples From

these sets the authors derive maximal sets The maximal sets for some attribute

A are the largest p ossible sets of attributes not determining A Then from the

complements of these maximal sets they derive the lefthand sides of FDs using a

levelwise algorithm for each attribute A it searches for lefthand sides X such that

r j X A by increasing the size of X The only step that requires accessing the

database or rather the stripp ed partition database is the computation of agree sets

The authors avoid computing agree sets for all pairs of tuples by limiting them

Chapter Integrity constraints in relational databases



X C X c eX X nfAg A

X

fA B C D E g

A fA B C D E g ff gg A

B fA B C D E g ff g f g f gg B

C fA B C D E g ff gg C

D fA B C D E g ff g f g f gg D

E fA B C D E g ff g f g f gg E

AB fA B C D E g A B B A

AC fA B C D E g A C C A

BD fA B C D E g ff g f g f gg B D D B

AB C fA B C D E g AB C AC B

BC A

Figure Unfolding of Tane on the database of Example

Chapter Integrity constraints in relational databases

selves to the tuples within MC the set of maximal equivalence classes of the stripp ed

partition database MC max fc b j b br g MC ff g f g f g f gg

in our example An attribute A is included in the agree set of tuples t t ift and

  

t b elong to the same equivalence class in the stripp ed partition c In our exam

 A

ple the agree set for the pair of tuples is ag fAg Similarlywehave

ag ag ag fB D Egag fE gag fC E gsothe

agree sets of r are ag r fA BDE E C Eg

The authors derive the maximal sets from the agree sets as follows for an attribute

A the maximal set maxA r max fX ag r j A X X g In our example

wehave maxA r fBDE C Eg and the complement of the maximal set of A is

cmaxA r fAC AB D g

Finally they derive the lefthand sides of FDs from these complements of maximal

sets The algorithm is based on the characterization of lefthand sides of FDs of

the form LH S A as the set of minimal transversals of the simple hyp ergraph



cmaxA r For instance wehav e cmaxA r fAC AB D g the set fB C g is a

transversal of cmaxA r since fB C gfA C g andfB C gfA B D g Ina

levelwise manner the algorithm pro ceeds by increasing sizes of candidate transversals

If a candidate T of size s is a transversal then no sup erset of T is allowed among the

candidates of size s The candidates of size are the attributes in cmaxA r

The authors found their approach to outp erform Tane in all data congurations

tested in the exp eriments

On the example the algorithm unfolds as shown in Figure



A collection H of subsets of R is a simple hyp ergraph if

X HX and X Y H and X Y X Y

A transversal T of H is a subset of R such that E HT E

Chapter Integrity constraints in relational databases

Size Size

Transversals Transversals

RHS cmaxRH S r Candidates LHS Candidates LHS

A fAC AB D g A B C D A BC BD C D BC C D

B fBC DE ABD A B C D E B D AC AE C E AC AE

AB C D g

C fBC DE AC A B C D E C AB AD AE AB AD

AB C D g BD BE DE AE

D fBC DE ABD A B C D E B D AC AE C E AC AE

AB C D g

E fBC DEg B C D E B C D E

Figure Unfolding of DepMiner on the database of Example

So the nontrivial FDs satised in r are the following

BC A CD A D B AC B AE B AB C AD C AE C

B D AC D AE D B E C E D E

FUN

In FUN NC the authors describ e their approach at a general level only without

detailing the optimizations due to the stripp ed partition database representation

even though they use this representation in their implementation

Their characterization of FDs is based on the concept of freesets A free set X is

a set of attributes such that removing an attribute from X decreases the number of



X s distinct values Free sets corresp ond to lefthand sides of FDs Tocharacterize



Formally X is a free set if Y X such that j r j j r j

Y X

Chapter Integrity constraints in relational databases

righthand sides of FDs they dene the closure and quasiclosure of an attribute set

X The closure of X denoted by X is the union of the attributes in X and the

additional attributes which can b e added to X without increasing the number of X s

distinct values X X fA R n X jj r j j r jg The quasiclosure of X

X X A

X is the union of X and the closures of X s maximal subsets While all attributes

in X are determined by X only those not in X s quasiclosure yield minimal FDs

In summary the minimal FDs satised in r are the FDs of the form X A where X

is a free set and A X n X This approach relies heavily on counting the number

of distinct values of attribute sets which requires either a sort of the data or sp ecial

data structures but while the latter solution is used by the authors they do not

detail how they p erform the counting op eration eciently

In a levelwise manner the algorithm searches for free sets of increasing sizes At

the level corresp onding to FDs with a lefthand side of size s the algorithm knows

from the previou s level the free sets of size s and their quasiclosure as well as the

collection of candidate free sets of size s It rst computes the closure of the free

sets of size s and displays the FDs of the form X A where X is a free set of size s

and A X n X Then it computes the quasiclosure of the candidate free sets of

size s using the closure of the free sets of size s Then it prunes the candidate free

sets X of size s that are not free sets based on the numb er of distinct values of X



and of its maximal subsets that are free sets Finally it generates the candidate free

sets of size s from the free sets of size s The authors found that their approach

outp erforms Tane in all congurations investigated which they explain by the fact

that the numb er of FDs tested in their approach is less than in Tane



These two steps computing quasiclosures and pruning candidate free sets are presented in this

order in the pap er However it seems more ecient to p erform them in the reverse order in order

to avoid computing the quasiclosures of candidate free sets that are not free sets

Chapter Integrity constraints in relational databases

On the example the algorithm would unfold as shown in Figure

X j r j X X X A

X

A A A

B B B D E B D E

C C C E C E

D D D B E D B E

E E E

AB key A B D E A B C D E AB C

AC key A C E A B C D E AC D E

AD key A B D E A B C D E AD C

AE key A E A B C D E AE B C D

BC key B C D E A B C D E BC A

BD not a free set

BE not a free set

CD key B C D E A B C D E CD A

CE not a free set

DE not a free set

Figure Unfolding of FUN on the database of Example

Key discovery

The approaches fo cusing only on key discovery KASZWA also use the levelwise

algorithm Shen et al SZWA implemented an algorithm with the data b eing

stored in a DBMS thus querying the database to test the validityofkey candidates

Chapter Integrity constraints in relational databases

However they do not provide the details of the algorithm they implemented Knobb e

and Adriaans KA search for keys of increasing sizes An attribute set X is a key

if the numb er of distinct values of X is equal to the numb er of tuples in the relation

When X is not a key they record the set B l ack rX of tuples having duplicate

X values Then when they test whether XA is key they only need to sort the tuples

in BlackrX wrt XA and check that they all have dierent XAvalues In such

a case XA is key otherwise they record BlackrXA and pro ceed to the next level

of the algorithm As the sets BlackrX can havea sizeinO jr j the amount

of memory required for these structures makes their metho d impractical for large

databases

Our approach for key discovery is adapted from Tane We dropp ed the righthand

side candidate sets and p erform key testing via database queries instead of using the

stripp ed partition database But the rest is similar to Tane we search for keys in a

levelwise manner by increasing size At eachlevel of the algorithm wetestthekey

candidates then prune the search space and then generate the candidates for the

next level

In this thesis we do not need elab orate data structures since wekeep the data

inside the DBMS Thus even though our algorithms may not b e as ecient wrt the

numb er of tuples as those of FD discovery presented ab ove the amount of memory

used by the algorithm is kept very low

Foreign key and inclusion dep endency discovery

Unlike functional dep endencies extracting inclusion dep endencies or foreign keys

has raised little interest in the database community De Marchi et al MLP

Chapter Integrity constraints in relational databases

explain this disregard by the complexity of the problem see Section and a lack

of p opularity

Due to the huge numb er of p ossible INDs PT it is impractical to enumerate

all inclusion dep endency candidates and try to evaluate them on a database Exist

ing work fo cuses on how to prune the numb er of inclusion dep endency candidates

considered

One approach LPT PT consists of selecting a much reduced initial set of in

clusion dep endency candidates by considering only the attributes app earing together

in equijoins from a workload of query statements Thus this approach requires sucha

workload of queries to the database The authors also claim that this criterion allows

them to nd out only interesting dep endencies since they were used in practice

But their metho d is not complete they determine only a subset of all dep endencies

which dep ends on some past queries to the database

Another approach is to start with unary inclusion dep endencies Indeed there



are only a p olynomial numberofthemn in a database with n attributes So the

complexity of extracting the unary INDs for a database with n attributes and p tuples



is p olynomial with resp ect to b oth parameters it is O n p log p MR KMRS

Moreover unary INDs are a necessary condition for nonunary or comp osite INDs

for RAB S CD to hold wemust have RA S C and R B S D Wechose

this approach b ecause it do es not rely on any prior knowledge on the database

The problem of eciently extracting unary inclusion dep endencies has b een stud

ied in two dierent contexts

all data pro cessing is done via database queries BB or

a new representation of the data is generated and the extracting step deals only

Chapter Integrity constraints in relational databases

with this new representation MLP

In the rst context Bell and Bro ckhausen BB optimize their algorithms to min

imize the numb er of accesses to the database They distinguish two generic typ es

numb er and string and determine UINDs for attributes of eachtyp e indep endently

To restrict the numb er of UIND candidates they use value restrictions ie upp er

and lower b ounds on the attribute domains as follows the b ounds of a lefthand side

must lie in the interval made by the b ounds of the righthandside When testing

the UIND candidates against the database they exploit the transitivity of UINDs to

reduce the numb er of queries using the two following rules

A A and A A A A

i j k j k i

A A and A A A A

i j i k j k

Dep ending on the ordering of the attributes the transitivity prop erties can save more

or fewer accesses to the database but as long as there is at least one valid UIND in

the database their algorithm saves at least one database query

In the second context De Marchi et al MLP also determine UINDs for each

data typ e indep endentlyFor a given data typ e t they build a binary relation asso

ciating for eachvalue of typ e t app earing in the database the attributes in which

it app ears Thus the amount of memory required is prop ortional to the number of

distinct values in the database Note that all these binary relations can b e generated

in one pass over the database Then they deduce the UINDs from a binary relation

B as follows if A is included in C then eachvalue taken by A must b e asso ciated

in the binary relation with C Therefore the righthand sides of A are given bythe

intersection for eachvalue v of A of the attribute sets asso ciated in B with v Here

again computing the intersections for all attributes with typ e t canbeachieved in

one pass over the binary relation for the typ e t

Chapter Integrity constraints in relational databases

De Marchi et al compared b oth approaches on six synthetic databases having

small or medium size so the binary relations t in main memory Their approachwas

found to outp erform Bell and Bro ckhausens However the binary relations may not

always t in memory

Wenow consider how the knowledge of UINDs can b e used to nd foreign keys or

general inclusion dep endencies

Knobb e and Adriaans KA prop ose to use UINDs to identify foreign keys They

rst extract the keys then the UINDs then form the FK candidates and then test

these candidates against the database They use keybased UINDs to form the foreign

key candidates F F F for a given key K K K by taking each F in the

 k  k i

lefthand sides of the UINDs having K as a righthand side When there are several

i

FK candidates they suggest the use of heuristics to cho ose among them

After having extracted UINDs by using binary relations as explained ab ove

instead of database queries De Marchi et al MLP use an instance of the lev

elwise algorithm to determine the INDs in which the testing of INDs is p erformed

via database queries This instance of the levelwise algorithm devoted to INDs was

describ ed by Mannila and Toivonen MT and Boulicaut Bou but De Marchi

et al detail in their pap er MLP how to generate the IND candidates of size i

from the INDs of size i

Chapter

Algorithms and complexity

We present in this chapter a framework for discovering approximate keys and for

eign keys holding in a database instance of whichwe assume no previous knowledge

The problem tackled can b e divided in two main mo dules identifying the ap

proximate keys and using the keys found to extract the approximate foreign

keys

We also analyze the complexities of the algorithms prop osed

For simplicity reasons we implemented the algorithms on top of a DBMS but

our framework could easily b e adapted to use direct pro cessing of the data

General architecture

Our general architecture for discovering the approximate keys and foreign keys from

a database of whichwe assume no prior knowledge is outlined in Algorithm

The inputs of the algorithm are a database d and the approximation thresholds

for keys and foreign keys The algorithm outputs the keys and foreign keys which

Chapter Algorithms and complexity

Algorithm General extraction of keys and foreign keys

Input a database d the maximum approximation degrees for keys and

foreign keys and resp ectively and the pruning parameter

key fk

PgParam ie the fraction of tuples involved in the pruning pass

Output the most interesting keys and their asso ciated foreign keys holding

in d

Preliminary step

Extract statistics ab out the database

foreach table tab in the database do

size

 while StoppingCriteriontabsize false do

Find all the keys of size size

Keysize ExtractKeystabsize

key

foreach key Keysize do

Find the foreign keys and their degree of approximation

keyForeignKeys ExtractForeignKeyskey PgParam

fk

end

 Sort Keysize byinterestingness and merge it with the sorted key list

of tab

size size

end

Output the key list of tab including for eachkey its foreign keys end

Chapter Algorithms and complexity

almost hold in d wrt the approximation thresholds

As a preliminary step we extract the following statistics ab out the database the

numb er of tuples of each table and the typ e and numb er of distinct values of each

attribute These statistics will b e used to optimize the dierent steps of the general

algorithm

Then we pro cess each table in the database in turn For a given table we de

termine the integrity constraints by increasing size rst the keys and foreign keys

FKs of size then the keys and FKs of size working towards larger sizes To

identify the integrity constraints of a given size swe rst extract the keys of size s

as detailed in Algorithm see page and then pro cess them in turn to nd their

FKs as detailed in Algorithm see page

A complete algorithm nding all the keys and foreign keys holding in the database

would pro cess each table until all its keys and resp ective FKs are found In sucha

case the pro cessing of a table would only stop when the maximal key size the number

of attributes of the relation has b een reached or when there are no more candidates

So the stopping criterion of the general algorithm Algorithm evaluates to

true if and only if size is equal to the numb er of attributes of tab or there is no key

candidate

But in most cases we are only interested in a small subset of all integrity con

straints ICs Typically the most useful keys are referenced bymany FKs have

small size and approximation degree and their FKs also have a small appro ximation

degree Dep ending on the application for whichkeys and foreign keys are discovered

wemaywanttotakeinto account the numb er of dierent relations containing foreign

keys referencing a given key rather than just the numb er of foreign keys Indeed we

Chapter Algorithms and complexity

mayprefera key with two foreign keys in dierent relations to a key with foreign keys

in the same relation This criterion would favor integrity constraints that connect the

dierent relations of a database whichwould b e useful in schema mapping MHH

These dierent criteria dening the interestingness of ICs are not indep endent

For example there is a tradeo b etween the numb er of FKs and the size so wemay

prefer a key of size with FKs to a key of size with FKs Similarlywemay

prefer a key of size holding exactly in the dataset to an approximate key of size

Thus in order to achieve the b est ordering of the ICs we need to extract them

all and only then can we order them by degree of interestingness But nding them

all when we only want a small subset of them is highly inecient In addition the

b est ordering is very sub jective and is hard to dene accurately indep endently of the

set of elements wewant to order So we decided to implement a heuristic algorithm

that would nd a go o d enough subset with a reasonable amountofwork Wechose

to mimic the complete algorithm on a smaller scale we extract all keys and FKs

of size b ounded by a constant This constraint reects the fact that in practice for

eciency reasons referenced keys tend to b e small Indeed otherwise large parts of

the data are duplicated Then b ecause there maybemany ICs within the b ound

chosen wewant to highlight the b est of these So we prop ose an order relation on

integrity constrain ts which reects the general usefulness of such constraints as keys

and foreign keys This ordering was used byShenetalSZWA

Denition Integrity constraint ordering byinterestingness The ele

ments for the ordering are pairs formed of an approximate key K and the set F K set

p ossibly empty of approximate foreign keys referencing it Wedenoteby approxK

resp approxF the approximation degree of a key K resp foreign key F To

compare two sets of FKs we dene the average approximation degree of a FK set

Chapter Algorithms and complexity

F K set denoted byaverage approxFKset as the average of the approximation de

grees of all FKs included in FKsetaverage approxFKset avgf approxF j

F FKsetg

Consider elements eK FKset and eK FKset We dene the

ordering b etween them according to the numb er of FKs attached to the keys

the sizes of the keys the approximation degrees of the keys and the average

approximation degrees of the foreign keys as follows

e e if jF K setj jF K setj

else if jF K setj jF K setj and siz eK sizeK

else if jF K setj jF K setj and siz eK siz eK and

approxK approxK

else if jF K setj jF K setj and siz eK siz eK and

approxK approxK and

average approxFKset average approxF K set

For the tests the upp er b ound on the size was xed to by Shen et al SZWA

wechose the upp er b ound on the size to b e equal to arbitrarily b ecause our test

databases have a smaller schema than theirs

To adapt the general architecture so that we only determine the most interesting

in the sense of Denition integrity constraints of b ounded size weonlyhave

to change the stopping criterion and add a p ossible pruning step after step

This pruning step would b e used to limit the number of integrity constraints

discovered at each stage of the algorithm by removing the least interesting of them

For example if wewant to determine primary keys as did Shen et al SZWA we

can prune all ICs but one the b est at each step when there are several b est we

Chapter Algorithms and complexity

can cho ose one randomly Finally the stopping criterion evaluates to true when the

numb er and qualityoftheintegrity constraints discovered up to that p oint match

the sp ecications of the user or when the size of the keysFKs mined has reached an

upp er b ound in our exp eriments the upp er b ound is xed to

Finding the keys

Wewant to extract the approximate keys of a given size holding in the database The

inputs of the algorithm are a relation r thekey size and the approximation threshold

The algorithm outputs all approximate keys having the sp ecied size that hold

key

in r wrt approximation threshold

key

In the general algorithm Algorithm see we searchforkeys by increasing

sizes starting with keys of size and going towards larger sizes Identifying the keys

of a given size is done via a call to the function ExtractKeys Algorithm As so on

as a key is found we prune the search space so that we do not extract sup erkeys if

X is a key then we discard from the search space all sup ersets of X

Since keys are just a sp ecial case of functional dep endencies we adapted Tanes

architecture HKPT to our problem of nding the minimal keys and to the fact

that unlikeinTane wekeep the data inside the database management system which

limits howwe can pro cess it The resulting metho d is shown in Algorithm

F or each size the key candidates are stored in CandKeysiz e We test them

against the database and insert the candidates that were found to b e keys in Keysiz e

The rest of the candidates are inserted in C andN otK ey siz e We then compute from

CandNotKeysize the set CandKeysize of key candidates for the next level of

Chapter Algorithms and complexity

Algorithm ExtractKeys

Input a relation r with n attributes A A

 n

siz e nthenumb er of attributes of the keys to extract

the approximation thresholds for keys

key

Output the set Keyssiz econtaining all the keys with siz e attributes of r

if size then

Initialization

C andK ey fA A g

 n

Key fA CandKey jjAj jr jg

Keysr Key

CandNotKey CandKey Key

C andK ey GenerateNextCand C andN otK ey

end

else

CandKeysiz e has b een computed by ExtractKeysrsize

key

foreach X in CandKeysiz e X X X do

 siz e

cardX has b een initialized when generating the candidate X

 if cardX jr j then

key

cardX SELECT COUNT

FROM SELECT DISTINCT X X FROM r

 p

 if cardX jr j then

key

X is an approximate key

Keysiz e Keysiz e X

end

end

end

CandNotKeysiz e C andK ey siz e Keysiz e

C andK ey siz e GenerateNextCand CandNotKeysiz e

end

Output Keyssiz e

Chapter Algorithms and complexity

the algorithm at size

The initial key candidates of size are the attributes of the relation We treat

size as a sp ecial case b ecause we do not need to query the database to determine

the keys we use the statistics gathered in the preliminary step

Testing the key candidates

An attribute set X is an approximate key if it has no more than jr j duplicate

key

values where is the approximation threshold and jr j the number of rows of the

key

relation or equivalently if it has at least jr j distinct values So

key

to test this inequalitywe query the database to retrieve X s numb er of distinct

values stored in cardX for cardinality In some cases we are able to determine

that X cannot b e a key without querying the database when an upp er b ound on

X s cardinality do es not pass the test This upp er b ound is computed when

generating the candidate X see next section and is used as the initial value of

cardX When X do es not pass the test cardX keeps its initial value ie the

upp er b ound on X s cardinality but when X do es pass the test card X b ecomes

X s real cardinality The upp er b ound is based on the cardinalities of X s maximal

subsets maxfjAjcardX nfAg j A X g

In Tane HKPT the use of b ounds to avoid accessing the data is limited to

p otential FDs b ecause these are the fo cus

Generating key candidates

First of all weshowthatwe can compute the key candidates at iteration siz e

only from C andN otK ey siz e the set of key candidates that were found not to b e

keys at iteration siz e

Chapter Algorithms and complexity

Denition An attribute set X is a key candidate if no prop er subset of X is

akey otherwise X would not b e a key but a sup erkey

C andK ey sfX st jX j s and Y X Yisnotakeyg

This denition is the basis for the following prop osition

Prop osition An attribute set X of size s is a key candidate if al l its subsets

of size s belong to C andN otK ey s

We adapted the function which generates the functional dep endency candidates

for the next level from Tane HKPT to our particular case of keys

In the function GenerateNextCand we generate the key candidates of size s

in two steps we construct attribute sets of size s which is done at

and detailed b elow and wecheck that all their subsets of size s are included in

C andN otK ey s whichisdoneat

In the rst step rather than generating all p ossible sets of size s we start

from attribute sets in C andN otK ey s and combine them to form sets of size s

Thus part of the subset tests are already veried therefore it is more likely that the

generated attribute sets will b e key candidates Tocombine twosetsY Z of size s in

order to form a set X of size s it is necessary that Y and Z share exactly s

attributes Consider X X X where each X is a single attribute to b e sorted

 s i

in increasing order ie X X for some ordering say the p osition of the

 s

attributes in the database table corresp onding to r ieX X if X is o the left of

i j i

X The set X isakey candidate if all its subsets of size s b elong to C andN otK ey s

j

and in particular the subsets X X X and X X X Therefore wechose to

 s s  s s

construct a set X by combining two sets Y Z of C andN otK ey s that share a prex

of size s suchthatX is the union of Y and Z and X s attributes are ordered

This is done at given that Y X X Y and Z X X Z are ordered

 s s  s s

Chapter Algorithms and complexity

Algorithm GenerateNextCand

For simplicitywe denote here by s the size of keys s corresp onds to

siz e in Algorithm

C andK ey s

foreach PrefSet in PrefixCandsCandN otK ey s do

foreach Y X X Y Z X X Z PrefSet Y Z do

 s s  s s s s

 X X X Y Z

 s s s

cardX max jZ jcardY jY jcardZ

s s

foreach A X A Y Z do

s s

 if X nfAg C andN otK ey s then

 cardX max cardX jAjcardX nfAg

else

cardX

exit the for lo op

end

end

if cardX then

CandKeys C andK ey s X

end

end end

Chapter Algorithms and complexity

ie X X Y Z and given that Y Z thenX X X Y Z

 s s s s s  s s s

is ordered as well by construction Since attribute sets of size are ordered by

induction all constructed key candidates are ordered This prop erty is used in the

function PrefixCands which partitions C andN otK ey sinto sets of attribute sets

sharing a same prex of size s Since the key candidates in C andN otK ey sare

ordered sets of attributes we can sort C andN otK ey s in lexicographic order for

X Y C andN otK ey s X X X Y Y Y wehave X Y if and only if

 s  s

X Y for i s Then we group the key candidates by prex of length s

i i

in one pass over the sorted set C andN otK ey s The function PrefixCands is also

adapted from Tane HKPT

When generating a key candidate X we also compute an upp er b ound on its

numb er of distinct values stored in cardX which is used when testing whether X

is a key see Section This upp er b ound is based on the results of the previous

level it is the maximum among all jAjcardX nfAgfor A X

Finding the foreign keys

We extract the foreign keys referencing a given key Key K K in three steps

 k

we generate foreign key candidates we make a pruning pass over these candidates

on a small set of sample rows from the database and we nally test the remaining

candidates against the database

In all the literature weareaware of Step follows directly Step without any

pruning

Wenow describ e in more detail the three steps The steps are summarized in Algorithm

Chapter Algorithms and complexity

Algorithm ExtractForeignKeys

Input a database d

the key Key K K for whichwewant to extract foreign keys

 k

the approximation threshold for foreign keys

fk

Output the foreign keys of Key

F K set

foreach relation r in d do

F

Step generate the foreign key candidates

CollectionUINDs ExtractUINDsKey r

F fk

FKCandsKey GenerateFKCandsKey Col lectionUINDs

foreach F F F FKCandsKey do

 k

Select the numb er of distinct values of F

cardF SELECT COUNTDISTINCT F F FROM r

 k F

Step pruning pass on a sample of the database

 if passedSampleTest F K ey then

Step real test

cardFKey SELECT COUNT FROM

SELECT DISTINCT F F

 k

FROM r r

K F

WHERE r K r F AND AND r K r F

K  F  K k F k

if cardF cardFKey cardF then

fk

FKset F K set fF Keyg

end

end

end

end

Output F K set

Chapter Algorithms and complexity

Generating foreign key candidates

It is impractical to enumerate all sets of size k as p otential foreign keys since there

n

of them where n is the numb er of attributes A commonly used approach can b e

k

to form foreign key candidates is based on previously extracted unary inclusion dep en

dencies KA KMRS Another approach is describ ed byPetit et al PTBK

who tackle the more general problem of discovering inclusion dep endencies The

inclusion right and lefthand side candidates are generated from sets of attributes

app earing together in equijoins within a workload of SQL queries over the database

considered Therefore this approach requires the prior knowledge of SQL queries over

the database Because we assume no prior knowledge over the database we apply

the former metho d using unary inclusion dep endencies While Knobb e and Adriaans

KA prop ose determining all unary inclusion dep endencies making this a prelimi

nary step of the general algorithm together with the gathering of statistics we limit

ourselves to keybased UINDs Therefore wehave to identify the keys rst so we

incorp orate the discovery of UINDs into the algorithm for extracting the foreign keys

We describ e the discovery of UINDs in Section

Using unary inclusion dep endencies to form foreign key candidates

Lets start with an intuitive example

Example Consider a database with two relations r A B and r C D E

 

Assume that AB is key of r and that the following UINDs hold



C A D A and E B Then the foreign key candidates for AB are CE and

DE But if the inclusion E B do es not hold anymore then AB has no foreign

key candidate Or if E b elongs to another relation r then AB has no foreign key 

Chapter Algorithms and complexity

candidates

Denition A foreign key candidate of key K K K is comp osed of k

 k

attributes F F suchthat

 k

all F b elong to the same relation

i

each of them is included in the corresp onding attribute of the key i F K

i i

the foreign key candidate is not equal to the key F F K K even

 k  k

though some of the F maybeequaltosomeK

i j

To address the rst condition we pro cess one relation at a time weextractkey

based UINDs from relation r then use them to generate FK candidates and then

pro ceed to the next relation Generating FK candidates is straightforward from the

denition we create all p ossible combinations of k attributes F F where F K

 k i i

and F F K K If there exists an attribute K such that no attribute in r

 k  k i

references K thenthekey has no foreign key candidate within the relation r

i

Pruning pass over the foreign key candidates

This step is intended to eliminate quickly the candidates that are the farthest from

b eing foreign keys The underlying idea is that a unary inclusion dep endency may

hold accidentally and only very few values of the combined lefthand side attributes

are included in the values of the key

Example Let K K K K be a keyand F F F F a foreign key candi

     

date with the following values

Chapter Algorithms and complexity

K K K F F F

     

a b c aa b c

a bb c aa bb c

aa b cc aa bb cc

a b cc

a bb cc

In this example none of F s values are included in the set of K s values So we can

exclude F from the foreign key candidates just by lo oking at one of its values

Prop osition In database danFKcandidate F is an approximate foreign key

referencing key K wrt error threshold if the number of distinct values of F that

fk

are not includedinK s values is at most cardF wherecardF is the number

fk

of distinct values of F

This is straightforward from the denitions of an approximate foreign key and of

the error measure for FKs see Chapter

The pruning pass over foreign key candidate F is done in the function

passedSampleTestF K ey in Algorithm It is based up on Prop osition

We extract at random a small fraction of F s distinct values SampleV alues

We call this fraction the pruning parameter PgParam Then for eachvalue v in

S ampl eV al ueswechec k whether v also app ears in the set of Keys values whichwe

denote by K ey V al ues This can b e implemented with an SQL query

SELECT COUNT FROM SELECT FROM r WHERE Key v

Key

or by querying an inverted index built on the whole database Wekeep a counter of

the number of F s distinct values pro cessed so far that are not included in Keys val

ues lets call this counter numM issesWhenever v is not included in K ey V al ueswe

Chapter Algorithms and complexity

increment numM issesAssoonasnumM isses exceeds cardF weknow from

fk

Prop osition that F is not an approximate FK so the function passedSampleTestF K ey

returns false For exact foreign key discovery only one value v that is not in K ey V al ues

is enough to prune a candidate If the b ound cardF is not reached during the

fk

pro cessing of the values in S ampleV aluesthenwe cannot conclude whether the de

p endency F Key is satised in d based solely on the data sample so the function

returns true and the candidate will b e tested against the database

Remark As an imp ortant side eect allowing approximation requires PgParam to

b e at least equal to

fk

Testing the remaining foreign key candidates

We test the foreign key candidates that passed the pruning step based up on Prop o

sition The numb er of distinct values of F that are not included in K s values

is equal to the dierence b etween the numb er of distinct values of F and the num

b er of distinct values of F that are also v alues of Key The former is cardF and

the latter can b e computed by joining the two relations on F KeyWe store in

cardFKeythenumb er of distinct values of the on F Key between r

F

and r Then wecheck whether the dierence cardF cardFKey is smaller

Key

than cardF if it is then F is an approximate FK for key Key in database d

fk

Discovering the unary inclusion dep endencies

Discovering the unary inclusion dep endencies is describ ed in Algorithm

In the following we represent a unary inclusion dep endency as a pair LHSRHS

where the lefthand side LHS is included in the righthand side RHS

Chapter Algorithms and complexity

We are interested here in nding the lefthand sides b elonging to a relation r of

the database of all unary inclusion dep endencies having a xed righthand side K

The brute force algorithm solving this problem consists of testing against the database

all pairs LK where L is an attribute of r dierent from K and such that L and K

have the same typ e eg text and text or numeric and numeric We improved on

it with a simple optimization based on the statistics gathered by the rst step of the

general algorithm if the numb er of distinct values of L is strictly greater than the

numb er of distinct values of K then the UIND LK cannot hold With this simple

optimization we prune the lefthand side candidates for righthand side K

Instead of using the attribute cardinalities Bell and Bro ckhausen BB ex

tract value restrictions which are upp er and lower b ounds on the attribute do

mains They use these value restrictions to prune the candidate UINDs as follows if

uppLuppK lowLlowK are the upp er and lower b ounds on the domains of

attributes L K then L K is a p ossible UIND if and only if uppL uppK and

lowL lowK Value restrictions generate a more ecient pruning than simply

cardinality restrictions We only use the latter b ecause the attribute cardinalities

are available in the DBMS catalogs while the former necessitate pro cessing the data

The authors also exploit the transitivity of UINDs to minimize the numb er of accesses

to the database We found out that in practice however this is not a very useful

optimization when extracting keybased UINDs

Unlike Bell and Bro ckhausen BB whose goal is to extract all unary inclusion

dep endencies we are really interested in only keybased UINDs whose righthand

sides b elong to a key since only these will b e used in generating foreign key can

didates Therefore our search space is reduced so we did not implement all the

optimizations they prop ose

Moreover our goal is to provide a general framework for key and foreign key

Chapter Algorithms and complexity

discoverysowe did not concentrate on all p ossible optimizations However it would

b e easy to integrate the ideas of Bell and Bro ckhausen into our algorithms in the

rst step we could include value restrictions on the statistics we gather and then use

these to prune the lefthand side candidates and we could add after step the

following statements to exploit the UIND transitivity

A K and A K A A

j i k i k j

A K and A A K A

j i j k i k

De Marchi et al MLP prop osed an ecientway to discover all UINDs which

do es not use SQL queries they generate a compact representation of the data and

from this new representation they extract the UINDs With their metho d the restric

tion to keybased UINDs do es not save computation Here again we could integrate

their metho d into ours but b ecause we concentrate on providing a general framework

for key and foreign key discoverywe implemented all algorithms using SQL queries

for the sake of uniformity and simplicit y

To test the inclusion of lefthand side candidate A into K we use the same metho d

as for foreign keys we generate lefthand side candidates then we p erform a prun

ing pass over these candidates and nally we test the remaining ones against the

database The pruning pass is exactly the one describ ed in Section with

F A and Key K which corresp onds to the case where the size is equal to one

j i

And at last to test the inclusion of remaining lefthand side candidate A into K we

compute the numb er of distinct values that A has in common with K by joining the

two relations on A and K and counting the numb er of distinct values of the result

The numb er of distinct values of the join is always less than the cardinalityof A but

if the dierence is smaller than car dA meaning that the fraction of erroneous

fk

values is b elow the approximation threshold then the inclusion holds fk

Chapter Algorithms and complexity

Algorithm ExtractUINDs

Input a relation r with n attributes A A

 n

the numb er of distinct values of each attribute car dA i n

i

the key Key K K for whichwewant to extract unary inclusion de

 k

p endencies

the approximation threshold for inclusions

fk

Output collectionUIND the set of unary inclusion dep endencies whose righthand

side b elong to Key and lefthand side b elongs to r

collectionUIND

for i k when the UINDs having K as a righthand side are not known do

i

Extract the UINDs having K as a righthand side

i

CandLHSK fA j j n A K jA j jK j

i j j i j i

DomA DomK g

j i

foreach A CandLHSK do

j i

Pruning pass on a sample of the database

 if passedSampleTest A K then

j i

Real test

cardA K SELECT COUNTDISTINCT rA

j i j

FROM rr WHERE rA r K

K j K i

i i

Ifr A r K almost holds

j K i

i

if cardA cardA K cardA then

j j i fk j

K g  collectionUIND collectionUIND fr A r

i j K

i

end

end

end end

Chapter Algorithms and complexity

Complexity analysis

Since all the data is kept in a DBMS the space complexity is not an issue Therefore

we concentrate in this section on the time complexity which is measured in terms of

database accesses

We rst examine the complexities of the database op erations used in our algo

rithms

We use pro jection in a SELECT DISTINCT FROM r typeofqueryThisoper

ation requires a sort of the tuples on the attributes selected and then a scan

over all tuples so its complexityis O jr j log jr j The size of the resulting set of

tuples ranges from all equal to jr j all distinct

We use selection in a SELECT FROM r WHERE typeofquery This op eration

just requires a scan over all tuples so its complexityis O jr j

We use counting in a SELECT COUNT FROM r typ e of query This op eration

also is just a scan over all tuples so its complexityis O jr j

We use equijoins joins whose join conditions are equalities in a SELECT

FROM r r WHERE r X r Y AND AND r X r Y typ e of query

       k  k

The complexity of this op eration dep ends on how it is executed when exe

cuted via a nested lo op it is O jr jjr j when r and r are sorted rst it is

   

O jr j log jr j where jr j maxjr j jr j when an index on X or Y exists the

 

complexityis O jr j log jr j if the index is a Btree or O jr jiftheindexisa

hash table For the rest of the complexity analysis we assume that no index

exists The size of the resulting set of tuples ranges from to minjr j jr j

 

The algorithms often use comp ositions of these op erations so we adapt the com

plexities accordingly As the join and pro jection op erations are the most exp ensive

Chapter Algorithms and complexity

among the four the complexity of comp ositions will b e dominated by their complex

ities When b oth op erations o ccur the total complexity is generally dominated by

the complexity of the join op eration

Example In Algorithm the following query combines the three op erations

presented ab ove

SELECT COUNT FROM

SELECT DISTINCT F F FROM r r

 k K F

WHERE r K r F AND AND r K r F

K  F  K k F k

Let m jr jand n jr j and assume m n The complexity of the equijoin

K F

ranges from O n log ntoO mn Because the size of the set of tuples returned bythe

equijoin is at most m the complexity of the pro jection is O m log m Similarlythe

complexity of the counting is O m Therefore the overall complexity is dominated

by the equijoin and is O n log n when the relations are sorted or O mn when

they are not

We dene in Table the notation used in the rest of this section We denote by

p the maximum numb er of tuples over all tables and consider that all tables have p

tuples to simplify the analysis We denote by joinp the complexity of the equijoin



of two relations joinpO p log porO p dep ending on the implementation of

the join op eration

Wenow study the complexityofthekey discovery mo dule and then that of the

foreign key discovery mo dule separately

Chapter Algorithms and complexity

nAtts attributes in the database

p tuples

j oinp complexity of a join op eration for two relations

with at most p tuples

PgParam fraction of rows involved in the pruning pass

nK ey s keys found

nK ey C ands key candidates

nU I N D s keybased unary inclusion dep endencies

nU indC ands keybased UIND candidates b efore pruning pass

nU indC andsP r uned keybased UIND candidates after pruning pass

nF K s foreign keys found

nF K C ands foreign key candidates b efore pruning pass

nF K C andsP r uned foreign key candidates after pruning pass

Table Denition of the notations used

Chapter Algorithms and complexity

Complexityofkey discovery

In the key discovery mo dule see Algorithm we access the database to retrievethe

numb er of distinct values for eachkey candidate This corresp onds to the counting

op eration applied to the result of a pro jection op eration The complexity is dominated

by the pro jection O p log p for one key candidate and O nK ey C ands p log p

overall

C ompl exity E xtr actK eys O nK ey C ands p log p

The number of key candidates is smaller in our approach than in the approaches

for functional dep endency discovery b ecause of an extra pruning mechanism line

in Algorithm This mechanism based on an estimate of the numb er of distinct

values of a candidate allows us to avoid some tests against the database

Complexity of foreign key discovery

The main tasks of FK discovery are the following generating FK candidates making

a pruning pass over the candidates and testing the remaining candidates

In the pruning pass over a FK candidate F we rst extract the number of F s

distinct v alues whichtakes O p log p Then for a fraction PgParam of F s distinct

values wecheck that they also app ear in the key ie the number of F s distinct

values used in the pruning pass is PgParam p The check can b e done either

linearly by a scan of the key values or in log p by probing an index We consider it

to b e in O p So the complexity of the pruning pass is quadratic in the number of

tuples



C ompl exity P r uning O nF K C ands PgParam p

Chapter Algorithms and complexity

Weinvestigated exp erimentally the interaction b etween the eciency of the pruning

and the pruning parameter PgParam see next chapter

Testing the remaining candidates is done through the query of Example see

page whose complexityis O joinp

C ompl exity TestFKCands O nF K C andsP r uned joinp

Finally generating foreign key candidates requires determining keybased unary

inclusion dep endencies see Algorithm The pro cess for UINDs is similar to that

for general FKs and so is the complexity



C ompl exity E xtr actUINDs O nU indsC ands PgParam p

nU indsC andP r uned joinp

The overall complexity is the sum of the three equations ab ove

Inuence of data characteristics on the complexity

The data characteristics inuencing the complexities vary greatly from one database

to another The complexities also dep end on input parameters error thresholds and

number of values involved in the pruning pass whichwe discuss in next section

Wesaw in Section that the numb ers of keys and key candidates nK ey s and

nK ey C ands can b e exp onen tial in the numb er of attributes In addition to the fact

that interesting keys tend to b e small in practice wechose to limit the maximal key

size also to avoid this exp onential explosion

The keybased unary inclusion dep endencies dep end on the numb er of attributes

included in keys since these are the righthand sides of the UINDs While in practice

the righthand sides form a small subset of all attributes in the worst case the

union of the keys covers all attributes in the database so that there are nAtts right

hand sides Also even though generally domains are of dierenttyp es eg integers

Chapter Algorithms and complexity

strings dates and therefore are disjoint it may o ccur that all of them are of the

same typ e and that they are all included in the domains of key attributes Then

there are nAtts lefthand sides p er righthand side So in the worst case nU inds



nU indC ands nU indC andsP r uned nAtts

The FK candidates are generated from keybased UINDs so they dep end on the

number of keys and keybased UINDs Wesaw that in the worst case there can b e

a great numb er of them

Inuence of input parameters on the complexity

The pruning pass over the foreign key candidates is useful only when the dierence

between the numb er of candidates b efore and after pruning justies the additional

cost of the pruning step

Without the pruning pass the complexity of testing the foreign key candidates is

O nU indC ands nF K C ands joinp

With it the complexity b ecomes

O nU indC andsP r uned nF K C andsP r uned joinp



nU indsC ands nF K C ands PgParam p

So the pruning step is worthwhile only when



nU indsC ands nF K C ands PgParam p

nU indC ands nU indC andsP r uned

nF K C ands nF K C andsP r uned joinp

When the fraction PgParam of tuples involved in the pruning pass is small the

pruning step should b e less ecient than when PgParam is large On the other

hand as PgParam increases the cost of the pruning pass grows so the number of

Chapter Algorithms and complexity

candidates after pruning has to b e signicantly smaller than the numb er of initial

candidates for the pruning step to b e worthwhile

The intuitive idea is that the pruning step should only b e used to discard the

candidates that are very far from b eing foreign keys eg whichhave just a couple

of tuples in common with the key These candidates do not require testing on many

rows so a small PgParam is enough to eliminate them and the pruning step is in

exp ensive For the remaining candidates we use the full test Indeed if wewanted

to discard them during the pruning pass wewould have to pro cess many tuples b e

cause they share manyvalues with the keysowewould need a large PgParam and

the pruning step might b ecome to o costly relative to its ecacy Unfortunatelythe

tradeo b etween the cost of the pruning step and its p erformance dep ends on the

dataset

The other input parameters other than PgParam are the error thresholds for

approximate keys and FKs Nonzero thresholds increase the number of keybased

UINDs keys FKs and all candidates

When the approximation threshold for the foreign keys increases the fraction

fk

PgParam of tuples involved in the pruning pass has to increase as well since wemust

have PgParam see Section Hence for high thresholds the pruning

fk

step may b e to o exp ensive relative to its b enet and mayhave to b e skipp ed But

such high thresholds w ould probably give coarse not very interesting results

Weinvestigate in the next chapter the eect of the input parameters on the

p erformance of the algorithm through exp erimentation

Chapter

Exp eriments

Goals of the tests

Weevaluate the gain in terms of database accesses of extracting only keybased

UINDs versus all UINDs

Weevaluate the eciency of the pruning pass b oth on synthetic and reallife

datasets with various amount of noise

We examine the general b ehavior of the pruning pass eciency in relation to

the pruning parameter PgParam In particular we study howtosetPgParam

to minimize the running time of the algorithm

We then analyze the inuence of the dep endency size on the pruning pass

that is whether the pruning pass b ehaves dierently on unary and comp osite

dep endencies In addition weinvestigate the inuence of the approximation

threshold for FKs

fk

Finallyweverify exp erimentally that the pruning time is quadratically dep en

dent on the data size ie the numb er of tuples

Chapter Experiments

Exp erimental setup

All the datasets used in the exp eriments were stored in IBM DB universal database

Version for Solaris We implemented the algorithms describ ed in Chapter in

C with emb edded SQL to access the databases Because of the randomness in our

algorithm within the pruning pass we ran it times on the whole set of databases

and we to ok the average over the runs for each dataset

First of all we dene the metrics used to evaluate the eectiveness of the pruning

pass Then wepresent the datasets used in the exp eriments Finallywe describ e the

setting of the input parameters of the algorithm such as the approximation thresholds

for keys and FKs

Metrics used to evaluate the impact of the pruning pass

Recall that our approach for identifying the foreign keys of a given key consists of

three steps

generating FK candidates

for FKs of size these are the keybased UIND candidates

for FKs of size these are generated from keybased UINDs

pruning the FK candidates

testing the remaining FK candidates

The pruning pass Step aims at minimizing the numb er of FK candidates in

volved in Step b ecause this last step is the most exp ensive Thus there is a tradeo

between the extra cost of the pruning pass and the savings it provides The pruning

Chapter Experiments

pass is useful only when the savings exceed the cost

We presentnow the metrics used to evaluate whether the pruning pass is worth

while and how ecient it is These metrics are based up on the following statistics

recorded during the execution of the algorithm for eacharity ie number of at

tributes of the keyFK

nF K C andsthenumb er of FK candidates In each pass this is recorded after

Step Note that FK candidates of size are far more numerous than for other

sizes b ecause the generating pro cess is dierent

nF K C andsP dthenumb er of FK candidates after the pruning of Step

nF K s the numb er of FKs This is the numb er of candidates that passed the

test of Step

In addition we gather statistics ab out the whole set of keybased UINDs not only

those asso ciated with keys of size

Degree of pruning We measure how eective the pruning pass is in reducing the

numb er of FK candidates involved in the full test of Step to the set of actual FKs

For instance assume that there are FKs holding in a database instance and that

our algorithm nds candidates ie nF K s andnF K C ands Then

the pruning pass is most eective when it prunes all the candidates that are not

actual FKs ie when nF K C andsP d

We dene the degree of pruning as follows

nF K C ands nF K C andsP d

DegP g

nF K C ands nF K s

Chapter Experiments

When nF K C ands nF K s the pruning pass is useless and do es not result in

any improvementover the algorithm without the pruning pass therefore wechose to

dene DegP g in this case The degree of pruning lies in the interval where

corresp onds to a p erfectly eective pruning and to a totally ineective pruning

Wechose to use this measure rather than the numb er of candidates pruned over

the numb er of candidates

nF K C ands nF K C andsP d

DegP g



nF K C ands

b ecause DegP g reects only the quantity of pruning done without taking into ac



count its qualityiehow close the candidates after pruning are of the nal FKs

Indeed in our previous example the pruning is most eective but DegP g is only



while DegP g expresses the intuitive idea that the pruning was p erfect

The degree of pruning measures the decr ease due to pruning of the number of

candidates tested against the database that are not FKs Consider the case where

nF K C ands nFKs and nF K C andsP d The degree of pruning is

low However we only test two sup eruous candidate which is rather go o d

Thus weintro duce another measure to compute the numb er of excess tests against

the database

Pruning ratio and FKcand ratio We measure howthenumb er of candidates

tested against the database in Step compares to the numb er of actual FKs To

evaluate the usefulness of the pruning pass we record this ratio b oth when there is

a pruning pass we call it pruning ratio and when there is not we call it FKcand

Chapter Experiments

ratio ie with and without Step of the algorithm

nF K s

P g Ratio

nF K C andsP d

nF K s

F K candRatio

nF K C ands

Here again b oth ratios lie in the interval where corresp onds to no excess

test against the database while corresp onds to the case where there are no FKs

but some candidates are tested

While these measures express how go o d the results after pruning are they do not

account for the cost necessary to obtain them the extra time needed to execute the

pruning pass Thus to know whether the pruning pass was worthwhile weintro duce

a last measure called time ratio

Time ratio We measure the running time of the algorithm without the pruning

pass T imeN oP g and with it TimeWithPgThen

TimeWithPg

T imeRatio

T imeN oP g

When T imeRatio the pruning pass was worth it otherwise it is faster to use

the version without pruning

All the measures dened ab ove dep end on the fraction PgParam of distinct tu

ples involved in the pruning pass the larger PgParam is the more eectivethe

pruning will b e but the more exp ensive the pruning pass is Therefore there is a

tradeo b etween the gains and the cost In the next section we study the inuence

of PgParam on the dierent measures and discuss what value is the b est choice for

PgParam

Chapter Experiments

Datasets used in the exp eriments

The goal of the algorithm presented in this thesis is to determine the data mo del

keys and FKs which b est ts the database considered More preciselywe can

a database as the sum of a clean database and some noise and the mo del to discover

corresp onds to that of the clean database In our exp eriments we used a few real

life nearly noiseless databases whose data mo dels are known and we added various

amounts of noise to them to simulate noisy databases We used the given data mo dels

to check whether those output by our algorithm are correct

So our noisy databases are all synthetically generated But as there is no general

mo del of noise in databases ie the characteristics of the noise in two databases

can b e totally dierent we consider that our synthetic noisy data are as appropriate

for the exp eriments as reallife noisy data would b e Moreover using synthetic noisy

data allows us to control the amount of noise in our exp eriments

Our reallife clean databases are of medium size wrt b oth their schema and

their numb er of tuples They were all carefully designed thus most of the foreign

keys are of size or which is the case of most databases in practice We generated

synthetic data with dierentcharacteristics ie dierentkey and foreign key sizes

to test our algorithm on other congurations

Reallife datasets These databases are all publicly available on the Internet

Amalgam is a bibliographic database with attributes distributed in

tables and up to tuples p er table We used Schema dened by Miller

et al MFH

Genex Gen is a rep ository of gene expression data with attributes dis tributed in tables and up to tuples p er table

Chapter Experiments

Janus Core Log whichwe call simply Janus in the following Oce is part of

the Janus database containing tables of Ocean Drilling Programs marine

geoscience data The Core Log subset of the Janus database is comp osed of

attributes distributed in tables and the sample we used contained up to

tuples p er table

Mondial May is a geographical database with attributes in tables

and up to tuples p er table

Synthetic data The synthetic datasets are small wrt b oth their schema and

their number of rows around attributes in tables and rows They were

generated to test worst cases of the algorithms the domains all have the same typ e

and many common values thus overlapping much more than in the reallife datasets

we used Besides the synthetic datasets present dierentcharacteristics ie dierent

key and foreign key sizes

We used constraint programming Lab to generate data resp ecting conditions on

the number of keys their sizes their numb er of FKs with the size of these FKs The

classical synthetic data generators eg those for classication asso ciation rules are

not suitable for our purp ose The b est to ol we are awareofisToxgene BMKL

but we did not use it b ecause it requires knowing the data mo del in advance while

our to ol constructs one which ts our conditions However unlikeToxgene our to ol

is not scalable which explains why our synthetic datasets have small schemas This is

due to the fact that the problem of generating a database satisfying a set of integrity

constraints sp ecied by a user is a v ery dicult one Bry et al explain that the

setofintegrity constraints can b e expressed by a set of rstorder logical formulas

and the goal is to nd a nite mo del b ecause a database is nite for this set of

rstorder logical formulas BEST But the question of whether a set of rstorder

Chapter Experiments



logical formulas is nitely satisable is semidecidable and the question of whether

a set of rstorder logical formulas is unsatisable is semidecidable so the question

whether a set of rstorder logical formulas is satisable though not nitely is not

even semidecidable

Noisy data We describ e here howwe generated approximate data from our clean

datasets Given approximation or noise degree noiseD eg as a p ercentage we pro

ceed as follows

for eachtablewe select at random noiseD eg p ercent of the tuples t t

 k

for each t v v either duplicate t with probability or mo dify

i  n i

some values in t with probability When mo difying t eachvalue is re

i i

placed with probabilityby a random value having the same typ e and with

probability by a NULL value so a value is not mo died in of cases

Duplicating tuples adds noise to keys only not to FKs while mo difying some

values in tuples adds noise to b oth keys and FKs

We created various noisy versions of our datasets we used noiseD eg and

Setting of the algorithms input parameters

We study whichvalues of and the input approximation thresholds for keys

key fk

and FKs resp ectivelygive a go o d enough mo del for all our datasets Of course the

threshold b etween go o d enough and unacceptable mo dels is sub jective and can vary

according to the applications In this thesis we consider that a mo del is go o d enough

A language L is semidecidable if there exists a Turing machine M that accepts every string in

L and do es not accept any string not in LM either rejects or go es in an innite lo op

Chapter Experiments

when few dep endencies are missing and if p ossible when those missing dep endencies

are of minor imp ortance in the original mo del For instance we consider that it is

acceptable to miss the key of a small table which is not referenced however dep en

dencies expressing relationships b etween tables such as FKs and referenced keys are

of ma jor imp ortance Because we do not knowinadvance how noisy the database is

wewant to nd a pair thatworks well on most databases clean as well as

key fk

noisy and then we will use these values as input parameters for our exp eriments

For this exp eriment wechose two of our reallife datasets Mondial b ecause its

data mo del has the characteristics of most welldesigned databases with keys of size

and and Janus b ecause its data mo del has large comp osite keys of size and

However the large comp osite keys in Janus data mo del are actually sup erkeys in

Janus data so we replaced them in Janus data mo del by their smallest subset that

is key in the data if there are several such subsets wecho ose one at random For

instance the of the table Core in Janus original data mo del is fLeg

Site Hole Core Core typ eg but fSite Core Core typ eg is key in Janus data so we

used the latter in Janus up dated data mo del We summarize in Table the keys

and foreign keys of b oth Mondial and Janus data mo dels

Number of keys Number of FKs

Database total for size total for size

Mondial

Janus

Table Summary of keys and foreign keys in Mondial and Janus data mo dels

Finding We tested our algorithm on each original dataset corresp onding to

key

noiseD eg as well as on the noisy versions of it We xed the input parameter

Chapter Experiments

to since FKs are of no concern in this particular studyandwe exp erimented

fk

with various values for Then we compared the keys output by our algorithm to

key

those of the data mo del describing the reallife dataset We recorded in Table the

number of keys present in the data mo del that are missed by our algorithm

missing keys for

noiseD eg Mondial Janus

key

Table Number of keys present in the mo del but not found by our algorithm

We can see in Table that as increases the number of keys missed by our

key

algorithm decreases We found that is a go o d choice the number of

key

missing keys decreases to for all noise degrees but noiseD eg on the Mondial

database and in this case the two missing keys are of minor imp ortance in the original

mo del they are not referenced Moreover as we consider datasets with small noise

degrees up to this value should cover most of the noise on keys

Finding We rep eat the previous exp eriment except that this time the param

fk

eter varied is and is xed to the value wejustchose We construct

fk key

Table similarly to Table the table for nding

key

Our original Janus data is not p erfectly clean one FK is missing when noiseD eg

Chapter Experiments

missing FKs for

noiseD eg Mondial Janus

fk

Table Numb er of FKs present in the mo del but not found by our algorithm with

The details p er size are inside the parentheses key

Chapter Experiments

and This FK has only a few distinct values including a noisy one so a

fk

large larger than is needed to recover the FK No value for allows us to

fk fk

recover all missing FKs Wechose b ecause it is a small value for which

fk

the numb er of missing FKs has decreased signicantly

We found that and did work reasonably well on all of our

key fk

datasets clean as well as noisysowe use these values as input parameters in the

rest of our exp eriments Even though these values allow us to recover most of the

data mo dels for our reallife datasets whatever the noise degree there may b e other

databases for which these values are not suitable We feel that in general they should

allow us to discover most of the keys and foreign keys of the data mo del provided

that the data is not very noisyFor very dirty data no technique is capable of nding

the right dep endencies

Finally the last input parameter of our algorithm is the pruning parameter the

fraction PgParam of tuples involved in the pruning pass We will study how to b est

set PgParam in Section

Keybased unary inclusion dep endencies

W eevaluate in this section the gain in terms of database accesses of extracting only

keybased UINDs instead of all UINDs

For each of our reallife databases with no noise added we record in Table

the numb er of UIND candidates b oth when we extract all UINDs and when we extract

only keybased UINDs and the p ercentage of attributes that are included in a key

We also express the number of keybased UIND candidates as a p ercentage of all

Chapter Experiments

UIND candidates For this exp eriment the stopping criterion used in the algorithm

is the limited size criterion ie we searchallintegrity constraints up to size

KeyBasedUINDCands Percentage of attributes

Database UINDCands as of UINDCands included in a key

Amalgam ie

Genex ie

Mondial ie

Janus ie

Table Extracting all UINDs versus only keybased UINDs

We found that on these four datasets to of UIND candidates are key

based Hence limiting ourselves to keybased UINDs saves b etween one third and one

half in database queries The rawnumb ers showthatthesesavings are signicant

there are to fewer keybased UIND candidates than UIND candidates As

each candidate requires a database query in order to check whether the UIND actually

holds in the database weavoid ab out to queries The p ercentage of

attributes included in keys is slightly smaller than the p ercentage of UIND candidates

that are keybased due to our candidate generation pro cess Indeed an attribute L

is a lefthand side candidate for a given attribute R provided that L and R have

the same typ e and that jLjjRj where jLj is the numb er of distinct values of L

Because attributes in keys have more distinct values than attributes not in keys there

are more UIND candidates with a key attribute as the righthand side

Chapter Experiments

Eciency of the pruning pass

First of all weshowinTable that the pruning pass can b e very eective in some

cases We recorded the running time of our algorithm b oth with and without the

pruning pass on our four reallife datasets without noise added We computed the

fraction of running time without pruning that weavoided by p erforming a pruning

pass this fraction measures the b enet of the pruning pass For this exp eriment we

used the limited size stopping criterion andvarious values for the

key fk

fraction PgParam of distinct values involved in the pruning pass the b est value for

each dataset

Gain due to the

pruning pass

Database PgParam Running time Running time in of running

without pruning with pruning time without pruning

Amalgam mins mins

Mondial mins mins

Genex mins mins

Janus mins mins

Table Running times of our algorithm with and without pruning pass on original

data with

key fk

From Table we see that the pruning pass allows us to savetoin

running time so it can b e very eective This justies the more detailed study that

follows where we examine the eectiveness of the pruning pass and which parameters inuence it

Chapter Experiments

General b ehavior of the pruning pass in relation to the

pruning parameter

We study in this section the general b ehavior of the pruning pass and how this

behavior dep ends on the pruning parameter PgParam

For a general analysis we exp erimented on most of the datasets describ ed in Sec

tion We ran our algorithm on our reallife datasets Mondial Genex Janus

and synthetic data as well as on noisy versions of Mondial and Janus We used the

input approximation thresholds of Section and

key fk

First of all we examine whether the pruning pass is worthwhile and for which

value of the pruning parameter the pruning is most eective As the pruning pass

requires that PgParam and is set to we exp erimented with PgParam

fk fk

values ranging from to Figure shows the ratio b etween the running

time with pruning and without pruning versus the pruning parameter All these

graphs have the same shap e the time ratio increases as the pruning parameter grows

However their p ositions with resp ect to the line of equation y dier according

to the data on synthetic data the time ratio is bigger than for all tested values

of PgParam while on Janus the time ratio is smaller than for these same values

Finally on Mondial the time ratio crosses the line y around PgParam

So the pruning pass PP is most eective for smaller value of the pruning param

eter in our exp eriments But this optimal value of PgParam can corresp ond

to a w orthwhile or detrimental pruning pass dep ending on the dataset Note that a

detrimental PP is such that T imeW ithP g T imeN oP g

Wenowinvestigate why the time ratio increases with PgParamandwhythe

Chapter Experiments

Synthetic data with noiseDeg

key fk

Synthetic data

T imeW ithP g

T imeN oP g

Janus with

key fk

T imeW ithP g

noiseD eg

T imeN oP g

noiseD eg

Mondial with

key fk

noiseD eg

noiseD eg

noiseD eg

T imeW ithP g

T imeN oP g

pruning parameter

Figure Time savings

Chapter Experiments

p osition of the curve wrt the line of equation y changes according to the

data We measure the p erformance of the pruning over all FK candidates with the

metrics describ ed in Section namely the FK candidate and pruning ratios

F K candRatio and P g Ratiowhich measure the numb er of sup eruous candidate

tests against the database without and with pruning resp ectively the closer to

the fewer sup eruous tests there are and the degree of pruning DegP g which

measures how many candidates have b een discarded by the pruning pass For concise

ness we limited our investigation to one version of each reallife database Mondial

and Janus with noiseD eg and Genex with noiseD eg but the results are

similar for other noise degrees We present the results in Table

Without pruning pass the algorithm p erforms many unnecessary database queries

as shown byvery low FK candidate ratios from for Janus to for synthetic

FKs

data Recall that F K candRatio and that each candidate is tested

candidates

against the database via a query In the ideal case the algorithm tests only the can

didates for which the inclusion holds which corresp onds to the numb er of candidates

b eing equal to the numb er of FKs and F K candRatio When there are more

candidates than FKs ie when F K candRatio sup eruous database queries

are p erformed

Intro ducing a small amount of pruning PgParam reduces greatly this

numb er of unnecessary database queries on Mondial Janus and Genex the pruning

ratios are more than times bigger than the FK candidate ratios Hence the pruning

is very eective on these datasets as shown by degrees of pruning greater than

But increasing the pruning parameter brings little improvement to the eectiveness

of the pruning pass as demonstrated byslowly increasing degrees of pruning and

pruning ratios on Mondial Janus Genex Even when the p erformance of the PP

remains relatively steady while PgParam grows the cost of p erforming the pruning

Chapter Experiments

Eciency Pruning parameter in

Database measure

Mondial FKcandRatio

noiseDeg PgRatio

DegPg

Genex FKcandRatio

noiseDeg PgRatio

DegPg

Janus FKcandRatio

noiseDeg PgRatio

DegPg

Synthetic FKcandRatio

data PgRatio

noiseDeg DegPg

Table Eciency of the pruning pass on all candidates with input parameters

and

key fk

Chapter Experiments

still increases so the cost of the PP increases faster than the b enets it generates

This explains why the gain from pruning decreases with PgParam

The b ehavior of the PP is slightly dierentonsynthetic data Indeed the pruning

pass is not as eective for small PgParam the pruning ratio is nearly equal to the

FK candidate ratio for PgParam and the degree of pruning is small at only

as opp osed to more than on reallife data

Overall the b est compromise b enetcost is reached for the smallest PgParam

and from there the eectiveness of the PP decreases on all datasets b ecause

the cost of pruning grows faster than its b enets For reallife datasets pruning with

PgParam is relatively inexp ensive and it greatly reduces the numb er of su

p eruous database queries so the optimal value of the pruning parameter corresp onds

toaworthwhile PP On synthetic data however the optimal value PgParam

corresp onds to a detrimental PP where the cost always surpasses the b enets

Wenow study why on reallife data the PP is so eective for a small PgParam

and why its p erformance increases very little for larger PgParamvalues The can

didates can b e classied in two groups the accidental candidates that havevery

few values in common with the righthand side and the exp ected candidates that

share more than a few values with the righthand side To discard an accidental

candidate wehave to pro cess only a couple more tuples than required by ie

fk

avery small pruning parameter suces On the other hand to b e able to discard

some of the exp ected candidates we need to examine many more tuples that is we

need a large PgParam

Weinvestigated howmanyvalues are shared by the accidental and exp ected can

didates and their righthand side RHS Wechose only one reallife dataset Mondial

and one synthetic dataset for this exp eriment Mondial is representative of reallife

Chapter Experiments

datasets and the synthetic database corresp onds to a sp ecial case where all attributes

share several values We computed the average p ercentage of common values b etween

candidates and RHSs for two PgParamvalues the smallest value used in previous

tests and a larger one The candidates that are discarded in the PP corre

sp ond to the accidental candidates and the others are the exp ected candidates We

present the results in Table

Pruning Avg p ercent of common values

parameter between rhs and

Database in pruned cand lhs not pruned cand lhs

Mondial

Synthetic

data

Table Average p ercentage of common values b etween the candidate lefthand

sides and their righthand sides on databases with noiseDeg

On Mondial when PgParam the accidental candidates share only

of their values with their RHSs while the exp ected candidates share as muchas

with their RHSs

Recall that candidates are either accidental or exp ected As accidental and ex

p ected candidates are dened in relation to the PP these twosetsvary with the

pruning parameter When PgParam increases some of the candidates that were

b efore exp ected candidates b ecome accidental thus the average p ercentage of com

mon values b etween accidental candidates and their RHSs increases as can b e seen

in Table

The high average of common values b etween accidental keys and their RHSs

Chapter Experiments

on synthetic data explains why the p erformance of the PP is so p o or with small

PgParam and why it increases slowly with PgParam The p ercentage of common

values b etween accidental keys and their RHSs are much larger for synthetic data

than for Mondial vs This explains why the p erformance of the PP is

lower on synthetic data than on reallife data the pruning ratio is at most on

synthetic data while it reachesonJanus and on Mondial see Table

These results conrm the observations made in Section Complexity section

namely that the PP is heavily inuenced by the degree of overlap in the dierent

domains We generated synthetic data to test our algorithm on worst case data so if

the numb er of common values b etween candidates and RHS is high on synthetic data

it is on purp ose The exp eriments corrob orate that overlapping domains represent

indeed a bad case for our algorithm Indeed the time ratio is always smaller than

for synthetic data so it is b etter not to p erform a PP on these data

Overall wesaw that the b est value for the pruning parameter is very close to

the FK approximation threshold for instance PgParam when

fk

But even for the b est value of the pruning parameter the PP maybe

fk

detrimental and one should avoid p erforming it in some cases This happ ens in

particular when the dierent attribute domains are overlapping to o muc h

Inuence of dep endency size

We study in this section the inuence of dep endency size on the eectiveness of the

pruning pass We used the same set of exp eriments as in Section except that

here we detail the p erformance results p er dep endency size We found that the prun

ing pass has the same eects for each of the sizes larger than one hence we distinguish

Chapter Experiments

only b etween unary and comp osite dep endencies in this section The results are pre

sented in Table for keybased UINDs and Table for comp osite dep endencies

Eciency Pruning parameter in

Database measure

Mondial FKcandRatio

noiseDeg PgRatio

DegPg

Genex FKcandRatio

noiseDeg PgRatio

DegPg

Janus FKcandRatio

noiseDeg PgRatio

DegPg

Synthetic FKcandRatio

data PgRatio

noiseDeg DegPg

Table Eciency of the pruning pass on keybased UINDs with input parameters

and

key fk

First of all notice that the gures in Table are nearly the same as in Table

so the general b ehavior of the pruning pass is actually shap ed by its b ehavior on unary

candidates Because there are far more unary candidates than comp osite candidates

the eectiveness of the PP is dominated byhow it fares on unary candidates

For most unary candidates the inclusion dep endency do es not hold in the database

as shown by the very lowvalues of the FK candidate ratio from for Janus to

Chapter Experiments

Eciency Pruning parameter in

Database measure

Mondial FKcandRatio

noiseDeg PgRatio

DegPg

Genex FKcandRatio

noiseDeg PgRatio

DegPg

Janus FKcandRatio

noiseDeg PgRatio

DegPg

Synthetic FKcandRatio

data PgRatio

noiseDeg DegPg

Table Eciency of the pruning pass on comp osite FK candidates with input

parameters and

key fk

Chapter Experiments

for Mondial see Table This is due to the fact that our candidate generating

pro cess is not very constraining an attribute L is a lefthand side candidate for a

RHS K provided that L and K have the same typ e and L has fewer distinct values

than K Thus an attribute L can b e a candidate even if it has no value in common

with the righthand side K So there is a huge numb er of lefthand side candidates

for each righthand side and most of them are accidental candidates

The pruning pass is very eective on unary candidates the degree of pruning

is ab ove for Mondial Genex and Janus so a large p ortion of the candidates

are discarded during the pruning pass However there still remain candidates for

which the inclusion dep endency do es not hold in the database the pruning ratio lies

between and for Mondial Genex and Janus

Unlike for keybased UINDs the pruning pass discards few comp osite candidates

for Mondial and Genex the pruning ratio is equal or nearly equal to the FKcand ratio

and for Janus P g Ratio F K candRatio see Table And the FKcand

ratio is high from for Janus to for Mondial so even without pruning pass

there are not many sup eruous tests against the database These dierences b etween

the comp osite and unary cases come from the fact that most unary candidates are

accidental while most comp osite candidates are exp ected Indeed a comp osite FK

candidate F F for key K K is such that F K for all i k sothe

 k  k i i

candidate is likely to share manyvalues with the key Therefore the probability

of pruning a comp osite candidate is very low We compute this probabilityonan

example taken from Janus a comp osite candidate of size has distinct values

including values shared with its key ie of the candidates distinct values are

shared with its key To discard this candidate we need to pick at least d e

fk

values that are not common with the key during the pruning pass Assume that

PgParam We can use up to dPgParame values in the pruning

Chapter Experiments

pass So the probability of discarding the candidate is

P r obaP g P fpick at least values not common with the key among valuesg



P



k k

k

k 



Note that this probability increases up to when so the pruning pass

fk

can b e eective on comp osite candidate when is very small The comp osite can

fk

didates discarded during the pruning pass are accidental candidates which share

few values with their key These accidental candidates often b elong to the same

table as the key and also have some attributes in common with the key For

instance F F ir stname Gender Lastname can b e a FK candidate for the key

K F ir stname M id init Lastnamebut F and K have probably very few com

mon values

The eectiveness of the pruning pass on comp osite candidates remains constant

when the pruning parameter increases see Table Indeed the few accidental

candidates are discarded with a v ery small PgParam but the exp ected candidates

share so manyvalues with their keys that a very large PgParam is required to prune

them Therefore if the pruning pass is to b e p erformed on comp osite candidates it

is b est to use very small PgParamvalues b ecause the cost of pruning is minimized

without a loss of p erformance However as the b enet due to pruning is low p er

forming a PP may b e more costly than its gains

Wesaw that the general eectiveness of the PP as studied earlier is actually

shap ed by the eectiveness on keybased unary candidates b ecause a large part of the

total numb er of candidates are unary So for unary candidates it is most eectiveto

use a small PgParam value close to b ecause it brings a go o d pruning eectiveness fk

Chapter Experiments

at low cost For instance when PgParam is the optimal value in

fk

our exp eriments

Wesaw that the PP is not very eective on comp osite candidates so it maybe

to o costly to p erform relative to its b enets on most databses eg on Mondial

Inuence of input approximation thresholds on the b en

et of the pruning pass

We study in this section how the FK approximation threshold aects the eciency

fk

of the PPWe try our algorithm on Mondial and Janus with several values

fk

and compare the corresp onding shap es of the curve representing the time

ratio versus PgParam The curves are displayed in Figure

The curves asso ciated with dierent have the same shap e Smaller values

fk

of allow us to pro cess fewer tuples during the pruning pass so that it is even

fk

faster and less costly to discard the accidental candidates For the time

key

ratio asso ciated with the optimal value of the pruning parameter PgParam

fk

gets smaller when decreases We cannot compare the curves corresp onding to

fk

dierentvalues of for Jan us b ecause the number of keys is signicantly larger

key

when than when

key key

We can see on Mondial that as grows the left part of the curveismoreand

fk

more truncated Then when reaches the value the only part that will remain

fk

is the part for which the PP is detrimental

So the FK approximation threshold inuences the eectiveness of the pruning pass

insofar as it provides a lower b ound for the pruning parameter Wesaw that the PP

is most eective for small values of PgParam therefore when a large value is fk

Chapter Experiments

Mondial with noiseDeg

key

fk

fk

key fk

T imeW ithP g

T imeN oP g

pruning parameter

Janus with noiseDeg

key

T imeW ithP g

fk

TimeNoPg

fk

key fk

pruning parameter

Figure Eciency of the pruning pass for various values fk

Chapter Experiments

input to our algorithm it may b e more eective not to p erform the PP

Inuence of data size



We found in Section that the cost of the pruning pass is in O PgParamp where

p is the maximal numb er of tuples p er table in the database Wenow attempt to

verify exp erimentally that the pruning pass is indeed quadratic wrt the data size

Wechose one of the synthetic datasets for this exp eriment b ecause it was easy to

increase the size of the data without altering the characteristics of the dep endencies

We draw in Figure the running time of the algorithm versus the numb er of tuples

in a table

Synthetic data

running

time min

numb er of tuples

Figure Running time on synthetic data with

key fk

P g P ar am

The graph in Figure is consistent with the shap e of a p olynomial of degree

Chapter

Conclusion

In this thesis we studied the problem of discovering approximate keys and foreign

keys holding in a relational database of whichwe assume no previous knowledge

The key discovery part of our topic b enets from a large b o dy of work on func

tional dep endencies ecient algorithms eg Tane HKPT have b een designed

for extracting FDs from a database On the other hand identifying foreign keys or

more generally inclusion dep endencies is a relatively new data mining area

We prop osed an integrated architecture for determining the approximate keys

and foreign keys holding in a database To the b est of our knowledge this is the rst

concrete prop osal of an algorithm for this the problem even though the underlying

ideas are not new Our algorithm is designed to work with the data stored in a DBMS

but it could b e easily adapted to b e memory resident

We presented two optimizations to this integrated architecture

extracting only keybased unary inclusion dep endencies instead of all unary

INDs as is usually done

Chapter Conclusion

adding a pruning pass to the foreign key discovery The goal of this extra

step p erformed on a small data sample is to reduce the numb er of foreign key

candidates tested against the database

The pruning pass PP is useful for discarding accidental candidates ie can

didates that share very few values with their righthand side Indeed pro cessing

a few tuples is enough to make the decision to prune the candidate so the PP is

very eectiveatlow cost on this typ e of candidate For instance the pruning pass

is worthwhile when extracting keybased UINDs b ecause the candidate generating

pro cess is not very constraining and creates many accidental candidates However

when the candidates share more than a few values with their righthand side the cost

of pruning is to o high in regard to the savings in database queries it generates thus

the PP is disadvantageous This happ ens for comp osite FKs or when the attribute

domains overlap to o much

Two input parameters of our algorithm have a ma jor inuence on the eectiveness

of the pruning pass the approximation threshold for foreign keys which controls

fk

the degree of approximation allowed in the FKs discovered and the pruning parameter

PgParamwhich denes the fraction of tuples involved in the pruning pass These

two parameters are constrained by the inequality PgParam whichmust b e

fk

satised for the PP to function The PP is most eective for small PgParamvalues

and b ecomes unfavorable when PgParam growssowhen is to o large the PP

fk

can b e detrimental

In summary the pruning parameter should b e set to a value slightly greater than

and the pruning should o ccur only on unary candidates Besides if the attribute

fk

domains overlap to o much or if is large it is often faster to avoid the PP alto

fk gether

Chapter Conclusion

The work presented in this thesis can b e extended in several ways One could

investigate heuristic optimizations for the pruning pass A p ossible improvement

would b e to dynamically decide whether a pruning pass should b e used for a given

foreign key candidate dep ending on its domain and the domain of the key When

the domains are of typ e string b ecause there are so many p ossible values it is very

unlikely to nd the same strings in b oth the key and the FK candidate unless the

candidate is really a FK When the domains are of typ e numb er on the other hand it

would not b e surprising to nd the same values in b oth key and FK candidate even

in cases where they are not linked at all

Bibliography

Arm William Armstrong Dep endency structures of database relationships In

Proc of the IFIP Congress pages Geneva Switzerland

AS Rakesh Agrawal and Ramakrishnan Srikant Fast algorithms for min

ing asso ciation rules In Proc th Int Conf Very Large Data Bases

VLDB pages Morgan Kaufmann

BB Siegfried Bell and Peter Bro ckhausen Discovery of constraints

and data dep endencies in databases extended abstract In Eu

ropean Conference on Machine Learning volume of Lecture

Notes in Articial Intel ligence pages Springer Verlag

Full version app eared as technical rep ort ftpftpaiinformatikuni

dortmunddepubRep ortsrep ortpsZ

Bel Siegfried Bell Discovery and maintenance of functional dep endencies by

indep endencies In Know ledge Discovery and Data Mining pages

BEST F rancois Bry Norb ert Eisinger Herib ert Schutz and Sunna Torge SIC

Satisabilitychecking for integrity constraints In DDLP pages

Bibliography

BMKL Denilson Barb osa Alb erto O Mendelzon John Keenleyside and Kelly A

Lyons Toxgene An extensible templatebased data generator for xml In

SIGMOD Conference

BMT Dina Bitton Jerey Millman and Solveig Torgersen A feasibilityandper

formance study of dep endency inference In Proceedings of the Fifth Inter

national Conference on Data Engineering pages IEEE Computer

So ciety

Bou JeanFrancois Boulicaut A KDD framework for database audit Interna

tional Journal of Information Technology and Management

Sp ecial issue on Information Technologies in Supp ort of Business

Pro cesses

CFP Marco A Casanova Ronald Fagin and Christos H Papadimitriou In

clusion dep endencies and their interaction with functional dep endencies

Journal of Computer and System Sciences February

CGK Qi Cheng Jarek Gryz Fred Ko o T Y Cli Leung Linqi Liu Xiaoyan

Qian and Berni Schiefer Implementation of two semantic query opti

mization techniques in DB universal database In The VLDB Journal

pages

CI Elliot J Chikofsky and James H Cross I I Reverse engineering and design

recovery A taxonomy IEEE Software

Co d E F Co dd Recentinvestigations in relational database systems In

Proce edings of the IFIP Congress

Dat AnIntroduction to Database Systems AddisonWesley edition

Bibliography

dJS Lurdes Pedro de Jesus and Pedro Sousa Selection of reverse engineer

ing metho ds for relational databases In rd International Conference

on Software Maintenace and Reengineering pages Amsterdam

FS Peter A Flach and Iztok Savnik Database dep endency discovery A

machine learning approach AI Communications

Gen Genex pro ject httpwwwncgrorggenexindexhtml

GGXZ ParkeGodfrey Jarek Gryz Haoping Xu and Calisto Zuzarte Exploiting

constraintlikedatacharacterizations in query optimization In SIGMOD

Record

Hai JeanLuc Hainaut Database reverse engineering In Proceedingsofthe

th Conf on ER Approach San Mateo CA

HKPT YkaHuhtala Juha Karkkainen Pasi Porkka and HannuToivonen Ef

cient discovery of functional and approximate dep endencies using parti

tions In Proceedings of the Fourteenth International Conference on Data

Engineering pages IEEE Computer So ciety

KA A J Knobb e and P W Adriaans Discovering foreign key relations in

relational databases In The Thirteenth European Meeting on Cybernetics

and Systems Researchvolume I I of Cybernetics and Systems pages

KM Jyrki Kivinen and Heikki Mannila Approximate dep endency inference

from relations In th International ConferenceonDatabase Theory

Bibliography

ICDTvolume of Lecture Notes in Computer Science pages

Springer

KMRS M Kantola H Mannila KJ Raiha and H Siirtola Discovering func

tional and inclusion dep endencies in relational databases Journal of In

tel ligent Systems

Lab F Laburthe Cho co an ob ject oriented constraint propagation kernel over

nite domains httpwwwcho coconstraintsnetindexhtml

LL M Levene and G Loizou A GuidedTour of Relational Databases and

Beyond SpringerVerlag

LL Mark Levene and George Loizou Guaranteeing no interaction b etween

functional dep endencies and treelike inclusion dep endencies Theoretical

Computer Science

LPL Stephane Lop es JeanMarc Petit and Lot Lakhal Ecient discov

ery of functional dep endencies and Armstrong relations In Advances in

Database Technology EDBT th International Conference on Ex

tending Database Technologyvolume of Lecture Notes in Computer

Science pages Springer

LPT S Lop es JM Petit and F Toumani Discovery of interesting data

dep endencies from a workload of SQL statements In Principles of Data

Mining and Know ledge Discovery ThirdEuropean Conferencevolume

of Lecture Notes in Computer Science Springer

Bibliography

LPT Stephane Lop es JeanMarc Petit and Farouk Toumani Discovering in

teresting inclusion dep endencies application to logical database tuning

Information Systems

LV Mark Levene and Millist W Vincent Justication for inclusion dep en

dency normal form Know ledge and Data Engineering

May Wolfgang May Information extraction and integration with FloridThe

Mondial case study Technical Rep ort Universitat Freiburg In

stitut fur Informatik Available from httpwwwinformatikuni

freiburgdemayMondial

MFH Renee J Miller Daniel Fisla Mary Huang David Kalmuk Fei Ku

and Vivian Lee The Amalgam schema and data integration test suite

httpwwwcstorontoedumilleramalgam

MHH Renee J Miller Laura M Haas and Mauricio A Hernandez Schema

mapping as query discoveryInProceedings of th International Confer

enceonVery Large Data Bases VLDB pages Morgan Kaufmann

MLP Fabien De Marchi Stephane Lop es and JeanMarc Petit Ecient algo

rithms for mining inclusion dep endencies In th International Conference

on Extending Datab ase Technology EDBT volume of Lecture

Notes in Computer Science pages Springer

MR H Mannila and KJ Raiha Dep endency inference In Proceedings

of the Thirteenth International ConferenceonVery Large Data Bases

VLDB pages Morgan Kaufmann

Bibliography

MR H Mannila and K Raiha The Design of Relational Databases Addison

Wesley

MT Heikki Mannila and HannuToivonen Levelwise search and b orders of

theories in knowledge discovery Data Mining and Know ledge Discovery

MTV Heikki Mannila HannuToivonen and A Inkeri Verkamo Ecient algo

rithms for discovering asso ciation rules In AAAI Workshop on Know ledge

Discovery in Databases KDD pages Seattle Washington

NC No el Novelli and Rosine Cicchetti FUN An ecient algorithm for min

ing functional and emb edded dep endencies In Proceedings of th Inter

national Conference on Database Theory ICDT volume of

Lecture Notes in Computer Science pages Springer

Oce Ocean Drilling Program Janus database httpwww

o dptamuedudatabase

PT JeanMarc Petit and Farouk Toumani Discovering inclusion and approxi

mate dep endencies in relational databases In Proc mes Journes Bases

de Donnes Avances pages

PTBK JeanMarc Petit Farouk Toumani JeanFrancois Boulicaut and Jacques

Kouloumdjian Towards the reverse engineering of denormalized relational

databases In Proceedings of the Twelfth International Conference on Data

Engineering pages IEEE Computer So ciety

Bibliography

SZWA WM Shen W Zhang X Wang and Y Arens Mo del construction with

key identication In Proceedings of SPIE Symposium on Data Mining

and Know ledge Discovery Theory Tools and Technologyvolume

of SPIE