<<

University of Nebraska - Lincoln DigitalCommons@University of Nebraska - Lincoln Science and Engineering: Theses, and Engineering, Department of Dissertations, and Student Research

5-1998 Implementation of a System with Boolean Constraints András Salamon University of Nebraska - Lincoln

Follow this and additional works at: https://digitalcommons.unl.edu/computerscidiss Part of the Computer Engineering Commons

Salamon, András, "Implementation of a Database System with Boolean Algebra Constraints" (1998). Computer Science and Engineering: Theses, Dissertations, and Student Research. 141. https://digitalcommons.unl.edu/computerscidiss/141

This Article is brought to you for free and open access by the Computer Science and Engineering, Department of at DigitalCommons@University of Nebraska - Lincoln. It has been accepted for inclusion in Computer Science and Engineering: Theses, Dissertations, and Student Research by an authorized administrator of DigitalCommons@University of Nebraska - Lincoln.

Implementation of a Database System with Bo olean Algebra

Constraints

by

Andras Salamon

A THESIS

Presented to the Faculty of

The Graduate College at the University of Nebraska

In Partial Fulllment of Requirements

For the Degree of Master of Science

Ma jor Computer Science

Under the Sup ervision of Professor Peter Z Revesz

Lincoln Nebraska

May

Implementation of a Database System with Bo olean Algebra Constraints

Andras Salamon MS

University of NebraskaLincoln

Advisor Peter Z Revesz

This thesis describ es an implementation of a constraint database system with

constraints over a Bo olean The system allows within the input

database as well as the queries equality equality and monotone inequality

constraints b etween Bo olean Algebra terms built up using the op erators of

intersection and Hence the new system extends the earlier DISCO

system which only allowed equality and subsetequality constraints b etween Bo olean

algebra variables and constants

The new system allows Datalog with Bo olean Algebra constraints as the query lan

guage The implementation includes an extension of Naive and SemiNaive evaluation

metho ds for Datalog programs and algebraic optimization techniques for relational

algebra formulas

The thesis also includes three example applications of the new system in the area

of family tree genealogy genome assembly and twoplayer game analysis In

each of these three cases the optimization provides a signicant improvement in the

running time of the queries

ACKNOWLEDGMENTS

I would like to express my deep est appreciation to my advisor Dr Peter Revesz

for his guidance and encouragement on this pro ject I also thank Dr JeanCamille

Birget and Dr Spyros Magliveras for serving on my advisory committee and for their

time

I would like to dedicate my thesis for the memories of my grandfathers Karoly

Nemeth and Sandor Salamon

Contents

Datalog with Bo olean Algebra Constraints

Bo olean Algebra

of Datalog Queries with Bo olean Constraints

Quantier elimination

Elimination metho d for equality constraints

Elimination metho d for precedence and monotone inequality

constraints

Naive and Seminaive evaluation metho ds

Naive metho d

SemiNaive metho d

EVAL

EVAL INCR function

Implementation

The implemented Bo olean Algebra

Sets

Implemented quantier elimination metho ds

Hardware and Software

Java program

Packages

Relational Algebra Formulas

Relational Algebra Formulas

Converting a Rule

Converting a Relation

Optimization of Relational Algebra Formulas

Algebraic Manipulation

Laws

Principles

The

Calculating multiple joins

Users Manual

Switches

Commands

Input le format

Examples

Ancestors example

The problem

The input database

The Datalog program

The output database

Execution complexity

Runtime results

Genome Map

The problem

Solution

Concrete examples

Unavoidable Sets

Problem

Solution

Example

Multisets

Introduction

Extension of the program

An example

The problem

The solution

Concrete example

Further Work

Introduction

Although among the database systems the relational database systems are the wide

spreadest systems at this moment constraint is a very p ersp ective approach

to change them

Constraint databases can contain quantierfree rstorder formulas With the

help of these formulas constraint databases are able to express more than traditional

relational databases For instance one tuple can contain innite of traditional

tuples

Constraint database systems can b e categorized according to the type of the con

straint Some of the wellknown constraint types are for instance linear constraints

p olynomial constraints gap constraints

In this system the constraints are Bo olean Constraint hence the name of the

system is Datalog with Boolean Constraint This system extends the p ossibilities of

a previous system D atalog with Integer COnstraint DISCO which was im

plemented in the department DISCO system allows only subsetequality constraints

b etween Bo olean Algebra variables and constants this system allows subsetequality

equality and monotone inequality constraints b etween Bo olean Algebra terms A

description of the DISCO system can b e found in

First there is a theoretical overview Chapter based on then a chapter

ab out the current implementation Chapter what kind of Bo olean Algebra is

implemented the main structure of the program

Because one really imp ortant part of the program is related to Relational Alge

bra Chapter describ es the Relational Algebra formulas how to store convert and

optimize these formulas

The name of the implemented program is GreenCoat Chapter gives informa

tion ab out the user interface of the program The predecessor of this program was

implemented in the previous semester by Song Liu and I

Chapter describ es some examples and try to demonstrate the p ossibilities of the

system

The system also supp orts the use of some multiset op erators Chapter contains

more information ab out the multisets

Chapter

Datalog with Bo olean Algebra

Constraints

In this chapter I give an overview of the Datalog with Boolean Algebra Constraint

This was introduced by Kanellakis et al and extended by Peter Z Revesz

First I present the basic denitions necessary to understand the concept Later I

dene the syntax metho ds

Bo olean Algebra

The following denition is taken from more information can b e found ab out

Bo olean in

A Bo olean algebra is a sextuple where is the domain set

and are binary op erators is a unary op erator

and are two sp ecial elements of the domain They are

also called zero and elements Every Bo olean algebra satises the following

x y

x y y x x y y x

x y z x y x z x y z x y x z

x x x x

x x x x

Bo olean term All the elements of including and are Bo olean terms All

the elements of V set of variables and all the elements of where C is the set of

constants except and are Bo olean terms If t and t are b oth Bo olean terms

1 2

than t t t t t are also Bo olean terms

1 2 1 2

1

Precedence constraint If a constraint has the following form x y

where x y V C then we call this constraint precedence constraint and

denote with x y

Monotone Bo olean function A g Bo olean function is monotone if x

i

y i n g x x g y y

i 1 n 1 n

Monotone inequality constraint g x x is a monotone inequality

1 n

constraint if g is a monotone Bo olean function

The following is a wellknown fact for Bo olean terms

Prop osition Every t Boolean term can be converted to disjunctive normal

form DNF

W

a a a

n

1 2

tz z z ta a a z z z

1 2 n 1 2 n

n

1 2 n

a f g

0 0

where z denotes z and z denotes z

Syntax of Datalog Queries with Bo olean Con

straints

The following basic denitions can also b e found in Every Datalog program

contains a set of facts constraint tuples and a set of rules The facts can b e seen as

sp ecial rules as well The general form of the facts is

R x x f x x g x x g x x

1 k 1 k 1 1 k l 1 k

where f and g i l are Bo olean terms

i

The general form of the rules is

f x R x x R x x R x x

1 k 1 11 1k n n1 nk

n

1

g x g x

1 l

where R R R are relation symbols not necessary distinct symbols x s

1 k

is the set of variables in the rule and f and g i l are Bo olean V C x

i

terms

It is not a real restriction that one side of the constraint is always f

g f g f g hence we can convert all the constraints to this

form Without loss of generality we can also assume that we have only one equality

constraint b ecause f f f f therefore we can

1 n 1 n

connect several equality constraints to create one constraint

Quantier elimination

A quantier elimination metho d is an equivalency b etween an existentially quantied

formula and a quantierfree formula

Quantier elimination is used for variables on the righthand side of a rule which

do not o ccur as variables in the lefthand side

There are three elimination metho ds describ ed in The correctness pro ofs of

the elimination metho ds also can b e found in the article

Elimination metho d for equality constraints

The rst elimination metho d Lemma which originates with George Bo ole

can b e used for equality constraints

xf x y y f y y f y y

1 k 1 k 1 k

Elimination metho d for precedence and monotone in

equality constraints

The other Lemma can b e used for precedence x y and monotone inequality

constraints

x z x z x

1 m

x y x y

1 k

w u w u

1 1 s s

g x v v

1 1 1n

1

g x v v

l 1 1n

l

is equivalent to

z y z y z y z y

1 1 k 2 2 m k

w u w u

1 1 s s

g y y y v v

1 1 2 k 1 1n

1

g y y y v v

l 1 2 k 1 1n

l

where z y w u v s are variables or constants Although they are not necessarily

i i i i i

distinct symbols they are dierent from x

Naive and Seminaive evaluation metho ds

If our input Datalog program do es not contain recursive rules then the evaluation

of the program is simple b ecause it is enough to evaluate every rule only once using

a standard algorithm for the ordering of the rules This algorithm is describ ed for

example in If the input program contains recursive rules then the rules have to

b e evaluated more than once

Naive metho d

Example The example is taken from We have an input database which

contains parentschildren pairs and our goal is to nd the ancestors of a sp ecic

p erson The Datalog program is the following There is a longer description of this

example in section

childrenP C Pparent parent Cperson brother

childrenP C Pgparent gparent Cparent uncle

childrenP C Pgparent gparent Cparent aunt

childrenP C Pggparent ggparent Cggparent

AAncestorP childrenPC person C

AAncestorP childrenPC AAncestorP C P

After evaluating all of the rules once we get the parents parent parent of

the sp ecic p erson If we evaluate the rules again we get the parents and grand

parents gparent gparent gparent gparent of the p erson After the third

evaluation we get the greatgrandparents ggparent ggparent to o After the

fourth evaluation the metho d do es not give new ancestors hence we can stop

The previously used algorithm is called Naive method The pseudoco de of the

metho d is the following taken from

for to m do

P

i

rep eat

for i to m do

Q P

i i

for i to m do

P EVALiQ Q

i 1 m

until P Q for all i i m

i i

Where P is the tuples of the ith relation Q is the tuples of the ith relation in

i i

the previous step At the b eginning we erase all tuples Than rep eat the steps of

the algorithm until the results of the last two steps are identical During one step

we store the tuples rst Q P than calculate the new tuples using the tuples

i i

calculated in the previous steps The calculation is done by the EVAL function

In the function i denotes the index of the current relation Q Q denote the

1 m

tuples of the relations which can b e used by EVAL

SemiNaive metho d

Tha main disadvantage of the Naive metho d is that it recalculates the same tuples in

every iteration In the previous example during the rst step the algorithm calculates

the parents during the second step the parents and the grandparents during the

third and fourth step the parent grandparents and greatgrandparents Therefore

the algorithm calculated the parents four times If the number of the steps are greater

and in a real application it is several times greater then this disadvantage is also

greater The main idea of the SemiNaive metho d is to omit these recalculations

If during the calculation we use only old tuples tuples which were calculated

b efore the previous step then we only recalculate some older tuples Therefore if we

want to calculate new tuples we should use at least one new tuple tuple which were

calculated during the previous step

Naturally the rst step is an exception b ecause there are no new tuples b efore

the rst step hence the rst steps of the Seminaive and Naive metho ds are identical

The pseudoco de of the seminaive evaluation also taken from

for to m do

P EVALp

i i

P P

i i

rep eat

for i to m do

Q P

i i

for i to m do b egin

INCRi P P Q Q P EVAL

1 m 1 m i

P P P

i i i

end

for i to m do

P P P

i i i

until P for all i i m

i

Where P denotes the tuples of the ith relation P the new tuples in the current

i i

Q the new tuples in the previous step of the relations At the rst step we use

i

the EVAL function to calculate the tuples Than we rep eat the steps of the algorithm

until there is no new tuples in the last step In a step rst we store the new tuples

Q P then calculate the new tuples using EVAL INCR function After

i i

this we check whether the new tuples are really new tuples P P P At the

i i i

end of the step we should up date the value of P P P P The EVAL INCR

i i i i

INCR function function has more parameters than the EVAL function b ecause the EVAL

needs not only all the tuples but the new tuples as well

EVAL function

This function calculates new tuples from the previously known tuples Every relation

is converted to relational algebra formulas and these formulas are optimized

By using these formulas the EVAL function can easily calculate the new tuples

Every formula is a tree and the leaves of the formulas are the relations If we

substitute the relations with the tuples of the relations and execute the relational

algebra op erators in the no des then the ro ot of the tree will contain the new tuples

EVAL INCR function

This function is similar to the EVAL function The dierence is that EVAL INCR should

use at least one new tuple during the calculation To achieve this we clone all the

rules as many times as relations o ccur in the right hand side of the rule In the ith

clone we put a b efore the ith relation In Example the clones of the rules of

relation AAncestor

AAncestorP childrenPC person C

AAncestorP childrenPC AAncestorP C P

AAncestorP childrenPC AAncestorP C P

Instead of the original rules we convert and optimize these rules to relational

algebra formulas With this we have the p ossibility to calculate EVAL INCR

There is an other p ossibility to simplify these rules In Example children

relation has only facts hence chil dr en is always empty Therefore all the rules

which contains chil dr en can b e eliminated If we eliminate these rules we only have

one rule left

AAncestorP childrenPC AAncestorP C P

More generally if all the rules of relation R contain only facts then we can elimi

nate every rule which contains R in it

Although not implemented in the system sometimes we can eliminate other rules

to o

Example Assume that we have mother and father relations in our input data

base and our goal is to nd the ancestors of a particular p erson First we can dene

a parent relation and after that the solution is the same as in Example Here is

a part of the program

motherMC M mother C child child

fatherFC F father C child child

parentPC motherPC

parentPC fatherPC

Although the parent relation has two rules and neither of them are facts we

calculate the tuples of the parent relation in the b eginning of the evaluation and

after this step no new tuples will b e added to this relation Hence there is only one

step in which parent is not empty Therefore it would b e p ossible for some of the

relations to calculate the number of steps after will b e always empty and eliminate

the appropriate rules then

Chapter

Implementation

The implemented Bo olean Algebra

Chapter gave a small overview of the basics of Datalog with Bo olean Constraints

All the denitions lemmas are working with all the p ossible Bo olean Algebras Al

though during the implementation most of the system are working with all the p ossible

Bo olean Algebras only one Bo olean Algebra is implemented

Sets

In the rst Bo olean Algebra contains the sets of and strings Because of

storage restrictions contains only nite sets or the sets which complement is nite

It is not a strict restriction b ecause the length of input les are nite hence the user

can dene only these sets and all the op erators are closed The Bo olean Algebra

op erators are dened in the following way complement set

We should dene and elements also fg fg complete set

set of all integers and strings

Implemented quantier elimination metho ds

In P Revesz describ es three elimination metho ds One of them works only with

atomless Bo olean Algebras Because the implemented Bo olean Algebra of Section

is not an atomless Bo olean Algebra that metho d is not implemented The

other two metho ds describ ed earlier in Sections are implemented in this

system The elimination metho d describ ed in Section can b e used only if all

the inequality constraints are monotone constraints The system do es not check the

monotonity but it assumes that all the inequality constraints are monotone inequality

constraints

Hardware and Software

The system is implemented in Java language Originally the system was implemented

under IRIX using JDK Hardware CPU SGI R although it was

tested also under WinNT Hardware Pentium Pentium Pentium II

using Microsoft Visual J Because one of the main prop erties of Java language

is p ortability the system should work on most wellknown systems even without

recompilation

The parser was implemented using Java Compiler JavaCC Version

pre

Java program

Packages

The Java language supp orts using packages collection of similar classes The pack

Name Function

storage Storage of Datalog programs

relalg Storage of Relational Algebra Op erators

relalgoptimize RA optimization metho ds

elimination Elimination metho ds

evaluation Naive SemiNaive evaluation

parser The parser

util Miscellaneous classes

Table Packages of the Java program

ages of the system and their function can b e seen in Table

relalg package

relalg and its subpackage relalgoptimize contain classes related to relational

algebra formulas Chapter contains more information ab out relational algebra for

mulas The optimization metho ds Section are implemented in relalgoptimize

package

elimination package

Quantier elimination metho ds Section are implemented in this package Be

cause two metho ds are implemented and the system do es not know in advance which

one can b e used a new quantier elimination metho d is implemented This metho d

is only a container of other quantier elimination metho ds right now two metho ds

and tries to execute the rst metho d and if it is not p ossible than the following one

until one of the metho ds was successful or none of them was successful

evaluation package

This package implements the generic co de of the Naive Section and SemiNaive

Section evaluation metho ds The eval eval incr functions are also dened

in this package

parser package

The function of this package is to parse the input les and the user commands

Chapter contains more information ab out the user commands and the input le

format The java source les in this package is created by JavaCC from a grammar

description le jj

storage package

This package stores the Datalog programs The hierarchy among the classes can b e

seen in Figure This is not a sup erclasssub hierarchy every class shown on

the picture contains one or more instances of the classes shown b elow the class At

the top of this hierarchy there is the Database class which contains our database A

database is a set of Relations Every relation have one or more Rules Every rule

have a head which represented by a RelationTitle and a b o dy A b o dy can contains

other relation names RelationTitles and Constraints A Constraint can b e an

equality or an inequality constraint Every constraint have the form Bo olean Term

or Bo olean Term A Bo olean Term is represented by a Term Because every

b o olean term can b e transformed to DNF every term is

stored as an array of basic Conjunctions Conjunction A basic conjunction is a

conjunction of literals which can b e stored as an array of Literals A literal can b e

Database

Relation

Rule

RelationTitle Constraint

Term

Conjunction

Literal

Variable Constant

Figure The hierarchy among the classes of storage package

a Variable an element of Element and a constant Constant The

constants are not implemented in the current version of the system but it is worth

to mention the p ossibility to integrate constants to the system

Element is an abstract class it can contain the elements of all p ossible Bo olean

Algebras It dened the necessary metho d which has to b e implemented to represent

a concrete Bo olean Algebra ElementSet is a sub class of Element it can store the el

ement of all the p ossible settyped Bo olean Algebras The only nonabstract sub class

of ElementSet is ElementFSet which implements sets Section Figure

shows a sup erclasssub class hierarchy among these classes Angled rectangle in

dicates that the class is not abstract while ovalshaped rectangle indicates that the

class is abstract

Element

ElementSet

ElemetFSet

Figure The sub classes of Element

Chapter

Relational Algebra Formulas

Relational Algebra Formulas

Because the relational algebra formulas are describ ed in several textb o oks for example

in this part I describ e only the dierences b etween the general relational algebra

and relational algebra formulas used in this program

In this system there are four relational algebra op erators join 1 pro ject

union select Because crosspro duct can b e seen as a sp ecial join in this

system join represents b oth of them The other main dierence that join and union

originally are binary op erators hence the number of op erands are always two In this

system the number of op erand are greater or equal than two For instance if we want

to represent the join of four relations originally we need three join op erators the new

system needs only one

The system stores relational algebra formulas as a tree it makes easy to change

the formula and to represent an op erations if it has more than two op erands It is

also very useful when we want to visualize a formula

The input from the user contains Datalog rules hence it is necessary to convert

these rules to relational algebra formulas A conversion metho d is describ ed in Ull

mans b o ok chapter However the metho d works with pure Datalog rules hence

that metho d needs to b e extended

Converting a Rule

A general Datalog with Boolean Algebra rule has the following form

R X X Q Y Y Q Y Y

1 n 1 k 1 11 1l m m1 ml

m

1

Where m is the number of relations on the righthand side n is the number

of selections on the righthand side Although either m or n can b e zero they cannot

b e zero at the same time n m

If m then rst we need to join the relations on the right hand side After that

we can issue the selection one after the other It would b e p ossible to combine the

selections into one selection but the optimization metho d works b etter if we do not

combine them

g fX X g then the righthand side If fY Y Y Y

1 k 11 1l m1 ml

m

1

contains only variables which can b e found in the left hand side therefore it is not

necessary to use pro jection Otherwise we need a pro jection as well

X X

1

k

Converting a Relation

First the algorithm converts all the rules of the relation If the number of the rules

is greater than one then the algorithm connects the formulas with union

Example If relation R has the following two Datalog rules

Rxy Cxy

Rxy Axz Bzy Dy z fg

U

C( X, Y ) Π X,Y

σ Z != { 1, 2, 3 }

A( X, Z ) B( Z, Y ) D( Y )

Figure The formula of Relation R

then the algorithm rst converts the rst rule and we get the formula C x y Next

the second rule is converted yielding Ax z 1 B z y 1 D y

xy z !=f123g

and nally the two formulas are joined together with a union op erator C x y

Ax z 1 B z y 1 D y see Figure

xy z !=f123g

Optimization of Relational Algebra Formulas

Although after converting Datalog rules to relational algebra formulas we are able

to use the formulas it is b etter to rst optimize the formula Using optimization

metho ds we calculate a new formula from our original formula The new formula

should b e equivalent with the original one if we evaluate it the result should b e the

same and it should b e evaluated faster No algorithm can improve all formulas

Usually the optimization improve the ma jority of the formulas and leave

unaltered or even worsen some formulas

There are many optimization metho ds The algebraic manipulation metho d used

in our system is describ ed in the next subsection

Algebraic Manipulation

This metho d is also describ ed in In this metho d we will use some equations

b etween formulas These equations are also called laws First we give a list of these

laws and later an algorithm which can change the original formula using these laws

After the changes the new formula can b e usually evaluated faster than the original

We have to optimize only a subset of the p ossible formulas b ecause our formulas

are originally Datalog programs

Laws

We use the following laws from

Commutative law for join

R 1 R R 1 R

1 2 2 1

Associative law for joins

R 1 R 1 R R 1 R 1 R

1 2 3 1 2 3

Cascade of projections

R R

A A A B B B A A A

1 2 1 2 1 2

k l k

if fA A A g fB B B g

1 2 k 1 2 l

Cascade of selections

R R R

F F F F F F

1 2 1 2 2 1

Commuting selections and projections

If the set of attributes in condition F is the subset of fA A A g

1 2 k

R R

A A A F F A A A

1 2 1 2

k k

g fB B B g If the set of attributes in F is fA A A

1 2 l i i i

m

1 2

R R

A A A F A A A F A A A B B B

1 2 1 2 1 2 1 2

k k k l

Communing selection with Join

If all the attributes of F are the attributes of R

1

R 1 R R 1 R

F 1 2 F 1 2

If F F F and the attributes of F are only in R and the attributes in

1 2 1 1

F are only in R then

2 2

R R 1 R 1 R

2 1 F F 1 2 F

2 1

If F F F and the attributes of F are only in R but the attributes in F

1 2 1 1 2

are in b oth R and R

1 2

R 1 R R 1 R

1 2 F F 1 2 F

1 2

Commuting a projection with a join

fA A A g fB B B g fC C g where B s are attributes

1 2 k 1 2 l 1 m i

of R and C s are attributes of R

1 i 2

R R 1 R R 1

2 A A A 1 2 B B B 1 C C C

m

1 2 1 2 1 2

k l

Principles

These are three main principles of algebraic query optimization

Perform selections as early as p ossible

Perform pro jections as early as p ossible

Combine of unary op erations

The Algorithm

The steps of the algorithm

For each selection use rule to move the selection down

Move pro jections down using rules If p ossible delete pro jections

Use rule to combine cascades of selection into one selection

Move selections down

In this step our goal is to move selections as down as p ossible Originally we have a

set of selections S S S and a set of relations R R R In the original

1 2 k 1 2 l

execution order we connect the relations with a join op erator if the number of the

op erations is greater than zero than calculate the selections one after the other

During the optimization we rst check which relations and selections have com

mon variables Let V b e the set of relations which have common variables with S

i i

More formally

V fR j v ar iabl es in R v ar iabl es in S g

i n n i

A selection S can b e executed if the join of all the relations mentioned in V is

i i

already calculated The join can contain other relations also

If V is empty or contains all the relations then the selection is executed only after

i

we join all the relations Therefore we should nd the place of the other selections

If ij V V then S will b e executed b efore all the other selections If such

i j i

an index i do es not exist then the program chooses any index which has a small

size V Next we mo dify the V j i sets

i j

V n V S if V V

j i i i j

V

j

V if V V

j i j

As we can see V contains not only relations but selections as well

j

The previously describ ed metho d is one step of the optimization This step should

b e rep eated until all the selections are chosen If there are some relations which are

not used during the optimization no selection contains any variables of the relation

then a nal join should connect these relations and the selections

Π X, Y

σ X /\ Y != @

σ Y /\ {4} == @

σ X /\ { 6, 8 } == @

σ X /\ {2,4} == @

A( X, Z ) B( V, Y ) C( X ) D( V, Y )

Figure The formula b efore the optimization

Example Assume we have the following Datalog program

R X Y AX Z B V Y C X D V Y X f g

X f g Y fg X Y

the corresp onding relational algebra formula see Figure

Ax z 1 B v y 1 C x 1 D v y

xy xy = y f4g= xf68g= xf24g=

and and each contain only one variable y x and

y f4g= xf68g= xf24g=

x resp ectively The variables in are x and y has common variables

xy = xy =

with relations ABC and D therefore V fA B C D g Similarly V fB D g

1 2

V fA C g V fA C g

3 4

There exists no V which is the subset of all the other V s hence the algorithm

i j

chooses one selection which has the smaller V In this example the algorithm can

i

choose V V and V assume that it chooses V As we issue S we get the following

2 3 4 2 2

formula B v y 1 D v y

y f4g=

The new values of V V and V are V fA B C D gnfB D g fS g

1 3 4 1 2

fA C S g V and V are unchanged b ecause V and V has no common variables

2 3 4 3 4

with S V fA C g V fA C g

2 3 4

Now V V V V hence we can issue S and get C x 1 Ax z

3 1 3 4 3 xf68g=

The new values of V and V are V fA C S gnfA C g fS g fS S g V

1 4 1 2 3 2 3 4

fA C gnfA C g fS g fS g

3 3

Now V V so the algorithm can issue S and we get C x 1

4 1 4 xf24g= xf68g=

Ax z

Finally we issue V and get x y C x 1 Ax z 1

1 xy = xf24g= xf68g=

B v y 1 D v y Figure

y f4g=

Moving Pro jections Down

In this step our goal is to move pro jections as down as p ossible Originally we have

a pro jection b elow that maybe some selections and nally a join Figure If it is

p ossible then we evaluate the pro jection b efore the join Usually it is not p ossible

but we can eliminate at least some of the variables b efore the join

Denote PV the set of variables in the pro jection Denote SV the set of variables

in the selections Denote V i the variables of the ith branch of the join All the

Π X, Y

σ X /\ Y != @

σ σ X /\ {2,4} == @ Y /\ {4} == @

σ X /\ {6,8} == @

B( V, Y ) D( V, Y )

C( X ) A( X, Z )

Figure The formula after the rst step of optimization

variable in PV or SV cannot b e eliminated b efore the selections

If a variable o ccurs only in one branch and the variable is not in PV or in SV

then this variable easily can b e eliminated b efore the join For instance if our original

formula is xAx 1 B x y as shown in Figure then y o ccurs only in the

second branch therefore we can eliminate y b efore the join The optimized formula

is AX 1 xB x y as shown in Figure

If there exist no variable which o ccurs only in one brach then the algorithm

chooses one variables which o ccurs in the least branches If all the variables are o ccur

in all the branches then the algorithm cannot eliminate any variables If more than

Π PV

σ . . SV . σ

. . . B 1 B l

. . .

V 1V l

Figure The variables of the formula

one variables o ccur in the same branches then the algorithm eliminates these variables

at the same time

After this step the branches of the join has changed therefore the algorithm should

recalculate the values of V i s This recalculation is similar to the recalculation during

the rst step Moving selections down of the algorithm There algorithm is running

until we cannot nd any eliminable variables

Because one brach of the join can contain other join op erators after the algorithm

move a pro jection b elow the join we should check whether it is p ossible to move the

pro jection even more b elow the other join To achieve this the algorithm calls itself

in a recursive way The new instance of the algorithm works only on a subtree

subformula of the original tree formula

Example After the rst step of the optimization we got the formula shown in

Π X

Π A( X ) X

A( X ) B( X, Y ) B( X, Y )

Figure Optimization input Figure Optimization output

Figure The variables of the join are X Y PV fX Y g the variables in

the selection are also X and Y SV fX Y g PV SV fX Y g therefore we

cannot eliminate X and Y The upp er join has two branches The variables in the

rst branch are X Z therefore V fX Z g Similarly V fV Y g Z and V

are lo cal variables b ecause Z o ccurs only in the rst and V o ccurs only in the second

branch Therefore we can eliminate Z from the rst branch b efore the join and V

from the second one The variables in the rst brach are fX Z g If we eliminate Z

we have only X hence we should issue a X b elow the join In the same way we

should issue Y b elow the second branch Our new formula is shown in Figure

Finally the algorithm tries to move the pro jections even b elow the other joins

In the second branch Y cannot b e eliminated b ecause Y is a variable in the pro

jection and in the selection also V cannot b e eliminated b ecause it o ccurs in all

the branches In the rst branch X is ineliminable b ecause X is a variable in the

pro jection Z is a lo cal variable of the second branch hence it can b e eliminated

b efore the join Therefore the algorithm can move the pro jection b elow the join and

σ X /\ Y != @

Π Π X Y

σ σ X /\ {2,4} == @ Y /\ {4} == @

σ X /\ {6,8} == @

B( V, Y ) D( V, Y )

C( X ) A( X, Z )

Figure The formula during the second step of optimization

we get our new formula which is shown in Figure

Connecting Selections

This is the nal and the easiest step of the optimization In this step the algorithm

combine cascades of selection into one selection The algorithm simply checks every

edge in the tree and if b oth vertices of this edge are selections then connects the two

vertices and erases the edge

Example In our example Figure there is only on pair of selections which

can b e connected and After this step we get our nal formula

xf24g= xf68g=

which can b e seen in Figure

σ X /\ Y != @

σ Π X /\ {2,4} == @ Y

σ σ X /\ {6,8} == @ Y /\ {4} == @

C( X ) Π X B( V, Y ) D( V, Y )

A( X, Z )

Figure The formula after the second step of optimization

Calculating multiple joins

When we have to join more than two relations then the simplest way to join them

is to choose one tuple from each relation create a new tuple and write the result to

the output relation For instance we have four relations A B C D and we want to

calculate

AX Y 1 B Y Z 1 C Z V 1 V W

Assume that there are tuples in the relations In this case we have to create

4 8

tuples

σ X /\ Y != @

σ Π X /\ {2,4} == @, X /\ {6,8} == @ Y

σ Y /\ {4} == @

Π C( X ) X

A( X, Z ) B( V, Y ) D( V, Y )

Figure The formula after the optimization

A b etter solution if we join two relations than join the third to the result of the

previous join and nally join the fourth to the last result For instance if we calculate

2 4

AX Y 1 B Y Z than the size of the result usually is less than Let us

assume that the results always contain tuples In this case we have to calculate

2 4

tuples which is less than p ercent of the original calculation

Of course we do not know the size of the result relation b efore we create the re

sult If the two relation have no common variables than the join is a crosspro duct

so the size of the result relation is the multiplication of the size of the original rela

tions In other cases we only know that the size of the result is not greater than the

multiplication of the size of the original relations

If the algorithm is able to estimate the size of the resulting relation then it is p os

sible to join the relations in a go o d order therefore the algorithm is able to decrease

the cost of a multiple join The go o d order can b e determined using dynamic pro

gramming Because the algebraic manipulation considerably decreases the o ccurrence

of multiple joins and makes it more dicult to estimate the size of the resulting rela

tion this system do es not change the execution order of the joins but simply executes

the joins from left to right

Chapter

Users Manual

The user interface of the program is a character based interface One student in the

department is working on a Graphical User Interface

After staring the program the system waits for a user command After the user

enters the command the system waits for the next command This pro cess is nished

if the user exists from the system

With the help of some commands the user can change the values of the switches

with the other command the user can ask the program to give information or execute

a pro cess First I describ e the switches and later the other commands

The user is also able to give a new rule or fact to the system The general form

of the rules and facts are describ ed in Section There are some dierences

The user can enter more than one equality constraint

The user can enter constraints in which neither side of the constraint is zero

The user can use not only but and as well

Because keyboard do es not contain symbols like or the other op erators

Mathematical In the system

and

or

R not R R

ZERO

Table Op erators

the user should use other symbols instead of Table shows the symbols which

can b e used by the system Usually more than one symbols can b e used they

are separated by commas

Switches

Most of the switches has two p ossible values they are either turned on or o All the

p ossible values should b e typed with small letters in the examples the capital letters

show the default value of the switch

time onOFF

If the switch is turned on then program after each evaluation displays the

time used during the evaluation It displays not only the total time but some

parttime for instance time used by dierent relation op erators as well

All the time values are in second and they are real second not CPU second

tempfile onOFF

If the switch is turned on after each step of the Naive Section or Semi

Naive Section evaluation the program prints out the derived tuples into a

temp orary le tempbld This le is not b eing overwritten by the following

steps hence the user can analyse the results after each step Although the le

contain the results of more steps each result has the same format as the input

le therefore it is p ossible to use dierent parts of this le as an input le and

continue an interrupted execution

trace onOFF

If the switch is turned on then the program display the inner representation of

a rule after a new rule added to the system

optimize ONoff

If the switch is turned on then the program uses relational algebra optimization

Section if not then the program uses the original formula The value of

the switch should b e changed b efore loading the input le to make eect

method oldnaiveSEMINAIVE

With this switch the user can choose b etween the implemented evaluation meth

o ds The Naive Section SemiNaive Section metho ds work with

also nonrecursive and recursive queries but the old metho d works only with

nonrecursive queries

Commands

exitbye

The command exists the program

loadconsult filename

The command load the le with the given lename and reads the queries from

the le The format of the input le is describ ed in Section

formula R onto filename

The command displays the relational algebra formula of R which is used by

the Naive evaluation Section If the user sp ecify a le name then the

formula will b e printed out to the le otherwise it will b e printed out to the

screen

formuladelta R onto filename

The command displays the relational algebra formula of R which is used by

the SemiNaive evaluation Section If the user sp ecify a le name then

the formula will b e printed out to the le otherwise it will b e printed out to

the screen

display R onto filename

If the user do not sp ecify a relation name then the program prints out the names

and the arities of all relations

If the user sp ecify a relation name then the program prints out the rules of the

relation

Similarly to the formula and formuladelta commands the user can name a

le otherwise the result of the command will b e printed out to the screen

displaydf onto filename

The command prints out all the derived facts in the database

Similarly to the previous commands the user can name a le otherwise the

result of the command will b e printed to the screen

memory

The command displays the total and the free memory used by the system

clear

The command erases all the relations rules from the database

RE E

1 n

With this command the user can ask the derived fact of a relation E can b e

i

either a variable or an element of One variable can o ccur more than once At

the rst time using this command the system calculates the derived facts using

of of the evaluation metho d Later the system uses the derived facts stored in

the memory hence the answer will b e faster

Note This is the only command which ends with a instead of a

Input le format

The input le starts with a line contains begin and ends with a line end Between

these lines there are the rule denitions

The input le may contain empty lines oneline comments after as in C

or Java multiline comments b etween and as in C C or Java

If a line would b e to o long it can b e splitted into more lines each line but the last

should end with a character In the examples of this thesis we do not use the

character

Chapter

Examples

Ancestors example

The problem

The ancestors example app eared in In the input database we store informations

ab out parents and their children Our goal is to calculate all the ancestors of one

particular p erson

The input database

Pure Datalog

In pure Datalog one of the easiest way to use the children relation One tuple can

contains one parent and one child For instance if husband and wife have three

children child child child then our input database is the following

children husband child

children husband child

children husband child

children wife child

children wife child

children wife child

New system

In the new system we need only one tuple to represent the previous example

children husband wife child child child

Comparison

The previous example shows that in the new system we need fewer tuples to store the

same data It also can b e seen that the tuples in the new system are more complex

then the tuples in pure Datalog In the example we needed only one tuple instead of

six More generally if a couple has k children then the pure Datalog needs k tuples

in contrast to the new system which needs only one

The Datalog program

Pure Datalog

AAncestorP childrenP person

AAncestorP childrenP C AAncestorC

New system

AAncestorP childrenPC person C

AAncestorP childrenPC AAncestorP C P

Comparison

The program contains two rules in b oth systems The rules are very similar although

the new system has slightly more complex rules

The output database

Pure Datalog

In the pure Datalog every tuple in the output relation represents one of the ancestors

New system

In the new system every tuple in the output relation represents two ancestors

Comparison

The new system contains half the number of tuples as pure Datalog

Execution complexity

In this comparison I assume that b oth systems are using the SemiNaive or the naive

evaluation

Pure Datalog

At the rst step the system nds the parents of person using the rst rule The

system should check all the children tuples and nd those in which person is the

child

Later we need to use the second rule hence we need to evaluate a join

New system

At the rst step the systems nds the parents of person using the rst rule The

system should check all the children tuples and nd those in which person is one

of the children

Later we need to use the second rule hence we need to evaluate a join and a

selection

Comparison

The rst step is almost identical Because the number of tuples is less in the new

system in that case the program should check fewer tuples On the other hand

the tuples are more complex in the new system hence to check one tuple is more

timeconsuming If we analyze one step then we can nd real advantages

If our goal is to nd the ancestors of child then in the original system we should

check all the six tuples Check means here to evaluate child child child

child child child We should evaluate each of them twice b ecause

each children o ccur in two tuples

In the new system we should check only one tuple Check here means that we

should calculate the intersection of fchildg and f child child child g It

means that we should evaluate child child child child child

child In this case we need to evaluate these only once Finally we should check

whether the intersection is empty or not

In the later steps b oth systems evaluate a join the second one also evaluates a

selection If in the pure datalog system we denote the number of tuples in relations

children and AAncestor C and A resp ectively then the number of tuples in the

p p

ggparent1 ggparent2

gparent1 gparent2 gparent3 gparent4

uncleswife uncle parent1 parent2 aunt auntshusband

cousin1 wife person brother cousin2 cousin3 cousin4

daughterinlawson daughter soninlaw

grandson granddaughter

Figure A family tree

A C

p p

and A where k is the average number of children new system are C

n n

2 2k

Therefore the number of basic op eration during join is A C in the old system and

p p

A C

p p

in the new system A C

n n

4k

If we are using SemiNaive evaluation than the algorithm uses only the

new tuples hence the number of basic op eration is A C in the old and A C

p p n n

A C

p p

in the new system

4k

Similarly to the rst step one basic step is more complex in the second system

but there is still an advantage of using the new system Usually the cost of the

selection is much more less than the cost of the join hence it is not a problem that

in the new system we need a selection as well

Runtime results

Figure shows a family tree which was used for test purp oses The squared rect

angles contain the names of the men oval shap ed rectangles contain the names of the

women In the program there is no dierence b etween the two sex it only helps to

understand the family tree

Table shows the running times During the evaluation the program calculated

not only the ancestors of person but the ancestors of everybo dy Because the opti

mization of relational algebra formulas do es not change the formulas in this example

the optimization has no eect on the runningtime the small dierences are only

b ecause of the inaccuracy of timemeasurement As can b e seen in the table the

SemiNaive evaluation is approximately eight times faster then the Naive evaluation

hence the SemiNaive evaluation is a great improvement

Genome Map

The problem

The following genome map problem is describ ed in

The deoxyribonucleic acid DNA is a of nucleotides There are four

nucleotides adenine A cytosine C guanine G and thymine T

Random substrings of a given DNA are called clones Clones may overlap each

other It is p ossible to cut a DNA string into clones with so called restriction enzymes

After cutting we lo ose all information ab out the order of the clones Each clone can

b e analysed further By various enzymes the clones can b e digested and we can

measure the fragments after the digestion To eliminate the errors of measurement

we can round the length of fragments

As an input we have a set of clones c c of a DNA string and the lengths

1 n

of the fragments of each clone

The goal is to nd the original order of the clones This problem is NPcomplete

However there exist dierent heuristics which make it p ossible to solve the problem

In this example we have an order of clones and the task is to decide whether it is

a p ossible order of clones or not

Solution

The idea for the algorithm is describ ed in

Further restrictions

To apply this solution we need some further restrictions in the input database

No clone contains any other clone

No clone contains two dierent fragments with the same length Although it is

p ossible that two dierent clones have dierent fragments with the same length

There exists k such that each fragment is contained in at most k clones

If c c is the correct order of the clones then i i n c overlaps

1 n i

c

i+1

Automaton

Because every fragment is contained at most k clones it is enough to analyze k

clones at the same time We call k adjacent clones a window At the b eginning

the window contains the rst k clones but during the solution this window will

shift right We denote the clones in the window A A

1 k +1

During the solution we will change the values of A A The values will

1 k +1

change if we shift the window or if we pick fragments from the clones We pick

fragments from several clones A A l k at the same time If a clone

1 l

must contain the fragments that we pick next then we call the clone active If A

j

active then i i j A also active hence one number is enough to store the set of

i

active clones

We create a a nondeterministic automaton to solve the problem The automaton

contains k states where S is the initial state H is the halt stage and S is the

0 i

stage which represents when i clone active

Now we need to dene the transition of the automaton

If A A then we cannot pick a fragment from the rst i clones which is not

i i+1

in A therefore A can b e declared an active clone to o

i+1 i+1

If A then we can shift the window right

1

If there are fragments which are in the active clones and not in the rst nonactive

clone then we can pick these fragments

Figure shows an automaton when k This automaton is taken from

Because this system supp orts more Bo olean op erators than the DISCO system which

was used in the description of the edges of this automaton is simpler than in

J==(A1^A2^A3^A4^A5)\A6 J==(A1^A2^A3^A4)\A5 J==(A1^A2^A3)\A4 A1=A1\J J==(A1^A2)\A3 A1=A1\J A2=A2\J A3=A3\J J==A1\A2 A1=A1\J A2=A2\J A1=A1\J A2=A2\J A3=A3\J A4=A4\J A1=A1\J A2=A2\J A3=A3\J A4=A4\J A5=A5\J

A1<=A2 A2<=A3 A3<=A4 A4<=A5 S S S S 1 2 3 4 S 5

A1==@ A1==@ A1==@ A1==@

init window window empty

S

0 H

Figure The nondeterministic automation for k

C3 C4 C2 C5 C1 C6 C7

1510 5 8 2025 30 35 15 5 10

Figure The clones of the smaller example

Concrete examples

Example with input size n k

This is a small example when k Figure shows the clones with the fragments

and also shows the correct order of the clones

To represent this example we need the following part of the input le

cloneNX N X

cloneNX N X

cloneNX N X

cloneNX N X

cloneNX N X

cloneNX N X

cloneNX N X

cloneNX N X

firstCloneN N

nextCloneNN N N

nextCloneNN N N

nextCloneNN N N

nextCloneNN N N

nextCloneNN N N

nextCloneNN N N

nextCloneNN N N

nextCloneNN N N

To implement the automation we need the second part of the input le

pickJ A B A B J B J

SL A A A A L firstClones clones A

nextCloness clones A

nextCloness clones A

nextCloness clones A

Naive SemiNaive

Problem Without With Without With

optimization optimization optimization optimization

Genome map

n k

Genome map

n k

Ancestor

Unavoidable sets

Multiset

Table Test results Pentium II MHz

i i

SL A A A A L SJ A A A A A A

SL A A A A L SJ A A A A A A

i i

SL A A A A L SJ A A A A

A clonec A nextClonec c clonec A

SL A A A A L SJ A A A A

A clonec A nextClonec c clonec A

i i

SJ B A A A SJJAAAA JA JA

pickJAB

SJ B B A A SJJAAAA JAA JA

pickJAB pickJAB

SJ B B B A SJJAAAA JAAA

JA pickJAB pickJAB pickJAB

GOODX SJ

I measured the evaluation time of this example in four dierent situations Table

shows the results All the are real seconds not CPU seconds the test

C8 C6 C2 C5 C1 C4 C3 C7 C9

35 8 25 40 20 15 3510 5 25 30 20 45 50 15 40

Figure The clones of the bigger example

was running under WinNT Pentium II MHz As can b e seen the SemiNaive

evaluation is a great improvement we need only the p ercent optimization o

or p ercent optimization on of the time as the time of the Naive evaluation

The optimization has a remarkable eect with Naive evaluation we save p ercent

of the time and a a slight eect if optimization is on p ercent The reason

of the small eect is that the original rules are rather optimized there is no much

p ossibility to optimize the rules more However it has to b e mentioned that the time

of the optimization less than second is much more smaller than this small eect

therefore the optimization is useful

Example with input size n k

The original example describ ed in was also tested in this system The clones and

the fragments are also shown in Figure Table shows the test result of this

example to o

Unavoidable Sets

Problem

The problem originates from twoplayer games such as chess for instance We can

assign dierent lab els to dierent p ositions In chess we can use lab els like white

wins black wins draw It is p ossible that a p osition has no lab els assigned If we

dene more lab els then it is also p ossible that more then one lab els are assigned to

a p osition For instance if we dene lab els like white has a queen white has a ro ok

white has a bishop then if white has two ro oks and a queen and no bishops then

the rst two lab els are assigned to the p osition

Assume that white wants to reach a p osition which has a sp ecic lab el and black

wants to avoid it We can build a tree which contain the p ossible p ositions the

current p osition is the ro ot and there is a directed edge b etween two p ositions if one

player can move from one p osition to the other If the players turn in alternate then

this graph is a bipartite graph one p osition contains the name of the player who will

turn next We assume that the graph is an acyclic graph In chess this is really

acyclic b ecause if the same p osition o ccurs thrice then the game is draw

Our goal is to calculate those lab els which are unavoidable by black if white

wants to reach the lab el

Solution

We can assign lab els to the leaves To calculate the lab els for the other no des we do

the following If black has to move then we calculate the intersection of the lab els

1

2 3

4 5 6

7 8 9 10 11

12 13 14 15 16

{3,7} {3} {4,3} {3,4,5,6} {4,6,7}

Figure The acyclic graph

assigned to the children of the no de b ecause black wants to avoid the lab el If white

has to move then we calculate the union of the lab els assigned to the children of the

no de b ecause white wants to reach the lab el If we assign lab els one after the each

other then after nite steps we reach ro ot b ecause the graph is acyclic

Mathematically the problem is the following We have a directed acyclic bipartite

graph Let A and B the two disjunct sets of vertices The graph has a sp ecial vertex

for which the indegree equals zero We call this vertex ro ot Let us supp ose that ro ot

is in A Sets are assigned to the leaves If the sets of all the children of a vertex are

already dened then we can assign a set to the vertex If the vertex is in A then we

assign the union of the sets of the children if the vertex is in B then the intersection

of the sets of the children The goal is to nd the set which is assigned to ro ot

Example

In this example the graph and the lab els assigned to the leaves are shown in Figure

To store the structure of the graph we need the rst part of the input le

leftPC P C

rightPC P C

leftPC P C

rightPC P C

leftPC P C

rightPC P C

leftPC P C

rightPC P C

leftPC P C

rightPC P C

leftPC P C

rightPC P C

leftPC P C

rightPC P C

leftPC P C

rightPC P C

leftPC P C

rightPC P C

leftPC P C

rightPC P C

leftPC P C

rightPC P C

To store the sets of the leaves we need the following part

whiteX S X S

whiteX S X S

whiteX S X S

whiteX S X S

whiteX S X S

The part which calculates the new sets contains only three rules

blackX S leftX LrightX RwhiteL WwhiteR W SWW

whiteX S leftX LrightX RblackL BblackR B SBB

unS whiteX S X

If we give this input le to the system we get the result that the set of unavoidable

lab els is f g Table shows the used time during evaluation This example also

shows the advantage of the SemiNaive evaluation Because the number of iterations

are relatively small in this example the eect is not to o big This example also shows

that the optimization metho d may worsen a formula This is very rare the problem

is that in this example the righthand side of black and white rules contains four

relations and a selection and in most cases it is useful to evaluate selections b efore

join in this sp ecial case join b efore selection would have b een b etter

Chapter

Multisets

Introduction

A natural extension of the program is using multisets instead of sets Multisets are

similar data structures to sets the dierence is that a set can contain an element at

most once while a multiset can contain several copies of an element Unfortunately

multisets do not form a Bo olean Algebra However as describ ed in multisets

can b e implemented using a limited set of multiset op erators and applying other

restrictions

Extension of the program

We allow the following multiset op erators

V V

1 2

Where V and V are multiset variables With this op erator we can check

1 2

whether one multiset variable is a subset of another multiset variable or not

V M

V V V V V V

1 2 1 2 2 1

V E V E V V

2 2

V E V E V V

2 2

V V V V

2 2

V E V E V V V V

2 2 2

Table Other op erators which can b e expressed

Where V is a multiset variable and M is a concrete multiset With this op erator

we can change the value of the multiset variable

V V V

1 2 3

Where V V and V are multiset variables The value of V is calculated using

1 2 3 1

the already known value of V and V

2 3

Although we allow only these three op erators some other op erators can b e ex

pressed with these op erators Table shows op erators which can b e expressed using

the basic op erators In the table V s are multiset variables E is a multiset constant

i

All the multiset variables and constants are denoted with a sign in the input

le There are also other restrictions related to multisets

At most one multiset variable in every relation

Only the rst variable can b e a multiset

An example

The problem

The next example is also related to the genome maps and describ ed in In this

previous genome map example we cut the DNA with an enzyme and get clones after

this used an other enzyme to digest the clones In this example at the b eginning

we have two enzymes We cut the original DNA with one of them and get the so

called row clones cut the original with the other enzyme and get the so called column

clones Neither the row clones nor the column clones can overlap each other After

this we use the same enzyme to digest b oth the row and column clones

The solution

Because of the genetic dierence we know the rst row and the rst column clone

The structure of the row and column clones implies that one of these two rst clones

is a subset of the other one Assume the the rst column clone is a subset of the rst

row clone We also know that at the b eginning of the DNA there are the fragments

of this column clone which are fragments of the rst row clone also and following

this that fragments of the rst row clone which are not in the rst column clone Let

S b e the set of these fragments The algorithm should nd an other column clone

which is either a subset of S or a sup erset of S If the column clone is a subset of S

it means that we still have more fragments from the row clones than from the column

clones hence we need to nd an other column clone If the column clone is a sup erset

of S it means that we have more fragments from the column clones than from the

row clones hence we need to nd a row clone now The dierence of the column clone

and S will b e the new value of S We can rep eat this step until S equals the empty

set which means that we nd the same fragments in b oth the row and the column

clones S will b e empty at the end of the DNA although there is a small p ossibility

that S will b e empty b efore that

The Datalog with Boolean Constraints program which can solve this problem

downSSUCUR initialcSSUCUR

rightSSUCUR initialrSSUCUR

downSSUCURR downSUCUR rowRM R S

SS S R pickMUR URR

downSSUCCUR rightSUCUR columnCN S C

SS C S pickNUCUCC

rightSSUCCUR rightSUCUR columnCN C S

SS S C pickNUCUCC

rightSSUCURR downSUCUR rowRM S R

SS R S pickMURURR

haltX downX Y X Y

haltX rightX Y X Y

Concrete example

In this example the DNA contains fragments ten row and nine column clones

Figure shows the fragments of the DNA string the row and column clones

Table shows the pro cess of the solution The rst column shows the expression

of the new value of S the second column shows the new value of S The third and

fourth columns show the unused row and column clones

Table shows the execution time of this example The SemiNaive evaluation

R9 R8 R3 R1 R4 R6 R2 R10 R7 R5

513 18 13 7 817 8 9 14 27 65 8 74 9 4 12 1112 10 10 10 28 5 4 5 10

C3 C1 C6 C8 C4 C7 C9 C5 C2

Figure The correct order of clones

S Row clones Column clones

C

R S

S C

C S

R S

C S

S R

S R

R S

C S

R S

C S

S R

R S

C S

R S

C S

R S

C S

Table The solution

is a great improvement in this example to o it needs only ab out of the time

necessary for the Naive evaluation However the optimization of Relational Algebra

formulas has no imp ortant eect on the execution time of this query

Chapter

Further Work

GUI Currently the program has a character based interface which can make

dicult for the user to handle the program The biggest disadvantage of the

character based interface is that sometimes it is to o dicult to interpret the

results b ecause the result is only a set of formulas

Currently a student Song Liu in the Department is working on a Graphical

User Interface which will make easier to understand the results

Approximation Every elimination metho d have some restrictions on the in

put database Sometimes we cannot use any quantier elimination metho ds

In these cases approximation may help when we cannot compute the correct

quantierfree formula rather only create a formula which approximate the

result A p ossible way of approximation is describ ed in

Multiset Chapter describ es the an extension of the system which makes

p ossible to use multisets Only a small subset of multiset op erators are used in

this extension it would b e p ossible to implement more multiset op erators

Indexing

Indexing can improve the sp eed of the database systems However indexing

is not to o dicult in traditional relational database systems it is much more

dicult in constraint database systems A go o d indexing metho d would improve

the sp eed of this system

Bibliography

B H Arnold and Boolean Algebra

PrenticeHall INC

J Byon PZ Revesz DISCO A constraint database system with sets Proc

Workshop on Constraint Databases and Applications pages

SpringerVerlag BerlinHeidelb ergNew York

Richard Helm Kim Marriott Martin Odersky Spatial Query Optimization From

Bo olean Constraints to Range Queries Journal of Computer and System Sciences

PC Kanellakis GM Kup er PZ Revesz Constraint query languages Journal

of Computer and System Sciences

Peter Z Revesz CSE Advanced Database Systems class material University

of NebraskaLincoln Spring

Peter Z Revesz The Evaluation and the Computational Complexity of Datalog

Queries of Bo olean Constraint Databases International Journal of Algebra and

Computation to appear

Peter Z Revesz Rening Restriction Enzyme Genome Maps Constraints

J D Ullman Principles of Database and Know ledgebase Systems

Computer Science Press