Probabilistic Temp oral I I Calculus and Query Pro cessing

y z x

Alex Dekhtyar Fatma Ozcan Rob ert Ross VS Subrahmanian

Abstract

There is a vast class of applications in which we know that a certain event o ccurred but do

not know exactly when it o ccurred However as studied by Dyreson and Sno dgrass there

are many natural scenarios where probability distributions exist and quantify this uncertainty

Dekhtyar et al extended Dyreson and Sno dgrasss work and dened an extension of the relational

algebra to handle such data The rst contribution of this pap er is a declarative temp oral

probabilistic TP for short calculus which we show is equivalent in expressive p ower to the

temp oral probabilistic algebra of Our second ma jor contribution is a set of equivalence and

containment results for the TPalgebra Our third contribution is the development of cost mo dels

that may b e used to estimate the cost of TPalgebra op erations Our fourth contribution is an

exp erimental evaluation of the accuracy of our cost mo dels and the use of the equivalence results

as rewrite rules for optimizing queries by using an implementation of TPdatabases on top of

ODBC

Intro duction

Dyreson and Sno dgrass pioneered the study of uncertainty in temp oral databases where statements

of the form Event e o ccurred or will o ccur at some time p oint in the time interval t t are

p ermitted Such statements are common For instance a shipp er like Federal Express may tell

customers that their package will b e delivered within hours of drop o In such a case if the

smallest unit of time ab out which reasoning is p erformed is a minute then there are

minutes at which the package may b e p ossibly delivered However over this timeframe there is

a probability distribution reecting the probability that the package will b e delivered precisely t

minutes after drop o Such a probability distribution may b e skewed eg the probability that the

package is delivered within hours is probably zero if the package has to make its way from Seattle

to Boston Dyreson and Sno dgrass develop ed this idea substantially to provide a framework for

reasoning ab out temp oralprobabilistic data They also provided applications in a variety of other

areas including carb ondating of historical records and sto ck market analysis For example there

are literally hundreds of programs that make sto ck market predictions most prediction mo dels

This work was supp orted in part by the Army Research Lab under contract numb er DAALK the Army

Research Oce under grant numb er DAAD by DARPARL contract numb er F and by the

ARL CTA on Advanced Decision Architectures

y

Department of Computer Science University of Kentucky Lexington KY Email dekhtyarcsukyedu

z

Department of Computer Science University of Maryland College Park MD Email fatmacsumdedu

x

Department of Computer Science University of Maryland College Park MD Email robrosscsumdedu

Department of Computer Science University of Maryland College Park MD Email vscsumdedu

are uncertain and provide probabilistic outputs In the same vein statistical mo dels that track

p erformance of machines and machine parts on a factory o or yield probabilistic estimates of when

the parts will need to b e repaired andor replaced Decision making programs for such applications

typically execute calls to the results of suc predictions

Following on the imp ortant work of Dyreson and Sno dgrass work Dekhtyar et al develop ed

a temp oral probabilistic algebra TPA which eliminated many of the assumptions in the framework

of

We start this pap er with a brief overview of TPdatabases in Section We then move on to our

sp ecic contributions

First in Section we develop a temp oral probabilistic calculus TPcalculus for short We

show that this calculus which is similar to the safe has the nice prop erty

that it is equivalent in expressive p ower to the TPalgebra

Second in Section we develop a set of query equivalence results in the TPalgebra These

equivalences automatically yield a set of rewrite rules that a query optimizer for TPdatabases

might use

Third in Section we develop mechanisms to estimate the cost of executing a TPalgebra

query Even though TPalgebra op erations are implemented on top of a relational

these op erations are not mere queries implementing them on top of the

relational algebra involves writing a C program that includes emb edded SQL calls Costing

such programs builds on top of cost mo dels of relational op erators but involves taking into

account the sp ecic asp ects of the programs themselves Our cost mo dels involve identifying

a set of statistics to b e stored ab out the data as well as metho ds to compute such statistics

for intermediate results obtained during query pro cessing

Fourth in Section we conduct a set of exp eriments to evaluate three things Our rst goal

is to assess the accuracy of our cost mo del Our second goal is to assess the eectiveness

of our rewrite rules The third goal is to assess the eciency of a query optimizer based on

this cost mo del and these rewrite rules Exp eriments were run using our implementation of

TPdatabases which is built on top of ODBC using Paradox as a backend

We compare our work with related work in Section and nally conclude with directions for further

research

Preliminaries TPDatabases Overview

In this section we recapitulate the notions of a Temp oral Probabilistic database TPdatabase and

the Temp oral Probabilistic Algebra TPA Full details may b e found in

Temp oralProbabilistic DB Mo del

To make statements ab out when certain events o ccurred or when certain facts werewill b e true

TPdatabases use a calendar cf Sno dgrass and So o and temp oral constraints

Calendar A calendar consists of an ordered list of time units For example day month year is

a p ossible calendar Each time unit has a nite domain asso ciated with it Given a calendar over

time units tu tu a time point is an expression of the form v v where each v is in

n n i

the domain of tu For instance wrt the example calendar is a time p oint

i

Temp oral constraint Temporal constraints are inductively dened i If tu is a time unit op f

g and v domtu then tu op v is a temp oral constraint over

ii If t t are time p oints in and t t then t t is a temp oral constraint over

iii If C and C are temp oral constraints then so are C C C C and C

The syntax t t is a shorthand for the constraint t t t We abuse notation and

write t instead of t t Solutions of temp oral constraints denoted solC are dened in the

obvious way For example the constraint has as its solutions all

time p oints b etween pm to pm as well as all time p oints b etween pm to pm

Probability Distribution Function PDF Let S b e the set of all temp oral constraints over

calendar Then the function S is a PDF if D S t solD D t

P

D t Furthermore is a restricted PDF if D S

tsolD

This denition of a PDF is rich enough to capture almost all probability mass functions eg

uniform geometric binomial Poisson etc studied in classical probability theory Furthermore

probability density functions can b e approximated by PDFs via a pro cess of quantization

TPcase A TPcase over calendar is a tuple hC D L U i where i C and D are temp oral

constraints over ii solC solD iii L U and iv is a restricted PDF

For example h ui is a TPcase where is a uniform

distribution Intuitively C sp ecies the time p oints when an event is valid while D sp ecies the

time p oints that are distributed by Since solC solD it follows that assigns a probability

interval to each time p oint t sol C Sp ecically let PrhC D L U i t denote D t L U

Alternatively could b e dened as where is a restricted PDF is a unrestricted

L U L U

PDF and t solC D t L D t U This generalization is useful when the lower

L U

and upp er probability b ounds do not follow the same distribution Here PrhC D L U i t

L U

denotes the probability interval D t L D t U Although extending to a pair of PDFs

L U

is straightforward we avoid this redenition in order to maintain b etter compatibility with

TPtuple A TPtuple over scheme R A A and calendar is a pair d where

k

d is a relational k tuple over R and is a nonempty TPcase statement over ie is a set of

TPcases over where i j sol C C

i j i j

In the following supp ose R is a relation scheme and is a calendar over tu tu

n

TP A TPtable over R is a multiset of TPtuples over R

TPrelation A TPrelation r over R is a TPtable over R where R is a sup erkey for r

TPdatabase A TPdatabase over is a set of TPtables over

Annotation The annotation of TPrelation r over R pro duces a relation ANNr over relation

scheme R tu tu L U where domL domU and

n t t t t

ANNr fd t L U j d r t sol C L U Pr tg

t t t t

Equivalence TPrelations r and r are equivalent denoted r r i ANNr ANNr

Example Base TPrelations The following TPrelations are named TrainDep and BusArr

TrainNo From To C D L U

Baltimore New York u

g

BusNo From To C D L U

Ro ckville Baltimore b

u

The second TPcase in TrainDep says that there is a probability that train numb er from

Baltimore to New York will depart b etween and Furthermore given any time p oint t in

this interval the probability that the train will depart at exactly time t is b etween D t

g

and D t where D Note that when u g or b then

g

the PDF is uniform where n jsolD j geometric where p or binomial where n jsolD j

and p resp ectively 3

Temp oralProbabilistic Algebra TPA

In this section we briey overview some but not all of the TPA op erators in

Denition Selection condition Supp ose R is a relation scheme is a calendar and f

g Then C is a selection condition over R if it has one of the following forms

Data condition C A c where A R and constant c domA

Temp oral condition C T where T is a temp oral constraint over

Probabilistic condition C P p where P fL U g and probability p

Conjunction condition C C C where C is a selection condition over R 3

n i

We abuse notation and write A R to indicate that A is an attribute of R It should b e clear

from context whether we mean set memb ership or memb ership of an attribute in Rs schema

Denition TPlter A TPlter function maps a TPcase hC D L U i and a probabilistic

condition P p to a temp oral constraint C where solC ft solC j D t P pg 3

Various metho ds for implementing TPlter functions are given in For example a naive

approach involves testing every time p oint t sol C Better algorithms exploit the prop erties of

For instance if u then C can b e determined by testing only one time p oint in solC

Denition Selection on a TPrelation Supp ose r is a TPrelation over R C is a selection

condition over R and t is a TPlter function Then the selection of C on r using t denoted

t

r pro duces TPrelation r over R where the following constraints are satised

C

If C A c then r fd r j dA cg

If C T then r fd j fhC T D L U i j d r

hC D L U i solC T g g

If C P p then r fd j fht C D L U i j C d r

hC D L U i sol t C g g

If C C C then r r 3

C C

Although the implementation chosen for TPlter function t aects the eciency of the selection

t t

r must hold for all TPlter functions t t Thus we let r op eration notice that

C C

t

r with any choice for t r denote

C

C

Example Selection Let r TrainDep and let

r BusArr Then the following are the TPrelations for r and r

L

TrainNo From T o C D L U

Baltimore New York  _   u

  g

BusNo From To C D L U

Ro ckville Baltimore b

u

Notice that for b oth selections we only need to change the values for the C attribute 3

Denition Attribute list Supp ose relation scheme R A A and P is the primary

k

key for R Then F a a is an attribute list over R P if the following constraints are satised

n

i n ii each a is an attribute of R iii each attribute in P is an attribute in F and

i

iv no attribute o ccurs more than once in F 3

Denition Pro jection on a TPrelation Supp ose R A A r is a TPrelation over

k

R P is the primary key for R and F a a is an attribute list over R P Then the projection

n

of F on r denoted r pro duces TPrelation r over R where R F and

F

r fd j d i n d r d a da g

i i 3

Note that our pro jection op erator is exactly like pro jection in the classical relational algebra except

the elds in the primary key for r and the C D L U elds cannot b e pro jected out

Before presenting our denition of Cartesian pro duct we need to recall the notion of a probabilistic

conjunction strategy intro duced by Lakshmanan et al Intuitively this is a function that

returns the probability interval for a conjunction of two events given the probability intervals for

the individual events We shall also intro duce the concept of a Cartesian pro duct function These

functions are applied within the inner lo op of a Cartesian pro duct

Denition PCS A probabilistic conjunction strategy PCS is a binary function from closed

probability intervals to closed probability intervals where the following p ostulates are satised

Bottomline L U L U minL L minU U

Ignorance L U L U max L L minU U

Identity L U L U

Annihilator L U

Commutativity L U L U L U L U

Asso ciativity L U L U L U L U L U L U

Monotonicity L U L U L U L U if L U L U 3

The following are some sample probabilistic conjunction strategies

PCS name Probability interval returned

Ignorance L U L U max L L minU U

ig

Positive correlation L U L U minL L minU U

pc

Negative correlation L U L U max L L max U U

nc

Indep endence L U L U L L U U

in

Denition CPF A Cartesian product function CPF maps a TPcase a PCS and a TP

W

case to a TPcase statement where i sol C sol C C and

ii t sol C Pr t Pr t Pr t 3

A simple CPF example is cpf This function is dened as

s

cpf fht t L U ui j t sol C C L U Pr t Pr tg

s

ig If is a pair of PDFs then another CPF example is cpf fhC D L U

p

U L

where C C C D C L min L L U min U U and are new

L U

L

t

PDFs such that for each t solD D t D t u if L or

t t t t

L U

L

U

t

if U or u otherwise and L U Pr t Pr t otherwise u

t t t t

U

If is a single PDF then a hybrid CPF example is cpf Sp ecically cpf is dened

h h

ig and x ig if cpf fhC D L U as fhC D L L x

p

U L L

L

t solD D t D t x Otherwise cpf is the result of cpf

h s

L U

Denition Cartesian pro duct of two TPrelations Supp ose r is a TPrelation over R

r is a TPrelation over R is a probabilistic conjunction strategy cpf is a Cartesian pro duct

function and fA j A R A R g Then the Cartesian product of r and r under using

cpf

cpf denoted r r pro duces TPrelation r over R where R R R and

r fd j d r d r d d d

S

cpf g

3

cpf cpf cpf

Since r r r r must hold we let r r denote r r with any choice for cpf

Denition Renaming function Supp ose R A A Then a renaming function over

k

R is a function R that maps each attribute A R to an attribute RA fC D L U g where

i i

A A R RA RA A A 3

i j i j i j

Denition Renaming for a TPrelation Supp ose R A A r is a TPrelation

k

over R and R is a renaming function over R Then the renaming of r under R denoted r

R

pro duces TPrelation r over R where R RA RA and

k

r fd j d A R d r d RA dAg

3

Example Renaming and Cartesian pro duct Let R map TrainNo From To to TrainNo

r TrainFrom TrainTo let R map BusNo From To to BusNo B usFrom BusTo and let r

in R

r Then the following is a p ossible TPrelation for r

R

TrainNo TrainFrom TrainTo BusNo BusFrom BusTo C D L U

Baltimore New York Ro ckville Baltimore u

u

Notice that renaming op erators can only change the names of the data attributes 3

We are now ready to extend the classical denition of a natural join to TPdatabases

r is a Denition Join of two TPrelations Supp ose R A A R A A

k

k

TPrelation over R r is a TPrelation over R is a probabilistic conjunction strategy cpf

is a Cartesian pro duct function and t is a TPlter function Then the join of r and r under

cpf t

using cpf and t denoted r 1 r pro duces TPrelation r over R where R F

t

cpf

r r and R C F have the following values r

R F

C

R is a renaming function over R where

A R A R RA A A R RA R

This requirement indicates that only the attributes in R that o ccur in R get renamed

These attributes must b e renamed to attribute names that are not present in R

C is the selection condition a Ra a Ra where

n n

fa a g fa R j a Rg

n

This is the natural join predicate

where F is the attribute list A A Ra Ra

k n k

is a sublist of A A g fa R j a Rg and a a 3 fa a

n n

k k

k

cpf t t cpf cpf t

Since r 1 r r 1 r must hold we let r 1 r denote r 1 r with any choice for

P

cpf and t Now let N r denote jj ie the total numb er of TPcases in r It is easy to

dr

see that N r 1 r can b e huge The TPcompression op eration helps alleviate this problem

Denition TPcompression A TPcompression function is a function that maps TP

relation r over R to a TPrelation r over R where i N r N r and ii there exists

a bijection that maps each d t L U ANNr to a matching d t L U ANNr 3

t t t t

The following are some sample TPcompression functions

TPcompression function name Function prop erties

Identity r r

id id

Maximal compression N r N r

mc mc

Let jr j denote the numb er of TPtuples in r Then it must b e the case that N r jr j Note that

if can b e a pair of PDFs then N r j r j jr j for every TPrelation r

mc mc

Denition Intersection of two TPtables Supp ose r and r are TPtables over R

Then the multiset intersection of r and r denoted r r pro duces TPtable r over R where

r fd j d r d r

fhC C D L U i j hC D L U i solC C g

fh C C D L U i j hC D L U i sol C C gg

3

Denition Union of two TPtables Supp ose r and r are TPtables over R Then the

multiset union of r and r denoted r r pro duces TPtable r over R where

r fd j d r d r g

3

The multiset op erators pro duce TPtables TPtables can b e converted into TPrelations through

the pro cess of compaction Compactions are p erformed by executing several instantiations of a

compaction function These functions op erate by consulting a combination function

Denition Combination function Supp ose S fL U L U g is a nonempty mul

n n

tiset of probability intervals and let L U L U L U Then is a combination

n n

function if S returns a probability interval L U that satises the following conditions

Identity If L U L U then L U L U

n n

Bottomline L max L

i

L U S

i i

Furthermore is an equity combination function if L U L U whenever L U 3

The following are some sample equity combination functions

T

Combination function name Probability interval returned when L U

i i

L U S

i i

Optimistic equity S max L max U

eq i i

L U S L U S

i i i i

Enclosing equity S min L max U

ec i i

L U S L U S

i i i i

Pessimistic equity S min L min U

ep i i

L U S L U S

i i i i

Rejecting equity S

er

Skeptical equity S

esk

Quasiindep enden ce equity S L U

eq i i i

L U S L U S

i i i i

Denition Compaction function Supp ose G is a nonempty multiset of TPcase statements

t is a time p oint and is a combination function Then the multiset S is dened as

Gt

S fPr t j G t sol C g

Gt

A function cmp G is a compaction function if it returns a TPcase statement where

W W

sol C sol C and

f j G g

t sol C Pr t S 3

Gt

A simple example is cmp G fht t L U ui j S L U S g

Gt Gt

s

If is a pair of PDFs then another example is cmp G fhC D L U ig

p

L U

V

where C C D C L max max L

i

tsolD L U S

f j G g

i i

Gt

U and U max max are new PDFs such that for each t solD

i

tsolD L U S

L U

i i

Gt

U L

t t

if L or otherwise u if U or u D t D t u

t t t t t t

L U

L U

otherwise and L U S

t t Gt

If is a single PDF then a hybrid example is cmp Sp ecically compaction function

h

ig and ig if cmp G fhC D L U cmp G is dened as fhC D L L x

p h

U L L

t solD D t D t x Otherwise cmp G cmp G x

h s

L U

L

Denition Compaction of a TPtable Supp ose r is a TPtable over R is a combina

tion function and cmp is a compaction function Then the compaction of r under using cmp

cmp

denoted r pro duces TPrelation r over R where

r fd j d r cmp f j d r g g

3

cmp cmp cmp

Since r r must hold we let r denote r with any choice for cmp

Denition dierence b etween two TPrelations Supp ose r and r are TPrelations over

R Then the dierence between r and r denoted r r pro duces TPrelation r over R where

r fd j d r d r

fhC C D L U i j hC D L U i

hC D L U i solC C gg

3

We are now ready to dene a TPalgebra query

Denition TPAquery Supp ose db is a TPdatabase over and let q denote a TPAquery

i

over db where the answer to q is a TPrelation r over R Then a TPAquery over db has one of

i i i

the following forms

r where r is a TPrelation in db

q where C is a selection condition over R

C

q where P is the primary key for R and F is an attribute list over R P

F

q where R is a renaming function over R

R

q where is a TPcompression function

q q where is a probabilistic conjunction strategy and R R

q 1 q where is a probabilistic conjunction strategy

q q where is a combination function and i j n R R

n i j

q q where is a combination function and i j n R R

n i j

q q where R R 3

Notice that a series of multiset intersections or multiset unions must end with a compaction This

ensures that the answer to a TPAquery is always a TPrelation Since these op erators are often tied

together it is convenient to let r r and r r denote r r and r r resp ectively

TPCalculus

We now intro duce the temp oral probabilistic calculus TPcalculus This calculus is similar in spirit

to the safe tuple relational calculus

Syntax of the TPCalculus

Denition TPvariable Supp ose R A A is a relation scheme and S is the set of

k

all temp oral constraints over Then a TPvariable over R is a variable s over the domain

doms domA domA domC domD domL domU dom

k

where domC domD S domL domU and dom is the set of all restricted

PDFs over calendar Each hd C D L U i doms is an instance of s 3

We shall abuse notation and write R R to indicate that A R must hold for all A R

Denition TPatom Supp ose s is a TPvariable over R s is a TPvariable over R

R R and f g Then a TPatom over s s has one of the following forms

sA c where A R and constant c domA

sC s C T where T is a temp oral constraint over

sC s P p where P fL U g and probability p

sA s A where A R A R and domA domA 3

For example sA and sC s C are TPatoms over s s TPatoms are

used to construct more complex TPexpressions

Denition Limited TPexpression Supp ose s is a TPvariable over R and s is a TP

variable over R Then a TPexpression over s s is dened in the following way

A TPatom over s s is a TPexpression over s s

If E E are TPexpressions over s s then E E is a TPexpression over s s

A TPexpression E is limited if i it contains at least one TPatom of the form sA s A and ii

for all A R if sA s A and sA s A are TPatoms in E then i j 3

i j

For example sA sC s C sC s L sA s A is a

TPexpression over s s The notion of a TPlinker sp ecies equality constraints that tie together

the data attributes of three TPvariables

Denition TPlinker Supp ose s s and s are TPvariables over R R and R re

sp ectively Then a TPlinker over s s s has one of the following forms

sA s A if A R A R and domA domA

sA s A if A R A R and domA domA

L L if L and L are TPlinkers over s s s 3

For example sA s A sA s A is a TPlinker over s s s

Denition Strategy set A strategy pair is an expression of the form ht y pe s tri where t y pe

f g and s tr is a probabilistic conjunction strategy when t y pe s tr is a TPcompression

function when t y pe and s tr is a combination function when t y pe

A strategy set is a nite set of strategy pairs such that ht y pe s tr i and ht y pe s tr i

implies that t y pe t y pe 3

For example fh i h ig and fh i h ig are strategy sets We need strategy sets

in mc eq mc

b ecause TPformulae dened b elow are complex formulae that represent queries ab out conjunctions

andor disjunctions of events When expressing such queries the user needs to sp ecify information

ab out i their knowledge of the dep endencies if any b etween events handled by the query ii

whether they want the answer compressed if so then how and iii what combination strategy to

use eg to eliminate conicts

Denition TPformula Supp ose db is a TPdatabase over s is a TPvariable over R

and P is a primary key for R Then a TPformula over db has one of the following forms

r where r is a TPrelation over R and r is in db s s

s s F E where s is a TPvariable over R F is a TPformula over db of the form

s F E is a limited TPexpression over s s and a P a R

s s F s F L where F is a TPformula over db of the form s F F is

a TPformula over db of the form s F L is a TPlinker over s s s and is a strategy

set of the form fh i h ig

s s F s F where F F are TPformulae over db each F has the form

n n i

n

each s is a TPvariable over R is a strategy set of the form fh i h ig and s F

i i

i

f g 3

Example TPformulae Supp ose TPdatabase db contains TPrelations r r r over R R

R and where R A A R A A R A A and is a Gregorian calendar with a

chronon of one year Then the following examples are TPformulae over db

r s s

is dierent from the set memb ership symb ol s r restricts We emphasize that the symb ol

doms in the following way For each instance hd C D L U i doms there must b e a

TPtuple d r where TPcase hC D L U i

s s s r sC s L sA s A

This TPformula is related to a probabilistic selection followed by a pro jection that causes s to

b e a TPvariable over relation scheme A and calendar To avoid the need for pro jection

we could add TPatom sA s A to the TPexpression used in this TPformula

r s s s

fh ih ig

mc

ig

s s r sA s A sA s A sA s A sA s A

This TPformula is related to a join of r and r under the probabilistic conjunction strategy

followed by a TPcompression that uses the strategy If we want a Cartesian pro duct

ig mc

then all we need to do is replace TPlinker sA s A with TPlinker sA s A in the

TPformula ab ove this renames A in R to A To avoid compression we could replace

mc

with the identity TPcompression function

id

r s A s A s s s s

fh ih ig

eq mc

r s A s A s s s r s A s A s s s

fh ih ig

eq mc

r followed r and r This TPformula is related to the multiset intersection of

A A A

by a compaction that uses the combination function and a TPcompression that uses

eq

the strategy Notice that unless domA domA domA the fourth rule for

mc

constructing TPformulae will not allow us to intersect r r and r 3

We are now ready to dene a TPcalculus query

Denition TPCquery Supp ose db is a TPdatabase over and F is a TPformula of the

form s F where i s is the only free variable in F and ii r db for every TPrelation r

mentioned in F Then fs j F g is a TPCquery over db 3

We will assume throughout this pap er that all quantied TPvariables in a TPCquery dier

from each other and from the lone free TPvariable There is no loss of generality in making this

assumption since TPvariables can easily b e renamed

Semantics of the TPcalculus

In this section we provide a quick semantics for TPCqueries We start with the denition of a

TPassignment which is similar to the concept of a variable assignment in classical logic

Denition TPassignment A TPassignment is a function A that maps each TPvariable s

A A A A A A

to an instance hd C D L U i doms We omit the sup erscipt A when it is clear from

s s s s s s

context Also we let denote the TPcase hC D L U i 3

s s s s s s

For example A may assign s to a data tuple d and a TPcase where d r for some

TPrelation r The following denition sp ecies when a TPassignment satises a TPexpression

Denition Suitablesatisfying TPassignment Supp ose A is a TPassignment s is a TP

variable over R s is a TPvariable over R and E is a TPexpression over s s Then A is

and the following constraints are satised suitable for E denoted A E if solC solC

s

s

If E has the form sA c then d A c

s

T If E has the form sC s C T then t sol C t sol C

s s

p t P D If E has the form sC s P p then t solC

s

s s s

A If E has the form sA s A then d A d

s

s

If E has the form E E then A E and A E

We say that A satises E denoted A j E i A E and the following constraints are satised

R fA j there exists a TPatom of the form sA s A in E g We assume here that relation

scheme R is treated as a set ie the attribute ordering do es not matter

There is no temp oral constraint C where solC solC and replacing C ab ove with C

s s

allows A to b e suitable for E

i 3 U L hD L U i hD

s s s s s s s s

Example Satisfying TPassignment Supp ose s TrainDep see Example Then

doms fh Baltimore New York ui

h Baltimore New York g ig

Furthermore supp ose sC s C sC s L sTrainNo s TrainNo is

TPexpression E Then A j E if TPassignment A satises one of the following conditions

As h ui and

As h Baltimore New York ui

As h g i and

As h Baltimore New York g i 3

Recall that S denotes the set of all temp oral constraints over Consider a partitioning of S

where C and C are in the same i sol C solC Although the size of each partition

is innite we can restrict ourselves to use only one canonical temp oral constraint for each partition

It is imp ortant to note that under this restriction the set of all TPassignments A where A j E is

nite if E is a limited TPexpression over s s and doms is nite

Denition TPlinker satisfaction Supp ose A is a TPassignment s is a TPvariable over

R s is a TPvariable over R s is a TPvariable over R and L is a TPlinker over s s s

Then A satises L denoted A j L i the following constraints are satised

R fA j there exists a TPlinker of the form sA s A or sA s A in Lg As in the

preceding denition we assume that the attribute ordering is not imp ortant

If L is a TPexpression E over s s then A E

If L is a TPexpression E over s s then A E

If L L L where L is a TPexpression E over s s and L is a TPexpression E over

s s then A E and A E 3

TrainDep s B usArr and L is the following TPlinker over s s s For example supp ose s

sTrainNo s TrainNo sTrainFrom s From sTrainTo s To

sBusNo s BusNo sBusFrom s From sBusTo s To

Furthermore supp ose s is a TPvariable over R Then A j L if i the set for relation scheme R

A A

is fTrainNo TrainFrom TrainTo BusNo BusFrom BusTog and ii d A d A for each TPlinker

s s

i

in L of the form sA s A where i f g

i

Denition TPformula semantics Supp ose A is a TPassignment and F is a TPformula

over db Then A is a model for F denoted A j F i the following constraints are satised

r then there exists a d r where hC D L U i If F has the form s s

s s s s s s

If F has the form s s F E then A j s F and A j E

If F has the form s s F s F L where fh i h ig then

C A j L solC and C C A j s F A j s F

s s s s

t t Pr t solC Pr t Pr

s s s s

s F If F has the form s s F where fh i h ig then

n

n

i n A j s F and sol C

i s

i

If then d d d C C C and

s s s s s s

n n

t Pr t solC Pr t fPr tg

s s s s

n

d If then d fd g C t and

s s s s

n

Pr t fPr t j i I g where

s s t

i

A A

I fi j A A j s F d d t sol C g

t i s

s s

i

i i

If then d d and hC C D L U i where

s s s s s s s s

W

n

A A

C fC j A A j s F d d C C g 3

i s

i s s

i

i i

Denition TPCquery semantics Supp ose F is a TPformula of the form s F Then

the answer to TPCquery fs j F g is dened as the following TPtable

A A

3

fd j A A j F d d f gg

s s

The following imp ortant theorems jointly state that the TPalgebra and the TPcalculus have

exactly the same expressive p ower

Theorem TPA TPC Every TPAquery can b e expressed as a TPCquery 3

Theorem TPC TPA Every TPCquery can b e expressed as a TPAquery 3

Equivalence Results

In this section we describ e a set of query equivalence results that hold in the TPalgebra These query

equivalences provide rewrite rules that may b e used to optimize queries In the sequel we assume

r r are TPtables and C is a selection condition Recall from Section that two TPrelations are

equivalent if their annotated expansions are identical informally sp eaking two TPrelations are

equivalent i whenever tp is the restriction of a TPtuple to its data attributes if the probability that

tp is true at time t is p according to the rst TPrelation then the probability that tp is true at time

t is p according to the second relation as well ie the two TPrelations assign the same probabilities

Also unless stated otherwise the following equivalences hold for al l combination functions and for

al l probabilistic conjunction strategies thus making the results very widely applicable

SetTheoretical Prop erties

The standard relational algebra op erations are idempotent commutative and associative These

prop erties do not hold for all TPalgebra op erations Our rst result says that selection pro jec

tion compaction union and intersection are idemp otent and an idemp otence style result holds for

dierence as well

Theorem Idemp otence The following equivalences hold

r r

C C C

r r

F F F

r r if is any combination function

r r r if r is a TPrelation and is any combination function

r r r if r is a TPrelation and is any combination function

r r r r r

The following two results show that most of the imp ortant op erations in the TPalgebra are

commutative but not all are asso ciative

Theorem Commutativity The following equivalences hold

r r

C C

C C

r r r

F G F G G F

r r r r

r r r r

r r r r r r r r r

r r r r ignoring the order of data attributes

r r r r ignoring the order of data attributes

Theorem Asso ciativity The following equivalences hold

r r r r r r

r r r r r r

In general intersection and union are not asso ciative b ecause b oth these op erations involve apply

ing the compaction op erator The following example shows why intersection is not asso ciative

Example Intersection is not asso ciative Recall that for each data tuple d and time p oint

cmp

s

t the op erator collects all intervals l u l u asso ciated with d and t by dierent

ec

k k

TPtuples and computes the new interval l u as follows

l u l u i l u l u

k k k k

l u

min l l maxu u otherwise

k k

Now consider three TPrelations r r and r such that for some data tuple d and time p oint t

d r ht t t t ui d r ht t t t ui d r

and ht t t t ui Then r r will contain d where TPcase ht t t

ec

t ui since Also r r will contain TPtuple d

ec

where TPcase ht t t t ui since

But then r r r will contain TPtuple d where ht t t t ui as

ec ec

and so the probability interval is min max On

the other hand r r r will contain TPtuple d where ht t t t ui

ec ec

as and so the probability interval is min max

This indicates that r r r r r r

ec ec ec ec

Prop osition The following prop erties hold

r r r r

r r r r r

r r r r r r r

r r r r r r r

Pushing Selection Through Other Op erations

An imp ortant rewrite rule in the classical relational algebra allows us to push selection through some

exp ensive op erations like join In this section we study the cases where selection can b e pushed

through various TPop erations While many results are similar to the corresp onding equivalences

from the classical relational algebra in many imp ortant cases eg pushing selection into Carte

sian pro ductjoin general results can only b e obtained for data and temporal selection conditions

Sp ecic results for probabilistic selection conditions are discussed in Section

Theorem Selectionpro jection Supp ose F is an attribute list and C do es not involve any of

these attributes Then r r

C F F C

The results b elow hold only for data and temporal selection conditions The next result shows

that selection can b e pushed into a compaction From an eciency standp oint this is go o d b ecause

selection can often b e p erformed much faster than compaction

Theorem Pushing selection into compaction Let C b e a data condition or a temp oral con

dition Then r r

C C

The following theorem shows that in some cases selections can b e pushed into Cartesian pro ducts

Theorem Pushing selection into Cartesian pro duct Supp ose is a PCS C resp C is

a data selection condition for r resp r and C is a temporal selection condition Then

r r r r

C C

r r r r

C C

r r r r r r r r

C C C C C

Not surprisingly a similar theorem holds for join

Theorem Pushing selection into join We use the same notation as Theorem

r r r r

C C

r r r r

C C

r r r r r r r r

C C C C C

Next we establish equivalences for pushing selections into set op erations

Theorem Pushing selection into set op erations Let C b e either a data condition or a

temp oral condition Then

r r r r r r r r

C C C C C

r r r r

C C C

r r r r r r

C C C C

Conditional Query Equivalences in TPA

In this section we rst Section explain why it is hard to nd equivalence results probabilistic

selections Then in Section we provide a numb er of conditional equivalence results that we

have obtained for probabilistic selections under a variety of sp ecial cases

Why equivalences involving probabilistic selection conditions are hard

We start with an example that provides strong evidence indicating that no simple general rule for

pushing probabilistic selections into Cartesian pro ducts and hence joins exists

Example Probabilistic selection and Cartesian pro duct Consider TPrelations r and r

of Example and supp ose that their attributes have b een renamed via the renaming functions from

Example Supp ose probabilistic selection condition C L U Then

Table shows the results of the following four queries

BusNo BusFrom BusTo TrainNo TrainFrom TrainTo

d

Ro ckville Baltimore Baltimore New York

q r r

in

LU

Data Part C D L U

d u

u

q r r

in

LU LU

Data Part C D L U

q r r

in

LU

Data Part C D L U

d u

q r r

in

LU

Data Part C D L U

d u

Table Dierent combinations of Cartesian pro duct and probabilistic selection

r r r r q q

in C in C C

q r r q r r

C in in C

It is clear from Table that no appropriate rewrite rule similar to those describ ed in Theorem for

r r can b e obtained as the answers to all four queries are dierent We note here that

in C

r selects only the time p oint This makes r selects only the time p oint and

C C

the result of q an empty set

Let us now consider an example involving compaction and probabilistic selects

Example Probabilistic selection and compaction Consider the op erator from Exam

ec

ple and supp ose TPcase ht t t t ui ht t t t ui and

ht t t t ui Also supp ose TPtable r fd d g where f g and

f g Then r fd g where f g since

ec

Consider the probabilistic selection condition U We see that TPcase is in the answer to

r since its upp er b ound is In contrast the answer to r contains but not

U ec U

so the answer to r can only contain Therefore r and r

ec U U ec ec U

are not equivalent

As compaction is used to dene b oth intersection and union probabilistic selections and inter

sectionsunions do not commute The next example shows that r r is not equivalent to

C

r r when C is a probabilistic selection condition

C C

Example Probabilistic selection and dierence Supp ose TPrelation r d f g and

TPrelation r d f g where ht t t t ui and ht t t t ui Then

r r so r r On the other hand r will contain and r will not

L L L

contain so r r will contain Thus r r r r

L L L L L

Sp ecic query equivalences for probabilistic selections

Though the negative results ab out probabilistic selection presented ab ove are discouraging there is

go o d news In some cases probabilistic selection can b e pushed into other op erations In our rst

result we consider only probabilistic selection conditions of the form P p or P p

Theorem Pushing selections of the form P p or P p Supp ose C is a probabilistic se

lection condition of the form L p U p L p or U p Then

r r r r

C C C C

r r r r

C C C C

The following theorem indicates that when the probabilistic selection condition has the form P p

or P p and we use the positive correlation PCS then we can push probabilistic selection into

Cartesian pro duct and join

Theorem Pushing selection into Cartesian pro duct and join under the PCS

pc

Supp ose C is a probabilistic selection condition of the form L p U p L p or U p and

supp ose is any equity combination function Then

r r r r r r

C pc C pc pc C

r r r r r r

C pc C pc pc C

When the selection condition has the form P p or P p and we use the independence PCS

then the following theorem indicates that we can push probabilistic selection into Cartesian pro duct

L r and MIN Ur for each TPrelation r As we and join if the optimizer stores the statistics MIN

L r is dened as min fl j l D t L hC D L U i d r g shall see in Section MIN

Ur is dened as minfu j u D t U hC D L U i d r g and MIN

Theorem Pushing selection into Cartesian pro duct under the PCS

in

p

r r r r

Lp in Lp in

L

MIN Lr

p p p

r r r r

Lp in in Lp

L L L

MIN Lr

MIN Lr MIN Lr

p

r r r r

Lp in Lp in

L

MIN Lr

p p p

r r r r

in in Lp Lp

L L L

MIN Lr

MIN Lr MIN Lr

p

r r r r

U p in U p in

U

MIN Ur

p p p

r r r r

U p in in U p

U U U

MIN U r

MIN Ur MIN Ur

p

r r r r

U p in U p in

U

MIN Ur

p p p

r r r r

in in U p U p

U U U

MIN U r

MIN Ur MIN Ur

An analogous equivalence holds for join as shown b elow

Corollary Pushing selection into join under the PCS

in

p

r r r r

Lp in Lp in

L

MIN Lr

p p p

r r r r

in in Lp Lp

L L L

MIN Lr

MIN Lr MIN Lr

p

r r r r

Lp in Lp in

L

MIN Lr

p p p

r r r r

in in Lp Lp

L L L

MIN Lr

MIN Lr MIN Lr

p

r r r r

U p in U p in

U

MIN Ur

p p p

r r r r

U p in in U p

U U U

MIN Ur

MIN Ur MIN Ur

p

r r r r

U p in U p in

U

MIN Ur

p p p

r r r r

U p in in U p

U U U

MIN Ur

MIN Ur MIN Ur

We see from the ab ove results that with some work we can push probabilistic selections into cartesian

pro ducts and joins For example Theorem and Theorem show that for certain kinds of

probabilistics selection conditions we can easily push selections into Cartesian Pro duct and join

Lr MIN Ur and for each tprelation r then Theorem shows that as long as we maintain MIN

we can also push probabilistic selections in Cartesian Pro duct and join when indep endence is the

PCS used Most relational databases routinely maintain minima and maxima of dierent columns

for so this overhead seems acceptable

Pushing Pro jection Through Other Op erations

In this section we present results showing when and how pro jection can b e pushed into TPalgebra

op erations Our rst result shows that pro jection can b e pushed inside a compaction

Theorem Pushing pro jection into compaction r r

F F

As we have already see b efore compaction is part of union and intersection The ab ove result

provides hop e that pro jection can also b e pushed through the set op erations This turns out to b e

true as shown in the following theorem

Theorem Pushing pro jection into set op erations If r and r have the same schema

r r r r

F F F

r r r r

F F F

r r r r

F F F

The following results indicate that pro jections can b e pushed into Cartesian pro ducts and join

Theorem Pushing pro jection into Cartesian pro duct and join Supp ose r and

F

are TPAqueries Also let F F denote the attribute list that is obtained by concatenating r

F

F with F and removing duplicate attribute names Then

r r r r

F F F F

r r r r

F F F F

Costing TPQueries

In order to eciently execute a query in the TPalgebra we must rst have a cost mo del asso ciated

with the algebra Such a cost mo del has two parts The rst part sp ecies what statistics to

maintain ab out a TPrelation such statistics includes information such as cardinality information

distributions of attribute values and so on The second part deals with the physical costs of executing

the op eration which dep ends up on the implementation We rst discuss the former in section

and the latter in section Due to space restrictions in this pap er we only consider the Selection

Pro jection Cartesian Pro duct and Join op erators in the TPalgebra

Statistics for TPDatabases

For each TPrelation base relation or otherwise we maintain a set of statistics summarized in

Table Given a tprelation r let timesr ft j there exists a tptuple tp in r and a tpcase

r maxtimesr and MinC r C D L U in tp such that t sol C g Then MaxC

mintimesr

Statistics Description

CARD Numb er of TPtuples in the TPrelation

AVG TIME PTS Average numb er of solutions of C constraint in a TPtuple

MAX TP The latest time p oint in the TPrelation

MIN TP The earliest time p oint in the TPrelation

MaxC See b elow

MinC See b elow

DOMAIN MINtime unit The smallest value time unit can take in calendar

DOMAIN MAXtime unit The largest value time unit can take in calendar

AVG LU Average lower upp er b ound probability in the TPrelation

MIN LU Minimum lower upp er b ound probability value in the TPrelation

MAX LU Maximum lower upp er b ound probability value in the TPrelation

Table Statistics for TPrelations

We now address the following problem Given a TPrelation r whose statistical variables cf

Table are known and given that some TPA op eration is executed on this table how do we

estimate the values of these statistical variables for the output TPrelation Though of course this

value can b e correctly computed by answering the query we would like to estimate these values

without answering the query As in the case of relational database implementations these estimates

need to b e quickly computable and reasonably rather than totally correct This allows fast

evaluation of many p ossible dierent query plans for executing the query so that we can pick the

b est Clearly for data selects we can use the wellknown classical techniques Hence we ignore

this op eration and deal instead with op erations new to TPA

Temp oral Selection

Due to space reasons we cannot show how to estimate all these parameters We describ e b elow

time pts r metho ds to estimate car d r and av g

C C

Estimating cardinality car d r may b e written as car dr sel C where sel C denotes

C

the selectivity of the selection condition C We can estimate sel C by induction on the structure of

C

tu value The probability that an arbitrary time p oint satises the constraint tu value is

M AX tuv alue D O M AI N

For a TPcase in r the probability that all time p oints in

D O M AI N M AX tuD O M AI N MIN tu

the solution of that TPcases C constraint are greater than or equal to v al ue may b e estimated by

TIME PTS r AV G

M AX tuv alue D O M AI N

Hence the probability that at least one

D O M AI N M AX tuD O M AI N MIN tu

such solution of C will satisfy tu value is

AV G TIME PTS r

D O M AI N M AX tuv alue

which describ es sel C

D O M AI N M AX tuD O M AI N MIN tu

tu value The analysis is exactly analogous to the ab ove and the estimate of the selectivity of

C is

TIME PTS r AV G

v al ue D O M AI N MIN tu

sel C

D O M AI N M AX tu D O M AI N MIN tu

tu value By an analysis similar to that ab ove we see that here the estimate of the selectivity

of C is

AV G TIME PTS r

sel C

D O M AI N M AX tu D O M AI N MIN tu

tu value In this case we simply subtract the selectivity of tu value from

TIME PTS solutions to its C constraint The prob t t On the average a TPcase has AVG

t t

ability that an arbitrary time p oint will b e a solution of this constraint is

M AX TP r MIN TP r

The probability that an arbitrary time p oint will not b e a solution of this constraint is

t t

Therefore the probability that no time p oint of an arbitrary TP

M AX TP r MIN TP r

TIME PTS r AV G

t t

case will b e a solution of the constraint t t is

M AX TP r MIN TP r

Hence the total numb er of TPtuples that will not b e returned by this op eration can b e estimated

AV G TIME PTS r

t t

in

as C ARD Then the exp ected selectivity is

M AX TP r MIN TP r

TIME PTS r AV G

t t

sel t t Formula TS

M AX TP r MIN TP r

We can also compute the selectivity of this constraint in another way Here we try to estimate

the probability that t t overlaps with any temp oral constraint C of the form t t in the

i j

TPrelation These two intervals may overlap if either i F t t and t t ie t is b etween

i j

t and t or ii F t t and t t ie t is b etween t and t We can compute the

i j i i i

selectivities of these conditions as follows

sel t t sel t t

i j

r t r t M axC M inC

MIN TP r t M AX TP r t

TP (r ) TP (r )t1 t1MIN M AX

otherwise otherwise

M axC 1(r )MINTP (r )+1 M AX TP (r )M inC 2(r )+1

sel t t sel t t

i i

TP r t r t MIN M axC

M axC r t MIN TP r t

1(r )t1 TP (r ) M axC t2MIN

otherwise otherwise

M axC 1(r )MIN TP (r )+1 M axC 1(r )MIN TP (r )+1

Hence sel F sel t t sel t t and sel F sel t t sel t t Hence

i j i i

the selectivity of t t will b e

sel t t sel F sel F sel F sel F Formula TS

Later in Section we will run exp eriments to determine which of these two estimates is b etter

Non atomic temp oral constraints Any nonatomic temp oral constraint can b e written purely

in terms of and The selectivity of r is equal to minus the selectivity of r The

C C

r and then may b e obtained by rst estimating the selectivity of r selectivity of

C C C

estimating r

C

Estimating AV G TIME PTS Table shows how we may estimate the average numb er of time

p oints asso ciated with the C constraints in a TPtuple For space reasons instead of explaining all

the entries we explain only the rst one which also happ ens to b e the toughest case The probabil

ity that one of these time p oints has time unit tu v al ue is

D O M AI N M AX tuD O M AI N MIN tu

As the original TPrelation has AV G TIME PTS r time p oints in it p er TPtuple we may there

AV G TIME PTS r

time p oints in it p er fore assume that the output relation has

D O M AI N M AX tuD O M AI N MIN tu

TPtuple

Probabilistic Selection

Our metho ds to estimate the selectivity of a probabilistic selection condition use the following simple

but useful result

Prop osition Given a value p and a TPtuple tp d f g

n i

n

hC D L U i there are at most time points t sol C such that t D L p

i i i i i i j j j

i

p

where t sol C

j

n

Pro of Let T ft sol C j t D L p t sol C g Let s jT j If s then

i j j j j

p p

i

p

P P P

n

n

p As we know that T C t D L s p L t D L

i j j j i j j j

p tT i tT

i

p

p p

which yields a contradiction 3

Condition AV G TIME PTS

AV G TIME PTS r

tu v al ue max

D O M AI N M AX tuD O M AI N MIN tu

AV G TIME PTS r

tu v al ue AV G TIME PTS r M ax

D O M AI N M AX tuD O M AI N MIN tu

AV G TIME PTS r D O M AI N M AX tuv alue

tu v al ue

D O M AI N M AX tuD O M AI N MIN tu

AV G TIME PTS r v alueD O M AI N MIN tu

tu v al ue

D O M AI N M AX tuD O M AI N MIN tu

tptp

tp tp AV G TIME PTS r

M AX TP MIN TP

Table Formulas to compute AV G TIME PTS

time The ab ove result says that as the probabilities assigned by a PDF add up to at most

p

p oints can b e assigned a probability greater than or equal to p We now estimate the selectivity of

atomic probabilistic selection conditions one by one

Lr We rst note that if the value of pr ob in the selection condition is not b etween MIN

MIN U r and M AX Lr M AX U r for lowerupp er b ound conditions then the selectivity

Lr and the will always b e either or dep ending on the condition For example if pr ob M AX

condition is L pr ob then the result cardinality will b e as no time p oint in the TPrelation will

satisfy this condition The table b elow contains the selectivities of probabilistic selects when pr ob is

outside the MIN L M AX L MIN U M AX U interval

Condition pr ob M AX LU pr ob M I N LU

LU pr ob

LU pr ob

LU pr ob

LU pr ob

We now provide cardinality estimates for probabilistic selects when pr ob is in the ab ove intervals

We prop ose two sets of cardinality estimations stemming from two somewhat dierent approaches

Table summarizes the formulas for the two sets In this table P RO B denotes the following

expression

P RO B min

pr ob M AX TP r MIN TP r

Some explanations ab out the intuition b ehind the formulas for b oth sets are in order

We explain the computations b ehind the Set A formulas on the example of atomic proba Set A

bilistic condition L prob

Consider an arbitrary TPtuple in the input relation r This tuple satises the condition L pr ob

if at least one time p oint in the solution of one of its C constraints has the lower b ound of probability

Cond C Set A sel C Set B sel C

1

L pr ob min

pr ob  AV G TIME PTS (r )

1

L pr ob min

pr ob  AV G TIME PTS (r )

Lr pr ob AV G

AV G TIME PTS (r )

L pr ob P RO B

1

pr obotherwise min

pr obAV G TIME PTS (r )

prob and AV G TIME PTS r

AV G TIME PTS (r )

prob AV G Lr

L pr ob P RO B

L(r )pr ob AV G

(AV G TIME PTS (r )2)

otherwise

AV G L(r )

U pr ob

U pr ob

Ur prob AVG

AV G TIME PTS (r )

U pr ob P RO B

U (r ) AV G

pr obotherwise min

pr ob

prob AV G U r

AV G TIME PTS (r )

U pr ob P RO B

U (r )pr ob AV G

(AV G TIME PTS (r )2)

otherwise

AV G U (r )

Table Two sets of selectivity estimates for atomic probabilistic selection conditions

L pr ob The probability that an arbitrary time p oint t satises the L pr ob constraint can b e

b ounded ab ove by P RO B min Hence the probability that t

pr obM AX TP r MIN TP r

TIME PTS r AV G

and therefore do es not satisfy this constraint is

pr obM AX TP r MIN TP r

the probability that a tuple has a time p oint in its TPcase statement satisfying L pr ob is

TIME PTS r AV G

min Hence the selectivity of r

Lpr ob

pr obM AX TP r MIN TP r

is given by

AV G TIME PTS r

min

pr ob M AX TP r MIN TP r

The same reasoning applies to the all remaining atomic constraints on lower and upp er b ounds

featured in Table

The intuition for dierent atomic conditions is somewhat dierent Set B

L prob By Prop osition there can b e at most time p oints with probability equal to

pr ob

pr ob in any TPtuple Therefore when the numb er of time p oints in a TPtuple is large we exp ect

that the chances of any time p oint to have a fairly large lower b ound of the probability interval are

fairly small This inequality b etween small and large values of pr ob is represented by the expression

pr obAV G TIME PTS r

On the other hand the chances of nding a time p oint with any particular smal l lower b ound

of the probability interval should b e approximately the same We represent that by assuming that

there exists a small constant that serves as an upp er b ound on the probability to encounter a time

p oint satisfying L pr ob As there is a nite numb er of time p oints in the TPtuple and p otentially

a continuum of values pr ob can take should b e a relatively small numb er

L prob Whenever pr ob is less than or equal to the average lower b ound probability value for a

single time p oint we are all but guaranteed the existence of at least on time p oint with lower b ound

larger than pr ob Otherwise if pr ob is very close to then there is very little chance of nding a

time p oint with a larger lower b ound pr ob describ es this value in a min computation this will

win whenever pr ob is almost For the values of pr ob b etween the average lower b ound and

the chance to nd a large probability value is in inverse prop ortion to the numb er of time p oints in

a TPtuple and to the value of pr ob cf Prop osition

L prob Whenever pr ob and there is more than one time p oint in a TPtuple we are

guaranteed to have a time p oint having a probability less than and hence less than pr ob If

pr ob is greater than the average probability then similar to the case ab ove the existence of a time

p oint with a lower b ound smaller than pr ob is almost guaranteed b ecause there have to b e time

p oints with probability less than the average probability In the last entry in the ab ove formula

Lr we assume that the exp ected numb er of time p oints with lower b ounds greater than AV G

AV G Lr pr ob

TIME PTS r os AV G provides our estimate of probability that a given time

AV G Lr

p oint is in the interval b etween pr ob and the average lower b ound Such p oints do not satisfy our

requirements So by computing the probability that half of all time p oints are in this range assuming

that the other half is ab ove the average and subtracting it from we obtain a probability estimate

for the existence of at least one time p oint with lower b ound for probability under pr ob

The estimates for the remaining conditions use similar intuitions correcting for the fact that

Prop osition do es not apply to the upp er b ounds of probability intervals

Whether Set A or Set B estimates are used it is easy to estimate cardinality of queries such as

r by rst estimating the size of r calling the result r and then estimating r

Lp Lp

Lpp

Table sp ecies how given a probabilistic selection condition atomic Other statistical variables

we may use the values of the statistical variables asso ciated with input relation r to estimate the

values of AV G L AV G U for the output relation obtained by applying the select

Condition AV G L AV G U

L pr ob pr ob pr ob AV G DIFF

in in out

L pr ob maxAV G L pr ob sel AV G L AV G L AV G DIFF

in in out

L pr ob minAV G L pr ob sel AV G L AV G L AV G DIFF

U pr ob pr ob AV G DIFF pr ob

out in in

U pr ob AV G U AV G DIFF maxAV G U p sel AV G U

out in in

U pr ob AV G U AV G DIFF minAV G U p sel AV G U

in out

LIN p p maxp minAV G L p selL p AV G L AV G DIFF

in in

AV G L minAV G L p selL

in

p AV G L selL p

out in

UIN p p AV G U AV G DIFF maxp minAV G U p selU p

in in

AV G U minAV G U p selU

in

p AV G U selU p

1

Table Computations for AV G L and AV G U in the result relation

Cartesian Pro duct and Join

Given two arbitrary TPtuples tp r and tp r tp d f g hC D L U i

n i i i i i i

tp d f g hC D L U i we need to determine the probability that

m

j j j j j j

there will b e a TPtuple tp r r corresp onding to these two tuples The Cartesian pro duct

of tp and tp will b e nonempty if they share at least one time p oint ie if there exists i n

j m such that sol C C

i

j

S S

n m

Let C C and C C We rst need to compute the probability that a solu

i

i j

j

tion of C is also a solution of C Note that if t is outside the range of time p oints found in

TP r M AX TP r then it is denitely not a shared solution Therefore we rst r MIN

establish the probability that t MIN TP r M AX TP r As t sol C we know that

TP r M AX TP r t MIN

Let TRr r jMIN TP r M AX TP r MIN TP r M AX TP r j Then

0

TRr r

0 0

P r t MIN TP r M AX TP r jt MIN TP r M AX TP r

0 0

M AX TP r MIN TP r

Once we establish that t is in the range of r we need to determine the probability that t sol C

AV G TIME PTS r

This probability is given by

M AX TP r MIN TP r

We obtain the desired probability P r t sol C jt sol C by multiplying the two numb ers

0 0

TRr r AV G TIME PTS r

0

sel P r t sol C jt sol C

1

0 0 0 0

M AX TP r MIN TP r M AX TP r MIN TP r

By a symmetric argument

0

AV G TRr r TIME PTS r

0

sel P r t sol C jt sol C

2

M AX TP r MIN TP r M AX TP r MIN TP r

Hence the cardinality of the Cartesian pro duct can b e estimated as

C ARD r C ARD r maxsel sel

As the join op eration in TPalgebra is merely a Cartesian pro duct followed by a selection the

cardinality of join is given by

C ARD r C ARD r maxsel sel sel JC

where JC is the join condition whose selectivity is computed in the same way as in the relational

case

Physical costs

In order to sp ecify the physical costs of executing these op erations we must rst provide a brief

description of the implementation of TPdatabases

Implementation Overview

We have signicantly extended the implementation of TPdatabases outlined in by i adding a

probabilistic table index and ii developing a cost mo del that uses the physical implementation of

the algebra op erations using a relational implementation and the probabilistic table index as well

as well known segment tree indexes for temp oral data iii developing a set of rewrite rules based Indexes

TrainNo From To C D L U delta cid MIN-T MAX-T 151 Baltimore New York (12:05 ~ 12:08) v (12:11 ~ 12:12) (12:05 ~ 12:14) 0.5 0.6 u 3434 12:05 12:08 (12:15 ~ 12:17) (12:15 ~ 12:20) 0.3 0.4 g, 0.5 3434 12:11 12:!2

TrainDep 3435 12:15 12.17 ......

tid TrainNo From To Table-T 1021 151 Baltimore New York r d cid MIN-U MAX-U ...... 3434 0.06 0.06

3435 0.05 0.2

...... cid tid C D L U delta 3434 1021 (12:05~12:08)v(12:11~12:12) (12:05~12:14) 0.5 0.6 u Table-U 3435 1021 (12:15~ 12:17) (12:15~12:20) 0.3 0.4 g,0.5 r t cid MIN-L MAX-L ...... 3434 0.05 0.05 3435 0.0375 0.15

...... Data

Table-L

Figure Storing TPtuples in Paradox r and r relations and TableL TableU and TableT

d t

indexes

on the query equivalences we have derived in Section and iv building a query optimizer for TP

databases using the ab ove comp onents and the Cascades optimizer framework In this section

we briey describ e the implementation

A TPrelation consists of three parts data temporal and probabilistic Each TPrelation r over

relational schema A A is stored as two Paradox tables r over schema TId A A

k d k

and r over schema CId TId C D L U Delta Here TId is the TPtuple id which is used to join

t

r and r when a user wants to r and CId is the TPcase id which is used to uniquely identify

d t

a TPcase We create a primary key index on TId for r and on CId for r To p erform a query on

d t

r we use the Borland Database Engine BDE to p erform relational queries on Paradox tables r

d

and r Moreover the TPrelation r for the result of a query is materialized as two Paradox tables

t

r and r

t

d

Probabilistic table index To index probabilistic information we rst create index tables TableL

r

and TableU for each TPrelation r We explain the structure of TableL b elow the other

r r

table is built in a similar way TableL has the schema cid MINL MAXL For each tuple

r

cid C D L U r we store in TableL its cid and the values

t r

MINL min L D t

tsolC

MAXL max L D t

tsolC

MINU and MAXU values for each TPcase in TableU may b e stored in a similar way

r

Each probabilistic query allowed in TPA can b e expressed as a combination of L Interval and

U Interval where Interval can either b e op en or closed on either side For each such request

L a b TableL is scanned and the set of cids is determined such that a b MINL MAXL

r

This will b e the list of candidate TPcases We are guaranteed that the answer to L a b is in a

subset of these candidate TPcases Each then needs to b e retrieved and checked

For indexing temp oral data we use three dierent data structures Two of them Temporal index

are variations of segment trees and the third is a variation of the index table structure used to

index probabilistic information As segment trees are well studied we do not discuss them further

here Future work may involve extending sophisticated temp oral index structures such as those

develop ed by Tsotras to handle probabilistic temp oral data

The tabular index structure TableT for the temp oral data contains three elds cid the TPcase

r

id in r and MAXT and MINT the maximal and minimal time p oint for each contiguous interval

t

describ ed in TPcase with id cid Unlike the case with TableL and TableU where cid was a

r r

unique cid will no longer b e a for TableT On the other hand the result

r

of p erforming a select on a temp oral condition using TableT will b e the exact set of TPcase ids

r

matching the query not a set of candidate TPcases

Figure shows how temp oral probabilistic data of Example is stored and indexed in the under

lying Paradox database

Physical Cost Mo del

The TPdatabase implementation builds on top of a xed set of query templates that are used to

access the underlying Paradox tables in which TPrelations are stored Each TPalgebra op erator is

enco ded via a C program that builds on such top of these templates To mo del the physical cost

of such op erators we must therefore i mo del the costs of the templates and ii use the template

mo dels to mo del the costs of the C programs enco ding the dierent TPalgebra op erators derive

a formula that estimates the cost of how long such C programs need to run Solely using a

standard cost mo del for relational databases is not enough

Query Templates The set of Paradox query templates and their asso ciated cost formulas are

provided in Table We use simple calibration and regression analysis techniques such as those in

to compute the costs of these template queries

The notation used in Table requires further explanation A and A denote the data and TPcase

d t

tables for an input TPrelation A C and C denote the data and TPcase tables for the resulting

d t

TPrelation For binary op erations like join and Cartesian pro duct B and B denote the data and

d t

TPcase tables of the second op erand Two intermediate tables TmpF and TmpJ which will b e

discussed b elow are used for Cartesian pro duct and join op erations tabl eI ndex denotes a temp oral

or probabilistic index which is maintained as a Paradox table Finally we use r to denote the

schema of relation r

The cost formulas in Table were obtained by running several queries trying many dierent cost

formulas and using standard regression analysis and calibration metho ds to t the b est curves and

hence obtain the b est formulas The cost formulas for query templates are linear combinations

of input and output table cardinalities this is to b e exp ected as they involve a simple linear

Query Template Cost Formula

C C A a jA j b jC j c

d d C d d d

C C A C a jA j b jC j c

t t t d t d

A A TIdC TId

t t

d

A a jA j b

t t

C C A C a jA j b jC j c

d d d t d t

A A TIdC TId

t

d d

A a jA j b jS j c

t t

CIdS

T tabl eI ndex A T a jA j b sel C jtabl eI ndexj c

C t t

A A CIdT CId

t t

C C A a jA j b

d d F d d

C C A a jA j b

t t t t

A B a maxjA j jB j b jA j jB j c

t t t t t t

T A B TmpF a jTmpF j b

d d

A TIdTmpA B TIdTmpB

d d

C C T

d d

TmpCA B

d d

TmpJ TmpJ A B a min jA j jB j b jTmpF j c

A TIdB TId C d d d d

d d

T A B TmpJ costA TmpJ

t t t

A TIdTmpA B TIdTmpB

t t

T costB TmpJ

t

A B

t t

A a jTmpF j b A B B

F d F d

d d

TmpF B T A

A TIdTmpA B TIdTmpB

d d

d d

C C T

d d

TmpCA B

d d

Table Query Templates Used by Op erators in the TPalgebra and Their Cost Formulas

relational op eration Likewise Query is a Cartesian pro duct op eration and its cost formula

involves the pro ducts of the TPrelations involved

In the case of Queries and we observed that the cost largely dep ends on the cardinality of

TmpF Recall that TId is a primary key and there exists an index on TId in the data tables The

join conditions in queries and are on TId Hence TmpF needs to b e read once and all tuples

satisfying the join are retrieved via the index on A and similarly for B

d d

Query is a join b etween two data tables The join condition is on a data eld and there is no

index on data elds One would assume that Paradox uses either a nested lo op or a merge join

After extensive calibration the b est result seems to b e the expression shown in Table

Finally Query involves a two way join We were unable to t a satisfactory curve to match the

b ehavior of this query As a consequence we tried to mo del the cost of this query as two successive

join op erations and this led to the cost formula shown

TPAlgebra Op erations We are now ready to use the costs of query templates in order to de

rive cost formulas for TPalgebra op erations Each query b egins by creating the C and C tables

d t

which will hold the results To compute costs of TPalgebra op erations we executed each op era

tion several times to determine its average running times Our cost formulas fo cus on the cost of

retrieving and storing data in the Paradox tables We use CREATE COSTT to denote the cost

NEXT COST to denote the cost of pro cessing one Paradox tuple of creating Paradox table T GET

COSTT denotes the cost of inserting via a Borland Data Engine BDE and INSERT

one tuple into Paradox table T For brevity let CREATE COST denote CREATE COSTC

d

We assume all readers are familiar with standard database concepts like cursors

CREATE COSTC

t

Data Select As this op eration corresp onds to a relational select the data condition C is simply

passed to the BDE Thus C is computed by query and C by query The cost formula is

d t

COST costq uer y costq uer y cost C RE AT E

Temp oral Select without index A temp oral condition C is evaluated using sp ecial purp ose

co de since there is no equivalent notion in a relational database Consider a b o olean combination of

temp oral intervals of the form t t As there is no index all TPcases in A need to b e retrieved

i j t

and checked Thus we op en a cursor on the result of query read one TPcase at a time compute

i

C C C and insert a new TPcase which incorp orates C into C if sol C Once the

i t

computation of C is nished the appropriate tuples are inserted into C by executing query The

t d

cost formula is

cost C RE AT E COST costq uer y car dA GE T NEXT COST

t

car dC I N S E RT COST C costq uer y

t t

Temp oral Select with segment tree index In this case we rst retrieve a set S of CIds

by using the segment tree index This index lters out irrelevant data by trying to ensure that

C C for each TPcase whose CId is in S We then execute query to retrieve the

i i

TPcases referred to in S We read one TPcase at a time from the result of query compute C

and write the result into C Finally C is p opulated by executing query The cost formula is

t d

COST costsegTreeSearch costq uer y cost C RE AT E

j S j GE T NEXT COST car dC I N S E RT COST C costq uer y

t t

Temp oral Select with table index TableT TableT is stored as a Paradox table where

r r

MINT and MAXT are indexed using standard Paradox indices We rst execute query to retrieve

from this table all CIds of TPcases in A whose C constraints overlap with the selection condition

t

We then op en a cursor on the result of query read one tuple at a time compute C and save the

results in C We nally execute query to compute C Hence the cost formula is

t d

COST costq uer y sel C car dtabl eI ndex cost C RE AT E

GE T NEXT COST car dC I N S E RT COST C costq uer y

t t

Probabilistic Select without index As relational databases do not supp ort probabilistic se

lection conditions we use sp ecial purp ose co de to compute the results of this op eration which is

similar to the unindexed temp oral select op eration The only dierence is that C is computed via

TPlter C and hence solC will b e the set of time p oints in C which satisfy C The cost

i i

formula is the same as the one used for unindexed temp oral selections

Probabilistic Select with table index A probabilistic condition C can b e rewritten using one

interval if op is not or the union of two intervals For instance L p can b e rewritten

as L p As a selection condition is a disjunction of intervals probabilistic select queries are

implemented by using a pro cedure similar to the one used for temp oral selects with table indices In

fact b oth queries share the same cost formula

Pro ject As the TPalgebra only allows pro jections of data attributes this op erator only pro jects

out elds from A and retains all elds in A by executing query and query resp ectively The

d t

cost formula is

COST costq uer y costq uer y cost C RE AT E

Cartesian Pro duct To execute a Cartesian pro duct op eration b etween TPrelations A and B we

rst execute query to retrieve all tuple ids from A and B For each unique pair of TPtuples from

t t

A and B that pro duces a result in the Cartesian pro duct a new unique tuple id needs to b e created

for the resulting tuple The TPalgebra uses a mapping f which takes tuple ids from A and B as

input and returns a new tuple id for C This mapping is stored as a Paradox table TmpF with the

schema TmpA TmpB TmpC Here TmpA TmpB and TmpC refer to the TIds from A B and

C resp ectively The implementation then creates TmpF whose cardinality equals that of C We

d

then op en a cursor on the result of query and read one tuple at a time For each tuple in the

result of query we compute C A C B C and discard this tuple if solC Otherwise we

t t

generate a unique id for C CId and determine whether or not there is a value for f A TId B TId

t t t

If not we generate a unique id for C TId and assign this value to f A TId B TId In either case

t t t

the implementation inserts hC CId f A TId B TId C D L U i into C when these values satisfy

t t t t

Denition of Cartesian pro duct Note that may b e a new distribution function that distributes

L U among all time p oints t solD according to the probabilities determined by After all

tuples returned by query are pro cessed function f is saved For all a b c where f a b c the

co de inserts a b c into TmpF This information is used to p opulate C by executing query

d

Finally we remove TmpF The cost formula is

COST C RE AT E COST TmpF costq uer y cost C RE AT E

NEXT COST car dC I N S E RT COST C car dA car dB GE T

t t t t

car dTmpF I N S E RT COST TmpF costq uer y

Join To execute the join op eration we rst create a temp orary Paradox table TmpJ with the schema

TmpA TmpB and p opulate TmpJ by executing query TmpJ stores the TIds of the pairs which

satisfy the join condition C Hence the cardinality of TmpJ is cardA cardB sel C The

d d

rest of the join co de is similar to that of Cartesian pro duct except that Query is used instead of

Query to limit the numb er of TPcases visited and Query is used instead of Query as the

result requires a pro jection on F Furthermore TmpJ can b e removed any time after executing

query The cost formula is

COST C RE AT E COST TmpJ costq uer y cost C RE AT E

C RE AT E COST TmpF costq uer y

NEXT COST car dC I N S E RT COST C car dTmpJ GE T

t t

COST TmpF costq uer y car dTmpF I N S E RT

Exp erimental Results

We conducted exp eriments aimed at i evaluating the eectiveness of our selectivity estimates ii

evaluating the eectiveness of our rewrite rules and iii evaluating the eectiveness of our TP

optimizer as a whole

All these exp eriments use a table creation program TCP which generates tables in accordance

TUPLES numb er of TPtuples with various parameters describ ed in Table TCP generates NO

The TPrelations generated by TCP contain data elds f and f TCP assigns unique consecutive

integers to the f eld and randomly assigns integers in the interval F Range to the f eld

To generate a temp oral constraint C of the form t t a value is drawn randomly from the range

C R for t The numb er of time p oints that are solutions to a TPcase is controlled by drawing

TIME PTS R and adding it to t Values for L and U are generated value x from the range AVG

R and U R resp ectively Distributions in tpcases randomly by cho osing a value from the ranges L

are restricted to geometric g binomial b uniform u and mix When a mix is chosen TCP

randomly picks one of the g b or u distributions

Parameter Description

NO TUPLES numb er of TPtuples

F Range range for the data attribute f

C R range for the lower b ound of the temp oral constraint C

AVG TIME PTS R range for the average numb er of time p oints in a TPtuple

L R range for the lower b ound probability L

U R range for the upp er b ound probability U

Delta probability distribution function

Table Parameters of the Table Creation Program

Eectiveness of Selectivity Estimators

We used TCP to create TPrelations with various cardinalities and prop erties These are shown

b elow Each TPtuple has one case

TPrel NO TUPLES F Range C R AVG TIME PTS R L R U R Delta

r mix

r mix

r mix

r mix

r mix

r mix

Table TPrelations used in the rst set of exp eriments

Our exp eriments fo cus on the eectiveness of the selectivity estimation formulas for temp oral

and probabilistic selection conditions as well as join op erations Table contains the temp oral and

probabilistic selection queries we used in these exp eriments

Temp oral Selectivity Figure shows the actual cardinality of the resulting answers as well as

the estimated values provided by Formulas TS TS The leftmost bar in each group in the gure

Query

r

L

r

Query

L

r

L

r

r

L

r

r

r

L

r

r

L

r

r

L

r

r

L

r

r

U

r

r

U

r

r

U

r

r

U

r

r

U

r

r

U

r

U

r

U

a Temp oral Selection Queries b Probabilistic Selection Queries

Table Selection Queries for Exp eriments

shows the cardinality estimates generated by using Formula TS the middle bar show the estimates

generated by Formula TS and the rightmost bar shows the actual result cardinalities It is easy to

see that Formula TS generates b etter results than Formula TS

Figures and contain the actual cardinality and the cardinality estimates generated by Set

A and Set B formulas for probabilistic conditions of the form L pr ob L pr ob U pr ob and

U pr ob resp ectively As seen from the gures Set B formulas provide b etter estimates than Set

A formulas For L pr ob Set B results are query accurate we surmise that this is b ecause they

use Prop osition For the same reason and the fact that the sum of the upp er probability values

U AV G TIME PTS the formula do not have to add up to but add up on average to AV G

for U pr ob also p erforms very well The formulas for L pr ob and U pr ob in Set B assume

L AVG U is half of the that the numb er of time p oints having probability values less than AVG

average time p oints in a TPtuple In fact we exp ect that there are more time p oints with lower

b ounds of probability intervals b etween and AVG L than there are p oints with the lower b ounds

L and as the sum should add up to Hence the estimates of probability intervals b etween AVG

for L pr ob values are not as accurate as the L pr ob case A symmetric argument also applies to

the U pr ob condition The cardinality estimate generated by Set A for queries and in Table

b were and hence hence they do not app ear in Figures and

Join In order to assess the accuracy of our join size estimators we created queries involving a two

way join b etween various TPrelations These queries and their actual and the estimated cardinalities

are shown in Table All queries assume that no prior knowledge ab out the relationship b etween the 1200

1000

800

EstimatedCardby 600 FormulaTS1

cardinality EstimatedCardby FormulaTS2 400 ActualCard

200

0 1 2 3 4 5 6 7 8 9 101112

queryno

Figure Actual And Estimated Cardinalities for Temp oral Selections

1200 1200

1000 1000

800 800

EstimatedCardbySetA EstimatedCardbySetA 600 EstimatedCardbySetB 600 EstimatedCardbySetB ActualCard ActualCard cardinality cardinality

400 400

200 200

0 0 1 2 3 4 5 6 7 8

queryno queryno

Figure Actual And Estimated Cardinalities Figure Actual And Estimated Cardinalities

for L pr ob for L pr ob

700 1200

600 1000

500 800

400 EstimatedCardbySetA EstimatedCardbySetA EstimatedCardbySetB 600 EstimatedCardbySetB ActualCard ActualCard cardinality 300 cardinality

400 200

200 100

0 0 9 10 11 12 13 14 15 16

queryno queryno

Figure Actual And Estimated Cardinalities Figure Actual And Estimated Cardinalities

for U pr ob for U pr ob

events represented in base relations exist an assumption known as ignorance The conjunction

strategy for this can b e sp ecied as follows a b c d max a c minb d The

ig

results indicate that the cardinality estimates for the join provide decent estimates The estimated

cardinalities seem to b e reasonably accurate

Query No QUERY Actual Card Estimated Card

select from r r under ig where rf rf

select from r r under ig where rf rf

select from r r under ig where rf rf

select from r r under ig where rf rf

select from r r under ig where rf rf

select from r r under ig where rf rf

select from r r under ig where rf rf

Table Join Queries for Exp eriments

Eectiveness of Rewrite Rules

In this section we study the eectiveness of some of the rewrite rules proved in Section Due

to space reasons we are unable to present results on the eectiveness of all the rewrite rules We

fo cus on the result that temp oral and probabilistic selects are commutative Theorem and that

temp oral selects can b e pushed into joins Theorem

Commutativity of Temp oral and Probabilistic Selects If T resp P is a temp oral resp

probabilistic selection condition then by Theorem we know that r r We

P T T P

started by generating TPrelations using the TCP with the following parameters

TPrel NO TUPLES F Range C R AVG TIME PTS R L R U R Delta

r mix

To vary T we used the temp oral condition i for all i To vary P

i

we used the probabilistic condition L L for all i and L For each

pair T P we determined the time in seconds to compute r minus the time to compute

P T

r when r is not indexed ie when all TPcases must b e examined If this dierence is

T P

p ositive negative then it is b etter to p erform the probabilistic temp oral select rst resp ectively

The results are shown in Figure

In general if SelT and SelP are approximately equal then b oth orderings are approximately

equal Otherwise it is b etter to p erform the selection that oers the highest selectivity rst In

other words even though it takes a lot of time to compute TPlter P used in probabilistic as

i

compared to solving the constraint C T in the case of temp oral selects this dierence turns out

i

to b e negligible Intuitively this o ccurs b ecause testing is mostly a CPUintensive op eration while

i

writing out satisfying TPcases is a memoryintensive op eration and hence is more exp ensive Our selects and temp oral selects o four p ossible these selectivities are appro Pushing of their p erform a series of selects on T T ableL ableL The same conclusion goal r r selectivities to compute and is T emp oral to determine w Figure T

a 0.00 0.00 Figure ableT ys 19.51 19.51 A with resp ect to Selections r 25.88 T 25.88 holds when Time for on Time for when P T 31.88 31.88 ximately equal then w r r r r

37.58 w 37.58 v er probabilistic Th w e should determine their execution order according to our estimates Sel-P Sel-P The results are sho e should 48.32 48.32 r us in P r B r P to 63.58 63.58 regardless of the kind of select b eing is indexed the T Join T 80.31 80.31 r min r use system min 90.95 90.95 T Note that the eac r 97.42 selects 97.42 us time for In this case w us time for h e can giv used

100.00

of these 100.00 wn in Figure T T r ableT 1.13 12.53 23.22 34.43 45.03 55.46 67.14 77.84 88.84 100.00 e preference to data selects o 1.13 12.53 23.22 34.43 45.03 55.46 67.14 77.84 88.84 100.00 C metho ds T query T e created probabilistic r P to Sel-T Sel-T r P r without indices In conclusion when w r T compute with indices T r o T generate sample r considered r or D (6)-(4) (4)-(2) (2)-0 0-2 2-4 4-6 (8)-(6) (6)-(4) (4)-(2) (2)-0 0-2 2-4 4-6 6-8 can b e P T r table indexes v ev er temp oral Ho T aluated and e need to r relations w ev er if used r in

for r and r we used the parameters given in Figure

TPrel NO TUPLES F Range C R AVG TIME PTS R L R U R Delta

r mix

r mix

r mix

r mix

Table TPRelations Used In JoinTemp oral Selection Rewrite Rules

We varied T in the same way as for temp oral selections and we did not vary the conjunction strategy

as this do es not aect the relative running times of options ABC D We varied the numb er of

TPcases returned by the join by considering the queries r r where jr r j and

T

r r where jr r j The results for these two queries are shown resp ectively in

T

Figures and

12

10

8 A B 6 C

Time(sec) D 4

2

0

0 8 .45 .85 .35 .44 .72 .37 .45 0.0 5.8 0.00 13 21 32 42 56 74 84 10

Selectivityof T withrespectto(r 1joinr 2)

Figure A r r vs B r r vs C r r vs D r r

T T T T T

Figure demonstrates the case when jr r j is relatively large Here it is b etter to push selection

into the join for b oth r and r ie strategy B unless the selectivity of the selection predicate is very

low ie unless the selectivity of T is a large p ercentage This result should not b e surprising As

jr r j increases the preference for strategy B vs strategy A intensies Conversely as jr r j

decreases the threshold for T where A b ecomes preferable to B decreases An example is shown in

Figure Note that when the join removes a fairly large numb er of tuples strategy C or D may

p erform b etter than strategy B the choice of C or D dep ends on the selectivity of T with resp ect

to r and r This o ccurs b ecause strategy B requires three op erations whereas the other strategies

only need two

Pushing Probabilistic Selections into Join The query r 1 r where P is of the form L p

P

for some probability p can b e evaluated as A r 1 r or B r 1 r To

P P P P

determine when to use these options we used TCP to generate eight TPrelations with the following

parameters

TPrel NO TUPLES C R AVG TIME PTS R L R U R Delta

fr j i g mix

i

fr j i g mix

i

The F Range parameter was for r and r for r and r for r and r and

for r and r Thus jr 1 r j jr 1 r j jr 1 r j and

pc pc pc

i

for all i jr 1 r j To vary P we used probabilistic predicate L

pc

The results are shown in Figure Note that the conjunction strategy remained constant This

is acceptable since other exp eriments have veried that the running times dep end on j r 1 r j

P

instead of the choice for Nonetheless note that when then it is often the case that

r j j r 1 r j j r 1

P P

For each graph in Figure the times for strategy A remain more or less constant This is b ecause

the times to p erform the joins are constant and these times overwhelm the times it takes to p erform

the selections Furthermore in each graph the times for strategy B increase as the selectivity of

P increases We are interested in the threshold for P where the time for strategy B exceeds the

time for strategy A For graphs a b and c of Figure this o ccurs when the selectivity of P

is around and resp ectively Thus if jr 1 r j increases while maxjr j jr j remains

constant then the threshold for P increases Graph d of Figure uses base TPrelations r r

where jr j jr j jr j jr j Here the threshold for P is around This matches

3

2.5

2 A B 1.5 C

Time(sec) D 1

0.5

0

0 .90 .48 .71 .61 .52 .52 .65 .32 0.0 0.00 12 35 38 51 64 64 80 90 10

Selectivityof T withrespectto(r 3joinr 4)

Figure A r r vs B r r vs C r r vs D r r

T T T T T 8 50

7 45 40 6 35 5 30 A A 4 25 B B 20

Time (sec) 3 Time (sec) 15 2 10 1 5 0 0 0.00 0.88 8.85 30.97 63.72 76.99 87.61 0.00 0.17 9.34 33.00 60.61 75.42 82.74

Selectivity of P with respect to (r1 join r2) Selectivity of P with respect to (r3 join r4)

a b

450 80

400 70

350 60 300 50 250 A A 40 200 B B

Time (sec) Time (sec) 30 150 100 20 50 10 0 0 0.01 0.55 8.36 33.04 59.16 74.58 81.86 0.00 0.59 6.63 33.50 59.87 74.64 81.19

Selectivity of P with respect to (r5 join r6) Selectivity of P with respect to (r7 join r8)

c d

Figure A r 1 r vs B r 1 r

P pc P P pc P

the threshold for graph a but do es not match the threshold for graph b Thus to estimate the

threshold for P we cannot just consider jr 1 r j eg jr 1 r j jr 1 r j Instead consider

pc pc

jr 1 r j jr 1 r j

pc pc

the ratio b etween jr 1 r j and maxjr j jr j eg

maxjr jjr j maxjr jjr j

In conclusion there are many cases where it is b etter to use strategy B even though it involves

p erforming selects three times more often than strategy A The accuracy of predicting when these

cases o ccur strongly dep ends on the accuracy of the cardinality estimators

Eectiveness of The Query Optimizer

In this section we rep ort on exp eriments conducted by us to determine the overall eectiveness of

our TP query optimizer We wanted to determine how go o d or bad the plans picked by the

TPquery optimizer are We used TCP to create TPrelations having the prop erties shown in Table

TPrel NO TUPLES F Range C R AVG TIME PTS R L R U R Delta

r mix

r mix

r mix

r mix

r mix

r mix

r mix

r mix

r mix

Table TPrelations used in the second set of exp eriments

Simple Queries We rst tried to see what happ ens with queries involving only one TPop eration

We compared compared the actual running times with the optimizers estimated execution times

Table contains the queries we tried together with their actual and estimated execution times It

can b e seen that the optimizer estimates exhibit an acceptable amount of accuracy consistent with

the accuracy of traditional cost based query optimizers In those cases where the actual and estimate

dier the dierence app ears to b e due to inaccurate selectivity estimates However the optimizer

still seems to pick very go o d plans and seems to avoid very bad ones

Multiple Selection Queries We wanted to study the b ehavior of our optimizer when select queries

involve all three typ es of selects data temp oral probabilistic We created three queries shown in

Table There are dierent feasible query plans for each of these queries Figure enumerates

these plans shows their running times and uses a star to indicate which plan was picked by the

optimizer As seen from the gure the optimizer was able to cho ose the cheap est plan for queries

and and the second b est plan for query The dierence b etween the cheap est plan and the one

the optimizer picked was secs

Multiple Joins Next we studied how our optimizer b ehaves when queries with two joins are

executed Table shows four queries created by TCP In each query we chose TPrelations with

Query Actual time Estimated Time

r s s

L

r s s

L

r s s

L

r s s

L

r s s

L

r s s

r s s

r s s

r s s

r r s s

r fr f ig

r r s s

r fr f ig

r r s s

r fr f ig

r r s s

r fr f ig

r r s s

r fr f ig

Table Some Simple Queries Their Estimated and Actual Running Times

Query

r

f L

r

f L

r

f L

Table Selection Queries

varying numb er of average time p oints There are three ways of executing these two join op eration

Again we executed those plans and examined the optimizers choice in each case The results are

shown in Figure and the optimizers choice is shown with a star The reader will not that the

optimizer picked the b est in all four cases

Query

r r r

ig ig

r fr f r fr f

r r r

ig ig

r fr f r fr f

r r r

ig ig

r fr f r fr f

r r r

ig ig

r fr f r fr f

Table Join Queries

SelectionJoin Mix We created three queries involving a join a temp oral and a probabilistic

select op eration Recall that temp oral selects can b e pushed into the join but probabilistic selects

cannot Moreover temp oral selection can b e pushed into the rst argument or the second argument

or b oth arguments of the join We can execute the temp oral select by using an index or without 5

4.5

4

3.5 DPT

3 DTP PDT 2.5 IPDT PTD 2

executiontime IPTD TDP 1.5 ITDP

1 TPD ITPD 0.5

0 1 2 3

queryno

Figure Ordering of Data Temp oral and Probabilistic Selects

an index Finally we can commute temp oral and probabilistic selects Hence there are dierent

ways of executing these queries in of those temp oral selects are executed b efore the join and in

of them they are executed after the join Figure shows the actual execution times of these

plans for the three queries provided in Table Once again we marked the optimizers choice with

a star Although the optimizer was not able to pick the cheap est plan for the rst query it was able

to avoid very bad plans and furthermore the dierence in execution times b etween the plan chosen

by the optimizer and the b est plan was secs The optimizer chose the second b est plan for the

second query and was able to pick the b est plan for the last query

Query

r r

ig

r fr f L

r r

ig

r fr f L

r r

ig

r fr f L

Table Join Queries with Selections

Finally we made the previous joinselection mix queries more complex and created the queries

shown in Table For the rst and the third queries the optimizer applied rewrite rules to

generate a plan The most promising plans and the one chosen by the optimizer are shown in Figure

The optimizer was able to cho ose the b est plan for the rst two queries and it chose a go o d

plan for the third query The dierence b etween the execution times of the b est plan and the plan

the optimizer chose for query was secs

We conclude this section by rep orting that our cost mo dels are reasonably accurate and easy to

compute and that the optimizer do es its job eectively It quickly selects a query execution plan

that is very go o d while consistently avoiding bad plans 5

4.5

4

3.5

3

2.5

2 executiontime

1.5

1

0.5

0 1 2 3 4

queryno

Figure Ordering of Two Joins

Query

r r r

ig ig

r fr f r fr f r f r f r f L

r r r

ig ig

r fr f r fr f r f r f

r r r

ig ig

r fr f r fr f r f L

Table More Complex Queries

Related Work

Dyreson and Sno dgrass were one of the rst to mo del temp oral uncertainty using probabilities

by prop osing the concept of an indeterminate instant Intuitively an indeterminate instant is an

interval of time p oints with an asso ciated probability distribution They prop ose an extension of SQL

that supp orts i sp ecifying which temp oral attributes are indeterminate ii correlation credibility

which allows a query to use uncertainty to mo dify temp oral data for example by using an

EXPECTED value correlation credibility the query will return a determinate relation that retains

the most probable time p oint for the event iii ordering plausibility which is an integer b etween

and where denotes that any p ossible answer to the query is desired while denotes that only

a denite answer is desired and iv sp ecifying that certain temp oral intervals are indeterminate

Later extended the work of Dyreson and Sno dgrass in the following ways First they

prop osed TPcases as a way of using constraints and probabilities together to store probabilistic

temp oral data Then they develop ed an algebra extending the relational algebra to manipulate

this data They used probability intervals hence capturing p oint probabilities of Dyreson and

Sno dgrass as a sp ecial case They allowed users to sp ecify in their queries arbitrary knowledge

the users may have ab out the dep endencies b etween events and in fact their algebraic op erators were

parameterized by these dep endency assumptions Sp ecicall y all indep endence assumptions used in 4.5

4

3.5

3

2.5

2 executiontime 1.5

1

0.5

0 1 2 3

queryno

Figure Ordering of a Join with Temp oral and Probabilistic Selects

were eliminated in They also prop osed a set of op erations such as compaction that were not

considered elsewhere However they did not incorp orate some things that Dyreson and Sno dgrass

could handle For instance they assumed tuples have only one indeterminate temp oral attribute

while allows more than one Furthermore they do not have analogs of correlation credibility

or ordering plausibility intro duced by and they use metho ds to store probability distributions

provided by Dyreson and Sno dgrass

This pap er directly builds on the previous two works Sp ecically it provides cost mo dels for

TPdatabases which apply in large part to b oth the preceding works and it provides estimates of

cardinality and other statistical variables useful for query optimization In addition it derives a

useful set of equivalence results that may b e used for Using the software of it

builds what is to our knowledge the rst query optimizer for temp oral probabilistic databases

The work also extends a host of work on probabilistic non temp oral databases Kiessling et

als DUCK system provides an elegant logical axiomatic theory for rule based uncertainty

Building on past work of Kifer and his colleagues Lakshmanan and Sadri show how selected

probabilistic strategies can b e used to extend the previous probabilistic mo dels Lakshmanan and

Shiri have shown how deductive databases may b e parameterized through the use of conjunction

and disjunction strategies Barbara et al develop a p oint probabilistic data mo del and prop ose

probabilistic op erators When p erforming joins they assume that Bayes rule applies and hence as

they admit up front they make the assumption that all events are indep endent Also as they p oint

out unfortunately their denition leads to a lossy join Cavallo and Pittarelli s imp ortant

probabilistic relational database mo del uses probabilistic pro jection and join op erations but the

other relational algebra op erations are not sp ecied Also a relation in their mo del is analogous

to a single tuple in the framework of Dey and Sarkar prop ose an elegant NF approach

to handling probabilistic databases They supp ort i having uncertainty ab out some ob jects but

certain information ab out others ii rst normal form which is easy to understand and use iii 4

3.5

3

2.5

2

executiontime 1.5

1

0.5

0 1 2 3

queryno

Figure Plans for Complex Queries

elegant new op erations like conditionalization The NF representation used by them is a sp ecial case

of the annotated representation of who showed exp erimentally that when dealing the temp oral

uncertainty the TPrepresentation is sup erior Also the semantics of some of the op erations in

is dierent from the semantics of the same op erations in TPA Another imp ortant work is the

ProbView system for probabilistic databases by Lakshmanan et al ProbView extends the

classical relational algebra by allowing users to sp ecify in their query what probabilistic strategy

or strategies should b e used to parameterize the query ProbView removed the indep endence

assumption of previous works However ProbView has no notion of time though ProbView scaled

up well to massive numb ers of tuples it did not scale up well when massive amounts of uncertainty

are present as is the case with temp oral probabilistic databases where saying that an event sometime

b etween Jan yields a total of seconds Thus if a temp oral database

uses seconds as it lowest level of temp oral granularity this gives rise to cases to represent

just one statement something that would quickly overwhelm ProbView The work rep orted in

this pap er sp ecically builds up on the idea of a probabilistic conjunctive strategy of and applies

it to TPdatabases But in addition to these works none of which provided statistical estimates of

various parametersvariables asso ciated with TPdata we do so and we provide a cost mo del for

such data and a query optimizer whose sole goal is to avoid the problems faced by purely probabilistic

database systems attempting to cop e with huge amounts of uncertainty

There is also much work in the temp oral database community that deals with uncertainty Sno d

grass was one of the rst to mo del indeterminate instances in his do ctoral dissertation he

prop osed the use of a mo del based on three valued logic Dutta and Dub ois and Prade later

used a fuzzy logic based approach to handle generalized temporal events events that may o ccur

multiple times Gadia prop oses an elegant mo del to handle incomplete temp oral information as

well He mo dels values that are completely known values that are unknown but are known to have

o ccurred values that are known if they have o ccurred and values that are unknown even if they

o ccurred However he makes no use of probabilistic information

Koubarakis prop oses the use of constraints for representing event o ccurrences His framework

allows stating the facts that event e o ccurred b etween and AM and that event e o ccurs after

pm From this we may conclude that event e o ccurs after e While TPtuples supp ort this via

queries they do not explicitly enco de this data into tuples which Koubarakis can do

As mentioned earlier though many of these works touch up on uncertainty they do not provide the

comprehensive probability analysis provided in They also do not prop ose statistical variables

to maintain information ab out TPrelations and corresp onding cost mo dels which form the key

contributions of this pap er

The only comparable work in the area of query pro cessing and query optimization comes from the

eld of temp oral databases Salzb erg and Tsotras provide an excellent overview of temp oral indexing

metho ds develop ed in the community over the past decade Query optimization and cost mo dels

for temp oral databases have b een studied in early s by Gunadhi and Segev and their study

was later continued by So o Sno dgrass Jensen and Slivinskas Their research was primarily

concentrated on designing ecient algorithms for relational op erations on temp oral databases which

minimized IO costs These results however are not directly applicable to the TPDatabases due

to the complex structure of the data mo del develop ed in Moreover to our knowledge this is

the rst attempt to develop a detailed mo del for estimating the cost and cardinality of probabilistic

selections

Conclusions

Databases that contain nondeterministic temp oral information are growing in imp ortance Imp ortant

work in this eld was started by Dyreson and Sno dgrass who prop osed a probabilistic temp oral

mo del of such data Dekhtyar et al extended the framework of by prop osing an extension of

the relational algebra for such data as well as by eliminating many assumptions ab out the data that

was present in prior work

In this pap er we make additional contributions to the area of temp oral probabilistic TP databases

First we develop a TP calculus that is equivalent to the TP algebra in expressive p ower Second

we develop a large set of equivalence results in such databases that constitute rewrite rules that a

query optimizer might use Third we develop a cost mo del for TP databases Fourth we have

implemented our cost mo del and our rewrite rules in a prototyp e optimizer for TP databases Using

this implementation we have evaluated i the accuracy of our cost mo del ii the eectiveness of

our rewrite rules and iii the eectiveness of our TP optimizer as a whole

References

D Barbara H GarciaMolina and D Porter The Management of Probabilistic Data

IEEE Trans on Know ledge and Data Engineering Vol pps

MH Bohlen RT Sno dgrass and MD So o Coalescing in Temp oral Databases in

Proc VLDB pp

R Cavallo and M Pittarelli The Theory of Probabilistic Databases in Proc VLDB

Co dd E F Relational Completeness of Data Base Sublanguages in pages

A Dekhtyar R Ross and VS Subrahmanian Probabilistic Temp oral Databases Part I

Algebra ACM Transactions on Database Systems vol pps March

D Dey and S Sarkar A Probabilistic Relational Mo del and Algebra ACM Transac

tions on Database Systems Vol pps

C Dyreson and R Sno dgrass Supp orting ValidTime Indeterminacy ACM Transac

tions on Database Systems Vol Nr pps

D Dub ois and H Prade Pro cessing Fuzzy Temp oral Knowledge IEEE Transactions

on Systems Man and Cybernetics pps

S Dutta Generalized Events in Temp oral Databases in Proc th Intl Conf on Data

Engineering pp

S Gadia S Nair and YC Po on Incomplete Information in Relational Temp oral

Databases in Proc VLDB

H GarciaMolina J D Ullman and J Widom Database System Implementation Prentice

Hall

G Graefe The Cascades Framework for Query Optimization in the Bul letin of the

TC on Data Engineering pp

H Gunadhi A Segev A Framework for Query Optimization in Temp oral Databases

in Proc Conf on Statistical and Scientic Database Managementpp

H Gunadhi A Segev Query Pro cessing Algorithms for Temp oral Intersection Joins

in Proc ICDE pp

U Guntzer W Kiessling and H Thone New Directions for Uncertainty Reasoning in

Deductive Databases Pro c ACM SIGMOD pp

M Kifer and A Li On the Semantics of RuleBased Expert Systems with Uncer

tainty nd Intl Conf on Springer Verlag LNCS eds M Gyssens J

Paredaens D Van Gucht Bruges Belgium pp

W Kiessling H Thone and U Guntzer Database Support for Problematic Know ledge

Pro c EDBT pps Springer LNCS Vol

G Kollios VJ Tsotras Hashing Metho ds for Temp oral Data To app ear at IEEE Transac

tions on Knowledge and Data Engineering

M Koubarakis Database Mo dels for Innite and Indenite Temp oral Information

Information Systems Vol pps

VS Lakshmanan N Leone R Ross and VS Subrahmanian ProbView A Flexible Proba

bilistic Database System ACM Transactions on Database Systems Vol Nr pp

VS Lakshmanan and F Sadri Mo deling Uncertainty in Deductive Databases in

Proc Int Conf on Database Expert Systems and Applications DEXA Lecture Notes in

Computer Science Vol Springer pp

VS Lakshmanan and F Sadri Probabilistic Deductive Databases in Proc Int Logic

Programming Symp ILPS MIT Press

VS Lakshmanan and N Shiri Parametric Approach with Deductive Databases with

Uncertainty in Proc Workshop on Logic In Databases pp

R Ramakrishnan J Gehrke Database Management Systems nd Ed McGrawHill

S Ross A First Course in Probability Prentice Hall

Rustin R Data Base Systems PrenticeHall New Jersey

B Salzb erg and V Tsotras Comparison of Access Metho ds for TimeEvolving Data

ACM Computing Surveys vol No pp

H Samet The Design and Analysis of Spatial Data Structures Addison Wesley Read

ing MA

J Sho eneld Mathematical Logic Addison Wesley

G Slivinskas CS Jensen RT Sno dgrass Query Plans for Conventional and Tem

p oral Queries Involving Duplicates and Ordering in Proc ICDE

RT Sno dgrass Monitoring Distributed Systems ARelational Approach PhD disser

tation Carnegie Mellon University

RT Sno dgrass M So o Supp orting Multiple Calendars in RT Sno dgrass Ed The

TSQL Temporal pp Kluwer

M D So o R T Sno dgrass C S Jensen Ecient Evaluation of the ValidTime

Natural Join in Proc ICDE pp

D Zhang VJ Tsotras B Seeger Ecient Temp oral Join Pro cessing using Indices Pro c

ICDE to app ear San Jose CA FebMarch

Q Zhu and P Larson Solving Lo cal Cost Estimation Problem for Global Query

Optimization in Multidatabase Systems Journal of Distributed and Paral lel Databases

pp

A Pro ofs for the TPCalculus Theorems

Pro of of Theorem Supp ose db is a TPdatabase q is a TPAquery over db and q is a sub query

i

of q We pro ceed by induction on the numb er of op erators in q This numb er is zero in the base case

where q r Otherwise q must ob ey the structure sp ecied in Denition

Supp ose Q is a TPCquery over db and Q fs j F g is a sub query of Q By the inductive

i i

i

hyp othesis assume that TPAquery q can b e expressed as TPCquery Q for all sub queries of q

i i

Then TPAquery q can b e expressed as TPCquery Q in the following way

If q r then Q fs j s r g

If q q where C C C then Q fs j s F E E g where

C n n

E sA c when C A c

i i

E sC s C T when C T

i i

E sC s P p when C P p

i i

If q q where F a a then

F n

Q fs j s F sa s a sa s a g

n n

If q q where R a a then

R n

Q fs j s F sRa s a sRa s a g

n n

If q q then Q Q

then If q q q where R a a and R a a

n

n

Q fs j s F s F sa s a sa s a

n n

fhih ig

id

sa s a sa s a g

n

n

then If q q 1 q where R a a and R a a

n

n

Q fs j s F s F sa s a sa s a

n n

fhih ig

id

sa s a sa s a g

n

n

If q q q then Q fs j s F s F g

n n

fhih ig fhih ig

n

id id

If q q q then Q fs j s F s F g

n n

fhih ig fhih ig

n

id id

g 3 s F If q q q then Q fs j s F

Pro of of Theorem Supp ose db b e a TPdatabase s is a TPvariable over R Q fs j F g is

a TPCquery over db and Q fs j F g is a sub query of Q where s is a TPvariable over R

i i i i

i

We pro ceed by induction on the structures given in Denitions and

Supp ose q is a TPAquery over db and q is a sub query of q By the inductive hyp othesis assume

i

that TPCquery Q can b e expressed as TPAquery q for all sub queries of Q Then TPCquery Q

i i

can b e expressed as TPAquery q in the following way

r g then q r If Q fs j s

If Q fs j s F E g then q q where

C F R

RA A for each TPatom in E of the form sA s A

F a a when R a a

k k

C C C where

n

C A c when the ith TPatom in E is of the form sA c

i

C T when the ith TPatom in E is of the form sC s C T

i

C P p when the ith TPatom in E is of the form sC s P p

i

C true when the ith TPatom in E is of the form sA s A

i

If Q fs j s F s F Lg where fh i h ig then

q q 1 q where

F R R

R A A for each TPlinker in L of the form sA s A Otherwise R A A

R A A for each TPlinker in L of the form sA s A Otherwise R A A

F a a when R a a

k k

If Q fs j s F s F g where fh i h ig then

n

n

q q q when

n

q q q when

n

q q q when 3

n

B Pro ofs of Equivalence Results

The pro ofs of query equivalence results contained in this section are all done using Theoretical Anno

tated Temporal Algebra TATA dened in on annotated relations the at relational equivalents of

TPrelations Each annotated relation consists of annotated tuples which have the form d t L U

t t

where d is the data part of the tuple a collection of relational attributes similar to the data part of

a TPtuple t is a single time p oint and L U is a probability interval The annotation

t t

op eration can b e applied to TPtuples to pro duce equivalent annotated relations

TATA has b een dened in on annotated relations Except for TPcompression which is sp ecic

to the format of TPtuples all other TPA op erations have their analogs in TATA It has b een shown in

there that TPA op erations correctly implement the semantics of corresp onding TATA op erations

ie the equivalences ANN O pr O pANNr and ANN O pr r O pANNr ANNr hold

for unary and binary op erations resp ectively of the TPA and TATA

As annotated tuples are at reasoning in terms of annotations of TPrelations in pro ofs is less

cumb ersome The correct implementation theorems of assure us that a query equivalence holds

in TPA i it holds in TATA Hence we cho ose to prove equivalences in TATA to make our reasoning

more transparent and make the pro ofs shorter

Pro of of Theorem

Selection We prove the theorem statement for atomic selection conditions Theorem will

ensure that it holds for conjunctive selection conditions Part of this theorem will ensure

that it holds for disjunctive selection conditions

Three cases need to b e considered

Selection condition is on data Then the idemp otence of selection in TPA follows from

idemp otence of selection in classical relational algebra

Selection condition C is temporal Let at d t L U AN N r This means that

C

timep oint t satises selection condition C But then AN N r will also contain

C C

at hence r r The other subset inclusion r r follows from

C C C C C C

the denition of selection in TPA

Selection condition C is probabilistic Let at d t L U AN N r This means

C

that the probability interval L U satises selection condition C But then AN N r

C C

will also contain at hence r r The other subset inclusion r

C C C C C

r follows from the denition of selection in TPA

C

Projection Direct corollary of idemp otence of pro jection in classical relational algebra

pro jection in TPA aects only the data part of the tptuples

Compaction In compaction op eration is dened to have the following three prop erties

Compactness r is compact for all TPrelations r

No Fo oling Around NFA If r is compact then ANNr ANN r

Conservativeness If at d t L U ANNr then at d t L U ANNr

t t

t t

Idemp otence of compaction therefore follows from Compactness and NFA prop erties

Intersection By denition r r r r Let at d t L U AN N r Then

AN N r r AN N r AN N r will contain two distinct instances of at As r is compact

AN N r contains no other tuples for the pair d t and therefore AN N r r will only contain

two distinct instances of at for the pair d t By Identity prop erty of combination functions

fL U L U g L U therefore AN N r r AN N r r will contain the

tuple at d t L U As at at we have shown that r r r

Consider now an annotated tuple at d t L U AN N r r By denition of intersection

r has to contain some tuple at d t L U and as we know that r is compact this tuple

will b e unique for the pair d t But then by the Identity prop erty of the combination function

AN N r r will contain the tuple at d t L U As r r is compact at at ie

L L and U U But then at at and therefore at AN N r Hence r r r

Union By denition r r r r Let at d t L U AN N r Then AN N r r

AN N r AN N r will contain two distinct instances of at As r is compact AN N r contains

no other tuples for the pair d t and therefore AN N r r will only contain two distinct in

stances of at for the pair d t By Identity prop erty of combination functions fL U L U g

L U therefore AN N r r AN N r r will contain the tuple at d t L U

As at at we have shown that r r r

Consider now an annotated tuple at d t L U AN N r r By denition of union

r has to contain some tuple at d t L U and as we know that r is compact this tuple

will b e unique for the pair d t But then by the Identity prop erty of the combination function

AN N r r will contain the tuple at d t L U As r r is compact at at ie

L L and U U But then at at and therefore at AN N r Hence r r r

Dierence Consider an annotated tuple at d t L U AN N r We show at AN N r

r i at AN N r r r Let r denote r r Two cases are p ossible

There exists an annotated tuple at d t L U AN N r In this case by denition

of dierence at AN N r Also as AN N r r r see denition of dierence

at AN N r r

There is no tuple in AN N r for the pair d t In this case at AN N r AN N r r

But then at also has to b e in AN N r r 3

Pro of of Theorem

Selection See Theorem in

Projection We prove r The other equivalence will follow

F G G F

Let tp d r Let d d By denition of classical pro jection d will only contain

F

attributes in F Let d d Then d will contain only attributes from G Hence d will

G

contain only attributes from F G ie d d

F G

By denition of pro jection in TPA r will contain tptuple tp d d and

F F

r will contain tptuple tp d d But also r will contain

G F G F G

tptuple tp d d tp So for every tuple tp d r b oth r

F G G F

and r will contain the same tuple tp d Therefore r r

F G G F F G

Intersection r r r r Thus in order to show the commutativity of intersection we

need to show commutativity of multiset intersection r r r r

By denition of multiset intersection in TATA

AN N r r AN N r AN N r

fat t d L U AN N r jat AN N r at t d L U g

fat t d L U AN N r jat AN N r at t d L U g

and

AN N r r AN N r AN N r

fat t d L U AN N r jat AN N r at t d L U g

fat t d L U AN N r jat AN N r at t d L U g

As the multiset union op eration is commutative the ab ove two expressions are equivalent

and hence AN N r r AN N r r yielding immediately r r r r

Union r r r r Thus in order to show the commutativity of intersection we need

to show commutativity of multiset union r r r r

By denition of multiset union in TATA

AN N r r AN N r AN N r AN N r AN N r

and

AN N r r AN N r AN N r AN N r AN N r

As the multiset union op eration of set theory is commutative the ab ove two expressions

are equivalent and hence AN N r r AN N r r yielding immediately r r r r

Dierence

We show that r r r r r r the other equivalences follow from it

r r r r r r

Let at d t L U AN N r r r Then by denition of dierence at AN N r r

and AN N r contains no tuple for the pair d t As at AN N r r it must b e the case

that at AN N r and AN N r contains no tuple for the pair d t

But then AN N r r will contain no tuples containing data part d and timep oint t As

at AN N r we get at AN N r AN N r r AN N r r r

r r r r r r

Let at d t L U AN N r r r Then by denition of dierence at r and there is

no tuple containing the pair d t in AN N r r Hence there is no annotated tuple containing

the pair d t in eiter AN N r or AN N r Then at AN N r AN N r AN N r r

and consequently at AN N r r AN N r AN N r r r

Cartesian Pro duct

Recall that given two data tuples d and d we consider tuples d d and d d to b e equivalent

Let at d t L U r and at d t L U r Then ANN r r will contain the

tuple at d d t L U where L U L U L U Also ANN r r will contain

the tuple at d d t L U where L U L U L U By commutativity of

op eration L U L U L U L U ie L U L U But then at at

Therefore for each pair of tuples in ANNr and ANN r the same tuple will b e found in

ANNr r and ANNr if the tuples have the same timep oint This proves the theorem

Join

Rememb er that r r r r But by commutativity of cartesian pro duct

F C

r r r r and therefore

r r r r r r r r

F C F C

3

Pro of of Theorem

Cartesian Pro duct

r r r r r r

Let at d t L U ANNr r r Then by denition of cartesian pro duct there

exist annotated tuples at d t L U ANNr r and at d t L U ANNr

such that d d d and L U L U L U Also as at ANNr r there

exist tuples at d t L U ANNr and at d t L U ANN r such that d d d

and L U L U L U

Ignoring the renaming of the attributes

But then ANN r r will contain a tuple at d t L U where d d d and

L U L U L U and ANN r r r will contain a tuple at d t L U

where d d d d d d d and L U L U L U L U L U

L U L U L U L U L U by asso ciativity of But then since

L U L U we get at a which proves the inclusion

r r r r r r

Let at d t L U ANNr r r Then by denition of cartesian pro duct there

exist annotated tuples at d t L U ANNr r and at d t L U ANNr

such that d d d and L U L U L U Also as at ANN r r there

exist tuples at d t L U ANNr and at d t L U ANNr such that

d d d and L U L U L U

But then ANNr r will contain a tuple at d t L U where d d d and L U

L U L U and ANNr r r will contain a tuple at d t L U where

d d d d d d d and L U L U L U L U L U

L U L U L U L U L U by asso ciativity of But then since

L U L U we get at a which proves the inclusion and the theorem

Join

Let R R and R b e the lists of data attributes for tprelations r r and r resp ectively

By denition of join

r r r r r r

aRa RR R

aRR

r r r

RR RR RR R aRa RR R aRa

aRR RR aRR

by pushing selection through pro jection Theorem proved b elow and pushing selection through cartesian

pro duct Theorem proved b elow

r r r

RR RR RR R RR R aRa aRa

aRR RR aRR

r r r

RR RR RR RRR R aRa

aRR RR RR

Similar reasoning leads to

r r r r r r

RR R RRRR R R aRa

aRR R R R R

By asso ciativity of cartesian pro duct r r r r r r

We will now show that the selection conditions and pro jection lists in the two expressions ab ove

are equivalent

Pro jection lists

R R R R R R R R R R R R R R R R

R R R R R R R R R R R R R R R R R

R R R R R R R R R R R R R R R R R

R R R R R R R R R R R R R R R R

R R R R R R R R R R R R R R

R R R R R R R R R R R R R

R R R R R R R R R R R R R R R R

R R R R R R R R R R R R R R R R

R R R R R R R R R R R R R R R R R

R R R R R R R R R R R R R R R R R

R R R R R R R R R R R R R R R R R

R R R R R R R R R R R R R R R R R

Selection conditions

R R R R R R R R R R R R R R R R

R R R R R R R R R R R R R R R R

R R R R R R R R R R R R R R R R R R R

R R R R R R R R R R R R R R R R R R

R R R R R R R R R R R R R R R R

R R R R R R R R R R R R R R R R R

R R R R R R R

Combined the commutativity of the cartesian pro duct and the equivalence of the pro jection

lists and selection conditions prove the asso ciativity of join 3

Pro of of Prop osition

r r r r

Let at d t L U ANNr r We need to show at r r By denition of intersection

r and hence ANNr is compact Therefore at is the only tuple in ANNr r for the pair

d t By denitions of intersection and compaction we infer that there must b e annotated tuples

at d t L U ANNr r i n n such that fL U j i ng L U

i i i i i

But then by denition of multiset intersection each at must b elong to either ANN r or

i

ANNr with at least one tuple in each relation Without loss of generality assume that

at at ANNr and at at ANNr for some r n But then by

r r n

denition of multiset union all at b elong to ANNr r and no other annotated tuple for

i

the pair d t b elongs to ANNr r Therefore by denitions of the union and compaction

ANNr r will contain tuple at d t L U where L U fL U j i ng But

i i

then L U L U and therefore at at ie at ANNr r

r r r r r

In short r r will contain tuples from b oth r and r while r r r by denition of dierence

will contain only tuples from r

To b e more sp ecic consider two annotated tuples at d t L U ANNr and at

d t L U ANNr By denition of r r ANNr r will contain b oth at and at

On the other hand ANNr r will contain no tuple for the pair d t as at ANNr and

at ANN r Therefore ANNr r r will contain at as at ANNr and no tuple for

d t is in ANNr r However as at ANN r at ANN r r r

r r r r r r r

r r r r r r r

Let at d t L U ANNr r r Then at ANNr and there is no tuple in

ANNr r which would contain the d t pair Thus two p ossibilities have to b e considered

i neither ANNr nor ANNr contain tuples for the d t pair and ii one of the two relations

either ANN r or ANN r contains a tuple at d t L U

In case i b oth ANN r r and ANNr r will contain at and therefore ANNr r

r r will contain at

In case ii assume for simplicity that ANNr contains tuple at as ab ove the other case is

symmetric ANNr r will contain at while ANNr r wont But then at ANNr

r r r

r r r r r r r

Let now at d t L U ANNr r r r Then either at ANNr r or

at ANNr r or at is in b oth ANNr r and ANNr r

Let at ANNr r and at ANNr r the case when at ANNr r and at ANN r

r is symmetric Then at ANNr and ANN r contains no tuple at d t L U Also

as at ANN r it has to b e the case that ANNr contains some tuple at d t L U

However as no tuple for d t is in ANN r ANNr r will also contain no tuple for d t and

therefore ANNr r r will contain at

Now in the case when b oth ANNr r and ANNr r contain at we know that at ANNr

and neither ANNr nor ANN r contains any tuples for the pair d t Then ANNr r

will contain no tuple for d t either and therefore at ANNr r r

r r r r r r r

r r r r r r r

Let at d t L U ANNr r r Then at ANNr and there is no tuple in

ANNr r which would contain the d t pair Therefore neither ANNr nor ANNr

contain any tuple for d t But then b oth ANNr r and ANN r r will contain at and

therefore so will ANNr r r r

r r r r r r r

Let now at d t L U ANN r r r r Then at ANNr r and at

ANNr r and therefore ar ANNr and no tuple in either ANNr or ANNr will

contain the pair d t But then ANNr r will contain no such tuple as well and b ecause

at ANNr at ANNr r r 3

Pro of of Theorem

We prove this theorem for atomic constraints C Then by Theorem from this will also hold

for nonatomic constraints For an atomic constraint C three cases are p ossible

constraint In this case the statement of the theorem is true since a similar statement C is a data

is true in classical relational algebra The TPcase eld of r is not touched by either F of C

and hence b oth selection and pro jection on r will b ehave exactly as they would in the absence

of the TPcase

C is a temp oral constraint We know that F contains only data elds in it We also know that

C do es not refer to any data eld It is easy to see from this that the statement of the theorem

holds

Indeed let at d t l u b e an annotated tuple from AN N r Since C is a temp oral

C F

constraint we know that t satises C Since AN N r AN N r we know that at

C

AN N r Since AN N r AN N r there exists an annotated tuple at

F F F

d t l u AN N r such that d P d But then since t satises C at AN N r

C

and therefore at AN N r

F C

Going the other way let at d t l u AN N r Then AN N r contains

F C C

an annotated tuple at d t l u where d P d Since AN N r AN N r

C

at AN N r and therefore at AN N r But at AN N r implies that t

F C

satises C and therefore it must b e the case that at AN N r

C F

constraint As in the case ab ove F contains only data elds while C refers C is a probabilistic

only to the contents of the TPcase eld Applying reasoning similar to the case of temp oral

constraints we can establish that the statement of the theorem is true in this case as well 3

Pro of of Theorem

r r

C C

Let at d t l u AN N r We know that at satises C and as C is either a data

C

or temp oral constraint we know that any annotated tuples at d t l u will also satisfy

C

As at satises C and since AN N r AN N r for any tprelation r we know

C C

that at AN N r By the prop erty of Conservativeness of there exists at least one

annotated tuple at d t l u AN N r Now let AN N r d t fat at g

n

i nat d t l u We know that l u fl u l u g As it was noticed

i i i n n

ab ove all at AN N r d t satisfy C and therefore AN N r d t AN N r But as

C

AN N r AN N r AN N r d t AN N r d t Therefore at d t l u will

C C

b e in AN N r which means that r r

C C C

r r

C C

Let at d t l u AN N r By the Conservativeness prop erty of op eration

C

we know that there exists at least one annotated tuple tp d t l u AN N r

C

Let now AN N r d t fat at g where for all i n at d t l u and

C n i i i

l u fl u l u g We know that all these annotated tuples satisfy C Also since

n n

r AN N r AN N r d t for any tprelation r and selection condition C AN N

C

C

AN N r But since C is a data or temp oral constraint we know that all annotated tuples in

the set AN N r d t must satisfy it Therefore AN N r d t AN N r d t

C

From the latter equality and the fact that l u fl u l u g we get that at

n n

d t l u AN N r But we also know that at satises C as would any annotated tuple

with data part d and temp oral part t Therefore at AN N r which means that

C

r r

C C

As you may notice the key step in the pro of was the fact that in a tprelation r for any data

part d and timep oint t it was true that AN N r d t AN N r d t for temp oral or data select

C

conditions C This may not b e true in the case when select condition is probabilistic Therefore the

statement of the theorem ab ove is not true for probabilistic select conditions 3

Pro of of Theorem

We break the pro of of this fact into two parts r r r r

C C

AN N r r AN N r AN N r

C C

Let at d t l u AN N r r Clearly at AN N r r As we know

C

Theorem AN N r r AN N r AN N r hence at d t l u

AN N r AN N r Then by the denition of cartesian pro duct on annotated relations

there exist two annotated tuples at d t l u AN N r and at d t l u

AN N r r AN N r such that d d d and l u l u l u As at

C

we know that d satises C But since the only elds mentioned in C are those from

the relational schema of r and AN N r it must b e the case that d satises C as d is

AN N r AN N r r and therefore at the part of d But then at

C C

AN N r AN N r AN N r r

C C

AN N r AN N r Going the other way we assume that at d t l u

C

AN N r Clearly then there exist two tuples at d t l u AN N r and at d t l u

C

AN N r we also know such that d d d and l u l u l u As at

C

that at AN N r But then at AN N r AN N r AN N r r Also since

AN N r we know that d satises C Therefore since C contains only ref at

C

erences to the elds from the relational schema of r AN N r d also satises C and

AN N r r therefore at

C

r r r r

C C

The pro of of this part of the theorem is symmetric to the pro of of part

r r r r

C C C

We prove this statement similarly to the pro of of the statement in part 3

Pro of of Theorem

Let F b e the list of join attributes of r and r and C b e the join condition Then

r r r r r r r r

C F C C F C F C C C

r r r r

C C F C

r r r r r r r r

C F F C C C F

C C C

r r r r

C C F C

r r r r r r r r

C F F C C C F

C C C

r r r r

C C C C F C

3

Pro of of Theorem

r r C r r r r r r

C C C C

First we prove the statement of the theorem for multiset intersection

r r r r

C C C

Let at d t l u AN N r r Then at satises C and as C is a data or temp oral

C

condition so would any annotated tuple at with data part d and temp oral part t As

AN N r r AN N r r at r r Two cases are p ossible i at AN N r

C

and ii at AN N r We will consider case i the remaining case is symmetric

As at r we know that there has to b e at least one tuple at d t l u AN N r

As we have noticed ab ove at satises C In this case at AN N r and at

C

AN N r But then at AN N r AN N r which shows that r r

C C C C

r r

C C

r r r r Let at d t l u AN N r r AN N r

C C C C C C

AN N r There are two p ossibilities i at AN N r and ii at AN N r

C C C

We will consider the rst one pro of for the second case is symmetric

As at AN N r we know that at satises C and also that at AN N r Also as

C

at AN N r AN N r there has to b e at least one tuple at d t l u

C C

AN N r Clearly at also satises C and at AN N r But in this case fat at g

C

AN N r r and as at satises C at AN N r r proving that r r

C C

r r 3

C C

Now r r r r and therefore using the result of the Theorem we get r r

C

r r r r r r r r 3

C C C C C C

r r r r

C C C

First we prove the statement for multiset union

r r r r

C C C

Let at d t l u AN N r r In this case at satises C and at AN N r r

C

Two p ossibilities exist i at AN N r and ii at AN N r We will consider the

rst one here the pro of for the other one is symmetric

As at AN N r and at satises C at AN N r and therefore at AN N r

C C

AN N r AN N r AN N r This proves the desired subset inclusion

C C C

Let at d t l u AN N r r AN N r r r r r

C C C C C C

AN N r As in the previous case two p ossible situations exist i at AN N r

C C

and ii at AN N r We consider the rst p ossibility the pro of for the second one

C

will b e symmetric

As at AN N r at satises C and at AN N r Then at AN N r AN N r

C

AN N r r As at satises C we concluded that at AN N r r 3

C

Now r r r r and therefore using the result of the Theorem we get r r

C

r r r r r r r r 3

C C C C C C

r r r r r r

C C C C

We prove the rst equality Second equality can b e proven similarly

r r r r Let at d t l u AN N r r Then at satises

C C C C

C and at AN N r r In this case at r and there are no tuples of the form

d t l u in r

As at r and at satises C at AN N r Also we know that no tuple of the

C

form d t l u is in AN N r But then at AN N r AN N r

C C C

AN N r r which proves the desired inclusion

C C

Let at d t l u AN N r r In this case r r r r

C C C C C

at AN N r and no tuple of the form d t l u is in AN N r Clearly at

C C

satises C and as C is either a data or a temp oral constraint any tuple of the form

d t l u would satisfy C As r r at r Also we know that r do es not

C

contain any tuples of the form d t l u since otherwise r would have contained

C

them But then at AN N r r and therefore at AN N r r which shows

C

that r r r r 3

C C C

Pro of of Theorem

r r r r

C C C C

As ANN r ANNr and ANN r ANNr the r r r

C C C C C

r inclusion is immediate

C

We now concentrate on proving r r r r

C C C C

Let C L x The pro of for L x U x U x is similar Let at d t L U

ANN r r Then L x and at ANNr r By denition of cartesian pro duct

C

there exist annotated tuples at d t L U ANNr and at d t L U ANNr

such that L U L U L U

But by the Bottomline axiom for probabilistic conjunction strategies L minL L and

hence L x and L x Therefore b oth at and at satisfy C and at ANN r

C

ANN r But then at ANN r sig ma r and as at satises C and at

C C C

at ANN r r 3

C C C

r r r r

C C C C

r r r r r r r r

C F F C C C F

C C C

r r r r r r

C C C C C F C C C C F C C F C

r r 3

C C C

Pro of of Theorem

r r r r r r

C pc C pc pc C

Let C L x The pro ofs for L x U x and U x are similar

r r r r r r

C pc C pc pc C

Let at d t L U ANN r r Then at satises C ie L x and at

C pc

ANNr r By denition of cartesian pro duct there exist two annotated tuples at

pc

d t L U ANNr and at d t L U ANNr such that L U L U

pc

L U min L L minU U Two cases are p ossible

L L L L In this case at d t L U and at satises C therefore at

ANN r But then from the ab ove at ANN r r

C C pc

L L L L Reasoning analogously to case we get at ANNr r

pc C

Combining two cases together we get at ANN r r or at ANN r r But

C pc pc C

then at ANN r r r r

C pc pc C

r r r r r r

C pc C pc pc C

Let at d t L U ANN r r r r By denition of multiset union

C pc pc C

two cases are p ossible

at ANN r r In this case there exist two annotated tuples at d t L U

C pc

ANN r and at d t L U ANNr such that L U L U L U

C pc

minL L minU U As at ANNsig ma r at satises C and therefore L x

C

But then as L minL L L we get L x and therefore at satises C Also as

at ANN r at is also in ANNr But them at ANNr r As we have established

C pc

that at satises C at ANN r r

C pc

at ANNr r and at ANN r r In this case there exist two anno

pc C C pc

tated tuples at d t L U ANN r and at d t L U ANNr such that

C

L U L U L U minL L minU U Because at ANN r r

pc C pc

at ANN r and therefore at do es not satisfy C ie L x On the other hand

C

at ANN r and therefore i at ANNr and ii at satises C ie L x From

C

the latter and L x we conclude L L and therefore L minL L ie L L But

then L x and at satises C as at ANNr at will b e contained in ANN r r and as

pc

at satises C at ANN r r 3

C pc

r r r r r r

C pc C pc pc C

Let F b e the list of join attributes of r and r and let C b e the appropriate join condition

Then

r r r r r r r r

C pc pc F pc F C C pc C F

C C C

r r r r r r r r r

C pc pc C F C C pc C C pc pc C F C F C

r r r r r r 3 r

pc C C pc pc C F

C

Pro of of Theorem

x x

r r r r r r

MINLr

Lx Lx in Lx in in

L L

L

MINLr

MINLr

x

x

r r

Lx in

L

MINLr

We show the rst equivalence

x

r r r r

Lx in Lx in

L

MINLr

Let at d t L U ANN r r Then at satises the selection condition ie

Lx in

L x and also at ANNr r Then by denition of cartesian pro duct there exist two

in

annotated tuples at d t L U ANNr and at d t L U ANN r such that

x

L U L U L U L L U U As L x L L x and therefore L

in

L

x x

x

But then at ANN As MINLr L L r and therefore

L

L MINLr

MINLr

x x

r as at satises L x at ANN r r at ANNr

Lx in in

L L

MINLr MINLr

x

r r r r

Lx in Lx in

L

MINLr

x

r Then at satises L x condition Let at d t L U ANN r

Lx in

L

MINLr

x

and at ANNr r By denition of cartesian pro duct there exist two

in

L

MINLr

x

r annotated tuples at d t L U ANN r and at d t L U ANN

L

MINLr

such that L U L U L U L L U U Than at ANN r But then

in

at ANNr r and as at satises L x at ANN r r 3

in Lx in

x x x

r r r r r r

in Lx Lx in Lx in

L L L

MINLr MINLr

MINLr

x

r r

Lx in

L

MINLr

We show the second equivalence

x x

r r r r

in Lx in Lx

L L

MINLr

MINLr

Let at d t L U ANN r r Then at satises the selection condition ie

Lx in

L x and also at ANNr r Then by denition of cartesian pro duct there exist two

in

annotated tuples at d t L U ANNr and at d t L U ANN r such that

x

L U L U L U L L U U As L x L L x and therefore L

in

L

x x x

and MINLr L As MINLr L L Reasoning similarly L

L L

MINLr

x

yield L

MINLr

x x

r Therefore r and at ANN But then at ANN

L L

MINLr

MINLr

x x

r r As at satises L x at ANN

in

L L

MINLr

MINLr

x x

at ANN r r

Lx in

L L

MINLr

MINLr

x x

r r r r

in Lx in Lx

L L

MINLr

MINLr

x x

r r Then at satises L Let at d t L U ANN

in Lx

L L

MINLr

MINLr

x x

r By denition of cartesian x condition and at ANN r

in

L L

MINLr

MINLr

x

pro duct there exist two annotated tuples at d t L U ANN r and at

L

MINLr

x

r such that L U L U L U L L U U d t L U ANN

in

L

MINLr

Than at ANN r and at ANNr But then at ANN r r and as at satises L x

in

at ANN r r 3

Lx in

x x x

r r r r r r

in U x U x in U x in

U U U

MINU r MINU r

MINU r

x

r r

U x in

U

MINU r

We prove the third equivalence

x

r r r r

in Lx in Lx

L

MINLr

Let at d t L U ANN r r Then at satises the selection condition ie

Lx in

L x and also at ANNr r Then by denition of cartesian pro duct there exist two

in

annotated tuples at d t L U ANNr and at d t L U ANN r such that

x

L U L U L U L L U U As L x L L x and therefore L As

in

L

x x

x

r and therefore at MINLr L L But then at ANN

L

L

MINLr

MINLr

x x

ANN r r and as at satises L x at ANN r r

in Lx in

L L

MINLr MINLr

x

r r r r

Lx in Lx in

L

MINLr

x

r r Then at satises L x condition Let at d t L U ANN

in Lx

L

MINLr

x

r r By denition of cartesian pro duct there exist two and at ANN

in

L

MINLr

x

annotated tuples at d t L U ANN r and at d t L U ANN r

L

MINLr

such that L U L U L U L L U U Than at ANN r But then

in

at ANNr r and as at satises L x at ANN r r 3

in Lx in

x x x

r r r r r r

in U x U x in U x in

U U U

MINU r MINU r

MINU r

x

r r

in U x

U

MINU r

Pro of of the rst equivalence is analogous to the pro of in part of this theorem Pro of of

the second equivalence is analogous to the pro of in part of this theorem Pro of of the third

equivalence is analogous to the pro of in part of this theorem 3

Pro of of Theorem

Let at d t L U at d t L U b e all annotated tuples in ANNr for the pair d t

n n n

Let F b e a list of manifest attributes and let d d Then ANN r will contain tuples

F F

at at where at d t L U for i n But then ANN r will contain the

i i F

n

i

tuple at d t L U where L U L U L U

n n

Now consider ANN r This relation will contain annotated tuple at d t L U where

L U L U L U L U But then ANN r will contain the tuple at

n n F

d t L U at The latter equality proves the theorem 3

Pro of of Theorem

Intersection We show the statement of the theorem for multiset intersection Then by Theorem

it will also hold for intersection

r r r r

F F F

Let at d t L U ANN r r Then by denition of pro jection there exists an

F

annotated tuple at d t L U ANN r r such that d d As at ANN r r

F

either at ANNr or at ANN r Consider the former case the latter is symmetric at

ANNr r and at ANNr implies that there exists an annotated tuple at d t L U

ANNr r such that at ANN r But then ANN r contains the tuple at

F

d t L U and ANN r contains the tuple at d t L U As at and at are data

F

and timeidentical ANN r r will contain b oth at and at

F F

r r r r

F F F

Let at d t L U ANN r r Then either at ANN r or at

F F F

ANN r Consider the rst case the other case is symmetric

F

at ANN r and at d t L U ANN r r implies that there exists an

F F F

annotated tuple at d t L U ANN r r such that at ANN r

F F F

But then ANNr contains the tuple at d t L U and ANN r contains the tuple at

d t L U where d d Therefore as at and at are data and timeidenticals b oth at

F

and at are contained in ANNr r But then ANN r r will contain at d t L U

F

which proves the theorem 3

Union We prove the theorem for multiset union Then by Theorem it will also hold for

union

r r r r

F F F

Let at d t L U ANN r r Then by denition of pro jection there exists an

F

annotated tuple at d t L U ANN r r such that d d As at ANN r r

F

either at ANN r or at ANN r Consider the former case the latter is symmetric As at

ANNr ANN r contains the tuple at d t L U and therefore ANN r r

F F F

also contains at

r r r r

F F F

Let at d t L U ANN r r Then either at ANN r or at

F F F

ANN r Consider the rst case the other case is symmetric

F

As at ANN r ANNr contains the tuple at d t L U where d d There

F F

fore as at is contained in ANNr r But then ANN r r will contain at d t L U

F

which proves the theorem 3

Dierence

r r r r

F F F

Let at d t L U ANN r r Then by denition of pro jection ANN r r contains

F

the tuple at d t L U where d d Then by denition of dierence at ANNr

F

and ANNr contains no tuple for the pair d t But then at at ANN r and

F F

ANN r contains no tuple for the pair d t Therefore at ANN r r

F F F

r r r r

F F F

Let at d t L U ANN r r Then at ANN r and ANN r

F F F F

contains no tuple for the pair d t But then ANNr contains the tuple at d t L U where

d d and ANNr contains no tuple for the pair d t This means that at ANNr r

F

and therefore at ANN r r 3

F

Pro of of Theorem

r r r r

F

F F F

r r r r

F

F F F

r r Then by denition of pro jection there exists an Let at d t L U ANN

F F

d By denition of annotated tuple at d t L U ANN r r such that d

F F

cartesian pro duct there exist two annotated tuples at d t L U ANN r and at

d t L U ANNr such that d d d and L U L U L U But then

r contains the r contains the tuple at d t L U where d d and ANN

F F

F

r will r But then ANN r tuple at d t L U where d ANN

F F F

contain the tuple at d d t L U where L U L U L U We notice now

d d F F d d that L U L U and that d d d

F

F F F

r d Therefore at at and at ANN r

F F

r r r r

F F F F

r Then by denition of cartesian pro duct there Let at d t L U ANN r

F F

r exist annotated tuples at d t L U ANN r and at d t L U ANN

F

F

such that d d d and L U L U L U But then there exists an annotated

tuple at d t L U ANN r and an annotated tuple at d t L U ANNr

d But then ANNr r contains the tuple at such that d d and d

F F

d t L U where d d d and L U L U L U L U Then by

r r contains the tuple at d t L U where d denition of pro jection ANN

F F

d d d d Therefore at at and d d athcal F d d

m F F F F F

r r which completes the pro of 3 at ANN

F F

r r r r

F F F F

Let G denote the list of attributes to remain in the join and let C denote the join condition

Recall that G F and G F Then

r r r r r r r

C G C F F G C G F F F F F F

r 3 r r r r

F G C F

F F