Probabilistic Temporal Databases, II: Calculus and Query Processing 1

Probabilistic Temp oral Databases I I Calculus and Query Pro cessing y z x Alex Dekhtyar Fatma Ozcan Rob ert Ross VS Subrahmanian Abstract There is a vast class of applications in which we know that a certain event o ccurred but do not know exactly when it o ccurred However as studied by Dyreson and Sno dgrass there are many natural scenarios where probability distributions exist and quantify this uncertainty Dekhtyar et al extended Dyreson and Sno dgrasss work and dened an extension of the relational algebra to handle such data The rst contribution of this pap er is a declarative temp oral probabilistic TP for short calculus which we show is equivalent in expressive p ower to the temp oral probabilistic algebra of Our second ma jor contribution is a set of equivalence and containment results for the TPalgebra Our third contribution is the development of cost mo dels that may b e used to estimate the cost of TPalgebra op erations Our fourth contribution is an exp erimental evaluation of the accuracy of our cost mo dels and the use of the equivalence results as rewrite rules for optimizing queries by using an implementation of TPdatabases on top of ODBC Intro duction Dyreson and Sno dgrass pioneered the study of uncertainty in temp oral databases where statements of the form Event e o ccurred or will o ccur at some time p oint in the time interval t t are p ermitted Such statements are common For instance a shipp er like Federal Express may tell customers that their package will b e delivered within hours of drop o In such a case if the smallest unit of time ab out which reasoning is p erformed is a minute then there are minutes at which the package may b e p ossibly delivered However over this timeframe there is a probability distribution reecting the probability that the package will b e delivered precisely t minutes after drop o Such a probability distribution may b e skewed eg the probability that the package is delivered within hours is probably zero if the package has to make its way from Seattle to Boston Dyreson and Sno dgrass develop ed this idea substantially to provide a framework for reasoning ab out temp oralprobabilistic data They also provided applications in a variety of other areas including carb ondating of historical records and sto ck market analysis For example there are literally hundreds of programs that make sto ck market predictions most prediction mo dels This work was supp orted in part by the Army Research Lab under contract numb er DAALK the Army Research Oce under grant numb er DAAD by DARPARL contract numb er F and by the ARL CTA on Advanced Decision Architectures y Department of Computer Science University of Kentucky Lexington KY Email dekhtyarcsukyedu z Department of Computer Science University of Maryland College Park MD Email fatmacsumdedu x Department of Computer Science University of Maryland College Park MD Email robrosscsumdedu Department of Computer Science University of Maryland College Park MD Email vscsumdedu are uncertain and provide probabilistic outputs In the same vein statistical mo dels that track p erformance of machines and machine parts on a factory o or yield probabilistic estimates of when the parts will need to b e repaired andor replaced Decision making programs for such applications typically execute calls to the results of suc predictions Following on the imp ortant work of Dyreson and Sno dgrass work Dekhtyar et al develop ed a temp oral probabilistic algebra TPA which eliminated many of the assumptions in the framework of We start this pap er with a brief overview of TPdatabases in Section We then move on to our sp ecic contributions First in Section we develop a temp oral probabilistic calculus TPcalculus for short We show that this calculus which is similar to the safe relational calculus has the nice prop erty that it is equivalent in expressive p ower to the TPalgebra Second in Section we develop a set of query equivalence results in the TPalgebra These equivalences automatically yield a set of rewrite rules that a query optimizer for TPdatabases might use Third in Section we develop mechanisms to estimate the cost of executing a TPalgebra query Even though TPalgebra op erations are implemented on top of a relational database these op erations are not mere relational algebra queries implementing them on top of the relational algebra involves writing a C program that includes emb edded SQL calls Costing such programs builds on top of cost mo dels of relational op erators but involves taking into account the sp ecic asp ects of the programs themselves Our cost mo dels involve identifying a set of statistics to b e stored ab out the data as well as metho ds to compute such statistics for intermediate results obtained during query pro cessing Fourth in Section we conduct a set of exp eriments to evaluate three things Our rst goal is to assess the accuracy of our cost mo del Our second goal is to assess the eectiveness of our rewrite rules The third goal is to assess the eciency of a query optimizer based on this cost mo del and these rewrite rules Exp eriments were run using our implementation of TPdatabases which is built on top of ODBC using Paradox as a backend We compare our work with related work in Section and nally conclude with directions for further research Preliminaries TPDatabases Overview In this section we recapitulate the notions of a Temp oral Probabilistic database TPdatabase and the Temp oral Probabilistic Algebra TPA Full details may b e found in Temp oralProbabilistic DB Mo del To make statements ab out when certain events o ccurred or when certain facts werewill b e true TPdatabases use a calendar cf Sno dgrass and So o and temp oral constraints Calendar A calendar consists of an ordered list of time units For example day month year is a p ossible calendar Each time unit has a nite domain asso ciated with it Given a calendar over time units tu tu a time point is an expression of the form v v where each v is in n n i the domain of tu For instance wrt the example calendar is a time p oint i Temp oral constraint Temporal constraints are inductively dened i If tu is a time unit op f g and v domtu then tu op v is a temp oral constraint over ii If t t are time p oints in and t t then t t is a temp oral constraint over iii If C and C are temp oral constraints then so are C C C C and C The syntax t t is a shorthand for the constraint t t t We abuse notation and write t instead of t t Solutions of temp oral constraints denoted solC are dened in the obvious way For example the constraint has as its solutions all time p oints b etween pm to pm as well as all time p oints b etween pm to pm Probability Distribution Function PDF Let S b e the set of all temp oral constraints over calendar Then the function S is a PDF if D S t solD D t P D t Furthermore is a restricted PDF if D S tsolD This denition of a PDF is rich enough to capture almost all probability mass functions eg uniform geometric binomial Poisson etc studied in classical probability theory Furthermore probability density functions can b e approximated by PDFs via a pro cess of quantization TPcase A TPcase over calendar is a tuple hC D L U i where i C and D are temp oral constraints over ii solC solD iii L U and iv is a restricted PDF For example h ui is a TPcase where is a uniform distribution Intuitively C sp ecies the time p oints when an event is valid while D sp ecies the time p oints that are distributed by Since solC solD it follows that assigns a probability interval to each time p oint t sol C Sp ecically let PrhC D L U i t denote D t L U Alternatively could b e dened as where is a restricted PDF is a unrestricted L U L U PDF and t solC D t L D t U This generalization is useful when the lower L U and upp er probability b ounds do not follow the same distribution Here PrhC D L U i t L U denotes the probability interval D t L D t U Although extending to a pair of PDFs L U is straightforward we avoid this redenition in order to maintain b etter compatibility with TPtuple A TPtuple over relation scheme R A A and calendar is a pair d where k d is a relational k tuple over R and is a nonempty TPcase statement over ie is a set of TPcases over where i j sol C C i j i j In the following supp ose R is a relation scheme and is a calendar over tu tu n TPtable A TPtable over R is a multiset of TPtuples over R TPrelation A TPrelation r over R is a TPtable over R where R is a sup erkey for r TPdatabase A TPdatabase over is a set of TPtables over Annotation The annotation of TPrelation r over R pro duces a relation ANNr over relation scheme R tu tu L U where domL domU and n t t t t ANNr fd t L U j d r t sol C L U Pr tg t t t t Equivalence TPrelations r and r are equivalent denoted r r i ANNr ANNr Example Base TPrelations The following TPrelations are named TrainDep and BusArr TrainNo From To C D L U Baltimore New York u g BusNo From To C D L U Ro ckville Baltimore b u The second TPcase in TrainDep says that there is a probability that train numb er from Baltimore to New York will depart b etween and Furthermore given any time p oint t in this interval the probability that the train will depart at exactly time t is b etween D t g and D t

Load more