Probabilistic Databases with Markoviews∗

Probabilistic Databases with MarkoViews∗ Abhay Jha Dan Suciu Computer Science and Engineering Computer Science and Engineering University of Washington University of Washington Seattle, WA 98195–2350 Seattle, WA 98195–2350 [email protected] [email protected] ABSTRACT major challenge. Most of the scalable query processing tech- Most of the work on query evaluation in probabilistic data- niques developed for probabilistic databases assume that the bases has focused on the simple tuple-independent data mo- tuples are independent events, or disjoint-independent [4, del, where tuples are independent random events. Sev- 1, 24]. For example, MystiQ, MayBMS, and SPROUT re- eral efficient query evaluation techniques exists in this set- port running times of a few seconds on databases of tens of ting, such as safe plans, algorithms based on OBDDs, tree- millions of tuples, using a combination of techniques such decomposition and a variety of approximation algorithms. as safe plans [7], plan re-orderings and functional depen- However, complex data analytics tasks often require com- dencies [23], Monte-Carlo simulation combined with top-k plex correlations, and query evaluation then is significantly optimization [28], or approximate confidence computation more expensive, or more restrictive. that tradeoff precision for performance [24]. These systems In this paper, we propose MVDB as a framework both scale to quite large databases, but are limited to indepen- for representing complex correlations and for efficient query dent probabilistic databases, or disjoint-independent. evaluation. An MVDB specifies correlations by views, called Tuple-independent probabilistic databases are insufficient MarkoViews, on the probabilistic relations and declaring for analyzing and extracting knowledge from practical data- the weights of the view’s outputs. An MVDB is a (very sets. As has been shown in the Machine Learning commu- large) Markov Logic Network. We make two sets of con- nity, modeling correlations is critical in complex knowledge tributions. First, we show that query evaluation on an extraction tasks. For example, in Markov Logic Networks MVDB is equivalent to evaluating a Union of Conjunctive (MLN) [29], users can assert arbitrary probabilistic state- Query(UCQ) over a tuple-independent database. The trans- ments over the data, in the form of First Order Logic sen- lation is exact (thus allowing the techniques developed for tences, and assign a weight. The sentence, called a feature, is tuple independent databases to be carried over to MVDB), expected to hold to a degree indicated by the weight. Each yet it is novel and quite non-obvious (some resulting proba- feature may introduce correlations between a large number bilities may be negative!). of base facts, and thus the MLN can express, very concisely, This translation in itself though may not lead to much a large Markov Network. MLNs have been demonstrated to gain since the translated query gets complicated as we try be effective at a variety of tasks, such as Information Ex- to capture more correlations. Our second contribution is to traction [26], Record Linkage [31], Natural Language Pro- propose a new query evaluation strategy that exploits offline cessing [27]. A benefit of MLNs is that the same framework compilation to speed up online query evaluation. Here we can be used both for learning the weights, and for inferring utilize and extend our prior work on compilation of UCQ. probabilities of new queries. We validate experimentally our techniques on a large proba- In this paper we present a new approach for represent- bilistic database with MarkoViews inferred from the DBLP ing and querying probabilistic databases. Our data model data. combines probabilistic databases with MLNs: it consists of a collection of probabilistic (tuples are annotated with a 1. INTRODUCTION probability) and deterministic tables, and a collection of The task of analyzing and extracting knowledge from large views, called MarkoViews.A MarkoView is expressed datasets often requires probabilistic inference over a com- by a Union of Conjunctive Queries (UCQ) over the proba- plex probabilistic model on the data. This step represents a bilistic and deterministic tables, and associates a weight to each tuple in the answer; intuitively, it asserts a likelihood ∗This work was supported in part by the NSF grants IIS- 1115188, IIS-0915054, IIS-0911036 for that output tuple, and therefore introduces a correlation between all contributing input tuples. A MarkoView Permission to make digital or hard copies of all or part of this work for can be seen as a set of MLN features, and thus, its weights personal or classroom use is granted without fee provided that copies are can be learned as in MLNs; we do not address learning in not made or distributed for profit or commercial advantage and that copies this paper, but focus solely on inference, or query evalua- bear this notice and the full citation on the first page. To copy otherwise, to tion. We call a database consisting of probabilistic tables republish, to post on servers or to redistribute to lists, requires prior specific and MarkoViews an MVDB. The data model of MVDBs permission and/or a fee. Articles from this volume were invited to present is significantly richer than that of tuple-independent proba- their results at The 38th International Conference on Very Large Data Bases, INDB August 27th - 31st 2012, Istanbul, Turkey. bilistic databases, which we denote with . Proceedings of the VLDB Endowment, Vol. 5, No. 11 We make two sets of contributions. First, we show how Copyright 2012 VLDB Endowment 2150-8097/12/07... $ 10.00. 1160 to translate query evaluation over an MVDB into query Second, an MVDB has probabilistic tables, shown in the evaluation over an INDB. More precisely, we express the middle of Fig. 1: Studentp(aid,year) stores likely years probability P (Q) on an MVDB in terms of the probabil- when an author was a student, Advisorp(aid1,aid2) stores p ity P0(Q ∨ W ), where W is a union of queries, one for each likely advisor/advisee relationship, and Affiliation (aid,i MarkoView; therefore W is a UCQ. To be precise, we show nst) records inferred affiliations based on co-authorship. that P (Q) = P0(Q | ¬W ) = (P0(Q ∨ W ) − P0(W ))/(1 − Each probabilistic table is defined by a query, which also P0(W )). The probability P0 is on an INDB obtained from associates a weight to every output tuple; for example, the MVDB through a simple, yet quite non-obvious trans- Studentp(aid, year)[e1−.15(year−year’)]: − ... associates the formation, discussed in Sect. 3. Without going into the weight w = e1−.15(year−year’) to its output. Weights are often transformation details, we would like to mention that ¬W preferred over probabilities when the probability function is logically stands for the MarkoViews, hence intuitively the a product of potential functions, as in MLNs and MVDBs. translation is computing the probability of the query given The intuition is that the weight w represents the odds of a that the views hold. Note that if Q is a UCQ, so is the probability p, w = p/(1 − p) (formal definition in Sect. 2). translated query; hence, both the query and the probabilistic Third, the MVDB contains a set of MarkoViews, which model are simple and come from well-understood domains. in our example are shown at the bottom of Fig. 1. Each In particular, while there are very few results on tractabil- MarkoView is a query over probabilistic tables, and its ity of MLNs and none complete, the set of tractable UCQ purpose is to define some correlations between the tuples over INDB is already known [8]. Therefore, the translation in those tables. It does this by defining a view over the moves our problem into a domain that is well-understood probabilistic tables, then asserting a certain weight for the and allows for easier detection of tractable instances. We tuples in the view. Weights < 1 define a negative correlation, are aware of one more such translation [12] in the literature, weights > 1 define a positive correlation, and a weight = 1 but it leads to a more complicated query, which is no longer means independence. A weight = 0 means a hard constraint: a UCQ. In contrast, our translation leads to both a sim- the view must be empty. For example, the MarkoView ple query (UCQ) and a simple model (tuple independent), V1 defines a correlation between a tuple in Studentp and where the complexity is well understood and the tractable a tuple in Advisorp: it states that the more papers two cases are fully characterized. people co-author during the years when the second person Our second set of contributions is to devise an efficient was a student, the more likely that the first person was his query evaluation method for MVDB. The task is to com- advisor. V2 defines a hard constraint: each person can have pute P0(Q ∨ W ), where both Q and W are UCQs. Note only one advisor. Finally, V3 introduces positive correlations that W depends only on the MarkoViews, and does not between common affiliations for people who published a lot depend on the query Q. We describe a new index struc- together. ture for W , called an MV-index, and show how to use it to Consider now the following simple query on the MVDB: compute P0(Q ∨ W ). The MV-index consists of an Ordered find all students advised by Sam Madden. The query, writ- Binary Decision Diagram (OBDD) [5], extended with addi- ten over the MVDB, is shown in Fig. 2 (a). If the tuples p p tional information that is critical for computing P0(Q ∨ W ) in Student and Advisor were independent random vari- efficiently. In prior work [15] we have shown that a certain ables, then this query could be computed very efficiently, class of UCQ queries, called inversion-free queries, always because it is a safe query [7].

Load more