Local Structure and Determinism in Probabilistic Databases

Local Structure and Determinism in Probabilistic Databases Theodoros Rekatsinas Amol Deshpande Lise Getoor University of Maryland University of Maryland University of Maryland College Park, MD, USA College Park, MD, USA College Park, MD, USA [email protected] [email protected] [email protected] ABSTRACT data using probabilistic databases. Examples of such applications While extensive work has been done on evaluating queries over include information extraction [20], data integration [17], sensor tuple-independent probabilistic databases, query evaluation over networks [16], object recognition [27], and OCR [26], to name a correlated data has received much less attention even though the few. This has led to much work on extending relational databases support for correlations is essential for many natural applications to support uncertainty, and on efficiently evaluating different types of probabilistic databases, e.g., information extraction, data inte- of queries over such databases (see [40] for a recent survey). gration, computer vision, etc. In this paper, we develop a novel Query evaluation over probabilistic databases is unfortunately approach for efficiently evaluating probabilistic queries over cor- known to be #P -hard even under tuple-independence assumptions related databases where correlations are represented using a factor for perhaps the simplest class of queries, namely, conjunctive queries graph, a class of graphical models widely used for capturing corre- without self-joins [10]. To overcome this limitation, a number of lations and performing statistical inference. Our approach exploits different approaches have been explored that can be categorized, the specific values of the factor parameters and the determinism at a high level, into extensional approaches and intensional ap- in the correlations, collectively called local structure, to reduce proaches. In an extensional approach, the query evaluation process the complexity of query evaluation. Our framework is based on is guided solely by the query expression, and query operators are arithmetic circuits, factorized representations of probability distri- extended to directly compute the corresponding probabilities. For butions that can exploit such local structure. Traditionally, arith- example, the probability of a join result tuple s 1 t is the product metic circuits are generated following a compilation process and of the probabilities of the tuples s and t. When such extensional can not be updated directly. We introduce a generalization of arith- evaluation is possible, the query can be evaluated in polynomial metic circuits, called annotated arithmetic circuits, and a novel al- time, hence, much research has focused on characterizing datasets, gorithm for updating them, which enables us to answer probabilis- queries, and query plans for which extensional methods can be cor- tic queries efficiently. We present a comprehensive experimental rectly applied [9, 30, 21, 31]. On the other hand, in an intensional analysis and show speed-ups of at least one order of magnitude in approach, the intermediate tuples generated during query execution many cases. and the final result tuples are associated with propositional sym- bolic formulas (often called lineage expressions) over a subset of Categories and Subject Descriptors the random variables corresponding to the base input tuples. One of several general purpose inference algorithms can then be used H.2.4 [Database Management]: Systems—Query Processing; H.2.m to compute the result tuple probabilities, either exactly, e.g., using [Database Management]: Miscellaneous; B.2.m [Arithmetic and Shannon expansion [30], variable elimination [38], etc., or approx- Logic Structures]: Miscellaneous; G.3 [Mathematics of Com- imately [25, 32, 34], depending on the complexity of the lineage puting]: [Probability and Statistics] expression and the uncertainty model. General Terms Extensional methods scale well but the class of queries for which they can be applied correctly is relatively small. Furthermore, if we Algorithms, Design, Management, Performance relax the tuple-independence assumption and allow representation Keywords of correlated data, then extensional methods cannot be directly applied. Being able to handle correlations is critical for probabilis- Probabilistic Databases, Arithmetic Circuits, Query processing tic databases because many of their natural applications require 1. INTRODUCTION it. These include applications such as information extraction, data integration, computer vision applications, sensor networks, etc., An increasing number of applications are producing large volumes where heavy use of machine learning techniques naturally results of uncertain data, fueling an interest in managing and querying such in complex correlations among the data. Hence, over the last few years, several probabilistic database systems have been proposed that can manage such correlated databases [37, 41, 42], with cor- Permission to make digital or hard copies of all or part of this work for relations typically captured using graphical models such as fac- personal or classroom use is granted without fee provided that copies are tor graphs or Bayesian networks [15]. However, intensional ap- not made or distributed for profit or commercial advantage and that copies proaches that can process queries over such correlated databases bear this notice and the full citation on the first page. To copy otherwise, to and handle more general classes of queries, are typically much republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. slower than extensional approaches and have poor scalability, lead- SIGMOD ‘12, May 20–24, 2012, Scottsdale, Arizona, USA. ing to a significant efficiency gap between the two approaches [21]. Copyright 2012 ACM 978-1-4503-1247-9/12/05 ...$10.00. In this work, we aim to increase the efficiency of intensional both tractable and hard queries over tuple-independent databases methods by developing a framework that represents correlations and correlated databases based on the TPC-H benchmark. We using factor graphs [37, 42], and can exploit context-specific in- observe a speed-up of at least one order of magnitude over vari- dependence and determinism in the correlations, collectively re- able elimination. Moreover the performance of our algorithm is ferred to as local structure [13]. Context-specific independence [3, similar to that of other intensional methods based on Shannon 43], often observed in practice, refers to independences that hold decomposition. given a specific assignment of values to certain variables. De- terminism in the correlations, i.e., assigning zero probability to 2. PRELIMINARIES some joint variable assignments, typically manifests in uncertain- In this section we present a short review of probabilistic databases ties involving logical constraints, e.g., mutual exclusion, impli- and arithmetic circuits. cations, etc. Exploiting such local structure enables probabilistic inference to run efficiently in many scenarios, where the stan- 2.1 Probabilistic Databases dard inference techniques such as variable elimination are not fea- A probabilistic database can be defined using the possible world sible [13]. Our framework builds upon the notion of an arithmetic semantics [10]. Let R be a set of relations, X = fX ; ··· ;X g circuit (AC) [13], which is a compiled and compact representation 1 n be a set of random variables associated with the tuples or attributes of a factor graph that can effectively exploit local structure to dras- stored in the database (these could either be binary random vari- tically reduce online inference times [5]. To our knowledge, there ables capturing tuple existence uncertainty, or discrete random vari- is no prior work on either modifying ACs directly or on computing ables capturing attribute value uncertainty), and Φ be a joint prob- probabilities of Boolean formulas over them. The highly-compact ability distribution over X . A probabilistic database D is defined compiled form of ACs makes it a non-trivial challenge to support to be a probability distribution over a set of deterministic databases either of these efficiently. Hence, we introduce annotated arith- (or possible worlds) W each of which is obtained by assigning X metic circuits (AACs), an extension where we add variable annota- a joint assignment x = fX = x ;:::;X = x g such that tions on the internal operation nodes of an AC, and develop a novel 1 1 n n x 2 dom(X ). The probability associated with a possible word algorithm for merging two AACs to, in essence, combine the uncer- i i obtained from the joint assignment x is given by Φ. tainties captured by the AACs. For evaluating conjunctive queries Given a query q to be evaluated against database D, the result over an AAC-representation of a probabilistic database, we repre- of the query is defined to be the union of results returned by each sent the resulting lineage formulas using ordered binary decision possible world. Furthermore, the marginal probability of each re- diagrams (OBDDs), suggested in prior work [30]. However, the sult t in the union is obtained by summing the probabilities of the AAC-representation of the database imposes significant constraints P possible worlds Wt ⊆ W that return t: Pr(t) = Pr(w). on how OBDDs can be generated, requiring us to develop new al- w2Wt gorithms for this task. Our approach can be seen as generalizing the prior work on what Representation: Typically, we are not able to

Load more