BAYESSTORE: Managing Large, Uncertain Data Repositories with Probabilistic Graphical Models

BAYESSTORE: Managing Large, Uncertain Data Repositories with Probabilistic Graphical Models Daisy Zhe Wang∗ , Eirinaios Michelakis∗ , Minos Garofalakisy∗ , and Joseph M. Hellerstein∗ ∗ Univ. of California, Berkeley EECS and y Yahoo! Research ABSTRACT Of course, the fundamental mathematical tools for managing uncertainty come from probability and statistics. In recent years, these Several real-world applications need to effectively manage and reason about tools have been aggressively imported into the computational do- large amounts of data that are inherently uncertain. For instance, perva- main under the rubric of Statistical Machine Learning (SML). Of sive computing applications must constantly reason about volumes of noisy special note here is the widespread use of Graphical Modeling tech- sensory readings for a variety of reasons, including motion prediction and niques, including the many variants of Bayesian Networks (BNs) human behavior modeling. Such probabilistic data analyses require so- and Markov Random Fields (MRFs) [13]. These techniques can phisticated machine-learning tools that can effectively model the complex provide robust statistical models that capture complex correlation spatio/temporal correlation patterns present in uncertain sensory data. Un- patterns among variables, while, at the same time, addressing some fortunately, to date, most existing approaches to probabilistic database sys- computational efficiency and scalability issues as well. Graphical tems have relied on somewhat simplistic models of uncertainty that can be models have been applied with great success in applications as di- easily mapped onto existing relational architectures: Probabilistic informa- verse as signal processing, information retrieval, sensornets and tion is typically associated with individual data tuples, with only limited pervasive computing, robotics, natural language processing, and or no support for effectively capturing and reasoning about complex data computer vision. correlations. In this paper, we introduce BAYESSTORE, a novel probabilistic data management architecture built on the principle of handling statis- Recent research efforts in PDBSs have injected new excitement tical models and probabilistic inference tools as first-class citizens of the into the area of uncertainty management in database systems. Un- database system. Adopting a machine-learning view, BAYESSTORE em- fortunately, the bulk of this work has, to date, relied on somewhat ploys concise statistical relational models to effectively encode the correla- simplistic models of uncertainty, placing the focus on simple prob- tion patterns between uncertain data, and promotes probabilistic inference abilistic extensions that can be easily mapped to existing relational and statistical model manipulation as part of the standard DBMS opera- database architectures, and essentially ignoring the state-of-the-art tor repertoire to support efficient and sound query processing. We present in SML. For instance, existing PDBSs typically associate probabil- tuples tuple val- BAYESSTORE’s uncertainty model based on a novel, first-order statistical ities directly with data at the level of individual or model, and we redefine traditional query processing operators, to manip- ues. While such fine-grained probabilistic information may be war- ulate the data and the probabilistic models of the database in an efficient ranted in certain scenarios (e.g., data integration), they also often manner. Finally, we validate our approach, by demonstrating the value of give rise to intractably large probabilistic reasoning problems [7]. exploiting data correlations during query processing, and by evaluating a Furthermore, in several application domains, including pervasive number of optimizations which significantly accelerate query processing. computing and sensornets, the granularity of uncertainty can be much coarser depending on the underlying random process being 1 Introduction observed (e.g., all readings from sensor-1 follow the same distribution pattern). Existing PDBSs also offer very limited or no sup- There is growing acknowledgment among database researchers and port for effectively modeling and reasoning about complex corre- practitioners that modern database systems need to routinely deal lation patterns — unfortunately, as SML work demonstrates, such with large amounts of uncertain information, be it incorrect, incom- correlation patterns abound in real-world data. In short, existing plete, or internally inconsistent. Work on Probabilistic Database PDBSs simply cannot support realistic, state-of-the-art probabilis- Systems (PDBSs) has the goal of addressing this problem with tech- tic reasoning within the database system: Such reasoning currently niques to help quantify, explain, and manage uncertainty — all needs to occur outside the database and its results can only be ap- within the familiar context of relational database models and lan- proximately mapped to and stored within the simplified uncertainty guages, and without sacrificing scalability over the stored data col- models supported by the PDBS; see, for instance, [11] for such an lections. approximate mapping in the context of MRF-based information ex- traction. Permission to make digital or hard copies of portions of this work for personal or classroom use is granted without fee provided that copies Related Work. While traditional SML has provided well-founded Permissionare not made to copyor distributed without fee for all pr orofit part or of thiscommercial material isadvantage granted provided and thatthat thecopies copies bear are this not madenotice or and distributed the full for citation direct commercialon the first advantage, page. mathematical tools for uncertainty management, such tools are not theCopyright VLDB copyrightfor components notice andof this the titlework of theowned publication by others and than its dateVLDB appear, targeted at the declarative management and processing of large- andEndowment notice is must given be that honored. copying is by permission of the Very Large Data scale data sets. Since the early 80’s, a number of PDBSs have been BaseAbstracting Endowment. with credit To copy is permitted. otherwise, To or copy to republish, otherwise, to to post republish, on servers proposed in an effort to address this issue [12, 5, 3, 10, 7, 4, 17, orto topost redistribute on servers to lists,or to requires redistribute a fee to and/or lists specialrequires permission prior specific from the 2, 1]. Moving away from statistical approaches, this work extends publisher,permission ACM. and/or a fee. Request permission to republish from: VLDB ’08 New Zealand the relational model with probabilistic information captured at the Publications Dept., ACM, Inc. Fax +1 (212) 869-0481 or level of individual tuple existence (i.e., a tuple may or may not ex- [email protected]. 2008 VLDB Endowment, ACM 000-0-00000-000-0/00/00. PVLDB '08, August 23-28, 2008, Auckland, New Zealand Copyright 2008 VLDB Endowment, ACM 978-1-60558-305-1/08/08 340 ist in the DB) [5, 7, 10, 17] or individual tuple-value uncertainty tinguishing features: (i.e., an attribute value in a tuple follows a probabilistic distribution) [3, 4, 2, 1]. The Trio [4] and MayBMS [2, 1] efforts, in par- • A new data uncertainty model based on a set of novel First- ticular, try to adopt both types of uncertainty, with Trio focusing Order (FO) extensions to graphical models, that enable declar- data lineage ative specifications of both tuple, and attribute level correla- on promoting as a first-class citizen in PDBSs and 1 MayBMS aiming at more efficient tuple-level uncertainty repre- tions, among populations of data items. sentations through effective relational table decompositions. In all • Seamless integration of state-of-the-art SML techniques with cases, probabilities are directly associated (and, stored) with indi- relational query processing, to directly support both relational vidual tuples and/or tuple values and are processed using standard queries and probabilistic model reasoning and manipulations, relational query operators over uncertain tables — this is another inside the PDBS. major departure from SML, that typically imposes a clear separa- tion between observed data (i.e., evidence) and uncertainty models • Query optimizations able to provide significant performance (e.g., BNs or MRFs) [13]. benefits, by exploiting graphical models to filter out unlikely Query processing in PDBSs is typically based on the standard tuples, without impairing the soundness or accuracy of the possible worlds semantics, where a PDB is viewed as encoding result. a probability distribution over all possible deterministic instances. As demonstrated by Dalvi and Suciu [7], such query processing In this paper we describe the BAYESSTORE system, its current quickly gives rise to computationally-intractable probabilistic in- state of implementation, and initial experiments demonstrating the ference problems, as complex correlation patterns can emerge dur- importance of our approach from the perspectives of both perfor- ing processing even if naive independence assumptions are made mance and statistical robustness. on the base data tables. In fact, modulo a restricted class of “safe” 2 The BAYESSTORE Data Model query execution plans, query processing in tuple-uncertain PDBSs is #P -complete in the size of the database [7]. This clearly raises BAYESSTORE is founded on a novel data model that treats (uncer- some serious practicality concerns

BAYESSTORE: Managing Large, Uncertain Data Repositories with Probabilistic Graphical Models

Probabilistic Databases

Adaptive Schema Databases ∗

Finding Interesting Itemsets Using a Probabilistic Model for Binary Databases

Open-World Probabilistic Databases

A Compositional Query Algebra for Second-Order Logic and Uncertain Databases

Answering Queries from Statistics and Probabilistic Views∗

Uva-DARE (Digital Academic Repository)

Brief Tutorial on Probabilistic Databases

Efficient In-Database Analytics with Graphical Models

Indexing Correlated Probabilistic Databases

Analyzing Uncertain Tabular Data

Entity Linkage for Heterogeneous, Uncertain, and Volatile Data Dr