Extracting and Querying Probabilistic Information in Bayesstore

Extracting and Querying Probabilistic Information in BayesStore by Zhe Wang A dissertation submitted in partial satisfaction of the requirements for the degree of Doctor of Philosophy in Computer Science in the GRADUATE DIVISION of the UNIVERSITY OF CALIFORNIA, BERKELEY Committee in charge: Professor Michael J. Franklin, Chair Professor Minos Garofalakis Professor Joseph M. Hellerstein Professor Cari Kaufman Fall 2011 Extracting and Querying Probabilistic Information in BayesStore Copyright c 2011 by Zhe Wang Abstract Extracting and Querying Probabilistic Information in BayesStore by Zhe Wang Doctor of Philosophy in Computer Science University of California, Berkeley Professor Michael J. Franklin, Chair During the past few years, the number of applications that need to process large-scale data has grown remarkably. The data driving these applications are often uncertain, as is the analysis, which often involves probabilistic models and statistical inference. Examples include sensor-based monitoring, information extraction, and online advertising. Such applications require probabilistic data analysis (PDA), which is a family of queries over data, uncertainties, and probabilistic models that involve relational operators from database literature, as well as inference operators from statistical machine learning (SML) literature. Prior to our work, probabilistic database research advocated an approach in which uncertainty is modeled by attaching probabilities to data items. However, such systems do not and cannot take advantage of the wealth of SML research, because they are unable to represent and reason the pervasive probabilistic correlations in the data. In this thesis, we propose, build, and evaluate BAYESSTORE, a probabilistic database system that natively supports SML models and various inference algorithms to perform advanced data analysis. This marriage of database and SML technologies creates a declarative and efficient probabilistic processing framework for applications dealing with large-scale uncertain data. We use sensor-based monitoring and information extraction over text as the two driving applications. Sen- sor network applications generate noisy sensor readings, on top of which a first-order Bayesian network model is used to capture the probability distribution. Information extraction applications generate uncertain entities from text using linear-chain conditional random fields. We explore a variety of research challenges, including extending the relational data model with probabilistic data and statistical models, efficiently implementing statistical inference algorithms in a database, defin- ing relational operators (e.g., select, project, join) over probabilistic data and models, developing joint optimization of inference operators and the relational algebra, and devising novel query exe- cution plans. The experimental results show: (1) statistical inference algorithms over probabilistic models can be efficiently implemented in the set-oriented programming framework in databases; (2) optimizations for query-driven SML inference lead to orders-of-magnitude speed-up on large corpora; and (3) using in-database SML methods to extract and query probabilistic information can significantly improve answer quality. 1 I dedicate this Ph.D. thesis to my parents and my family. i Contents Contents ii List of Figures vii List of Tables x Acknowledgements xi 1 Introduction 1 1.1 Bigger Data and Deeper Analysis . .1 1.2 Two Motivating Applications . .2 1.2.1 Sensor Networks . .3 1.2.2 Information Extraction (IE) . .3 1.3 Probabilistic Data Analysis (PDA) . .4 1.4 Existing Systems and Technologies . .5 1.4.1 Databases and Data Warehouses . .5 1.4.2 Statistical Machine Learning (SML) . .6 1.4.3 Loose Coupling of Databases and SML . .6 1.4.4 PDA Summary . .8 1.5 Our Approach: BAYESSTORE ............................8 1.6 Contributions . .9 1.7 Summary . 10 2 Background 12 2.1 Probabilistic Graphical Models (PGMs) . 12 2.1.1 Overview . 12 2.1.2 Bayesian Networks . 13 ii 2.1.3 Conditional Random Fields . 14 2.2 Inference Algorithms . 16 2.2.1 Top-k Inference on CRF Models . 16 2.2.2 Sum-Product Algorithm . 17 2.2.3 MCMC Inference Algorithms . 17 2.3 Integrating SML with Databases . 19 2.4 Probabilistic Database Systems . 20 2.4.1 Key Concepts . 20 2.4.2 PDB Research Projects . 23 2.4.3 Summary . 26 2.5 In-database SML Methods . 26 2.6 Statistical Relational Learning . 27 2.7 Summary . 29 3 The BAYESSTORE Data Model 30 3.1 BAYESSTORE Overview . 30 3.2 BAYESSTORE Data Model: Incomplete Relations and Probability Distribution . 32 3.3 BAYESSTORE Instance for Sensor Networks . 33 3.3.1 Sensor Tables: The Incomplete Relations . 33 3.3.2 First-Order Bayesian Networks: The Probability Distribution . 34 3.3.3 Summary . 41 3.4 BAYESSTORE Instance for Information Extraction . 41 3.4.1 Token Table: The Incomplete Relation . 42 3.4.2 Conditional Random Fields: The Probability Distribution . 42 3.5 Summary . 43 4 Probabilistic Relational Operators 44 4.1 Overview . 44 4.2 Selection . 45 4.2.1 Selection over Model ........................ 45 MF OBN 4.2.2 Selection over an Incomplete Relation σ(R ) ................ 48 4.3 Projection . 53 iii 4.4 Join . 55 4.4.1 Join over Model .......................... 55 MF OBN 4.4.2 Joining Incomplete Relations σ(R ) ..................... 56 4.5 Experimental Evaluation . 58 4.5.1 Methodology . 59 4.5.2 Data Size . 60 4.5.3 Data Uncertainty . 63 4.5.4 Model’s Connectivity Ratio . 63 4.5.5 Data Uncertainty . 63 4.5.6 First-order Inference . 64 4.6 Summary . 65 5 Inference Operators 66 5.1 Overview . 66 5.2 MR Matrix: A Materialization of the CRF Model . 67 5.3 Top-k Inference over CRF . 68 5.3.1 Viterbi SQL Implementations . 68 5.3.2 Experimental Results . 74 5.4 MCMC Algorithms . 74 5.4.1 SQL Implementation of MCMC Algorithms . 74 5.4.2 Experimental Results . 76 5.5 Guidelines for Implementing In-database Statistical Methods . 77 5.6 Summary . 78 6 Viterbi-based Probabilistic Query Processing 79 6.1 Overview . 79 6.2 Two Families of SPJ Queries . 81 6.3 Querying the ML World . 82 6.3.1 Optimized Selection over ML World . 82 6.3.2 Optimized Join over ML World . 85 6.4 Querying the Full Distribution . 86 6.4.1 Incremental Viterbi Inference . 86 iv 6.4.2 Probabilistic Selection . 89 6.4.3 Probabilistic Join . 90 6.4.4 Probabilistic Projection . 91 6.5 Experimental Results . 94 6.5.1 Selection over ML world (sel-ML): opt vs. naive .............. 95 6.5.2 Join over ML World (join-ML): opt vs. naive ................ 96 6.5.3 Probabilistic Selection: prob-sel vs. sel-ML ................ 98 6.5.4 Probabilistic Join: prob-join vs. join-ML .................. 99 6.6 Summary . 100 7 Query Processing and Optimization with MCMC Inference 101 7.1 Overview . 101 7.2 Query Template . 102 7.3 Cycles from IE Models and Queries . 104 7.4 Query-Driven MCMC Sampling . 105 7.5 Choosing Inference Algorithms . 107 7.5.1 Comparison between Inference Algorithms . 107 7.5.2 Parameters . 108 7.5.3 Rules for Choosing Inference Algorithms . 109 7.6 Hybrid Inference . 110 7.6.1 Query Processing Steps . 110 7.6.2 Query Plan Generation Algorithm . 111 7.6.3 Example Query Plans . 113 7.7 Experimental Results . 115 7.7.1 Query-Driven MCMC-MH . 116 7.7.2 MCMC vs. Viterbi on Top-k Inference . 117 7.7.3 MCMC vs. Sum-Product on Marginal Inference . 118 7.7.4 Exploring Model Parameters . 119 7.7.5 Hybrid Inference for Skip-chain CRFs . 119 7.7.6 Hybrid Inference for Probabilistic Join . 121 7.7.7 Hybrid Inference for Aggregate Constraint . ..

Extracting and Querying Probabilistic Information in Bayesstore

Probabilistic Databases

Adaptive Schema Databases ∗

Finding Interesting Itemsets Using a Probabilistic Model for Binary Databases

Open-World Probabilistic Databases

A Compositional Query Algebra for Second-Order Logic and Uncertain Databases

Answering Queries from Statistics and Probabilistic Views∗

Uva-DARE (Digital Academic Repository)

BAYESSTORE: Managing Large, Uncertain Data Repositories with Probabilistic Graphical Models

Brief Tutorial on Probabilistic Databases

Efficient In-Database Analytics with Graphical Models

Indexing Correlated Probabilistic Databases

Analyzing Uncertain Tabular Data