Scaling Probabilistic Databases

Scaling Probabilistic Databases Hernan´ Blanco University of Antwerp, Belgium [email protected] Supervised by Martin Theobald University of Ulm, Germany [email protected] ABSTRACT The Trio probabilistic database system [7] was the first Probabilistic databases, which have been widely studied over system that explicitly addressed the integration of data man- the past years, lie at the expressive intersection of databases agement via SQL, provenance (aka. \lineage") management and probabilistic graphical models, thus aiming to provide via Boolean formulas, and efficient probabilistic inference. efficient support for the evaluation of probabilistic queries The data model which was considered for Uncertainty and over uncertain, relational data. Lineage Databases (ULDBs) [1] was the first probabilis- Several Machine Learning approaches, on the one hand, tic database approach that was shown to provide a closed have recently investigated the issue of distributed probabilis- and complete probabilistic extension to the relational model tic inference but do not support relational data and SQL. which supports all of the core relational (i.e., SQL-based) Conventional database engines, on the other hand, do not operations over this kind of uncertain data. handle probabilistic data and queries, nor any form of un- A key in making probabilistic inference scalable to the certain data management. With this project, we aim to extent that is needed for modern data management appli- fill this prevalent gap between the two fields of Databases cations lies (without considering parallelization techniques and Machine Learning by scaling probabilistic databases to yet) in the identification of tractable query classes and the a distributed setting, which is a topic that so far has not adaptation of known inference techniques from graphical been addressed in the literature. The proposed PhD disser- models to a relational data setting. While for specific classes tation topic provides a number of intriguing and challenging of queries, probabilistic inference can directly be coupled aspects, both from a theoretical and a systems-engineering with the relational operations [9, 12], the performance may perspective. very quickly degenerate when the underlying independence assumptions among the data objects do not hold. Despite the polynomial runtime complexity for the data compu- 1. INTRODUCTION tation step that is involved in finding answer candidates Managing uncertain data via probabilistic databases to probabilistic queries, the confidence computation step, (PDBs) has evolved as an established field of research in i.e., the actual probabilistic inference, for these answers is recent years. The field meanwhile encompasses a plethora known to be of exponential cost already for fairly simple of applications, which are ranging from scientific data man- SQL queries. Consequently, much of the recent research in agement, sensor networks, data integration, to knowledge PDBs has focused on establishing a classification of query management systems [13, 12]. PDBs lie at the intersection plans for which confidence computations are either of poly- of databases and probabilistic graphical models. While clas- nomial runtime or are #P-hard [2, 12]. Thus, strategies for sical database approaches benefit from a mature and scal- efficient confidence computations and early pruning of low- able infrastructure for the management of relational data, confidence query answers remain a key challenge also for the probabilistic databases aim to further combine these well- scalable management of uncertain data via PDBs. studied data management strategies with inference techniques known from graphical models such as Bayesian Net- works and Markov Random Fields. PDBs adopt powerful 2. STATE-OF-THE-ART query languages from relational databases, including Rela- tional Algebra, the Structured Query Language (SQL), and Recent work on efficient confidence computations in PDBs logical query languages such as Datalog. has addressed this problem mainly from two ends, namely by restricting the class of queries that are allowed, i.e., by focusing on so-called safe query plans [12], or by considering a specific class of tuple-dependencies, commonly referred to as read-once functions [11]. Intuitively, safe query plans denote a class of queries for which confidence computations This work is licensed under the Creative Commons Attribution- can directly be coupled with the relational operators and NonCommercial-NoDerivs 3.0 Unported License. To view a copy of this li- thus be performed by an extensional query plan. On the cense, visit http://creativecommons.org/licenses/by-nc-nd/3.0/. Obtain per- other hand, read-once formulas denote a class of Boolean mission prior to any use beyond those covered by the license. Contact copyright holder by emailing [email protected]. lineage formulas which can be factorized in polynomial time Proceedings of the VLDB 2015 PhD Workshop into a form where every variable in the formula (each repre- 1 senting a database tuple) appears at most once, thus again Slave 1 Slave 2 permitting efficient confidence computations. While safe plans clearly focus on the characteristics of the query structure, and read-once formulas focus on the logi- x4 cal dependencies among individual data objects, top-k-style pruning approaches [4, 8] have been proposed as an alter- x2 x5 x7 native way to address confidence computations in PDBs. These approaches aim to efficiently identify the top-k most x1 probable answers, using lower and upper bounds for their x3 x6 x8 marginal probabilities, without the need to compute the ex- act probabilities of these query answers. In [4], a form of x9 first-order lineage formulas (based on a restricted class of first-order logic) is proposed, which allows to fully integrate the data and confidence computation steps in a probabilis- R1 A B R2 B C R3 C D tic database setting. PrDB [10] is one of the most generic PDB approaches in the field, aiming to significantly widen x1 a1 b1 x4 b1 c1 x6 c1 d2 the opportunities to include richer probabilistic models by x2 a2 b2 x5 b2 c3 x8 c1 d1 allowing uncertainty at both tuple and attribute level as x3 a3 b3 x7 b3 c2 x9 c2 d1 well as tuple correlations. It also achieves improvements in terms of fast inference by implementing a self-developed al- Figure 1: A sample factor graph. The red circles represent variable nodes which are database items (instantiated in relations gorithm based on bisimulation. However, also PrDB does R1, R2 and R3); the blue squares represent factor nodes. The not address the problem of distribution and parallelization, data is distributed across two cluster slaves by minimizing the cut where an enormous potential for scalability lies. between the two partitions of the factor graph. On the other hand, recent trends in relational databases clearly focus on \Big Data" management and \Cloud Com- DHJ puting". Distributed database engines like Apache's Cas- C sandra, HIVE, or HBase aim to achieve an even better scalability to many petabytes of data by going for distributed DMJ B file systems and by performing SQL queries via iterative and largely synchronous communication workflows based on Google's MapReduce or Apache's Hadoop frameworks. DIS(R1) DIS(R2) DIS(R3) There currently exists no distributed probabilistic database (a) (b) system. A number of systems brought up by the Machine Learning and Artificial Intelligence communities, such as Figure 2: (a) A representation of a distributed query execu- CMU's GraphLab [6], MIT's FactorIE, Microsoft's Infer.NET tion: distributed index scans over the tree base relations R1, R2 and R3 from Figure 1, a distributed merge join on B, and a (MSR Cambridge), and the recently released Amazon Ma- distributed hash join on C. (b) The results retrieved from eval- chine Learning and AMPLab's KeystoneML platforms fo- uating the query represented in Figure 2(a), each record holding cus on purely graph-based and partly also on distributed its respective lineage expression. approaches. However, all of these employ their own APIs and do not natively support relational data or SQL. From a more classical (i.e., deterministic) database perspective, we tic inference based on variable elimination and asynchronous have recently seen distributed engines like SciDB or Berke- message passing. ley's BOOM project, which were designed from scratch to be massively scalable (and which, in the first case, origi- 3.1 A Unifying Data Model nally even had the support for uncertain data as part of Specifically, in this PhD project we advocate for the in- its core design goals), but they ultimately do not support vestigation of a distributed factor graph model as being the probabilistic or otherwise uncertain data. core data model for the intended (both distributed and probabilistic) database infrastructure. Factor graphs provide a generic data model for capturing various kinds of proba- 3. NOVEL CONTRIBUTIONS bilistic graphical models and have successfully been applied In this project, we advocate for the development of a radi- to related (but non-distributed) probabilistic settings in the cally new, scalable, and massively distributed infrastructure past. A factor graph (see Figure 1) is a bipartite graph for probabilistic databases. The goal is to investigate how far G(X; Φ) that consists of a set of variable nodes X = fx1;:::; the advantages of mature database technologies can be car- xmg and a set of factor nodes Φ = fφ1(X1); : : : ; φn(Xn)g, ried over to a distributed and probabilistic setting. We pro- where each Xs ⊆ X is a subset of variables which are as- pose a probabilistic data(-base) model, seeking to combine sociated with a factor function φs : Xs ! R, such that Q both the generic semantics of a relational backend with par- G(X; Φ) = i φi(Xi). allelization opportunities for probabilistic inference based on Variable nodes will represent data items of mutable (usu- graphical models. Our intended research will specifically fo- ally binary) state, which are connected to factor nodes in cus on the core functionalities of a database in terms of effi- order to model weighted dependencies between the possi- cient query processing over a distributed, persistent storage ble states of the variable nodes.

Load more