Most Probable Explanations for Probabilistic Database Queries
Total Page:16
File Type:pdf, Size:1020Kb
Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17) Most Probable Explanations for Probabilistic Database Queries Ismail˙ Ilkan˙ Ceylan and Stefan Borgwardt Thomas Lukasiewicz Faculty of Computer Science Department of Computer Science Technische Universitat¨ Dresden, Germany University of Oxford, UK [email protected] [email protected] [email protected] Abstract variables, is one of the key computational problems in PGMs. In its general form, this problem is studied under the name Forming the foundations of large-scale knowledge of maximum a posteriori hypothesis1 (MAP), where the hy- bases, probabilistic databases have been widely pothesis refers to an instantiation of a set of distinguished studied in the literature. In particular, probabilistic variables. The most probable explanation (MPE) is a spe- query evaluation has been investigated intensively cial case of MAP, where the distinguished variables and the as a central inference mechanism. However, des- evidence variables cover all variables in the model. Maximal pite its power, query evaluation alone cannot ex- posterior probability computations have also been lifted to re- tract all the relevant information encompassed in lational probabilistic models such as Markov Logic Networks large-scale knowledge bases. To exploit this po- (MLNs) [Richardson and Domingos, 2006]. tential, we study two inference tasks; namely find- Both MPE and MAP translate to probabilistic databases ing the most probable database and the most prob- in a natural way through the rich structure of queries. The able hypothesis for a given query. As natural coun- most probable database problem (analogous to MPE), first terparts of most probable explanations (MPE) and proposed in [Gribkoff et al., 2014], is the problem of de- maximum a posteriori hypotheses (MAP) in prob- termining the (classical) database with the largest probabil- abilistic graphical models, they can be used in a ity that satisfies a given query. Intuitively, the query defines variety of applications that involve prediction or constraints on the data, and the goal is to find the most prob- diagnosis tasks. We investigate these problems rel- able database that satisfies these constraints. We also in- ative to a variety of query languages, ranging from troduce a more intricate notion, called most probable hypo- conjunctive queries to ontology-mediated queries, thesis, which only asks for partial databases satisfying the and provide a detailed complexity analysis. query (analogous to MAP). The most probable hypothesis contains only tuples that contribute to the satisfaction con- 1 Introduction dition of the query, which allows to more precisely pinpoint Research on building large-scale probabilistic knowledge the most likely explanations of the query. bases (PKBs) has resulted in a number of systems includ- We study the computational complexity of the correspond- MPD MPH ing NELL [Mitchell et al., 2015], Yago [Hoffart et al., ing decision problems, denoted by and , respect- 2013], DeepDive [Shin et al., 2015], and Google’s Know- ively, for a variety of query languages. Our results provide ledge Vault [Dong et al., 2014]. They have been used in a detailed insights about the nature of these problems. We show exist- wide range of areas to automatically build structured know- that the data complexity of both problems is lower for ential queries universal queries MPH ledge bases. Most of them are based on the long tradition and than for . As expected, MPD foundations of probabilistic databases (PDBs) [Imieliski and usually has a higher complexity than . ontology-mediated queries Lipski, 1984; Suciu et al., 2011], which define probability We extend our results towards (OMQs) unions of con- distributions over sets of classical databases. , which enrich the well-known class of junctive queries Probabilistic query evaluation is the key inference task un- with the power of ontological rules based on ± [ et al. derpinning these systems. However, PKBs encompass rich, Datalog (also called existential rules) Baget , 2011; et al. ] ontology- structured knowledge, for which alternative inference mech- Cal`ı , 2013 . This follows the tradition of based data access [ et al. ] anisms beyond probabilistic query evaluation are needed. In- Poggi , 2008 , which allows us to spired by the maximal posterior probability computations in query PDBs in a more advanced manner. For OMQs, our ana- Probabilistic Graphical Models (PGMs) [Pearl, 1988; Koller lysis shows that the computational complexity of these prob- and Friedman, 2009], we investigate the problem of finding lems can change significantly w.r.t. the ontology languages most probable explanations for probabilistic queries to ex- 1There exist other variants of these inference tasks [Kwisthout, ploit the potential of such large databases to their full extent. 2011], and there are different naming conventions across communit- Computing the maximal posterior probability of a distin- ies; here, we use the terminology from the Bayesian Network (BN) guished set of variables, given evidence about another set of literature [Darwiche, 2009]. 950 Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17) under consideration and the complexity-theoretic assump- Vegetarian FriendOf tions employed. Our results provide tight complexity bounds alice 0.7 alice bob 0.7 for all Datalog± languages known in the literature. Detailed bob 0.9 alice chris 0.8 proofs of all results can be found at https://lat.inf. chris 0.6 bob chris 0.1 tu-dresden.de/research/papers.html. Eats Meat 2 Background and Motivation bob spinach 0.7 shrimp 0.7 chris mussels 0.8 mussels 0.9 We recall the basics of classical query languages and data- alice broccoli 0.2 seahorse 0.3 bases and briefly describe the tuple-independent PDB model. 2.1 Queries and Databases Figure 1: The PDB Pv, where each table represents a predicate and each row is a probabilistic tuple. We consider a relational vocabulary consisting of finite sets R of predicates, C of constants, and V of variables.A term is a constant or a variable. An atom is of the form P(s1; : : : ; sn), through which we can ask the probability of vegetarians being where P is an n-ary predicate, and s1; : : : ; sn are terms. A friends with vegetarians. In the given PDB, alice, bob, and tuple is an atom without variables. chris are all vegetarians and friends with each other (with A conjunctive query (CQ) is an existentially quantified for- some probability). The query mula ∃x φ, where φ is a conjunction of atoms, written as a Q′ = ∃y Veg(bob) ∧ FriendOf(bob; y) ∧ Veg(y) comma-separated list. A union of conjunctive queries (UCQ) 1 is a disjunction of CQs. If φ is an arbitrary Boolean combin- is a special case of Q1 which asks whether bob has vegetarian ′ ation of atoms, then we call ∃x φ an existential query (∃Q), friends. Its probability can be computed as P(Q1) = 0:9 ⋅ 0:1 ⋅ and ∀x φ a universal query (∀Q).A first-order query (FOQ) 0:6 = 0:054: ∎ is an arbitrary first-order formula. A query is Boolean if it has This task is known as probabilistic query evaluation, and no free variables. it has been studied extensively in PDBs. In a celebrated res- A database is a finite set of tuples. The central problem ult by [Dalvi and Suciu, 2012], it has been shown that UCQs studied for databases is query evaluation: Finding all answers exhibit a data complexity dichotomy between P and #P for to a query Q over a database D, which are assignments of probabilistic query evaluation. However, while dealing with the free variables in Q to constants such that the resulting large-scale knowledge bases, alternative inference mechan- first-order formula is satisfied in D in the usual sense. In the isms are needed. Consider again our running example, and following, we consider only Boolean queries Q, and focus on suppose that we have observed that Q′ is true and would like the associated decision problem, i.e., deciding whether Q is 1 to learn what best explains this observation w.r.t. the PDB Pv. satisfied in D, denoted as usual by D ⊧ Q. We revisit this example after providing a principled approach 2.2 Probabilistic Databases for dealing with such tasks. The most elementary probabilistic database model is based on the tuple-independence assumption. We adopt this model 3 Most Probable Explanations and refer to [Suciu et al., 2011] for details and alternatives. We investigate the problem of finding most probable explan- A probabilistic database induces a set of classical databases ations for probabilistic database queries under two different (called worlds), each associated with a probability value. semantics. Finding the most probable database is to determ- Formally, a probabilistic database (PDB) P is a finite set ine the world with the largest probability that satisfies a given of probabilistic tuples of the form ⟨t ∶ p⟩ ; where t is a tuple query, as formalized by [Gribkoff et al., 2014]: and p ∈ [0; 1], and, whenever ⟨t ∶ p⟩; ⟨t ∶ q⟩ ∈ P, then p = q.A Definition 2. The most probable database for a query Q over PDB P assigns to every tuple t the probability p if ⟨t ∶ p⟩ ∈ P, a PDB P is and otherwise the probability 0. Since we usually consider arg max P(D); a single, fixed PDB, we denote this probability assignment D⊧Q simply by P. We concentrate on the well-known possible where D ranges over all worlds induced by P. worlds semantics. Under the tuple-independence assumption, P induces the following unique probability distribution over Intuitively, a PDB defines a probability distribution over classical databases D: exponentially many classical databases, and the most prob- able database is the element in this collection that has the P(D) ∶= M P(t) M(1 − P(t)): highest probability while still satisfying the query.