Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17)

Most Probable Explanations for Probabilistic Queries

Ismail˙ Ilkan˙ Ceylan and Stefan Borgwardt Thomas Lukasiewicz Faculty of Computer Science Department of Computer Science Technische Universitat¨ Dresden, Germany University of Oxford, UK [email protected] [email protected] [email protected]

Abstract variables, is one of the key computational problems in PGMs. In its general form, this problem is studied under the name Forming the foundations of large-scale knowledge of maximum a posteriori hypothesis1 (MAP), where the hy- bases, probabilistic have been widely pothesis refers to an instantiation of a set of distinguished studied in the literature. In particular, probabilistic variables. The most probable explanation (MPE) is a spe- query evaluation has been investigated intensively cial case of MAP, where the distinguished variables and the as a central inference mechanism. However, des- evidence variables cover all variables in the model. Maximal pite its power, query evaluation alone cannot ex- posterior probability computations have also been lifted to re- tract all the relevant information encompassed in lational probabilistic models such as Markov Logic Networks large-scale knowledge bases. To exploit this po- (MLNs) [Richardson and Domingos, 2006]. tential, we study two inference tasks; namely find- Both MPE and MAP translate to probabilistic databases ing the most probable database and the most prob- in a natural way through the rich structure of queries. The able hypothesis for a given query. As natural coun- most probable database problem (analogous to MPE), first terparts of most probable explanations (MPE) and proposed in [Gribkoff et al., 2014], is the problem of de- maximum a posteriori hypotheses (MAP) in prob- termining the (classical) database with the largest probabil- abilistic graphical models, they can be used in a ity that satisfies a given query. Intuitively, the query defines variety of applications that involve prediction or constraints on the data, and the goal is to find the most prob- diagnosis tasks. We investigate these problems rel- able database that satisfies these constraints. We also in- ative to a variety of query languages, ranging from troduce a more intricate notion, called most probable hypo- conjunctive queries to ontology-mediated queries, thesis, which only asks for partial databases satisfying the and provide a detailed complexity analysis. query (analogous to MAP). The most probable hypothesis contains only tuples that contribute to the satisfaction con- 1 Introduction dition of the query, which allows to more precisely pinpoint Research on building large-scale probabilistic knowledge the most likely explanations of the query. bases (PKBs) has resulted in a number of systems includ- We study the computational complexity of the correspond- MPD MPH ing NELL [Mitchell et al., 2015], Yago [Hoffart et al., ing decision problems, denoted by and , respect- 2013], DeepDive [Shin et al., 2015], and Google’s Know- ively, for a variety of query languages. Our results provide ledge Vault [Dong et al., 2014]. They have been used in a detailed insights about the nature of these problems. We show exist- wide range of areas to automatically build structured know- that the data complexity of both problems is lower for ential queries universal queries MPH ledge bases. Most of them are based on the long tradition and than for . As expected, MPD foundations of probabilistic databases (PDBs) [Imieliski and usually has a higher complexity than . ontology-mediated queries Lipski, 1984; Suciu et al., 2011], which define probability We extend our results towards (OMQs) unions of con- distributions over sets of classical databases. , which enrich the well-known class of junctive queries Probabilistic query evaluation is the key inference task un- with the power of ontological rules based on ± [ et al. derpinning these systems. However, PKBs encompass rich, Datalog (also called existential rules) Baget , 2011; et al. ] ontology- structured knowledge, for which alternative inference mech- Cal`ı , 2013 . This follows the tradition of based data access [ et al. ] anisms beyond probabilistic query evaluation are needed. In- Poggi , 2008 , which allows us to spired by the maximal posterior probability computations in query PDBs in a more advanced manner. For OMQs, our ana- Probabilistic Graphical Models (PGMs) [Pearl, 1988; Koller lysis shows that the computational complexity of these prob- and Friedman, 2009], we investigate the problem of finding lems can change significantly w.r.t. the ontology languages most probable explanations for probabilistic queries to ex- 1There exist other variants of these inference tasks [Kwisthout, ploit the potential of such large databases to their full extent. 2011], and there are different naming conventions across communit- Computing the maximal posterior probability of a distin- ies; here, we use the terminology from the Bayesian Network (BN) guished set of variables, given evidence about another set of literature [Darwiche, 2009].

950 Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17) under consideration and the complexity-theoretic assump- Vegetarian FriendOf tions employed. Our results provide tight complexity bounds alice 0.7 alice bob 0.7 for all Datalog± languages known in the literature. Detailed bob 0.9 alice chris 0.8 proofs of all results can be found at https://lat.inf. chris 0.6 bob chris 0.1 tu-dresden.de/research/papers.html. Eats Meat 2 Background and Motivation bob spinach 0.7 shrimp 0.7 chris mussels 0.8 mussels 0.9 We recall the basics of classical query languages and data- alice broccoli 0.2 seahorse 0.3 bases and briefly describe the tuple-independent PDB model. 2.1 Queries and Databases Figure 1: The PDB Pv, where each represents a predicate and each is a probabilistic tuple. We consider a relational vocabulary consisting of finite sets R of predicates, C of constants, and V of variables.A term is a constant or a variable. An atom is of the form P(s1, . . . , sn), through which we can ask the probability of vegetarians being where P is an n-ary predicate, and s1, . . . , sn are terms. A friends with vegetarians. In the given PDB, alice, bob, and tuple is an atom without variables. chris are all vegetarians and friends with each other (with A conjunctive query (CQ) is an existentially quantified for- some probability). The query mula ∃x φ, where φ is a conjunction of atoms, written as a Q′ = ∃y Veg(bob) ∧ FriendOf(bob, y) ∧ Veg(y) comma-separated list. A union of conjunctive queries (UCQ) 1 is a disjunction of CQs. If φ is an arbitrary Boolean combin- is a special case of Q1 which asks whether bob has vegetarian ′ ation of atoms, then we call ∃x φ an existential query (∃Q), friends. Its probability can be computed as P(Q1) = 0.9 ⋅ 0.1 ⋅ and ∀x φ a universal query (∀Q).A first-order query (FOQ) 0.6 = 0.054. ∎ is an arbitrary first-order formula. A query is Boolean if it has This task is known as probabilistic query evaluation, and no free variables. it has been studied extensively in PDBs. In a celebrated res- A database is a finite set of tuples. The central problem ult by [Dalvi and Suciu, 2012], it has been shown that UCQs studied for databases is query evaluation: Finding all answers exhibit a data complexity dichotomy between P and #P for to a query Q over a database D, which are assignments of probabilistic query evaluation. However, while dealing with the free variables in Q to constants such that the resulting large-scale knowledge bases, alternative inference mechan- first-order formula is satisfied in D in the usual sense. In the isms are needed. Consider again our running example, and following, we consider only Boolean queries Q, and focus on suppose that we have observed that Q′ is true and would like the associated decision problem, i.e., deciding whether Q is 1 to learn what best explains this observation w.r.t. the PDB Pv. satisfied in D, denoted as usual by D ⊧ Q. We revisit this example after providing a principled approach 2.2 Probabilistic Databases for dealing with such tasks. The most elementary probabilistic is based on the tuple-independence assumption. We adopt this model 3 Most Probable Explanations and refer to [Suciu et al., 2011] for details and alternatives. We investigate the problem of finding most probable explan- A probabilistic database induces a set of classical databases ations for probabilistic database queries under two different (called worlds), each associated with a probability value. semantics. Finding the most probable database is to determ- Formally, a probabilistic database (PDB) P is a finite set ine the world with the largest probability that satisfies a given of probabilistic tuples of the form ⟨t ∶ p⟩ , where t is a tuple query, as formalized by [Gribkoff et al., 2014]: and p ∈ [0, 1], and, whenever ⟨t ∶ p⟩, ⟨t ∶ q⟩ ∈ P, then p = q.A Definition 2. The most probable database for a query Q over PDB P assigns to every tuple t the probability p if ⟨t ∶ p⟩ ∈ P, a PDB P is and otherwise the probability 0. Since we usually consider arg max P(D), a single, fixed PDB, we denote this probability assignment D⊧Q simply by P. We concentrate on the well-known possible where D ranges over all worlds induced by P. worlds semantics. Under the tuple-independence assumption, P induces the following unique probability distribution over Intuitively, a PDB defines a probability distribution over classical databases D: exponentially many classical databases, and the most prob- able database is the element in this collection that has the P(D) ∶= M P(t) M(1 − P(t)). highest probability while still satisfying the query. This can t∈D t∉D be seen as the best instantiation of a probabilistic model, and By the worlds induced by P we refer to those databases D hence analogous to MPE in BNs [Darwiche, 2009]. with P(D) > 0. Query evaluation is also enriched by prob- abilistic information. More formally, the probability of a Example 3. Consider again the PDB Pv and the query ( ) ∶= (D) Boolean query Q w.r.t. P is P Q ∑D⊧Q P . Q2 = ∀x, y ¬Veg(x) ∨ ¬Eats(x, y) ∨ ¬Meat(y), Example 1. Consider the PDB Pv given in Figure 1 and the conjunctive query which defines the constraints of being a vegetarian, which is violated by the tuples Veg(chris), Eats(chris, mussels) Q1 = ∃x, y Veg(x) ∧ FriendOf(x, y) ∧ Veg(y), and Meat(mussels). Hence, the most probable database

951 Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17) for Q2 cannot contain all three of them. It is easy to see Our first result concerns the well-known class of UCQs, that Veg(chris) is removed, as it has the lowest probabil- and we observe that both MPD and MPH remain in polyno- ity. Thus, the most probable database (in this case unique) mial time in the data complexity and in NP for the combined contains all tuples of Pv that have a probability above 0.5, complexity. ′ except for Veg(chris). Suppose that we have observed Q1, Theorem 7. MPD and MPH for UCQs can be decided in and we are interested in finding an explanation for this ob- polynomial time in the data complexity and are NP-complete servation under the constraint of Q2, which is specified by in the combined complexity. ′ Q3 = Q1 ∧ Q2. In this case, the most probable database must contain the tuples Veg(chris) and FriendOf(bob, chris), Intuitively, polynomial-time data complexity is ensured by whereas Eats(chris, mussels) is omitted. ∎ the fact that UCQs are monotone queries: as a simple al- gorithm for MPD, consider all matches (of which there are Finding the most probable database is important, as it iden- polynomially many) of a given UCQ, and extend each match tifies the most likely state of a PDB relative to a query. How- with all (negated) tuples from the PDB that have a probability ever, it has certain limitations, which are analogous to the greater than 0.5. This results in (polynomially many) clas- limitations of MPE in PGMs [Koller and Friedman, 2009]. sical databases, one of which is the most probable database. Most importantly, one is always forced to choose a complete The polynomial data complexity result for MPH follows from database although the query usually affects only a subset of similar arguments. the tuples. In other words, it is usually not the case that the This result may seem immediate, as UCQs are monotone whole database is responsible for the goal query to be satis- queries, but for MPD we can show an even stronger result that fied. To be able to more precisely pinpoint the explanations applies to all existential queries, even if negations in front of of a query, we introduce the following more intricate notion. query atoms are allowed. For MPH, we show an additional Definition 4. The most probable hypothesis for a query Q hardness in the combined complexity. over a PDB P is Theorem 8. MPD for ∃Q can be decided in polynomial time arg max Q P(D), in the data complexity and is NP-complete in the combined P H⊧Q D⊧H complexity. MPH for ∃Q can be decided in Σ2 in the data complexity and is ΣP-complete in the combined complexity. where H ranges over sets of tuples t and negated tuples ¬t 3 such that t occurs in P, and ⊧ denotes the open-world entail- Determining the precise data complexity of MPH for ∃Q is ment , i.e., H ⊧ Q holds iff all worlds induced by P left as an open problem. that satisfy H also satisfy Q. For universally quantified queries, MPD becomes harder even in the data complexity. Given the rich structure of The sum inside the maximization evaluates to the product the query, it becomes possible to encode non-deterministic of the probabilities of the (negated) tuples in H, and hence we choices over all possible databases. It is important to note denote it by P(H). Note that each D ⊧ H must satisfy Q, and that a similar hardness result was also obtained in [Gribkoff thus the most probable database is a special case of the most et al., 2014] (although on a different query). probable hypothesis that has to contain all tuples from P. In contrast, the most probable hypothesis contains tuples only if Theorem 9. MPD for ∀Q is NP-complete in the data com- they contribute to the satisfaction of the query. plexity.

Example 5. The most probable hypothesis H for Q3 con- We observe a similar phenomenon for MPH, where the in- sists of Veg(bob), Veg(chris), FriendOf(bob, chris), crease in the complexity is even more dramatic. ¬ ( ) P and Eats chris, mussels , which yields a probability of Theorem 10. MPH for ∀Q is Σ2-complete in the data com- P(H) = 0.9 ⋅ 0.6 ⋅ 0.1 ⋅ (1 − 0.8) = 0.0108. Since the most plexity. probable hypothesis contains less tuples, it is more informat- Note that also MPE and MAP differ in terms of computa- ∎ ive than the most probable database for Q3. tional complexity: in BNs, the former is NP-complete [Shi- Whereas the most probable database represents full know- mony, 1994; Littman, 1999] and the latter NPPP-complete ledge about all facts, which corresponds to the common [Park and Darwiche, 2004a]. Differently, MPH remains closed-world assumption for (probabilistic) databases, the P PP in Σ2 ⊆ NP , since computing the probability of the hy- most probable hypothesis may leave tuples of P unresolved, pothesis can be done in polynomial time due to the tuple- which can be seen as a kind of open-world assumption (al- independence assumption. As we will examine later, allow- though the tuples that do not occur in P are still false). ing ontological rules to induce correlations on the tuples res- Complexity Results ults in different complexities, which are more comparable to We formulate the decision problems MPD and MPH and in- PGMs, which can express conditional dependencies between vestigate their computational complexity (see Table 1). We their variables. Our final result for this section concerns the assume familiarity with computational complexity theory, combined complexity of both problems for ∀Q. P and the distinction between data and combined complexity. Theorem 11. MPD and MPH for ∀Q are Σ2-complete in the Definition 6. Let Q be a query, P a PDB, and p ∈ (0, 1] combined complexity. a threshold. We denote by MPD (resp., MPH) the problem If we consider arbitrary FOQs, it is easy to show that the of deciding whether there exists a database D (resp., hypo- data complexity remains the same as for ∀Qs, and the com- thesis H) that satisfies Q with P(D) ≥ p (resp., P(H) ≥ p). bined complexity is still PSPACE, as for classical databases.

952 Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17)

Moreover, all our combined complexity results for PDBs hold is entailed by D, denoted D ⊧ (Q, Σ), if I ⊧ Q holds for all even in the case where the arity of the predicates is bounded I ∈ mods(D, Σ). Observe that consistency of D w.r.t. Σ can + by some constant, commonly known as the bounded-arity thus be written as D ⊭ (–, Σ), or equivalently, D ⊭ (Q–, Σ ), (ba) combined complexity (see Table 1). The reason is that where Q– is the UCQ obtained from the disjunction of the our hardness results use only predicates of a bounded arity. bodies of all NCs in Σ, and Σ+ contains all TGDs of Σ. The probability of an OMQ over a PDB is defined as usual, by 4 Most Probable Explanations for OMQs summing up over the consistent worlds that entail the query. [ We now study the problems in the presence of ontological In general, the entailment problem is undecidable Beeri ] rules, which add deductive power to PDBs. The benefits of and Vardi, 1981 , which motivated syntactic fragments of [ using of ontologies for large-scale PKB completion are well TGDs (but not the NCs) Cal`ı et al., 2012; Baget et al., 2011; known [Jung and Lutz, 2012; Borgwardt et al., 2017]. From a Krotzsch¨ and Rudolph, 2011; Cal`ı et al., 2013; Fagin et al., ] broader perspective, extending tuple-independent PDBs with 2005 . We focus on only a few of these classes here (see + = ∅ logical rules is an old idea, aiming to induce correlations on a Table 1). If there are no TGDs at all, i.e., Σ , then logical level, and thus to relax the tuple independence, result- we denote the by OMQNC. One of the most ing in very powerful formalisms [Poole, 1993; 1997]. important classes, guarded programs (OMQG) allow only In the remainder of this paper, we restrict ourselves to such TGDs that have a body atom (the guard) that contains UCQs, but in exchange consider additional knowledge en- all body variables. In the special case of linear programs coded through an ontology. In the simplest case, we can for- (OMQL), the body may consist of only one atom. In frontier- mulate negative constraints (NCs) like guarded programs (OMQFG), only the frontier variables, i.e., those shared by the body and the head, need to be guarded. ∀x, y Veg(x) ∧ Eat(x, y) ∧ Meat(y) → –, We pay particular attention to the class of guarded and full programs (OMQGF), which does not allow existentially quan- which imposes the same constraints as Q2. NCs are a spe- ± cial case of denial constraints over databases [Staworko and tified variables, and is one of the least expressive Datalog - Chomicki, 2010]. We also allow to formulate more general based query languages with a P-complete data complexity. ontological knowledge in the form of tuple-generating de- At the upper end of the expressivity spectrum, we observe pendencies (TGDs). For instance, the first-order formulas classes like weakly guarded (OMQWG) programs, whose data complexity is already EXP-complete. From a theoretical per- ∀x, y FriendOf(x, y) → FriendOf(y, x), spective, the class of acyclic programs (OMQA), which do ∀x Veg(x) → ∃y Knows(x, y) ∧ Veg(y) not allow cyclic dependencies between predicates, is of in- terest, because it has a non-deterministic combined complex- are TGDs stating that the friend relation is symmetric and ity, which leads to a different behavior. that every vegetarian knows another vegetarian. In partic- A key paradigm in OMQ answering is the FO-rewritability ular, TGDs can express the well-known inclusion depend- of queries: an OMQ (Q, Σ) is FO-rewritable if there exists encies and join dependencies from . Taking a Boolean UCQ Q such that, for all consistent databases D, all parts together, we are talking about ontology-mediated Σ it holds that D ⊧ (Q, Σ) iff D ⊧ Q . Intuitively, the re- queries, which are UCQs in combination with a so-called Σ written query Q can be answered directly over the database, Datalog± ontology, i.e., a finite set of NCs and TGDs [Cal`ı et Σ without referring to the Datalog± program. A class OMQ is al., 2012]. We will pay particular attention to the case where X FO-rewritable if all its OMQs are FO-rewritable; in this case, only NCs are allowed, as this is closest to the constraints over the OMQ entailment problem for OMQ has a data complex- PDBs that we described in the previous examples, and it gives X ity of AC0. Of the classes mentioned above, only OMQ , us a baseline for the computational complexity. NC OMQ , and OMQ have this property. More formally, an NC ν is an FO formula ∀x ϕ(x) → –, L A where ϕ(x) is a conjunction of atoms, called the body of ν, In addition to data complexity, ba-combined complexity, and – is the truth constant false.A TGD σ is an FO for- and combined complexity, for OMQs we also consider the fixed-program (fp) combined complexity ± mula ∀x ϕ(x) → ∃y P(x, y), where ϕ(x) is a conjunction of , where the Datalog atoms, called the body of σ, and P(x, y) is an atom, called the program is viewed as fixed, but both (P)DB and query are head of σ.A (Datalog±) program (or ontology) Σ is a finite considered to be part of the input. The ba-combined complex- 2 ity is of interest for Description Logics [Baader et al., 2003], set of NCs and TGDs. An ontology-mediated query (OMQ) ± (Q, Σ) consists of a program Σ and a Boolean UCQ Q. some of which can be considered to be Datalog languages For the semantics, we extend the vocabulary by an infinite with arity at most 2. set N of nulls. An instance I is a possibly infinite set of tuples that may additionally contain nulls; it satisfies a TGD or NC σ 4.1 Most Probable Databases for OMQs if I ⊧ σ, where ⊧ denotes standard FO entailment. I satisfies For ontology-mediated queries, models are restricted to those a program Σ, written I ⊧ Σ, if I satisfies each formula in Σ. that are consistent with the ontology. In other words, the con- The set mods(D, Σ) of models of a database D relative to a straints are considered separately from the (existential) query; program Σ is {I S I ⊇ D and I ⊧ Σ}. A database D is consist- thus, the definitions of MPD and MPH have to be adapted ac- ent w.r.t. Σ if mods(D, Σ) is non-empty. The OMQ (Q, Σ) cordingly. The main difference is that we now consider only consistent 2 the worlds induced by the PDB, and thus maximize We omit the universal quantifiers in TGDs and NCs, and use only over consistent worlds. commas (instead of ∧) for conjoining atoms for ease of presentation.

953 Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17)

Query Most Probable Database Most Probable Hypothesis Languages data fp-comb. ba-comb. comb. data fp-comb. ba-comb. comb. UCQ in P — NP NP in P — NP NP P P P ∃Q in P — NP NP in Σ2 — Σ3 Σ3 P P P P P ∀QNP— Σ2 Σ2 Σ2 — Σ2 Σ2 P FOQ NP — PSPACE PSPACE Σ2 —PSPACE PSPACE P P PP PP PP OMQNC NPNP Σ2 Σ2 PPNP NP NP P PP PP OMQL NPNP Σ2 PSPACE PPNP NP PSPACE NE NE PP NE NE OMQA NPNPP P PPNP P P P PP PP PP OMQGF NPNP Σ2 EXP NP NP NP EXP PP PP OMQG NPNPEXP 2EXP NP NP EXP 2EXP PP PP OMQFG NP NP 2EXP 2EXP NP NP 2EXP 2EXP OMQWG EXP EXP EXP 2EXP EXP EXP EXP 2EXP

Table 1: Complexity results for MPD and MPH for a wide range of queries: all listed results are original contributions of this paper except the data complexity for ∀Q. This latter result has been strengthened towards the weaker representation OMQNC.

Definition 12. The most probable database for an OMQ Under ba-combined complexity assumptions, we observe (Q, Σ) over a PDB P is an increase in complexity, which intuitively comes from the + arg max P(D). fact that the query (Q–, Σ ) is not fixed anymore. D⊧(Q,Σ),mods(D,Σ)≠∅ P Theorem 16. MPD for OMQNC is Σ2-hard in ba-combined As before, we consider the decision problem MPD, which complexity, even for Q = ⊺. checks for the existence of a world D as above that exceeds a We obtain a matching upper bound from Theorem 13 if given probability threshold p. Our complexity results for this C = NP. Most of the remaining hardness results follow problem are summarized in the lower left part of Table 1. from the complexity of classical OMQ entailment, since an A naive approach to solve MPD is to guess the database D, OMQ (Q, Σ) is entailed by a consistent database D iff the and then check that it entails the given OMQ, does not entail most probable database w.r.t. {⟨t ∶ 1⟩S t ∈ D} has probabil- + the query (Q–, Σ ), and exceeds the probability threshold. ity 1. In combination with Theorem 13, this yields tight com- Since the probability can be computed in polynomial time, plexity results for large deterministic classes like PSPACE or the problem can be decided by an NP Turing machine using EXP. This leaves open only the case of OMQA, for which en- an oracle to check OMQ entailment. tailment is NEXP-complete in the (ba-)combined complexity, NEXP NEXP NE Theorem 13. If entailment for OMQX is in a complexity which yields an upper bound of NP = P = P due C [ ] class C, then MPD for OMQX is in NP (under the same to Theorem 13 and results from Hemachandra, 1989 . We complexity assumptions). show that this bound is tight, using a reduction from a PNEXP- [ ] By a reduction from 3-colorability, we show that MPD is complete version of the tiling problem Furer,¨ 1983 . NE NP-hard already in the data complexity, even if we only use Theorem 17. MPH for OMQA is P -hard in the ba- NCs, i.e., the query and the positive program Σ+ are empty. combined complexity. This strengthens our previous result about ∀Qs, since NCs can be expressed by universal queries, but are not allowed to 4.2 Most Probable Hypotheses for OMQs use negated atoms. Similarly to MPD, we have to update the definition of MPH Theorem 14. MPD for OMQNC is NP-hard in the data com- to take into account only consistent worlds. plexity, even for Q = ⊺. Definition 18. The most probable hypothesis for an OMQ We show a matching upper bound even in the fp-combined (Q, Σ) over a PDB P is complexity, by reconsidering the approach from Theorem 13. + arg max Q P(D), Since here the query (Q–, Σ ) is fixed, for the non-entailment H⊧(Q,Σ) D⊇H check, we can refer to the data complexity of OMQ entail- mods(D,Σ)≠∅ ment. Assuming that this is possible in P, and further that the fp-combined complexity does not exceed NP, then the whole where H is a set of (non-probabilistic) tuples t occurring in P. test can be done by a single non-deterministic Turing machine Since we consider only monotone queries (i.e., UCQs), we in polynomial time. This insight yields an upper bound of NP do not have to include negated tuples in the hypothesis. for most of the classes we consider. To solve MPH, one can guess a hypothesis, and then check Theorem 15. If entailment for OMQX is in NP in the fp- whether it entails the query, and whether the probability mass combined complexity and in P in the data complexity, then of its consistent extensions exceeds the given threshold. The MPD for OMQX is in NP in the fp-combined complexity. latter part can be done by a PP Turing machine with an oracle

954 Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17) for OMQ entailment (by normalizing the probability of the 2009]. It is well-known that they are a form of abduction, worlds such that the threshold is exactly 0.5). The oracle can which clearly also applies to the problems studied here. Max- be used also for the initial entailment check. imal posterior probability computations are central in PGMs and in statistical relational learning. However, analogous Theorem 19. If entailment for OMQX is in a complexity C problems have not been studied in depth in the context of class C, then MPH for OMQ is in NPPP (under the same X PDBs. To our knowledge, the only work in this direction is complexity assumptions). PH PP [ et al. ] PP P PP on most probable databases Gribkoff , 2014 , which we For any C ⊆ PH, this yields NP = NP = NP as an use as a starting point. upper bound, due to a result from [Toda, 1989]. Except for Probabilistic query evaluation has been investigated in PP, all other upper bounds in the lower right part of Table 1 depth in the literature [Imieliski and Lipski, 1984; Suciu et also follow from this observation. For C = NEXP, the whole al., 2011]. In fact, it is known to be either in P or #P-hard NEXP PP computation can be done by an NEXP oracle, and for UCQs in the data complexity [Dalvi and Suciu, 2012]. NEXP NE hence we again obtain NP = P in this case. Extensions of PDBs with ontological rules, in particular with On the other hand, we can transfer most of the lower Datalog±, are also well-covered in the literature [Gottlob et bounds from MPD by a simple reduction. al., 2013; Borgwardt et al., 2017]. The focus of these works Theorem 20. If MPD for OMQX is hard for a complexity is on probabilistic query answering. class C, then MPH for OMQX is also hard for C (under any Most of our data complexity results are covered by the P PP complexity measure except for the data complexity). classes NP, Σ2, PP, and NP . Though intractable, they We now discuss the more interesting cases below PSPACE. are at the core of many important problems, which motiv- For FO-rewritable classes of OMQs, we obtain a data com- ated a body of work tailored towards scalable algorithms for plexity of only PP. This is due to the fact that TGDs can be these classes. There is an immediate connection between compiled into the query, and the observation of Theorem 8 MPE and weighted MAX-SAT [Sang et al., 2007]. Simil- that existential queries can be processed using polynomially arly, advances in knowledge compilation [Park and Darwiche, many steps. Nevertheless, in each step, we have to compute 2004b; Pipatsrisawat and Darwiche, 2009] and approximate a non-trivial sum over the consistent worlds, which can be model counting [Chakraborty et al., 2016; Fremont et al., done in PP. Finally, the polynomially many PP checks can 2017] are tailored to achieve optimal, scalable algorithms for be compiled into a single PP operation [Beigel et al., 1995]. problems in classes such as PP and NPPP. Both MPD and Theorem 21. If entailment for OMQ is in AC0 in the data MPH can be cast into their propositional variants using the X lineage representation of queries, which gives immediate ac- complexity, then MPH for OMQX is PP-complete in the data complexity. cess to such algorithms. However, there is need for future PP work in this direction, as grounding is not an optimal way To frame this result, we show NP -hardness both for of handling these problems; performing inference directly on OMQGF (the prototypical non-FO-rewritable class) in the FO-structures is known to be more efficient. data complexity, and for OMQNC in the fp-combined com- plexity. We use two similar reductions from a problem in the 6 Summary and Outlook polynomial-time counting hierarchy [Wagner, 1986]. In the first one, we rely on the power of guarded and full TGDs; in We studied two inference tasks for PDBs; namely finding the the second, we use a non-fixed query and a careful choice of most probable database and the most probable hypothesis for probabilities to simulate the effect of these TGDs. a given query. The focus of this paper was to determine the precise complexity of these problems, and we provided a de- Theorem 22. MPH is NPPP-hard for OMQ in the data GF tailed analysis for these problems relative to a variety of query complexity, and for OMQ in the fp-combined complexity. NC languages, ranging from conjunctive queries to ontology- We also observe a dichotomy in the data complexity of mediated queries. The main focus of future work is on the one MPH, which follows from the dichotomy for probabilistic hand to extend these results to other classes (such as equal- query evaluation [Dalvi and Suciu, 2012]. Note that the PP- ity generating dependencies) and on the other hand to obtain hardness holds only under Turing reductions. even more refined results (such as classification results). Lemma 23 (Dichotomy). For FO-rewritable classes OMQX, MPH is either in P or PP-hard in the data complexity. Acknowledgments ± Our results apply to all decidable Datalog languages from This work was supported by DFG in the CRC 912 (HAEC), the literature, due to the generic nature of our theorems and GRK 1907 (RoSI), and the project BA 1122/19-1 (GoAsQ), the known complexity results for OMQ entailment [Cal`ı et and by the UK EPSRC grants EP/J008346/1, EP/L012138/1, al., 2012; Baget et al., 2011; Krotzsch¨ and Rudolph, 2011; EP/M025268/1, and by The Alan Turing Institute under the Cal`ı et al., 2013]. For example, full sets of TGDs (F) behave EPSRC grant EP/N510129/1. like OMQGF, and linear full (LF) and acyclic full (AF) sets of TGDs exhibit the same complexities as OMQL. References 5 Related Work [Baader et al., 2003] Franz Baader, Diego Calvanese, Deborah L. McGuinness, Daniele Nardi, and Peter F. Patel-Schneider, edit- Our work is inspired by the maximal posterior probability ors. The Description Logic Handbook: Theory, Implementation, computations in PGMs [Pearl, 1988; Koller and Friedman, and Applications. Cambridge University Press, 2003.

955 Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17)

[Baget et al., 2011] Jean-Franc¸ois Baget, Marie-Laure Mugnier, [Koller and Friedman, 2009] Daphne Koller and Nir Friedman. Sebastian Rudolph, and Michael¨ Thomazo. Walking the com- Probabilistic Graphical Models: Principles and Techniques. plexity lines for generalized guarded existential rules. In Proc. MIT Press, 2009. IJCAI, 2011. [Krotzsch¨ and Rudolph, 2011] Markus Krotzsch¨ and Sebastian [Beeri and Vardi, 1981] Catriel Beeri and Moshe Y. Vardi. The im- Rudolph. Extending decidable existential rules by joining acyc- plication problem for data dependencies. In Proc. ICALP, 1981. licity and guardedness. In Proc. IJCAI, 2011. [Beigel et al., 1995] Richard Beigel, Nick Reingold, and Daniel [Kwisthout, 2011] Johan Kwisthout. Most probable explanations Spielman. PP is closed under intersection. JCSS, 50(2):191–202, in Bayesian networks: Complexity and tractability. IJAR, 1995. 52(9):1452–1469, 2011. [Borgwardt et al., 2017] Stefan Borgwardt, Ismail˙ Ilkan˙ Ceylan, [Littman, 1999] Michael L. Littman. Initial experiments in and Thomas Lukasiewicz. Ontology-mediated queries for prob- stochastic satisfiability. In Proc. AAAI, 1999. abilistic databases. In Proc. AAAI, 2017. [Mitchell et al., 2015] Tom Mitchell, William Cohen, Estevam [Cal`ı et al., 2012] Andrea Cal`ı, Georg Gottlob, and Thomas Hruschka, Partha Talukdar, Justin Betteridge, Andrew Carlson, Lukasiewicz. A general Datalog-based framework for tractable Bhavana Dalvi Mishra, Matthew Gardner, Bryan Kisiel, Jayant query answering over ontologies. JWS, 14:57–83, 2012. Krishnamurthy, Ni Lao, Kathryn Mazaitis, Thahir Mohamed, [Cal`ı et al., 2013] Andrea Cal`ı, Georg Gottlob, and Michael Kifer. Ndapa Nakashole, Emmanouil Platanios, Alan Ritter, Mehdi Taming the infinite chase: Query answering under expressive re- Samadi, Burr Settles, Richard Wang, Derry Wijaya, Abhinav lational constraints. JAIR, 48:115–174, 2013. Gupta, Xinlei Chen, Abulhair Saparov, Malcolm Greaves, and Joel Welling. Never-ending learning. In Proc. AAAI, 2015. [Chakraborty et al., 2016] Supratik Chakraborty, Kuldeep S. Meel, and Moshe Y. Vardi. Algorithmic improvements in approximate [Park and Darwiche, 2004a] James D. Park and Adnan Darwiche. counting for probabilistic inference: From linear to logarithmic Complexity results and approximation strategies for MAP ex- SAT calls. In Proc. IJCAI, 2016. planations. JAIR, 21(1):101–133, 2004. [Dalvi and Suciu, 2012] Nilesh Dalvi and Dan Suciu. The dicho- [Park and Darwiche, 2004b] James D. Park and Adnan Darwiche. tomy of probabilistic inference for unions of conjunctive queries. A differential semantics for jointree algorithms. AIJ, 156(2):197– J. ACM, 59(6):1–87, 2012. 216, 2004. [Darwiche, 2009] Adnan Darwiche. Modeling and Reasoning with [Pearl, 1988] Judea Pearl. Probabilistic Reasoning in Intelligent Bayesian Networks. Cambridge University Press, 2009. Systems. Morgan Kaufmann, 1988. [Dong et al., 2014] Xin Luna Dong, Evgeniy Gabrilovich, Geremy [Pipatsrisawat and Darwiche, 2009] Knot Pipatsrisawat and Adnan Heitz, Wilko Horn, Ni Lao, Kevin Patrick Murphy, Thomas Darwiche. A new d-DNNF-based bound computation algorithm Strohmann, Shaohua Sun, and Wei Zhang. Knowledge Vault: A for functional E-MAJSAT. In Proc. IJCAI, 2009. web-scale approach to probabilistic knowledge fusion. In Proc. [Poggi et al., 2008] Antonella Poggi, Domenico Lembo, Diego SIGKDD, 2014. Calvanese, Giuseppe De Giacomo, Maurizio Lenzerini, and Ric- [Fagin et al., 2005] Ronald Fagin, Phokion G. Kolaitis, Renee´ J. cardo Rosati. Linking data to ontologies. JDS, 10:133–173, 2008. Miller, and Lucian Popa. Data exchange: Semantics and query [Poole, 1993] David Poole. Probabilistic Horn abduction and answering. TCS, 336(1):89–124, 2005. Bayesian networks. AIJ, 64(1):81–129, 1993. [Fremont et al., 2017] Daniel Fremont, Markus Rabe, and Sanjit [Poole, 1997] David Poole. The independent choice logic for mod- Seshia. Maximum model counting. In Proc. AAAI, 2017. elling multiple agents under uncertainty. AIJ, 94(1):7–56, 1997. [Furer,¨ 1983] Martin Furer.¨ The computational complexity of the [Richardson and Domingos, 2006] Matthew Richardson and Pedro unconstrained limited domino problem (with implications for lo- Domingos. Markov logic networks. ML, 62(1):107–136, 2006. Logic and Machines gical decision problems). In , 1983. [Sang et al., 2007] Tian Sang, Paul Beame, and Henry Kautz. A [Gottlob et al., 2013] Georg Gottlob, Thomas Lukasiewicz, dynamic approach to MPE and weighted MAX-SAT. In Proc. Maria Vanina Martinez, and Gerardo I. Simari. Query answering IJCAI, 2007. AMAI under probabilistic uncertainty in Datalog+/– ontologies. , [Shimony, 1994] Eyal Solomon Shimony. Finding MAPs for belief 69(1):37–72, 2013. networks is NP-hard. AIJ, 68(2):399–410, 1994. [Gribkoff et al., 2014] Eric Gribkoff, Guy Van den Broeck, and [Shin et al., 2015] Jaeho Shin, Sen Wu, Feiran Wang, Christopher Dan Suciu. The most probable database problem. In Proc. BUDA, De Sa, Ce Zhang, and Christopher Re.´ Incremental knowledge 2014. base construction using DeepDive. In Proc. VLDB, 2015. [Hemachandra, 1989] Lane A. Hemachandra. The strong exponen- [Staworko and Chomicki, 2010] Sławomir Staworko and Jan tial hierarchy collapses. JCSS, 39(3):299–322, 1989. Chomicki. Consistent query answers in the presence of universal [Hoffart et al., 2013] Johannes Hoffart, Fabian M. Suchanek, Klaus constraints. Inf. Syst., 35(1):1–22, 2010. Berberich, and Gerhard Weikum. YAGO2: A spatially and tem- [Suciu et al., 2011] Dan Suciu, Dan Olteanu, Christopher Re,´ and porally enhanced knowledge base from Wikipedia. AIJ, 194:28– Christoph Koch. Probabilistic Databases. Morgan & Claypool, 61, 2013. 2011. [Imieliski and Lipski, 1984] Tomasz Imieliski and Witold Lipski. [Toda, 1989] Seinosuke Toda. On the computational power of PP Incomplete information in relational databases. J. ACM, and ⊕P. In Proc. SFCS, 1989. 31(4):761–791, 1984. [Wagner, 1986] Klaus W. Wagner. The complexity of combinat- [Jung and Lutz, 2012] Jean Christoph Jung and Carsten Lutz. orial problems with succinct input representation. Acta Inf., Ontology-based access to probabilistic data with OWL QL. In 23(3):325–356, 1986. Proc. ISWC, 2012.

956