Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19)

Best Answers over Incomplete Data : Complexity and First-Order Rewritings

Amelie´ Gheerbrant and Cristina Sirangelo Universite´ de Paris, IRIF, CNRS, F-75013 Paris, France {amelie, cristina}@irif.fr

Abstract as if nulls were usual data values, thus merely using the stan- dard query engine to compute certain answers. Answering queries over incomplete data is ubiqui- In general though it is a common occurrence that few if tous in data management and in many AI applica- any certain answers can be found. If there are no certain an- tions that use query rewriting to take advantage of swers, it is still useful to provide a user with some answers, relational database technology. In these scenarios with suitable guarantees. To address this need, a framework one lacks full information on the data but queries to measure how close an answer is to certainty has recently still need to be answered with certainty. The cer- been proposed [Libkin, 2018b], setting the foundations to tainty aspect often makes query answering unfeasi- both a quantitative and a qualitative approach. We focus on ble except for restricted classes, such as unions of the qualitative notion of best answers. Those are a refinement conjunctive queries. In addition often there are no, of certain answers based on comparing query answers; one or very few, certain answers, thus expensive com- that is supported by a larger set of complete interpretations is putation is in vain. Therefore we study a relax- better. Best answers are those answers for which there is no ation of certain answers called best answers. They better one. They always exist and when certain answers exist are defined as those answers for which there is no the two notions simply coincide. better one (that is, no answer true in more possi- ble worlds). When certain answers exist the two Best answers is a natural notion, but we still know little notions coincide. We compare different ways of about it. Identifying the set of best answers among some given family of sets of answers is known to be complete in casting query answering as a decision problem and NP[log n] characterise its complexity for first-order queries, data complexity for the class P , which is considered showing significant differences in the behaviour of as “mildly” harder than both NP and CONP. However this best and certain answers. We then restrict attention very formulation of the decision problem is non standard. to best answers for unions of conjunctive queries Traditionally, in we rather focus on problems stat- and produce a practical algorithm for finding them ing that some given result belongs to the set of answers, or based on query rewriting techniques. that some given set is the set of answers. Certain answers as a decision problem is typically formulated in the first way. So do these variations matter ? We fully answer the question for 1 Introduction both best and certain answers of first-order queries, showing Answering queries over incomplete databases is crucial in significant differences in their computational behaviour. many different scenarios such as data integration [Lenzerini, Despite the high complexity of finding best answers in gen- 2002], data exchange [Arenas et al., 2014], inconsistency eral, one gains tractability when restricting to unions of con- management [Bertossi, 2011], data cleaning [Geerts et al., junctive queries. This is a common class of queries, usu- 2013], ontology-based data access (OBDA) [Bienvenu and ally well behaved computationally, even for certain answers. Ortiz, 2015], and many others. The common thread run- Finding best answers for them was shown to be tractable in ning through all these applications lies in computing cer- [Libkin, 2018b] via an adaptation of techniques used in the tain answers [Amendola and Libkin, 2018; Libkin, 2018a], context of certain answers [Gheerbrant and Libkin, 2015]. which is the standard way of answering queries over incom- Those are essentially resolution based algorithms for first- plete databases. Intuitively this produces answers that can order formulas; this makes them hard to implement in the be obtained from all the possible complete databases a given database context. To overcome this we develop new query incomplete database represents. However, computing such rewriting techniques. In particular we show that best an- query answers then relies on sophisticated algorithms that are swers to any union of conjunctive queries can be computed often difficult to implement. It is well known that restricting by issuing a new first-order query directly on the incomplete to unions of conjunctive queries allows to overcome the diffi- database. Query rewritings are standard in the context of, e.g., culty by using na¨ıve evaluation [Imielinski´ and Lipski, 1984]. consistent query answering, OBDA, query answering using This amounts to evaluating queries over incomplete databases views etc., i.e., all contexts where only partial information is

1704 Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19) available about the data to be queried [Calvanese et al., 2000; ∧, ∨, ¬ and quantifiers ∃, ∀. We write ϕ(¯x) for an FO- Calvanese et al., 2007; Cal`ı et al., 2013; Cal`ı et al., 2003b]. formula ϕ with free variables x¯. With slight abuse of no- First-order rewritings are particularly useful, as they allow to tation, x¯ will denote both a tuple of variables and the set use the power of standard database query engines. In fact of variables occurring in it. The set of constants used by when they exist, the rewritten queries can be implemented in ϕ is as usual denoted by adom(ϕ), and gives the active do- any relational query engine by expressing them in SQL, with main of the associated query. We interpret FO-formulas un- no need to implement ad-hoc algorithms. der active domain semantics, i.e. we consider D as a rela- tional structure with universe adom(D) ∪ adom(ϕ). Thus 2 Preliminaries an FO formula ϕ(¯x) represents a query (of active domain adom(ϕ)) mapping each database D into the set of tuples We represent missing information in relational databases {t¯over adom(D) ∪ adom(ϕ) | D |= ϕ(t¯)}. in the standard way using nulls [Abiteboul et al., 1995; Imielinski´ and Lipski, 1984; van der Meyden, 1998]. To evaluate FO-formulas with free variables we use as- Databases are populated by constants and nulls, coming re- signments ν from variables to constants in the active domain. Note that with a little abuse of notation we write D |= ϕ(t¯) spectively from two countably infinite sets Const and Null. ¯ We denote nulls by ⊥, sometimes with sub- or superscript. for D |=ν ϕ(¯x) under the assignment ν sending x¯ to t. We also allow them to repeat, thus adopting the model of Here it is important to note that the query associated to marked nulls, as customary in the context of applications ϕ is a mapping defined on all databases D, possibly with such as OBDA or data integration and exchange. A relational nulls. If D contains nulls, D |= ϕ(t¯) is to be intended schema, or vocabulary σ, is a set of relation names with asso- “na¨ıvely”, i.e. nulls of D are treated as new constants ciated arities. A database D over σ associates to each relation in the domain of D, distinct from each other, and distinct name of arity k in σ, a k-ary relation which is a finite subset from all the other constants in D and ϕ. For example the of (Const ∪ Null)k. Sets of constants and nulls occuring in D query ϕ(x, y) = ∃z (R(x, z) ∧ R(z, y)), on the database are denoted by Const(D) and Null(D). The active domain of D = {R(1, ⊥1),R(⊥1, ⊥2),R(⊥3, 2)} selects only the tu- D is adom(D) = Const(D) ∪ Null(D).A complete database ple (1, ⊥2). has no nulls. We consider the ∃, ∧, ∨-fragment of FO known as unions A valuation v : Null(D) → Const on a database D is of conjunctive queries and its ∃, ∧-fragment known as con- a map that assigns constant values to nulls occurring in D. junctive queries. By v(D) [resp. v(¯a)] we denote the result of replacing each ⊥ by v(⊥) in D [resp. in the tuple a¯]. The seman- Example 2.1. Let Q(x) = ∃y(R(y) ∧ S(y, x)) and D = tics [[D]] of an incomplete database D is the set {v(D) | {R(⊥1),R(1),S(⊥2, ⊥2)}. We have Supp(Q, D, ⊥2) = v is a valuation on D} of all complete databases it can rep- {v ∈ V(D) | v(⊥2) = 1 or v(⊥1) = v(⊥2)}, resent. V(D) denotes the set of all valuations defined on D. Supp(Q, D, 1) = {v ∈ V(D) | v(⊥2) = 1} and An m-ary query of active domain C ⊆ Const is a map that Supp(Q, D, ⊥1) = {v ∈ V(D) | v(⊥1) = v(⊥2)}. It 2 associates with a database D a subset of (adom(D) ∪ C)m. follows that (Q, D) = ∅ and Best(Q, D) = {(⊥2)}. The active domain of a query will be denoted as adom(Q). In order to study the complexity of best answer compu- To answer a query Q over an incomplete database D we fol- tation we shall need two classes in the second level of the low [Lipski, 1984; Libkin, 2018b] and adopt a slight gener- polynomial hierarchy. Both of these contain NP and CONP, alisation of the usual intersection based certain answers no- and are contained in Σp ∩ Πp. The class DP consists of lan- tion. We define the set of certain answers to Q over D as 2 2 guages L ∩ L where L ∈ NP and L ∈ CONP. This class 2(Q, D) = {a¯ over adom(D) ∪ adom(Q) | ∀v : v(¯a) ∈ 1 2 1 2 has appeared in database applications [Fagin et al., 2005; Q(v(D))}. The only difference with the usual notion is that NP[log n] we allow answers to contain nulls. Barcelo´ et al., 2014]. The class P consists of prob- Following [Libkin, 2018b], given a query Q, a database D, lems that can be solved in polynomial time with a logarithmic and a tuple a¯ over adom(D) ∪ adom(Q), we let the support number of calls to an NP oracle [Eiter and Gottlob, 1997]. of a¯ be the set of all valuations that witness it: Equivalently, it can be described as the class of problems solved in P with an NP oracle where calls to the oracle are Supp(Q, D, a¯) = {v ∈ V(D) | v(¯a) ∈ Q(v(D))} . done in parallel, i.e., independent of each other. This class has appeared in the context of AI, modal logic, OBDA [Got- Supports thus measure how close a tuple is to certainty. We tlob, 1995; Eiter and Gottlob, 1997; Calvanese et al., 2006; consider one answer to be better than another if it has more Bienvenu and Bourgaux, 2016], data exchange [Arenas et al., support. That is, given a database D, a k-ary query Q, and 2013]. k-tuples a,¯ ¯b over adom(D) ∪ adom(Q), we let a¯ ¡ ¯b ⇔ Supp(Q, D, a¯) ⊂ Supp(Q, D, ¯b) . Q,D 3 Complexity of Best and Certain Answers The set of best answers to Q over D is defined as the set of answers for which there is no better one : Best(Q, D) = {a¯ | A natural way to cast computing best answers as a decision ¯ ¯ ¬∃b :a ¯ ¡Q,D b}. problem would be to proceed along the lines of certain an- We focus on first-order (FO) queries of vocabulary σ writ- swers [Imielinski´ and Lipski, 1984] and ask whether a given ten here in the logical notation using Boolean connectives tuple is a best answer :

1705 Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19)

PROBLEM:BESTANSWER {c1, . . . , cm} to nodes of G. Then we define a query INPUT: A query Q, a database D, a tuple a¯ C(x) ∧ ∀y, z E(y, z) → L(y, x) QUESTION: Is a¯ ∈ Best(Q, D)? ϕ(x) = ∧ ∀y L(y, x) → ∃z E(y, z) Instead, [Libkin, 2018b] considers the following variant of ∧ ¬∃y E(y, y) . the problem, asking whether the set of best answers belongs to a specified family of sets : For any valuation v, ϕ(c) holds in v(DG) iff 1) c = cj for some j = 1..m (ensured by the first conjunct). 2) For such PROBLEM:BESTANSWER∈ a cj, the valuation v maps each null into {c1, . . . , cj} (sec- INPUT: A query Q, a database D, ond conjunct), i.e. v represents an assignment of colours to a family X of sets of tuples nodes of G, using at most the first j colours. 3) Each color QUESTION: Is Best(Q, D) ∈ X ? {c1, . . . , cj} is used by v, i.e. v represents an assignment of colours to nodes of G, using precisely the first j colours (third This suggests yet another alternative formulation of the conjunct). 4) There are no loops in E (fourth conjunct). problem, asking if a given set is the best answer : Thus, for a valuation v, the formula ϕ(cj) is true in v(DG) iff v represents a colouring of G using precisely the first j = PROBLEM:BESTANSWER colours {c1, . . . , cj} (which in the sequel we refer to as an INPUT: A query Q, a database D, exact j-colouring of G). a set X of tuples, Next we define : QUESTION: Is Best(Q, D) = X? Q(x) = C(x) ∧ ( ϕ(x)∨ ∃y (O(y) ∧ L(x, y) ∧ ϕ(y)) ) BESTANSWER∈ was shown in [Libkin, 2018b] to be For a valuation v, we have that Q(c ) holds in v(D ) iff ei- PNP[log n]-complete in data complexity. We show that the i G ther v represents an exact i-coloring of G; or v represents an other alternatives are computationally equivalent. Interest- exact j-coloring of G with j odd, and i ≤ j. In other words ingly, the situation is very different with certain answers, as valuations representing exact j-colorings, with j even, sup- we show next. port only the maximal color cj; while valuations representing Theorem 3.1. For FO queries the problems exact j-colorings, with j odd, support all colors {c1...cj}. [log n] BESTANSWER and BESTANSWER= are PNP -complete With this in place we can conclude the reduction for the in data complexity. BESTANSWER problem : Claim. c ∈ Best(Q, D ) iff the chromatic number of G Proof (sketch). The upper bound for BESTANSWER= im- 1 G is odd. mediately follows from the upper bound for An adaptation of this encoding can be used to reduce the BESTANSWER∈ (take the family X to be a singleton odd chromatic number problem to BESTANSWER= as well. {X}). As for BESTANSWER we only need a slight modifica- tion of the upper bound proof in [Libkin, 2018b]. To check whether a¯ ∈ Best(Q, D) we proceed as follows. Since the Now that we showed that all three formulations of best query is fixed, and has therefore fixed arity k, in polynomial answers actually collapse computationally, another natural time we can enumerate all the k-tuples of adom(D). Then, question arises. Does a similar result hold for certain an- using parallel calls to the NP oracle, we can check for each swers ? We obtain the decision problems CERTAINANSWER, such tuple ¯b whether Supp(Q, D, a¯) ⊆ Supp(Q, D, ¯b) CERTAINANSWER= and CERTAINANSWER∈ by replacing and whether Supp(Q, D, ¯b) ⊆ Supp(Q, D, a¯). With this everywhere in the statements of the above decision problems information, in polynomial time we know whether a¯ ¡ ¯b Q,D BESTANSWER by CERTAINANSWER and Best(Q, D) by ¯b for some . 2(Q, D). It is well known that data complexity of CER- We prove the two remaining lower bounds, reducing from TAINANSWER is CONP-complete for FO-queries [Abiteboul NP[log n] the same P -complete problem [Wagner, 1990]: given et al., 1991]. We complete the picture as follows. an undirected graph G, is its chromatic number χ(G) odd? = With each undirected graph G = hN,Ei with nodes N and Theorem 3.2. For FO queries, CERTAINANSWER is DP - [log n] ERTAIN NSWER∈ NP edges E, we associate a database DG over binary relations complete and C A is P -complete in L, E and unary relations C,O as follows. We use a null ⊥n data complexity. 0 in DG for each node n of G. For each edge {n, n } of G, Proof. To prove membership of CERTAINANSWER= in DP , we have pairs (⊥n, ⊥n0 ) and (⊥n0 , ⊥n) in the relation E of notice that for a query Q, this problem is the intersection DG. In relation C we have m constants {c1, . . . , cm} (intu- itively representing possible colors), where m is the number of two languages L1 ∩ L2 where L1 = {(D,X) | X ⊆ 2 2 of nodes of G. Relation O of DG is defined as {ci | i is odd} (Q, D)} and L2 = {(D,X) | X ⊆ (Q, D)}. L1 is known and L is a linear ordering on them, i.e., (ci, cj) ∈ L iff i ≤ j, to be in CONP : we guess a tuple a¯ ∈ X and a valuation for i, j ≤ m. v ∈ V (D) with v(¯a) 6∈ Q(v(D)). Similarly, L2 is in NP : ¯ 0 Remark that any valuation v of DG that maps each null we guess a tuple b ∈ X and a valuation v ∈ V (D) with into a constant of C represents an assignment of colours in v0(¯b) ∈ Q(v(D)).

1706 Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19) Type of problem a¯ ∈ Answer X = Answer Answer ∈ X Certain Answer coNP-complete [Abiteboul et al., 1991] DP complete PNP[log n]-complete Best Answer PNP[log n]-complete PNP[log n]-complete PNP[log n]-complete [Libkin, 2018b]

Figure 1: Summary of data complexity results for FO queries

[log n] To prove membership of CERTAINANSWER∈ in PNP , Definition 4.1 (NRV normal form). A conjunctive query Q is suppose the query Q is k-ary, and we are given a family of in non repeating variable normal form (NRV normal form) if sets of k-ary tuples X = {X1,...,Xn} and a database D. it is of the form Q(¯x) = ∃y,¯ z¯ (q(¯y, z¯) ∧ e(¯y, z¯) ∧ x¯ =z ¯) For each Xi ∈ X , we use the NP oracle to decide in parallel where variables in x¯y¯z¯ are pairwise distinct, x¯ and z¯ have whether Xi = 2(Q, D) (for each Xi the two calls to the the same length, and: oracle do not depend on each other and they can also be done • q(¯y, z¯) is a conjunction of relational atoms, where each in parallel). free variable in y,¯ z¯ has at most one occurrence in q, For DP -hardness, we reduce from the problem of check- ing whether χ(G), the chromatic number of an undirected • e(¯y, z¯) is a conjunction of equality atoms, possibly using graph G, equals 4 [Rothe, 2003] and for PNP[log n] -hardness, constants. we reduce from the related problem of checking whether We say that q(¯y, z¯) is the relational subquery of Q, and χ(G) is odd. With such a graph G, we associate the same e(¯y, z¯) ∧ x¯ =z ¯ is the equality subquery of Q. database DG as in the proof of Theorem 3.1. Using the exact coloring formula ϕ in the proof of Theorem 3.1, we define a Example 4.2. The query Q(x) from Example 2.1 is equiva- query lent to ∃y1y2z(R(y1) ∧ S(y2, z) ∧ y1 = y2 ∧ z = x), which is in NRV normal form. Q(x) := C(x) ∧ ∀y (ϕ(y) → L(x, y)) Clearly every conjunctive query Q is equivalent to a query 2(Q, D) = {c , . . . , c } χ(G) = n We claim that 1 n iff , in NRV normal form; moreover Q can be easily rewritten in 2(Q, D) = {c , . . . , c } χ(G) = 4 which entails 1 4 iff and NRV normal form (in linear time in the size of the query). 2(Q, D) ∈ {{c , . . . , c } | j 1 ≤ j ≤ |G|} 1 j is odd and iff Thus in what follows we assume w.l.o.g. that conjunctive χ(G) v(D ) |= ϕ(c ) c is odd. Recall that G i iff i is a color queries are given in NRV normal form. Intuitively the NRV {c , ..., c } v i in 1 |G| and represents an exact -coloring of the normal form allows us to separate the two ingredients of a graph. Now v(DG) |= Q(cj) iff cj is a color and there is no conjunctive query : the existence of facts in some relations of i < j such that v represents an exact i-coloring of the graph, the database on the one side, and a set of equality conditions which holds exactly whenever cj ∈ {c1, ..., cχ(G)}. on data values occurring in these facts, on the other side. The existence of facts does not depend on the valuation of nulls, and thus can be directly tested on the incomplete database. 4 First-Order Rewritings for Best Answers Instead equality atoms in an NRV normal form imply condi- Considering arbitrary FO-queries brought us an intrinsic in- tions that valuations need to satisfy in order for the query to tractability result for all variants of best answers. This moti- hold. We can thus first concentrate on the support of equality vates restricting to unions of conjunctive queries, for which subqueries. This will be encoded in FO and then integrated a polynomial time evaluation algorithm (in data complexity) in the rewriting of the whole conjunctive query. already exists [Libkin, 2018b]. The resolution based pro- We introduce a notion of equivalence of database elements cedure is however in sharp contrast with na¨ıve evaluation, w.r.t. to a set of equalities. Intuitively equivalent elements which allows to compute certain answers to unions of con- of a tuple t¯ are the ones which should be collapsed into a junctive queries via usual model checking. We thus initiate a single value in order for a valuation of t¯to satisfy all the given descriptive complexity analysis of the best answers problem, equalities. showing that for unions of conjunctive queries, it can essen- Definition 4.3. Given a database D, a conjunction of equal- tially be reduced - modulo a preprocessing of the query - to ity atoms γ(¯y) and an assignment ν :y ¯ ∪ adom(γ) → (na¨ıve) evaluation of an FO-formula. adom(D)∪adom(γ) preserving constants, we say that u, u0 ∈ Given a union of conjunctive queries Q, our starting adom(D) ∪ adom(γ) are equivalent w.r.t. γ and ν and write point towards an FO-rewriting for best answers is finding ν 0 0 0 u ≡γ u , if either u = u or (u, u ) belongs to the reflexive an FO-formula Q⊆(¯x, y¯) encoding the inclusion of sup- ports, i.e. selecting tuples s,¯ t¯ over adom(D) ∪ adom(Q) iff symmetric transitive closure of {(ν(x), ν(w)) | x = w ∈ γ}. ¯ ν Supp(Q, D, s¯) ⊆ Supp(Q, D, t). From Q⊆ one can easily The relation ≡γ is clearly an equivalence relation over define an FO-formula selecting precisely all best answers to adom(D) ∪ adom(γ), where each element outside the range Q on D: of ν forms a singleton equivalence class. best (¯x) := ∀y¯(Q (¯x, y¯) → Q (¯y, x¯)) (1) Q ⊆ ⊆ Example 4.4. Let γ be x1 = x2 ∧ x2 = x3 ∧ x4 = x5 ∧ We start by putting each conjunctive query in a normal x6 = 1. Let ν assign ⊥i to xi for i ≤ 5, and ⊥5 to x6. The ν form which eliminates repetition of variables, by introducing equivalence classes of ≡γ are {⊥i | i ≤ 3} and {1, ⊥4, ⊥5}, new equality atoms. plus one singleton for each other domain element.

1707 Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19)

In what follows we denote by ∼γ the reflexive symmetric incomplete database D over σ ∪ Null, Null is always inter- transitive closure of {(x, w) | x = w ∈ γ}. Note that this is preted by the set of nulls occurring in D (in accordance with an equivalence relation among variables and constants of γ. the semantics of the SQL construct IS NULL). I.e. we allow We will be using the following formula to provide a syntac- rewritings to test whether a database element is null or not. ν tic characterisation of ≡γ , where m is the number of equiva- For γ(¯y) a conjunction of equality atoms, using equivγ we 1 lence classes of ∼γ : define a new formula compγ (¯y) stating the existence of a val- uation that collapses all equivalent elements of a tuple: 0 0 equivγ (¯y, z, z ) := z = z ∨ compγ (¯y) := _ 0 ^ 0 0 0 0 (z = u1 ∧ z = vm ∧ vi = ui+1) ∀zz (equivγ (¯y, z, z ) ∧ ¬Null(z) ∧ ¬Null(z ) → z = z ) u1,v1...um,vm∈ y¯ ∪ adom(γ) | 1≤i

1 Queries we write in the sequel can be domain dependent. So it where each Qi is in NRV-normal form with relational sub- is important to recall that we always use active domain semantics. query qi(¯yi, z¯i) and equality subquery eqi(¯x, y¯i, z¯i).

1708 Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19)

Lemma 4.14. Let D be a database, v a valuation of D and in Q, the quantifier rank of bestQ is linear in the size of Q. Q(¯x) a union of conjunctive queries in NRV-normal form, Thus whether a¯ ∈ best(Q, D) can be checked using space then v ∈ Supp(Q, D, r¯) if and only there exists i, s¯and t¯such O(|Q|, |D|). Moreover one can easily check that bestQ can ¯ ¯ ¯ that D |= qi(¯s, t) ∧ compeqi (¯rs¯t) and v(D) |= eqi(v(¯rs¯t)). be computed from Q in space polynomial in the size of |Q|. We are now ready to define the FO-formula encoding the Since space bounded computations can be composed without inclusion of supports. storing the intermediate output, computing bestQ from Q and then evaluating bestQ on D can be done overall in PSPACE ¯0 ^ in the size of |Q| and |D|. The rewriting approach thus im- Q⊆(¯x, x ) := (∀y¯z¯((qi(¯y, z¯) ∧ compeqi (¯x, y,¯ z¯)) → 1≤i≤n plies a PSPACE upper bound for the combined complexity of BESTANSWER for unions of conjunctive queries. How- _ 0 0 0 0 0 0 0 ∃y¯ z¯ (qj(¯y , z¯ ) ∧ implyeqi,eqj (¯xy¯z,¯ x¯ y¯ z¯ )) ) ) ever we can show that the problem actually stands in the third 1≤j≤n level of the polynomial hierarchy. Combining Lemmas 4.7, 4.14, Propositions 4.10, 4.12 and Theorem 4.19. For unions of conjunctive queries, combined p Corollary 4.13 we get : complexity of BESTANSWER is Π3-complete. Hardness al- ready holds for conjunctive queries. Proposition 4.15. D |= Q⊆(¯s, t¯) iff Supp(Q, D, s¯) ⊆ Supp(Q, D, t¯). Therefore under standard complexity theoretic assump- tions, our rewriting approach is not optimal in terms of com- Recall that from Q⊆ one can easily define a first order bined complexity, as it is often the case with generic ap- rewriting bestQ(¯x) for best answers as in (1). proaches. However it has the advantage of exploiting stan- Theorem 4.16. Given Q a union of conjunctive queries dard FO query evaluation, which despite the PSPACE com- ¯ over schema σ and an incomplete database D, t ∈ bined complexity, is highly optimised in database systems ¯ Best(Q, D) iff D |= bestQ(t). and works well in practice. Example 4.17. For Q, D, γ, γ0 as in Example 2.1 and 4.11 : 0 5 Future Work Q⊆(x, x ) := ∀y1y2z((R(y1)∧S(y2, z)∧compγ (y1, y2, z, x)) Constraints (e.g., keys and functional dependencies) are → known to raise the complexity of finding certain answers 0 0 0 0 0 0 0 0 0 0 ∃y1y2z (R1(y1)∧S(y2, z )∧implyγ,γ0 (y1y2zx, y1y2z x ))). [Cal`ı et al., 2003a]. They appear in another model of incom- This allows to derive for instance Supp(Q, D, 1) ⊆ pleteness - conditional tables [Imielinski´ and Lipski, 1984] Supp(Q, D, ⊥2) (as observed in Example 2.1). In - that in general leads to higher complexity of query evalu- [ ] fact the subquery R(y1) ∧ S(y2, z) ∧ compγ (y1, y2, z, x) ation Abiteboul et al., 1995 but are nonetheless useful in [ ] with free variables y1, y2, z, x selects on D tuples several applications Arenas et al., 2013 . We would like to (1, ⊥2, ⊥2, 1), (⊥1, ⊥2, ⊥2, 1), and no other tuple with last explore how our rewriting techniques interact with integrity element 1. Moreover as shown in Example 4.11 constraints. In another direction, while we focused on FO-rewritings, D |= implyγγ0 (⊥1⊥2⊥21, 1⊥2⊥2⊥2) we could consider more expressive rewriting languages such D |= implyγγ0 (1⊥2⊥21, 1⊥2⊥2⊥2) as Datalog or fixed point logics, as it is common in the context of OBDA, query answering using views, or consistent query Thus D |= Q⊆(1, ⊥2). Similarly one can show D |= answering [Bienvenu and Ortiz, 2015; Francis et al., 2015; Q⊆(⊥1, ⊥2) and therefore D |= bestQ(⊥2). Bertossi, 2011]. These more expressive logics are likely to As a corollary of Theorem 4.16, for a union of conjunc- be able to express rewritings of larger classes of queries than tive queries Q one can compute Best(Q, D) by first com- unions of conjunctive queries. puting the formula bestQ(¯x) from Q, then evaluating bestQ We would also like to investigate how our techniques can on D. Since data complexity of FO query evaluation is be extended to different semantics of incompleteness. We DLOGSPACE (and in particular AC0), this gives the follow- used here the closed-world semantics [Abiteboul et al., 1995; ing corollary : Imielinski´ and Lipski, 1984; van der Meyden, 1998], in which Corollary 4.18. For each fixed union of conjunctive queries data values are the only missing information, but there are Q, the data complexity of BESTANSWER is DLOGSPACE. other possible semantics, e.g. needed in order to cope with data inconsistencies [Cal`ı et al., 2003a], where query rewrit- Note that it was known from [Libkin, 2018b] that the data ings could still be found. complexity of computing best answers for unions of conjunc- Finally, we would like to investigate how techniques de- tive queries is polynomial time. In terms of combined com- veloped in this paper could be extended to study rewritings of plexity (i.e. when either Q, D and a¯ are in the input), the certain answers. rewriting approach (i.e. the procedure of computing bestQ from Q and then evaluating bestQ on D), can be easily shown to be in PSPACE. In fact it is well known that a first or- Acknowledgements der query ϕ can be evaluated on a database D in space at We thank the anonymous referees and Leonid Libkin for their most qr(ϕ) log |D| + log|ϕ|, where qr(ϕ) is the quantifier useful feedback. This work was supported by contract ANR- rank of ϕ. Note that although bestQ has size exponential 18-CE40-0031 QUID.

1709 Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19)

References [Calvanese et al., 2007] Diego Calvanese, Giuseppe De Gi- acomo, Maurizio Lenzerini, and Moshe Y. Vardi. View- [Abiteboul et al., 1991] Serge Abiteboul, Paris C. Kanel- based query processing: On the relationship between lakis, and Gosta¨ Grahne. On the representation and query- rewriting, answering and losslessness. TCS, 371(3):169– ing of sets of possible worlds. TCS, 78(1):158–187, 1991. 182, March 2007. [ et al. ] Abiteboul , 1995 Serge Abiteboul, Richard Hull, and [Eiter and Gottlob, 1997] Thomas Eiter and Georg Gottlob. VictorVianu. Foundations of Databases. Addison-Wesley, p The complexity class theta2 : Recent results and applica- Boston, MA, USA, 1995. tions in AI and modal logic. In FCT, pages 1–18, Krakow,´ [Amendola and Libkin, 2018] Giovanni Amendola and , September 1997. Springer. Leonid Libkin. Explainable certain answers. In IJCAI, [Fagin et al., 2005] Ronald Fagin, Phokion G. Kolaitis, and pages 1683–1690, Stockholm, Sweden, August 2018. Lucian Popa. Data exchange: getting to the core. ACM [Arenas et al., 2013] Marcelo Arenas, Jorge Perez,´ and TODS, 30(1):174–210, 2005. Juan L. Reutter. Data exchange beyond complete data. J. [Francis et al., 2015] Nadime Francis, Luc Segoufin, and ACM, 60(4):28:1–28:59, 2013. Cristina Sirangelo. Datalog rewritings of regular path [Arenas et al., 2014] Marcelo Arenas, Pablo Barcelo, queries using views. Logical Methods in Computer Sci- Leonid Libkin, and Filip Murlak. Foundations of Data ence, 11(4), 2015. Exchange. Cambridge University Press, 2014. [Geerts et al., 2013] Floris Geerts, Giansalvatore Mecca, [Barcelo´ et al., 2014] Pablo Barcelo,´ Leonid Libkin, and Paolo Papotti, and Donatello Santoro. The LLUNATIC Miguel Romero. Efficient approximations of conjunctive data-cleaning framework. PVLDB, 6(9):625–636, 2013. queries. SIAM J. Comput., 43(3):1085–1130, 2014. [Gheerbrant and Libkin, 2015] Amelie´ Gheerbrant and [Bertossi, 2011] Leopoldo E. Bertossi. Database Repairing Leonid Libkin. Certain answers over incomplete xml and Consistent Query Answering. Synthesis Lectures on documents: Extending tractability boundary. ACM ToCS, Data Management. Morgan & Claypool Publishers, 2011. 57(4):892–926, July 2015. [ ] [Bienvenu and Bourgaux, 2016] Meghyn Bienvenu and Gottlob et al., 2015 Georg Gottlob, Marco Manna, and An- Camille Bourgaux. Inconsistency-tolerant querying of dreas Pieris. Polynomial rewritings for linear existential description logic knowledge bases. In Tutorial Lectures of rules. In IJCAI, pages 2992–2998, Buenos Aires, Ar- RW Summer School, pages 156–202. LNCS, 2016. gentina, August 2015. AAAI Press. [Gottlob, 1995] Georg Gottlob. NP trees and Carnap’s modal [Bienvenu and Ortiz, 2015] Meghyn Bienvenu and Mag- logic. Journal of the ACM, 42(2):421–457, March 1995. dalena Ortiz. Ontology-mediated query answering with data-tractable description logics. In Reasoning Web. Web [Imielinski´ and Lipski, 1984] Tomasz Imielinski´ and Witold Logic Rules - 11th International Summer School, pages Lipski. Incomplete information in relational databases. 218–307, Berlin, Germany, August 2015. Springer. Journal of the ACM, 31(4):761–791, Oct 1984. [Cal`ı et al., 2003a] Andrea Cal`ı, Domenico Lembo, and Ric- [Lenzerini, 2002] Maurizio Lenzerini. Data integration: A cardo Rosati. On the decidability and complexity of query theoretical perspective. In ACM PODS, pages 233–246, answering over inconsistent and incomplete databases. In Madison, USA, June 2002. PODS, pages 260–271, June 2003. [Libkin, 2018a] Leonid Libkin. Certain answers as objects [Cal`ı et al., 2003b] Andrea Cal`ı, Domenico Lembo, and Ric- of knowledge. Artificial Intellligence, 232:195–207, 2018. cardo Rosati. Query rewriting and answering under con- [Libkin, 2018b] Leonid Libkin. Certain answers meet zero- straints in data integration systems. In IJCAI, pages 16–21, one laws. In ACM PODS, pages 1–19, Houston, USA, Acapulco, Mexico, August 2003. Morgan Kaufmann. June 2018. [Cal`ı et al., 2013] Andrea Cal`ı, Diego Calvanese, and Mau- [Lipski, 1984] . On relational algebra with rizio Lenzerini. Rewrite and conquer: Dealing with in- marked nulls. In PODS, pages 201–203, Waterloo, On- tegrity constraints in data integrations. In Seminal Contri- tario, Canada, April 1984. ACM. butions to Information Systems Engineering, pages 353– [Rothe, 2003] Jorg¨ Rothe. Exact complexity of exact-four- 359, 2013. colorability. Inf. Process. Lett., 87(1):7–12, 2003. [Calvanese et al., 2000] Diego Calvanese, Giuseppe De Gi- [van der Meyden, 1998] Ron van der Meyden. Logical ap- acomo, Maurizio Lenzerini, and Moshe Y. Vardi. What proaches to incomplete information: a survey. In Logics is view-based query rewriting? In KRDB, pages 17–27, for databases and information systems, pages 307–356, Berlin, Germany, August 2000. Norwell, MA, USA, 1998. Kluwer Academic Publishers. [Calvanese et al., 2006] Diego Calvanese, Giuseppe De Gi- [Wagner, 1990] Klaus W. Wagner. Bounded query classes. acomo, Domenico Lembo, Maurizio Lenzerini, and Ric- SIAM J. Comput., 19(5):833–846, 1990. cardo Rosati. Epistemic first-order queries over descrip- tion logic knowledge bases. In DL, Windermere, Lake District, UK, May-June 2006.

1710