Best Answers Over Incomplete Data : Complexity and First-Order Rewritings

Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19) Best Answers over Incomplete Data : Complexity and First-Order Rewritings Amelie´ Gheerbrant and Cristina Sirangelo Universite´ de Paris, IRIF, CNRS, F-75013 Paris, France famelie, [email protected] Abstract as if nulls were usual data values, thus merely using the standard database query engine to compute certain answers. Answering queries over incomplete data is ubiqui- In general though it is a common occurrence that few if tous in data management and in many AI applica- any certain answers can be found. If there are no certain an- tions that use query rewriting to take advantage of swers, it is still useful to provide a user with some answers, relational database technology. In these scenarios with suitable guarantees. To address this need, a framework one lacks full information on the data but queries to measure how close an answer is to certainty has recently still need to be answered with certainty. The cer- been proposed [Libkin, 2018b], setting the foundations to tainty aspect often makes query answering unfeasi- both a quantitative and a qualitative approach. We focus on ble except for restricted classes, such as unions of the qualitative notion of best answers. Those are a refinement conjunctive queries. In addition often there are no, of certain answers based on comparing query answers; one or very few, certain answers, thus expensive com- that is supported by a larger set of complete interpretations is putation is in vain. Therefore we study a relax- better. Best answers are those answers for which there is no ation of certain answers called best answers. They better one. They always exist and when certain answers exist are defined as those answers for which there is no the two notions simply coincide. better one (that is, no answer true in more possible worlds). When certain answers exist the two Best answers is a natural notion, but we still know little notions coincide. We compare different ways of about it. Identifying the set of best answers among some given family of sets of answers is known to be complete in casting query answering as a decision problem and NP[log n] characterise its complexity for first-order queries, data complexity for the class P , which is considered showing significant differences in the behaviour of as “mildly” harder than both NP and CONP. However this best and certain answers. We then restrict attention very formulation of the decision problem is non standard. to best answers for unions of conjunctive queries Traditionally, in databases we rather focus on problems stat- and produce a practical algorithm for finding them ing that some given result belongs to the set of answers, or based on query rewriting techniques. that some given set is the set of answers. Certain answers as a decision problem is typically formulated in the first way. So do these variations matter ? We fully answer the question for 1 Introduction both best and certain answers of first-order queries, showing Answering queries over incomplete databases is crucial in significant differences in their computational behaviour. many different scenarios such as data integration [Lenzerini, Despite the high complexity of finding best answers in gen- 2002], data exchange [Arenas et al., 2014], inconsistency eral, one gains tractability when restricting to unions of con- management [Bertossi, 2011], data cleaning [Geerts et al., junctive queries. This is a common class of queries, usu- 2013], ontology-based data access (OBDA) [Bienvenu and ally well behaved computationally, even for certain answers. Ortiz, 2015], and many others. The common thread run- Finding best answers for them was shown to be tractable in ning through all these applications lies in computing cer- [Libkin, 2018b] via an adaptation of techniques used in the tain answers [Amendola and Libkin, 2018; Libkin, 2018a], context of certain answers [Gheerbrant and Libkin, 2015]. which is the standard way of answering queries over incom- Those are essentially resolution based algorithms for first- plete databases. Intuitively this produces answers that can order formulas; this makes them hard to implement in the be obtained from all the possible complete databases a given database context. To overcome this we develop new query incomplete database represents. However, computing such rewriting techniques. In particular we show that best an- query answers then relies on sophisticated algorithms that are swers to any union of conjunctive queries can be computed often difficult to implement. It is well known that restricting by issuing a new first-order query directly on the incomplete to unions of conjunctive queries allows to overcome the diffi- database. Query rewritings are standard in the context of, e.g., culty by using na¨ıve evaluation [Imielinski´ and Lipski, 1984]. consistent query answering, OBDA, query answering using This amounts to evaluating queries over incomplete databases views etc., i.e., all contexts where only partial information is 1704 Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19) available about the data to be queried [Calvanese et al., 2000; ^; _; : and quantifiers 9; 8. We write '(¯x) for an FO- Calvanese et al., 2007; Cal`ı et al., 2013; Cal`ı et al., 2003b]. formula ' with free variables x¯. With slight abuse of no- First-order rewritings are particularly useful, as they allow to tation, x¯ will denote both a tuple of variables and the set use the power of standard database query engines. In fact of variables occurring in it. The set of constants used by when they exist, the rewritten queries can be implemented in ' is as usual denoted by adom('), and gives the active do- any relational query engine by expressing them in SQL, with main of the associated query. We interpret FO-formulas un- no need to implement ad-hoc algorithms. der active domain semantics, i.e. we consider D as a relational structure with universe adom(D) [ adom('). Thus 2 Preliminaries an FO formula '(¯x) represents a query (of active domain adom(')) mapping each database D into the set of tuples We represent missing information in relational databases ft¯over adom(D) [ adom(') j D j= '(t¯)g. in the standard way using nulls [Abiteboul et al., 1995; Imielinski´ and Lipski, 1984; van der Meyden, 1998]. To evaluate FO-formulas with free variables we use as- Databases are populated by constants and nulls, coming re- signments ν from variables to constants in the active domain. Note that with a little abuse of notation we write D j= '(t¯) spectively from two countably infinite sets Const and Null. ¯ We denote nulls by ?, sometimes with sub- or superscript. for D j=ν '(¯x) under the assignment ν sending x¯ to t. We also allow them to repeat, thus adopting the model of Here it is important to note that the query associated to marked nulls, as customary in the context of applications ' is a mapping defined on all databases D, possibly with such as OBDA or data integration and exchange. A relational nulls. If D contains nulls, D j= '(t¯) is to be intended schema, or vocabulary σ, is a set of relation names with asso- “na¨ıvely”, i.e. nulls of D are treated as new constants ciated arities. A database D over σ associates to each relation in the domain of D, distinct from each other, and distinct name of arity k in σ, a k-ary relation which is a finite subset from all the other constants in D and '. For example the of (Const [ Null)k. Sets of constants and nulls occuring in D query '(x; y) = 9z (R(x; z) ^ R(z; y)), on the database are denoted by Const(D) and Null(D). The active domain of D = fR(1; ?1);R(?1; ?2);R(?3; 2)g selects only the tu- D is adom(D) = Const(D) [ Null(D).A complete database ple (1; ?2). has no nulls. We consider the 9; ^; _-fragment of FO known as unions A valuation v : Null(D) ! Const on a database D is of conjunctive queries and its 9; ^-fragment known as con- a map that assigns constant values to nulls occurring in D. junctive queries. By v(D) [resp. v(¯a)] we denote the result of replacing each null ? by v(?) in D [resp. in the tuple a¯]. The seman- Example 2.1. Let Q(x) = 9y(R(y) ^ S(y; x)) and D = tics [[D]] of an incomplete database D is the set fv(D) j fR(?1);R(1);S(?2; ?2)g. We have Supp(Q; D; ?2) = v is a valuation on Dg of all complete databases it can rep- fv 2 V(D) j v(?2) = 1 or v(?1) = v(?2)g, resent. V(D) denotes the set of all valuations defined on D. Supp(Q; D; 1) = fv 2 V(D) j v(?2) = 1g and An m-ary query of active domain C ⊆ Const is a map that Supp(Q; D; ?1) = fv 2 V(D) j v(?1) = v(?2)g. It 2 associates with a database D a subset of (adom(D) [ C)m. follows that (Q; D) = ; and Best(Q; D) = f(?2)g. The active domain of a query will be denoted as adom(Q). In order to study the complexity of best answer compu- To answer a query Q over an incomplete database D we fol- tation we shall need two classes in the second level of the low [Lipski, 1984; Libkin, 2018b] and adopt a slight gener- polynomial hierarchy. Both of these contain NP and CONP, alisation of the usual intersection based certain answers no- and are contained in Σp \ Πp.

Best Answers Over Incomplete Data : Complexity and First-Order Rewritings

Probabilistic Databases

Download a Copy of the 264-Page Publication

1 Tarski's Influence on Computer Science

Nonapplicable Nulls

My Six Encounters with Victor Marek — a Personal Account

Ooo Ooto •Oooo

Oooo Oo#Oo Oteoo

Best Answers Over Incomplete Data: Complexity and First-Order Rewritings

FINDING a MANHATTAN PATH and RELATED PROBLEMS by Witold

The Relational Model of Data and Cylindric Algebras

Query Processing on Probabilistic Data: a Survey

Remembering Professor Helena Rasiowa