1
The Complexity of Counting Problems over Incomplete Databases
MARCELO ARENAS∗, Universidad Católica & IMFD Chile PABLO BARCELÓ†, Universidad Católica & IMFD Chile MIKAËL MONET, Univ. Lille, Inria, CNRS, Centrale Lille, UMR 9189 CRIStAL, F-59000 Lille, France We study the complexity of various fundamental counting problems that arise in the context of incomplete databases, i.e., relational databases that can contain unknown values in the form of labeled nulls. Specifically, we assume that the domains of these unknown values are finite and, for a Boolean query 푞, we consider the following two problems: given as input an incomplete database 퐷, (a) return the number of completions of 퐷 that satisfy 푞; or (b) return the number of valuations of the nulls of 퐷 yielding a completion that satisfies 푞. We obtain dichotomies between #P-hardness and polynomial-time computability for these problems when 푞 is a self-join–free conjunctive query, and study the impact on the complexity of the following two restrictions: (1) every null occurs at most once in 퐷 (what is called Codd tables); and (2) the domain of each null is the same. Roughly speaking, we show that counting completions is much harder than counting valuations: for instance, while the latter is always in #P, we prove that the former is not in #P under some widely believed theoretical complexity assumption. Moreover, we find that both (1) and (2) can reduce the complexity of our problems. We also study the approximability of these problems and show that, while counting valuations always has a fully polynomial-time randomized approximation scheme (FPRAS), in most cases counting completions does not. Finally, we consider more expressive query languages and situate our problems with respect to known complexity classes. CCS Concepts: • Theory of computation → Complexity theory and logic; Incomplete, inconsistent, and uncertain databases; • Mathematics of computing → Approximation algorithms. Additional Key Words and Phrases: Incomplete databases, closed-world assumption, counting complexity, Fully Polynomial-time Randomized Approximation Scheme (FPRAS).
1 INTRODUCTION
Context. In the database literature, incomplete databases are often used to represent missing information in the data; see, e.g., [1, 37, 52]. These are traditional relational databases whose active domain can contain both constants and nulls, the latter representing unknown values [30]. There are many ways in which one can define the semantics of such a database, each being equally meaningful depending on the intended application. Under the so called closed-world assumption [1, arXiv:2011.06330v2 [cs.DB] 28 Apr 2021 45], a standard, complete database 휈 (퐷) is obtained from an incomplete database 퐷 by applying a valuation 휈 which replaces each null ⊥ in 퐷 with a constant 휈 (⊥). The goal is then to reason about the space formed by all valuations 휈 and completions 휈 (퐷) of 퐷. Decision problems related to querying incomplete databases have been well studied already. Consider for instance the problem Certainty(푞(푥¯)) which, for a fixed query 푞(푥¯), takes as input an incomplete database 퐷 and a tuple 푎¯ and asks whether 푎¯ is an answer to 푞 for every possible completion of 퐷. By now, we have a deep understanding of the complexity of these kind of decision
∗Also with Department of Computer Science & Institute for Mathematical and Computational Engineering, School of Engineering, Pontificia Universidad Católica de Chile. †Also with Institute for Mathematical and Computational Engineering, School of Engineering and Faculty of Mathematics, Pontificia Universidad Católica de Chile.
Authors’ addresses: Marcelo Arenas, Universidad Católica & IMFD Chile, [email protected]; Pablo Barceló, Universidad Católica & IMFD Chile, [email protected]; Mikaël Monet, Univ. Lille, Inria, CNRS, Centrale Lille, UMR 9189 CRIStAL, F-59000 Lille, France, [email protected]. 1:2 Marcelo Arenas, Pablo Barceló, and Mikaël Monet problems for different choices of query languages, including conjunctive queries (CQs) andFO queries [2, 30]. However, having the answer to this question is sometimes of little help: what if it is not the case that 푞 is certain on 퐷? Can we still infer some useful information? This calls for new notions that could be used to measure the certainty with which 푞 holds, notions which should be finer than those previously considered. This is for instance what the recent workin[38] does by introducing a notion of best answer, which are those tuples 푎¯ for which the set of completions of 퐷 over which 푞(푎¯) holds is maximal with respect to set inclusion. A fundamental complementary approach to address this issue can be obtained by considering some counting problems related to incomplete databases; more specifically, determining the number of completions/valuations of an incomplete database that satisfy a query 푞. These problems are relevant as they tell us, intuitively, how close is 푞 from being certain over 퐷, i.e., what is the level of support that 푞 has over the set of completions/valuations of 퐷. Surprisingly, such counting problems do not seem to have been studied for incomplete databases. A reason for this omission in the literature might be that, in general, it is assumed that the domain over which nulls can be interpreted is infinite, and thus incomplete databases might have an infinite number of completions/valuations. However, in many scenarios it is natural to assume that the domain over which nulls are interpreted is finite, in particular when dealing with uncertainty in practice [4, 6, 7, 11, 23, 46]. By assuming this we can ensure that the number of completions and valuations are always finite, and thus that they can be counted. This is the setting that we study.
Problems studied. We focus on the problems #Comp(푞) and #Val(푞) for a Boolean query 푞, which take as input an incomplete database 퐷 together with a finite set dom(⊥) of constants for every null ⊥ occurring in 퐷, and ask the following: How many completions, resp., valuations, of 퐷 satisfy 푞? More formally, a valuation of 퐷 is a mapping 휈 that associates to every null ⊥ of 퐷 a constant 휈 (⊥) in dom(⊥). Then, given a valuation 휈 of 퐷, we denote by 휈 (퐷) the database that is obtained from 퐷 after replacing each null ⊥ with 휈 (⊥), and we call such a database a completion. Besides, in this article we consider set semantics, that is, we remove repeated tuples from 휈 (퐷). For #Comp(푞) we count all databases of the form 휈 (퐷) such that 푞 holds in 휈 (퐷). Instead, for #Val(푞) we count the number of valuations 휈 such that 푞 holds in 휈 (퐷). It is easy to see that these two values can differ, as a completion might be obtained from two different valuations, i.e.,there might exist two distinct valuations 휈, 휈 ′ such that 휈 (퐷) = 휈 ′(퐷). We think that both problems are meaningful: while #Comp(푞) determines the support for 푞 over the databases represented by 퐷, we have that #Val(푞) further refines this by incorporating the support for a particular completion that satisfies 푞 over the set of valuations for 퐷. We deal with the data complexity of the problems #Comp(푞) and #Val(푞), focusing on obtaining dichotomy results for them in terms of counting complexity classes, as well as studying the existence of randomized algorithms that approximate their results under probabilistic guarantees. For the dichotomies, we concentrate on self-join-free Boolean conjunctive queries (sjfBCQs). This assumption simplifies the mathematical analysis, while at the same time defines a setting whichis rich enough for many of the theoretical concepts behind these problems to appear in full force. Notice that a similar assumption is used in several works that study counting problems over probabilistic and inconsistent databases; see, e.g., [18, 40]. To simplify further the presentation, in the bulk of the article we mainly consider self-join–free Boolean conjunctive queries that do not contain constants; however, we explain later (in Section7) how our results can be extended to queries that can contain constants and free variables. To refine our analysis, we study two restrictions of the problems #Comp(푞) and #Val(푞) based on natural modifications of the semantics, and analyze to what extent these restrictions simplify our problems. For the first restriction we consider incomplete databases in which each null occurs The complexity of counting problems over incomplete databases 1:3
Counting valuations Counting completions Non-uniform Uniform Non-uniform Uniform 푅 푥, 푥 푅 푥, 푥 Naïve ( ) 푅(푥, 푥) 푅(푥) ( ) 푅(푥) ∧ 푆 (푥) 푅(푥) ∧ 푆 (푥,푦) ∧ 푇 (푦) 푅(푥,푦) 푅(푥,푦) ∧ 푆 (푥,푦) 푅 푥 푆 푥,푦 푇 푦 푅 푥, 푥 Codd 푅(푥) ∧ 푆 (푥) ( ) ∧ ( ) ∧ ( ) 푅(푥) ( ) 푅(푥,푦) ∧ 푆 (푥,푦) 푅(푥,푦) Table 1. Our dichotomies for counting valuations and completions of sjfBCQs. For each of the eight cases, if an sjfBCQ 푞 contains a pattern mentioned in that case, then the problem is #P-hard (and #P-complete for counting valuations, as well as for counting completions over Codd tables). In turn, for each case if an sjfBCQ 푞 does not have any of the patterns mentioned in that case, then the problem isin FP.
FPRAS for counting valuations FPRAS for counting completions Non-uniform Uniform Non-uniform Uniform 푞 Naïve Always Always Never Only when has only unary atoms Codd Always Always Never ? Table 2. Our results on the existence of FPRAS for solving the problems studied in the article (assuming NP ≠ RP).
exactly once, which corresponds to the well-studied setting of Codd tables – as opposed to naive tables where nulls are allowed to have multiple occurrences. We denote the corresponding prob- lems by #ValCd (푞) and #CompCd (푞). For the second restriction, we consider uniform incomplete databases in which all the nulls share the same domain – as opposed to the basic non-uniform setting in which all nulls come equipped with their own domain. We denote the corresponding problems by #Valu (푞) and #Compu (푞). When both restrictions are in place, we denote the problems u (푞) u (푞) by #ValCd and #CompCd . Our dichotomies for exact counting. We provide complete characterizations of the complex- ity of counting valuations and completions satisfying a given sjfBCQ 푞, when the input is a Codd table or a naive table, and is a non-uniform or a uniform incomplete database (hence we have eight cases in total). Our eight dichotomies express that these problems are either tractable or #P-hard, and that the tractable cases can be fully characterized by the absence of certain forbidden patterns in 푞. In essence, a pattern is simply an sjfBCQ which can be obtained from 푞 by deleting atoms and occurrences of variables (the exact definition of this notion is given in Section3). Our characteriza- tions are presented in Table1. By analyzing this table we can draw some important conclusions as explained next. #Comp(푞) and #Val(푞) are computationally difficult: For very few sjfBCQs 푞 the aforementioned problems can be solved in polynomial time. Take as an example the uniform setting over naive tables. Then #Valu (푞) is #P-hard as long as 푞 contains the pattern 푅(푥, 푥), or 푅(푥) ∧ 푆 (푥,푦) ∧ 푇 (푦), or 푅(푥,푦) ∧ 푆 (푥,푦). That is, as long as there is an atom in 푞 that contains a repeated variable 푥, or a pair (푥,푦) of variables that appear in an atom and both 푥 and 푦 appear in some other atoms in 푞. By contrast, for this same setting, #Compu (푞) is #P-hard as long as 푞 contains the pattern 푅(푥,푦) or 푅(푥, 푥), that is, as long as there is an atom in 푞 that is not of arity one. 1:4 Marcelo Arenas, Pablo Barceló, and Mikaël Monet
#Val(푞) is always easier than #Comp(푞): In all of the possible versions of our problem, the tractable (푞) (푞) u (∃푥∃푦 푅(푥,푦)) cases for #Val are a strict superset of the ones for #Comp . For instance, #CompCd u (∃푥∃푦 푅(푥,푦)) is hard, while #ValCd is tractable. Even counting completions is hard: While counting the total number of valuations for an incom- plete database can always be done in polynomial time, observe from Table1 that the problem u (∃푥∃푦 푅(푥,푦)) #CompCd is #P-hard, and thus that simply counting the completions of a uniform Codd table with a single binary relation 푅 is #P-hard. Moreover, we show that in the non-uniform case a single unary relation suffices to obtain #P-hardness. Codd tables help but not much: We show that counting valuations is easier for Codd tables than for naive tables. In particular, there is always an sjfBCQ 푞 such that counting the valuations that satisfy 푞 is #P-hard, yet it becomes tractable when restricted to the case of Codd tables. However, for counting completions, both in the uniform and non-uniform setting, the sole restriction to Codd tables presents no benefits: for every sjfBCQ 푞, we have that #Comp(푞) (resp., #Compu (푞)) (푞) u (푞) is #P-hard if and only if #CompCd (resp., #CompCd ) is #P-hard. Non-uniformity complicates things: All versions of our problems become harder in the non-uniform setting. This means that in all cases there is an sjfBCQ 푞 for which counting valuations is tractable on uniform incomplete databases, but becomes #P-hard assuming non-uniformity, and the same holds for counting completions. Our dichotomies for approximate counting. Although #Val(푞) can be #P-hard, we prove that good randomized approximation algorithms can be designed for this problem. More precisely, we give a general condition under which #Val(푞) admits a fully polynomial-time randomized approximation scheme [33] (FPRAS). This condition applies in particular to all unions of Boolean conjunctive queries. Remarkably, we show that this no longer holds for #Comp(푞); more precisely, there exists an sjfBCQ 푞 such that #Comp(푞) does not admit an FPRAS under a widely believed complexity theoretical assumption. More surprisingly, even counting the completions of a uniform incomplete database containing a single binary relation does not admit an FPRAS under such an assumption (and in the non-uniform case, a single unary relation suffices). Generally, for sjfBCQs, we obtain seven dichotomies for our problems between polynomial-time computability of exact counting and non admissibility of an FPRAS. The only case that we did not completely solve is that u (푞) of #CompCd . Our dichotomies for approximate counting are illustrated in Table2. Beyond #P. It is easy to see that the problem of counting valuations is always in #P, provided that the model checking problem for 푞 is in P. This is no longer the case for counting completions, and in fact we show that, under a complexity theoretical assumption, there is an sjfBCQ 푞 for which #Compu (푞) is not in #P. This does not hold if restricted to Codd tables, however, as we prove that #CompCd (푞) is always in #P when the model checking problem for 푞 is in P. For reasons that we explain in the article, a suitable complexity class for the problem #Comp(푞) is SpanP, which is defined as the class of counting problems that can be expressed as thenumber of different accepting outputs of a nondeterministic Turing machine running in polynomial time. While we have not managed to prove that there is an sjfBCQ 푞 for which #Comp(푞) is SpanP- complete, we show that this is the case for the problem of counting completions for the negation of an sjfBCQ, even in the uniform setting; that is, we show that #Compu (¬푞) is SpanP-complete for some sjfBCQ 푞. Finally, we also show that SpanP is the right complexity class for counting valuations of queries for which model checking is in NP. Extension to queries with constants and free variables. As we said already, for pedagogical reasons we mostly present our results by considering queries that are Boolean and that do not have The complexity of counting problems over incomplete databases 1:5 constants. In Section7 however, we explain how to extend these results to the case of queries that have free variables and that can contain constants. For the case of a query 푞(푥¯) with free variables 푥¯, our counting problems are defined in the expected way; for instance the problem #Val(푞(푥¯)) takes as input an incomplete database 퐷, a tuple of constants 푎¯ of same arity as 푥¯, and it outputs the number of valuations 휈 of 퐷 such that 푎¯ in an answer to 푞(푥¯) on 휈 (퐷). We then extend our dichotomies and approximation results in this setting.
The current article extends the conference article [8] in the following ways: • u (푞) In [8] we left open the dichotomy for #ValCd , i.e., for counting valuations of sjfBCQs for Codd tables under the uniform setting. We close this case here, by finding one more hard pattern (namely, the pattern ∃푥,푦 푅(푥,푦) ∧ 푆 (푥,푦)) and showing that the problem can be solved in polynomial time for all other queries. • We added Section7 which explains how our framework can be extended to handle queries with constants and free variables; • Proposition 6.3, which establishes the NP-completeness of checking if a set of facts is a possible completion of an incomplete database, is new; • Finally, full proofs of most results are included in the body of the article. Organization of the article. We start with the main terminology used in the article in Section2, and then present in Section3 our four dichotomies on #Val(푞) when 푞 is an sjfBCQ, and the input incomplete database can be Codd or not, and the domain can be uniform or not. We then establish the four dichotomies on #Comp(푞) in Section4. In Section5, we study the approximability complexity of our problems. We then give in Section6 some general considerations about the exact complexity of the problem #Comp(푞) going beyond #P. We explain in Section7 how to extend our results to queries with constants and free variables. In Section8, we discuss related work and explain the differences with the problems considered in this article. Last, we provide some conclusions and mention possible directions for future work in Section9.
2 PRELIMINARIES Relational databases and conjunctive queries. A relational schema 휎 is a finite non-empty set of relation symbols written 푅, 푆, 푇 , ..., each with its associated arity, which is denoted by arity(푅). Let Consts be a countably infinite set of constants. A database 퐷 over 휎 is a set of facts of the form 푅(푎1, . . . , 푎arity(푅) ) with 푅 ∈ 휎, and where each element 푎푖 ∈ Consts. For 푅 ∈ 휎, we denote by 퐷(푅) the subset of 퐷 consisting of facts over 푅. Such a set is usually called a relation of 퐷. A Boolean query 푞 is a query that a database 퐷 can satisfy (written 퐷 |= 푞) or not (written 퐷 ̸|= 푞). If 푞 is a Boolean query, then ¬푞 is the Boolean query such that 퐷 |= ¬푞 if and only if 퐷 ̸|= 푞.A Boolean conjunctive query (BCQ) over 휎 is an FO formula of the form ∃푥¯ 푅1 (푥¯1) ∧ ... ∧ 푅푚 (푥¯푚) , (1) where all variables are existentially quantified, and where for each 푖 ∈ [1,푚], we have that 푅푖 is a relation symbol in 휎 and 푥¯푖 is a tuple of variables with |푥¯푖 | = arity(푅푖 ). To avoid trivialities, we will always assume that 푚 ⩾ 1, i.e., the query has at least one atom, and also that arity(푅푖 ) ⩾ 1 for all atoms. Observe that we do not allow constants to appear in the query (but we will come back to this issue in Section7). For simplicity, we typically write a BCQ 푞 of the form (1) as
푅1 (푥¯1) ∧ ... ∧ 푅푚 (푥¯푚), and it will be implicitly understood that all variables in 푞 are existentially quantified. As usual, we de- fine the semantics of a BCQ in terms of homomorphisms. A homomorphism from푞 to a database 퐷 is a 1:6 Marcelo Arenas, Pablo Barceló, and Mikaël Monet
(휈 (⊥1), 휈 (⊥2)) (푎, 푎)(푎,푏)(푏, 푎)(푏,푏)(푐, 푎)(푐,푏) 휈 (퐷) 푆 푆 푆 푆 푆 푆 푎 푏 푎 푏 푎 푏 푎 푏 푎 푏 푎 푏 푎 푎 푎 푎 푏 푎 푏 푎 푐 푎 푐 푎 푎 푎 푎 푎
휈 (퐷) |= 푞? Yes Yes Yes No Yes No
Fig. 1. The six valuations of the (non-uniform) incomplete database 퐷 = (푇, dom) with 푇 = {푆 (푎,푏), 푆 (⊥1, 푎), 푆 (푎, ⊥2)} from Example 2.2, and their corresponding completions. The Boolean conjunctive query 푞 is ∃푥 푆 (푥, 푥).
mapping from the variables in 푞 to the constants used in 퐷 such that {푅1 (ℎ(푥¯1)), . . . , 푅푚 (ℎ(푥¯푚))} ⊆ 퐷. Then, we have 퐷 |= 푞 if there exists a homomorphism from 푞 to 퐷.A self-join–free BCQ (sjfBCQ) is a BCQ such that no two atoms use the same relation symbol.
Incomplete databases. Let Nulls be a countably infinite set of nulls (also called labeled or marked nulls in the literature), which is disjoint with Consts. An incomplete database over schema 휎 is a pair 퐷 = (푇, dom), where 푇 is a database over 휎 whose facts contain elements in Consts∪Nulls, and where dom is a function that associates to every null ⊥ occurring in 퐷 a subset dom(⊥) of Consts. Intuitively, 푇 is a database that can mention both constants and nulls, while dom tells us where nulls are to be interpreted. Following the literature, we call 푇 a naive table [30]. An incomplete database 퐷 = (푇, dom) can represent potentially many complete databases, via what are called valuations. A valuation of 퐷 is simply a function 휈 that maps each null ⊥ occurring in 푇 to a constant 휈 (⊥) ∈ dom(⊥). Such a valuation naturally defines a completion of 퐷, denoted by 휈 (푇 ), which is the complete database obtained from 푇 by substituting each null ⊥ appearing in 푇 by 휈 (⊥). It is understood, since a database is a set of facts, that 휈 (푇 ) does not contain duplicate facts. By paying attention to completions of incomplete databases that are generated exclusively by applying valuations to them, we are sticking to the so called closed-world semantics of incompleteness [1, 45]. This means that the databases represented by an incomplete database 퐷 = (푇, dom) are not open to adding facts that are not “justified” by the facts in 푇 .
Example 2.1. Let 퐷 = (푇, dom) be the incomplete database consisting of the naive table 푇 = {푆 (⊥1, ⊥1), 푆 (푎, ⊥2)}, and where dom(⊥1) = {푎,푏} and dom(⊥2) = {푎, 푐}. Let 휈1 be the valuation mapping ⊥1 to 푏 and ⊥2 to 푐. Then 휈1 (푇 ) is {푆 (푏,푏), 푆 (푎, 푐)}. Let 휈2 be the valuation mapping both ⊥1 and ⊥2 to 푎. Then 휈2 (푇 ) is {푆 (푎, 푎)}. On the other hand, the function 휈 mapping ⊥1 and ⊥2 to 푏 is not a valuation of 퐷, because 푏 ∉ dom(⊥2). □
When every null occurs at most once in 푇 , then 퐷 is what is called a Codd table [16]; for instance, the incomplete database in Example 2.1 is not a Codd table because ⊥1 occurs twice. We also consider uniform incomplete databases in which the domain of every null is the same. Formally, a uniform incomplete database is a pair 퐷 = (푇, dom), where 푇 is a database over 휎 and dom is a subset of Consts. The difference now is that a valuation 휈 of 퐷 must simply satisfy 휈 (⊥) ∈ dom for every null of 퐷. We will often abuse notation and use 퐷 instead of 푇 ; for instance, we write 휈 (퐷) instead of 휈 (푇 ), or 푅(푎, 푎) ∈ 퐷 instead of 푅(푎, 푎) ∈ 푇 , or again 퐷(푅) instead of 푇 (푅). The complexity of counting problems over incomplete databases 1:7
Counting problems on incomplete databases. We will study two kinds of counting problems for incomplete databases: problems of the form #Val(푞), that count the number of valuations 휈 that yield a completion 휈 (퐷) satisfying a given BCQ 푞, and problems of the form #Comp(푞), that count the number of completions that satisfy 푞. The query 푞 is assumed to be fixed, so that each query gives rise to different counting problems, and we are considering the data complexity [53] of these problems. Before formally introducing our problems, let us observe that they are well defined if we assume that the set of constants to which a null can be mapped to is finite. Hence, for the (default) case of an incomplete database 퐷 = (푇, dom), we assume that dom(⊥) is always a finite subset of Consts. Similarly, for the case of a uniform incomplete database 퐷 = (푇, dom), we assume that dom is a finite subset of Consts. Finally, given a Boolean query 푞, we use notation sig(푞) for the set of relation symbols occurring in 푞. With these ingredients, we can define our problems for the (default) case of incomplete naive tables and a Boolean query 푞.
PROBLEM : #Val(푞) INPUT : An incomplete database 퐷 over sig(푞) OUTPUT : Number of valuations 휈 of 퐷 with 휈 (퐷) |= 푞
PROBLEM : #Comp(푞) INPUT : An incomplete database 퐷 over sig(푞) OUTPUT : Number of completions 휈 (퐷) of 퐷 with 휈 (퐷) |= 푞
We also consider the uniform variants of these problems, in which the input 퐷 is a uniform incomplete database over sig(푞), and the restriction of these problems where the input is a Codd table instead of a naive table. We then use the terms #Valu (푞), #Compu (푞) when restricted to the (푞) (푞) u (푞) u (푞) uniform case, #ValCd , #CompCd when restricted to Codd tables, and #ValCd , #CompCd when both restrictions are applied. As we will see, even though the problems #Val(푞) and #Comp(푞) look similar, they are of a different computational nature; this is because two distinct valuations can produce thesame completion of an incomplete database. We illustrate this phenomenon in the following example. Example 2.2. Let 푞 be the Boolean conjunctive query ∃푥 푆 (푥, 푥), and 퐷 be the (non-uniform) incomplete database 퐷 = (푇, dom), with 푇 = {푆 (푎,푏), 푆 (⊥1, 푎), 푆 (푎, ⊥2)}, dom(⊥1) = {푎,푏, 푐} and dom(⊥2) = {푎,푏}. We have depicted in Figure1 the six valuations of 퐷 together with the completions that they define. Out of these six valuations 휈, only four are such that 휈 (퐷) |= 푞, so we have #Val(푞)(퐷) = 4. Moreover, there are only 3 distinct completions of 퐷 that satisfy 푞 – because the first two are the same –so #Comp(푞)(퐷) = 3. □ 퐴, 퐵 퐴 p 퐵 퐴 Counting complexity classes. Given two problems , we write ⩽T when reduces to 퐵 under polynomial-time Turing reductions. When both 퐴 and 퐵 are counting problems, we 퐴 p 퐵 퐴 퐵 write ⩽par when can be reduced to under polynomial-time parsimonious reductions, i.e., when there exists a polynomial-time computable function 푓 that transforms an input 푥 of 퐴 to an input 푓 (푥) of 퐵 such that 퐴(푥) = 퐵(푓 (푥)). We say that a counting problem is in FP when it can be solved in polynomial time. We will consider the counting complexity class #P [50] of problems that can be expressed as the number of accepting paths of a nondeterministic Turing machine running in polynomial time. Following [50, 51], we define #P-hardness using Turing reductions. It is clear that FP ⊆ #P. Moreover, this inclusion is widely believed to be strict. Therefore, proving that a 1:8 Marcelo Arenas, Pablo Barceló, and Mikaël Monet counting problem is #P-hard implies that it cannot be solved in polynomial time under such an assumption.
3 DICHOTOMIES FOR COUNTING VALUATIONS In this section, for a fixed sjfBCQ q, we study the complexity of the problem of computing, given as input an incomplete database 퐷, the number of valuations 휈 of 퐷 such that 휈 (퐷) satisfies 푞. Recall that we have four cases to consider for this problem depending on whether we focus on naive or on Codd tables, where nulls are restricted to appear at most once, and whether we focus on non-uniform or uniform incomplete databases, where nulls are restricted to have the same domain. Our specific goal then is to understand whether the problem is tractable (in FP) or #P-hard in these scenarios, depending on the shape of 푞. To this end, the shape of an sjfBCQ 푞 will be characterized by the presence or absence of certain specific patterns. In the following definition, we introduce the necessary terminology to formally talk about the presence of a pattern in a query. Definition 3.1. Let 푞, 푞′ be sjfBCQs. We say that 푞′ is a pattern of 푞 if 푞′ can be obtained from 푞 by using an arbitrary number of times and in any order the following operations: deleting an atom, deleting an occurrence of a variable, renaming a relation to a fresh one, renaming a variable to a fresh one, and reordering the variables in an atom.1 □ Example 3.2. Recall that we always omit existential quantifiers in Boolean queries. Then we have that 푞′ = 푅′(푢,푢,푦) ∧ 푆 ′(푧) is a pattern of 푞 = 푅(푢, 푥,푢) ∧ 푆 ′(푦,푦) ∧ 푇 (푥, 푠, 푧, 푠). Indeed, 푞′ can be obtained from 푞 by deleting atom 푇 (푥, 푠, 푧, 푠), renaming 푅(푢, 푥,푢) as 푅′(푢, 푥,푢) to ob- tain 푅′(푢, 푥,푢) ∧ 푆 ′(푦,푦), reordering the variables in 푅′(푢, 푥,푢) to obtain 푅′(푢,푢, 푥) ∧ 푆 ′(푦,푦), renaming variable 푦 into 푧 to obtain 푅′(푢,푢, 푥) ∧ 푆 ′(푧, 푧), deleting the second variable occurrence in 푆 ′(푧, 푧) to obtain 푅′(푢,푢, 푥) ∧ 푆 ′(푧), and finally renaming variable 푥 into 푦 to obtain 푞′. □ We point out that in Definition 3.1, the important parts are those about deleting atoms and variable occurrences. The parts about reordering variable occurences inside an atom and about renaming relations and variables to fresh ones have obviously no effect on the complexity of the problem2; these are only here to allow us to formally say, for instance, that “푅(푥) is a pattern of 푅(푦)”, or that “푆 (푥,푢, 푥) is a pattern of 푇 (푤, 푧, 푧)” (as these are, in essence, the same queries). In the following general lemma, we show that if 푞′ is a pattern of 푞, then each of the problems considered in this section is as hard for 푞 as it is for 푞′. Recall in this result that unless stated otherwise, our problems are defined for naive tables under the non-uniform setting. 푞, 푞′ 푞′ 푞 푞′ p 푞 Lemma 3.3. Let be sjfBCQs such that is a pattern of . Then we have #Val( ) ⩽par #Val( ). Moreover, the same results hold if we restrict to Codd tables, and/or to the uniform setting. 푞′ p 푞 Proof. We first present the proof for #Val( ) ⩽par #Val( ), that is, for naive tables in the non- uniform setting. First of all, observe that we can assume without loss of generality that we did not reorder the variables in the atoms nor renamed relation names or variables by fresh ones, because, as mentioned above, this does not change the complexity of the problem.3 We can then write 푞 as
1We remind the reader that we assume all sjfBCQs to contain at least one atom and that all atoms must contain at least one variable. 2This is in particular because the conjunctive queries we consider have no self-joins (otherwise, reordering variables inside an atom could change the complexity). 3Formally, one can first check that we can assume without loss of generality that 푞′ was obtained from 푞 by first deleting some atoms and variable occurences to obtain a query 푞′′, and then performing some renamings and variable reorderings to obtain 푞′ (that is, we can always push the renaming and reordering parts at the end of the transformation). But then, since #Val(푞′′) and #Val(푞′) are obviously the same problem, we can assume 푞′ = 푞′′. The complexity of counting problems over incomplete databases 1:9
푅 (푥 ) ∧ ... ∧ 푅 (푥 ) and 푞′ as 푅 (푥 ′ ) ∧ ... ∧ 푅 (푥 ′ ), where 1 ⩽ 푗 < ... < 푗 ⩽ 푚 and 푥 ′ is 1 1 푚 푚 푗1 푗1 푗푝 푗푝 1 푝 푗푘 푥 4 obtained from 푗푘 by deleting some variable occurrences but not all , and the other atoms have been deleted. Let 퐷 ′ be an incomplete database input of #Val(푞′). Let 퐴 be the set of constants that are appearing in 퐷 ′ or are in a domain of some null occurring in 퐷 ′. For 1 ⩽ 푘 ⩽ 푝, we construct 퐷 푅 퐷 ′ 푅 푥 푥 , . . . , 푥 the relation ( 푗푘 ) from the relation ( 푗푘 ). Let us assume that 푗푘 is the tuple ( 1 푟 ) (with 퐷 푅 푡 ′ some variables possibly being equal). We initialize ( 푗푘 ) to be empty, and then for every tuple 퐷 ′ 푅 퐷 푅 푡 푡 ′ in ( 푗푘 ) we add to ( 푗푘 ) all the tuples that can be obtained from in the following way for 1 ⩽ 푖 ⩽ 푟: 푥 푥 a) If 푖 is a variable occurrence that has not been deleted from 푗푘 , then copy the element (constant or null) of 푡 ′ corresponding to that variable occurrence to the 푖-th position of 푡; 푥 푥 푖 b) Otherwise, if 푖 is a variable occurrence that has been deleted from 푗푘 , then fill the -th position of 푡 with every possible constant from 퐴. ′ Then we construct the relations 퐷(푅푖 ) where 푅푖 does not appear in 푞 (this can happen if we have deleted the atom 푅푖 (푥푖 )) by filling it with every possible 푅푖 -fact over 퐴. We leave the domains of all nulls unchanged. The whole construction can be performed in polynomial time (this uses the fact that 푞 is assumed to be fixed, so that the arities of the relations mentioned in 푞 are fixed). Hence, it only remains to be checked that #Val(푞′)(퐷 ′) = #Val(푞)(퐷), that is, that the reduction works and is indeed parsimonious. It is clear that the valuations of 퐷 ′ are exactly the same as the valuations of 퐷 (because they have the same sets of nulls). Hence it is enough to verify that for every valuation 휈, we have 휈 (퐷 ′) |= 푞′ if and only if 휈 (퐷) |= 푞. Let ℎ′ be a homomorphism from 푞′ to 휈 (퐷 ′) witnessing that 휈 (퐷 ′) |= 푞′ (i.e., we have ℎ′(푞) ⊆ 휈 (퐷 ′)). Then ℎ′ can clearly be extended in the expected way into a homomorphism ℎ from 푞 to 휈 (퐷): this is in particular thanks to the fact that we filled the missing columns with every possible constant. Conversely, let ℎ be a homomorphism from 푞 to 휈 (퐷) witnessing that 휈 (퐷) |= 푞. Then the restriction ℎ′ of ℎ to the variables occurring in 푞′ is such that ℎ(푞′) ⊆ 휈 (퐷 ′), hence we have 휈 (퐷 ′) |= 푞′. This concludes the proof for the case of naive tables in the non-uniform setting. For the cases of Codd tables and/or for the uniform setting, the reduction is exactly the same. Indeed, the domains of the nulls are unchanged, and it is clear that the presented construction preserves the property of being a Codd table. □
The idea is then to show the #P-hardness of our problems for some simple patterns, which then we combine with Lemma 3.3 and with some tractability proofs to obtain the desired dichotomies. Our findings are summarized in the first two columns ofTable1 in the introduction. We first focus on the two dichotomies for the non-uniform setting in Section 3.1, and then we move to the case of uniform incomplete databases in Section 3.2. We explicitly state when a #P-hardness result holds even in the restricted setting in which there is a fixed domain over which nulls are interpreted. In other words, when there is a fixed domain 퐴 such that the incomplete databases used in the reductions are of the form 퐷 = (푇, dom) and dom(⊥) ⊆ 퐴, for each null ⊥ of 푇 .
3.1 The complexity on the non-uniform case
In this section, we study the complexity of the problems #Val(푞) and #ValCd (푞), providing dichotomy results in both cases. We start by proving the #P-hardness results needed for these dichotomies. We first show that #Val(푅(푥, 푥)) is #P-hard by actually proving that hardness holds already in the uniform case. Proposition 3.4. #Valu (푅(푥, 푥)) is #P-hard. This holds even in the restricted setting in which all nulls are interpreted over the same fixed domain {1, 2, 3}.
4See Footnote1. 1:10 Marcelo Arenas, Pablo Barceló, and Mikaël Monet
Proof. We reduce from the problem of counting the number of 3-colorings of a graph 퐺 = (푉, 퐸), which is #P-hard [32]. For every node 푣 ∈ 푉 we have a null ⊥푣, and for every edge {푢, 푣} ∈ 퐸 we have the facts 푅(⊥푣, ⊥푢 ) and 푅(⊥푢, ⊥푣). The domain of the nulls is {1, 2, 3}. It is then clear that the number of valuations of the constructed database that do not satisfy 푅(푥, 푥) is exactly the number of 3-colorings of 퐺. Since the total number of valuations can be computed in PTIME, this concludes the reduction. □ The next pattern that we consider is 푅(푥) ∧ 푆 (푥). This time, we can show #P-hardness of the problem even for Codd databases.
Proposition 3.5. #ValCd (푅(푥) ∧ 푆 (푥)) is #P-hard. Proof. We start by recalling the setting of consistent query answering under key constraints. Intuitively, in this case we are given a set Σ of keys and a database 퐷 that does not necessarily satisfy Σ. Then the task is to reason about the set of all repairs of 퐷 with respect to Σ [9]. In our context, this means that one wants to count the number of repairs of 퐷 with respect to Σ that satisfy a given CQ 푞. When 푞 and Σ are fixed, we call this problem #Repairs(푞, Σ); see, e.g., [40]. We formalize these notions below. Here we focus on the case when Σ is a set of primary keys. Recall that this means that each relation name 푅 ∈ 휎 of arity 푛 comes equipped with its own key, i.e., key(푅) = 퐴, where 퐴 = ∅ or 퐴 = [1, . . . , 푝] for some 푝 ∈ {1, . . . , 푛}. Henceforth, 퐷 is inconsistent with respect to Σ if there is a relation name 푅 ∈ 휎 and facts 푅(푎¯), 푅(푏¯) ∈ 퐷 with 푎¯ ≠ 푏¯ such that
key(푅) = 퐴 and 휋퐴 (푎¯) = 휋퐴 (푏¯). In this case we say that the pair (푅(푎¯), 푅(푏¯)) is key-violating. Let us define a block in a database 퐷 with respect to a set Σ of primary keys to be any maximal set 퐵 of facts from 퐷 such that the facts in 퐵 are pairwise key-violating. A repair of 퐷 with respect to Σ is a subset 퐷 ′ of 퐷 that is obtained by choosing exactly one tuple from each block of 퐷 with respect to Σ. Let us consider a schema 휎 with two binary relations 푅′ and 푆 ′, such that key(푅′) = key(푆 ′) = {1}. That is, the first attribute of both 푅′ and 푆 ′ defines a key over such relations. We define this setof keys over 휎 to be Σ. Also, let 푞 = ∃푥,푦, 푧(푅′(푦, 푥) ∧ 푆 ′(푧, 푥)). For simplicity, we write the pair (푞, Σ) as 푅′(y, 푥) ∧ 푆 ′(z, 푥). The problem #Repairs(푅′(y, 푥) ∧ 푆 ′(z, 푥)), which given a database 퐷 ′ over schema 휎 aims at computing the number of repairs of 퐷 ′ under Σ that satisfy 푞, is known to be #P-complete [40].5 Now, observe that the #P-hardness of #ValCd (푅(푥) ∧ 푆 (푥)) easily follows from the hardness of the problem #Repairs(푅′(y, 푥) ∧푆 ′(z, 푥)). In fact, let 퐷 ′ be a database with binary relation 푅′, 푆 ′. We construct an incomplete Codd database 퐷 with unary relations 푅, 푆 as follows. For every constant 푎 that appears in the first attribute of 푅′, we have a tuple 푅(⊥) in 퐷, where ⊥ is a fresh null, and we set dom(⊥) = {푏 | 푅′(푎,푏) ∈ 퐷 ′}. For every constant 푎 that appears in the first attribute of 푆 ′, we have a tuple 푆 (⊥) in 퐷, where ⊥ is a fresh null, and we set dom(⊥) = {푏 | 푆 ′(푎,푏) ∈ 퐷 ′}. It is then clear that the number of repairs of 퐷 ′ that satisfy 푅′(y, 푥) ∧ 푆 ′(z, 푥) is equal to the number of valuations of 퐷 that satisfy 푅(푥) ∧ 푆 (푥), thus concluding the proof. We point out here that another proof of Proposition 3.5, that uses different techniques, can be found in the conference version of the article [8] (the proof that we presented here is shorter). □ Already with Propositions 3.5 and 3.4, we have all the relevant hard patterns for the non-uniform setting. We start by proving our dichotomy result for naive tables, which is our default case.
5To see that [40] establishes the hardness of 푞 = 푅′ (y, 푥) ∧ 푆′ (z, 푥), first apply their rewrite rule R7 (from Fig. 6)to obtain 푞′ = 푅′ (y, 푥) ∧ 푆′ (x, 푥), then apply rewrite rule R10 to obtain 푞′′ = 푅′ (y, 푥) ∧ 푆′ (x, 푎). Then, 푞′′ is #P-hard by Lemma 19, and so is 푞 by Lemma 7. The complexity of counting problems over incomplete databases 1:11
Theorem 3.6 (dichotomy). Let 푞 be an sjfBCQ. If 푅(푥, 푥) or 푅(푥) ∧ 푆 (푥) is a pattern of 푞, then #Val(푞) is #P-complete. Otherwise, #Val(푞) is in FP. Proof. The #P-hardness part of the claim follows from the last two propositions and from Lemma 3.3. We explain why the problems are in #P right after this proof. When 푞 does not have any of these two patterns then all variables have exactly one occurrence in 푞. This implies that every valuation 휈 of 퐷 is such that 휈 (퐷) satisfies 푞 (except when one relation is empty, in which case the result is simply zero). We can then compute the total number of valuations in FP by simply multiplying the sizes of the domains of every null in 퐷. □
Notice that in this theorem, the membership of #Val(푞) in #P can be established by considering a nondeterministic Turing Machine 푀 that, with input a non-uniform incomplete database 퐷, guesses a valuation 휈 of 퐷 and verifies whether 휈 (퐷) satisfies 푞. This machine works in polynomial time as we can verify whether 휈 (퐷) satisfies 푞 in polynomial time (since 푞 is a fixed FO query). Then given that #Val(푞)(퐷) is equal to the number of accepting runs of 푀 with input 퐷, we conclude that #Val(푞) is in #P. Obviously, the same idea works for codd tables, that is, #ValCd (푞) is also in #P. But with this restriction we obtain more tractable cases, as shown by the following dichotomy result.
Theorem 3.7 (dichotomy). Let 푞 be an sjfBCQ. If 푅(푥) ∧ 푆 (푥) is a pattern of 푞, then #ValCd (푞) is #P-complete. Otherwise, #ValCd (푞) is in FP. Proof. We only need to prove the tractability claim, since hardness follows from Proposition 3.5 and Lemma 3.3. We will assume without loss of generality that 퐷 contains no constants, as we can introduce a fresh null with domain {푐} for every constant 푐 appearing in 퐷, and the result is again a Codd table, and this does not change the output of the problem. Let 푞 be 푅1 (푥¯1) ∧ ... ∧ 푅푚 (푥¯푚). Observe that since 푞 does not have 푅(푥) ∧ 푆 (푥) as a pattern then any two atoms cannot have a variable in common. But then, since 퐷 is a Codd table we have 푚 Ö #ValCd (푞)(퐷) = #ValCd (푅푖 (푥¯푖 ))(퐷(푅푖 )). 푖=1
Hence it is enough to show how to compute #ValCd (푅푖 (푥¯푖 ))(퐷(푅푖 )) for every 1 ⩽ 푖 ⩽ 푚. Let 푡¯1,..., 푡¯푛 be the tuples of 퐷(푅푖 ). Let us write 휌(푡¯푗 ) for the number of valuations of the nulls appearing in 푡¯푗 푥 (푅 (푥 ))(퐷(푅 )) = Î | (⊥)| − Î푛 휌(푡¯ ) that do not match ¯푖 . Clearly, #ValCd 푖 ¯푖 푖 ⊥ appears in 퐷 (푅푖 ) dom 푗=1 푗 , so we only have to show how to compute 휌(푡¯푗 ) for 1 ⩽ 푗 ⩽ 푛. Since we can easily compute the total number of valuations of 푡¯푗 , it is enough to show how to compute the number of valuations of 푡¯푗 that match 푥¯푖 . For every variable 푥 that appears in 푥¯푖 , compute the size of the intersection of the domains of the corresponding nulls in 푡¯푗 , and denote it 푠푥 . Then the number of valuations of 푡¯푗 that 푥 Î 푠 match ¯푖 is simply 푥 appears in 푥¯푖 푥 . This concludes the proof. □ At this stage, we have completed the first column of Table1, and we also know that 푅(푥, 푥) is a hard pattern in the uniform setting for naive tables (but not for Codd tables, by Theorem 3.7). In the next section, we treat the uniform setting.
3.2 The complexity on the uniform case u (푞) u (푞) In this section, we study the complexity of the problems #Val and #ValCd , again providing dichotomy results in both cases. 3.2.1 Naïve tables. We start our investigation with the case of naive tables. In Proposition 3.4, we already showed that #Valu (푅(푥, 푥)) is #P-hard. In the following proposition, we identify two other simple queries for which this problem is still intractable. 1:12 Marcelo Arenas, Pablo Barceló, and Mikaël Monet
Proposition 3.8. #Valu (푅(푥) ∧푆 (푥,푦) ∧푇 (푦)) and #Valu (푅(푥,푦) ∧푆 (푥,푦)) are both #P-hard. This holds even in the restricted setting in which all nulls are interpreted over the same fixed domain {0, 1}. Proof. We reduce both problems from the problem of counting the number of independent sets in a graph (denoted by #IS), which is #P-complete [44]. We start with #Valu (푅(푥) ∧ 푆 (푥,푦) ∧푇 (푦)). Let 푞 = 푅(푥) ∧ 푆 (푥,푦) ∧푇 (푦) and 퐺 = (푉, 퐸) be a graph. Then we define an incomplete database 퐷 as follows. For every node 푣 ∈ 푉 , we have a null ⊥푣, and the uniform domain is {0, 1}. For every edge {푢, 푣} ∈ 퐸, we have facts 푆 (⊥푢, ⊥푣) and 푆 (⊥푣, ⊥푢 ) in 퐷. Finally, we have facts 푅(1) and 푇 (1) in 퐷. For a valuation 휈 of the nulls, consider the corresponding subset 푆휈 of nodes of 퐺, given by 푆휈 = {푡 ∈ 푉 | 휈 (⊥푡 ) = 1}. This is a bijection between the valuations of the database and the node subsets of 퐺. Moreover, we have that 휈 (퐷) ̸|= 푞 if and only if 푆휈 is an independent set of 퐺. Since the total number of valuations of 퐷 is 2|푉 |, we have that the number of independent sets 퐺 |푉 | − u (푞)(퐷) p u (푞) of is equal to 2 #Val . Hence, we conclude that #IS ⩽T #Val . The idea is similar for #Valu (푅(푥,푦) ∧ 푆 (푥,푦)): we encode the graph with the relation 푆 in the same way, and this time we add the fact 푅(1, 1). □ As shown in the following result, it turns out that the three aforementioned patterns are enough to fully characterize the complexity of counting valuations for naive tables in the uniform setting. Theorem 3.9 (dichotomy). Let푞 be an sjfBCQ. If 푅(푥, 푥) or 푅(푥)∧푆 (푥,푦)∧푇 (푦) or 푅(푥,푦)∧푆 (푥,푦) is a pattern of 푞, then #Valu (푞) is #P-complete. Otherwise, #Valu (푞) is in FP. The #P-completeness part of the claim follows directly from what we have proved already. Here, the most challenging part of the proof is actually the tractability part. We only present a simple example to give an idea of the proof technique, and defer the full proof to Appendix A.1. 푛,푚 We will use the following definition. Given ∈ N, let us write surj푛→푚 for the number of surjective functions from {1, . . . , 푛} to {1, . . . ,푚}. By an inclusion–exclusion argument, one can Í푚−1 (− )푖 푚(푚 − 푖)푛 show that surj푛→푚 = 푖=0 1 푖 (for instance, see [3]). It is clear that this can be computed in FP, when 푛 and 푚 are given in unary. Example 3.10. Let 푞 be the sjfBCQ 푅(푥) ∧ 푆 (푥), and 퐷 be an incomplete database over rela- tions 푅, 푆. Notice that 푞 does not have any of the patterns mentioned in Theorem 3.9. We will show that #Valu (푞) is in FP. Since 푞 contains only two unary atoms we can also assume without loss of generality that the input 퐷 is a Codd table (otherwise all valuations are satisfying). Since we can compute in FP the total number of valuations, it is enough to show how to compute the number of valuations of 퐷 that do not satisfy 푞. Let dom be the uniform domain, 푑 be its size, 푛푅 (resp., 푛푆 ) be the number of nulls in 퐷(푅) (resp., in 퐷(푆)) and 퐶푅 (resp., 퐶푆 ) be the set of constants occurring in 퐷(푅) (resp., in 퐷(푆)), with 푐푅 (resp., 푐푆 ) its size. We can assume without loss of generality that 퐶푅 ∩퐶푆 = ∅, as otherwise all the valuations are satisfying, and this is computable in PTIME. Furthermore, we can also assume that 퐶푅 ∪퐶푆 ⊆ dom, since we can remove the constants that are not in dom, as these can never match. Let 푀 := dom \(퐶푅 ∪ 퐶푆 ), and 푚 its size (i.e., with our assumptions we have 푚 = 푑 − 푐푅 − 푐푆 ). ′ ′ 푀 ⊆ 푀 푅 ⊆ 퐶 ′ ′ Fix some subsets and 푅. The quantity surj푛푅 →|푀 |+|푅 | then counts the number of ′ ′ valuations of the nulls of 퐷(푅) that span exactly 푀 ∪ 푅 . Moreover, letting 휈푅 be a valuation of the ′ ′ ′ 푛 nulls of 퐷(푅) that spans exactly 푀 ∪ 푅 , the quantity (푑 − 푐푅 − |푀 |) 푆 is the number of ways to extend 휈푅 into a valuation 휈 of all the nulls of 퐷 so that 휈 (퐷) ̸|= 푞: indeed, every null of 퐷(푆) can ′ take any value in dom \(퐶푅 ∪ 푀 ). The number of valuations of 퐷 that do not satisfy 푞 is then (keeping in mind that a null in 퐷(푅) cannot take a value in 퐶푆 ):
∑︁ ′ 푛푆 ′ ′ × (푑 − 푐 − |푀 |) surj푛푅 →|푀 |+|푅 | 푅 푀′ ⊆푀 ′ 푅 ⊆퐶푅 The complexity of counting problems over incomplete databases 1:13 and since the summands only depends on the sizes of 푀 ′ and 푅′, this is equal to 푚 푐 ∑︁ 푅 ′ 푛푆 surj ′ ′ × (푑 − 푐 − 푚 ) 푚′ 푟 ′ 푛푅 →푚 +푟 푅 0⩽푚′⩽푚 ′ 0⩽푟 ⩽푐푅 This last expression can clearly be computed in PTIME.6 □ 3.2.2 Codd tables. We conclude this section by turning our attention to the case of Codd tables. Notice that none of the results proved so far provides a hard pattern in this case. We identify in the following proposition a simple query for which the problem is intractable. u (푅(푥) ∧ 푆 (푥,푦) ∧ 푇 (푦)) Proposition 3.11. #ValCd is #P-hard. Proof. We reduce from the problem of counting the number of independent sets of a bipartite (simple) graph, written #BIS, which is #P-hard [44]. Let 퐺 = (푋 ⊔ 푌, 퐸) be a bipartite graph. Without loss of generality, we can assume that |푋 | = |푌 | = 푛; indeed, if |푋 | < |푌 | then we could simply add |푌 | − |푋 | isolated nodes to complete the graph, which simply multiplies the number of independent sets by 2|푌 |−|푋 |. Also, observe that counting the number of independent sets of 퐺 is the same as counting the number of pairs (푆1, 푆2) with 푆1 ⊆ 푋, 푆2 ⊆ 푌 , such that (푆1 × 푆2) ∩ 퐸 = ∅. We will call such a pair an independent pair. For 0 ⩽ 푖, 푗 ⩽ 푛, let 푍푖,푗 be the number of independent pairs (푆1, 푆2) such that |푆1| = 푖 and |푆2| = 푗. It is clear that (★) the number of independent sets of 퐺 퐺 Í 푍 푛 2 is then #BIS( ) = 0⩽푖,푗⩽푛 푖,푗 . The idea of the reduction is to construct in polynomial time ( + 1) incomplete databases 퐷푎,푏 for 0 ⩽ 푎,푏 ⩽ 푛 such that, letting 퐶푎,푏 be the number of valuations 휈 of 퐷푎,푏 with 휈 (퐷푎,푏 ) ̸|= 푅(푥) ∧ 푆 (푥,푦) ∧ 푇 (푦), the values of the variables 푍푖,푗 and 퐶푖,푗 form a linear system of equations AZ = C, with A an invertible matrix. This will allow us, using (푛 + 1)2 calls to u (푅(푥) ∧푆 (푥,푦) ∧푇 (푦)) 푍 (퐺) an oracle for #ValCd , to recover the 푖,푗 values, and then to compute #BIS using (★). We now explain how we construct 퐷푎,푏 from 퐺 for 0 ⩽ 푎,푏 ⩽ 푛, and define A. First, we fix an arbitrary linear order 푥1, . . . , 푥푛 of 푋, and similarly 푦1, . . . ,푦푛 for 푌 . The database 퐷푎,푏 has constants 푎푖 for 1 ⩽ 푖 ⩽ 푛, and has a fact 푆 (푎푖, 푎푗 ) whenever (푥푖,푦푗 ) ∈ 퐸. It has nulls ⊥1,..., ⊥푎 푅(⊥ ) 푖 푎 푎 = ⊥′,..., ⊥′ and facts 푖 for 1 ⩽ ⩽ (if 0 there are no such nulls and facts), and nulls 1 푏 푇 ′ 푖 푏 and facts (⊥푖 ) for 1 ⩽ ⩽ ; in particular, this is a Codd table. The uniform domain of the nulls is {푎푖 | 1 ⩽ 푖 ⩽ 푛}. Given a valuation 휈 of 퐷푎,푏, let 푃 (휈) be the pair of subsets of 푉 defined by 푃 휈 def 푥 푘 푎 휈 푎 , 푦 푘 푏 휈 ′ 푎 ( ) = ({ 푖 | ∃1 ⩽ ⩽ s.t. (⊥푘 ) = 푖 } { 푖 | ∃1 ⩽ ⩽ s.t. (⊥푘 ) = 푖 }) One can then easily check that the following two claims hold:
• For every valuation 휈 of 퐷푎,푏, we have that 휈 (퐷푎,푏) ̸|= 푅(푥) ∧ 푆 (푥,푦) ∧ 푇 (푦) iff 푃 (휈) is an independent pair of 퐺;7 • (푆 , 푆 ) 퐺 × 휈 For every independent pair 1 2 of , there are exactly surj푎→|푆1 | surj푏→|푆2 | valuations such that 푃 (휈) = (푆1, 푆2). 퐶 Í 푍 But then, we have 푎,푏 = 0⩽푖,푗⩽푛 (surj푎→푖 × surj푏→푗 ) 푖,푗 . In other words, we have the linear 2 2 def system of equations AZ = C, where A is the (푛 + 1) × (푛 + 1) matrix defined by A(푎,푏),(푖,푗) = ′ ′ 푛 푛 surj푎→푖 × surj푏→푗 . This matrix is the Kronecker product A ⊗ A of the ( + 1) × ( + 1) matrix with ′ def ′ entries A푎,푖 = surj푎→푖 . Since A is a triangular matrix with non-zero coefficients on the diagonal, it is invertible, hence so is A, which concludes the proof. □
6 ′ ′ Note that in the sum we do not need to specify that 푚 + 푟 ⩽ 푛푅, as when 푎 < 푏 we have surj푎→푏 = 0. 7This observation, and in fact the idea of reducing from #BIS, is due to Antoine Amarilli. 1:14 Marcelo Arenas, Pablo Barceló, and Mikaël Monet
Note that in Proposition 3.8, we proved that #Valu (푅(푥) ∧ 푆 (푥,푦) ∧ 푇 (푦)) is #P-hard in the general case where naive tables are allowed. Hence, the hardness of that query for naive tables was in fact a consequence of Proposition 3.11. However, we decided to provide a separate proof for Proposition 3.8, because in this case intractability holds already when nulls are interpreted over the fixed domain {0, 1}, whereas we do not know if this is true for Codd tables.
The second pattern that we show is hard for #Valu for Codd tables is 푅(푥,푦) ∧ 푆 (푥,푦). Again, notice that we already showed this query to be hard in the case of naive tables (as Proposition 3.8), even for a fixed domain of {0, 1}. In the case of Codd tables, hardness still holds, but the proof is more complicated and uses domains of unbounded size (which is why we provide separate proofs). We show:
u (푅(푥,푦) ∧ 푆 (푥,푦)) Proposition 3.12. #ValCd is #P-hard.
Proof. Let 푞 be the query 푅(푥,푦) ∧푆 (푥,푦). We reduce from the problem of counting the number of matchings of a 2-3–regular bipartite graph, which is #P-complete [14, 55]. Let 퐺 = (퐴 ⊔ 퐵, 퐸) be a 2-3–regular bipartite graph, with the nodes in 퐴 having degree 3 and those in 퐵 having def def 푛 |퐴| 푛 |퐵| 퐺 푛 3푛퐴 degree 2, and let 퐴 = , and 퐵 = . Notice that since is 2-3–regular we have that 퐵 = 2 . We say that a set 푆 ⊆ 퐸 of edges of 퐺 is an 퐴-semimatching if every node 푎 of 퐴 is adjacent to at most 1 edge of 푆; formally, for every 푎 ∈ 퐴 we have |{푒 ∈ 푆 | 푎 ∈ 푒}| ⩽ 1. The type of an 퐴-semimatching 푆 is the number 푐 of nodes in 퐵 that are adjacent to exactly 2 edges of 푆; def formally, 푐 = |{푏 ∈ 퐵 | |{푒 ∈ 푆 | 푏 ∈ 푒}| = 2}|. For 0 ⩽ 푐 ⩽ 푛퐵, we write 푇푐 for the number of 퐴-semimatchings of 퐺 of type 푐. Observe then that 푇0 is simply the number of matchings of 퐺. The idea of the reduction is then as follows. We will construct databases 퐷푘 for 0 ⩽ 푘 ⩽ 푛퐵 such that, letting 퐶푘 be the number of valuations 휈 of 퐷푘 such that 휈 (퐷푘 ) ̸|= 푞, the values of the variables 푇푘 and 퐶푘 form a system of linear equations AT = C, with A and invertible matrix. This will allow 푛 + u (푞) 푇 푇 us, using 퐵 1 calls to and oracle for #ValCd , to recover the values 푘 , and thus to obtain 0 in polynomial time, that is, the number of matchings of 퐺. We now explain how to construct the database 퐷푘 for 0 ⩽ 푘 ⩽ 푛퐵. In what follows we use the convention that {1, . . . , 푘} = ∅ when 푘 = 0. The database 퐷푘 contains the following facts:
(1) One fact 푅(푎, ⊥푎) for every 푎 ∈ 퐴; (2) One fact 푆 (⊥푏,푏) for every 푏 ∈ 퐵; ′ ′ (3) One fact 푆 (⊥푎′, 푎 ) for every 푎 ∈ 퐴 where 푎 is a fresh constant. In particular, observe that ′ def ′ there are 푛퐴 such facts. In what follows we will write 퐴 = {푎 | 푎 ∈ 퐴}. 푆 푎 , 푎′ 푎 , 푎 퐴 퐴 푎 푎 (4) One fact ( 1 2) for every ( 1 2) ∈ × with 1 ≠ 2; (5) One fact 푆 (푎, 푖) for every (푎, 푖) ∈ 퐴 × {1, . . . , 푘}; (6) One fact 푆 (푢, 푣) for every (푢, 푣) ∈ (퐴 ∪ 퐵)2 such that {푢, 푣} is not in 퐸. (7) One fact 푅(푢, 푣) for every (푢, 푣) ∈ (퐴′ ∪ 퐵)2.
def And finally, the (uniform) domain for all the nullsis dom = 퐴 ∪ 퐵 ∪ 퐴′ ∪ {1, . . . , 푘}. Note that this is indeed a Codd database. Now, let us compute 퐶푘 , the number of valuations 휈 of 퐷푘 such that 휈 (퐷푘 ) ̸|= 푞. For such a valuation 휈 of 퐷푘 such that 휈 (퐷푘 ) ̸|= 푞, observe that (★) for every 푎 ∈ 퐴 ′ it holds that 휈 (⊥푎) is either 푎 or is one of the nodes in 퐵 that is a neighbor of 푎; this is because otherwise, the facts from (4-6) would make the query be satisfied. Then, for a valuation 휈 of 퐷푘 def such that 휈 (퐷푘 ) ̸|= 푞, let us define SM(휈) = {{푎,푏} | (푎,푏) ∈ (퐴 × 퐵) ∩ 푅(휈 (퐷푘 ))}. Because of (★), observe that SM(휈) is a subset of 퐸, and that it is in fact an 퐴-semimatching. We can then partition The complexity of counting problems over incomplete databases 1:15 the valuations 휈 of 퐷푘 with 휈 (퐷푘 ) ̸|= 푞 according to the type of SM(휈) as follows: Ä Ä Ä {휈 | 휈 is a valuation of 퐷푘 with 휈 (퐷푘 ) ̸|= 푞} = {휈}.
0⩽푐⩽푛퐵 푆: 푆 is an 퐴-semimatching 휈: 휈 valuation of 퐷푘 of 퐺 of type 푐 휈 (퐷푘 )̸|=푞 SM(휈)=푆 (2) Fix an 퐴-semimatching 푆 of 퐺 of type 푐 ∈ {0, . . . , 푛퐵 }, and let us count how many valuations 휈 of 퐷푘 there are such that 휈 (퐷푘 ) ̸|= 푞 and SM(휈) = 푆. First of all, observe that for such a valuation 휈, ′ the value of 휈 (⊥푎) for every 푎 ∈ 퐴 is forced: it is 푎 if 푎 is adjacent to no edge of 푆, and otherwise it is 푏 for the unique {푎,푏} ∈ 푆. Therefore we have to count how many possibilities there are for the remaining nulls, those of the form ⊥푎′ and ⊥푏 from facts (2-3). We have:
• For a null ⊥푏 such that 푏 is adjacent to two edges of 푆, there are 푛퐴 + 푘 − 2 possible values in order to not satisfy the query. Note that there are 푐 such nulls ⊥푏. ′ • For a null ⊥푎′ such that 푎 is not adjacent to an edge in 푆 (so that we know that 휈 (⊥푎) = 푎 ), there are 푛퐴 + 푘 − 1 possible values in order to not satisfy the query. For a null ⊥푏 such that 푏 is adjacent to exactly one edge {푎,푏} of 푆 (i.e., we have 휈 (⊥푎) = 푏) there are again 푛퐴 + 푘 − 1 possible values to not satisfy the query. Observe that in total there are 푛퐴 − 2푐 such nulls ⊥푎′ or ⊥푏, because 푆 is an 퐴-semimatching. ′ • For a null ⊥푎′ such that 푎 is adjacent to an edge in 푆 (so that we know that 휈 (⊥푎) ≠ 푎 ) there are 푛퐴 + 푘 possibilities, and similarly for a null ⊥푏 such that 푏 is not adjacent to an edge in 푆 there are 푛퐴 + 푘 possible values. By the previous two items, in total there are 푛퐴 + 푛퐵 − 푐 − (푛퐴 − 2푐) = 푛퐵 + 푐 such nulls ⊥푎′ or ⊥푏 . 푐 푛퐴−2푐 푛퐵 +푐 Therefore, there are exactly (푛퐴 + 푘 − 2) (푛퐴 + 푘 − 1) (푛퐴 + 푘) valuations 휈 of 퐷푘 such that 휈 (퐷푘 ) ̸|= 푞 and SM(휈) = 푆. Since this depends only on the type of the 퐴-semimatching 푆 (and not on 푆 itself), we obtain from Equation2 that
퐶푘 = |{휈 | 휈 is a valuation of 퐷푘 with 휈 (퐷푘 ) ̸|= 푞}|
∑︁ 푐 푛퐴−2푐 푛퐵 +푐 = 푇푐 × (푛퐴 + 푘 − 2) (푛퐴 + 푘 − 1) (푛퐴 + 푘) . 0⩽푐⩽푛퐵 푐 That is, we have the linear system of equations AT = C, with A푘,푐 = (푛퐴 + 푘 − 2) (푛퐴 + 푛 −2푐 푛 +푐 푘 − 1) 퐴 (푛퐴 + 푘) 퐵 for 0 ⩽ 푐, 푘 ⩽ 푛퐵. But observe that we have A = DV, with D being 푛 푛 the diagonal (푛퐵 + 1) × (푛퐵 + 1) matrix with entries (푛퐴 + 푘 − 2) 퐴 (푛퐴 + 푘) 퐵 , and V being 푛 푛 (푛퐴+푘−2)(푛퐴+푘) 푘 푛 the ( 퐵 + 1) × ( 퐵 + 1) Vandermonde matrix with coefficients 2 for 0 ⩽ ⩽ 퐵. Hence (푛퐴+푘−1) to show that A is invertible, we only need to argue that the coefficients of this Vandermonde matrix V 푓 (푥) def= (푛퐴+푥−2)(푛퐴+푥) 푓 ′ (푥) = 2 are all distinct. But for the function 푛퐴 (푛 +푥−1)2 , one can check that 푛퐴 (푛 +푥−1)3 , 푓 , 푛 퐴 푛 퐴 so that 푛퐴 is strictly increasing on [0 퐵] (we assume 퐴 ⩾ 2 without loss of generality), so that all the coefficients are indeed distinct. This concludes the proof. □
As we show next, the patterns from Propositions 3.12 and 3.11 are the only hard patterns u for #ValCd. The following is then the last dichotomy of this section. Theorem 3.13 (dichotomy). Let 푞 be an sjfBCQ. If 푅(푥) ∧ 푆 (푥,푦) ∧푇 (푦) or 푅(푥,푦) ∧ 푆 (푥,푦) is a 푞 u (푞) u (푞) pattern of , then #ValCd is #P-complete. Otherwise, #ValCd is in FP. We only need to show the tractability part of that claim, as hardness follows from Propositions 3.12 and 3.11 and Lemma 3.3. First of all, observe that we can assume without loss of generality that the sjfBCQ 푞 is connected. This is because 푞 has no self-join and the database 퐷 is Codd, so, 1:16 Marcelo Arenas, Pablo Barceló, and Mikaël Monet letting 푞1, . . . , 푞푡 be the connected components of 푞, and letting 퐷푖 be the database 퐷 restricted to the relations appearing in 푞푖 , we have that
푡 u 푞 퐷 Ö u 푞 퐷 . #ValCd ( )( ) = #ValCd ( 푖 )( 푖 ) 푖=1 Second, notice that, for a connected sjfBCQ 푞, not containing any of these two patterns is equivalent to the following: there exists a variable 푥 such that all atoms of 푞 contain variable 푥, and for any two atoms of 푞, the only variable that they have in common is 푥. In other words, 푥 is in every atom and every other variable occurs in only one atom. For instance 푅1 (푥,푦,푦) ∧ 푅2 (푥,푥,푧,푧,푧,푢,푢) ∧ 푅3 (푥, 푥, 푣, 푡, 푡) is such a query. We now provide an example of a connected query with only two atoms. Example 3.14. We consider the query 푞 = 푅(푥, 푥,푦,푦) ∧ 푆 (푥, 푥). Let 퐷 be an incomplete Codd database over relations 푅, 푆, with uniform domain dom of size 푑. In this proof we will use sym- bols 푎, 푎1, 푎2,... to denote a constant or a null, and symbols 푐, 푐1, 푐2,... to denote constants. Moreover, unless stated otherwise, these symbols can refer to the same constant (but not to the same null because 퐷 is a Codd database). A fact that contains only constants is called a ground fact. Further- more, since that database is Codd we will always represent a null with ⊥, being understood that they are all distinct. We start with a few simplifications. First, we assume wlog that 퐷 does not contain ground facts that already satisfy the query. Second, we assume wlog that 퐷 does not contain facts of the form 푅(푎, 푎′, 푐, 푐 ′) or 푅(푐, 푐 ′, 푎, 푎′) or 푆 (푐, 푐 ′) where 푐 and 푐 ′ are distinct constants. Indeed, because 퐷 is Codd and such a fact 푓 can never be part of a match, we could simply remove 푓 from 퐷 and multiply the end result by the appropriate value (namely, 푑푡 where 푡 is the number of nulls of 푓 ). Last, we assume wlog that for any fact of the form 푅(푎, 푎′, 푐, ⊥) or 푅(푎, 푎′, ⊥, 푐) or 푅(⊥, 푐, 푎, 푎′) or 푅(푐, ⊥, 푎, 푎′) or 푆 (푐, ⊥) or 푆 (⊥, 푐), we have 푐 ∈ dom; indeed otherwise, those facts can never be part of a match and we could again remove them and multiply the result by the appropriate value. Next, we need to introduce some notation. For a fact 푓 of 퐷, the type of 푓 is the word in {0, 1}arity(푓 ) that has a 1 in position 푖 iff the 푖-th element of 푓 is a null. For instance the type of 푅(⊥, ⊥, 푐, ⊥) is 1101 and that of 푆 (푐, 푐) is 00. Observe that there are a fixed number of possible types, because the query is fixed. For a constant 푐 and fact 푓 of 퐷, we say that 푓 is 푐-determined if 푓 contains the constant 푐 at a position for variable 푥. For instance 푅(⊥, 푐, 푐 ′, ⊥) is 푐-determined and so is 푆 (푐, 푐). A fact 푓 that contains only nulls on the positions for 푥 is called free. With the simplifications of the last paragraph, a fact of 퐷 is either free or it is 푐-determined for a unique constant 푐. Let t be a type of 퐷 (for 푅 or for 푆). We say that t is free if it has only 1s in the positions corresponding to variable 푥; in other words, if it is the type of a free fact. Otherwise we say that t is determined. For a constant 푐 and determined type t, we write 푛푅,푐,t (resp., 푛푅,푐,t) the number of 푅-facts (resp., of 푆-facts) of 퐷 that are 푐-determined of type t. For a free type t of 푅 (resp., of 푆) we write 퐹푅,t (resp., 퐹푆,t) for the set of free 푅-facts (resp., of free 푆-facts) and f푅,t (resp.,ft) for its size. Let 푓 be a fact that is 푐-determined. We write 훼푓 for the number of valuations 휈 of the nulls in 푓 such that 푓 matches ′ ′ ′ the corresponding atom in 푞. For instance if 푓 is 푅(⊥, 푐, 푐 , ⊥) or again 푅(푐, 푐, 푐 , 푐 ) then 훼푓 = 1, 8 while if 푓 is 푅(푐, 푐, ⊥, ⊥) then 훼푓 = 푑. Similarly we let 훽푓 denote the number of valuations 휈 of the 푡 nulls in 푓 such that 푓 does not match the corresponding atom in 푞; which is then equal to 푑 − 훼푓 , for 푡 the number of nulls in 푓 . Observe that 훼푓 and 훽푓 depend only on the type t of 푓 . Hence we will write 훼t and 훽t instead. Let 푓 be a free fact. We let 훼푓 be the number of valuations of
8When 푓 is a ground fact – such as 푅(푐, 푐, 푐′, 푐′) – we recall the mathematical convention that there exists a unique function with emtpy domain, hence a unique valuation of the nulls of 푓 . The complexity of counting problems over incomplete databases 1:17 the nulls in 푓 that are not on a position for variable 푥 that match the corresponding part of the ′ atom in 푞. For instance if 푓 is 푅(⊥, ⊥, 푐 , ⊥) then 훼푓 = 1 and if 푓 is 푅(⊥, ⊥, ⊥, ⊥) then 훼푓 = 푑. We def 푡 also define 훽푓 = 푑 − 푑훼푓 where 푡 is the number of nulls, which correspond to the number of valuations of the nulls in 푓 such that the ground fact obtained does not match its corresponding atom. Again, since 훼푓 and 훽푓 depend only on the free type t of 푓 , we will write 훼t and 훽t instead. It is clear that we can compute all values 훼푓 ,푐 and 훽푓 ,푐 , for every determined fact 푓 and constant 푐 of 퐷, and values 훼푓 for every free fact in polynomial time. Last, we fix a linear order 푐1, . . . , 푐푛 on the constants 푐푖 such that there exists a fact of 퐷 that is 푐푖 -determined. Next, we explain how we can compute in FP the number of valuations of 퐷 that do not satisfy 푞. To do so, we will first define some quantities, and then show that we can compute these quantities in polynomial time using a dynamic programming approach. One of these quantities will be the number of valuations of 퐷 that do not satisfy 푞, which is what we want to compute. The quantities that we will define are of the form 푉 (params), where params consist of the following parameters (†): (a) one parameter 푣 whose range is 0 . . . 푛; 4 (b) one parameter 푞푅,t for every free type t ∈ {0, 1} of 푅, with range 0 ... f푅,t; (c) one parameter 푟푅 with range 0, . . . , 푛; (d) one parameter 푟푆 with range 0, . . . , 푟푅; We now explain what the quantity 푉 (params) with these parameters represent. To that end, we define the incomplete database 퐷(params) to be the database that contains only
• all of the facts that are 푐푖 -determined for 1 ⩽ 푖 ⩽ 푣 (if 푣 = 0 then 퐷푣 contains no determined facts); • for every free type t of 푅, 퐷(params) contains 푞푅,t free 푅-facts of 퐷 of type t (it doesn’t matter which ones, say the first 푞푅,t ones); • 퐷(params) contains all the free 푆-facts of 퐷. (Note that parameters (c) and (d) are not used to define the database but we still write 퐷(params).) The quantity 푉 (params) is then defined to be the number of valuations 휈 of database 퐷(params) such that 휈 (퐷(params)) ̸|= 푞 and such that the following holds: (1) for every 1 ⩽ 푖 ⩽ 푟푆 and free fact 푓 of 푅 or of 푆, the ground fact 휈 (푓 ) does not match the corresponding atom in 푞 with variable 푥 mapped to 푐푖 (this condition is empty if 푟푆 = 0); and (2) for every 푟푆 + 1 ⩽ 푖 ⩽ 푟푅 and every free fact 푓 of 푅, the ground fact 휈 (푓 ) does not match 푅(푥, 푥,푦,푦) with variable 푥 mapped to 푐푖 (this condition is empty if 푟푅 = 푟푆 ). By definition we then have that, when 푣 = 푛, 푞푅,t = f푅,t for every free type t of 푅 – so that 퐷(params) is equal to 퐷 –, 푟푅 = 0 and 푟푆 = 0 then 푉 (params) is then u (푞)(퐷) equal to #ValCd . Observe that there are a fixed number of parameters in params (because the query is fixed), and that the possible values of these parameters are polynomial in thesizeof the input. We explain how compute the values 푉 (params) by dynamic programming. Base case. Our base case will be when 푣 is equal to zero (and the other parameters are arbitrary as in (†)); in other words, the database 퐷(params) does not contain any determined facts, it contains only free facts. We have to compute the number of valuations of 퐷(params) that do not satisfy the query and such that (1) and (2) hold. We do so as follows. We guess a subset 푆푅 푐 , . . . , 푐 푐 푅 푐, 푐, 푐 ′, 푐 ′ 휈 퐷 of dom \{ 1 푟푅 }; this will be the set of constants such that ( ) ∈ ( (params)) for 푐 ′ 푆 푐 , . . . , 푐 푆 푐 some . We also guess a subset 푆 of dom \ { 1 푟푆 } ∪ 푅 ; this will be the set of constants such that 푆 (푐, 푐) ∈ 휈 (퐷(params)) (observe that 푆푅 and 푆푆 are disjoint; this is in order to avoid satisfying the query). To “achieve” these subsets, we for each free type t of 푅 guess a subset 푊푅,t of the free facts of 푅 of type t; these will be the 푅-facts 푓 of type t such that 휈 (푓 ) matches 푅(푥, 푥,푦,푦) with a constant in 푆푅 for variable 푥. Similarly we for each free type t of 푆 guess a subset 푊푆,t of the 1:18 Marcelo Arenas, Pablo Barceló, and Mikaël Monet free facts of 푆 of type t; these will be the 푆-facts 푓 of type t such that 휈 (푓 ) matches 푆 (푥, 푥) with a constant in 푆푆 for variable 푥. To ensure that these subsets are indeed enough to cover 푆푅 and 푆푆 , we surjectively assign each of the corresponding facts a constant in 푆푅 (for 푅-facts) or 푆푆 (for 푆-facts). For such facts 푓 , we then choose one of the 훼t valuations of 푓 that ensure that the ground fact obtained satisfies the correspoinding atom of the query; crucially, this quantity depends onlyon the type of 푓 . For the remaining (free) facts 푓 , we choose one of the 훽t valuations of 푓 that ensure that the ground facts obtained do not satisfy the corresponding atom of the query. In the end we obtain that 푉 (params) is equal to the following expression:
∑︁ ∑︁ ∑︁ ∑︁ ∑︁ ∑︁ ··· ··· 훾 (3) ⊆ \{ } ⊆ ⊆ ′ ⊆ ′ ′ ⊆ ′ 푆푅 dom 푐1,...,푐푟푅 ⊆ \ { }∪ 푊푅,t1 푄푅,t1 푊푅,tJ 푄푅,tJ 푊푅,t 퐹푆,t 푊푅,t 퐹푆,t 푆푆 dom 푐1,...,푐푟푆 푆푅 1 1 K K ... 푅 ′ ... ′ 푆 푄 where t1 tJ are all the free types of , t1 tK are all the free types of , 푅,ti is the set of the def 퐽 ′ def 퐾 first 푞 free 푅-facts of type t , and where, letting 푧 = Í |푊 | and 푧 = Í |푊 ′ |, we have 푅,ti i 푖=1 푅,ti 푖=1 푆,ti
퐽 퐾 Ö |푊 | |푄 |−|푊 | Ö |푊푆,t′ | |퐹푆,t′ |−|푊푆,t′ | 훾 훼 푅,ti 훽 푅,ti 푅,ti 훼 i 훽 i i . = surj푧→|푆 | × surj푧′→|푆 | × t t × ′ ′ 푅 푆 i i ti ti 푖=1 푖=1 Obviously, we cannot compute the expression in3 in polynomial time, since we are summing over subsets of the input facts. However, because 훾 depends only on the sizes of all these sets, we can express 푉 (params) as ∑︁ ∑︁ ∑︁ ∑︁ ∑︁ ∑︁ ··· ··· 훿, (4)
0⩽푠푅 ⩽푑−푟푅 0⩽푠푆 ⩽푑−푟푆 −푠푅 0⩽푤푅,t1 ⩽푞푅,t1 0⩽푤푅,t ⩽푞푅,t 0⩽푤푅,t′ ⩽f푅,t′ 0⩽푤푅,t′ ⩽f푅,t′ J J 1 1 K K def 퐽 ′ def 퐾 where, letting again 푧 = Í 푤 and 푧 = Í 푤 ′ , we have 푖=1 푅,ti 푖=1 푆,ti
퐽 퐾 Ö 푤 푞 −푤 Ö 푤푆,t′ f푆,t′ −푤푆,t′ 훿 훼 푅,ti 훽 푅,ti 푅,ti 훼 i 훽 i i =surj푧→푠 × surj푧′→푠 × t t × ′ ′ 푅 푆 i i ti ti 푖=1 푖=1 퐽 퐾 푑 − 푑푅 푑 − 푟푆 − 푠푅 Ö 푞푅,t Ö f푆,t′ × × i × i . 푠 푠 푤 푤 ′ 푅 푆 푖=1 푅,ti 푖=1 푆,ti But then, because the expression4 contains a fixed number of nested sums (this number depends only on the query 푞), and because indices and summands are polynomial, this quantity can be computed in polynomial time. This concludes the base case, that was when 푣 = 0. Inductive case. Next we explain how we can compute the quantities 푉 (params) (with the pa- rameters params as in (†)) in polynomial time from quantities 푉 (params′) with a strictly smaller value for parameter (a). Hence we assume that parameter 푣 is ⩾ 1. The idea is to get rid of the 푐푣- determined facts of 퐷(params), by partitioning the valuations of 퐷(params) that do not satisfy 푞 and satisfy (1) and (2) into
(A) those valuations 휈 that do not satisfy the query and satisfy (1) and (2) and such that no 푐푣- determined 푅-fact or free 푅-fact 푓 of 퐷(params) is such that 휈 (푓 ) matches 푅(푥, 푥,푦,푦) with variable 푥 mapped to 푐푣; and (B) those valuations 휈 that do not satisfy the query and satisfy (1) and (2) and such that at least one 푐푣-determined 푅-fact or free 푅-fact 푓 of 퐷(params) is such that 휈 (푓 ) matches 푅(푥, 푥,푦,푦) The complexity of counting problems over incomplete databases 1:19
with variable 푥 mapped to 푐푣 (and thus, no 푐푣-determined 푆-fact or free 푆-fact 푓 of 퐷(params) is such that 휈 (푓 ) matches 푆 (푥, 푥) with variable 푥 mapped to 푐푣). 푅 푓 푐 훽 To compute valuations in (A), we choose for each -fact that is 푣-determined one of the 푓 ,푐푣 valuations of its nulls that is such that 휈 (푓 ) does not match 푅(푥, 푥,푦,푦), we disallow the free 푅 facts of 푅 to have valuations that would match on 푐푣 by increasing 푟 by one, and we choose any valuation for the 푐푣-determined facts of 푆 (since we know that they will not be part of a match). Formally, letting 푡푓 be the number of nulls of a fact 푓 , we have the expression