Magic Sets for Data Integration∗

Proceedings of the Twenty-Third AAAI Conference on Artificial Intelligence (2008) Magic Sets for Data Integration∗ Wolfgang Faber and Gianluigi Greco and Nicola Leone Department of Mathematics, University of Calabria, Italy {faber,ggreco,leone}@mat.unical.it Abstract 2004; Bravo & Bertossi 2003). An important feature of these programs is that they are consistent, viz. that a stable model We present a generalization of the Magic Sets technique to Datalog¬ programs with (possibly unstratified) negation un- is guaranteed to exist. However, given the co-NP hardness of der the stable model semantics, originally defined in (Faber, their evaluation, the design of optimization techniques is of Greco, & Leone 2005; 2007). The technique optimizes utmost importance for applications in real scenarios, where Datalog¬ programs by means of a rewriting algorithm that the size of the input database may be huge. preserves query equivalence, under the proviso that the orig- In this paper, we focus on the optimization of Datalog¬ inal program is consistent. The approach is motivated by re- programs, by discussing an extension of the well-known cently proposed methods for query answering in data integra- Magic Set method (Bancilhon et al. 1986; Beeri & Ramakr- tion and inconsistent databases, which use cautious reasoning ¬ ishnan 1991). This method exploits the fact that while an- over consistent Datalog programs under the stable model se- swering a user query, often only a certain part of the stable mantics. models needs to be considered, so there is no need to com- In order to prove the correctness of our Magic Sets transfor- pute these models in their entirety. In fact, its aim is to fo- mation, we have introduced a novel notion of modularity for ¬ cus the instantiation of the program to those ground rules Datalog under the stable model semantics, which is more suitable for query answering than previous module defini- that are really needed to answer the query, by propagating tions, and which is also relevant per se. A module under this binding information from the query goal into the program ¬ definition guarantees independent evaluation of queries if the rules. Differing from the original method, Datalog requires full program is consistent. Otherwise, it guarantees sound- also body-to-head propagation in the presence of unstratified ness under cautious and completeness under brave reasoning. negation. The key idea is then to identify rules for which this is necessary, which we term dangerous rules. The formal properties of the proposed approach have been Introduction deeply analyzed. First, we show that the program obtained is Datalog¬ programs are function-free logic programs where query equivalent under brave and cautious reasoning to the negation may occur in the bodies of rules. Datalog¬ with original program if the latter is consistent, making it a per- 1 stable model semantics (Gelfond & Lifschitz 1988) is a fect fit for data integration applications, where consistency very expressive query language in a precise mathematical is guaranteed. If the original program is not guaranteed to 2 sense: Under brave (cautious) reasoning , Datalog¬ allows be consistent, we can still show that on the transformed pro- to express every query that is decidable in the complexity gram, brave reasoning is complete, and cautious reasoning class NP (co-NP) (Schlipf 1995). is sound with respect to the original program. In many recent proposals for data integration and reason- In order to establish the above results, we introduce ing on inconsistent databases, query answering turned out to a suitable notion of modularity for query answering over be co-NP-complete and, in fact, it was reduced to cautious Datalog¬ programs. Previous notions like splitting sets of reasoning on suitable Datalog¬ programs (Arenas, Bertossi, (Lifschitz & Turner 1994) and modules of (Eiter, Gottlob, & & Chomicki 2000; Greco, Greco, & Zumpano 2001; Lembo Mannila 1997) have been defined for stable model genera- ∗Supported by M.I.U.R. within projects “Potenziamento e Ap- tion, while the new notion is tailored to query answering. plicazioni della Programmazione Logica Disgiuntiva” and “Sistemi Finally, we analyze the complexity of determining basati sulla logica per la rappresentazione di conoscenza: esten- whether a predicate is dangerous in a given program, which sioni e tecniche di ottimizzazione,” and “tocai.it: Tecnologie Ori- is a central notion of our Magic Set method. It turns out that entate alla Conoscenza per Aggregazioni di Imprese in Internet.” this task is NL-complete and thus tractable. Copyright c 2008, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. ¬ ¬ Datalog Programs 1Unless explicitly specified, Datalog will always denote Dat- alog with negation under stable model semantics in this paper. An atom p(t1, . , tk) is composed of a predicate symbol 2Note that brave and cautious consequences are also called pos- p of arity k and terms t1, . , tk, each of which is either a sible and certain answers, respectively. constant or a variable. A literal is either an atom a or its 1528 ¬ − tnegation no a. A (Datalog ) rule r is of the form an arc is marked if a appears in B (r).A cycle of DGP is a sequence of nodes C = n1, . , nk, such that each ni h :- b1, . , bm, not bm+1,..., not bn. (1) (1 < i < k) occurs exactly once in C, n1 = nk, and each where h, b1, ··· , bn are atoms and 0 ≤ m ≤ n. The atom h (ni, ni+1) (1 ≤ i < k) is an arc in DGP . An odd cycle in is called the head of the rule. A Datalog¬ program is a set of DGP is a cycle C = n1, . , nk such that an odd number Datalog¬ rules. If all defining rules of a predicate p are facts of the arcs (ni, ni+1) (1 ≤ i < k) is marked. In analogy, A (that is, n = m = 0), then p is an EDB predicate; otherwise one can also define the atom dependency graph DGP of a p is an IDB predicate. A set of facts for EDB predicates of ground program P, by considering atoms rather than predi- a program P is called an EDB (for P). A query Q is just an cates. We now use this notion to define dangerous predicates atom. and rules. The intuition is that dangerous predicates may in- Let the (Herbrand) universe and base for a Datalog¬ pro- hibit a stable model. gram P be denoted UP and BP , respectively. The ground Definition 1 Let P be a program (resp., ground program), instantiation of P w.r.t. UP is denoted by Ground(P). An and d be a predicate (resp., atom) of P. Then, we say that d interpretation is a subset of BP . A ground positive literal A is dangerous if either (resp. negative literal not A) is true w.r.t. I if A ∈ I (resp. A A/∈ I); otherwise it is false. An interpretation I satisfies a 1. d occurs in an odd cycle of DGP (resp., DGP ), or ground rule r ∈ Ground(P) if the head of r is true w.r.t. I 2. d occurs in the body of a rule with a dangerous head pred- whenever the body of r is true w.r.t. I. An interpretation I icate (resp., atom). ¬ is a model of a Datalog program P if I satisfies all rules in A rule r is dangerous, if it contains a dangerous predicate Ground(P). (resp., atom) in the head. 2 Each not -free program P has a least (under subset inclu- sion) model, which is denoted by LM(P) and is the unique Based on this definition, we define a notion of indepen- stable model of P. Given a Datalog¬ program P and an dence for sets of atoms. These sets must be closed under interpretation I, the Gelfond-Lifschitz transform PI is ob- rules in the head-to-body direction and in the body-to-head tained from Ground(P) by deleting all rules containing direction for dangerous rules. The defining rules of these not b where b ∈ I, and deleting all not literals in the re- sets then form modules. maining rules. The set of stable models of a Datalog¬ pro- Definition 2 An independent atom set of a ground program gram P, denoted by SM(P), is the set of interpretations I, P is a set S ⊆ BP such that for each atom a ∈ S the fol- such that I = LM(PI). lowing holds: A program P is consistent if SM(P) 6= ∅, otherwise it is 1. if a is the head of a rule r ∈ P then all atoms of r are in inconsistent. A program P is data consistent if P = P ∪F F S, and is consistent for each EDB F. Given a ground atom a and a Datalog¬ program P, a is 2. if a appears in the body of a dangerous rule r ∈ P then all atoms of r are in S. a cautious (or certain) consequence of P, denoted by P |=c a, if ∀M ∈ SM(P) : a ∈ M; a is a brave (or possible) A subset T of a program P is a module if T = {r | consequence of P, denoted by P |=b a, if ∃M ∈ SM(P): the head of r is in S} for some independent atom set S. 2 . Given a query , Ans denotes the set of a ∈ M Q c(Q, P) These modules can be used to partially evaluate programs, substitutions , such that ; Ans denotes ϑ P |=c Qϑ b(Q, P) as the following results show. the set of substitutions ϑ, such that P |=b Qϑ. Let P and P′ be Datalog¬ programs and Q be a query.

Load more