<<

Proceedings of the Twenty-Third AAAI Conference on (2008)

Magic Sets for Integration∗

Wolfgang Faber and Gianluigi Greco and Nicola Leone Department of Mathematics, University of Calabria, Italy {faber,ggreco,leone}@mat.unical.it

Abstract 2004; Bravo & Bertossi 2003). An important feature of these programs is that they are consistent, viz. that a stable model We present a generalization of the Magic Sets technique to ¬ programs with (possibly unstratified) negation un- is guaranteed to exist. However, given the co-NP hardness of der the stable model semantics, originally defined in (Faber, their evaluation, the design of optimization techniques is of Greco, & Leone 2005; 2007). The technique optimizes utmost importance for applications in real scenarios, where Datalog¬ programs by means of a rewriting algorithm that the size of the input may be huge. preserves query equivalence, under the proviso that the orig- In this paper, we focus on the optimization of Datalog¬ inal program is consistent. The approach is motivated by re- programs, by discussing an extension of the well-known cently proposed methods for query answering in data integra- Magic Set method (Bancilhon et al. 1986; Beeri & Ramakr- tion and inconsistent , which use cautious reasoning ¬ ishnan 1991). This method exploits the fact that while an- over consistent Datalog programs under the stable model se- swering a user query, often only a certain part of the stable mantics. models needs to be considered, so there is no need to com- In order to prove the correctness of our Magic Sets transfor- pute these models in their entirety. In fact, its aim is to fo- mation, we have introduced a novel notion of modularity for ¬ cus the instantiation of the program to those ground rules Datalog under the stable model semantics, which is more suitable for query answering than previous module defini- that are really needed to answer the query, by propagating tions, and which is also relevant per se. A module under this binding information from the query goal into the program ¬ definition guarantees independent evaluation of queries if the rules. Differing from the original method, Datalog requires full program is consistent. Otherwise, it guarantees sound- also body-to-head propagation in the presence of unstratified ness under cautious and completeness under brave reasoning. negation. The key idea is then to identify rules for which this is necessary, which we term dangerous rules. The formal properties of the proposed approach have been Introduction deeply analyzed. First, we show that the program obtained is Datalog¬ programs are function-free logic programs where query equivalent under brave and cautious reasoning to the negation may occur in the bodies of rules. Datalog¬ with original program if the latter is consistent, making it a per- 1 stable model semantics (Gelfond & Lifschitz 1988) is a fect fit for applications, where consistency very expressive in a precise mathematical is guaranteed. If the original program is not guaranteed to 2 sense: Under brave (cautious) reasoning , Datalog¬ allows be consistent, we can still show that on the transformed pro- to express every query that is decidable in the complexity gram, brave reasoning is complete, and cautious reasoning class NP (co-NP) (Schlipf 1995). is sound with respect to the original program. In many recent proposals for data integration and reason- In order to establish the above results, we introduce ing on inconsistent databases, query answering turned out to a suitable notion of modularity for query answering over be co-NP-complete and, in fact, it was reduced to cautious Datalog¬ programs. Previous notions like splitting sets of reasoning on suitable Datalog¬ programs (Arenas, Bertossi, (Lifschitz & Turner 1994) and modules of (Eiter, Gottlob, & & Chomicki 2000; Greco, Greco, & Zumpano 2001; Lembo Mannila 1997) have been defined for stable model genera- ∗Supported by M.I.U.R. within projects “Potenziamento e Ap- tion, while the new notion is tailored to query answering. plicazioni della Programmazione Logica Disgiuntiva” and “Sistemi Finally, we analyze the complexity of determining basati sulla logica per la rappresentazione di conoscenza: esten- whether a predicate is dangerous in a given program, which sioni e tecniche di ottimizzazione,” and “tocai.it: Tecnologie Ori- is a central notion of our Magic Set method. It turns out that entate alla Conoscenza per Aggregazioni di Imprese in Internet.” this task is NL-complete and thus tractable. Copyright c 2008, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. ¬ ¬ Datalog Programs 1Unless explicitly specified, Datalog will always denote Dat- alog with negation under stable model semantics in this paper. An atom p(t1, . . . , tk) is composed of a predicate symbol 2Note that brave and cautious consequences are also called pos- p of arity k and terms t1, . . . , tk, each of which is either a sible and certain answers, respectively. constant or a variable. A literal is either an atom a or its

1528 ¬ − tnegation no a. A (Datalog ) rule r is of the form an arc is marked if a appears in B (r).A cycle of DGP is a sequence of nodes C = n1, . . . , nk, such that each ni h :- b1, . . . , bm, not bm+1,..., not bn. (1) (1 < i < k) occurs exactly once in C, n1 = nk, and each where h, b1, ··· , bn are atoms and 0 ≤ m ≤ n. The atom h (ni, ni+1) (1 ≤ i < k) is an arc in DGP . An odd cycle in is called the head of the rule. A Datalog¬ program is a set of DGP is a cycle C = n1, . . . , nk such that an odd number Datalog¬ rules. If all defining rules of a predicate p are facts of the arcs (ni, ni+1) (1 ≤ i < k) is marked. In analogy, A (that is, n = m = 0), then p is an EDB predicate; otherwise one can also define the atom dependency graph DGP of a p is an IDB predicate. A set of facts for EDB predicates of ground program P, by considering atoms rather than predi- a program P is called an EDB (for P). A query Q is just an cates. We now use this notion to define dangerous predicates atom. and rules. The intuition is that dangerous predicates may in- Let the (Herbrand) universe and base for a Datalog¬ pro- hibit a stable model. gram P be denoted UP and BP , respectively. The ground Definition 1 Let P be a program (resp., ground program), instantiation of P w.r.t. UP is denoted by Ground(P). An and d be a predicate (resp., atom) of P. Then, we say that d interpretation is a subset of BP . A ground positive literal A is dangerous if either (resp. negative literal not A) is true w.r.t. I if A ∈ I (resp. A A/∈ I); otherwise it is false. An interpretation I satisfies a 1. d occurs in an odd cycle of DGP (resp., DGP ), or ground rule r ∈ Ground(P) if the head of r is true w.r.t. I 2. d occurs in the body of a rule with a dangerous head pred- whenever the body of r is true w.r.t. I. An interpretation I icate (resp., atom). ¬ is a model of a Datalog program P if I satisfies all rules in A rule r is dangerous, if it contains a dangerous predicate Ground(P). (resp., atom) in the head. 2 Each not -free program P has a least (under subset inclu- sion) model, which is denoted by LM(P) and is the unique Based on this definition, we define a notion of indepen- stable model of P. Given a Datalog¬ program P and an dence for sets of atoms. These sets must be closed under interpretation I, the Gelfond-Lifschitz transform PI is ob- rules in the head-to-body direction and in the body-to-head tained from Ground(P) by deleting all rules containing direction for dangerous rules. The defining rules of these not b where b ∈ I, and deleting all not literals in the re- sets then form modules. maining rules. The set of stable models of a Datalog¬ pro- Definition 2 An independent atom set of a ground program gram P, denoted by SM(P), is the set of interpretations I, P is a set S ⊆ BP such that for each atom a ∈ S the fol- such that I = LM(PI). lowing holds: A program P is consistent if SM(P) 6= ∅, otherwise it is 1. if a is the head of a rule r ∈ P then all atoms of r are in inconsistent. A program P is data consistent if P = P ∪F F S, and is consistent for each EDB F. Given a ground atom a and a Datalog¬ program P, a is 2. if a appears in the body of a dangerous rule r ∈ P then all atoms of r are in S. a cautious (or certain) consequence of P, denoted by P |=c a, if ∀M ∈ SM(P) : a ∈ M; a is a brave (or possible) A subset T of a program P is a module if T = {r | consequence of P, denoted by P |=b a, if ∃M ∈ SM(P): the head of r is in S} for some independent atom set S. 2 . Given a query , Ans denotes the set of a ∈ M Q c(Q, P) These modules can be used to partially evaluate programs, substitutions , such that ; Ans denotes ϑ P |=c Qϑ b(Q, P) as the following results show. the set of substitutions ϑ, such that P |=b Qϑ. Let P and P′ be Datalog¬ programs and Q be a query. Theorem 1 Let T be a module of a ground program P, then ′ b ′ given an arbitrary EDB F, the following holds: Then, P is brave-sound w.r.t. P and Q, denoted P⊆Q P , if Ans Ans ′ b(Q, PF ) ⊆ b(Q, PF ) is guaranteed for each EDB 1. SM(PF )/TF ⊆ SM(TF ), and F; P is cautious-sound w.r.t. P′ and Q, denoted P⊆c P′, 3 Q 2. SM(TF ) = SM(PF )/TF , if PF is consistent. Ans Ans ′ if c(Q, PF ) ⊆ c(Q, PF ) for all F. P is brave- complete (resp., cautious-complete) w.r.t. P′ and Q, denoted Corollary 2 Let T be a module of a ground program P and b ′ c ′ Ans Ans ′ F be an EDB. If PF is consistent, each stable model of PF P⊇Q P (resp., P⊇Q P ) if b(Q, PF ) ⊇ b(Q, PF ) Ans Ans ′ can be obtained by enlarging a stable model of TF . (resp., c(Q, PF ) ⊇ c(Q, PF )). Finally, P and P ′ are brave-equivalent (resp., cautious-equivalent) w.r.t. Importantly, this also means that one obtains the same an- b ′ c ′ b ′ swers to a query by considering only the module in which Q, denoted by P≡Q P (resp. P≡Q P ), if P⊆Q P and P⊇b P′ (resp., P⊆c P′ and P⊇c P′). the query predicate is contained, under the proviso that the Q Q Q original program is consistent. Dangerous Rules and Independent Atom Sets Theorem 3 Given query Q, which is covered by module T With every program P, we associate a marked directed of a ground program P, and an EDB F, such that PF is consistent, then it holds that: graph DGP = (N,E), called the predicate dependency graph of P, where (i) each predicate of P is a node in N, 1. Ansb(Q,TF ) = Ansb(Q, PF ), and and (ii) there is an arc (a, b) in E directed from node a to 3 node b if there is a rule r ∈ P such that two predicates a and For an interpretation I and a set T of rules, I/T denotes the b of literals appear in B(r) and H(r), respectively. Such restriction of I to T , precisely, I/T = I ∩ BT .

1529 Ans Ans ¬ 2. c(Q,TF ) = c(Q, PF ). Input: A Datalog program P, and a query Q = g(t). ¬ More generally, the following properties hold for query Output: The optimized program MS (Q, P). answering: var S: stack of adorned predicates; modifiedRules,magicRules: set of rules; begin Theorem 4 Given query Q, which is covered by module T 1. modifiedRules:= ∅; magicRules:=BuildQuerySeeds(Q,S); of a ground program P, it holds that: 2. while S 6= ∅ do b c α 1. T ⊇Q P and T ⊆Q P: 3. p := S.pop(); 2. T ≡b P and T ≡c P, if P is data consistent. 4. for each rule r ∈ P with H(r) = p(otp) d Q Q α 5. ra := Adorn(r,p ,S); ¬ Magic Sets for Datalog Programs 6. magicRules := magicRules S Generate(ra); 7. modifiedRules := modifiedRules S {Modify(ra)}; While with the notion of dangerous rules and modules we 8. end for have an adequate means for identifying subprograms which 9. for each dangerous rule d ∈ P of the form h(:t ) − Q (t...,), Q (t ) are sufficient to answer a query, the definitions are not con- h 1 1 m m where Q = p or Q = not p do structive. For this reason we next present a generalization of i i 10. let d be the rule p t h t , Q t...,, Q − t , the Magic Set method for Datalog¬ programs, which effec- s (: i) − ( h) 1( 1) i 1( 1) Qi+1(t...,1), Qm(tm); tively constructs a module for answering a given query. α 11. let d :=Adorn(d ,p ,S); The algorithm is given in Figure 1. Other than the tra- a s 12. magicRules := magicRules S Generate(da); ditional phased approach of Magic Sets, it uses a stack- 13. end for based architecture, where the stack holds adorned predicates 14. end while (predicates with an associated binding pattern, viz. one of b ¬ 15. MS (Q, P):=magicRules ∪ modifiedRules; [bound] or f [free] for each argument). The idea is to create ¬ 16. return MS , ; so-called magic predicates, which restrict the range of con- (Q P) end. stants for predicate arguments to those relevant for the query. The relevant defining rules are then modified to include these ¬ magic predicates. Rules which are never reached from the Figure 1: Magic Set Algorithm for Datalog Programs. query and are not dangerous will not be modified, and thus not considered. the adorned predicates have been processed the algorithm The algorithm begins by analyzing the binding pattern of outputs the program MS¬(Q, P). the query, pushing the respective adorned query predicate We would like to show that MS¬(Q, P) forms a module of onto the stack and possibly storing some fact for the asso- P which covers the query predicate. There is one technical ciated magic predicate (via the function BuildQuerySeeds). issue, however, because the algorithm creates new symbols After this initialization, the algorithm performs several steps (the magic predicates). This means that MS¬(Q, P) formally for each adorned predicate by popping them from the stack is not a subset of P and hence cannot be a module by itself. (adorned predicates are pushed on the stack only once). But we can partially evaluate the program on the predicates In steps 4-8, the binding of pα is propagated head-to-body that were introduced during the transformation, and consider in each rule of having an atom p t in the head. This r P ( ) the obtained equivalent ground program, which we can show propagation works as in the standard Magic Set method for to be indeed a module of Ground(P). These considerations stratified Datalog¬ programs using an appropriate sideways give rise to the following central result. information passing strategy (SIPS). Steps 9-13 perform the propagation of the bind- Theorem 5 Let P be a Datalog¬ program, let Q be a query, ing through each dangerous rule d in P of the form and F be an EDB. Then, the following holds: h(-th) : Q1(...,t1), Qm(t m), in which the predicate p 1. MS¬(Q, P)⊆c P and MS¬(Q, P)⊇b P; occurs in Q (t ) inside the positive or negative body. Q Q i i 2. Ans (Q, MS¬(Q, P) ) = Ans (Q, P ), if P is consis- These steps are, in fact, crucial for guaranteeing cautious- b F b F F tent; completeness and brave-soundness for consistent programs. Ans ¬ Ans In this case, in order to simulate body-to-head propagations 3. c(Q, MS (Q, P)F ) = c(Q, PF ), if PF is con- and to minimize the effort of doing so, we swap head and sistent; ¬ b ¬ c the matching body literal and apply the standard method 4. MS (Q, P)≡Q P and MS (Q, P)≡Q P, if P is data con- as if it was a head-to-body propagation. So, the rule sistent. d is first replaced by an “inverted” rule ds of the form p(-ti): h(th), Q1( t1), ..., Qi−1(ti−1), Qi+1( ti+1), ..., Qm(tm), Complexity Results which has been obtained by swapping the head atom with A desideratum of the Magic Set algorithm is of course that the body atom (possibly occurring negated) propagating the it runs in polynomial time in terms of the input program. All binding. Then, the adornment can be carried out as usual operations in the algorithm are deterministic and most are by means of the function Adorn. Since this “inverted” rule clearly simple. Moreover, under the realistic assumption that was not part of the original program and its only purpose the maximum arity of predicates is bounded by a constant, it is generating binding information, it will not give rise to is also easy to see that the number of adorned predicates is a modified rule, but only to magic rules. Finally, after all bounded by a polynomial. There is however one step in the

1530 algorithm that needs clarification, the cost of determining Conclusion dangerous predicates and rules. We have provided a brief overview of an extension of the Theorem 6 Let P be a program and d be a predicate. Then, Magic Set method for Datalog¬ programs under the stable deciding whether d is dangerous is NL-complete. model semantics. The technique guarantees query equiva- lence for consistent programs, which occur in applications This result is not straightforward, as deciding whether the like data integration or querying inconsistent databases. Fu- first condition in Def. 1 holds is NP-complete. We conclude ture work includes studying promising combinations with that the Magic Set algorithm is tractable for programs with works such as (Bonatti 2004). bounded predicate arities. We can also specify the complex- ity for computing the set of dangerous rules. References ¬ Theorem 7 Let P be a Datalog program. All danger- Arenas, M.; Bertossi, L. E.; and Chomicki, J. 2000. Spec- Preds 3 ous rules can be computed in O(| (P)| + |P|), where ifying and querying database repairs using logic programs Preds (P) is the set of IDB predicates of P. with exceptions. In FQAS 2000, 27–41. Bancilhon, F.; Maier, D.; Sagiv, Y.; and Ullman, J. D. 1986. An Application to Data Integration Magic Sets and Other Strange Ways to Implement Logic This work had been motivated by needs arising within the Programs. In PODS’86, 1–16. project INFOMIX on data integration (Leone et al. 2005), Beeri, C., and Ramakrishnan, R. 1991. On the power of funded by the European Commission. Here, a fully func- magic. JLP 10(1–4):255–259. tional data integration system has been implemented by ex- ploiting a reduction from query answering in data integra- Bonatti, P. A. 2004. Reasoning with infinite stable models. tion systems to cautious reasoning over Datalog¬ programs. Artificial Intelligence 156(1):75–111. Indeed, a data integration system I = hG, S, Mi (where G Bravo, L., and Bertossi, L. 2003. is a global schema, S are source schemata and M is a map- for consistently querying data integration systems. In IJ- ping) together with a database D for S and a query q over CAI 2003, 10–15. G are encoded into a Datalog¬ program Π(q, I), in a way Eiter, T.; Gottlob, G.; and Mannila, H. 1997. Disjunctive that cautious answers for Π(q, I) correspond to the answers Datalog. ACM TODS 22(3):364–418. given by the data integration system, cf. (Lembo, Lenzerini, Faber, W.; Greco, G.; and Leone, N. 2005. Magic sets and & Rosati 2003; Lembo 2004). their application to data integration. In ICDT’05, LNCS The binding propagation techniques proposed in this pa- 3363 per can be profitably exploited to isolate the relevant part Faber, W.; Greco, G.; and Leone, N. 2007. Magic Sets and of a database. Importantly, our optimization fits perfectly their Application to Data Integration. JCSS 73(4):584–609. into the data integration framework. Indeed, the loosely- sound semantics for data integration always guarantees the Gelfond, M., and Lifschitz, V. 1988. The Stable Model Se- existence of a database repair no matter of the types of con- mantics for Logic Programming. In ICLP’88, 1070–1080. straints in Σ, provided that the schema is non-key-conflicting Cambridge, Mass.: MIT Press. (Lembo 2004). Thus, the resulting logic program is consis- Greco, G.; Greco, S.; and Zumpano, E. 2001. A logic pro- tent and our rewriting fully preserves the original semantics gramming approach to the integration, repairing and query- of the data-integration query, since query equivalence is en- ing of inconsistent databases. In ICLP’01, LNCS 2237 sured (Theorem 5). INFOMIX Project Team. 2004. Demo Scenario. Tech. Theorem 8 Let I = hG, S, Mi be a data integration sys- Report INFOMIX S7-1, INFOMIX Project Consortium. tem, D be a database for S, and q be a query over G. Then, http://sv.mat.unical.it/infomix. ¬ ans(q, I, D) coincides with Ansc(q, MS (q, Π(q, I)) ∪ D). Lembo, D.; Lenzerini, M.; and Rosati, R. 2003. Meth- In order to test the effectiveness of the Magic Set tech- ods and techniques for query rewriting. Tech. Report D5.2, nique for in data integration systems, Infomix Consortium. we have carried out some experiments on the demonstration Lembo, D. 2004. Dealing with Inconsistency and Incom- scenario of the INFOMIX project, which refers to the infor- pleteness in Data Integration. Ph.D. Dissertation, Universit mation system of the University “La Sapienza” in Rome. A di Roma “La Sapienza”. detailed discussion of the results is available in (INFOMIX Leone, N.; Gottlob, G.; Rosati, R.; Eiter, T.; Faber, W.; Project Team 2004). The results confirmed that on various Fink, M.; Greco, G.; Ianni, G.; Kałka, E.; Lembo, D.; practical queries the performance is considerably improved Lenzerini, M.; Lio, V.; Nowicki, B.; Ruzzi, M.; Staniszkis, by Magic Sets, even of order of magnitudes in some cases. W.; and Terracina, G. 2005. The INFOMIX System for Ad- Our Magic Set technique provides benefits with respect vanced Integration of Incomplete and Inconsistent Data. In to two crucial parameters in INFOMIX. By limiting the SIGMOD 2005, 915–917. ACM Press. computation on the fraction of the retrieved global database Lifschitz, V., and Turner, H. 1994. Splitting a Logic Pro- which is relevant to the query binding, it generally produces gram. In ICLP’94, 23–37. MIT Press. smaller ground programs. Moreover, by disregarding map- Schlipf, J. S. 1995. The Expressive Powers of Logic Pro- ping conflicts that are irrelevant for answering the query at gramming Semantics. JCSS 51(1):64–86. hand, it can actually give rise to exponential savings.

1531