<<

146 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. I, NO. I, MARCH 1989 What You Always Wanted to Know About (And Never Dared to Ask)

STEFANO CERI, , AND LETIZIA TANCA

Abstract-Datalog is a based on the temal typically do not allow data sharing and ; it has been designed and intensively studied over the last five years. We present the syntax and semantics of Datalog recovery, thus do not properly belong to current database and its use for querying a relational database. Then, we classify opti- technology [26]. The spread and success of such en- mization methods for achieving efficient evaluations of Datalog que- hanced AI systems, however, indicate that there is a great ries, and present the most relevant methods. Finally, we discuss var- need for them. ious exhancements of Datalog, currently under study, and indicate what Loose coupling has been attempted in the area of Logic is still needed in order to extend Datalog’s applicability to the solution of real-life problems. The aim of this paper is to provide a survey of Programming and databases as well, by interconnecting research performed on Datalog, also addressed to those members of systems to relational databases [42], [25], [46], the database community who are not too familiar with logic program- [W, [331, [971. ming concepts. Although interesting results have been achieved, most

Zndex Terms-Deductive databases, , recursive studies indicate that simple interfaces are too inefficient; queries, relational databases, . an enhancement in efficiency is achieved by intelligent interfaces [64], [33]. This indicates that loose coupling might in fact solve today’s problems, but on the long I. INTRODUCTION range, strong integration is required. More generally, we ecent years have seen substantial efforts in the direc- expect that knowledge base management systems will R tion of merging artificial intelligence and database provide direct access to data and will support rule-based technologies for the development of large and persistent interaction as one of the programming paradigms. Da- knowledge bases. An important contribution towards this talog is a first step in this direction. goal comes from the integration of logic programming and The reaction of the database community to Datalog has databases. The focus has been mostly concentrated by the often been marked by skepticism; in particular, the im- community on well-formalized issues, mediate or even future practical use of research on so- like the definition of a new rule-based language, called phisticated rule-based interactions has been questioned Datalog, specifically designed for interacting with large [94]. Nevertheless, we do expect that Datalog’s experi- databases; and the definition of optimization methods for ence, properly filtered, will teach important lessons to re- various types of Datalog rules, together with the study of searchers involved in the development of knowledge base their efficiency [ 1201, [ 151, [ 161. In parallel, various ex- systems. The purpose of this paper is to give a self-con- perimental projects have shown the feasibility of Datalog tained survey of the research that has been performed re- programming environments [ 1191, [24], [88]. ’ cently on Datalog. More exhaustive treatments can be Present efforts in the integration of artificial intelligence found in the books [34], [ 1221, and [92]. Other interesting and databases take a much more basic and pragmatic ap- survey papers, approaching the subject from different per- proach; in particular, several attempts fall in the category spectives, are [53] and [loo]. of “loose coupling,” where existing AI and DB environ- This paper is organized as follows. In Section II we ments are interconnected through ad-hoc interfaces. In present the foundations of Datalog: the syntactic structure other cases, AI systems have solved persistency issues by and the semantics of Datalog programs. In Section III we developing internal databases for their tools; but these in- explain how Datalog is used as a query language over re- lational databases; in particular, we indicate how Datalog can be immediately translated to equations of relational Manuscript received March 1, 1989. This paper is based on Parts II and III of the book by S. Ceri, G. Gottlob, and L. Tanca, Logic Programming algebra. In Section IV we present a taxonomy of the var- and Databases (New York: Springer-Verlag. to be published). ious optimization methods, emphasizing the distinction S. Ceri is with the Dipartimento di Matematica, Universita di Modena, between program transformation and evaluation methods. Modena, Italy. G. Gottlob is with the Institut fur Angenwandte Informatik, TU, Wien, In Section V, we present a survey of evaluation methods Austria. and program transformation techniques, selecting the most L. Tanca is with the Dipartimento di Elettronica, Politecnico di Milano, representative ones within the classes outlined in Section Milano, Italy, IEEE Log Number 8928 155. IV. In Section VI, we present several formal extensions ‘The term “Datalog” was also used by Maier and Warren in [82] to given to the Datalog language to enhance its expressive- denote a subset of Prolog. ness, and we survey some of the current research projects

1041-4347/89/0300-0146$01 .OO 0 1989 IEEE CERI ef al.: DATALOG 147 in this area. Finally, in Section VII we attempt an evalu- of the same arity, i.e., that they have the same number of ation of what will be required in Datalog in order to be- arguments. A literal, fact, rule, or clause which does not come more attractive and usable. contain any variables is called ground. Any Datalog program P must satisfy the following II. DATALOG: SEMANTICS AND EVALUATION safety conditions. PARADIGMS l Each fact of P is ground. In this section we define the syntax of the Datalog query l Each variable which occurs in the head of a rule of language and explain its logical semantics. We give a P must also occur in the body of the same rule. model-theoretic characterization of Datalog programs and These conditions guarantee that the set of all facts that show how they can be evaluated in a bottom-up fashion. can be derived from a Datalog program is finite. This evaluation corresponds to computing a least fixpoint. Finally, we briefly mention another evaluation paradigm B. Datalog and Relational Databases called “top-down evaluation,” and compare Datalog to In the context of general Logic Programming it is usu- the well-known logic Prolog. ally assumed that all the knowledge (facts and rules) rel- evant to a particular application is contained within a sin- A. The Syntax of Datalog Programs gle logic program P. Datalog, on the other hand, has been Datalog is in many respects a simplified version of gen- developed for applications which use a large number of eral Logic Programming 1781. A logic program consists facts stored in a relational database. Therefore, we will of a offacts and rules. Facts are assertions about always consider two sets of clauses: a set of ground facts, a relevant piece of the world, such as: “John is the father called the Extensional Database (EDB), physically stored of Harry”. Rules are sentences which allow us to deduce in a relational database, and a Datalog program P called facts from other facts. An example of a rule is: “If X is the Znt$zsional Database (ZDB).3 The predicates occur- a parent of Y and if Y is a parent of 2, then X is a grand- ring in the EDB and in P are divided into two disjoint sets: parent of Z”. Note that rules, in order to be general, usu- the EDB-predicates, which are all those occurring in the ally contain variables (in our case, X, Y, and Z ). Both extensional database and the IDB-predicates, which occur facts and rules are particular forms of knowledge. in P but not in the EDB. We require that the head predi- In the formalism of Datalog both facts and rules are cate of each clause in P be an IDB-predicate. EDB-pred- represented as Horn clauses of the general shape icates may occur in P, but only in clause bodies. Ground facts are stored in a relational database; we as- Lo :- L,, . . . ) L, sume that each EDB-predicate r corresponds to exactly where each Li is a literal of the form pi (t,, . . . , tk,) such one relation R of our database such that each fact r ( cl, that pi is a predicate symbol and the t, are terms. A term . . . ) ck) of the EDB is stored as a tuple < cl, . . . , c, > is either a constant or a variable.* The left-hand side of R. (LHS) of a Datalog clause is called its head and the right- Also the IDB-predicates of P can be identified with re- hand side (RHS) is called its body. The body of a clause lations, called ZDB-relations, or also derived relations, may be empty. Clauses with an empty body represent defined through the program P and the EDB. IDB rela- facts; clauses with at least one literal in the body represent tions are not stored explicitly, and correspond to rela- rules. tional views. The materialization of these views, i.e., their The fact “John is the father of Bob”, for example, can effective (and efficient) computation, is the main task of be represented as father (bob, john). The rule “If X is a a Datalog compiler or interpreter. parent of Y and if Y is a parent of Z, then X is a grand- As an example of a relational EDB, consider a database parent of Z” can be represented as Et consisting of two relations with respective schemes PERSON(NAME) and PAR(CHILD, PARENT). The first grundpur(Z, X) : - pur( Y, X), pur(Z, Y). contains the names of persons and the second expresses a Here the symbols par and grundpur are predicate sym- parent relationship between persons. Let the actual in- bols, the symbols ,john and bob are constants, and the stances of these relations have the following values: symbols X,- Y, and-Z are variables. We will use the fol- PERSON = { < ann > , < bertrund > , < Charles > , lowing notational convention: constants and predicate symbols are strings beginning with a lower-case charac- < dorothy > , < evelyn > , < ji-ed > , ter; variables are strings beginning with an upper-case < george > , < hilary > } character. Note that for a given Datalog program it is al- ways clear from the context whether a particular nonvari- PAR = { < dorothy, george > , < eveelyn, george > , able symbol is a constant or a predicate symbol. We re- < bertrand, dorothy > , < unn, dorothy > , quire that all literals with the same predicate symbol are < unn, hilury > , < Charles, evelyn > >. ‘Note that in general Logic Programming, a term can also be a complex nested structure made of function symbols, constants, and variables. Ex- ‘In the literature an IDB is usually defined as a set of rules. Here the tensions of Datalog to cover such complex terms are outlined in Section Datalog program P may also contain facts. This terminological extension VI. is harmless. 148 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 1. NO. 1, MARCH 1989

These relations express the set of ground facts E = (per- 3np (E ) of intensional “result facts” defined by (3np (E ) son (unn), person (bertrand), . . . , par (dorothy, george), = cons(P U E) n ZHB. . . . >pur(churZes, evelyn)}. Let K and L be two literals (not necessarily ground). Let P, be a Datalog program consisting of the following We say that K subsumes L, denoted by K D L, if there clauses: exists a substitution 0 of variables such that KB = L, i.e., if 0 applied to K yields L. If K D L we also say that L is rl: sgc(X, X) : - person(X). an instance of K. For example, q(u, b, b) and q( , c, c) r2: sgc(X, Y) :- pur(X, Xl), sgc(X1, Yl), par( Y, Yl). are both instances of q(X, Y, Y), but q(b, b, a) is not. When a goal “?- G” is given, then the semantics of Due to rule rl, the derived relation SGC will contain a the program P with respect to this goal is a mapping 3np, o tuple < p, p > for each personp. Rule r2 is recursive and from EHB to ZHB defined as follows states that two persons are same generation cousins when- ever they have parents which are in turn same generation VE c EHB~z~,~(E) = {HIHE ~z~(E) A G > H}. cousins. Further examples of tuples belonging to SGC are: < dorothy, evelyn > , < Charles, unn > , and < unn, D. The Model Theory of Datulog Charles > . is a branch of mathematical logic which Note that program P, can be considered as a query defines the semantics of formal systems, by considering against the EDB El, producing as answer the relation their possible interpretations, i.e., the different intended SGC. In this setting, the distinction between the two sets meanings that the used symbols and formulas may as- of clauses, E and P, makes yet more sense. Usually a sume. Early developments of general model theory date database (in our case the EDB) is considered as a time- back to works of Ldwenheim [79], Skolem [ 1131, and Go- varying collection of information. A query (in our case, de1 [56]. A comprehensive exposition is given in [39]. a program P), on the other hand, is a time-invariant map- The model theoretic characterization of general logic the- ping which associates a result to each possible database ories often requires the use of quite sophisticated tools of state. For this reason we will formally define the seman- modem algebra [83]. The model theory of tics of a Datalog program P as a mapping from database systems and logic programs, however, is less difficult. It states to result states. The database states are collections has been developed in [125], [l l] and is also explained of EDB-facts and the result states are IDB-facts. in [78]. Datalog, being a simplified version of Logic Pro- Usually Datalog programs define large IDB-relations. gramming, can be described very easily in terms of model It often happens that a user is interested in a subset of theory. these relations. For instance, he or she might want to know An interpretation (in the context of Datalog) consists the same generation cousins of Ann only rather than all of an assignment of a concrete meaning to constant and same generation cousins of all persons in the database. To predicate symbols. A Datalog clause can be interpreted in express such an additional constraint, one can specify a several different ways. A clause may be true under a cer- goal to a Datalog program. A goal is a single literal pre- tain interpretation and false under another one. If a clause ceded by a question mark and a dash, for example, in our C is true under a given interpretation, we say that this case, ?-sgc (unn, X ). Goals usually serve to formulate ad interpretation satisfies C. hoc queries against a view defined by a Datalog program. The concept of logical consequence, in the context of Datalog, can be defined as follows: a fact F follows log- C. The Logical Semantics of Dutalog ically from a set S of clauses, iff each interpretation sat- Each Datalog fact F can be identified with an atomic isfying every clause of S also satisfies F. If F follows from formula F* of First-Order Logic. Each Datalog rule R of S, we write S E F. Note that this definition captures quite well our intuitive theformLO:-L,,... , L, represents a first-order for- mulaR*oftheformVX,, . . .VX,(L,A.. . AL, * L,), understanding of logical consequence. However, since wherex,, . . . , X,,, are all the variables occurring in R. A general interpretations are quite unhandy objects, we will set S of Datalog clauses corresponds to the conjunction limit ourselves to consider interpretations of a particular type, called Herbrund Interpretations. The recognition S* of all formulas C* such that C E S. The Herbrund Base HB is the set of all facts that we that we may forget about all other interpretations is due to the famous logicians Lowenheim, Skolem, and Her- can express in the language of Datalog, i.e., all literals brand.4 ofthe form P(c,, . . . , ck) such that all ci are constants. A Herbrand interpretation assigns to each constant Furthermore, let EHB denote the extensional part of the symbol “itself”, i.e., a lexicographic entity. Predicate Herbrand base, i.e., all literals of HB whose predicate is symbols are assigned predicates ranging over constant an EDB-predicate and, accordingly, let ZHB denote the set symbols. Thus, two nonidentical Herbrand interpretations of all literals of HB whose predicate is an IDB-predicate. differ only in the respective interpretations of the predi- If S is a finite set of Datalog clauses, we denote by cate symbols. For instance, one Herbrand interpretation cons(S) the set of all facts that are logical consequences may satisfy the fact 1(t, q) and another one may not sat- of S”. The semantics of a Datalog program P can now be de- 4Actually, Herbrand interpretations can be defined in the much more scribed as a mapping 3np from EHB to ZHB which to each general context of Clausal Logic [39], [80] and Logic Programming [125], possible extensional database E C EHB associates the set 1781. CERI et al.: DATALOG 149 isfy this fact. For this reason, any Herbrand interpretation ber theory) is faced with considerable problems such as can be identified with a subset 9 of the Herbrand base HB. consistency and incompleteness issues [55], [57], the This subset contains all the ground facts which are true proof theoretic analysis of subformalisms of first-order under the interpretation. Thus, a ground fact p (cl, . . . , logic, such as Logic Programming or Datalog, is easier c,,) is true under the interpretation 9 iff p (c,, . . . , c,) E and much less ambitious, and is more oriented towards 9. A Datalog rule of the form La: -L,, . . . , L, is true algorithmic issues. under 9 iff for each substitution B which replaces variables In this section we show how Datalog rules can be used by constants, whenever L,B E 4 A . . . A L,B E 4, then it to produce new facts from given facts. We define the no- also holds that &,B E 9. tion of “fact inference” and introduce a proof-theoretic A Herbrand interpretation which satisfies a clause C or framework which allows one to infer all ground facts a set of clauses S is called a Herbrund model for C or, which are consequences of a finite set of Datalog clauses. respectively, for S. Consider a Datalog rule R of the form LO : - L,, . . . , Consider, for example, the Herbrand interpretations L, and a list of ground facts F,, . . . , F,. If a substitution

91 = {person ( john), person (jack), person ( jim), 0 exists such that for each 1 5 i I n Lie = Fi then, from sgc(john, john), pur(jim, john), pur( jack, john), rule R and from the facts F,, . . . , F,, we can infer in one sgc(john, john), sgc(jack, juck), sgc(jim, jim)}. Ob- step the fact LOO. The inferred fact may be either a new viously, 9, is not a Herbrand model of the program P, fact or it may be already known. given in Section II-B. Let g2 = 9, U {sgc (jim, juck), What we have just described is a general inference rule, sgc( jack, jim)} . It is easy to see that ‘J2 is a Herbrand which produces new Datalog facts from given Datalog model of P,. rules and facts. We refer to this rule as the Elementary The set cons (S ) of all consequence facts of a set S of Production Principle (EPP). In some sense, EPP can be Datalog clauses can thus be characterized as follows: considered as being a meta-rule, since it is independent cons (S ) is the set of all ground facts which are satisfied of any particular Datalog rules, and treats them just as by each Herbrand model of S. Since a ground fact F is syntactic entities. satisfied by a Herbrand interpretation 9 iff F E 9, cons ( S ) For example, consider the Datalog rule rl of program is equal to the intersection of all Herbrand models of S. P,: Summarized r,: sgc(X, X) :- person(X). cons(S) = {FE HB(S k F) From this rule and from the fact person (george) we can = II {g(gisaHerbrandmodelofS). infer in one step sgc (george, george). The substitution used here was 8 = {X + george} . Now recall the second Let 9, and g2 be two Herbrand models of S. It is easy to rule of P, see that their intersection 4, fl S2 is also a Herbrand model of S. More generally, it can be shown that the in- r,: sgc(X, Y) :- pur(X, Xl), sgc(X1, Yl), pur(Y, Yl) tersection of an arbitrary (possibly infinite) number of and consider the facts pur(dorothy, george), sgc( george, Herbrand models of S always yields a Herbrand model of george), and pur(evelyn, george). By applying EPP and S. This property is called the model intersection property. using the substitution 8 = {X + dorothy, Y + evelyn, Note that the model intersection property not only holds + george), we infer in one step for Datalog clauses, but for a more general type of clause Xl + geow, Yl sgc (dorothy, evelyn). called dejnite Horn clauses [ 1251, [78]. Let S be a set of Datalog clauses. Informally, a ground From the model intersection property it follows, in par- fact F can be inferred from S, denoted by S I- F iff either ticular, that for each set S of Datalog clauses, cons (S ) is F E S or F can be obtained by applying the inference rule a Herbrand model of S. Since cons (S ) is a subset of any EPP a finite number of times. The relationship “ ,-- ” is other Herbrand model of S, we call cons( S ) the least more precisely defined by the following recursive rules: Herbrund model of S. l S I- FifFES. Up to here nothing has been said about how cons (S ) l St FifaruleRESandgroundfactsF,, . . . , F, can be computed. The following subsections as well as exist such that Vl I i I n S ,- Fi and F can be inferred Sections III-V deal with this problem. In the rest of this in one step by the application of EPP to R and F,, . . , , section we will remain at a quite abstract level and we will not care about implementation and storage issues. In Fll. The sequence of applications of EPP which is used to particular, we will ignore that a set S of Datalog clauses infer a ground fact F from S is called a proof of F from may consist of program clauses and of EDB-clauses. The S. Any proof can be represented as a proof tree with dif- problem of retrieving facts from an EDB is deferred to ferent levels and with the derived fact F at the top. Sections III-V. Let S, denote the set of all clauses present in our ex- ample-program P, and in our example-database E,, i.e., E. The Proof Theory of Dutulog S, = P, U E,. It is easy to see that S, + sgc(ann, Proof theory is concerned with the analysis of logical Charles). The corresponding proof-tree is depicted in inference. As a branch of mathematical logic, this disci- Fig. 1. pline was founded by Hilbert [60]. While the investiga- Now that we have a proof-theoretic framework which tion of relevant extensions of first-order logic (e.g., num- allows us to infer new ground facts from an original set IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. I, NO. I. MARCH 1989

principle is also referred to as forward chaining, because the Datalog rules are processed forward, i.e., in the sense of the logical implication sign, from premises to conclu- sions.

F. Fixpoint Churucterization of Cons (S ) Let us show that the set cons (S ) can be characterized as the least fixpoint of a mapping Ts from 6 (HB) to 6 (HB), where 6 (HB) denotes the powerset of HB. If S is a set of Datalog clauses, then let infer1 (S ) de-

Fig. 1. Proof tree of sgc (arm, charles) note the set of all ground facts which can be inferred in one step from the rules and facts of S through EPP. Fur- thermore, let FACTS(S) denote the set of all facts of S of Datalog clauses S, let us compare this approach to the and let RULES ( S ) denote the set of all rules of S. Ob- model-theoretic approach presented in the last subsection. viously, we have S = FACTS(S) U RULES(S). The following important theorem holds. The transformation Ts associated to S is a mapping from Soundness and Completeness Theorem: Let S be a set 6 (HB) to 6 (HB) defined as follows: of Datalog clauses and let F be a ground fact. S I- F iff S I= F. VW E HB:T,( W) = W U FACTS(S) A proof of this theorem can be found in [34]. U inferl(RULES(S) U W). In order to check whether EPP applies to a rule R of the form Lo :- LI, . . . ) L, and to a (ordered) list of ground It is easy to see that any Herbrand interpretation 9 c HB facts F,, . . . , F,,, one has to hnd an appropriate substi- is a Herbrand model of S iff 9 is a jixpoint of Ts, i.e., tution 0 such that for each 1 I i I n Lie = Fi. Such a Ts( 9) = 9. In particular, the least Herbrand model substitution can be found by matching each Li in the rule cons ( S ) is the least jixpoint of Ts. body against the corresponding fact Ft. Such a matching cons(S) can be computed by least jixpoint iteration, either fails or provides a “local” substitution 6i for all i.e., by computing in order, Ts( 0 ), Ts( Ts( 0 )), variables occurring in Li. If at least one matching fails or Ts(Ts(Ts(izr>>>, * * * , until one term is equal to its pre- if the local substitutions are not mutually compatible be- decessor. This final term is cons (S ), the least fixpoint of cause they assign different constants to the same variable, S. This computation is quite similar to the INFER algo- then EPP does not apply. If all matchings are successful rithm, however, instead of adding one new fact at each and if the substitutions are all compatible, then we obtain iteration step, a set of new facts are added: all those new the global substitution 0 by forming the union of all local facts which can be inferred in one step from the old ones. substitutions 0 = 8, U . * . U 8,. (Note that matching is There exists even a transformation Ts which is some- a particular form of unification [34].) what less complicated than Ts and has the same least fix- There exists a very simple method of computing point cons (S ) for each finite set of Datalog clauses: VW c HB:T;(W) = FACTS(S) FUNCTION INFER(S) U inferl(RULES(S) U W). INPUT: a finite set S of Datalog clauses OUTPUT: cons(S) Thus, Th differs from Ts by the omission of the term Win BEGIN the union. w:= s; Although both Ts and Ts have the same least fixpoint WHILE EPP applies to some rule and facts of W cons ( S ), they do not have the same set of fixpoints. In producing a new ground fact F $ W particular, not all Herbrand models of S are fixpoints of DOW:= WU {F}; T&. RETURN(W n HB) / * all facts of W, but not the rules */ G. Top-Down Evaluation of Dutulog Goals END. The top-down method is a radically different way of The INFER always terminates and produces as evaluating Datalog programs. Proof trees are constructed output ajnite set of facts cons (S ). If we assume that the from the top to the bottom. This method is particularly arities of all predicates that may occur in a Datalog pro- appropriate when a goal is specified together with a Da- gram are bound by a constant, then the output of the IN- talog program. FER algorithm and its runtime are both polynomial in the Consider the program P, and the EDB El of our “same- size of its input S. generation” example. Assume that the goal ?-sgc(ann, The order in which INFER generates new facts corre- X ) is specified. One possibility of finding the required sponds to the bottom-up order of proof trees. For this rea- answers is to compute first the entire set cons(P, U El) son, the principle underlying INFER is called bottom-up by bottom-up evaluation and then throw away all facts evaluation. In the terminology of artificial intelligence this which are not subsumed by our goal. This would be a / CERI er 01.: DATALOG 151 noticeable waste of energy, since we would derive much to retrieve a matching tuple from mass memory. Due to more facts than necessary. the procedural semantics of Prolog, which prescribes a The other possibility is to start with the goal and con- particular order of visiting goals and subgoals, the re- struct proof trees from the top to the bottom by applying quired interaction between the interpreter and the external EPP “backwards”. In the context of artificial intelligence database is of the type one-tuple-at-a-time. This method such methods are also referred to as backward chaining. of accessing mass storage data is quite inefficient com- The general principle of backward chaining is described pared to the set-oriented methods used by high-level query in [ 1141 and in [34]. languages. Several enhanced coupling mechanisms have In Section V of this paper we will present a top-down been proposed [64], [33], but no one takes full advantage method for evaluating Datalog programs against an EDB. of set-oriented techniques. This is probably not possible This method, called the Query-Subquery approach without compromising Prolog’s semantics. (QSQ), implicitly constructs all proof trees for a given It is the aim of the next section to show that Datalog is goal in a recursive fashion. well suited for set-oriented techniques. Also the well-known programming language Prolog [45] is based on the principle of backward chaining. How- III. DATALOG Is REALLY A DATABASE LANGUAGE ever, as we outline in the next subsection, the semantics of Prolog differs from that of Datalog. Although expressing queries and views in Datalog is quite intuitive and fascinating from the user’s viewpoint, we should not forget that the aim of database query lan- H, Datulog and Prolog guages like Datalog is providing accessto large quantities From the syntactical point of view, Datalog is a subset of data stored in mass memory. Thus, in order to enable of Prolog; hence each set of Datalog clauses could be an easy integration between Datalog and database man- parsed and executed by a Prolog interpreter. However, agement systems, we need to relate the logic program- Datalog and Prolog differ in their semantics. ming formalism to the most common database languages. While Datalog, as a simplified version of general Logic We have chosen as such a data retrieval Programming [78], has a purely declarative semantics, the language. This section provides an informal description meaning of Prolog programs is defined by an operational of the translation of Datalog programs and goals into re- semantics, i.e., by the specification of how Prolog pro- lational algebra. grams must be executed. A Prolog program is processed according to a resolution strategy which uses a depth-first A. Translation of Dutalog Queries into Relational search method with backtracking for constructing proof Algebra trees and respects the order of the clauses and literals as they appear in the program [45]. This strategy does not Each clause of a Datalog program is translated, by a guarantee termination. The termination of a recursive syntax-directed translation algorithm, into an inclusion Prolog program depends strongly on the order of the rules relationship of relational algebra. The set of inclusion re- in the program, and on the order of the literals in the rules. lationships that refer to the same predicate is then inter- Consider, for example, the program Pi consisting of preted as an equation of relational algebra. Thus, we say the following clauses: that a Datalog program gives rise to a system of algebraic equations. Each IDB-predicate of the Datalog program r;: sgc(X, Y) :- sgc(X1, Yl), pur(X, Xl), pur( Y, Yl). corresponds to a variable relation; each EDB-predicate of the Datalog program corresponds to a constant relation. t-2: sgc(X, X) :- person(X). Determining a solution of the system corresponds to de- termining the value of the variable relations which satisfy Note that this program differs from program PI of Section the system of equations [61], [63]. The translation from II-B only by the order of the rules, and of literals in the Datalog to relational algebra is described in [28], [34], rule bodies. From a Datalog viewpoint, the order of and [122]. clauses and literals is totally irrelevant, hence Pi, as a Let us consider a Datalog clause C:p (a,, . . . , a,) : - Datalog program, is equivalent to PI. On the other hand, q, ( PI, . . . , &), . . . qm( /3,, . . . , &). The translation if we submit Pi and the EDB Et to a Prolog interpreter associates to C an inclusion relationship Expr( Q,, . . , and activate it, say, with the goal “?-sgc(unn, X)“, then Q,) E P, among the relations P, Q,, . . . , Q, that cor- we would run into infinite without getting any respond to predicates p, q,, . . . , qm,5 with the convention result. that relation attributes are named by the number of the Prolog has several system predicates, such as the cut, corresponding argument in the related predicate. For ex- which render the language even more procedural. It is, ample, the Datalog rules of program P, from Section II: however, a very rich and flexible programming language which has gained enormous popularity over the last years. r,: sgc(X, X) :- person(X). It is possible to couple Prolog to an external database. A Prolog interpreter can then distinguish between IDB and r2: sgc(X, Y) :- pu.r(X, Xl), sgc(X1, Yl), pur(Y, Yl) EDB-predicates. When an EDB goal is encountered dur- ing the execution of a Prolog program, the interpreter tries 5Note that some of the q, might be p itself, yielding a recursive rule IS2 1EEE TRANSACTlONS ON KNOWLEDGE AND DATA ENGINEERING. VOL. I, NO. I, MARCH 1989 are translated into the inclusion relationships: II,,5 ((PAR w SGC) w PAR) G SGC 2=1 4=2 II,, , PERSON c SGC The rationale of the translation is that literals with com- mon variables give rise to joins, while the head literal Fig. 2. The expressive power of Datalog. determines the projection. Details of translation rules can be found in [28]. Note that, in order to obtain a two-col- umn relation SGC in the second inclusion relationship, we comparable. Indeed, the semantics of Datalog is based on have performed a double projection of the unique column the choice of a specific model (the least Herbrand model), of relation PERSON. while first-order logic does not a priori require a partic- For each IDB predicate p, we now collect all the inclu- ular choice of the model. Interesting tractations of the sion relationships of the type Expr;( Q,, . . . , Q,) c P, problem of the expressive power of relational query lan- and generate an algebraic equation having P as LHS, and guages can be found in [lo] and [2]. the union of all the left-hand sides of the inclusion rela- tionships as RHS: IV. THE OPTIMIZATION PROBLEM: AN OVERVIEW In this section we provide a classification of efficient P = E.wl(QI, . . . , Q,,) U Eq2. view of methods is required, since optimization can be achieved using a variety of techniques, and understanding For instance, from the above inclusion relationships we their relationships is not obvious. obtain the following equation: We classify optimization methods according to thefor- &,((PAR w SGC) w PAR) malism, to the search strategy, to the objective of the op- 2=1 4=2 timization, and to the type of considered information.

U II,. , PERSON = SGC A. Logic and Algebraic Formalism Note that the transformation of several disequations into It comes out from Section III that programs in Datalog one equation really captures the minimality requirement can equivalently be expressed as systems of equations of contained in the least Herbrand model semantics of a Da- RA’. So, we consider two alternative formalisms, that we talog program. In fact, it expresses the fact that we are regard, respectively, as logic and algebraic, and we dis- only interested in those ground facts that are conse- cuss optimization methods that belong to both worlds. quences of our program. We emphasize the convenience of mapping to the al- We also translate logic goals into algebraic queries. gebraic formalism, as this allows reusing classical results, Input Datalog goals are translated into projections and se- such as optimization, common sub- lections over one variable relation of the system of alge- expression analysis, and the quantitative determination of braic equations. For example, the logic goal “?-p(X).” costs associated to each operation of relational algebra is equivalent to the algebraic query “P”, and “?-q(a, [69], [84]. [122]. X)“. is equivalent to “a, =,Q”. B. Search Strategy B. The Expressive Power of Datalog We recall from Section II that the evaluation of a Da- The system produced by the above translation includes talog goal can be performed in two different ways: bot- all the classical relational operations, with the exception tom-up, starting from the existing facts and inferring new of difference; we say that it is written in positive rela- facts, or rather top-down, trying to verify the premises tional algebra, RA+. It can be easily shown that each de- which are needed in order for the conclusion to hold. fining expression of RA+ can also be translated into a Da- In fact, these two evaluation strategies represent differ- talog program [34]. This means that Datalog is at least as ent interpretations of a rule. Bottom-up evaluations con- expressive as RA+; in fact, Datalog is strictly more ex- sider rules as productions, that apply the initial program pressive than Z&4+ because in Datalog it is possible to ex- to the EDB, and produce all the possible consequences of press recursive queries, which are not expressible in RA+. the program, until no new fact can be deduced. Bottom- However, there are expressions in full relational alge- up methods can naturally be applied in a set-oriented fash- bra that cannot be expressed by Datalog programs. These ion, i.e., taking as input the entire relations of the EDB. are the queries that make use of the diference operator. This is a desirable feature in the Datalog context, where For example, given two binary relations R and S, there is large quantities of data must be retrieved from mass mem- no Datalog rule defining R - S. Fig. 2 graphically rep- ory. On the other hand, bottom-up methods do not take resents the situation. We will see in Section VI that these immediate advantage of the selectivity due to the exis- expressions can be captured by enriching pure Datalog tence of arguments bound to constants in the goal predi- with the use of logical ( 1 ). cate. Note also that, even though Datalog is syntactically a In top-down evaluation, instead, rules are seen as prob- subset of first-order logic, strictly speaking they are not lem generators. Each goal is considered as a problem that CERI et al.: DATALOG 153

must be solved. The initial goal is matched with the left- CRITERION ALTERNATIVES 1 hand side of some rule, and generates other problems cor- F0WD3k~~ logic vs. relational algebra responding to the right-hand side predicates of that rule; Search technique bottom-upvs. top-down this process is continued until no new problems are gen- lhucrsal order depth-first vs. breadth-first erated. In this case, if the goal contains some bound ar- Objective rewriting vs. pure evaluation gument, then only facts that are somehow related to the Approach syntactic vs. semantic goal constants are involved in the computation. Thus, this siructurc rule structure YS. goal structun evaluation mode already performs a relevant optimization because the computation automatically disregards many (a) of the facts which are not useful for producing the result. BOTTOM-UP TOP-DOWN On the other hand, in top-down methods it is more natural EVALUATION Naive (Jacobi, Query-subquery to produce the answer one-tuple-at-a-time, and this is an METHODS Gauss-Seldel) (281, [29] ll28l. [I301 undesirable feature in Datalog. 1611.1151 If we restrict our attention to the top-down approach, Semi-nave [IS] HenschewNaqw 1591 we can further distinguish two search methods: breadth- first or depth--r-St. With the depth-first approach, we face LOGIC ALGEBRAIC the disadvantage that the order of literals in rule bodies REWRITING Magic sets (141. 1171 Variables reduction and strongly affects the performance of methods. This hap- METHODS Counting [ 141 Constant reduction Magtc Countmg [ 106, 1291 pens in Prolog, where not only efficiency, but even ter- Sl.mC fihcrrng ,6X, mination of programs is affected by the left-to-right order of subgoals in the rule bodies. Instead, Datalog goals seem (b) more naturally executed through breadth-first techniques, Fig. 3. Classification of evaluation and optimization methods as the result of the computation is neither affected by the order of predicates within the right-hand sides (RHS) of to produce an efficient answer to a query; the combination rules, nor by the order of rules within the program. of the query with additional semantic information is per- formed automatically. Semantic methods are often based C. Objectives of Optimization Methods on integrity constraints, which express properties of valid Our third classification criterion is based on the differ- databases. For instance, a constraint might state that “all ent objectives of optimization methods: some methods intercontinental Jights directed to Milan land in Mal- perform program transformation, namely, transforming a pensa airport “. This constraint can be used to produce program into another program which is written in the same the answer of a goal asking for “the arrival airport of the formalism, but yields a more efficient computation when intercontinental jlight AZ747 from New York to Milan ” one applies an evaluation method to it; we refer to these without accessing the EDB. Although we think that se- as rewriting methods. Given a goal G and a program P, mantic optimization has the potential for significant im- the rewritten program P’ is equivalent to P with respect provements of query processing strategies, we do not fur- to G, as it produces the same result. Formally, recalling ther consider semantic optimization methods in this paper; . . the definition of Xp,o of Section II, two programs P and various approaches to semantic optimization are proposed P’ are equivalent with respect to a goal G iff 31zp,G = in the literature; among others; see [70], [36]. init. &&e methods contrast with the pure evaluation meth- E. ClassiJication of Evaluation and Optimization ods, which propose effective evaluation strategies, where Methods the optimization is performed during the evaluation itself. Fig. 3(a) summarizes classification criteria for Datalog optimization methods; they concern the search strategy D. Type of Considered Information (bottom-up or top-down), the objective (rewriting or pure Optimization methods differ in the type of information evaluation), and the formalism (logic or algebraic). By used in the optimization. combining approaches and excluding some alternatives Syntactic optimization is the most widely used; it deals that do not correspond to relevant classes of methods, we with those transformations to a program which descend obtain four classes, each including rather homogeneous from the program’s syntactic features. In particular, we methods: distinguish two kinds of structural properties. One is the 1) top-down evaluation methods analysis of the program structure, and in particular the 2) bottom-up evaluation methods type of rules which constitute the program. For example, 3) logic rewriting methods some methods exploit the linearity (see Section V) of the 4) algebraic rewriting methods. rules to produce optimized forms of evaluation. The sec- Fig. 3(b) shows some of the most well-known evalua- ond one is the structure of the goal, and in particular the tion and optimization methods present in the literature. selectivity that comes from goal constants. These two ap- This list is not exhaustive for obvious brevity reasons; we proaches are not mutually exclusive: it is possible to build apologize to authors whose methods are omitted. In the syntactic methods which combine both cases. next section we survey informally some of these methods. Semantic optimization, instead, concerns the use of ad- Finally, we expect optimization methods to satisfy three ditional semantic knowledge about the database in order important properties. 153 Ittk TKANSAC~IONS ON KNOWLEDGE AND DAI-A ENGINEERING. VOL. I. b:O. I. hIARCH 1089

l Methods must be sound: they should not include in Ri := E,(RI, . . . ) R,); the result tuples which do not belong to it. IF R, # S THEN cond : = false; l Methods must be complete: they must produce all the END; tuples of the result. UNTIL cond; l Methods must terminute: the computation should be FOR i : = 1 TO II DO OUTPUT( performed in finite time. ENDMETHOD Although we omit to present formal proofs, all the Note that step “Ri : = E;( R,, . . . , R,)” of the Gauss- methods surveyed in the next section satisfy all these Seidel algorithm has the same effect as the application of properties. rule EPP of Section II. However, instead of acting on sin- gle tuples, here we apply algebraic operations simultu- V. SURVEY OF EXECUTION METHODS AND neously to entire relations. %'I.IMI%ATION TZ;CHNIQIJES Variants of the Gauss-Seidel method are the Jucobi In this section we present methods for optimizing a Da- method and the Chuotic method [28]. The latter is ob- talog program, i.e., for generating efficiently the actual tained by computing the various algebraic expressions not set of tuples which satisfy a given goal for a given set of in a strict sequential order. Different versions of the cha- Datalog rules. In order to do this, we first present a bot- otic method yield the so-called luzv and dutu$ow evalu- tom-up nuive evaluation method (called the Gauss-Seidel ations, that correspond to starting the evaluation of com- method), and discuss briefly its sources of inefficiency. putable relations, respectively, at the latest or at the Then, we examine how to reduce these sources of ineffi- earliest convenience. ciency: we introduce the Semi-N&e approach which im- The Gauss-Seidel method has two sources of ineffi- proves a bottom-up computation of linear rules, and we ciency present a top-down efficient strategy, called Query- a) Several tuples are computed multiple times during Subquery. Then, we present two of the most known re- the iteration process. In particular, during the iterative writing methods, Mugic Sets and Counting, which are evaluation of a relation R, tuples belonging to relations used to optimize the behavior of bottom-up computations. R(‘) will also belong to all subsequent relations R’ ’ ), j 2 Finally, we briefly overview other optimization methods. i, until the fixpoint is reached. b) The method produces the entire result relations. If A. Bottom- Up E~uluution the goal contains constant arguments, they are selected The Gauss-Seidel method is an algebraic version of the only at the end. In this way, several tuples are computed nai\ae e\wfuution paradigm, which is present in many dif- without being really required, and eliminated by the final ferent forms in the literature [ 131, [85], [41], [86], [87]. selection. The method is also well known in numerical analysis, Inefficiencies due to observation a) above are partially where it is used for determining the iterative solution (fix- eliminated through the semi-naive methods discussed in point) of systems of equations. Assume the following sys- the remainder of this subsection; inefficiencies due to ob- tem C of relational equations: servation b) above are dealt with by evaluation methods (like query-subquery) and rewriting methods (like magic Ri = E,(R,, . . > R,), (i = 1, . , n). sets, counting, etc.), discussed in the next subsections. Semi-naive evaluation is a bottom-up technique de- The Gauss-Seidel method proceeds as follows: initially, signed for eliminating redundancy in the evaluation of tu- the variable relations Ri are set equal to the empty set. ples at different iterations. Several versions of this method Then, the computation Ri : = E; CR,, . . . , R,,), (i = 1, can also be found in the literature [ 131, [28], [20]. . , n) is iterated until all the R; do not change between Consider the Gauss-Seidel algorithm, applied to solve two consecutive iterations (namely, until the Ri have a single equation on variable R. Let Rck’ be the temporary reached a jxpoint). At the end of the computation, the value of relation R at iteration step k. The differential of value assumed by the variable relations R, is the solution R at step k of the iteration is defined as of the system C. D’k’ - R’“’ _ R’“- 1) GAUSS-SEIDEL METHOD INPUT: A system of algebraic equations C, and an The differential term, expressing the new tuples of R’“’ at Extensional Database EDB. each iteration k, is exactly what we would like to use at OUTPUT: The values of the variable relations R,, each iteration, instead of the entire relation R(“;‘; this is K,. legal with linear equations. An equation of relational al- MiiiiOD : gebra FORi:== ITOnDORi:= @; REPEAT R = E(R) cond : = true; is lineur with respect to R if it verifies the following prop- Fori:= 1 TOnDO erty : BEGIN S := R,; E(R’ U R”) = E(R’) U E(R”) CERI et al.: DATALOG 155

for any two relations R’ and R” having the same arity as required to perform (una tantum) the final selection. How- R. Note that linearity is ensured if only one occurrence of ever, the reader should notice that the QSQ method uses R appears in the expression E(R). The differential of R the information about constants in the goal, hence, the at step k of the iteration is simply E(D( R)): size of P and of all the relations involved in the compu- tations is comparatively much smaller than the size of all E(Dck’) = E(R (k) _ ~(k-1)) = E(R(~)) _ E(R(~-1)) the relations involved in the Gauss-Seidel computation. This algorithm can be compared to the Prolog inferen- so that one can compute R (k+ ’ ) = E( Rck’) as the union tial machine, which is also top-down. The comparison is purely indicative, based on the idea of using a Prolog in- E(Rck- I’) U E(Dck’). ferential machine to execute a Datalog program. We note The advantage of this formula is that, at each iteration k, that Prolog acts one-tuple-at-a-time, while QSQ is set-ori- we need to compute E ( Dck’) rather than E ( Rck’). This ented, as it processes whole relations. In this sense, QSQ can be generalized to systems of equations (see [29], is appropriate for database processing. Also, QSQ is breadth-first and always terminates, while Prolog is depth- [631). Extensions of this method, the general semi-naive, and first and may instead not terminate in some cases. the pseudo-rewriting semi-naive (discussed, among oth- The query-subquery algorithm was introduced by ers, in [34], [ 1351, and [63]) enable a less efficient appli- Vieille in [ 1281. The version introduced in that paper was cation of a similar approach to nonlinear equations. found to be incomplete. The author [ 1291, [ 1301and oth- ers [93], [ 1001have subsequently provided corrections or complete versions of the algorithm. A detailed description B. Top-Down Evaluation of the method can be found in [34]. The query-subquery (QsQ) algorithm [129] is an effi- cient top-down evaluation algorithm, optimizing the be- C. Magic Sets havior of backward-chaining methods as described in Sec- The method of the magic sets is a logical rewriting tion II. method which transforms a program into a larger one, The objective of the QSQ method is to access the min- containing some more rules that define new IDB predi- imum number of facts needed in order to determine the cates. These IDB predicates serve as constraints, which answer. In order to do this, the fundamental notion of force the program variables to satisfy some additional subquery is introduced. A goal, together with a program, conditions. Thus, during bottom-up computation, the vari- determines a query. Literals in the body of any one of the ables of the modified rules may assume only some of the rules defining the goal predicate are subgoals of the given values that were instead allowed for variables of the orig- goal. Thus, a subgoal, together with the program, yields inal rules. In most cases, this makes the new program a subquery; this definition applies recursively to subgoals more efficient. of rules which are subsequently activated. In order to an- Details about this algorithm can be found in [ 141, [ 171, swer the query, each goal is expanded in a list of subgoals, [22], [34]. We show here a significant example, and make which are recursively expanded in their turn. some comments on the rewritten program. Consider again The method maintains two sets: a set P of answer tu- the program PI with the goal ?-sgc( ann, X). After the pies, containing answers to the main goal and answers to magic sets transformation, the program becomes: intermediate subqueries, which is represented by a set of temporary relations (one relation for each IDB-predi- r;: sgc(X, X) : - person(X). cate); and a set Q of current subqueries (or subquery in- stances), which contains all the subgoals that are cur- r;: sgc(X, Y) :- magic(Xl), par-(X, Xl), rently under consideration. Thus, the function of the QSQ algorithm is twofold: sgc(X1, Yl), par-( Y, Yl). generating new answers and generating new subqueries r;: magic(ann). that must be answered. There are two versions of the QSQ algorithm, an iterative one (QSQI) and a recursive one r-4: magic(X1) :- magic(X), par-(X, Xl). (QSQR). The difference between the two concerns which of these two functions has priority over the other: QSQI Let us first notice what has happened to the initial pro- privileges the production of answers, thus, when a new gram: two new rules have been added, and one rule has subquery is encountered, it is suspended until the end of been modified. Let us consider this rule (r; ). A new lit- the production of all the possible answers that do not re- eral has been added to its body: magic (Xl ). The presence quire using the new subquery. QSQR behaves in the op- of this new literal forces the argument Xl of the EDB posite way: whenever a new subquery is found, it is re- relation PAR to assume only some specific values, i.e., cursively expanded and the answering to the current values which also belong to the IDB relation MAGIC. subquery is postponed to when the new subquery has been Let us now see how this new relation is defined. The completely solved. first rule (r-i ) simply says that the constant ann belongs At the end of the computation, P includes the answer to it. Rule r-4 says that, if Xl is in MAGIC, and X is a to the goal; hence, as in the Gauss-Seidel method, it is parent of Xl, then also X is in MAGIC. The result of the 156 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. I, NO. I. MARCH 1989 computation of relation MAGIC is counting method was also first presented in [14]; an im- proved version was introduced in [22]. MAGIC = { < arm > , < dorothy > , Consider again the program P, used as an example of < hilary > , < george > }. the magic set method in the previous subsection. The magic set method restricts the computation to the ances- Thus, the IDB relation MAGIC contains all the ancestors tors of arm; for each of these elements, the counting of ann. The tuples of the relation MAGIC defined by the method maintains the information whether it is one of magic rules form the magic set. It is often called the cone arm’s parents (distance l), arm’s grandparents (distance of unn, referring to the fact that, starting from ann, the 2), unn’s grand-grandparents (distance 3), etc. The re- ancestors grow “fanning out” like a cone. By imposing written program contains the computation of these dis- that, in r;, Xl must belong to relation MAGIC, we are tances. At this point, computation may be restricted, re- imposing that, in the computation of Ann’s cousins at the spectively, to the children of cum’s parents, to the same generation, we only have to consider those other grandchildren of arm’s grandparents, to the grand-grand- pairs of same generation cousins whose first element is an children of unn’s grand-grandparents, etc. ancestor of Ann. The following is the result of applying the counting This example can be used to introduce the idea of transformation to the output of the magic sets method: sideways information passing (SIP) [22]. Intuitively, sgc’(X, X, I) : - person(X), integer(Z). given a certain rule and a literal in the rule body with some bound argument(s), one can use the knowledge sgc’(X, Y, Z) :- counting(X1, Z), pat-(X, Xl), about the relation corresponding to this literal to obtain sgc’(X1, Yl, J), par( Y, Yl), bindings for uninstantiated variables in other argument positions. This process can be iterated for each literal in Z=J- 1. the rule body, and recursively on other IDB-predicates. counting(ann, 0). Thus, known information (bindings to constants) is passed counting(X1, I) : - counting(X, .Z), par sideways within the rule body. As we have seen in Section ,(X3Xl), V-B, this is the normal behavior of top-down evaluation z=J+l. methods, for instance QSQ. With the goal: After the magic sets transformation, the resulting pro- gram can be evaluated by a simple algorithm like Gauss- ? - sgc’(ann, Y, 0). Seidel or semi-naive, still taking advantage of the binding With some liberality, we use built-in predicates for ad- passing strategy. However, the application of a bottom- dition and subtraction, that will be discussed in Section up computation method to the rewritten program may pro- VI-A. The reader can observe that the counting predicate duce more tuples than exactly those of the goal answer, increments generation levels from arm upwards, marking as it includes all the same generation pairs of all people the level of each element of the magic set, whereas the belonging to the cone of ann. Thus, as in QSQ, the re- sgc ’ predicate decrements generation levels. The goal se- sulting relation must be finally selected to obtain the an- lects only the sgc pairs of level 0, i.e., those at the same swer. level as ann. At each step, among the tuples of the magic The magic sets method has been extended by Sacca’ set, the program only uses the tuples that have the appro- and Zaniolo to a class of queries to logic programs that priate distance from ann. contain function symbols [105]. A more sophisticated The application of the counting method has the poten- technique called “Supplementary Magic Sets”, has been tial for improving the efficiency of the computation, but introduced by Beeri and others [22], whose virtue is to clearly the method does not terminate when the database eliminate some repeated computations. Another improve- is cyclic (as the increment of the counting variable is not ment of the magic set method is proposed in [99], where arrested). Thus, the counting method applies to acyclic top-down evaluation is completely mimicked by bottom- databases only. A further hypothesis is that the program up, using a semi-naive evaluation of rewritten rules; the be linear, with at most one recursive rule for each predi- rule rewriting uses initial goal bindings in a very sophis- cate . ticated way; for instance, goals like p (X, X), give rise to In order to improve the applicability of the counting the binding of the two occurrences of variables X to each method whenever it is not known a priori whether the other. This was not covered by the original magic sets database is cyclic, Sacca’ and Zaniolo have introduced method. 11061 the “magic counting method”, which constantly monitors the counting computation in order to determine D. Counting whether the underlying database is cyclic; if so, it switches The Counting method is a rewriting method based on to the magic set computation. the knowledge of the goal bindings; the method includes the computation of the magic set, but each element of the E. The Method of Henschen and Naqvi magic set is complemented by additional information ex- One of the earliest pure evaluation methods proposed pressing its “distance” from the goal constant. The in the literature was developed by Henschen and Naqvi CERI ef al : DATALOG 157

[59]; the method applies to linear Datalog programs with the bindings can be eliminated at the end of the compu- goals that contain bound arguments. tation; the idea of the method is that of “cutting” off use- This method produces an iterative program that evalu- less tuples from the computation at an earlier stage of their ates the goal through several steps. Each step produces flow towards the goal node. This is achieved by imposing some of the answer tuples, and, at the same time, com- conditions on predicate arguments, called jilters, to the putes symbolically a new expression that has to be eval- output edges of each relation node; the expressions of fil- uated at the following step. The method is based on a ters are derived from the goal bindings, and propagated “functional” interpretation of predicates: we can view along the graph by a push operation. any predicate p having at least two arguments as a set A similar approach was presented by Devanbu and function from some of its arguments to the remaining Agrawal in [47], but restricting the application to linear ones, associating to each set S of values of its first argu- rules with only one occurrence of the recursive predicate ments a set S ’ of values of its second arguments; for in- in the RHS. Another interesting (pure evaluation) algo- stance, if p is a binary predicate, we denote by fp the func- rithm (the Apex method) was previously introduced by tional mapping Lozinskii in [S 11. The first method for pushing selections into recursive expressions was proposed by Aho and Ull- S’ =f#) = {YI x E S and p(-c Y)}. man in a seminal paper [lo]; the static filtering technique can be considered as a generalization of that concept. With this notation, one can also perform the composition Another generalization of [lo] is presented by Ceri and of predicate functions, in the usual mathematical sense. Tanca in [29], by introducing the pair of methods Vuri- This can be applied to the solution of linear Datalog pro- able Reduction and Constant Reduction which apply to grams, by considering the bound arguments as the func- generic systems of algebraic equations. Both methods tion’s domains. We exemplify the method of Henschen- push the initial selections so as to use their “filtering” Naqvi on program P, , with the goal: ?-sgc (a, X ). ability as soon as possible; the two methods act either by rewriting equations into equivalent ones with different (smaller) variables or by reducing the size of involved re- lations prior to the equations’ evaluation. These methods are part of a structured approach to the optimization of wherefperson denotes a unary function returning all per- systems of algebraic equations derived from logic pro- sons, .fpar is the set function from the first to the second grams, presented in [29]. argument of the relation PAR, and frap is the set function An idea similar to magic sets is exhibited in the Alex- from the second to the first argument of the relation PAR. under Method, published in [ 1011; the method consists of The most interesting feature of the method of Henschen rewriting the rules for each recursive predicate so that they and Naqvi is that it integrates two kinds of computation: represent, respectively, problems and solutions for the at a certain step, some tuples of the answer are computed, predicate; all new rules obtained are evaluated bottom-up, but also some symbolic manipulation is performed. The but the evaluation of the problem rules simulates in fact first kind of computation is typical of pure evaluation top-down evaluation and allows binding passing among methods, the second is characteristic of rewriting meth- subgoals. ods. There are also other possible approaches to optimiza- Substantially the same functional interpretation of pred- tion. One of them is the method for redundancy elimina- icates is given in a more general form by Gardarin and De tion of Sagiv: in [107] an algorithm is presented for min- Maidreville in [52], who propose a method for evaluating imizing the size (in terms of number of rules in the queries as function series, where the functions involved program and of number of atoms in a rule) of a Datalog correspond to the functional interpretation of the predi- program under a decidable condition called uniform cates. Follow-up work can be found in [54]. equivalence. Another kind of optimization is achieved by Ramakrishnan and others, in [98], where the objective of F. Other Eficient Evaluation Methods optimization is pushing projections, rather than selec- Several other methods can be found in the literature; tions, in the body of a Datalog rule. This means deleting most of them share some characteristics with the methods some argument positions in the literals of the rule body, presented above. which also has, sometimes, the effect of making some Static Filtering is a rewriting method introduced by Ki- rules redundant for the computation. fer and Lozinskii in [68]. In this method, a bottom-up A comparative evaluation of the performance of meth- evaluation is viewed as a flow of tuples through a graph ods presented prior to 1986 is presented by Bancilhon and derived from the program, called relation-axiom graph, Ramakrishnan [ 151. The study uses a benchmark which with two types of nodes: relation nodes, associated to includes only few, conventional programs; it excludes predicates, and axiom nodes, associated to rules. Com- cyclic databases, which are quite common. However, this putation is ideally preformed inside axiom nodes. paper remains one of the few quantitative approaches to When the goal has some bound arguments, those tuples the comparison of optimization methods; we think that produced during the graph traversal which do not satisfy this topic deserves a much more thorough treatment, al- 1.58 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. I, NO. I. MARCH 1989 though achieving a comparative evaluation of methods’ The meaning of this program is obvious. By use of the performance is extremely difficult due to the variety and inequality built-in predicate we avoid that a person is con- inherent complexity of methods and to the continuous ev- sidered as his own sibling. olution of the field. From a formal point of view, built-ins can be consid- ered as EDB-predicates with a different physical realiza- G. Computing Transitive Closures Eficiently tion than ordinary EDB-predicates: they are not explicitly Computing transitive closures efficiently has been rec- stored in the EDB but are implemented as procedures ognized as a significant subproblem of the efficient com- which are evaluated during the execution of a Datalog putation of general recursive queries [62], [7]-[9], [ 1231. program. However, built-ins correspond in most cases to In fact, many proposals of other extensions to the rela- injnite relations, and this may endanger the safety of Da- tional query languages in “nontraditional” directions in- talog programs. clude this operation [8], [102], [58], [136]. Thus, an in- Safety means that a Datalog program should always dependent area of research has been developed which have a finite output, i.e., the intensional relations defined studies the efficient implementation of transitive closures by a Datalog program must be finite. It is easy to see that improving the naive and semi-naive methods. The effi- safety can be guaranteed by requiring that each variable cient methods are mainly concerned with transitive clo- occurring as argument of a built-in predicate in a rule body sures of binary relations representing graphs and trees, must also occur in an ordinary predicate of the same rule very often making use of sparse matrices to represent body, or must be bound by an equality (or a sequence of them. equalities) to a variable of such an ordinary predicate or In [62], a logarithmic technique is applied in order to to a constant. Here, by “ordinary predicate”, we mean a build more efficient iterations. The paper provides an in- nonbuilt-in predicate. teresting study of performance on tree-shaped databases, During the evaluation of a Datalog rule with built-in showing that logarithmic techniques perform much better predicates, the following principle has to be observed: de- than the traditional ones (as it could be expected). In [7], fer the evaluation of a built-in predicate until all argu- direct are proposed, which are based on stud- ments of this predicate are bound to constants. An excep- ies performed in different contexts [108], [109], [ 1311, tion to this principle can (sometimes should) be made for [ 1321. The name “direct” algorithm descends from the the equality predicate. An equality should be evaluated as fact that the length of the computation does not depend soon as one of its two arguments is a constant or is bound on the length of paths in the underlying graph. Several to a constant. research efforts have also been directed towards the par- In a similar way, arithmetical built-in predicates can allel computation of transitive closures. In these propos- be used. For instance, a predicate plus (X, Y, Z) may be als, the computation of transitive closures is performed used for expressing X + Y = Z, where the variables X, on several processors at the same time; the significant Y, and Z are supposed to range over a numeric domain. problem is avoiding maintaining too many duplicates, and During the evaluation of a rule body, such a predicate can to perform too many duplicate computations on different be evaluated as soon as bindings for its “input variables” processors. Two such algorithms were proposed in [ 1241. (here X and Y) are provided. Other significant papers are [9] and [67]. Finally, let us remark that when Datalog rules are trans- formed into algebraic equations, then many built-in pred- icates can be expressed through join conditions. The above VI. EXTENSIONS OF PURE DATALOG defined program P,, for instance, is translated into the The Datalog syntax we have been considering so far following equation of relational algebra: corresponds to a very restricted subset of first-order logic and is often referred to as pure Datalog. Several exten- SIB = n 2,4(PAR W,=,,,2z2 PAR) sions of pure Datalog have been proposed in the literature where SIB and PAR denote the relations corresponding to or are currently under investigation. The most important the predicates sibling and parent, respectively. of these extensions are built-in predicates, negation, and complex objects. B. Incorporating Negation into Datalog: The Problem A. Built-In Predicates In pure Datalog, the negation sign “ 1 ” is not allowed to appear. However, by adopting the Closed World As- Built-in predicates (or “built-ins’ ‘) are expressed by sumption (CWA), we may infer negative facts from a set special predicate symbols such as > , < , 2 , I , =, # of pure Datalog clauses. with a predefined meaning. These symbols can occur in Note that the CWA is not a universally valid logical the right-hand side of a Datalog rule; they are usually rule, but just a principle that one may or may not adopt, written in infix notation. depending on the semantics given to a language. In the Consider, for example, the following program P2 con- context of Datalog, the CWA can be formulated as fol- sisting of a single rule where par is an EDB predicate: lows: P2: sibZing(X, Y) :- par(Z, X), par(Z, Y), X # Y. CWA: If a fact does not logically follow from a set of CERI et al.: DATALOG 159

Datalog clauses, then we conclude that the negation of then it also holds that LOO is satisfied in 9. (Note that LOO this fact is true. is satisfied in 9 iff LOO E 4, since LOO is positive.) Negative Datalog facts are positive ground literals pre- Let S be a set of Datalog’ clauses. A Herbrand inter- ceded by the negation sign, for instance, 1 sgc(bertrand, pretation 9 is a Herbrand model of S iff all facts and rules hilary). Note that this negative fact follows by the CWA of S are satisfied in 9. from P, U E,, since sgc(bertrand, hilary) does not fol- In analogy to pure Datalog, we require that the set of low from P, U E,. If F denotes a negative ground fact, all positive facts derivable from a set S of Datalog’ then ) F 1 denotes its positive counterpart. For example: clauses be a minimal model of S. However, a set S of ) 1 sgc (bertrand, hilary) 1 = sgc( bertrand, hilary). Datalog: clauses may have more than one minimal Her- The CWA applied to pure Datalog allows us to deduce brand model. For instance, if S, = {boring (chess) : - negative facts from a set S of Datalog clauses. It does not, 1 interesting(chess)}, then S has two minimal Herbrand however, allow us to use these negative facts in order to models: H, = (interesting (chess)) and Hb = (bor- deduce some further facts. In real life, it is often neces- ing (chess) > . sary to express rules whose premises contain negative in- The existence of several minimal Herbrand models for formation, for instance: “if X is a student and X is not a a set of Datalog’ clauses entails difficulties in defining graduate student, then X is an undergraduate student”. In the semantics of Datalog’ programs: which of the differ- pure Datalog, there is no way to represent such a rule. ent minimal Herbrand models should be chosen? Note also Note that in relational algebra an expression corre- that the model minimality requirement is inconsistent with sponding to the above rule can be formulated with ease the CWA. By the CWA both facts 1 boring(chess) and by use of the set-difference operator “ -“. Assume that 1 interesting(chess) can be deduced from the above set a one-column relation STUD contains the names of all stu- S,. Thus, neither of the models H, and Hb are consistent dents and another one-column relation GRAD contains the with the CWA. names of all graduate students. Then we obtain the rela- In the following we describe a policy which is com- tion UND of all undergraduate students by simply sub- monly referred to as stratijied evaluation of Datalog’ tracting GRAD from STUD, thus programs, or simply as strati$ed Datalog’. This policy permits us to select a distinguished minimal Herbrand UND = STUD - GRAD. model in a very natural and intuitive way by approximat- Our intention is now to extend pure Datalog by allowing ing the CWA. However, as we will see later, this method negated literals in rule bodies. Assume that the unary does not apply to all Datalog’ programs, but only to par- predicate symbols stud, und, and grad represent the prop- ticular subclass; the so-called strat@ed programs [37], erties of being a student, an undergraduate, and a gradu- [ 121. Note also that this technique can be used in the more ate, respectively. Our rule could then be formulated as general context of Logic Programming with negation [ 121 follows: as well. und(X) :- stud(X), 1 grad(X). C. Stratijied Datalog’ More formally, let us define Datalog T as the language This policy of choosing a particular Herbrand model, whose syntax is that of Datalog except that negated lit- and thus of determining the semantics of a Datalog’ pro- erals are allowed in rule bodies. Accordingly, a Datalog’ gram is guided by the following intuition: when evaluat- clause is either a positive (ground) fact or a rule where ing a rule with one or more negative literals in the body, negative literals are allowed to appear in the body. For first evaluate the predicates corresponding to these nega- safety reasons we also require that each variable occurring tive literals. Then the CWA is “locally” applied to these in a negative literal of a rule body also occurs in a positive predicates. literal of the same rule body. For instance, the clause set S,, as defined above, would In order to discuss the semantics of Datalog’ pro- be evaluated as follows: before trying to evaluate the grams, we first generalize the notion of the Herbrand predicate boring, we evaluate the predicate interesting Model (see Section II-D) to cover negation in rule bodies. which occurs negatively in the rule body. Since there are Let 9 be a Herbrand interpretation, i.e., a subset of the no rules and facts in S, allowing us to deduce any fact of Herbrand base HB. Let F denote a positive or negative the form interesting( ol), the set of positive answers to Datalog fact. this predicate is empty. In particular, interesting (chess) cannot be derived. Hence, by applying the CWA “lo- F is a positive fact and F E 9, or cally” to the interesting predicate, we derive 1 interest- F is satisfied in 9 iff F is a negative fact and 1F ( $ 9. ing (chess). Now we evaluate the unique rule of S, and get boring (chess). Thus, the computed Herbrand model is Hb Now, let R be a Datalog’ rule of the form La : - L, , = {boring(chess)). . . . ) L, and let 4 be a Herbrand interpretation. R is sat- When several rules occur in a Datalog’ program, then isfied in 9 iff for each ground substitution 0 for R, when- the evaluation of a rule body may engender the evaluation ever it holds that for all 1 5 i I n, Lie is satisfied in 9, of subsequent rules. These rules may contain in turn neg- 160 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 1, NO. I, MARCH 1989

ative literals in their bodies and so on. Thus, it is required head predicate is p. Furthermore, the edge < p, q > is that before evaluating a predicate in a rule head, it is al- marked with a “ 1” sign iff there exists at least one rule ways possible to completely evaluate all the predicates R with head predicate p such that q occurs negatively in which occur negatively in the rule body or in the bodies the body of R. The extended dependency graph EDG (P,) of some subsequent rules and all those predicates which of the program P, is depicted in Fig. 4. are needed in order to evaluate these negative predicates. A Datalog’ program P is stratified iff EDG( P) does If a program fulfills this condition it is called strati$ed. not contain any cycle involving an edge labeled with Any stratified program P can be partitioned into disjoint “ 1 “. If P is stratified, then it is quite easy to construct sets of clauses P = P’ U . . . U Pi U . . . U P called a particular stratification for P from EDG( P) [34]. An- strata, such that each IDB-predicate of P has its defining other method for constructing a stratification is given in clauses within one stratum and P’ contains only clauses [122]. with either no negative literals or with negative literals It can be shown that the strata-by-strata evaluation of a corresponding to EDB-predicates and each stratum Pi stratified program P on the base of an underlying EDB E contains only clauses whose negative literals correspond always produces a minimal Herbrand model of P U EDB. to predicates defined in lower strata. The partition of P This model is also called the Perjiect Model and can be into P’ . . . P” is called a stratification of P. characterized in a purely nonprocedural way [ 121, [95]. Assume a stratified program P with given stratification Local stratijcation, a refinement of stratification, is pro- P’ . . . P” has to be evaluated against an EDB E. The posed in [95]. evaluation is done stratum-by-stratum as follows. First, P’ is evaluated by applying the CWA locally to the EDB, D. Injlationary Evaluation and Expressive Power of i.e., by assuming 1 p ( cl, . . . , ck) for each k-ary EDB- Datalog 7 Programs predicate p and constants cl, . . . , ck where p( cl, . . . , Another evaluation paradigm for Datalog’ programs ck) $ E. Then the other strata are evaluated in ascending has been proposed recently in [4], [72]. This method, order. During the evaluation of each stratum Pi, the result called in.ationaly evaluation, has the advantage of ap- of the previous computations is used and the CWA is made plying to all Datalog’ programs and not just stratified Da- “locally” for all EDB-predicates and for all predicates talog T programs. defined by lower strata. Let P be a Datalog’ program and E an EDB. The in- Consider, for example, the following program P,, flationary evaluation of P on E is performed iteratively so where d is the only EDB-predicate: that all rules of P are processed in parallel at each step. rl: p(X, Y) :- 1 q(X, Y), s(X, Y). From the EDB and the facts already derived, new facts are derived by applying the rules of P. These new facts r2: q(K Y> :- q(K Z>, q(Z, Y). are added to the result at the end of each step. At each step, the CWA is made temporarily during the evaluation r,: q(X, Y) :- d(X, Y), 1 r(X, Y). of the rule bodies: it is assumed that the negation of all facts not yet derived is valid. The procedure terminates r4: r(X, Y): - d( Y, X). when no more additional facts can be derived. r5: s(X, Y) :- q(X, Z), q( Y, T), X # Y Consider, for example, the following nonstratified pro- gram Pi, where d is the only EDB-predicate: A stratification of P,y is: Pj = { r4 }, Pz = { rz, r3, r5 >, e = { r, }. Assume P, is evaluated over the EDB ES = rl: s(X) :- p(X), q(X), lr(X>. (d(a, b), d(b, CL d( e, e) >. The evaluation of the first r,: p(X) :- d(X), 1 q(X). stratum produces the new facts: r(b, a), r(c, b), r(e, e). The computation of the second stratum yields the fol- r3: q(X) :- d(X), lp(X). lowing additional facts: q(a, b), q(b, c), q(a, c), s(a, r4: r(X) :- d(X), d(b). b), s (b, a). Finally, by evaluating the third stratum we getp(b, a>. Assume that this program is evaluated inflationarily Note that a stratified program has, in general, several against an EDB Ei = {d(a) }. During the first iteration different stratifications. The program P,, for example, has step, rule r2 produces the new fact p(a), and rule r3 pro- the following alternative stratification: Pb = { r4 }, Pz = duces the new fact q (a ). During the second iteration step, 1 r2, r3 }, P,: = { rl, r5 }. However, it can be shown [12] rule r1 produces the new fact s(a). Since no further facts that all stratifications are equivalent, i.e., the result of the are derivable, the procedure stops with the result { p (a ), evaluation of a stratified Datalog’ program P is indepen- q(a), s(a) I. dent of the stratification used. It is easy to see the inflationary evaluation of a Da- It is easy to decide whether a given Datalog’ program talog’ program P corresponds to the computation of a is stratified by analyzing the Extended Dependency Graph least fixpoint [72], [34]. Furthermore, the result united to EDG (P) of P. The nodes of EDG (P) consist of the IDB- E is a Herbrand model of P U E. However, in general, predicate symbols occurring in P. There is a (directed) it is not a minimal Herbrand model. For instance, in the edge < p, q > in EDG(P) iff the predicate symbol q oc- above example the computed Herbrand model is { d(a), curs positively or negatively in a body of a rule whose p(a), q(a), s(a)}, but this model is not minimal since CERI et al.: DATALOG 161

Fig. 4. Extended dependency graph EDG ( Ps) Fig. 5. Hierarchy of expressiveness of different versions of Datalog

{d(a), q(a)} and {d(a), p(a)} are smaller Herbrand logic programs with negation. In particular, every strati- models of Pi U Ei. fied program is semantically characterized by a total Just as for pure Datalog programs, one can specify a model, i.e., a model such that each fact of the Herbrand goal together with a Datalog’ program. In that case, base has either truth value “true” or “false”. This model again, the output consists of all those derived IDB-facts coincides with the perfect model mentioned above. Non- which are subsumed by the given goal. It can be shown stratified programs, on the other hand, can be character- [4], [7 11, [72], that for each stratified Datalog’ program ized by partial models, where single facts may assume P with goal G there exists a Datalog’ program P’ such truth value “undejined.” A fixpoint method for comput- that the output of the stratified computation of P on any ing the well-founded partial model is given in [ 1271, while EDB E w. r.t. G is equal to the output of the inflationary resolution-based procedural semantics for well-founded computation of P’ on E w .r.t. G. This means that infla- negation is provided in [ 1031. Further important papers tionary Datalog’ is at least as expressive as stratified Da- related to well-founded semantics are [27], where the re- talog T. Moreover, there exist programs whose inflation- lationship to logical constructivism is investigated, and ary evaluation (w .r.t. a given goal) cannot be simulated [961. by any strata-by-strata computation of a stratified Da- talog’ program. Thus, inflationary Datalog’ is compu- E. Complex Objects tationally strictly more expressive than stratified Da- talog T . Furthermore, it has been shown that inflationary The “objects” handled by pure Datalog programs cor- Datalog’ has the same expressive power as Fixpoint Logic respond to the tuples of relations which in turn are made on Finite Structures, a well-known formalism obtained by of attribute values. Each attribute value is atomic, i.e., extending first-order logic with a least fixpoint operator not composed of sub-objects; thus the underlying data for positive first-order formulas [6]. model consists of relations in first normal form. This It is also easy to see that stratified Datalog’ is in turn model has the advantage of being both mathematically strictly more expressive than pure Datalog [4]. Fig. 5 simple and easy to implement. On the other hand, several shows the hierarchy of expressivenessof the different lan- new application areas (such as computer-aided design, of- guages and formalisms presented in this paper. More in- fice automation, and knowledge representation) require formation on this subject can be found in [38]. the storage and manipulation of (deeply nested) structured We finish our discussion on incorporating negation into objects of high complexity. Such complex objects cannot Datalog by giving a brief survey on other relevant work be represented as atomic entities in the normalized rela- on this topic. tional model but are broken into several autonomous ob- A different approach for defining the meaning of logic jects. This implies a number of severe problems of con- programs with negation is based on a proposal of Clark ceptual and technical nature. [44]. He essentially considers the completion of a logic For this reason, the has been extended program by viewing the definitions of derived predicates in several ways to allow the compact representation of as logical equivalences rather than logical implications. complex objects. Datalog can be extended accordingly. The semantics of a logic program with negation can then The main features that are added to Datalog in order to be defined as a minimal model of its completion. This represent and manipulate complex objects are fiuzction approach was discussed by Sheperdson [ 1lo], [ 11l] and symbols as a glue for composing objects from sub-objects Lloyd [78]. Fitting [49] and Kunen [74] refined this ap- and set constructors for being able to build objects which proach by using three-valued logic: in their setting, a fact are collections of other objects. Function symbols are can be true, false, or undefined. In the context of Datalog “uniterpreted”, i.e., they do not have any predefined these approaches are not completely satisfactory. meaning. Usually one also adds a number of predejined A recently introduced and very promising approach, functions for manipulating sets and elements of sets to the also based on three-valued logic, is the well-founded se- standard vocabulary of Datalog. mantics by Van Gelder, Ross, and Schlipf [126]. Their There exist several different approaches for incorporat- method nicely extends the stratified approach to arbitrary ing the concept of a complex structured object into the 162 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING. VOL. I, NO. I, MARCH 1989 formalism of Datalog [ 1331, [119], [75]. One of the most endanger the safety of programs. It is undecidable whether well-known approaches has been developed within the a Datalog program with function symbols has a finite or LDL (Logic Data Language) Project, carried out at MCC, an infinite result. The simplest solution is to leave the re- Austin, TX. The notation for representing sets and related sponsibility to the programmer. A similar problem is the issues used in this subsection is the one of LDL. finiteness of sets. Furthermore, not all Datalog programs Examples of complex facts involving function symbols with sets have a well-defined semantics. In particular, one and sets are: should avoid self-referential set definitions such as person (name ( joe, berger ), birthdute ( 1956, jnne, 30)) p ( < X > ) : - p (X ). Such definitions come close to Rus- sell’s paradox. A large class of programs free of self-ref- children ( {mar, sat-ah, jim) )). erence, called admissible programs, is defined in [21]. Note also that the test whether two terms (or literals) in- person(nome( joe, coker), birthdute( 1956,,june, 30), volving sets match is a computationally hard problem. children ( { bill, sat-ah } ) ). This is a particular case of theory unification [I 161. Another interesting problem is the consistency problem person (name (bebe, suong), birthdute ( 1958, may, 5), for monovalued data functions. A monovalued data func- children ( { jim, mux, sarah } )) . tion f is an evaluable function symbol, interpreted as a mathematical function. Such functions can be defined by Here name, birthdate, and children are function symbols. the rules of a logic program. However, the unicity of the Variables may represent atomic objects (i.e., constants) function value must be ensured. References [3], [76], and or compound objects. The following rule relates the last [77] deal with this problem. names of all persons having the same birthdate and the A Logic Programming language for data manipulation same first name: such as LDL should be conceived in accordance with an similar(X, Y) :- person(name(2, X), B, C), appropriate data model which formalizes the storage and retrieval principles and the manipulation primitives that a person(nume(2, Y), B, 0). DBMS offers for the objects referenced by the language. By this rule, we can derive, for instance, the new fact Pure Datalog, for instance, can be based on the relational similar (berger, coker ). model in first normal form (nowadays often called thejut Two sets are considered equal iff they contain the same relational model) because the concept of a literal nicely elements, independently of the order in which these ele- matches the one of a tuple in a relation and because the ments appear. The following rule defines a predicate single evaluation steps of a Datalog program can be trans- eqchilds (X, Y) stating that X and Y are the names of per- lated into appropriate sequences of relational operations. sons whose children have exactly the same first names: On the other hand, Logic Programming languages dealing with structured objects such as LDL require more com- eqchilds(X, Y) :- person(X, B, C), person( Y, D, C). plex data models. By this rule we can derive, for instance, eqchilds Quite a number of extensions of the relational model ( name ( joe, berger ) , name ( bebe, suong ) ) . have been developed in the last years in order to allow the LDL offers several built-in predicates and functions for storage and manipulation of complex objects. The most handling sets. The most important are: famous ones are the NF* model by Jaeschke and Schek l member( t, S ), a built-in predicate for expressing that [66], the model of nested relations by Fisher and Thomas t is an element of the set S. Notice that t can be a complex [48], the model of Abiteboul and Beeri [l], [2] (which is term and may contain sets as components. more general than the former two models), the “Franco- Armenian Data model” FAD by Bancilhon, Briggs, l union (S, A, B), a built-in predicate for expressing the S = A U B. Khoshafian, and Valduriez [19] (based on a calculus for Sets can be introduced not only by enumeration but also complex objects by Bancilhon and Khoshafian [ 181). by grouping. Grouping allows us to define a set in a rule ALGRES, a quite powerful data-model for complex ob- head by indicating the properties of its elements in the jects, supports an extended relational algebra augmented corresponding rule body. The following rule, for in- with a fixpoint operator, and thus is an ideal base for im- stance, defines the set of all persons (identified by their plementing interpreters or compilers for logic data lan- last name) who have a child called Sarah. guages with complex objects [30], [31]. An excellent overview and comparison of data models suruhpar( ) :- for complex objects is given in [ 11. person(name(A, X), B, C), member(saruh, C). F. An Overview of Research Prototypes This rule generates the new fact saruhpar( { berger, We mention here some of the research prototypes cur- coker, suong } ). rently under development based on the language Datalog Using complex objects in Datalog is not as easy as it (or its variations). might appear. Several problems have to be taken into con- The LDL project is under development at Microelec- sideration. First of all, the use of function symbols may tronics and Computer Technology Corporation (MCC), CERI er al.: DATALOG 163

Austin, TX. The project’s goal is to implement an inte- active, we feel that some basic understanding has been grated system for processing queries in Logic Data Lan- established, thus allowing for systematic treatment. guage (LDL), a language which extends Datalog. We One of the major challenges that Datalog research has gave some examples of the most significant constructs of still to meet is to convince the knowledge base community LDL in Section VI. A general overview of the LDL proj- of the practical merits of this theory. The weaknesses of ect can be found in [43]. Features for dealing with com- Datalog work have been indicated as follows. plex terms in LDL are presented by Zaniolo [ 1331; a lan- a) Very few applications have been shown which can guage overview is given by Tsur and Zaniolo [119]; the take full advantage of Datalog’s expressive power. In par- treatment of sets and negation is formally presented by ticular, no useful applications have been reported so far Beeri et al. [21]; other papers on the subject are [19] and for nonlinear or mutually recursive rules. [73]. An excellent description of the LDL language and b) Datalog is not considered as a programming lan- of related features is given in [92]. guage, but rather as a “pure” computational paradigm. The NAIL! project (Not Another Implementation of For instance, Datalog does not provide support for writ- Logic!) is under development at Stanford University with ing user’s interfaces, and does not support quite useful the support of NSF and IBM. NAIL! processes queries in programming tools, such as modularization and struc- Datalog, but interfaces a conventional SQL database sys- tured types. tem (running on IBM PC/RT). An overview of the NAIL! c) Datalog does not compromise its clean declarative project is presented by Morris, Ullman, and Van Gelder style in any way; while sometimes it is required that the in [88] ; an update can be found in [89]. Various kinds of programmer may take control on inference processing, by algorithms applied in the prototype are discussed in [90] stating the order and method of execution of rules. This and [121]. is typical, for instance, of many expert systems shells. The KIWI Esprit project, sponsored by the EEC, is a d) Datalog systems have been considered, until now, joint effort for the development of knowledge bases, pro- as closed worlds, that do not talk to other systems; while grammed through an object-oriented language (OOPS), the current trend is towards supporting heterogeneous sys- and interfaced to an existing relational database. The Ad- tems. vanced Database Interface (ADE) of KIWI is developed Some of the above criticisms are in fact well founded, jointly by the University of Calabria, CRAI, and ENI- and provide an indication of the directions in which we DATA (Italy). A general overview of KIWI and ADE can expect Datalog to move in order to become fully appli- be found in [106a]. The mini-magic variation to the magic cable. Datalog research will have to consider with great set approach, used in KIWI, is described by Sacca’ and care the advances in other research areas; in particular, Zaniolo [ 1041. we have indicated in Section VI that Datalog can be ex- The ALGRES Project, under development in the frame tended to support complex terms; this is a first step to- of the METEOR Esprit project, is also sponsored by the wards the development of new language paradigms which EEC. The ALGRES project extends the relational model use some of the concepts from object-oriented databases. to support nonnormalized relations and a fixpoint opera- The foundation of such an evolution have already been tor, and supports Datalog as programming language. placed by Beeri [23] and Abiteboul and Kanellakis [5]; ALGRES is a joint effort of the Politecnico di Milan0 and this work has shown that rule-based and object-oriented of TXT-Techint (Italy). A general overview of the approaches are not in opposition, but rather they are ca- ALGRES Project can be found in a paper by Ceri, Crespi- pable of providing useful programming concepts to each Reghizzi et al. [32]. Other papers are [30] and [31]. other. Other relevant research projects which deal with rule- In summary, we expect that supporting rule computa- based computations are POSTGRES, under development tion will be one of the ingredients of future knowledge at Berkeley University [ 1171, and the 5th GENERATION base systems; Datalog research has provided exact meth- Project [50], [91], under development at the Institute for ods and a fairly good understanding for approaching this New Generation Computer Technology (ICOT), Tokio, issue. Japan. More details about the above-mentioned projects can be found in [34]. Other recent overviews of the proj- REFERENCES ects on databases and logic were presented by Zaniolo [l] S. Abiteboul and S. Gmmbach, “Bases de donnees et objets com- [ 1341 on a dedicated issue of IEEE-Data Engineering and plexes,” Tech. Sci. Inform., vol. 6, no. 5, 1987. by Gardarin and Simon [53] on TSI. [2] S. Abiteboul and C. Beeri, “On the power of languages for the manipulation of complex objects, ” in Proc. Int. Workshop Theory Appl. Nesied Relations Complex Objects, Darmstadt, West Ger- VII. CONCLUSIONS many, abstract, 1987. There are no doubts that Datalog theory has been nicely [3] S. Abiteboul and R. Hull, “Data functions, datalog and negation,” in Proc. ACM-SIGMOD Conf., 1988. developed in the last five years, through the flourishing of [4] S. Abiteboul and V. Vianu, “Procedural and declarative database many elegant contributions. The main attraction of Da- update languages,” in Proc. ACM SIGMOD-SIGACT Symp. Prin- talog is the possibility of dealing, within a unique for- ciples Database Syst., 1988, pp. 240-250. [5] S. Abiteboul and P. C. Kanellakis, “Object identity as a query malism, with nonrecursive expressions (or views) as well language primitive,” manuscript, 1989. as with recursive ones. Although this area is still very [6] P. Aczel, “An introduction to inductive definitions,” in The Hond- 164 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. I, NO. I. MARCH 1989

book of Mathematical Logic, J. Barwise, Ed. Amsterdam, The [33] S. Ceri, G. Gottlob, and G. Wiederhold, “Efficient database ac- Netherlands: North-Holland, 1977, pp. 739-782. cess through Prolog,” IEEE Trans. Software Eng., Feb. 1989. [7] R. Agrawal and H. V. Jagadish, “Direct algorithms for computing [34] S. Ceri, G. Gottlob, and L. Tanca, Logic Programming and Da- the of database relations,” in Proc. 13th Int. tabases. New York: Springer-Verlag, to be published. Conf. Very Large Databases, Brighton, U.K., 1987. [35] U. S. Chakravarthy, J. Minker, and J. Grant, “Semantic query [8] -, “Alpha: An extension of relational algebra to express a class optimization: Additional constraints and control strategies,” in of recursive queries,” in Proc. IEEE 3rd Int. Conf. Data Eng., Proc. 1st Znt. Conf. Expert Database Syst., L. Kerschberg, Ed., Los Angeles, CA, Feb. 1987. Charleston, 1986; and in Expert Database Systems. Menlo Park, [9] -, “Multiprocessor transitive closure algorithms,” Data Eng. CA: Benjamin-Cummings, 1987. (Special Issue on Databases for Parallel and Distributed Systems), [36] U. S. Chakravarthy, J. Grant, and J. Minker, “Foundations of se- vol. 12. Mar. 1989. mantic query optimization for deductive databases,” in Proc. Znt. [lo] A. V. Aho and J. D. Ullman, “University of data retrieval lan- Workshop Foundations Deductive Databases Logic Programming, guages,” presented at the 6th ACM Symp. Principles Program- J. Minker, Ed., Aug. 1986. ming Languages, San Antonio, TX, Jan. 1979. [37] A. Chandra and D. Harel, “Horn clause queries and generaliza- [l l] K. R. Apt and M. H. VanEmden, “Contributions to the theory of tions,” J. Logic Programming, vol. 1, pp. 1-15, 1985. logic programming, ” J. ACM, vol. 29, no. 3, 1982. [38] A. Chandra, “A theory of database queries,” in Proc. ACM SIG- [12] C. Apt, H. Blair, and A. Walker, “Towards a theory of declarative MOD-SZGACT Symp. Principles Database Syst., Mar. 1988. knowledge,” IBM Res. Rep. RC 11681, Apr. 1986. [39] C. L. Chang and R. C. Lee, Symbolic Logic and Mechanical Theo- [13] F. Bancilhon, “Naive evaluation of recursively defined relations,” rem Proving. New York: Academic, 1973. in On Knowledge Based Management Systems-Integrating Data- [40] C. C. Chang and H. J. Keisler, Model Theory. Amsterdam, The base and AI Systems, Brodie and Mylopoulos, Eds. New York: Netherlands. 1977. Springer-Verlag, 1985. [41] C. Chang, “On the evaluation of queries containing derived rela- [14] F. Bancilhon, D. Maier, Y. Sagiv, and J. D. Ullman, “Magic sets tions in relational databases,” in Advances in Database Theory, and other strange ways to implement logic programs,” in Proc. Vol. 1, H. Gallaire, J. Minker, and J. M. Nicholas, Eds. New ACM SZGMOD-SIGAT Symp. Principles Database Syst., Cam- York: Plenum, 1981. bridge, MA, Mar. 1986. [42] C. L. Chang and A. Walker, ‘ ‘PROSQL: A Prolog programming [15] F. Bancilhon and R. Ramakrishnan, “An amateur’s introduction interface with SQLiDS,” in Proc. First Workshop Expert Data- to recursive query processing,” in Proc. ACM-SZGMOD Conf., base Syst., Kiawah Island, SC, Oct. 1984; and in Expert Database May 1986. Systems, L. Kerschberg, Ed. Menlo Park, CA: Benjamin-Cum- 1161 -, “Performance evaluation of data intensive logic programs,” mings, 1986. in Foundations of Deductive Databases and Logic Programming, [43] D. Chimenti, T. O’Hare, R. Krishnamurthy, S. Naqvi, S. Tsur, J. Minker, Ed. Washington, DC, 1986. C. West, and C. Zaniolo, “An overview of the LDL system,” [17] F. Bancilhon, D. Maier, Y. Sagiv, and J. D. Ullman, “Magic sets: Special Issue on Databases and Logic, IEEE Data Engineering, Algorithms and examples,” manuscript, 1986. vol. 10, Dec. 1987. [18] F. Bancilhon and S. Khoshafian, “A calculus for complex ob- [44] K. L. Clark, “Negation as failure, ” in Logic and Databases. New jects,” in Proc. SZGMOD 86, 1986. York: Plenum, 1978. [19] F. Bancilhon, T. Briggs, S. Khoshafian, and P. Valduriez, “FAD, 1451 W. F. Clocksin and C. S. Mellish, Programming in Prolog. New A powerful and simple database language,” in Proc. 13th Int. Conf. York: Springer-Verlag, 1981. Very Large Data Bases, Brighton, U.K., 1987. 1461 F. Cunnens and R. Demolombe, “A PROLOG-relational DBMS _ _ .I [20] R. Bayer, “Query evaluation and recursion in interface using delayed evaluation,” presented at the Workshop on systems,” manuscript, 1985. Integration of Logic Programming and Databases, Venice, Dec. [21] C. Beeri, et al., “Sets and negation in a logical database language 1986. (LDLl),” in Proc. ACM SZGMOD-SZGACT Symp. Principles Da- [47] P. Devanbu and R. Agrawal, “Moving selections into fixpoint tabase Syst., San Diego, CA, Mar. 1987. queries,” manuscript, Bell Labs, Murray Hill, NJ, 1986. [22] C. Beeri and R. Ramakrishnan, “On the power of magic,” in Proc. [48] P. Fischer and S. Thomas, “Operators for non-first-normal-form ACM SZGMOD-SIGACT Symp. Principles Database Syst., San relations,” in Proc. 7th Int. Comput. Software Appl. Conf., Chi- Diego, CA, Mar. 1987. cago, IL, 1983. [23] C. Beeri, “Data models and languages for databases,” in Proc. [49] M. Fitting, “A Kripke-Kleene semantics for logic programs,” J. 2nd Znt. Con& Database Theory, Bruges, Belgium, 1988; and in Logic Programming, vol. 2, pp. 295-312, 1985. LNCS 326. New York: Springer-Verlag. 1988. [50] K. Fuchi, “Revisiting original philosophy of fifth generation com- [24] J. Bocca, H. Decker, J.-M. Nicolas, L. Vielle, and M. Wallace, puter project,” presented at the Int. Conf. Fifth Generation Com- “Some steps toward a DBMS-based KBMS,” in Proc. ZFZP World put. Syst., 1984. Conf., Dublin, 1986. [Sl] H. Gallaire and J. Minker, Eds, Logic and Databases. New York: [2.5] J. Bocca, “On the evaluation strategy of EDUCE,” in Proc. ACM- Plenum, 1978. SZGMOD Conf., Washington, DC, May 1986. [52] G. Gardarin and C. De Maindreville, “Evaluation of database re- [26] M. L. Brodie, “Future intelligent information systems: AI and da- cursive logic programs as recurrent function series,” in Proc. ACM- tabase technologies working together,” in Readings in Artificial SZGMOD Conf., Washington, DC, May 1986. Intelligence and Databases. San Mateo, CA: Morgan Kaufman, [53] G. Gardarin and E. Simon, “Les systemes de gestion de bases se 1988. donnees deductives,” Technique et Science Informatiques, vol. 6, VI F. Bry, “Logic programming as constructivism: A formalization 1987. and its application to databases,” in 8th ACM Symp. Principles [54] G. Gardarin, “Magic functions: A technique to optimize extended Datnbase Cyst. (PODS), Mar. 1989, pp. 34-50. datalog recursive programs, ” in Proc. 13th Conf. Very Large Da- 1281- _ S. Ceri. G. Gottlob, and L. Lavazza, “Translation and optimiza- tabases, Brighton, U.K., 1987. tion of logic queries: the algebraic approach,” in Proc. Z2th Znt. [55] G. Gentzen, “Die Wiederspruchsfreiheit der reinen Zahlentheo- Conf Very Large Data Bases, Kyoto, Aug. 1986. rie,” Math. Annalen, vol. 112, pp. 493-565, 1936. [29] S. Ceri and L. Tanca, “Optimization of systems of algebraic equa- [56] K. Godel, “Die Vollstandigkeit der Axiome des logischen Funk- tions for evaluating Datalog queries, ” in Proc. 13th Znt. Conf. Very tionenkalkiils,” Monatshefte fir Mathematik und Physik, vol. 37, Large Data Bases, Brighton: U.K., Sept. 1987. pp. 349,360, 1930. 1301_ S. Ceri. S. Cresni Reahizzi. G. Gottlob, F. Lamperti. L. Lavazza. 1571. _ -. “Uber formal unentscheidbare Sltze der Principia Mathe- L. Tanca, and R. Zicari, “The ALGRES project,” in Proc. Int. matica und verwandter Systeme I,” Monatshefte fiir &iathematik Conf Extending Database Technol. (EDBT88), Venice, 1988. und Physik, vol. 38, pp. 173-198, 1931. [31] S. Ceri, S. Crespi-Reghizzi, L. Lavazza, and R. Zicari, [58] A. Guttman, “New features for relational database systems to sup- “ALGRES: A system for the specification and prototyping of com- port CAD applications,” Ph.D. dissertation, Dep. Comput. Sci., plex databases,” Dip. Elettronica, Politecnico di Milano, Int. Rep. Univ. California, Berkeley, June 1984. 87-018, 1987. [59] L. J. Henschen and S. A. Naqvi, “On compiling queries in recur- [32] S. Ceri, S. Crespi-Reghizzi, G. Lamperti, L. Lavazza, and R. Zi- sive first order databases,” J. ACM, vol. 31, no. 1, 1984. cari, “ALGRES: An advanced database system for complex ap- [60] D. Hilbert, “Axiomatisches Denken,” Mathematische Annalen 78, plications,” IEEE Software, to be published. pp. 405-415, 1918. CERl er al.: DATALOG 165

[61] Y. E. Ioannidis and E. Wong, “An algebraic approach to recursive of the Nail! system,” in Proc. Int. Conf Logic Programming. inference,” Electron. Res. Lab., Univ. California, Berkeley, Int. New York: Academic, 1986. Rep. UCB/ERL M85/92, 1985. [89] K. Morris, J. Naughton, Y. Saraiya, J. Ullman, and A. Van Gelder, [62] Y. E. Ioannidis, “On the computation of the transitive closure of “YAWN! (Yet another window on NAIL!).” Special Issue on Da- relational operators, ” in Proc. 12th Int. Conf. Very Large Data- tabases and Logic, IEEE Data Eng., vol. 10, Dec. 1987. bases, Kyoto, Japan, 1986. [90] K. A. Morris, “An algorithm for ordering subgoals in Nail! ,” in [63] Y. E. Ioannidis and E. Wong, “Transforming non-linear recursion Proc. ACM SIGNOD-SIGACT Symp. Principles Database Syst., into linear recursion,” manuscript, 1987. Austin, TX, 1988. [64] Y. E. Ioannidis, J. Chen, M. A. Friedman, and M. M. Tsangaris, [91] K. Murakami, T. Kakuta, N. Miyazaki, S. Shibayama, and H. “BERMUDA-An architectural perspective on interfacing Prolog Yokota, “A relational , first step to knowledge to a database machine,” Dep. Comput. Sci., Univ. Wisconsin, base machine,” in Proc. 10th Symp. Comput. Architecture, June Tech. Rep. 723, Oct. 1987. 1983. [65] H. Itoh, “Research and development on knowledge base systems [92] S. Naqvi and S. Tsur, A Logical Language for Data and Knowl- at ICOT,” in Proc. 12th lnt. Conf. Very Large Data Bases, Kyoto, edge Bases. New York: Computer Science Press, 1989. Japan, Aug. 1986. [93] W. Nejdl, “Recursive strategies for answering recursive queries- _1661 _ B. Jaeschke and H. J. Schek, “Remarks on the algebra of non first the RQA/FQI strategy,” in Proc. 13th Int. Conf Very Large Data normal form relations,” in Proc. ACM SiGMOD-SIGACT Buses, Brighton, U.K., Sept. 1987. Symp. Principles Database Cyst., Los Angeles, CA, 1982, [94] E. Neuhold and M. Stonebraker, “Future directions in DBMS re- pp. 124-138. search,” Int. Comput. Sci. Inst., Berkeley, CA, TR 88-01, May [67] J. F. Jenq and S. Sahni, “All pairs shortest paths on a hypercube 1988. multiprocessor,” in Proc. IEEE Int. Conf Parallel Processing, [95] T. Przymusinski, “On the semantics of stratified deductive data- Aug. 1987. bases,” in Proc. Workshop Foundations Deductive Databases [68] M. Kifer and E. L. Lozinskii, “Filtering data flow in deductive Logic Programming, Washington, DC, 1986, pp. 433-443. databases, ” in Proc. 1st Int. Conf. Database Theory, Roma, Sept. [96] -, “Every logic program has a natural stratification and an it- 1986. erated least fixed point model, ” in 8th ACM Symp. Principles Da- [69] W. Kim, D. S. Reiner, and D. S. Batory, Query Processing in tabase Syst. (PODS), Mar. 1989, pp. I l-21. Database Systems. New York: Springer-Verlag, 1985. [97] Quintus Computer Systems Inc., Mountain View, CA, Quintus [70] J. King, “Quist: A system for semantic query optimization in re- Prolog Data Base Interface Manual, version 1, June 29. 1987. lational databases, ” in Proc. 7th Int. Conf. Very Large Data Bases, [98] R. Ramakrishnan, C. Beeri, and R. Krishnamurty. “Optimizing Cannes, 1981. existential Datalog queries,” in Proc. ACM SIGMOD-SIGACT [71] Ph.G. Kolaitis, “On the expressive power of stratified datalog pro- Symp. Principles of Database Syst., Austin, TX, Mar. 1988. grams, ” Stanford Univ., Stanford, CA, preprint, Nov. 1987. [99] -, “Magic templates, A spellbinding approach to logic evalua- [72] Ph.G. Kolaitis and Ch.H. Papadimitriou, “Why not negation by tion,” in Proc. Logic Programming Conf., Aug. 1988. fixpoint?, ” in Proc. ACM SIGMOD-SIGACT Symp. Principles Da- [IOO] D. Roelants, “Recursive rules in logic databases,” Philips Res. tabase Syst., 1988, pp. 231-239. Lab., Bruxelles, Rep. R513, Mar. 1987, submitted for publication. [73] R. Krishnamurthy and C. Zaniolo, “Optimization in a logic based ]lOl] J. Rohmer, R. Lescoeur, and J. M. Kerisit, “The Alexander language for knowledge and data intensive applications,” in Proc. method: technique for the processing of recursive axioms in de- Int. Conf Extending Database Technol. (EDBT88). Venice, 1988; ductive databases,” in New Generation Computing, vol. 4. New and LNCS, No. 303. New York: Springer-Verlag, 1988. York: Springer-Verlag, 1986. [74] K. Kunen, “Negation in logic programming,” J. Logic Program- [102] A. Rosenthal, S. Heiler, U. Dayal, and F. Manola, “Traversal ming, vol. 4, pp. 289-308, 1987. recursion: A practical approach to supporting recursive applica- [75] G. M. Kuper, “Logic programming with sets,” in Proc. ACM tions,” in Proc. ACM SIGMOD I986 Int. Conf Management oj SIGMOD-SIGACTSymp. Principles Database Syst., 1987, pp. 1 l- Data, Washington, DC, May 1986. 20. [103] K. A. Ross, “A procedural semantics for well founded negation in [76] E. Lambrichts, P. Nees, J. Paredaens, P. Peelman, and L. Tanca, logic programs,” in 8th ACM Symp. Principles Database Syst. “MilAnt: An extension of datalog with complex objects, functions (PODS), Mar. 1989, pp. 22-32. and negation,” Dep. Comput. Sci., Univ. Antwerp, Int. Rep., [IO41 D. Sacca’ and C. Zaniolo, “On the implementation of a simple 1988. class of logic queries for databases,” in Proc. ACM 1986 SIG- [77] E. Lambrichts, P. Nees, J. Paredaens, P. Peelman, and L. Tanca, MOD-SIGACT Symp. Principles Database Syst., Cambridge. MA, “Integration of functions in the fixpoint semantics of rule based Mar. 1986. systems, ” in Proc. 2nd Symp. Math. Fundamentals Database The- [ 1051 -, “Implementing recursive logic queries with function sym- ory, Visegrad, Hungary, June 1989; and LNCS. New York: bols,” MCC Tech. Rep. DB-401-86, Dec. 1986. Springer-Verlag, 1989. [IO61 -, “Magic counting methods, ” in Proc. ACM-SIGMOD Conf., [78] J. W. Lloyd, Foundations of Logic Programming, 2nd extended San Francisco, CA, May 1987. ed. New York: Springer-Verlag, 1987. 1106al D. Sacca. M. Disoinziezi. A. Mecchia. C. Pizzuti. C. Del Gracco, [79] L. Lowenheim, “Uber Moglichkeiten im Relativkalkiil,” Mathe- and P. Naggar, “The advanced database environment of the KIWI mnthische Annalen, vol. 76, pp. 447-470, 1915. system,” Special Issue on Databases and Logic, IEEE Data En- [80] D. W. Loveland, Automated Theorem Proving: A Logical Basis. gineering, vol. 10, no. 4, Dec. 1987. Amsterdam, The Netherlands: North-Holland, 1978. [107] Y. Sagiv, “Optimizing Datalog programs,” in Proc. ACM I987 [81] E. Lozinskii, “Evaluating queries in deductive databases by gen- SIGMOD-SIGACT Symp. Principles Database Syst., San Diego, erating, ” in Proc. Int. Joint Conf Artijicial Intell., 1985. CA, Mar. 1987. [82] D. Maier and D. S. Warren, Computing with Logic. Menlo Park, [108] L. Schmitz, “An improved transitive closure algorithm,” Com- CA: Benjamin-Cummings, 1988. put., vol. 30, 1983. [83] A. I. Malcev, Algebraic Systems. New York: Springer-Verlag, [109] C. P. Schnorr, “An algorithm for transitive closure with linear 1973. expected time,” SIAM J. Comput., vol. 7, May 1978. [84] V. Mannino, P. Chu, and T. Sager, “Statistical profile estimation [llO] J. C. Shepherdson, “Negation as Failure II,” J. Logic Program in database systems,” ACM Comput. Surveys, vol. 20, Sept. 1988. ming, vol. 2, no. 3, pp. 185-202, 1985. [85] D. McKay and S. Shapiro, “Using active connection graphs for 11111-, “Negation in logic programming,” in Foundations of De- reasoning with recursive rules,” in Proc. 7th Int. Joint Conf. Ar- ductive Databases and Logic Programming, J. Minker, Ed. Los tificial Intell., 1981. Altos, CA, 1988, pp. 19-88. [86] G. Marque-Pucheu, “Algebraic structure of answers in a recursive [112] 0. Shmueli, and Sh. Naqvi, “Set grouping and layering in Horn logic data-base,” Acta Inform., 1983. clause programs, ” in Proc. Int. Conf Logic Programming, 1987, [87] G. Marque-Pucheu, J. M. Gallausiaux, and G. Jomier; “Interfac- pp. 152-177. ing Prolog and relational database management systems,” New Ap- [I131 T. Skolem, “Logisch-kombinatorische Untersuchungen iiber die plications of Databases, Gardarin and Gelenbe, Eds. New York: Erfiillbarkeit oder Beweisbarkeit mathematischer Satze nebst ei- Academic, 1984. nem Theoreme ilber dichte Mengen,” Skrifter utgit av Videnskaps- [88] K. Morris, J. D. Ullman, and A. Van Gelder, “Design overview selskapet i Kristiana, I, Mat-Nat. Klasse, no. 4, 1920. 166 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. I, NO. I, MARCH 1989

11141 D. E. Smith, M. R. Genesereth, and M. L. Ginsberg, “Controlling Stefano Ceri is a Professor of Computer Science recursive inference,” Arfijical Intell., vol. 30, no. 3, 1986. at the Dipartimento di Matematica, University of 111.vL. Sterling and E. Shapiro, The Art of Prolog. Cambridge, MA: Modena, Modena, Italy, and Visiting Professor at M.I.T. Press, 1986. Stanford University, Stanford, CA, during the (1161M. E. Stickel, “A unification algorithm for associative commuta- summer terms. Until 1986 he was with the Dipar- tive functions,” JACM, vol. 28, July 1981. timento di Elettronica, Politecnico di Milano, (1171 M. Stonebraker and L. A. Rowe, Eds., “The Postgres papers,” Milan, Italy. His research interests include dis- Univ. Calif, Berkeley, Mem. UCB/ERL M86/86, June 1987 (re- tributed databases, deductive and object-oriented vised version). databases, , medical databases, U181 L. Tanca, “Optimization of recursive logic queries to relational and the use of databases in software engineering. databases” (in Italian) Ph.D. d.rssertation, Politecnico di Milano He is the author of numerous articles in these areas and Universita’ di Napoli, 1988. and coauthor of the book, Distributed Databases, Principles and Systems 11191 S. Tsur and C. Zaniolo, “LDL: A logic-based query language,” (New York: McGraw-Hill). He has been an active participant of several in Proc. 12th Int. Conf. Very Large Data Bases, Kyoto, Japan, joint projects between the university and industry, sponsored by the Na- 1986. tional Research Council of Italy, the NSF, and Esprit (EEC). 117-01 J. D. Ullman, “Implementation of logic query languages for da- Dr. Ceri is a member of ACM, Vice-Chairman of the IEEE Technical tabases,” ACM Trans. Database Syst., vol. 10, no. 3, 1985. Committee on Data Engineering, and Associate Editor of the journals, ACM 11211 J. D. Ullman and A. Van Gelder, “Testing applicability of top- Transactions on Database Systems and Distributed Computing. down capture rules,” Stanford Univ., Stanford, CA, lnt. Rep. STAN-CS-851046; and ACM-J., to be published. [1221J. D. Ullman, Principles of Databases and Knowledge-Base Sys- tems, Volume I. Potomac, MD: Computer Science, 1988. 11231 P. Valduriez and H. Boral, “Evaluation of recursive queries using Georg Gottiob received the DipI.-lng. degree and join indices,” in Proc. 1st Int. Conf. Expert Database Syst., the Doctorate in computer science from the Tech- Charleston, SC. 1986. nical University of Vienna, Vienna, Austria. ~1241 P. Valduriez and S. Khoshafian, “Parallel evaluation of the tran- From 1982 to 1984 he served as a Research As- sitive closure of a database relation.” Int. J. Parallel Program- sociate at the Politecnico di Milano, Milan, Italy, ming, vol. 17, Feb. 1988. 2nd_..__. from_.._ ___-19X5 to.- 1988._.. he~~~ was~~ a Staff‘ Scientist at 11251 M. Van Emden and R. Kowalski, “The semantics of predicate logic the Institute for Applied Mathematics, Italian Na- as a programming language, ” J. ACM, vol. 4, Oct.- 1976. - tional Research Council (C.N.R.), Genoa. During [I261 A. Van Gelder. A. Ross. and J. S. Schlipf, “The well-founded the Summers of 1985 and 1987, he lectured and semantics for general logic programs,” in 7th ACM Symp. Prin- performed research at Stanford University, Stan- ciples Database Syst. (PODS), Mar. 1988, pp. 221-230. ford. CA. Presentlv he is a Professor of Computer t1271 A. Van Gelder. “The alternating fixpoint of logic programs with Science at the Technical University of Vienna, where he directs the Data- negation,” in 8th ACM Symp. Principles Database Syst. (PODS), base and Expert-System Subdivision. His research interests are in the fields Mar. 1989, pp. I-10. of databases, expert-systems, and applied mathematical logic. [1281 L. Vieille, “Recursive axioms in deductive databases: The Query- Dr. Gottlob is a member of ACM, the IEEE Computer Society, and the Subquery approach,” in Proc. Int. Conj Expert Database Syst., Kurt Goedel Society. L. Kerschberg, Ed., Charleston, 1986. w91 -, “A database complete proof procedure based on SLD reso- lution,” ECRC, Munich, West Germany, lnt. Rep. IR-KB40, Nov. 1986. r1301 -1 “From QSQ to QoSaQ: Global optimization of recursive Letizia Tanca received the Ph.D. degree in ap- queries,” in Proc. 2nd Int. Conf. Expert Database Syst., L. plied mathematics and computer science from the Kerschberg, Ed., Tyson Corner, 1988. Politecnico di Milano, Milan, Italy, in 1988. 11311 H. S. Warren, “A modification of Warshall’s algorithm for the Prior to working towards the Ph.D. degree, she transitive closure of binary relations,” Commun. ACM, vol. 18, worked in industry as a Software Designer. At Apr. 1975. present she is Postdoctorate Fellow at the Politec- 1 [132] S. Warshall, “A theorem on boolean matrices,” J. ACM, vol. 9, mco di Milano. Her main research interests are June 1962. the treatment of negative and functional informa- [133] C. Zaniolo, “The representation and deductive retrieval of com- tion in relational and deductive databases and ex- plex objects,” in Proc. 11th Int. Conf. Very Large Data Bases, tensions to the relational data model (in Milan) Aug. 1985. within the project Algres (an extended relational [I341 -, Special Issue on Databases and Logic, ZEEE Data Eng., vol. environment for manipulating complex objects). She is also collaborating 10, Dec. 1987. with Prof. Paredaens of the University of Antwerp (UIA), on the project [135] C. Zaniolo and D. Sacca’, “Rule rewriting methods for efficient of a rule-based language for manipulating complex objects and functions. evaluation of Horn logic,” MCC Tech. Rep. DB-084-87, 1987. She has written a book on the integration of Relational Databases and Logic [136] M. M. Zloof, “Query-by-example: Operations on the transitive Programming, coauthored with S. Ceri and G. Gottlob (New York: Sprin- closure,” IBM, Yorktown Heights, NY, RC 5526, 1975. ger-Verlag, to be published).