What You Always Wanted to Know About Datalog (And Never Dared to Ask)
Total Page:16
File Type:pdf, Size:1020Kb
146 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. I, NO. I, MARCH 1989 What You Always Wanted to Know About Datalog (And Never Dared to Ask) STEFANO CERI, GEORG GOTTLOB, AND LETIZIA TANCA Abstract-Datalog is a database query language based on the logic temal databases typically do not allow data sharing and programming paradigm; it has been designed and intensively studied over the last five years. We present the syntax and semantics of Datalog recovery, thus do not properly belong to current database and its use for querying a relational database. Then, we classify opti- technology [26]. The spread and success of such en- mization methods for achieving efficient evaluations of Datalog que- hanced AI systems, however, indicate that there is a great ries, and present the most relevant methods. Finally, we discuss var- need for them. ious exhancements of Datalog, currently under study, and indicate what Loose coupling has been attempted in the area of Logic is still needed in order to extend Datalog’s applicability to the solution of real-life problems. The aim of this paper is to provide a survey of Programming and databases as well, by interconnecting research performed on Datalog, also addressed to those members of Prolog systems to relational databases [42], [25], [46], the database community who are not too familiar with logic program- [W, [331, [971. ming concepts. Although interesting results have been achieved, most Zndex Terms-Deductive databases, logic programming, recursive studies indicate that simple interfaces are too inefficient; queries, relational databases, query optimization. an enhancement in efficiency is achieved by intelligent interfaces [64], [33]. This indicates that loose coupling might in fact solve today’s problems, but on the long I. INTRODUCTION range, strong integration is required. More generally, we ecent years have seen substantial efforts in the direc- expect that knowledge base management systems will R tion of merging artificial intelligence and database provide direct access to data and will support rule-based technologies for the development of large and persistent interaction as one of the programming paradigms. Da- knowledge bases. An important contribution towards this talog is a first step in this direction. goal comes from the integration of logic programming and The reaction of the database community to Datalog has databases. The focus has been mostly concentrated by the often been marked by skepticism; in particular, the im- database theory community on well-formalized issues, mediate or even future practical use of research on so- like the definition of a new rule-based language, called phisticated rule-based interactions has been questioned Datalog, specifically designed for interacting with large [94]. Nevertheless, we do expect that Datalog’s experi- databases; and the definition of optimization methods for ence, properly filtered, will teach important lessons to re- various types of Datalog rules, together with the study of searchers involved in the development of knowledge base their efficiency [ 1201, [ 151, [ 161. In parallel, various ex- systems. The purpose of this paper is to give a self-con- perimental projects have shown the feasibility of Datalog tained survey of the research that has been performed re- programming environments [ 1191, [24], [88]. ’ cently on Datalog. More exhaustive treatments can be Present efforts in the integration of artificial intelligence found in the books [34], [ 1221, and [92]. Other interesting and databases take a much more basic and pragmatic ap- survey papers, approaching the subject from different per- proach; in particular, several attempts fall in the category spectives, are [53] and [loo]. of “loose coupling,” where existing AI and DB environ- This paper is organized as follows. In Section II we ments are interconnected through ad-hoc interfaces. In present the foundations of Datalog: the syntactic structure other cases, AI systems have solved persistency issues by and the semantics of Datalog programs. In Section III we developing internal databases for their tools; but these in- explain how Datalog is used as a query language over re- lational databases; in particular, we indicate how Datalog can be immediately translated to equations of relational Manuscript received March 1, 1989. This paper is based on Parts II and III of the book by S. Ceri, G. Gottlob, and L. Tanca, Logic Programming algebra. In Section IV we present a taxonomy of the var- and Databases (New York: Springer-Verlag. to be published). ious optimization methods, emphasizing the distinction S. Ceri is with the Dipartimento di Matematica, Universita di Modena, between program transformation and evaluation methods. Modena, Italy. G. Gottlob is with the Institut fur Angenwandte Informatik, TU, Wien, In Section V, we present a survey of evaluation methods Austria. and program transformation techniques, selecting the most L. Tanca is with the Dipartimento di Elettronica, Politecnico di Milano, representative ones within the classes outlined in Section Milano, Italy, IEEE Log Number 8928 155. IV. In Section VI, we present several formal extensions ‘The term “Datalog” was also used by Maier and Warren in [82] to given to the Datalog language to enhance its expressive- denote a subset of Prolog. ness, and we survey some of the current research projects 1041-4347/89/0300-0146$01 .OO 0 1989 IEEE CERI ef al.: DATALOG 147 in this area. Finally, in Section VII we attempt an evalu- of the same arity, i.e., that they have the same number of ation of what will be required in Datalog in order to be- arguments. A literal, fact, rule, or clause which does not come more attractive and usable. contain any variables is called ground. Any Datalog program P must satisfy the following II. DATALOG: SEMANTICS AND EVALUATION safety conditions. PARADIGMS l Each fact of P is ground. In this section we define the syntax of the Datalog query l Each variable which occurs in the head of a rule of language and explain its logical semantics. We give a P must also occur in the body of the same rule. model-theoretic characterization of Datalog programs and These conditions guarantee that the set of all facts that show how they can be evaluated in a bottom-up fashion. can be derived from a Datalog program is finite. This evaluation corresponds to computing a least fixpoint. Finally, we briefly mention another evaluation paradigm B. Datalog and Relational Databases called “top-down evaluation,” and compare Datalog to In the context of general Logic Programming it is usu- the well-known logic programming language Prolog. ally assumed that all the knowledge (facts and rules) rel- evant to a particular application is contained within a sin- A. The Syntax of Datalog Programs gle logic program P. Datalog, on the other hand, has been Datalog is in many respects a simplified version of gen- developed for applications which use a large number of eral Logic Programming 1781. A logic program consists facts stored in a relational database. Therefore, we will of a finite set offacts and rules. Facts are assertions about always consider two sets of clauses: a set of ground facts, a relevant piece of the world, such as: “John is the father called the Extensional Database (EDB), physically stored of Harry”. Rules are sentences which allow us to deduce in a relational database, and a Datalog program P called facts from other facts. An example of a rule is: “If X is the Znt$zsional Database (ZDB).3 The predicates occur- a parent of Y and if Y is a parent of 2, then X is a grand- ring in the EDB and in P are divided into two disjoint sets: parent of Z”. Note that rules, in order to be general, usu- the EDB-predicates, which are all those occurring in the ally contain variables (in our case, X, Y, and Z ). Both extensional database and the IDB-predicates, which occur facts and rules are particular forms of knowledge. in P but not in the EDB. We require that the head predi- In the formalism of Datalog both facts and rules are cate of each clause in P be an IDB-predicate. EDB-pred- represented as Horn clauses of the general shape icates may occur in P, but only in clause bodies. Ground facts are stored in a relational database; we as- Lo :- L,, . ) L, sume that each EDB-predicate r corresponds to exactly where each Li is a literal of the form pi (t,, . , tk,) such one relation R of our database such that each fact r ( cl, that pi is a predicate symbol and the t, are terms. A term . ) ck) of the EDB is stored as a tuple < cl, . , c, > is either a constant or a variable.* The left-hand side of R. (LHS) of a Datalog clause is called its head and the right- Also the IDB-predicates of P can be identified with re- hand side (RHS) is called its body. The body of a clause lations, called ZDB-relations, or also derived relations, may be empty. Clauses with an empty body represent defined through the program P and the EDB. IDB rela- facts; clauses with at least one literal in the body represent tions are not stored explicitly, and correspond to rela- rules. tional views. The materialization of these views, i.e., their The fact “John is the father of Bob”, for example, can effective (and efficient) computation, is the main task of be represented as father (bob, john). The rule “If X is a a Datalog compiler or interpreter. parent of Y and if Y is a parent of Z, then X is a grand- As an example of a relational EDB, consider a database parent of Z” can be represented as Et consisting of two relations with respective schemes PERSON(NAME) and PAR(CHILD, PARENT).