<<

IMPLEMENTATION OF THE RELATIONLOG SYSTEM

A THESIS SUBMITTEDTO THE FACULTYOF GRADUATESTUDIES AND RESEARCH IN PARTIALFULFILLMENT OF THE REQUIREMENTS FOR THE DEGREEOF MASTEROF SCIENCE IN COMPUTERSCIENCE UNIVERSITYOF REGINA

BY Riqiang Shan Regina, Saskatchewan September 4, 1998

@ Copyright 1998: Riqiang Shan National Library Bibliothèque nationale du Canada Acquisitions and Acquisitions et Bibliographie Services services bibliographiques 395 Wellington Street 395. me Wellington OttawaûN K1AW OüawaON KIAW canada Canada

The author has g=anteda non- L'auteur a accordé une licence non exclusive licence dowing the exclusive permettant à la National Lïbrary of Canada to Bibliothèque nationale du Canada de reproduce, 10- distriiute or sell reproduire, prêter, distriiuer ou copies of this thesis in microform, vendre des copies de cette thèse sous paper or electronic formats. la forme de rnicrofiche/fiim. de reproduction sur papier ou sur format électronique.

The author retains ownership of the L'auteur conseme la propriété du copyright in this thesis. Neither the droit d'auteur qui protège cette thèse. thesis nor substantial extracts f?om it Ni la thèse ni des extraits substantiels may be priuted or otherwise de celle-ci ne doivent être imprimés reproduced without the author's ou autrement reproduits sans son permission. autorisation, Abstract

Advanced database applications could require the abilities to store data with com- plex structures and to perform inference. However, such capabilit ies are not directly supported in exist ing database systems. Relat ionlog, a persistent deductive database systern, has been designed and implernented to directly support the storage and infer- ence of data with a complex structure. It provides a uniform declarative language for defining, manipulating and querying a database. It also supports user-defined com- plex types and employs various evaluat ion strategies for query processing wit hout the user's intervention. The Relat ionlog system can be used in business management, information retrieval, industry, academic course teaching and research. This thesis gives an introduction to the Relationlog database language, which includes a SQL-like Data Definition Language, Data Manipulation Language, and Query Language. It discusses the design and irnplernentation of the Relationlog sys- tem as well as evaluation strategies. It also contains the experimental results, and a comparison with other related systems. Acknowledgments

1 would like to take this opportunity to thank that many people who helped me a lot. Many thanks go first to my supervisor, Dr. Mengchi Liu, for his valuable advice and support throughout this research project. 1 am grateful to Dr. Liu for providing valuable guidance during my research, for letting me use every cornputer that he has, and for arranging my financial support. He has provided several key ideas that appear in various chapters in this thesis. Secondly, many thanks go to the members of my theris cornmittee, Dr. Mingyuan Chen, Dr. Larry Saxton, and Dr. Brien Maguire, for their precious comments and suggestions. 1 am also very grateful for the financial support given by the Department of Computer Science, Faculty of Graduate Studies and Research of the University of Regina, Faculty of Science of the University of Regina, and NaturaI Sciences and Engineering Research Council of Canada. My special thanks go to Mr. Cory Butz, and Mr. John Palrnquist, for their patience in proofreading my thesis. I am also highly appreciated a lot of help from my generous friends: Binshu Shen, Huichu Mo, Guoqiang Huang, Xiaoyang Guan, Qinglai Zhang, and Xiaoyun Wang. Lastly, 1 would like to give my thanks to my beloved wife, Mingyuan Deng, for her encouragement, understanding, support and sacrifice in my studies. Contents

Abstract

Table of Contents iii

List of Tables vi

List of Figures vii

Chapter 1 Introduction 1

Chapter 2 Background 5 2.1 Relational Mode1 ...... 5 2.2 Complex Value Models ...... 7 2.3 Deductive Databases ...... 11 2.3.1 Datalog ...... 11 2.3.2 Bottom-Up Evaluation ...... 13 2.3.3 TopDown Evaluation ...... 15 2.4 Implemented Deductive Database Systems ...... 16 2.4.1 The LDL System ...... 16 2.4.2 The CORAL System ...... 18 2.4.3 Aditi ...... 19 2.4.4 Glue-Nail ...... 20

iii Chapter 3 Relationlog Database Language 21 3.1 Data Definition Language ...... 21 3.1.1 Domain Types ...... 21 3.1.2 Schema Definition ...... 24 3.1.3 Rules and Views ...... 29 3.2 Query Language ...... 32 3.3 Data Manipulation Canguage ...... 36 3.4 Relationlog Programs ...... 38 3.4.1 Well-Typed Programs ...... 40 3.4.2 Stratification ...... 40

Chapter 4 Design and Implementation of Relationlog 43 4.1 System Architecture ...... 44 4.2 File Architecture of the Persistent Database ...... 45 4.2.1 Nodes ...... 46 4.2.2 Data File ...... 48 4.2.3 Index File ...... 49 4.2.4 System Catalog Files ...... 50 4.3 Storage and Update Subsystem ...... 50 4.3.1 BufferManager ...... 51 4.3.2 Index Manager ...... 55 4.3.3 Update Manager ...... 56 4.4 DDL Manager ...... 57 4.4.1 Representation of Schernas ...... 59 4-42 Representation of Rules ...... 59 4.4.3 Representation of Catalogs ...... 63 4.5 DML Manager ...... 65 4.5.1 Representation of Relations ...... 65 4.5.2 Update Issues ...... 67 4.6 Query Manager ...... 67 4.6.1 Internai Representat ion ...... 68 4.6.2 Query Optirnization ...... 69 4.6.3 Grouping ...... 70

Chapter 5 Query Evaluation Strategies 72 5.1 Matching ...... 72 5.2 Semi-Naive Bottom-Up with Rule Orderings ...... 73 5.2.1 Stratification ...... 74 5.2.2 Basic Semi-Naive Algorithm For Relationlog ...... 77 5.2.3 Extended Semi-Naive Algorit hm For Relat ionlog ...... 79 5.2.4 Rule Orderings ...... 83 5.3 Topdown Pipelining ...... 85 5.4 Summaxy ...... 87

Chapter 6 Experirnental Results and Related Work 89 6.1 Experimental Results ...... 89 6.2 CornparisonwithOtherSystems ...... 94

Chapter 7 Conclusion and Future Research 97 7.1 Contribution ...... 98 7.2 Future Research ...... 99

Bibliography

Appendix A Syntax of Relationlog List of Tables

2.1 Similar concepts in databases and Datalog ...... 12 6.1 Test results on STU ...... 92 6.2 Test results on edgesd ...... 93 6.3 Summary of Deductive Database Systems ...... 95 List of Figures

2.1 Nested relations defined with extended relational algebra operators . 10 2.2 Top-Down Evaluation of a Query ...... 16 3.1 Sarnple Dornain Definitions ...... 23 3.2 Sample Schema Definitions ...... 25 3.3 Sarnple Relations ...... 26 3.4 ROBOT Relation Schema ...... 27 3.5 ROBOT relation ...... 28 3.6 Sample Relationlog Program f arnily ...... 39 3.7 Dependency Graph for the family Prograrn ...... 41 3.8 Dependency Graph for the jamily Program ...... 42 4.1 RelationlogSystemArchitecture ...... 44 4.2 Filestructure ...... 47 4.3 Index and Data File ...... 49 4.4 Index File ...... 49 4.5 Page Slot Address Format ...... 52 4.6 Mapping Between Disk and Memory ...... 54 4.7 Components of the DDL Manager ...... 57 4.8 Interna1 Schema Representation ...... 59 4.9 Schema Structure for ROBOT ...... 60 4.10 Interna1 Rule Representation ...... 62 4.11 System Catalog Files ...... 64 4.12 Interna1 Representation of a Fact ...... 66 4.13 The Updated Persons Fact ...... 68

vii 5.1 Dependency Graph of base-view ...... 76 5.2 Query Evaluation Flowchart ...... 88

viii Chapter

Introduction

Advanced database applications could require effective storage, efficient access and infercnce of large amounts of data with complex structures. However, such capabilities are not directly supported by the existing database systems. In recent decades, the nested relational and complex value models have been developed to extend the applicability of the traditional relational mode1 to more complex applications such as Computer-Aided Design (CAD), image processing and text retrieval. Extended relational algebra and calculus are provided for such kinds of models. It is shown in [3] that extended relational algebra with the powerset operator and sale extended relational calculus with the predicate are also equivalent and can simulate iteration and express transitive closure. However, they do so in a very inefficient way. Computations of transitive closure using either framework require inherently exponential space which means that t hey are not practical for real database applications. Another important direction of intense research has been in using a logic program- ming based language, Dataiog [3, 61, as a database query language. Such a language provides a natural way to express queries on a relational database. Furthermore, by allowing recursion and negation, it is more expressive than the traditional relational algebra and calculus. In the past few years, there have been some efforts Co combine these two ap proaches, mainly by extending Datalog with and constructors. The main merit of these extensions is that their natural use of a &point coostruct allows us to express transitive closure declaratively in polynornial space and time [26], which makes them expressive enough while still practical. However, most of the research on such deductive databases remains at the theoretical level. A few implemented sys- tems such as LDL [8] and COFL4L [22] are just memory-based, which cannot deal with large volumes of data. Other systems such as Aditi [27] and LOLA [13] are disk-based and perform well for large applications, but fail to support complex values. The objective of the Relationlog system is to directly support the storage and inference of data with cornplex structures. There have been several different imple- mentations of Relationlog since 1996. The first one is built on top of an Oracle/Ingres system by using an embedded SQL/C prograrn. It turns Oracle/Ingres into a persis- tent deductive database system that supports data with complex values. However, the performance is not satisfactory, as nested relations have to be stored as flat re- lations and the program has no control over the query optimization in the database. The second is built on top of the EXODUS persistent storage manager [5]. Although EXODUS could be a good choice, it is no longer supported and does not work under the latest Solaris operating system, which is the only kind of operating systern avail- able for us to develop the system. Besides, its host language, E, does not provide a reasonable debugging environment so that it is very difficult to develop large systems like Relationlog. The implementation was finally abandoned. Finally, the Persistent Alrnost-Relational Object Database Manager, PARODY [25], has been chosen. It has the bllowing advantages:

(1) persistent objects (2) concise fully-implemented C++ codes (3) reasonable performance for small and medium databases (4) portable to multiple platforms

These are al1 essential for the purposes of the Relationlog research prototype. To make PARODY more suitable for the implementation of Relationlog, the original code has been modified to include meta information definitions and index manage- ment. Therefore, Relationlog now subsumes PARODY. Although PARODY performs poorly for large applications, it could be improved by some further research to make Relationlog suitable for future challenges. The Relationlog system was mainly developed on a SUN SPARCstation running Solaris. It contains 20,000 lines of C++ codes. It has also been successfully ported to SGI, Linux, DEC and Windows NT/95. From the language point of view, the Relationlog system provides a declarative query language based on Relationlog [16, 181 and also a declarative data manipulation language based on DatalogU [l?]. It also provides a powerful set of constructs for representing and manipulating both partial and cornplete information on sets, which were inspired by LDL. Moreover, the extended relational algebra operations, as defined in (1, 9, 231, can be represented in Relationlog directly, and more importantly, recursively in a way similar to Datalog. From an implementation perspective, there are many novel features in the Rela- tionlog system:

schemas for both extensional and intensional relations

0 direct inference and access to embedded values in a nested or complex value relation as if the relation is normalized

0 standard user interface that appears ease of use

0 domain types, schemas, facts, and rules are persistent

facts and rules needed for query evaluation are dynamically selected and results are temporarily stored for Iater queries

rules used for query evaluation are dynamically rewritten based on the query

rn intensional data can be materialized as opposed to being evaluated dynamically every time they are queried

a least recently used (LRU) mechanism is used to remove intensional and ex- tensional data From memory when memory space is needed

various evaluation strategies are used automatically based on the nature of the query and data in the database without the user's intervention so that there is no need for specifying query forms or modes as in Aditi, LDL and CORAL The Relationlog system [24, 191 is designed, implemented and tested by the au- thor. It can be used in business management, engineering, information retrieval and industry. It is also a useful tool for database course teaching and academic research. This thesis gives an introduction to the Relationlog database language, which in- cludes SQGlike Data Definition Language, Data Manipulation Language, and Query Language. It also discusses the implementation and performance of the Relationlog system. It is organized as follows. Chapter 2 presents the background materials that are relevant to the Relationlog system. Chapter 3 introduces the Relationlog database language. The core of the thesis lies in Chapter 4 and Chapter 5, where the design and implementation of the system are discussed as well as the novel evaluation strategies. Chapter 6 presents the experimental results of the Relationlog system and a comparison with related deductive database systems. Chapter 7 concludes the thesis, summarizes the contribution of this research, and explores future directions. Chapter 2

Background

This chapter provides the background materials that are related to the Rela- tionlog systern. It introduces the relational model, complex value models, deductive databases and the implemented deductiw systems relevant to the Relationlogsystem.

2.1 Relat ional Mode1

The relational model uses a collectiori of tables, called relations, to represent data. The table is narned by a relation name. Each colurnn of the table is called an attribute and has an attribute name. The collection of al1 attribute names in the table composes the schema of the relation. Each row is called a tuple, and represents a collection of related data values. Data values are taken from user-defined sets called domains. They are indivisible so that they are called atornic values. Such kinds of relations are also called INF (First Normal Form) relations. A relational database is a finite collection of relational reIations.

Example 2.1.1 The following table illustrates a relation describing the relation- ship between departments, employees, and research areas. For instance, the first row (cpsc, bob, db) represents that employee bob is in the computer science (cpsc) depart- ment and doing research in the database (db) area. CPSc joe OS

math 1 sam cm

A relation of the relat ional model is also called a Jat relation. It has the following properties:

1. no two rows are identical;

2. row order is insignificant;

3. colurnn order is significant;

Over the relational data model, there are two major query languages to manipulate data in a relational database: relational algebra and calculus. They are equivalent in computational power [3]. The relational algebra [12]is a collection of operators that deal with whole relations, and yield new relations as a result. The major operators of the relational algebra are:

1. : Given relations RI and R2,the Cartesian Product, denoted by Rix R2,produces a relation that has the attributes of R1and R2and includes as al1 possible combinations of tuples from R1 and R2.

2. Projection: Given a relation R and a set A of attribute names in R, the pro- jection of R on A, denoted by rA(R),returns only the specified columns on the given relation with duplicates removed. 3. Selection: Given a relation R and a collection P of conditions over the relation, the selection of R based on P, denoted by bp(R),selects only those tuples of the relation which satisfy the given conditions.

4. : Given two relations RI and R2,the union of Ri and R2,denoted by RI u R2,produces a relation with duplicates eliminated that includes al1 the tuples either in R1 or in R2 or in both Rland R2. RI and R2 must be union compatible, which means they have the same schema definition.

5. Dzffeerence: Given two relations Ri and R2, the difference of Riand R2,denoted by Rl - R2,produces a relation that includes the tuples in RIbut not in Rz. R1and R2 must be union compatible.

6. Intersection: Given two relations Riand R2,the intersection of RI and R2, denoted by RI n R2,produces a relation that includes al1 the tuples in both Ri and R2. Riand R2 must be union compatible.

7. (Natuml) Join: Given two relations R1and R2, the natural join of Riand R2, denoted by Ri w R2,is a relation formed by computing RI x R2,selecting out al1 tuples whose values on each attribute common to RI and R2 coincide, and projecting one occurrence of each of the common attributs.

As discussed in (26,121, union, difference, cartesian product, projection and selec- tion are the primitive operators. Based on them the other two operators intersection and join can be derived.

2.2 Complex Value Models

Simplicity is the main advantage of the relational model. However, it becomes a severe limitation when dealing with many practical database applications. Consider Example 2.1.1; the DeptXmployee-Area relation suffers the problem of data redun- dancy. Apparently, al1 the employees in the computer science department have the same attribute value cpsc stored separately. What if they share the same space allo- cated for cpsc. This problem could be solved by splitting the table into a few smaller tables each containing non-redundant information. However, the tables need to be joined together each time to get the desired results, which takes time. To overcome this kind of problern, nested relational and cornplex value models [20,14, 21 have been proposed. The complex value models extend the relational model. The domains of complex value models are atomic values, (nested) tuples, or (nested) sets, while the relational model requires that al1 attributes have atomic values. That is, the value of a tuple on an attribute of a complex value relation can be an atomic, a tuple or a set. A complex value relation without any set of sets and tuples with tuple components is also called a nested relation [26].

Example 2.2.1 The following is a complex value relation that stores the same information of the relation in Example 2.1.1.

Deptxmployees

The relation contains two tuples mith nested tuples and sets. For instance, the first one specifies the computer science (cpsc) department and al1 the employees working in the department along with their research areas. The value of the attribute Employees is a nested set, which takes a nested tuple as its elernent. The tuple also has a set- value attribute Areas. Sets are used to store similar elements together so that the redundancy problem of the relational mode1 is solved. For the cornplex value models, extended relational algebra and calculus [23]have been developed. It has been proven that the extended algebra, domain independent calculus, and safe-range calculus for complex values are equivalent[3]. The extended relat ional algebra extends t hose basic operations ment ioned in the last section, and includes some new operators for manipulating sets and tuples listed as follows:

Constructive operators setcreate and tup-mate. They are used for tuple cre- ation and set creation.

Destructive operators setdestmy and tup-destroy. They are used to destroy sets and t uples.

Unnest and nest operators for nested relations. The transformation of a nested relation into 1NF is called unnesting. The reverse process of transforming a 1NF relation into a nested relation is called nesting.

The unnest and nest operators are derived from the destructive and constructive operators. Figure 2.1 shows 8 nested relations: Pl, Pz,P5, Ps and P7 are nested relations with the same schema; P4 is a flat relation; Pg and P3 are also nested relations sharing most common attributes except t hat P8 has one more atornic-value attribute narned A. The following operations show how the extended relat ional algebra operators function on several nesteà relations in Figure 2.1: Pl = nest(B,(C, D)(nest(C, D)(P4))), P' = unnest(C, D)(unnest(B, (C,D))(fi)), Ps = Pl u P2, Ps = Pl n P2, P, = P, - P2, and P8 = P2 w P3. Pl can be generated by applying nesting on P4 as in the first operation of the above example, while P4 can be created by applying unnesting on fi as in the second operation. The next three operations perform Union, Intersection and Dzfference on Pl and P2 in order to produce P5,Ps and P7. The last one shows that PB is the production of a Natural Join between P2 and P3. Figure 2.1: Nested relations defined with extended relational algebra operaton

10 2.3 Deductive Databases

Relational database systems have many advantages such as ease of use, appli- cations to a variety of problems, and suitability for parallel processing. However, the expressive power and functionality offered by a relational database language is limited compared with the logic programming languages as discussed in [6]. De ductive databases [4, 261 integate the advantages of relational databases and logic programming techniques (26,61 and thus have the capability to define rules, which can deduce or infer additional information (new facts) from the facts that are stored in the database. At first, Prolog [6], the most popular logic programming language, was used as a database language in most earlier deductive database systems at the end of the seventies. In the eighties, Datalog, a subset of Prolog, was designed for use as a standard database language to eliminate the drawbacks of Prolog such as tuple-at-a-tirne and order sensitivity [6].Most deductive database systems, which are Datalog extensions, have been proposecl and implemented since then. This section first introduces Datalog and then two query evaluation strategies: bottom-up and topdown for deductive database systems. The Datalog extensions will be discussed in the next section.

2.3.1 Datalog

Datalog [3, 61 is used to define rules declaratively in conjunction with an existing set of relations. Datalog programs are built from basic objects called atoms, which are predzcates with unique narnes. An atom is of the form p(t l, ..., t,), where p is the predicate name, and n is the number of arguments for predicate p. The arguments tl, ..., t, are called terms, which can be either constant values, or variables. This chapter adopts the convention that al1 constants in a predicate either are numeric or are character strings starting with lowercase letters only, whereas variable names always start with an upper letter. A Datalog program is a set of clauses of the form A t Li, ..., Ln,which means A can be derived from LI,... and Ln. A is an atom, called head, while LI,..., Ln are literals, called body or conditions of the clause. A literal is either an atom, called positive Iiteml, or an atom preceded by not, called a negatzve IiteruZ. A clause (a literal) in which no &ables appear is called a ground clause (ground liteml). A gound clause with an of conditions is a fact. A clause with non-empty head and conditions is called a rule. A clause with an empty head, on the other hand, is called a goal or a query which the system tries to evaluate, while each condition (Iiteral) in the body is called a subgoal. Note that an atom in Datalog denotes a relation. The predicate name of a f't in Datalog corresponds to a relation name; while the arguments pertain to the attribute values of a tuple in the relation. A ground clause (fact) corresponds to a tuple in a relation. A goal relates to a query on a database. The meaning of an attribute value in a tuple is determined solely by its position within the tuple. Rules are somewhat similar to relational views. They speciS. virtual (temporary) relations that are not actually stored but that can be formed from the relations by applying evaluation mechanisms based on the rule specifications. The main difference between rules and views is that rules may involve recursion and hence may yield views that cannot be defined in terms of standard relational views. Generally, a rule-defined relation is called an intensional relation, while a fact-defined relation is called an ezte~ional relation or a base relation The rules in Example 2.3.1 defines the ancestor relation wit h the parent relation. Table 2.1 summarizes the correspondence between similar concepts in Datalog and in Database as discussed above.

1 Database Conce~ts1 Datalog Concepts I 1 Relation 1 Predicate 1 Attribute Predicate argument Tude Ground clause (fact)

View 1 Rule Querv 1 Goal 1

Table 2.1: Similar concepts in databases and Datdog Example 2.3.1 Considering the following rules:

ancestor (X, Y) t parent (X, Y). ancestor(,Y, 2)t parent(X, Y), ancestor(Y, 2).

Tuples of the base (extensional) relation parent are:

parent (bob, t irn) parent (bob, ann) parent (am,pam) parent@am, tom)

Tuples of the intensional relation ancestor are derived by applying the rules:

ancestor (bob, t im) ancestar (bob, ann) ancestor (ann, pam) ancestor@am, ta)

The above four are generated from the first rule.

ancestor (bob, pam) ancestar(bob, tom) ancestor (ann, tom)

The last three are generated by recursively applying the second rule.

2.3.2 Bottom-Up Evaluat ion

Deductive database systems based on Datalog mainly ernploy bottom-up evalu- ation starting with base relations to generate new facts by applying rules. As facts are generated, they are checked against the query predicate goal for a match. The systems involve three kinds of queries: queries on only fact-defined predicates, on non-recunive predicates and on recursive predicates. A rule is recursive, if its head predicate is also used to define at least one of its body predicates in another rule. Othemise, it is called non-recursive. The first two kindç are sirnilar to the relational algebra queries. As for the third kind, a fized point Uizpoint) of a Datalog program with respect to involved extensional relations is a solution for these intensional rela- tions that satisfies the rules in the program. Tt is possible that for some programs there are more that two solution sets. A solution So is called the leost çedpoint if S contains So for any solution S for the program. For exarnple, consider a query ancestor(ann,Y) on the program in Exarnple 2.3.1. X bottom-up evaluation system first checks whether any of the existing facts directly matches the query. Since al1 the facts are for the parent predicate, no match is found, so the first ruie is now applied to the existing facts to generate new facts, the first four facts for ancestor in Example 2.3.1. Only one match pm is found For Y. Then the systern continues to generate new facts by applying the second rule (three times) until no more new facts are found. Thus a fixed point of the program is reached and the last three facts for ancestor are generated shown in Example 2.3.1. From them, Another match tom is found for Y. For evaluating a set of rules that may contain recursive rules, several strategies have been proposed. Two of them are of concern in this thesis:

Naive Strategy. The input is a set of rules with a set of base relations correspond- ing to the fact-defined predicates and another set of relations corresponding to the rule-defined predicates. The output is the least fixed point solution. The procedure uses permanent values in the extensional relations and current values in the inten- sional relations to cornpute new values for the intensional relations. The process is repeated until, at some point, none of the rule-defined predicates has any change of values. At this stage, the least fixed point solution is supposed to be reached.

Semi-Naive Strategy. The main problem with naive evaluation is that, at each iteration of evaluating the intensional relations, a redundant computation of the same tuples occurs. Tt is better to concentrate on the "incremental change" to the rule- defined predicate at each round and use these in computing additional tuples on the next round. The semi-naive strategy is an efficient strategy because of its approach to compute the "differentia17' of the tuples in each of the rule-defined predicates at each iteration. Since negated literals may appear in the bodies of rules in prograrns, there rnay be some different fixed points. Only the stratified program with negation can achieve the lest fixed point. A prograrn is stmtzfied if there is no recursion through negation. Programs in this have a very intuitive semantics and can be efficiently evaluated. The following example Pz describes a strat ified program.

ancestor(X7Y) : -paient (X, Y) Tl

ancestor(X, Y) : -parent(X, Z),ancestor(2, Y) r2

nocyc(X, Y) : -ancestm(X, Y), not ancestw(Y, X) r3 Note the third rule has a negative literal in its body. This prograrn is stratified because the definition of the predicate nocyc depends (negatively) on the definition of ancestor, but the definition of ancestor does not depend on the definition of nocye. A bottom-up evaluation of Pz would first compute a fixed point of rules rl and r2 (the rules defining ancestor). Rule r3 is applied only when al1 the ancestm facts are known.

2.3.3 TopDown Evaluation

Besides bottom-up evaluation strategies, many systems also use top-down evalua- tion. Topdown evaluation starts with the query predicate goal and attempts to find matches to the variables that lead to valid facts in the database. In this approach, facts are not explicitly generated, as they are in bottom-up. For exarnple, consider the query ancestor(ann, Y) on the program in Example 2.3.1. The system fint searches for any facts with the ancestor predicate whose first argument matches ann. Since there are no such facts, the system then locates the first rule whose head has the same predicate name as the query leading to the rule: ancestor(X,Y) t parent(X, Y) The inference mechanism then matches X to ann, leading to the de: ancestor(ann, Y) t parent (ann, Y) The variable X is now said to be bound to the value ann. The system proceeds to substitute ancestor(ann, Y) with prent(ann, Y), and it searches for facts that match parent(ann, Y) to End an answer for Y. At this point, the search using the first rule is exhausted, so the system searches for the next rule, which leads to the recursive rule. The inference mechanism then binds X to am, resulting in the modified rule: ancestor(ann, 2)t parent(ann, Y), ancestor(Y, 2). Tt then starts searching For facts that satisfy both subgoals of the body. Thus, the system finds pam for Z and the 1st answer for Y, which is tom. The whole process is shown in Figure 2.2.

ancestor (ann,Y)

ancestor (ann,Y) ancestor (am,Y)

ancestor (pam, Y) I

Figure 2.2: Top-Down Evaluation of a Query

2.4 Implemented Deductive Database Systems

Datalog has been extended in the past several years by incorporating tuple and/or set constructorç. This section briefly describes several implemented deductive database systems: LDL, CORAL, Aditi, and Glue-Nail.

2.4.1 The LDL System

The Logical Data Language (LDL) (81 system was started in 1984 at Microelec- tronics and Computer Technology Corporation (MCC)with two primary objectives: 0 To develop a system that extends the relational model, yet exploits some of the desirable features of a DBMS (database management system).

To enhance the functionality of a DBMS so that it works as a deductive DBMS and also supports the development of general-purpose applications. The resulting system is now a commercial deductive DBMS. Based on Datalog, LDL provides a general-purpose declarative logic language. It differs from the Datalog style of programming in the following ways: Rules are compiled in LDL.

0 There is a notion of a schema for the fact base in LDL before compile time. The fact base is Freely updated at run-time.

The LDL execution model is simpler, based on the operation of matching and the computation of "least fixed point". These operators, in turn, use simple extensions to the relation algebra.

LDL can indirectly support tuples by using functors and directly support sets. We can represent the DeptEmployees relation in Example 2.2.1 in LDL by using the following two facts: dept-ernployees(cpsc, {empl (bob, {db, ai)), empl(joe, {os, pl, db)) )) dept-employees (math, {empl (sam,{gt , n, si)), empl (tom, {CU}))) The functor emlp is introduced to represent the tuples of the set Employees. To support the access to elernents in a set, LDL allows the use of the member predicate (E). To support the construction of sets with rules, LDL provides two powerful mechanisrns: set enumeration and set grouping. With set enumeration, a set is constructed by listing al1 of its elements. With set grouping, a set is constructed by defining its elements with a property that they must satis6. The following are several examples.

Example 2.4.1 Consider the relation parato/ represented as facts in LDL as follows: parento f (bob, tom)

The following set grouping rule in LDL goups al1 parents of a person into a set and obtains parentso f (bob, {pam,tom)), where (Y) is a set grouping term:

parentso f (X,(Y)) :- parentof(X, Y)

To access deeply nested data and to nest/unnest relations in LDL, we have to proceed procedurally by introducing intermediate variables and/or relations.

Example 2.4.2 The nested relation Dept-Employees in Example 2.2.1 can be obtained from the normalized relation Dept-Employee-Area in Example 2.1.1 by using the nest operation of extended relational algebra defined in [23] twice.

In order to obtain DeptZmployees from Dept-EmployeeArea in LDL, we have to use grouping several times and introduce several intermediate relations as follows:

rî (D,E, (-4)) :- dept-employeenrea(D,E, A) r2(D,empl(E, As)) :- rI(D, E, -4s) dept-employees(D,(Es)) :- r2(D,Es)

It is easy to view the LDL execution as a bottorn-up computation using relational algebra. For instance, let p( ...) be the query with the following rule, where pi and p, are either extensional or intensional predicates (relations): P(XY) :- P,(X 2):p2(Z, Y) This query cari be answered by first computing the relations representing pl and p* and then computing their join, followed by a projection.

2.4.2 The CORAL System

The CORAL system [22], which was developed at the University of Wisconsin at Madison, builds on experience gained from the LDL project. Like LDL, the system provides a declarative language based on Datalog while supporting sets and func- tors. There are many important differences, however, in both the language and its implementat ion. From a language standpoint, CORAL adapts LDL's set-grouping constructs. It also supports a class of programs with negation and grouping that is strictly larger than the class of stratified programs. It is closer to Prolog than to LDL in supporting tuples with variables; thus, the tuple equal(X,X) can be stored in the database and denotes that every binary tuple in which the first and the second field values are the sarne is in the relation called equal. From an implementation perspective, CORAL implements several optimizations to deal with non-ground tuples efficiently, in addition to techniques such as magic templates, and special optimizations of different kinds of linear programs. It also provides an efficient way to compute non-stratified queries. In addition, it has good a interface with C++ and is extensible, enabling a user to customize the system for special applications by adding new data types or relation implementations, for example. CORAL'S main evaluation techniques are based on bottom-up evaluation. How- ever, CORAL also provides a topdown evaluation mode. Since many different seman- tics and evaluation rnethods are supported, CORAL provides a module mechanism for organizing programs. Each module exports a query predicate and can be considered to be simply a definition of this predicate. A module contains one or more rules defin- ing the exported predicate and possibly some local predicates. The semantics and evaluation of a module can be controlled by adding per-module annotations, and each module's controls are completely independent of those for other modules. This makes it possible to mix top-down evaluation in one module with bottom-up evaluation in another module.

2.4.3 Aditi

Aditi 1271 was developed at the University of Melbourne. It is a disk-based deduc- tive database system, that can store both large extensional and intensional relations on disk. Tt supports multiple users and exploits parallelism in the underlying ma- chine. Moreover, it uniquely provides many efficient evaluation mechanisms, such as magic-set, context transformations for linear rules, parallel evaluation and top-down tuple-at-a-time evaluation. Users can specify several flags for one predicate to state the preferable evaluation mechanisms. Like most other deductive database systems, Aditi supports only stratified forms of negation and has a Prolog interface. Unlike other deductive database systems mentioned above, Aditi provides system performance analysis based on a real Aight database. However, Aditi does not support complex values such as nested tuples and sets. That is the reason why it can have many evaluation strategies.

Glue-Nail [Il] is originally from Stanford. It is based on the earlier NAIL [IO] system. NAIL is a conventional logic programming language, extended with higher order syntax based on Hilog [Tl. Its evduat ion algorithm supports well-founded nega- tion including general negation in the absence of function symbols. Glue is a new imperative language with primitives that operate on relations, and it is intended to be used directly by application programmers to write efficient database algorithms and to perform updates and I/O. (Other deductive database systems use existing languages such as C++, and Prolog to perform tasks that require side-effects, and their designers aim to automatically compile problem descriptions into code executing efficient algorithms.) Glue-Nail is a self-contained and purely memory based system: it performs disk 110 only to load relations into memory at startup and to Save any modified relations at shutdown. Thus it cannot be used for large applications. Chapter 3

Relat ionlog Database Language

Datalog [6, 261 is a deductive language for Rat relations. Relationlog 116, 181 is a deductive language for nested relations and complex values. Relationlog stands in the sarne relationship to the nested relational and cornplex value models as Dat- alog stands to the relational model. However, Relationlog is typed unlike Datalog [18].This chapter introduces the Relationlog Data Definition Language (DDL), Data Manipulation Language (DML), and Query Language (QL).

3.1 Data Definition Language

The data definition language of Relationlog allows the specification of dornain types, relation schemas, views and rules. It is similar to the SQL data definition language, which is used in ORACLE, Sybase, Ingres and so on. As a result, users familiar with those widely-used DBMSs can get used to this systern easily.

3.1.1 Domain Types

The primitive domain types supported are as follows:

(1) string(n) for character strings with fixed-length n. (2) string which is an alias for string(l6). (3) token(n) for alphanumeric strings without space, hyphen and quotation char- acters with rn~~irnumn characters. (4) token which is an alias for token(l6). (5) integer(1) for 1-byte integers. (6) integer(2) for 2-byte integers. (7) integer(4) for Cbyte integen. (8) integer (8) for &byte integers. (9) integer which is an alias for integer(4). (10) float(4) for 4 byte real numbers. (11) float(8) for 8 byte real numbers. (12) float which is an alias for Boat@).

The lollowing are some examples of acceptable and unacceptable values of these primitive domain types (see Appendix A):

string: 'BobSam', "Bob's Wife" 'Bob", "abc"deP'g", 'Bob'swife' token: BobSam,BobSaai Bob Sam, bob-Sam, Bob's- Wife, 2sons, Bob integer: 5, 8, 120, -20 1.5, 1e4, one, "two" float: 1, 2.0, 3.14, 1.2e4 one, 'two'

In addition to the primitive domain types, the Relationlog system also supports tuple, sel, list, and bag types that can be defined using the primitive dornain types. The following are examples of these types. Tuple types [Last: string, First: string], [City: string, Country: string] Set types {token), {[Last: string, First: string]) List types lintegerl, I[Last: string, First: string] 1 Bag types *integer* *[Lat: string, First: string]* These non-primitive types must be created by the user using the create domain or create type command. Note that the bag and list types are currently treated the same as set types in the Relationlog system with different notations. Figure 3.1 shows how to define new domain types in Relationlog. create domain NameType string(l0) create domain FullName [ Last : NameType, First : NameType] create domain Date [ Year : integer, illonth : integer, Day : integer] create domain Incomes 1 integerl create domain Personlist 1 FullNamel create domain Location [ City : string, Country : string] create domain Paper [ Aut hm : NameT ype, Title : string@O), FirstPage : integer, Last Page : integer] Figure 3.1: Sample Domain Definitions

Note that as a complex value based language, Relationlog does not allow circular type definitions such as

create domain Person [Name : FullName, BirthDate : Date, Wif e : Person] User-defined types can be deleted using the drop domain or drop type corn- mand. For example, the following command can be used to delete the user-defined type FullName:

drop domain Full Name

However if a user-defined type is used in the definitions of other types or relation schemas, the deletion of the type is not allowed. The modification of types is not allowed in the Relationlog system as the type definitions are used to allocate storage space for relations. 3.1.2 Schema Definition

Dornain types are used to define schemas of relations. The Relationlog system supports two kinds of relations: base relations (or eztensional relations) and views (or intensional relations). Base relations are stored persistently on disk. Views are defined using rules based on the base relations and other views. The schema for both kinds of relations must be defined explicitly using the create relation or create table command. Figure 3.2 shows several relation schema definitions. As in traditional relational database systems, Relationlog supports the definition of keys. However, keys must be defined on atornic attributes. The index mechanism is tightly integrated with the creation and deletion of tables. When a table is created or dropped, the indices for it are created or dropped. Relationlog can manipulate the index mechanism automatically if the user does not handle it. That means the Relationlog system checks the schema definition to create a B- index on user-preferred attributes, which exist in the schema, or on a default attribute, which is generatly the first attribute defined in sequence. For the last two examples, the user explicitly defines the attributes JID, Volume and Number as the keys of the relation. If the user did not provide any key, JID would be selected as the default primary key. A newly created relation is empty at first. The user has to use the insert command to be discussed in Section 3.3 to load data in the base relations or use rules to derive data in the views. Figure 3.3 gives three relations Persons, Conferences and Journais which will be used as the running example. Note that the current implementation of Relationlog does not support the storage of partial tuples (that is, null-values) and partial sets introduced in [18]. Instead, sets in the relations must be complete. create relation Persans (Name : NameType, BirthDate : Date, Parents : { NameType), Livesrn : Locution) Earnings : Tncmes key = (Name) create relation Parentso f (Name : NameType, Parents : (NameType)) key = (Name) create relation Ancestorso f (Name : NameType, Ancs : {NameType)) key = (Name) create relation Con f ermces (CID : token, Year : integer, Location : Location, Papers : {Paper}) key = (CrD,Y ear) create relation Journais (JID : token, Volume : integer, Number : integer, Pupers : {Paper)) key = (JID, Volume, Number) creat e relation Publications (Author : Name, Papers : {[Title: string(40), ID : token, First Page : integer, LustPage : integer])) key = (Name) Figure 3.2: Sample Schema Definitions Conferences - - Year Location 1 Papers Paper - 'Author 1 Title 1 FirstPage 1 LastPage 91 Proiog - Par log .Ann Datalog 21 30

[Athens Greece] -{ Oracle 10- - Pam Ingres Il 20 Bob Sybase 21 30 I 'Joe 1 I 15 Ii DOOD 1i I Bob 1 NF2 1 16 1 30

JTD Volume 1 Number 1 Pa~ers 1 L { Paper [Author Title - First Page Last Page] JLP 1 1 C. .[Tom Lopjc- 1 101 BO^ ~omClause 11 20i ' TODS 1 1 { [Tom Relation 1 101 [Pam Calculus 11 201 [Jim Algebra 21 3011 . Persons

Figure 3.3: Sample Relations Relationlog is suitabte for more complex relations with nested tuples and sets. Consider the ROBOT relation schema in Figure 3.4.

create relation ROBOT ( ID : string(20)) ARMS : {[ ID : string(30), AXES : {[ KINEMATICS : [DH-MrlTRIX: {[COLUMN: integer, VECTO R : {float )]) JOINTANGLE : [MAX : integer, MIN : int eger]] DYNAMICS : [MASS : float, ACCEL : float] Il Il GRIPPERS : {[ ID : string(20)) FUNCTION : string(20) 11

Figure 3.4: ROBOT Relation Schema

This relation is used to represent the parts of real robots. It has three main at- tributes: ID (string), ARMS (nested set), and GRIPPERS (nested set). GRIPPERS has a tuple type, which has two attributes of string type: ID and FUNCTION. ARMS has a tuple type inside, which contains two attributes: ID (string) and AXES (nested set). AXES also has a tuple type inside, which has two tuple type at- tributes: KINEMATICS (nested tuple) and DYNAMICS (tuple). DY NAMICS has two attributes of float type: MASS and ACCEL. KINEMATICS contains a set type DH-MATRIX and a tuple type JOINT-ANGLE, which has two attributes ARMS

Anoo

O)]) I 1R'sht tllW { 0 '

I 1))) Dsroa {[Middle Il [-270

Figure 3.5: ROBOT relation of integer type: MAX and MIN. DHM.4TRIX has a tuple which contains a in- teger type (COLUMN), and a set of a 0oat type(VECT0R). A subset of the ROBOT relation is shown in Figure 3.5. To remove a relation from a Relationlog database, the users can use the drop relation (or drop table) and delete relation (or delete table) commands. The former deletes al1 information about the dropped relation from the database while the latter deletes al1 tuples in the relation but retains the schema of the relation. To modify the schema of a relation in a Relationlog database such as insert, delete or modify the attributes in a relation schema, the users can use the alter relation or alter table command. Consider the following example:

alter relation Persons add Spse : NameType alter relation Persons drop Spouse alter relation Pe~sonsmodify Parents: 1NameTypeI

The first cornmand says add an attribute Spovse and its corresponding type into to the relation Persons. If the relation does not exist or the attribute has already been defined in the relation, this operation will fail. Besides, if the relation has tuples in it already, this operation will also fail as it will result in null-values in the existing tuples which is not allowed in the current implernentation of Relationlog. Otherwise, the attribute is appended to the end of the existing attributes in the Persons relation. The second command says delete the attribute Spovse from the relation schema Persas and the corresponding attribute values from al1 the tuples. The 1st cornmand says modify the type of the attribute Parents of the relation Persas to the Iist type (NameTypeI. The operation wIll fail if the attribute does not exist or there are tuples in the relation which violate the new type constraints.

3.1.3 Rules and Views

Based on the database, deductive information can be defined by using rules in Relationlog. A rule is of the form similar to Datalog:

-4 : -LI,..., Ln. Every Li (1 5 i 5 n) in the body is either a positive literal with variables, a negative literal which consists of a negation keyword (not) and a positive literal, or arithrnetic or set-theoretic cornparison literal over variables and constants. The head A is a positive literal. A rule can be used to deduce attribute values for existing objects, or to describe how to construct objects and obtain their attribute values. A rule has to be saje, which means every variable that appears in the head A aiso appears in the body in a non-negated literal. Relationlog does not follow the Prolog/Datalog convention which treats words starting with upper case letters as variables and words starting with lower case letters as constants. Instead, al1 variables in Relationlog must start with - and - itself can be used as an anonymous variable. In this way, words starting with upper case letters can be treated as token type constants. Besides, Relationlog directly supports arithmetic comparison expressions such as X = YtlO,K 2 -Y/2, and set-theoretic comparison expressions such as X E Y,S = SIu S2,SI n S2. Views are defined using rules based on base relations and other views. There can be two kinds of views in the Relationlog system: rnaterialized (stored) and non- rnaterialized. Materialized views are stored persistently on disk as base relations and are rnaintained current while non-materialized views are evaluated when they are queried . Materialized views are created using the cornmand create stored view while non- materialized views are created using the command create view. Unlike traditional relational languages such as SQL, Relationlog supports recursively defined views. For instance, the following comrnands define three views: non-materialized views Parentsoj and Ancestorsof and materialized view Publications based on the base relations Pe~sons,Con f etences and Jmrnals defined earlier:

create view Parentsof as Parentsof (Name, Parents) :- Persans(-Name, Age, Parents, -Address) create view Ancestorsof as .-lncestorsof (same, (-4ncestoi)) :- Parentsof (Name, (-hcestor)); Ancestorso f (Name, (Ancestor)) :- Parentsof (Name, (Parent)), Ancestorso f (Parent, (Ancestor)) create stored view Publication as Publzcations(Name, ([Title, JD,JPage, LPage])) :- Cm/erences(JD, -Lm, ( [Name,Title, YPage, -LPage])); Publications(Name, ([Title, JD,-F Page, -LPage])) :- Journal s(JD,-Vol, -.hm, ([Xante,2it le, 3Page, -LPage])) where Xame, Parents, Age, Address, Parent, Ancestor, Title, -ID,-Lot, -FPage, LPage, -Vol, and Num are logical variables. In particular, Parents in the first command is a set-valued variable which ranges over the set of parents of a person tuple. The term (Ancestor) in the second view definition is called a partial set tenn and is used for different purposes. For the one in the head of the rules, it is used to group every ancestor of a specific person into a set. For the one in the body of the rules, it denotes an of the set in the matching tuple. Similarly, the partial set term ([Title,-ID, -FPage, -L Page]) in the head of the t hird view definition is used for grouping while the partial set term ([JVame,Title, XPage, -LPage]) in the body of the rules is used to denote a tuple in the set of the matching tuple. These commands will fail if the schemas for the relations are not defined or the rules in the view definitions are not well-typed with respect to their relation schernas. Note that the view Ancestors is recursively defined.

Remark: The schema and view definitions could have been combined together into the view definition. As views/rules may be dependent on each other however, there may be views that have rules that are not well-typed. Relationlog requires al1 rules in a database to be stratified. The stratification of rules is autornatically checked every time a new view is created. How to stratify rules of a database will be discussed in Section 3.4. With rules/views, Relationlog directly supports extended relationd algebra op erators. Consider the Following relationd schemas for the nested relations given in Figure 2.1: P1(A : string, Al : {[B : string, BI : {[C : string, D : string])])) P2(A: string, -41 : {[B : string, BI : {[C : string, D : string]}])) P3(E : string, B : string, B1 : {[C : string, D : string]}) P4(A : string, B : string, C : string, D : string) Ps(A: string, 41 : {[B : string, B1 : {[C: string, D : string]}])) P6(A : string, Al : {[B : string, B1 : {[C : string, D : string])])) P7(E : string, B : string, BI : {[C : string, D : string])) P8(A: string, E : string, B : string, Dl : {[C : string, D : string]})

As discussed earlier, the following example shows how the extended relational al- gebra operators function on nested relations: Pl = nest(B,(Cl D) (nest (C, D) (P4))), P4 = unnest(C, D)(unnest(B,(C,D))(Pl)), P5 = PlUP2, Ps = PlnP2,P7 = Pl-P2, and Ps = Pz w P3. In Relationlog, these relational algebra operations can be represented directly using niles as follows:

Note that in the above rules a variable -E is used to denote a tuple. Indeed, variables can be used to represent not only atomic values but also nested sets and tuples. The materialized and non-materialized views can be dropped using the drop view command. However, if a view is used by other views, then the view cannot be dropped. For example, the following command will fail as the view Parentsof is used in the view Ancestorso f.

drop view Parentsof

If the view to be dropped is non-rnaterialized, then the rules used to define the view will be deleted. If the view to be dropped is rnaterialized, then both the rules that define the view and the derived relation will be deleted.

3.2 Query Language

Relationlog directly supports al1 Datalog queries and rules. It also allows direct access and inference of deeply nested data by providing two powerful set terms: partial set terms of the form < OIT.., On > and complete set terms of the form (O1,.., 0,) and a tuple terms of form [Ol, .., O,] or [Al : 01, .., il, : O,], where 01,..., O,, 2 5 n, are variables, constants, set or tuple terms and Al, ..., il, are attribute names. The query language of Relationlog ailows the user to directly query base relations and rnaterialized or non-materialized views with the query commands. A query is an expression of the of the following form:

query LI, ..a, Ln where LI, ..., Ln are a sequence of positive literals, negated object literals, usual arith- metic and set-theoretic comparison literals. A positive literal in a query is organized as:

while El, ..., E, are the combination of variables, constants, set and tuple terms. These are very similar to rules. A negation literal in a query consists of a keyword not and a positive literal. The arithmetic and set-theoretic comparison literals are the same as the ones used in rules. The Relationlog system will always give al1 answers at once. Consider the following query in Relationlog:

query Persons(JIame, [-Y ear, -, -1, Parents, -, -), -Year < 1965

This query says list the name, birth year, and parents of every person who were born before 1965. In the current implementation of Relationlog, the attribute values in a tuple have a left-to-right order. The anonymous variable '2 in a tuple expression can be ornitted based on this order and abbreviate it into the following equivalent query:

query Persons(Name, [Year], Sarents) , Year < 1965

The results to this query based on the relation shown in Figure 3.3 will be displayed as a set of bindings as follows

-Name = Name : "Tom",-Year = Year : 1913, Parents = Parents : {) -Name = Name : "Bop, -Year = Year : 1933, Parents = Parents : {) Name = Name : " Pam", -Year = Year : 1956, Parents = Parents : {" Bob") Name = Name : " Joe" , -Year = Year : 1962, -Parents = Parents : {" Put" ) The following example shows how to find the two siblings who live in the same ci ty.

query Persons(JVame1, ,Parents, -Location), Persons(Name2,, Parents, -Location), Namel O JVame2 where Location is a tuple variable which ranges over a nested tuple in the rnatching tuple. The views in Relationlog can be queried in the sarne way as base relations. The following are several examples:

1. query Parentsof (X,XI)

2. query Parentsoj(" Bop, < XI >), not Parents(" Jzm" , < XI >)

3. query Parentso/(X, < "Tom">), X! = " Bob"

4. query Publiatias(" Bob" , ([-Title]))

5. query Publiatitms(-Name, ([" Logic"])),Persms(-Name, 4ge, -, [-City,Country])

The first query asks for every fact of the relation Parentso f. < XI > in these two queries stands for partial sets. The second one asks for parents of Bob but not the parents of Jim. The third one asks for the the person who cannot be Bab, but has Tom as one of his parents. The fourth query says End the title of every paper that Bob wrote. The last query says find the name, age, and location of the author who wrote the paper Logic. Note that if a view is materialized, then the query on it will be answered directly by retrieving data from the corresponding disk file. If it is not materialized, then the Relationlog system will first evaluate al1 relevant rules using proper strategies to find answers to the query. The results will be stored temporarily for later queries until the database is no longer active. Displaying the results to a query as a set of bindings may not be the way the user likes. Besides, the results of the previous query may be useful for later queries. Therefore, Relationlog also allows a query to be represented as a rule whose body specifies what is to be queried and whose head specifies how to format the query results. Thus the following query is also allowed in Relationlog:

query H : -LI, ..., Ln Here H indicates a temporal relation, the attributes of which could be implicitly derived from the rule body or explicitly declared by the user. In the first case, the head is just used to handle the outputs. As for the second case, the head relation is declared before the issuance of the query, and then the result of the query are stored ternporally into database unt il the delet ion of the relation. Consider the following two queries:

query Children(Parent,([Ehild, -Year])) :- Persas(-Child, [Year], ([Parent]))

query Children2(Parent,-Childl Year) :- Children(Parent,([Ehild, -Year]))

The first query will display the results based on the relation shown in Figure 3.3 as a nested relation as follows:

(Joe, {[Sam,1983])) (Bob, {[Pam,19561, [Joe,19621)) The second query will display the results of the first query as a flat relation:

(Joe, Sam, 1983) (Bob, Parn, 1956) (Bob, Joe, 1962) 3.3 Data Manipulation Language

Data manipulation in the Relationlog system involves extensional data updates and intensional data updates. The user can insert, delete and update base relations using two elementary data manipulation language commands insert and delete, with the following forms: insert L delete LI,.-., Ln where L, LI,..., L, are update literals the same as the literals in rules in Section 3.1.3. Since update cornmands are t ightly integrated with queries in Relationlog, the following DML command handles more complex updates : query LI,..., Ln,delete Ed, insert E, where LI,..., Ln are query literals, Ed is a positive literal for deletion, and Ei is a positive literal for insertion. Note that the variables in Ed and E* must appear in the previous query literals or the command will fail. In sorne cases, the deletion part can be omitted. .4n update command in Relationlog is treated as a transaction which can either succeed or fail. If it fails, it has no effect on the database at all. Based on the update expressions that can be used in update commands, the user can insert new facts or delete part of existing facts using elementary or complex DML commands. For example, the relation Persons shown in Figure 3.2 can be populated with the following commands:

insert Pei-sons("Ta", [1913,11,20], {), ["Toronto", "Canada"], 150001) insert Persuns(" Sam",[l983,5,30], {" Jim" ), r' Calgary", " Canada"], 115001) insert Persans(" Bob", [1933,4,1], {}, [" LosAngeles", "USA"], 1200,25001) insert Persas("Pam", [1956,7,8], {" Bob" }, ["Chicago", "USA"], 1900,9001) insert Persas("Joe", [1962,1,24], {" Jzm" ), NewYork" ,"USA"], 1300ill)

The tuple to be inserted will first be checked with respect to the corresponding schema definition. Only well-typed tuples can be inserted successfully. Note that the attribute values in a tuple have a left-to-right order. Besides, a null-value is not allowed at al1 in the current Relationlog system since it has not clear syntax meaning (see Appendix A). The following examples show how to delete tuples from the database.

deIete Persons("Tom" ,-) delete Persas(, [Year]),Year < 1965

The first cornrnand says delete the tuple with the name Tom frorn the Persas relation. If the tuple does not exist, the operation will fail. This command can be abbreviated to the second one in which the anonymous arguments are omitted as well. The third cornmand says delete every person who was born before 1965. Note that this command implies a query, i.e., find al1 persons who were born before 1965 and then delete them. The following examples show how to update atomic values in tuples in Relationlog relations.

delete Persons(X, -Y, -2,["Toronto", JV]) , insert Persons(X, -Y,Z, [" Vanmuer" .-W]) query Persans(" Sam", -, (SamsParent)), delete Persm(SamsParent, X, -Y), insert Person(SamsParent, X, -Y, [" Vancouver", "Canada"])

The first command says transfer every perçon from Toronto to Vancouver. The second comrnand says transfer Sam's parents' to Vancouver, Canada. Note that there is an explicit query comrnand in the second update comrnand. Indeed, complex updates can be performed in Relationlog by using complex queries in the update comrnands. The following examples show how to update elements in sets:

insert Persans(I' Pam" ,-, (" BO&') ) delete Persas(" Joe" ,, (" Pot")) delete Parentsoi(" Bob" , < X >), not Parentso/(" Jim" , < X > , insert Parentsof (" Philip" , c X >) The first command says insert Bob into Pam's parents set. The operation will fail if the tuple for Pum is not in the relation or Bo6 is already a parent of Pam. The second command says delete Put from Joe's parents set rather than the whole tuple. This operation will fail if Pat is not in Joe's parents set. The 1st one is a combined update. It says transfer the parent of Bob, who can not be the parent of Jim, to the parents of Philip. Updates in the Relationlog system have a declarative semantics which is a straightforward extension of the semantics presented in [17]. Note that complex update commands may explicitly (query)or implicitly (delete) involve queries. Such an update command is performed in Relationlog in two steps. First, do the query (implicit or explicit). Based on the instantiation of variables, then perforrn the update. Updating base relations may result in materialized views and results of previous queries that are kept invalid. Relationlog will delete the results of previous queries that are invalid and update materialized views.

3.4 Relationlog Programs

In order to simplify the creation of database relations, the Relationlog system allows the user to put al1 the information necessary for the database creation into a Relationlog program, which consists of four parts: types, schema, facts and rules. The types part contains the domain type definitions. The schema part contains the relation schemas for both extensional and intensional relations. The facts part contains tuples in base relations. The rules part contains rules used to define views. If there is no domain types involved, the types part cm be omitted. For example, Figure 3.6 shows a Relationlog program, named farnily, which is part of the sample database shown in Section 3.1 with one more TrueAncsof relation added. Types NameType = string(l0) Full Name = [Last : NameType, First : NameType] Date = [Year : integer, Month : integer, Day : integer] Incomes = lintegerl Schema Person(Name : NameType, BirthDate : Date Parents : {string), LivesIn : Location, Eamzngs : Incornes) key = (Name) Parentsof (Name : NameType, Parents : {NameType)) key = (Name) Ancestorso f (Narne : NumeTgpe, Ancs : {NameType)) key = (Narne) TrueAncsoj(Name : NameType, TrueAncs : { NameType}) key = (Name) Facts Persons("Tom", [1913,11,20],(1, [" Tmato","Canada7'], 150001) Persans(" Sam",[1983,5,30], {" Joe" }, ["Calgary","Canada"], (15001) Persans(" BO&',[1933,4,1], {), [" LosAngeles", " USA7'],1200,25001) Persans(" Pam", [1956,7,8],{"Bob" ), ["Chicago","US.4"], 1900,9001) Persans(" Joe", [1962,1,24], {" Bob" ), r'NewYorkn,"USA"], 13000)) Rules Purent sol (-Name, Parents) :- Persons(flume, Age,-Parents, Address) ilncestorsof (Name,(Ancestor)) :- Parentso f (Name,(Ancestor) ) Ancestorso f (Name,(Ancestor)) :- Parentso f (Name,(Parent)), Ancestorso f (Parent,(Ancestor)) TrueAncsof (Name,(Ancestor)) :- Ancestorso f (-Parent, (Ancestm)), notparentsof (JVame (Parent)) Figure 3.6: Sample Relat ionlog Program f amily A11 Relationlog programs have to be well-typed, which means that each literd of rules related to a relation satisfies the corresponding schema definitions and so does every fact.

Example 3.4.1 Consider the following database: Schema p(A : integer, B : {string)) q(A : string, C : integer)

The rule is well-typed with respect to the schema but the fact is not. Therefore the program is not well-typed.

Stratification

The rules in a Relationlog program are required to be stratified. Stratification is autornatically determined by using a dependency graph which is a marked graph constructed as follows:

1. the set of nodes consists of al1 predicates appearing in the schema of the database

2. there is an edge from x to y if there is a rule in which x is in the head and y is in the body. The edge is rnarked if y is in an negated relation expression (Iiteral) or y contains complete set terms such that there exists a dewhere one of the complete set term(s) is in its head.

The rules in a database are stmtzjied if the dependency graph has no cycle with a rnarked edge. Since a given program has only a finite number of predicate symbols, it can be statically determined whether a program is stratified or not. Consider the famzly program in Figure 3.6. Its dependency graph is shown in Figure 3.7. There is no cycle with rnarked edges (a marked edge is labeled with a 'f7) in the graph. Therefore, the jamily program is stratified. The graph also shows that the rules can be divided into tmo strata: the view TrueAnwf composes the second straturn; the other views consist of the first stratum. The views in the lowest (first) stratum have to be evduated first, then the second, and so on.

1 1 cestorsof ,

1

I

1 \O Stratum 1 t

Figure 3.7: Dependency Graph for the family Program

Example 3.4.2 Consider the following Relationlog program.

Schema r(A : token, B : {token)) q(A : token, B : {token)) p(A : token, B : {token}) Facts r(c, {a,b)) Rules

q(X,(Y)) :- T(X?(-Y)) 7 net P(-K(-0)

p(X,3) 1- q(X4, r(-Y, S) This program is not stratified since there is a marked edge in the cycle p 49*p in its dependency graph (see Figure 3.8).

Figure 3.8: Dependency Graph for the family Prograrn Chapter 4

Design and Implement at ion of Relat ionlog

The Relationlog system was inspired by LDL [8]. It also adopted rnost of the appli- cable and suitable technologies from CORAL [22], Aditi [27],and Glue-Nail [Il]. Un- like these deductive database system implementations, The user can specih whether to rnake an intensional relation persistent in the Relationlog system. If a user queries a persistent intensional relation, Relationlog simply uses indices based on the exist- ing relations to find results so that the response time is minimal (see 5.1 for more details). As the current irnplementation of the Relationlog system is built on top of PARODY [25], Relationlog perforrns reasonably well for smdl and medium databases and relatively slow for large ones due to the limitation of PARODY itself. However, it is sufficient for research and teaching purposes. This limitation would be overcome by some further research to make Relationlog suitable for future challenges. This chapter presents the design and implementation of the Relationlog system. Section 4.1 describes the Relationlog system architecture. Then according to the system architecture, each layer of the system is discussed specifically. To build a solid ground for understanding the interna1 system structures, Section 4.2 technically analyzes the file organization for the Relationlog persistent databases. Section 4.3 describes the storage management of Relationlog. Sections 4.4 to 4.6 discuss the irnplementation of the DDL Manager, DML Manager, and Query Manager. 4.1 System Architecture

The Relationlog system has been implemented as a single-user persistent deduc- tive database system due to limited manpower. The Relationlog system architecture is shown in Figure 4.1. Like most of database systems, the Relationlog system is or- ganized into 3 layers. The first layer is the user interface. The textual user interface currently is the primary communication medium between users and the Relationlog systern. It accepts and processes user commands including definitions, queries, and updates. It then performs syntactical analysis of the commands and passes those valid commands to the Data Management and Query Subsystem or rejects them. It also displays al1 results desired or error messages, if any.

1 Data Management and Query Subsystem 1

--- -- S torage and Update Subsystem

Persistent Database

Figure 4.1: Relationlog Systern Architecture

The second layer is the Data Management and Query Subsystem which consists of three managers: DDL Manager, Query Manager, and DML Manager. They cooperate with each other tightly and direct the Storage and Update Subsystem to handle srnaller tasks respectively. The DDL Manager is responsible for processing al1 DDL comrnands, maintaining system catalogs about domains, relations, indices and rules, and answering al1 the type checking requests from the DML and Query Managers. It also checks if al1 rules are stratified when a new view is created. As views may be materialized, the DDL manager is also responsible for dropping those materialized intensional relations and the corresponding rules when a view is dropped. The Query Manager is responsible for data retrieval and rule evaluation. If the data to be retrieved is in an extensional relation or in a materialized view, then it simply uses rnatching and indices to find the results. If the data to be retrieved is in a non-materialized view, it uses semi-naive bottom-up evaluation with rule ordering as pmposed in [21]and togdown pipelining mechanism to find the results. The DML Manager perforrns dl the updates to the extensional relations. 4s an update rnay imply a query, the DML manager may request the Query Manager to process the query before performing the update. After an extensional relation is updated, it will also request the Query Manager to propagate the updates to materialized views that are dependent on the updated relation if any. The third layer is the Storage and Update Subsystem which consists of three main managers: Buffer Manager, index Manager, and Update Manager. The Buffer Man- ager deals with loading, dropping, and updating domains, relation schemas, relation tuples, indices and rules between main rnemory and the persistent database, which consists of disk files. The Index Manager is in charge of the creation and updates of the B-tree and hash indices for relations and provides a transparent interface. The Update Manager deals with the updates of dornains, relation schemas, relation tuples, indices and rules.

4.2 File Architecture of the Persistent Database

The Relationlog persistent database consists of UNE files with similar struc- tures. The files store facts, indices, and catalogs. When Relationlog starts running, five system catalog files are automatically created for meta information manage- ment: database.sys which records the information of al1 created databases in the system, relatim.sys which records the information of al1 relations in each database, attrzbute.sy.s which records the information of al1 attributes of each relation, rule.sys which records the information of al1 rules of each database, and ruleunit.sys which records the information of dl rule literals existing in each rule. Each database created in Relationlog has four files of its own in the UNEfile sys- tem: a data file with extension .EDB which contains dl facts of extensional relations, a data file with extension .IDB which contains al1 facts of materialid intensional relations, a index file with extension .EDX which contains key indices to the facts of extensional relations, and a index file with extension .IDX which contains key in- dices to the facts of intensional relations. The Relationlog file architecture is based on PARODY [25]. The original structure has been modified for meta information management and sets. A11 the files in the Relationlog persistent database are composed of nodes. There is a deleted-node queue in each file. Thus, the deleted nodes can be re-allocated. Nodes are addressed logically by node numbers starting with 1. Relationlog uses the node system in two ways: the index files and the system catalog files are organized into fixed-length nodes; the data files are unformatted, variable-length data threads that could extend beyond the boundaries of a node. This section discusses how the files are organized that include data, index, and system catalog files for the Relationlog persistent database.

4.2.1 Nodes

A file contains a header record followed by node records. Figure 4.2 shows this structure, which is the same as in PARODY. It can support:

1. variable-length persistent relations;

2. indices into the facts on nested relations;

3. persistent objects that can grow or shrink in length or be deleted;

4. reuse of deleted file space;

5. meta information management. 1 Node Records Header Record 1 Next node data I i Next node data

n Next node 1 data

Figure 4.2: File Structure

Header Record The header record contains two node numbers. The first node number points to the highest node that has been allocated. For instance, the node n is the highest one in Figure 4.2. The second points to the first node in a thread of deleted nodes.

Node Records Each node record corresponds to a unique node number of its own. It is a fixed-length disk record with a next-node number at the front of the record followed by anything Relationlog nrants to put into the node. The next-node number is a thread pointer. It points to the next node in a logicd node thread. Relationlog use the node number of each node record to access the node by translating the number into the physical location of the node in the file.

Node Threads Since a object may have different length, a node sometimes is not enough. A node thread consisting of a nurnber of nodes are thus used to contain the whole object. Relationlog maintains a pointer to the first node in the thread. The node pointer at the front of the first node points to the second node, which points to the third, and so on, until a node is reached that has zero in its next node pointer, which marks the end of the thread. The nodes in a thread are in no particular sequence in the file.

Node Deletion When a node is deleted, Relationlog first clears the node's data area of the node to zeros and adds it to the deleted node thread, which is pointed to by the deleted node pointer in the header record (See Figure 4.2). The pointer actually points to the most recently deleted node. When a node is deleted, Relationlog rnoves the deleted node pointer from the header into its next node pointer and moves its own address into deleted node pointer in the header.

Node Allocation When a new node is needed, Relationlog looks first at the deleted node pointer in the header. If the pointer is nonzero, Relationlog allocates that oode br the new usage and changes the next node pointer from the node into the deleted node pointer. However, if the deleted node thread has no nodes, Relationlog allocates a node from the end of the file by using the node nurnber of the highest node allocated as recorded in the file header.

4.2.2 Data File

Relationlog system stores facts in node threads. The extended structure for a Relationlog data file based on the one in Figure 4.2 is shown in Figure 4.3. Each fact starts from a node. The node number is the fact's address. The fact starts with three integers following by the data of the fact. The first two integers at that address are the relation identification of the fact and a logical node number, both used in locating the fact and rebuilding the index file, which is described in the following section. The third integer indicates the attribute node related to this fact so that the nested set elements can be accessed. Thus this extended structure is suitable for storing nested relations. If the length of a fact is greater than that of a node, it spills over into the next node in the thread, continuing that way until the entire fact is recorded. The balance of the last node in a fact's thread is padded with zeros. The same physical file of nodes holds al1 the objects of al1 the relations in a database. Nothing that stands alone in the file itself tells the database the types of facts or anything about their formats. For Relationlog to correctly locate a fact, it needs the address of the fact's first node. That address is maintained by the indices as in Figure 4.3, described in. the following section.

Relation Header Tables Each relation is represented by a table of index header records. The first node in the file is the header table for the relation that has relation identification zero. That node is the 5rst of a thread of index header nodes. The second node in the thread is the table for relation iaentification one and so on.

Index Header Records Each relation header record is a table of index header records. There is an index header record for each key in the relation. The first one is for the primary key. The second one is for the first secondary key, and so on. Each index header record contains two data items: the node number of the root node for the index and the length in bytes of the index key value. The system assigns the root node number when the index is first built, and it extrapolates the key length From the first time the key is created.

4.2.4 System Catalog Files

PARODY [25] cannot deal with meta information. Thus it is furnished with the capability for meta information management in the system catalog files. The structure of catalog files is the sarne as the basic structure in Figure 4.2. Each system catalog file however has a different node size depending on its basic block. A basic block can be the description of a database, a relation, an attribute, a rule or a rule literal. It takes up exactly one node in its corresponding file. Most kinds of blocks are connected with each other through node numbers. The details will be further described in Section 4.4 as for the representation of these blocks in the catalog files.

4.3 Storage and Update Subsystem

The Storage and Update Subsystem is the lowest layer. It accepts the requests from the Data Management and Query System and responds them with TRUE indicat- ing jobs done, or FALSE denoting a failure along with some error message. According to the requests, it provides rapid access to domains, schemas, facts, indices and rules in the persistent database. Generally speaking, the subsyst em performs the following tasks: 1. memory management by using LRU (least recently used) algorithm, which is used to move the least recently used data out of memory upon the memory request in case that no memory space is available.

2. updates of domains, schernas, facts, indices, and rules.

3. retrieval of attributes, facts, indices, rules by their unique node numbers, which is much like most 00 DBMSs using OID to identib a object.

4. supports for the B-tree and hash indices.

5. supports for many different kinds of data types, such as atomic data types, tuple types, set types, bug types and list types so that the user can declare and manipulate complex nested relations with these types.

4.3.1 Buffer Manager

Node numbers are used as pointers in the Relationlog system. They have to be translated into exact memory addresses for the accesses in rnemory. The Buffer Manager is in charge of address mapping that translates node numbers of attributes, facts, indices, and rules into memory addresses and updates while cooperating with the Update Manager. To make efficient use of each node, Relationlog adopts the similar technique as used in Cache-Memory Mapping [28], called Set Associatzue Mappzng. Each relation has its own buffer space for its own facts. The buffer space is composed of a lot of memory pages, holding some memory slots (see Figure 4.5). This manager is responsible for managing the space occupied by objects (including facts, attributes, indices, and rules) requested by the applications. That rneans if an object is currently in main memory, then its memory address will be returned to the application requesting it. Otherwise the Buffer Manager employs a LRU algorithm to drop some old objects that have not been used for long time, and then loads this object into memory. In case that the node to be dropped has been changed or deleted (marked by the Update Manager), the Buffer Manager physically changes or deletes the node stored on disk. The format of an object in its page buffer is similar to its disk format. The mapping technique is described as follows:

Set Associative Mapping

One page=C dots (blocks); One set = J dots (blocks). Each page of the buffer space is grouped into P = 9 sets. 1. Mapping function: Q = A mod P, where Q is the page goup number, A is the node number.

2. Page slot format: Tags Node Number Data

Figure 4.5: Page Slot Address Format

3. Page structure: each page is organized into P rows (sets), each row consists of J slots. Each slot has Noàe Number , Tags and Data fields. In a buffer, there are a few pages containing the facts of a relation. Each page has the same size according to the relation on the top level of a schema or contains the node-size dot for set objects. The page number in the buffer is lirnited by the available memory and dynamically changes.

4. Mapping scheme:

(a) Use node nurnber as the index to access a set in a relation buffer (in related pages) (b) Compare this node address with al1 Node Number in this set. (c) If there is a match, then respond to the requested application with the memory address so that the content of this node can be accessed. Other- wise a page miss occurs. 5. Page miss: When a page miss occurs, the Buffer Manager uses the node number to access data in the persistent database. It then transfers the whole node from persistent database to the memory slots within the set by using the LRU replacement algorithm in case of no space available. Meanwhile it writes the node number to Node Number part of the dot and make marks in the Tags.

For example, suppose there are 8 slots in memory, and 24 nodes on disk. Then, the 8 slots are grouped into 2 sets: Grapo and Groupl. Similarly, the 24 nodes are divided into 6 sets: GO, ..., G5. In each set, there are 4 slots. According to the mapping scheme, GrmpO corresponds to GO, G2, G4, while Groupû to G1, G3, G5. The mapping between memory slots and disk nodes is shown in Figure 4.6. If an application needs node 10, then 10 mod 4 = 2 indicates that the node is in G2, while G2 corresponds to GrcmpO. Thus node 10 should be in memory Gr@. At the same time, the Buffer Manager selects a slot in the set and sets Node Number to the node number and make a mark in the Tags. Assume dot O is the first free slot, then slot O is the one chosen for loading. Figure 4.6: Mapping Between Disk and Memory 4.3.2 Index Manager

The Index Manager maintains the B-tree indices [15] for the Relationlog relations (extensional and intensional) and hash indices [15] for temporal relations. They are chosen because they are efficient and standard, and are used by most database sys- tems. BTree indices residing both in memory and on disk are used for defining and rnaintaining keys of each relation. They are stored in the index file of each database. hash indices are applied only in memory for temporal relations, which could be the differential relations of semi-naive evaluation or ternporarily created views. a. B-Tree Indices The Relationlog system uses the Btree structure to implement the indices. The B- tree [i5] is a balanceci tree of key values used to locate the fact node that matches a specified key argument. The tree itself is a of nodes, where each node contains from one to a fixed number of keys. A Btree generally consists of a root node and usually two or more lower nodes. The nodes store keys in key value sequence. Each key consists of the key index value provided by the Fact and the fact's node address in the database. When the tree has multiple levels, each key in a parent node points to the lower node that contains keys greater than the parent key and less than the next adjacent key in the parent. The nodes at the lowest level are called Ieaves and do not contain pointers to lower nodes. When a node becomes overpopulated by key value insertions, it splits into two lower nodes and the middle key value is inserted into the parent node of the original pre-split node. If the root node splits, a new root node is grown as the parent of the split nodes. When a node becomes underpopulated as the result of key value deletions, it combines with an adjacent node. The key value that is between the two nodes and that is in the common parent node of the two is moved into the combined node and deleted from the parent node. if the deleted key is the last key in the root node, then the old root node is deleted, and the newly cornbined node becomes the root node. The Btree guarantees that the tree always remains in balance; that is, the number of levels from the root node to the bottom of the tree is always the same no matter which branch is considered. In the Relationlog system, each relation can have a pri- mary key (unique or not) or several secondq keys. Therefore, a fact can be accessed by only a portion of attribute values related to its indices. The nodes of indices are always loaded into memory before the nodes of the relations. After the index check- ing, only needed facts are loaded into rnemory through their indices. Meanwhile, a LRU algorithm is used to decide which node is to be moved out of the buffer in case that the memory space is needed.

b. Hash Indices As Etree indices, Relat ionlog adopts the traditional hash index [15] to locate a fact of in-memory temporal relations. By computing a hash function, Relationlog maps the combined attribute values into an entry of the in-memory hash indices for temporal relations and then returns the address of a desired fact in memory. In case of same mapping on the sarne entry of the hash table, buckets are provided for maintaining the facts with the same entry in a queue. Here the term bucket is a unit of mernory that can store one or more key values of the facts. To insert a Fact, the combined attribute values are computed with the chosen hash function, which gives the address of the bucket for that fact. If there is still space in the bucket to store the Fact, then the fact is stored in that bucket. Otherwise a new bucket is allocated and hooked up to the end of the old one. To perform a lookup on the cornbined attribute values, the values are computed with the hash function. Then Relationlog searches the bucket one-by-one with the address from the computation. Deletion is equally straightforward. It includes a lookup on the combined attribute values. Once the fact is found, it is then deleted.

4.3.3 Update Manager

The Update Manager deals with the updates of these nodes that are for domains, schemas, facts, indices and des. As an update may imply a query, the DML and DDL managers consult with the Query Manager to find the exact nodes to be updated and then send these node numbers to the Update Manager so as to perforrn the updates. The Update Manager then consults with the Buffer Manager and the Index Manager to get the nodes that need to be updated into memory (if it is not). It also marks the node with Changed or Delded tag on its corresponding memory page slot (see Figure 4.5) so that the Buffer Manager will change or delete them when the session is over (the database is closed) or the nodes are dropped out from mernory.

4.4 DDL Manager

The DDL Manager processes al1 Relationlog DDL commands, maintains system catalogs about domains, schernas, relations, indices and rules, and answers al1 the type checking requests and rule applications from the DML or the Query Managers. Figure 4.7 shows its four parts: DDL Processor, Schema Manager, Type Checker, and Rule Manager.

I DDL Rocessor I

Schema Manager Rule Manager

Type Checker

Figure 4.7: Components of the DDL Manager

1. DDL Processor The Data Definition Language processor is responsible for pre-processing the user inputs, checking the validity and then translating them into interna1 expres sions. It communicates with the Schema and Rule Manages to gain exclusive access of the schemas and ruleç, and makes the changes. 2. Schema Manager The Schema Manager is responsible for implernenting al1 changes to scherna definitions. It processes the internd schema update expressions from the DDL Processor. By further checking the syntax and semantics safety and soundness, the request with internal schema update expressions will be proceeded or re- jected. The Schema Manager keeps a version of the scherna in the catalog files and tracks the changes of every attribute.

3. Type Checker The Type Checker is responsible to ensure that the submitted queries and up date expressions are Iegal. This involves attribute identifier resolution and type checking the attribute values. Example errors include unknown relations or at- tribute names, passing irncompatiable types to arithmetic operators, and mal- Formed expressions. In Relationlog, types are checked each time a query is sub- mitted. This ensures that, after modification of the schema, the Type Checker will not accept incorrect queries that have been generated by the older applica- tion programs. Timings of Relationlog operations indicate that type checking is only a small part of the overall cost of query evaluation.

4. Rule Manager The Rule Manager maintains the essential information concerning rules. Tt deals with the internal rule update expressions from the DDL Processor. First of all, it consults with the Type Checker to rnake sure that the expressions are well-typed. Then, the safety and acceptability regarding the requests will be checked. After that, the internal expressions are optimized. If everything goes well with the checking and optimization, the request will be cornmitted. Otherwise it will fail. The Rule Manager also maintains a hard copy of the rules in the system catalog files and makes safe changes on it.

In what follows in this section, some important issues will be elaborated on how to represent schemas, rules and catalogs. 4.4.1 Representation of Schemas

In the Relationlog system, al1 relation schemas in a database are created as novel tree-like structures. Each attribute is identified by its unique node number. At- tributes are connected to each other by pointers (node nurnbers) in the catalog file attribute.sys. For example, the Persas relation in Section 3.1.2 is represented in- ternally as in Figure 4.8.

Figure 1.8: Interna1 Schema Representation

The schema shows that each fact of relation Persons is a tuple. There are a few attributes in this tuple, which are Nane as a string type, Age as a integer type, Parents as a set of string, and Address as a tuple of two string types: Street and City . Note that if the type of an attribute is tuple, set, list or bag, then it has some sub-types as its children. Figure 4.9 shows a more complicated schema definition of the ROBOT relation in Section 3.1.2.

4.4.2 Representat ion of Rules

The Relationlog rules have been introduced in the previous chapter. This section shows how the rules are internally represented. Each rule has a few literals concerning relations or arithmetic and set-oriented operations. Each literal is cdled a query (rule) node in interna1 representation. Each query (rule) node has a few query (rule) units corresponding to the attributes if the unit concerns one of relations in the database. Note that a rule can be used as a query, query nodes and units are used to stand for I float I Figure 4.9: Schema Structure for ROBOT rule nodes and units for the compatibility with queries. The units are connected to its relatives by node numbers in the tree-like structure like schemas. Each unit can contain a constant or a variable pointer to VariableGroup. VariableGroup contains al1 the variables in the rules of a database. However, if the query node is only a condition operation, a label is set indicating a condition node. Consider the following rule introduced in Section 3.1.3:

Its internai representation is shown in Figure 4.10. This rule has three literals corre- sponding to three query nodes. The description of the rule includes the stratification layer number (Stratum NO-),head and body node numbers of the rule (for easier un- deatanding, relation narne is used instead). Each query node has a few members: CND indicating whether the node is a condition (TRUE for YES), relation node num- ber Relation (relation name is still used), IsHead denoting the position of the node in a rule, Negation (TRUE) defining the negative literals, AttrRoot standing for the attribute root node number for this query node, UnitNo. illustrating the unit number of the query units belonging to the query node, and some other bytes left for later use. A query unit is related to a variable or a constant. The variable in a query unit is represented as a pointer pointing to VariableGroup. To distinguish variables with the sarne name, some tags are used such as related attribute node number (AttrlD), and other information (Other Info.) like rule number and so on. The Value column associates the real value for the variable in the query processing, which could also be a pointer. If the query node is a condition, the operator is indicated, and so are the operands (variables or constants). In the example, it is assumed that the stratification layer of the rule is 1, the attribute root node numbers For the three query nodes are respectively 12, 3 and 18. Then the description of the rule is P8 for the head, and Pz and PJ for the body. In the real system, these IDs for the head and body are the node numbers for each query node. But for better illustration, the names are used in Figure 4.10 instead of the real node number. Three query nodes are needed with the reIation name of Pa,P2, and P3. They are al1 non-condition nodes among which TRUE in IsHead field of Pg indicates it is the head of the rule, while FALSE for others indicates the body of the rule. For the node of Pa, there are six query units corresponding to the head literal. There are 7 and 5 query units respectively for the two body nodes. In the rule, there are variables 4, -By-Gy and -E storing in VariableGroup, where the variable names are connected with their nodes, attributes and values. Because 4: here represents a tuple, its Value field in VariableGroup contains the node number holding the whole tuple in case of evaluation. CND Rtiadon hHead NcgPtion Aar Root Unit No- OLba FALSE 1 p8 ] TRUE 1 FALSE 1 12 1 6 ...

Rule Description

Variable Group

Value

- FALSE 1 p2 1 FALSE 1 FALSE 1 3 1

Partial set

Partial set

FALSE 1 p3 1 FALSE 1 FALSE 1 18 1

Figure 4.10: Interna1 Rule Representation 4.4.3 Representation of Catalogs

As iintroduced in Section 4.2.4, there are file system catdog files with extension .sys for meta information of databases, schemas and rules. Each system catalog file has a different node size depending on its basic block. A basic block can be a attribute node, a query/rule node, and a query/rule unit node for schemas and rules discussed in the previous two sections. Tt can also be the description of a database and a relation. It takes up exactly one node in its corresponding file. The next node field of each node points to the next node depending on the context. Each node may contain some pointers to other files. In database-sys, each node of the description of a database contains some pointers, actually node numbers, pointing to the description of its related relations and rules. It also stores VariableGroup for al1 the variables in the rules. In relatia.sys and rule.sys, nodes of the of relations and rules are connected to the next by node nurnbers. The description node of a relation contains information about the schema pointing to attribute.sys and other information like layer number and so on. Likewise, the description nodes of rules are hooked up to each other also by node numbers and contain pointers to their rule literals in ruleunit.sys. Consider the farnily database in Section 3.4. The meta information concerning its relation schernas and rules is partly given in Figure 4.11. The node in database.sys contains the description of the database family denoting al1 the meta information of its relation schemas and rules. In relation.sys or rule-sys, each node stream contains the description of al1 relations or rules, such as Parentsoj and AncestorsoJ in relatîm.sys. There is a node number in a description node of a relation pointing to the root node of its schema in attrîbute-sys, through which al1 attributs can be accessed. Similariy a description node of a rule has the node numbers of its head and first body rule units, and thus points to its head and body array in ruleunit.sys.

Sumrnary: when a database is activated (opened) in the Relationlog system, al1 attributes of domain types, schemas, rule nodes, and rules units included in this database are created as tree-like structures in main memory from the system catalog fiies in order to achieve high performance. Each block (node) is identified by its unique node number. Units are connected to each other by node numbers. The indices are

63 Next node DBname Type-info Rel-Mo Rule-info VariableGroup

O family NULL I / *-O 1 1 I 1 I 1 I 1 I 1 Next node Reiname Atîr-Info Other INeanode Head Body Condition

-- 7- / relation.sys / Next nodC Attrname Children Next r

? 0 root 1 I 1 I

deunit. sys

Figure 4.11: System Catdog Files loaded in memory only in need for facts, whose representation will be described in the following section.

4.5 DML Manager

The DML Manager performs ail the updates to the extensional relations. As an update rnay imply a query, the DML manager may request the Query Manager to process the query before performing the update. After an extensional relation is updated, it will also request the Query Manager to propagate the updates to the materialized views that are dependent on the updated relation if any.

4.5.1 Representation of Relations

In the Relationlog system, tuples in the same relation with n toplevel non-empty sets, Iists, and bags are stored in n + l node streams. One stream is for the tuple itself and one strearn for each set, list, or bag in the tuple with its first node number, the description and the count of the elernents stored in the set stream so that we cm access its elements froni the tuple stream. The nurnber of nodes needed for the tuple stream is the same for al1 tuples of the relation, but the number of nodes needed for other streams depends on the elernents they hold. For example, empty set does not occupy any node. The maximum number of nodes allowed currently in a Relationlog file is 232. The current Relationlog system does not support raw data. If the set, list or bag in the tuple has nested sets, lists, or bags, then additional node streams are used to store them. For the nested tuples, there is no separate node streams because the length of the tuples is known from i ts type definition. Instead, the nested tuples are stored in their root tupie stream. Consider the following fact in the Persons relation introduced in Section 3.1.2:

("Mary", 25, { "Lzlyn,"Seann), [" 123 St" ," Toronto"])

Because of the nested set, this fact is stored in two node streams: one stream for the atomic values and the nested tuple and the other stream for the nested set. Suppose that the available nodes are 10, 11, 14, ... and the first stream takes two nodes 10 and 11 and the second node strearn takes just one node 14, then the node 10 contains the following data: the next node number 11, relation identification for the relation Parents, value Mary for the attribute Name, value 25 for the attribute Age, the description of the Parents set, and value " 123 St" For the attribute Street of Address. The description of the only set is the first node number of the thread storing dl the elements of the set and a binary number 1 indicating a complete set (O for partial set). In the node 11, there is only one value Toronto for the attribute City of Address. The left bytes are all initialized to O. The node 14 just contains the set (2 for the set count and 10 for its parent node). The interna1 representation of this Fact is shown in Figure 4.12. -

- - 11 Rel-Info MW 12 14 2578lth Street O...( - O 1 ReLInfo 1 Toronto 1 1

O Rel-Info 2 10 Lily Sean O.. .O

-

Figure 4.12: Interna1 Representation of a Fact 4.5.2 Update Issues

The Relationlog file structure in Figure 4.2 is suitable for updates. This section shows how to update a fact without destroying it. Continue with the previous ex- ample in Figure 4.12. Suppose one more string attribute Salary is added to the Persans relation at the end and Salary value 5000 is inserted into the fact used above. bloreover, the attribute Parents is altered to Family for the whole farnily members. Two brothers tom and pam are added to the same set. It is assumed the attribute value 5000 can still be held in the last node of the node stream for the fact and one node (14) cannot contain four elements of a set. Thus a new node will be allocated and its nurnber will be put into the next node field of the first node containing the set elements. The next available node is supposed to be 15. Then the first three elements are in node 14 and the last one exists in node 15. The unused bytes are filled with O. On the other hand, the set element number is also changed from 2 to 4. Figure 4.13 shows who this updated fact is stored in the Relationlog database.

4.6 Query Manager

The Query Manager is responsible for data retrievd and rule evaluation pertaining to the facts stored extensionally in the database or defined intensionally by rules. It firstly translates a query into its internal expressions, and optimizes them based on the conditions the query processes. It then uses several novel evaluation strategies such as matching, semi-naive bottom-up evaluation with rule ordering, and topdown pipelining mechanism to find the results. Due to the introduction of sets in Relation- log, the Query Manager finally has to group the derived partial set elements [18]in order to answer the user's queries. This section concerns the internal representat ion of queries, query opt imizat ion, and the grouping mechanism. These evaluation strategies will be discussed in the following chapter for better illustration. Il Rel-Id . Mary 12 14 1 257 8th Street O 1 Rel-Id Toronto 150q O...O

15 Rel-Id 4 10 Lily Sean Tom

1 O 1 Rel-Id 1 Pam 1 O...O

Figure 1.13: The Updated Persons Fact

4.6.1 Interna1 Representation

Because each variable is associated with an attribute and so is each known value, Relationlog adopts an internai tree-like structure, each node in the tree is called a query unit, for each query literal related to a relation. Thus the Type Checker can easily manipulate the type checking and the communication with the Schema Manager. Each query unit is connected with its neighbors and children through pointers, which simulates the representation of the relation schemas. The internal representation of query expressions is the same as that of the units in a rule illustrateci in Section 4.4.2. The difference between them is that al1 the meta information (nodes and units) OF rules is stored in the catalog files, while that of qlieries are tempordly in mernory. 4.6.2 Query Optimization

In order to evaluation queries more efficiently, Relationlog adjusts and optimizs the queries in interna1 expressions based on the conditions the query possesses. It then uses the following optimization algorithm, which is based on the algorithms in [3] and [22].

Algorit hm 4.6.1 Optimization Algorithm Input: El, ..., En in a queue Ql in the body of a mle or a query expression Output: EOl ,..., EO. in a queue Q2 /or the optimized expressions. Method: Let Ei O < i 5 n be the first fiteral related to a relation in Q1 7hth US many known value as possible in its puery units so as to narrow the seiection scope. WHILE QI is not empty

(a) Add Ei to Q2 (b) Delete Ei from Ql (c) Let Ei be the one in QI with as many known value or "bound" variables or condition ezpression which has privilege (appeared in Me literals of QI) as possible in its query units (if it is an arithrnetic or set condition, ail the variables on one side O/ have to be bound.)

END WHILE Consider the following query based on the relations defined in Section 3.1.2: query Persons(Name,Age, ,[City, Country]), Publiations(-Name, ([" Logic" 1) ) ,-Name! = "Tom" The query says find the name, age, and location of the author who is not Tom and wrote the paper Logic. It contains three literals corresponding to El, E2, E3 in QI, while Q2 is ernpty at first. Thus E2 is first picked up to Qz since it has one bound attribute value. Then E3 is selected for al1 its variables are "bound" . Finally El is picked up and the loop is finished with the empty queue QI. Now in Q2, the sequence of EZ,€3, El corresponds to: Publiations(JVame,([" Logic"])), -Name! = "tom", Persons(-Narne, -4ge,, [-City,Country]) Note here the original expressions are used instead of the interna1 ones for better illustration.

4.6.3 Grouping

As discussed in [18],atoms with partial sets are deliberately ailowed to be inferred in Relationlog as intermediate results. They rnust be gouped properly with the existing facts to make the database compact. Suppose the following intensional fact based on the relation Ancestorsof is derived From the rules. Ancestorso f (" Jimn,("Tom")) The fact says Tom is one ancestor of Jzm. Another fact Ancestorso f (" Jim" , (" John"," Leroy" ) ) already exists in the database, what means that John and Leroy are two ancestors of Jirn However, the above two facts are equivalent to the following fact: ..lncestorsof (" Jim", (" John", " Leroy" ,"Tom")). That means they should be grouped together to one compact fact. The following algorithm based on the grouping operator introduced in [18] shows how Relationlog groups facts so as to make the database always compact.

Algorithm 4.6.2 Grouping Algorithm Input: a new nested tuple Ttmpand eziting grouped tuples in the database Output: grouped tuples includzng Ttmp Method: Compare only the non-set attributes one-by-one w-th the ezisting tuples by the same key of Tt,,, if applicable.

Let Ti be the first ezisting tuple in the database wilh the same key as Tm, WHlLE Ti is not 0

compare only the non-set attributes one-by-one of Tt,, with Ti IF Ti and TImp have the same non-set attn'bute values THEN find the new set elements existing in Ttmp IF there are svch new elements THEN insert the elements into the same set of Ti recursively by calling this grouping algorithm itself if there are nested attributes in the set ELSE return FALSE END IF ELSE Let Ti be the nezt tuple with the same key os in Ttmp END IF

END WHlLE Moue Ttmpinto the pool of ezistzng tuples as a new one urith appropriate indices return TRUE

With this algorithm, the above grouping problern can be solved. For instance, consider the two facts of the relation Ancestorso f at the beginning of this section. In the next chapter, evaluation strategies used by the Query Manager will be described. Chapter 5

Query Evaluat ion Strategies

Before the user can query a Relationlog database, the database must be opened. By opening the database, the Relationlog systern loads into memory the meta infor- mation about relations, schernas and rules. Then the Query Manager, as introduced in the 1st chapter, takes care of query evaluation using several novel evaluation strategies. Generally speaking, if the query only involves in an extensional relation or in a materialized view, then the Query Manager simply uses matching and indices to find the results. If the query involves in a non-materialized view, it uses semi-naive bottom-up evaluation with rule orderings (proposed in [21]) and topdown pipelin- ing rnechanisms to find the results. This chapter discusses these three evaluation strategies one by one and summarizes how they cooperate with each other. 5.1 Matching

Most deductive database systems such as Aditi [27], LDL [8],CORAL [22], etc, use magic set rewriting techniques to improve the performance of queries that involve rules. Basically, they focus on how to improve the evaluation speed by using the constants provided in the queries. There are several problems with this approach. Firstly, if there is no constant in the query, then it ha. the worst performance. There- fore, the average performance is only half of the worst case. In order to have such performance, query forms (or modes) have to be specifieà that tell which argument is free and which is bound. For a relation with n arguments, there can be 2" query forms to be specified if sets are supported. The rules have to be rewritten according to the query forms before any query can be issued. This is normally done at compile time so that it is possible that the user issues a query that does not match any query forrn. Finally, same rules are evaluated many times For similar queries and even the same query. Relationlog uses a quite different approach. The user can speci& whether to make an intensional relation persistent in the Relationlog system. If a user queries a persistent intensional relation, Relationlog simply uses matching and indices to find results so that the response time is minimal. Obviously, the cost for maintaining persistent intensional relation is the updates. In order to reduce the update frequency, Relationlog does not propagate updates to persistent intensional relations for each individual update. Instead, it only puts a mark on the relations indicating they are no longer valid. When the relations are queried, the Relationlog system then propagates updates by evaluating the rules. Unlike other deductive database system implementations (LDL, CORAL), the Relationlog system also keeps derived data of non-persistent intensional relations in main mernory if memory space allows and temporally on disk with proper indices until the database is closed. In this way, the results to user queries may already be available without any need for further evaluation so that matching and indices can be used to find the results and the response time is minimal. 5.2 Semi-Naive Bottom-Up with Rule Orderings

Relationlog programs can be evaluated bottom-up by repeatedly applying al1 rules in iterations until the fixpoint is reached. With the introduction of negation, it is often preferable to apply the rules in some order, that is, stratification. The default bottom- up evaluation strategy in the Relationlog system is the Basic Semi-Naive evaluation (BSN) [21], but a variant, called Extended Semi-Naive evaluation, is also available For the programs with many mutually recursive predicates. For the variant, rule or- derings proposed in [21] are applies in order to efficiently reduce rule applications and iterations and achieve better performance. This section presents stratification, the Basic Semi-Naive evaluation, the Extended Semi-Naive evaluation and rule orderings respectively. The algorithms to be introduced are based on the ones in 131 and [18]. Stratification

Before any evaluation in Relationlog, the rules/views in the database created have to be stratified, which has been covered in Section 3.1.2. The stratification algorithm developed for Relationlog is as follows: Algorithm 5.2.1 Relationlog Stratification Algorithm Input: queue QI of rules RI, ..., Rn to be stratijied Output: queue QO of stratzfied rules un'th proper layer numbers set, or error messages zndicating there are circles uith negation or complete sets Method: First of ail, divide the rules in QI two queues: QI contaznzng the rules vith negation or complete sets in the body, QO contaznzng the other rules. Then, set al1 the layer numbers of predzcates to 1. WHILE QI is not empty Let Ri be the Jirst rule in Q 1. WHlLE Ri Zs not 0 IF (al1 the predzcates in its body urith negation or complete sets are not in the head of any other rules) THEN a) Max.,,,,~ := the highest layer number oj predzcates m'th negation or complete sets b) Maxpasirive:= the hzghest layer nurnber of the other predicates c) layer of the head := maxirnum{l+ Maxnqation,Ma>bOsitiw} d) recurszvely reset the layer nurnbers of correspondzng predzcates in Q0 e) Delete this rule Ri from QI and get out of thzs loop ELSE Let Ri be the next rule in QI END IF END WHlLE IF Ri equals to 0 and QI is not empty THEN return error messages END WH1 LE return QO und store the relation Iayer nurnbers in the system catalog fies Example 5.2.1 Given a Relationlog database program base-view

Schema: Basel (A: string, B : string) Base2(A: string, B : string) Viewi (il: string, B : string) View*(A: string, B : string) View3(A: string, B : string) View4(A: string, B : string) View5(A: string, B : string) Views(A: string, B : string)

Facts: Baset("aV,"6") Baset ("b", "c") BaseI("bYT,"a") Basel("bl',"8') Basel("&',"eV) Base2("b","c")

Rules: View2(X, -Y) :-Basel (X,Y) Viewl(-Y, -Y):-Base2 (-Y, -1') View2(X, -Y) :-View3 (X, -Y) View3(,Y, -Y) :-View4(-Y, -Y) View4(X, Y) :-View2 (X, -Y), not Viewi (X,-Y) View5(x, -Y) :-View5 (&Y, Z) ,Views (2, -Y) Vim5(X, -Y) :-View3(X, Y) Views(X,-Y):-View5(X, -Y) There are eight relations defined and also eight ruies defining the intensional rela- tions: Vieut*, Vieui, View3, View4, Views and View6. These eight rules have to be stratifiable. Figure 5.1 shows the dependency graph (introduced in Section 3.1.2) of the prograrn. Only View* is in Stratum 1. However, the graph can not be constructed in the Relationlog systern. The stratification algorithm is used instead.

Figure 5.1: Dependency Graph of base-vzew

According to the stratification algorithm, these eight rules are first assigned with 1 as their layer number and then grouped into two queues: QO only contains R5, QI contains the rest rules. The layer number for R5's head predicate Vieu14 is assigned with 2. Then the rules in QI are recursively reset:

(1) R4's head predicate Views is also in the second stratum because predicate View4 is in its body (2) The head predicate View* of R3 and the head ViewJ of R7 are both in the second stratum because predicate There is in their body (3) View6 is also included in the second straturn for View5 is in R8's body (4) Finally, set al1 rules' layer numbers with the number of their head predicates

Thus this prograrn can be divided into two strata: stratum 1: R.2; stratum 2: RI, R3, R4, R5, R6, R7, R8; 5.2.2 Basic Semi-Naive Algorit hm For Relationlog

Semi-naive bottom-up evaluation consists of a rule rewriting part performed at execution, which creates versions of rules with delta relation, and an evduation part (delta relations contain changes to relations since the previous iteration). The evaluation part evaluates each rewritten rule once in each iteration, and perforrns some updates to the relations at the end of each iteration. An evaluation terminates when an iteration produces no new facts. The Jozn operation is used quite often in the semi-naive bottom-up evaluation. The join order used in Relationlog is currently lek-to-right in the rule, with a simple reordering that moves delta relations to the front of the join order. The reordering is done with the expectation that the delta relations have a small number of tuples than the other relations. The join rnechanism is based on hash indices for delta relations, B-tree indices for base relations and views, and the pipelining mechanism which will be discussed in Section 5.3. This section discusses the Basic Semi-Naive Algorithm for Relationlog. The next section will describe its variant. The Basic Semi-Naive Algorithm For Relationlog is similar to the one for Datalog. The major difference is that grouping is the last step of each iteration in Relationlog, since the representation of nested relations is the different from Bat relations. The following illustrative exarnple demonstrates how the algorithm functions.

Example 5.2.2 Consider the following Relat ionlog database edge-reach.

Schema edge(From : token, Tu : {token)) reach(Frm : token,To : {token}) Facts edg4-h {bl) edge(b1 {cl) edge(c, {d, 4) Rules reach(X, (Y)):- edge(X, (2)) reach(X,(Y)) :- edge(X, (2)) , reach(Z, (Y)) Then the following evduat ion is performed:

(1) reachO:= {) (2) Areachl := {) (3) i := 1 (4) reachL := reacho U Areach' Areach2 := {reach(a,(b)), reach(b, (c)),reach(c, (cl)), reach(c, (e))}- Areachl i := 2; reach2 := reachLU Areach2 nreach3 := {reach(a,(c)), reach(b, (d) ) ,reach(b, (e))) i := 3; reach3:= reach2 U Areach3 := {reach(a,(6) ), reach(b,(c) ), reach(c, (d)),reach(c, (e)), reach(a,(c) ), reach(b,(d)), reach(b, (e)), reach(a,(d)) , reach(a, (e)) ) Areach4 := {reach(a,(d)), reach(a, (e)))

2 := 4; reach4:= reach3 U Areach4 := {reach(a,(b)), reach(b, (c)), reach(c, (d)),reach(c, (e)), reach(a,(c) ), reach(b,(d)), reach(b, (e) ) , reach(a, (d)),reach(a, (e))) Areach5 := {)

(5) reach := Grmping(reach4):= {reach(a,{b, c, d, el), reach(b,{c, d, el),reach(c, {d, e})) 5.2.3 Extended Semi-Naive Algorit hm For Relat ionlog

For more efficient evaluation on programs with many mutually recursive defined views, the Extended Semi-Naive Algonthm was also developed for Relationlog. The following is the abstract algorithm designed for this extended semi-naive evaluation.

Algorithm 5.2.2 Extended Semi-Naive Algorithm Input: ordered mles RI, ..., Rn in one stratum Output: al1 /a& of the intensional relations in this stratum

Method: Fitst of all, set the order nvmber O/ al1 rules to 1

WH l LE there are still sorne new fa& FOR i := 1 TO n APb% := Aphf; Aphx := 0 END FOR FOR i:= 1 TO n Let ph be the head predzcate dejîned in rule Ri, np;x := 0 FOR each predicate pj wzth same stratum in the body of Ri

IF Pj eWals to PL.R. THEN IF O 5 k < i THEN Atrnpj0ld := Apç:z +~pg$~ IF k > i THEN AtmpjOld := ap:[dRk App$ semi-naive rewritten version of Ri !or pj by replacing nPfIdwiUi ~trnp;'~. phid := Grouping(phtd + Ap,) END IF END FOR END FOR END WHlLE In this algorithm, Relationlog maintains the relation ptd for the head predicate appearing in al1 the rules RI,..., &. Moreover, for each mle 9, 1 5 i 5 n, if its head ph is in the body of other rules, there are two differential relations for its head predicate ph, which are ApGifor the facts evaluated in the last iteration and Ap" for the latest facts. Here hash indices are created for these differential relations in memory. When a rewritten rule is applied in one iteration, the new facts for one differential relation in the body, which exists in the heads of other rules, are not only from the last iteration but also from the applications of the rules before this ruIe in the same iteration. That is, the new facts for the relation calculated in one iteration can be used by the subsequent rules in the same iteration. Thus, it considerably reduces the number of rule applications and the number of iterations. In this aspect, this algorithm is much like GSN mentioned in (211, which has the fastest speed but with some specific storage organization for the differential relations. Both of the semi-naive algorithms developed for the Relationlog system employ reasonable storage organization, as described in Section 4.5.1 for differential relations in spite of the introduction of nested sets and tuples. Moreover, the last step in each iteration is grouping which gathers the set elements together so that the space for storing the intermediate results is minimized. The remaining part of this section gives two simple examples so as to shed a light on this algorithm. In Example 5.2.3, the Extended Semi-Naive evaluation takes 3 iterations, which is similar to the Basic Semi-Naive evaluation, which takes 4 iterations, since the reach relation are not used to recursively define any other views. Example 5.2.4 illustrates the advantage of the Extended Semi-Naive evaluation over the Basic Semi-Naive evaluation on the programs with mutually recursive views. Example 5.2.3 Consider the same edge-reach database used in Example 5.2.2.

Areach;;9nm .- {reach(a,(b)), reach(b, (c)),reuch(c, (d))reach(c,(e))}(for rl) reachofd:= {reach(a,{b)), reach(b, {c)), reach(c, {d, e}) ) &mp-reachdd := {reach(a,(b) ), reach(b, (c)) ,reach(c, (d)), reach(c, (e)) } ~reach;" = {reach(a,(c) ), reach(b, {d)) ,reach(b, (e)) } (for r2) reachoLd:= {reach(a,{b, c)), reach(b, {c,d, e)),reach(c, {d, e}) } i .- 2; Areach;fd := Areach,,2,new Areach:fd := Areach,;2 new ~reach~~~:= ~reach:?~ := {) ~reach:f := {)( for rl) reachofdremains the same. ntmpreachdd := {reach(a,(b)), reach(b, (c)),reach(c, (d)),+each(c, (e))

reafh(a,(4) Y reacw, (41, reach(b, (41) Areach:rm := {reach(a,(d) ) ,reach(a, (e)) } (for r2) readcofd{reach (a, {b, c, d, e)) ,reach(b, {c, d, e}), reach (c,{d, e)) } i := 3; &eachFfd := Areach,,3,neur 3,new ATC.~C~:;~~:= Areachr2 ~reach:~~:= Areachr2 3,nm .- {} Areach,,4,nw .-{}(forrl).- reachOldremains the sarne. Areuchra4,n- --.- {}(for r2) reachOldremains the same. no more new fxts

(4) reach := reachoid := {reach(a,(6, c, d, e}),reach(b, {c,d, e}),reach(c, {d, e})} Example 5.2.4 Consider the f amdy database in Example 3.6. (l)ancestar~o/~:= {} (2)A ancestorso f:ineiu := {) ;~ancestorso/:;d~ := {); ~ancestorsoj,!~~~:= {); ~ancestorso fyd := {}; Atrue~ncsofz~:= {); Atrue~ncso/f3dd := {); (3)i:= 1; Aancestorsoj$nm := {ancestorsof (jim,(put)), ancestorso f bat, (bob)), ancestorsof (bob,(tom)), ancestorso f (bob,(pam)) )(for rl) ~ancestorsoj~'~:= {ancestorso f (jzrn,(put)), ancestorso f bat, (Lob)), ancestursof (bob, (tom,pam)) }

~tm~ancestorsof O" := {ancestorsof (jim,(pat)), ancestorso f (put, (bob)), ancestorsof (bob,(tom)), ancestorsof (lob, (pam)) } Aancestorso fAneiu:= {ancesturso/(jim,(bob) ) ,ancestorso f bat, (tom)), ancestorsof @ut, (pam)))(f or r2) Attue~ncsof~~~:= {trueAncso f (jim,(bob)), trueAncso f (pot, (tom)), true Ancsof bat, @am)) } (for r3) ~ancestorsof 'ld := {ancestursof (jzm, (put,bob)), oncestorso f bat, (bob,tom, parn)), ancestorsof (606, (tom,pam))} ~true~ncsof'Id:= {trueAncsof (jim,(bob))), trueAncso f (put, (tom,pam))} i := 2; 2 neur ~ancestorsof :;Ofd := Aancestorso fri Aancestorso fyd := Aancestorso fr;2 new anc ces tors of^^"^ := ~ancestorsoj~~:= {} Aancestorsof;inw := {)(/or ri) ancestorsofdd remains the sane. ~trn~ancestorsof := {ancestorsof (jim,(put)), ancestorso f (put, (bobo), ancestorsof (bob,(tom)), ancestorsof (bob,(parn)) ancestursof (jim,(606) ) ,ancestorso f @at , (tma)), ancest orsof bat, @am))) Aancestorsofgcw := {ancestorsof (jim,(tom)), ancestorso f (jim,(pam)))( f or r2) ~true~ncsof;inEY := {trueAncsof (jim,(tom)), trueAncso f (jim,barn) )}(for r3) ~ncestorso/~~~:= {ancestorso/(jint, (pat, bob, tom,parn)), ancestorsof (pot, (bob,tom, parn)), ancestorso j(bob, (tom,parn))) true~ncsof := {trueAncsof (jim,(bob, tom, pam)) , trueAncso/@at, (tom,pam) )) 3. i := 3; ~ancestorso/;~'~:= aancestorsof, 3,new anc ces t wsof:ioid := Atznces torsof,,3 ,new Aancestorso f:inelY := ~ancestorsof:inm := {) anc ces tors of>^^ := {)(lmri) ancestorsof"ld rernains the same. an ces tors of^^^ := {)(/or r2) Atrue~ncsof:inew := ()(forr3) ancestorsof remaains the same. true~ncso/~~~remains the same. no more new fads (4)ancestorsof := ancestorsof *ld := {ancestorsof ÿzm, bat,bob, tom,pana)), ancestorsof bat, {lob,tom, pam}),ancestorso f (lob,{tom, pam})} trueAncsof := tnie~ncsof := {trueAncso/(jim, {bob, tom, pam}),trueAncsoJ@at, {tom, pam}) }

This example shows that the Extended Semi-Naive Algorithm only takes 3 iterations to reach the fixpoint, while the Basic Semi-Naive Algorithm needs 5. The bottom-up algorithrns were proven in [18].

5.2.4 Rule Orderings

Rule orderings [21] are required in order to compute the answers correctly and efficient ly. For example, in st ratified programs, lower strata must be evaluated first. Rule orderings can also improve efficiency by reducing the number of rule applica- tions. Let q and r be base relations. Consider the following program: Ro : Pk(S ) - q (X) RI : Pl(X)- fi(X) - -. Rk : fi(X) - P~-L(X),TV) In an iteration of a Basic Semi-Naive (BSN) evaluation, dl the rules of a stratum are applied independently. Rule Rk will be successfully applied for the first time only in the (k + l)thiteration. In an Extended Semi-Naive (ESN) evaluation however, it would be possible to successfully apply rule Rk in the first iteration itself if the rules are applied in the order shown and the facts produced by each rule application are immediately made available to subsequent rule applications. In the above example, if the cornputation of a fact using the ESN evaluation takes n iterations, then com- putation of the same Fact using the BSN evaluation strategy could take up to O(h) iterations. In the above example, if the rules are applied in the opposite order, i.e., Rk,Rk-l, ..., &, the number of iterations taken is practicdly the same as BSN eval- uation, even if facts are made available immediately. Thus a good ordering of rules is important to reducing the rule applications and iterations. The Relationlog system orders rules before applying a Extended Semi-Naive eval- uation. The rule ordering algorithm is as follows that is based on GSN and PSN in [21]:

Algorithm 5.2.3 Rule Ordering Algorithm Input: rules RI, ..., Rn in one stratum to be ordered Output: QO with ordered rules Method: First of all, set the ordering numbers of al1 rules to 1. Then, group the rules RI, ..., Rn into two queues: QO containing the rules whose predicates in the body are dl with lower layer nurnber than the head, QI containing the other rules.

WHILE QI zs not empty End Jrst rule Ri in QI whose predicates in its body with sarne layer nurnber to the head are not in the head of rules in QO IF Ri is not 0 THEN a) Asszgn OrderCount to the ordering number of the rule 6) Increment OrderCount c) Delete thzs rule Ri /rom QI d) Insert Ri into QO END IF END WHlLE return QU Still consider the database in Example 5.2.1. There are two strata for the database listed as follows: stratum 1: R2; stratum 2: RI, R3, R4, R5, R6, R7, RB; Only the second stratum is concemed for ordering rules. Applying the rule ordering algorithm, two queues are generated: QO contains RI, QI contains R3, R4, R5, R6, R7, R8 with the order number al1 set to 1 at first. The algorithm checks QI and then first R5's al1 body predicates are in the head of the rules in QO or with lower iayer number. Then in sequence, R4, R3, R7 ,R8 and R6 are picked up into QO. Finally the ordering of al1 the rules in stratum 2 is shown as follows: RI, R5, R4, R3, R7, R8, R6 where RI is applied first in an iteration and 1st R6 is.

5.3 Top-down Pipelining

Pipelining [8] [22] is essentially topdown evaluation. When a query cornes, the first rule in the list associated with the queried predicate is applied. This could involve recursive calls on other rules within the database (it could be the rewritten one). If there are answers satisbing the rule and the query, the computation sleeps and the first generated answer tuple results are tried. A subsequent request for the next answer tuple results in the reactivation of the sleeping computation, and processing continues until the ne-xt answer is returned. At any stage, if a rule fails to produce an answer, the next rule in the rde list for the head predicate is tried. When there are no more rules to be tried, the query on the predicate fails. When the topmost query fails, no further answers can be generated, and the pipeline evaluation is terminated. However things are much different in the Relationlogsystern, for it mainly employs the semi-naive bottom-up evaluation with rule orderings to deal with queries involving non-materialized intensional relations. Queries olten involve the join operation over two or more relations (So does the semi-naive evaluation). The jozn operation is done in a pipelining fashion. Suppose the jozn operation is over two relations RI and R2 and one tuple Tl has been located for RI, only the tuples of R2 that join with Tl of RI are selected for further evaluation. This avoids unnecessary jozn on irrelevant tuples of RI and R2,which is one form of pipelining evaluation in the Relationlog system. Moreover due to the introduction of complex values in Relationlog, a query may contain some partial set tems such as (X)in one literal, and the variable X occurs in the rest part of the query. Although there probably exist many elements in the set, only one element can be activated at a time, which is much like the activation of rules in real topdown systems such as Prolog. After the control has been returned to the part containing the partial set term, the next element in the set will be tried. This is another form of pipelining evaluation in Relationlog. The pipelining algorithm in Relationlog is as follows that is based on the work proposed in [8] and [22].

Algorithm 5.3.1 Pipelinhg Algorithm Function Name: Pipelin ing-Query Input: i jor the cument query ezpression Li, O 5 i 5 n of amay LI, ..., Ln Output: TRUE zndicates there are tuples jound FALSE means there are not Method: Evduate al1 the tuples telated to Li by the Relationlog semz-naive algo- n'thms, this submutine is recurszvely called. let Ti be the jîrst tuple satisfging Li,(could be nested) WHILE Ti is not 0 WHILE there are partial set variables in Li and stzll some elements left not touched IF Pipelining-Query(i + 1) equals to TRUE THEN put related attnb~tevalues in Tiand sent them to the requester END IF END WHILE let Ti be the next tuple for Li END WHlLE For instance, consider the following query based on the edge-mach database: edge(X,U)), reach(-& (y)) This query cm be answered by first computing the relations representing edge and then reach and finally computing their join followed by a projection. Only those tuples in reclch join with the tuples of edge in a pipelined fashion. This avoids the computation of any tuple of reach that does not join with edge, whereas, if a tuple in reach joins with many tuples in edge then it is computed many times. Moreover, because the variable Z is in a partial set, there are probably a few elements in the set. So one element is set to Z once in tum and only the facts related to the element are used for the join. This is different from other deductive database systems. Suppose there are edge(c, {d, e)) and reach(d, (e,f)) evaluated using rules. The only fact of edge is unified with the first sub-expression: c is set to X, and first element of the set d is set to 2. Then reach(d, (Y))substitutes reach(2,(Y)). Then e is selected for -Y at first. The result then is sent to the requester. Thus f is set to Y. No further evaluation can be made now, thus the control is returned back to first expression. The second element of the set of edge is then set to Z this time and continues the sirnilar evaluation as described above.

5.4 Summary

In this chapter, three novel query evaluation strategies have been discussed with some illustrative examples. Given a query, the Query Manager of Relationlog will automatically determine which strategies will be used and divide the query into cor- responding subqueries For best performance. The flowchart in Figure 5.2 shows how Relationlog handles a user's query. The Query Manager optimizes the query first by applying Algorithm 46.1. It uses match- ing to promptly respond to the query only involving in an extensional relation, a materialized view or a previously evaluated view. If the query involves in a non- materialized view , the Query Manager of Relationlog uses the semi-naive bottom-up evaluation with rule orderings to find the results. Furthermore, if the query involves dOne Literal? , w m%i&r Rule ûrdering

ESN Evaluation on the N Y nonevaluated views Rule Ordering 4 v i * t ESN Evaiuation BSN Evaiuation Matching Evaluation Pipelinhg Evaluation L I 4 m4 I G=lGrouping

Figure 5.2: Query Evaluat ion Flowchart the jozn operation over several extensional or intensional (view) relations, the top- down pipelining rnechanism is used to avoid joins over irrelevant tuples cornbined with semi-naive evaluation for the subgoals (literals) related to non-materialized views that have not been evaluated. Algorithm 1.6.2 is used to group the intermediate facts to make the database compact. Chapter 6

Experiment al Result s and Relat ed Work

To give an insight analysis of the system performance, this chapter first gives some essential experimental results based on two experimental databases. Then it compares related systerns wit h Relat ionlog.

6.1 Experimental Results

This section presents reports on two sets of experirnents. Since there is no applica- ble large database available currently, the testing databases were created with random values and are available over the Internet at: http ://m.CS. wegina. ca/ mliu/RLOG. Al1 the tests are performed on a Sun SPARCstation 20 running SunOS 5.5 (Solaris 2.5) with one 125-MHz hyperSPARC processor and 32 MB memory. The test database is stored on a local disk. Only elapsed times are reported instead of CPU times, because they show how the system would appear to a user, while CPU time measurements leave out context switches, disk I/O, and other similar effects. Applications that can run on relational systerns still run on the Relationlog system with similar performance. Relationlog keeps performance for traditional applications cornpetitive by using conventional relational technology whenever possible in Rela- tionlog. The first set of experiments contains two relations: Dept (for department) and STU (student). Their scherna definition is as follows:

STu'(SN0 : int eger, Name : string, DNO : string@), Birt h Date [Year : integer, Month : integer, Day : integer], Friends : {string}, Address : [Street : string, City : string, Country : string]) key = (SNO, Name) Dept(DN0 : string(5), Name : string(30)) The STU relation ha more than 20,000 tuples (if it is Aatten, there are more than 100, 000 tuples), while the Dept relation has 20. Each tuple of the STU relation represents a student with its identification SNO,Name , department information DNO,Birth Date, F~iends,and living place Address. The Dept relation contains al1 the department information: department identification DNO, and department name Name. Based on this database, a lots of tests have been carried on. The values for the string type attributs are randomly created from a pre-arranged table. The tests concentrate on the database operations including basic operations like projection, selectia, and join, and cornplex operations like nesting and unnesting. Moreover, the update performance is also tested. For projection alone on al1 data set of the STU relation, the following query STU(SiV0,Nil, DNO,BIR, ,400) is selected for processing al1 the tuples. The variables represent al1 attribute values except the Frzends set. In the combined queries of projection and seletion as follows: STU(Pno,-NA, DNO) STU(SN0,Name, DNO) Pno stands for a random value for the peaon's SN0 and Name represents a random value for the peaon's Name, while variables SNO,N-4, DNO, BIR and 400 for the person's SNO, Name, DNO, BirthDate and Address. To measure the performance more precisely, 100 random values are chosen for each of Pno and Name. The query Dept(Dno, DN),STU(, NA, Dno) is used for testing the join oper- ation of the two relations on the same attribute DNO. ND corresponds to the at- tribute Name in the Dept relation schema. Likewise, 100 random values are selected for Dno according to the cases with indices and without on the third attribute DNO of the STU relation. Similarly, a query STU(10000, NA, DNO,BIR, (ADD)) is chosen for unnesting. The average time stands for creating one Bat tuple. Since there are average six elements in the Friends set of each tuple, the total time is for unnesting each tuple. As for the updates, one randorniy created fact is teçted first for insertion and deletion. Then 100 tuples are created for larger tests. Table 6.1 reports al1 the results in seconds. Since Relationlog handles indices transparently, there is some time difference in the projection and selectim table on the average time between the queries STU(Pno, nTA, DNO)and STU(SN0,&VA, DNO,BIR, -, ADD) though they are al1 based on similar indices. The difference 0.003 (0.008-0.005) here is due to data swapping between disk and memory. Since it takes time for key comparison, queries based on long keys take longer than the ones on short keys. That is why STU(10000, -NA, DNO)responds faster than STU(SNO,"Tom", DNO) since a string comparison takes longer time than an integer comparison. The join table shows that after adding indices on the third attribute of the STU relation, Relationlog processes much faster on join than without this attribute's indices. In the updates table, a deletion takes nearly as same time as an insertion since deletions aIso involve padding zeros to the deleted nodes. P~ojectionand Selection 7 queries average total STU( 10000,iVA,-DNO) 0.004 0.004 STU(SNO," Tom ",DNO) 0 .O29 0.029 STU( Pno,-NA ,DNO) 0.005 0.491 STU(SN0,Name,DNO) 0 .O3 2.632 STU(SN0,NA,DNO,SIR,,-A DD) 0.008 1598.760

I Dept( "CSn,DN),STU(-,DNO, "CS") Dept(Dno,DN),STU(-,DNO, Dno) Non-index 35 1.745 (200020 t uples) 35053.132(20002000 t uples) Index 173.067(200020 tuples) 17283.7(20002000 tuples)

Unnesting- and nesting- STU(lOOO0,-NA,DNO,BIR,-,(ADD))STU(Pno,NA,-DNO,-BIR,-,(400))

total 0.28 r 0.26 average 1 0.067 1 0.063 1 Updat es one insertion 100 insertion one deletion 100 deletion 0.07 0.06 (5.832 in total) O. 066 0.059(5.923 in total)

Table 6.1: Test results on STU

The second set of experirnents uses a database similar to the edge-reach introduced in Example 5.2.2. The difference here is that the schema is changed to edge(From : integer, Tu : {integer}). So is the schema of reach. The test queries are in the form of reach(F, < _T >) with different values generated for F. The experiments Vary the contents of the relations edge and the values of F. The edge relation contains full binary trees of various sizes, i-e., types of the form edge(i,< 2i >) and edge(i, < 2ifl >) for values of i ranging from 1 up to a maximum that depends on the size of the tree. Six data sets use maximums of 255, 511, 1023, 2047, 4095, and 8191, corresponding to the tree depths of 8, 9, 10, 11, 12, and 13 respectively. This produces databases with the tuple number of 510,1022,2046,4094, 8190, and 16382 correspondingly in relational mode1 . Each fact of the edge relation has a B-tree index on its first attribute value. At first, two versions of the queries are used for the experirnents. The first contains ooly one value (a random number) for F; the results achieved with this version are labeled with "small query". The other version contains five hundred values for F randomly chosen From the middle three levels of the full binary tree of the relevant edge relation; the results achieved with this version are labeled wit h "large query" . The reason why only trees of even depths for edge are chosen is that the middle three levels move downward one level only when the total depth of the tree increases by two. The number of values for F makes virtually no difference For the seminaive pro- gram since Relationlog cornputes the entire reach relation and rnatching the result with the value of F. The reach relation however can be materialized on disk tem- porally, for either the user or the system can manage to control the materialization of the views in Relationlog. At first it takes some time to have al1 the facts of reach relation on disk. .4fter the issuance of the first query, al1 the facts of reuch will be evaluated and rnaterialized on disk. This %nt query" is the third version of the test queries. For the further queries on reach relation without updates, Relationlog will directly use matching to find the desired results. Table 6.2 reports the results. Rows correspond to the query type frorn queries of first, small and large, while columns correspond to the depths of the edge relation.

size 8 9 10 Il 12 13 first 32.536 125.205 426.433 21 10.763 7642.767 30137.425 small 0.007 0.009 0.011 0.01 5 0.018 0.021 large 0.010 0.013 0.017 0.021 O. 024 0.029

Table 6.2: Test results on edgeset

The testing results show that the evaluation time for the first calculation of i + 1 increases nearly 3 times more than that of i due to the data enlargement. If there is no updates, the average speed for the large queries does not increase much. Since there is an index for each tuple of the edge-reach database in Example 5.2.2, the small amount of time increment is caused by the depth of B-tree indices. The experimental results show the satisfactory performance of Relationlog both on traditional databases and deductive databases. They also show that Relationlog perfoms nicely on the large applications that do not change very often and still gives reasonable response on any case.

6.2 Cornparison with Other Systems

Chapter 2 introduces some relevant deductive database systems: LDL, CORAL, Aditi and Glue-Nail. This section compares them with Relationlog in order to give an insight analysis pertaining to the Relationlog system. Table 6.3 compares some of the important features of these systems and Relation- log. The following issues are concerned for the cornparison:

1. Recursion. Al1 systems above allow the rules to use general recursion, which includes linear and nonlinear.

2. Negation. Al1 systerns allow negated subgoals in rules. When rules involve negation, there may be many minimal fix points that could be interpreted as the meaning of the rules, and the system has to select from among these possibilities one rnodel that is regarded as the intended rnodel, against which queries will be answered.

3. Nested Sets and Nested Tuples. Only Relationlog directly supports nested sets and tuples. LDL and COML directly support nested sets and indirectly sup port tuples. Both LDL and CORAL adopt Prolog set treatments. Relationlog stores tuple and set values most efficiently and supports inference to deeply nested relations.

4. Typed. Only Relationlog is well-typed both for relation schemas and facts. LDL and Aditi support scherna notions for relations, but hi1 to support well-typed facts. CORAL and Glue-Nail are both like Prolog/Datalog without schema notation. 5. Updates. Logical rules do not, in principle, involve updating of the database. However, most systems have some approaches to specifying updates, either through special dictions in the rules or update queries outside the rule system. A "Yes" in the table identifies the system that supports updates in logicd niles or similar queries.

LDL CORAL Aditi Glue-Nail Relationlog MCC U-Wisconsin U-Melbourne Stanford U. Regina General General General General General Stratified Modularly St rat ified No Stratified Stratified 1 Nested Sets Yes Yes No No Ys Nested Tuples indirect indirect No No Yes FGF Partial No Partial No Yes 1 Updates Yes Yes No Glue only Yes Bottom-up Bottom-up Bottom-up Bottom-up Bottom-up Magic-Set Magic-Set Magic-Set Magic-Set Top-down Topdown Pipeline Storage EDB EDB EDB, IDB EDB EDB,IDB Memory Memory Disk Mernory Disk [InterfaceI

Table 6.3: Sumrnary of Deductive Database Systems

Evuluation. Deductive database systems need to perform evaluation. Topdown and Bottom-up rnechanisms are most commonly used. The other systems except Relationlog al1 support Magic-Set, which is an efficient optimization strategy applicable for flat relations.

Storage. Al1 systems allow extensional relations (EDB) to be stored on disk, but sorne also store intensional relations (IDB) in secondary storage, which is the key for large applications. Onk Aditi and Relationlog are disked-based and support IDB storage, which makes them suitable for large applications. 8. Interface. LDL, CORAL and Aditi connect to other languages or systems like C/C++ and Prolog. These connections are embedding of calls to the deductive database systems in another language, or allow other language or systems to be invoked from the deductive database systems. Currently Relationlog does not support such kinds of connections.

A few existing deductive database systems have been compareci with the Rela- tionlog system. Relationlog was inspired by LDL [8] and it adopts the applicable techniques of LDL [8],CORAL (221, Aditi [27].Tt is the only typed system that can store both base relation and temporary relations on disk and support complev value relations. Moreover, it is unique in being able to combine set and tuple terms with deep inference. It is also unique in its ability for the query processing strategies for deep nested relations. Chapter 7

Conclusion and Future Research

This thesis has described the Relationlog persistent deductive database system, which supports effective storage, efficient access and inference of large amounts of data with complex structures. The Relationlog system provides a declarat ive query language that is based on Relationlog [16,18] and also a declarative data manipulation language that is based on DatalogU [17]. It is a typed valueoriented deductive database system that supports user defined data types. Moreover, the extended relational algebra operations, as defined in [Il 9, 231, can be represented in Relationlog directly, and more importantly, recursively in a way similar to Datalog. The current irnplementation of the Relationlog system is on top of P.4RODY [25] in C++ under UNIX and Windows. It performs reasonably well for smail and medium databases and relatively slow for large ones with too many updates due to the limita- tion of P.4RODY itself. The experimental results show the satisfactory performance of Relationlog both on traditional databases and deductive databases. As a deduc- tive system, Relationlog programs must be well-typed and stratified in order to be evaluated bottom-up. The Relationlog system is organized into 3 layers: user interface, Data Manage- ment and Query Subsystern, and Storage and Update Subsystern. The user interface is the primary communication medium between the users and the Relationlog system. Tt passes definitions, queries, and updates to the Data Management and Query Sub- system. Then the Data Management and Query Subsystem processes the requests and directs the Storage and Update Subsystem to handle smaller tasks respectively. The Storage and Update Subsystem thus handles these tasks and deals with loading, dropping, and updating domains, schemas, tuples, indices and rules between main memory and disk files.

7.1 Contribution

The Relationlog system makes the following contribution to the deductive database cornmunity.

1. The Relationlog system is the first typed disk-resident deductive database sys- tem supporting inference to complex values while adopting the applicable tech- niques of LDL [8], CORAL [22], and -4diti [27].

2. Relationlog can be used for different purposes, such as Engineering Databases, CAD databases, text retrieval, image processing, etc. where nested or complex- value relations are needed. It can also be used as a teaching tool for advanced database courses on nested relational databases, cornplex-value databases, de- ductive databases, and deductive complex-value databases. .

3. The Relationlog systern can store large base relations and temporary relations on disk and support cornplex updates.

4. Relationlog employs novel evaluation strategies: disk-based matching, extended semi-naive evaluation with rule ordering and pipelining techniques. The match- ing evaluation is based on the results of the queries before, which could be in memory or temporal on disk. The extended semi-naive evaluation with rule ordering is inspired by GSN and PSN proposed in [21], which is suitable for inference of nested relations with grouping operation. The pipelining technique is suitable for the recursive activation of the elements in a set in case of join.

The language was proposed by Dr. Mengchi Liu in [18]. The design and imple- mentation were finished by the author. 7.2 Future Research

There are several open issues which still need to be addressed. From the language point of view, Relationlog has the following limitations. First, it supports incomplete sets by using partial set ternis but not incomplete tuples (i.e., tuples with nul1 values). Therefore, it is not clear what is the meaning of recursive rules when both incomplete sets and tuples occur. Neut, the complete set terms in Relationlog cannot contain partial set terms so that it is impossible to have a set of which we know the but for which we do not have complete knowledge of some of the elements, such as

{(a,b) 1 {h4 1. On the other hand, from the database system point of view, the Relationlogsystem still needs some more work to overcome some problems. The first problem is how to benchmark the Relationlog system. Since therc is no applicable large database avail- able currently, and there is no comparable related system, how to precisely rneasure the system performance of Relationlog would be another interesting issue in the near future. In order to benchmark the system, a new host language needs to be devel- oped. C++ interface for Relationlog would be optimal so that C++ functions can be called in Relationlog or Relationlog can be embedded into Cf+ in the future. Secondly, since the Relationlog systern is on top of P.4RODY) the limitation of PARODY is inherited by Relationlog. Because P.4RODY ernploys nodes for the tuples, it reads a new node each tirne for a tuple from disk instead of loading in bulk. This causes too rnany 110 operations. The problern can be fixed by building a large memory buffer for controlling the loading and dropping a group of nodes. When Relationlog asks for a node, the new buffer system provides what Relationlog wants from the buffers with preloaded nodes with anticipation instead of directly from disk. Using better swapping mechanisms for the new buffer systern would drarnatically improve the system performance. Finally, more efficient evaluat ion strategies need to be developed. Even t hough Relationlog can materialize intensional relations, it takes too much space for storing the intensional facts on disk and too much tirne for processing the evaluation. In some extreme cases, it takes non-affordable time and space to evaluate al1 the intensional facts by using semi-naive techniques due to the potential huge volume of intensional facts, which may even exceed the disk capacity. Therefore, more efficient evaluation strategies like magic set would be extended for Relationlog in order to evaluate the applications with huge intensional facts of nested relations. Bibliography

[II S. Abiteboul and N. Bidoit. Non first normal form relations: An algebra allowing data restructuring. J. Cornputer and System Sciences, 33(3) :361-393, 1986.

[2] S. Abiteboul, P.C. Fisher, and H. J. Schek, editors. Nested Relations and Cornplex Objects in Databases. Lecture Notes in Cornputer Science, vol 361, Springer- Verlag, 1989.

[3] S. Abiteboul, R. Hull, and V. Vianu. Foundations of Datubases. Addison Wesley, 1995.

[4] KR. Apt, HA. Blair, and A. Walker. Towards a theory of dedarative knowledge. In J. Minker, editor, Foundatzon of Deductive Databases and Logic Progmmmzng, pages 89-148. Morgan Kaufmann Publishers, 1988.

[5] M. Carey, D. DeWitt, and S. Vanderberg. A data mode1 and query language for EXODUS. In Proceedings of the ACM SIGMOD International Conference on

Management O/ Data, pages 413-423, Chicago, Illinois, 1988.

[6] S. Ceri, G. Gottlob, and T. Tanca. Logic Pmgramming and Databases. Springer- Verlag, 1990.

[7] Q. Chen and W. Chu. Hilog: A high-order logic programrning language for non-lnf deductive databases. In W. Kim, J.M. Nicolas, and S. Nishio, editors, Proceedzngs of the International Conference on Deductive and Object-Oriented Databases, pages 431-452, KYO~O,Japan, 1989. North-Holland. [8] D. Chimenti, R. Gamboa, R. Krishnamurthy, S. Naqvi, S. Tsur, and C. Zan- iolo. The LDL system prototype. IEEE Transactions on Knowledge and Data Engineering, 2(1):76-90, 1990.

[9] L. Colby. A recursive algebra and query optimization for nested relations. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pages 124-138, Portland, Oregon, 1989.

[IO] M. Derr, S. Morishita, and G. Phipps. Design and implementation of the Glue- Nail database system. In Proceedings of the ACM SIGMOD International Con- ference on Management of Data, pages 147-167, Washington, D.C.,1993.

[Il] Marcia -4. Derr, Shinichi Morishita, and Geoffrey Phipps. The Glue-Nail deduc- tive database system: Design, irnplementation, and evaluation. VLDB Journal, 3(2):123-160, 1994.

[12] R. Elmasri and S. B. Navathe. Fundarnentals of Database Systems. Ben- jamin/Curnmings, 1994.

[13] B. Freitag, H. Schutz, and G. Specht. LOLA - a logic language for deduc- tive databases and its implementation. In Proceedings of the Second Interna- tional Symposium on Database Systems for Advanced Applicntions, pages 216- 225, Tokyo, Japan, 1991.

[14] G. Jaeschke and H.J. Schek. Remarks on the algebra on non first nornal form re- lations. In Proceedings of the A CM SIGACT-SIGMOD Symposzum on Princzples of Database Systems, pages 124-138, 1982.

[15] H.F. Korth and A. Silberschatz. Database System Concepts (second edition). McGraw-Hill, 1991.

[16] M. Liu. Relationlog: A typed extension to datalog with sets and tuples (extended abst ract ) . In Proceedings of the ïnternational Logic Programming Symposium (ILPS '95), pages 83-97, Portland, Oregon, U.S.A., December 4-7 1995. MIT Press. [17] M. Liu. Extending datalog wit h declarit ive updates. Submztted for Publicution, 1998.

[18] M. Liu. Relationlog: -4 typed extension to datalog with sets and tuples. Journal of Logic Prograrnming, 36 (3):2'?1-299, 1998.

[19] M. Liu and R. Shan. The design and implernentation of the relationlog deductive database system. In Proceedings of the 9th International Workshop on Database and Evert System Applications (DEXA Workshop '98), Vienna, Austria, August 24-28 1998. IEEECS Press.

[20] A. Makinouchi. A consideration of normal form of not-necessarily-normal form relations in the relational data model. In Proceedings of the 5th International Conference on Vary Large Data Bases, pages 447-453, Los Angels, 1977.

[21] R. Ramakrishnan, D. Srivastava, and S. Sudarshan. Rule ordering in bottom- up fixpoint evaluation of logic prograrn. In Proceedings of the International Conference on Very Large Data Bues, pages 359-371, Brisbane, Queensland, Australia, 1990. Morgan Kaufmann Publishers, Inc.

[22] Raghu Ramakrishnan, Divesh Srivastava, S. Sudarshan, and Praveen Seshadri. The CORAL deductive system. VLDB Journal, 3(2):161-210,1994.

[23] M. A. Roth, H. F. Korth, and A. Silberschatz. Extended algebra and calculus for nested relational databases. A CM TODS,13(4):389-417, 1988.

[24] R. Shan and M. Liu. Tntroduction to the relationlog system. In Proceedings of the 6th Intl. Workshop on Deductive Databases and Logic Prograrnming (DDLP '98), Manchester, UK, June 20 1998.

[25] Al Stevens. C++ Database Development. MIS Press, 2 edition, 1994.

[26] J.D. Ullman. Principles of Database and Knowledge- Base Systems, volume 2. Cornputer Science Press, 1989. [2?] J. Vaghani, K. Ramanohanarao, D.B. Kemp, Z. Somogyi, P.J. Stuckey, T.S. Leask, and J. Hariand. The Aditi deductive database system. VLDB Journal, 3 (2) :245-288, 1994.

[28] C.N. Zhang. CS400 Clus Notes. Department of Computer Science, University of Regina, 1994. Appendix A

Synt ax of Relat ionlog

database ..- {domain {schema {facts { rules ) ) )

schema schema relationaLscherna { relationaLschema ) domain types domahtype { domain-type ) relationaLschema predicatesyrnbol ( name: type { , name: type) ) domain-type name: type narne string type atomic-type 1 set-type 1 tuple-type atomictype integer (n) 1 string(n) 1 real (n,n) set-t ype { type ){(min, max)) tupletgpe C type, {tvpe) 3 predicates ym bol token facts -.-. f acts { fact ) fact .. - predicatesymbol ( object, {object) object ...- individual 1 partial set 1 completeset 1 tuple individual ....- - integer-number 1 real-nurnber 1 chamcterstring partiahet ...-. - ( object {, object }) complet edet .. - { no~partiaLobject{ , non-partial-object }) tuple ..- [object {, object}] non-paTtiaLo bject ..-. individual 1 complete-set 1 des rules { rule ) rule rule-head :- rule-body ruleJlead atom rule-body literul { , lzterul ) literal atom 1 not atom atom predzcut~ymbol( term {, tenn 1 term variable 1 individual 1 partzalsetterm 1 complete-seLterm 1 tuple-term variable capitaLletter { Ietter ) partiaLseLterm ( te7-m {, teTm ) ) completueLterm { no~partial-tem{, non-partiaLterm ) ) tuple-term C tem {, tenn)] non-partiaLterm utomzctem 1 completesettenn 1 [ non-partial-tenn { , noxpurtiaLterm ) ]

..-- ?- Ziteral {, Ziteral ) IMAGE NALUATION TEST TARGET (QA-3)

APPLIEi IMAGE. tnc =- 1653 East Main Street ,--- - , Rochester. NY 14609 USA ==-- Phot~:71 W482-0300 --=-- Fax: 71 6/28û-5989