<<

Functional Dependencies in Horn Clause Queries

ALBERTO O. MENDELZON and PETER T. WOOD Computer Systems Research Institute University of Toronto

When a database query is expressed as a set of Horn clauses whose execution is by top-down of goals, there is a need to improve the backtracking behavior of the interpreter. Rather than putting on the programmer the onus of using extra-logical operators such as cut to improve performance, we show that some uses of the cut can be automated by inferring them from functional dependencies. This requires some knowledge of which variables are guaranteed to be bound at query execution time; we give a method for deriving such information using data flow analysis.

Categories and Subject Descriptors: H.2.4 [Database Management]: Systems—query process- ing; 1.2.3 [Artificial Intelligence]: Deduction and Theorem Proving—deduction General Terms: Algorithms, Languages Additional Key Words and Phrases: programming, data flow analysis, functional depen- dency, relational database

1. INTRODUCTION

Horn clause programs are an attractive formalism for database access. The recent textbook by Unman [181 surveys work in this area. Some of this work deals with the integration of existing or future database systems with a Horn clause component; another approach, and the one we pursue here, is simply to take a widely available system, such as [51, and try to use it for database access without any further modification. It is clear that there are many limitations in such an approach, but there are also advantages. For applications where the databases are small, for quick proto - typing of larger applications, and for interactive queries where data access is localized and the database can be held in virtual memory, the use of a widely

The results presented in this paper extend those of A. O. Mendelzon, “Functional Dependencies in Logic Programs,” in Eleventh International Conference on Very Large Data Bases (Stockholm, Sweden, Aug. 21-23, 1985), IEEE, New York, 1985, pp. 324-330. Authors’ current address: Computer Systems Research Institute, University of Toronto, Toronto, Canada M5S1A4 Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its data appear, and notice is given that copying is by permission of the Association for Computing Machinery. To copy otherwise, or to republish, requires a fee and/or specific permission. @ 1991 ACM 0362-5915/91/0300-0031 $01.50 ACMTransactionsonDatabaseSystems,Vol 16,No 1,March1991,Pages31-55. 32 . A. O. Mendelzon and P. T. Wood available language such as PROLOG makes for systems that are easy to build and maintain and easy to port. However, as Warren reported [19], the rigid backtracking behavior of PROLOG is inadequate even for relatively undemanding database applica- tions. The logic programming community is well aware of this problem and several proposals for selective backtracking exist [3, 8, 12]. In practice, PROLOG systems provide a control primitive called cut that gives the programmer some say in how backtracking is handled. But the use of the cut or other extra-logical control operators makes programs less declarative and more procedural, partially defeating the goals of logic programming. In this paper we propose to use declarative information provided by the programmer in the form of functional dependencies to automate the insertion of cuts. While this does not cover all uses of the cut, we think it is a step in the right direction. Our approach differs from the work on selective backtracking cited above in that it requires no changes to the PROLOG interpreter and can be implemented in a preprocessor. For example, suppose we have a database with the following predicates:

capital( A, C): the capital of country A is city C hq( X, C): company X IS headquartered at city C sales(X, S): company X makes S million dollars a year in sales

We are interested in those countries whose capital cities are headquarters for some company that has sales of more than 100 million a year:

big(C) + capital(C, X), hq(Y, X), saies( Y, S), S >100

Now suppose we try to evaluate the query

big(usa)

with the usual PROLOG backtracking mechanism. First, the capital predi- cate will be evaluated, binding X to “Washington”. Suppose there is only one company, “D.C. Power Corp.”, with headquarters in Washington, and its sales are only 1 million. When S > 100 fails, the PROLOG interpreter will backtrack to try to find another sales figure for D.C. Power; when in turn this fails, it will backtrack again and, failing to find another company with headquarters in Washington, it will try to find another capital of the U.S.A. Many versions of PROLOG provide a control operator called cut, denoted “!”, that lets the programmer avoid these unnecessary attempts to resatisfy predicates that cannot be resatisfied. For example, we could write

big(C) + a(C, X), hq(Y, X), sales(Y, S), S >100. a(C, X) + capltal(C, X), !.

The effect of the cut in the a clause is to prevent the interpreter from trying to resatisfy a if any goal to the right of it in the big clause fails. We can apply the same transformation to the sales predicate and obtain

big(C) + a(C, X), hq(Y, X), /3(Y, S’), S > 100 u(C, X) + capital(C, X), !. 6(Y, S) + sales(Y, S), !. ACM TransactIons on Database Systems,Vol. 16, No 1, March 1991 Functional Dependencies in Horn Clause Queries . 33

However, there is still one serious source of inefficiency in the backtracking behavior of this program. Suppose there are 100 companies with sales over 100 million headquartered in Paris; then the query

big(france), big(usa) will resatisfy big(f rance) 100 times before failing. Again, the cut is helpful in avoiding this as follows:

big(C) + a(C, X), Y(Y, X, S). y(Y, X, S) + hq(Y, X), /3(Y, S), S > 100,!. with a and ~ defined as before. The cut at the end of the -y clause means that, once a country has been proven big, the interpreter will not try to reprove this in a different way. What started as a clean logical definition of big countries has become, through the introduction of cuts, a complex program that can only be understood procedurally. In this paper we propose to automate this process by using functional dependency statements to infer where cuts can be inserted. For example, the introduction of a and P is allowed by the fact that in capital(C, X), C functionally determines X, and in sales( Y, S), Y function- ally determines S. We would like the program to be written as follows:

capital(C, X): C + X sales(Y, S): Y + S big(C) + capital(C, X), hq(Y, X), sales(Y, S), S >100.

and let the system produce the improved version. The method we present handles the introduction of a and ~, but not of y; this is actually the problem of decomposition into independent subqueries that Warren solved in [181. Consider the above clause for big along with the query

big(france), big(usa)

mentioned earlier. After X has been instantiated to Paris, the arguments of the remaining predicates in the clause share no uninstantiated variables with either the head of the clause or the capital predicate. Warren’s query planner detects this, while an extended PROLOG interpreter ensures that the remaining predicates are processed independently from the capital predi- cate and that no more than one solution is found for them. This has the same effect as the introduction of y above. There is one problem with our approach, and it is well known to logic programmers familiar with the cut. Note that, even though the functional dependency hq( Y, X): Y ~ X holds, we have not used it. The reason is that hq( Y, X) is invoked with the Y variable free and the X variable bound; if sales( Y, S) fails, it is in fact necessary to backtrack and find another binding for Y; that is, another company with headquarters in city X. In fact, the introduction of a and 13preserves the meaning of the program only if we are using big as a test, rather than as a generator. For example, if the query is

big(W) ACM Transactions on DatabaseSystems,Vol. 16,No. 1, March 1991. 34 . A. 0, Mendelzon and P. T. Wood where W is a variable, and it happens that the first country that appears in the capital predicate is “usa”, then the answer will be empty. Thus the application of our method requires some knowledge of which variables are guaranteed to be bound on invocation of each predicate. We use data flow analysis techniques to obtain this knowledge statically. In the next section, we formalize the above concepts. In Section 3, we present the cut-insertion process and show its correctness. In Section 4, we give a method for computing the information on binding of variables that is required by the cut-inserter. This method is extended in Section 5 to allow additional binding information to be computed. Finally, we discuss related work in Section 6.

2. DEFINITIONS

2.1 Logic Programming A program P is a sequence of Horn clauses Cl, . . . . C.. Each Ci is of the form

C,: A(XI, ..., ~n)+~l(Yll, . ..> Ylkl). . . .j~n(Yml, . ..j Ynk m). where m ~ O, for each 1 < i < m, k, > 0, and each x,, 1 < i < n, and each y,j,l~i~m,l

@: G(zl,. ... zp)

where each z, is either a variable or a constant. The meaning of P on query d is

p(~) = {(cl, . . .,cP) I c, is a constant, 1 s i =p, and P = G(cl, . . .,cP)};

that is, G(cl, . . . . CP) is a logical consequence of P.

2.2 PROLOG The PROLOG programming language [5] provides a practical implementa- tion of logic programming. A PROLOG interpreter attempts to compute the meaning of program P on query @ by applying the following procedure. Construct a tree (known as an SDL-tree) whose nodes are sequences of atomic formulae, called goals. The root of the tree is ~. Suppose a node contains the goal (.LI, . . . . L.), and there is a clause

CJ:A+Bl,. ... B~ ACM TransactIons on Database Systems,Vol, 16, No. 1, March 1991 Functional Dependencies in Horn Clause Queries ● 35 in P such that A “matches” LI, or, more formally, A can be unified with LI using unifier 0. Then the node has a child labeled with

(O(BI),..., 0( B~),6(Lz),...,0 (L~)), that is, where the substitutions given by 0 are applied to each literal in the new goal (see [9] for more details). The children of a node are ordered according to the value of j in Ci. There are three kinds of branches in the tree: success branches are those that end in a leaf labeled with the empty goal: failure branches end in a leaf whose leftmost literal cannot be unified with the left-hand side of any clause; and infinite branches do not terminate. The PROLOG interpreter searches this tree in depth-first order; whenever a leaf at the end of a successful branch is reached, the composition of all the substitutions 0 applied in the path from the root to this leaf is applied to the original query and the result is output. Because we want all solutions to a query, the interpreter backtracks through the tree to search for additional solutions. We denote the result of this computation by

Pr(P, @).

R can be shown that the above procedure is sound; that is , F2-(P, d) ~ P(4) for function-free programs [9]. However, it is not complete; that is, there are cases where the search procedure does not terminate and fails to find successful leaves that exist in the tree.

2.3 The Cut Operator

The cut operator, denoted by “!”, may appear anywhere on the right-hand side of a clause. Cut is treated as the constant true, but it also has the following side-effect on the depth-first search procedure. Let C be a clause whose body contains a cut symbol, and p be the parent goal in the SLD-tree whose child goal c resulted from the unification of the leftmost literal in p with the head of clause C. Let d be the leftmost descendant of c whose initial subgoal is the cut. If the PROLOG interpreter ever backtracks to node d, the whole subtree rooted at p is considered visited by the search procedure and it is not searched further. Note that, if the part of the tree that is not searched does not contain a success branch, the cut does not make the interpreter miss any answers that would be found without the cut. However, if the omitted part of the tree contained some infinite branches, the presence of the cut might cause some additional sound answers to be found, because the inter- preter will avoid those infinite paths.

2.4 Functional Dependencies A functional dependency is a statement of the form

A(Xl, . . .. Xn). XL1,Xip+XJ1, XJ1,. ... xJ,) where forl~k

In all that follows we assume our programs and databases satisfy the given set of functional dependencies; questions of how to enforce this constraint are outside the scope of this paper.

2.5 Binding Rules

A binding rule b for program P is a statement of the form

A:il, . . ..ih. where A is an n-ary predicate that appears in some clause of P, and k < n. We say that P runs under b if, in every execution of P, predicate A is invoked with the ilth, . . . . i~th arguments bound to constants. An input rule r for program P is a statement of the same form as a binding rule. We say the P runs under input rule r if whenever P is run with goal A( xl, . . . . x.), all the arguments X,l, . . . . X,A are constants. An input rule is a constraint provided by the programmer; it gives the system some information on how certain predicates will be used as goals. Given some input rules, it is possible to infer from them binding rules. Consider the following example.

Example 1. Let P be the program below, which computes the transitive closure T of a binary relation E:

c1 : T(x, Y)+ E(x, Y). C2 : 2’(X, Y)+ E(X> z), 2’(2, Y).

Suppose we are given the input rule T:l that is, we are told that this program will be used to compute all nodes reachable from a given fixed node. Since Z will always be bound to some constant after executing E( X, Z), we know the invocation of T in Cz will also have the first argument bound. It follows that P will run under binding rule T :1. This in turn implies that it will run under binding rule E: 1, because E always inherits its first argument from T. On the other hand, consider program Q:

C, : T(X, Y) + E(X, Y). Cz : T(X, Y)+ T(Z, Y), E(X, Z).

Note that this is logically equivalent to program P. From the same input rule, T :1, no binding rules can be inferred. In particular, Q will not run under binding rule T :1, because the invocation of T in C’z will be with a free variable as first argument. Given a set of input rules I for program P, we say I implies binding rule b if whenever P is run under every input rule in 1 it necessarily runs under ACM Transactions on Database Systems,Vol 16, No. 1, March 1991 Functional Dependencies in Horn Clause Queries . 37 binding rule b. We will defer the problem of computing an approximation to the binding rules implied by a given set of input rules until Section 4.

2.6 Graph Terminology

Since we model the binding information contained in a program by means of a directed graph, we now introduce the necessary terminology. We assume that the reader is familiar with the basic definitions of graph theory, as found (for example) in [2]. Let G = (V, E) be a directed graph; that is, V is a set of nodes, while E is a set of ordered pairs (u, U) ~ V x V called directed edges. If (u, v) e E, then u is called a successor of u in G. A node v e V is isolated if there is no u ~ V such that either (u, v) e E or (v, u) ● E. A sequence p = (zO, Zl), (zl, Zz), . . . . (zn_l, z.), n >1, where each z,e V, O s i < n, is a path of length n from ZO to z. in G. If ZO = z., then p is called a cycle. A cycle of length 1 is called a selfiloop. A (strongly connected) component K of G is a maximal subset of V such that, for every pair of nodes u and v in K, there is a path from u to u and a path from v to u. The (acyclic) condensation of G, denoted G, is obtained from G as follows. Let KI, KZ,. . ., Km be the components of G. The condensation G of G is a directed graph wi~h m nodes k ~, ka, . . . . k ~ such that there is a directed edge from k, to kj in G if and only if in G there is a directed edge from some node in K, to some node in K?.

3. INSERTING CUTS

Our method inserts a cut after a literal if the functional dependencies imply that the bound variables for that literal functionally determine every vari- able that appears both in the literal concerned and either to its right in the clause or in the head of the clause. The idea is that backtracking to this literal again cannot make any difference in what happens to the right of it, since all free variables that are “live” must take exactly the same values again. We can show that the method is correct in the sense that the modified program computes at least all the tuples that the original one computed for the same query. 1 We use the notation

in place of the pair of clauses

where a is a predicate appearing nowhere else in the program.

1In some cases,the modified program may actually compute more tuples than the original one, while remaining sound.

ACM Transactions on DatabaseSystems,Vol. 16, No. 1, March 1991. 38 . A. O. Mendelzon and P T. Wood

Given program P, input rules 1, and set of functional dependencies F, let C~ be a clause

C,: A(Z) ‘Bl(~l), . . ., B,-l(~, -1), ‘,(~, ), B,+l(~,+l), . . . ~Bw/(~~).

Let { Zl, . . . . Zp} be the set of variables that appear in ~, and also appear in some ~~ for k > i or in Z; that is, the variables that appear both as arguments of B, and either to the right of it or as arguments to the left-hand side of the clause. Suppose that the input rules 1 imply the binding rule

B,:jl, . . ..j~

and the functional dependencies F imply the functional dependency

Then replace B, in clause C~ by B!,. In other words, replace B, by some new predicate a and introduce a new clause Ch as follows:

CJ:A(2) +B1(jl),. ... %l(~L-1)7 ~(~z)) BZ+l(YL+l)) . . . ~ B?n(Jn)”

ck:CY(yJ+llL(jJ, !.

The resulting program will be denoted by P!. Let us now describe the construction of the SLD-tree for query 4 applied to P! in which clause Cj is invoked, and the inserted cut takes effect. The tree is depicted in Figure 1. Assume that the initial goal d is G(D). If C~ is invoked, then at some level in the tree there will be a goal GI with initial subgoal A( ii), where D unifies with Z. At this point, there may be a number of other subgoals Dl, . . . , D. comprising GI. Then GI resolves with the clause C~ giving rise to goal G2. Assuming each of the subgoals Bl, . . . . B,_ ~ succeeds, the predicate a will eventually head the list of subgoals. Let this goal be GA, and its parent in the tree be G~. The cut insertion procedure ensures that the subgoal a will always unify with the head of exactly one clause, namely C~, yielding goal G~. Assuming the BZ eventually succeeds, goal GG will appear along the leftmost branch rooted at goal GG. Because the cut succeeds immediately, there is only one branch emanating from GG, giving rise to goal Gv. The subtree rooted at GT is searched exhaustively, and when no more alternatives exist, the cut forces backtracking to continue at the second branch (from the left) rooted a goal G~. That is, all of the branches emanat ing from G5, except the leftmost leading to goal GG, are pruned from the tree and are not searched. The cut insertion procedure described above can be shown correct in the following sense.

THEOREM 1. Let P! be the result of applying the transformation above to program P. Then for any query ~ that satisfies the input rules I,

Pr(P, c$) ~ Pr(P!, c)) = P(@).

PROOF. The following lemma is crucial to all our results. ACM Transactions on Database Systems,Vol 16, No. 1, March 1991. Functional Dependencies m Horn Clause Queries . 39

Go: +G(fi)

/L. --.\

G2:+B, (Jo),..., B,-l@i-l), O@,). B’+l@i+l), . . . ! ‘~(7??l)>Dl! . . . ‘D. I----\ G3 :

~----\ G4:+@i),B l+l,.. ., Bn,DD~ . . ..D~

/L. -._{

G6:+-!, Bi+l, . . .. B~. DDs . . ..Ds I G7:+-Bi+l, . . .. Bm. Dl, D$. ,D$ /’J\_--—— Fig. 1. An SLD-tree.

LEMMA 1. Let A(xl, . . . . x.) be a subgoal of goal G in an SLD-tree T. After A(xl, . . . . x.) has succeeded, each x, will be bound to some constant c,, I

PROOF. Without loss of generality, consider the goal G: + A, Bl, . . . . Bn. In order for the subgoal A to succeed, G must eventually reduce to G: +Bl, ..., B~. The proof proceeds by induction on the length of the path from G to G’ in T.

Basis. Length is 1. In this case, A( xl, . . . . x,) must unify with a unit clause. Our assumption that every argument of the head of a clause also appears in the body or is a constant ensures that all unit clauses are in fact ground clauses. Hence, each x, will be bound to some constant c,, 1 s i s n.

Induction. Assume that the result holds for all subgoals that succeed in k or fewer steps (where a step corresponds to an edge in T). Consider subgoal A(xl, . . . . x.) for which the path from G to G’ is of length k + 1.Because we have assumed that every variable appearing in the head of a clause also ACM TransactIons on DatabaseSystems,Vol. 16, No, 1, March 1991 40 . A. O. Mendelzon and P. T. Wood appears in the body, each x,, 1 ~ i < n, will be bound to either a variable or a constant appearing to the left of Bl, . . . . l?m in goals on the path from G to G’ in T. Those variables in A bound to variables in subgoals that succeed in k or fewer steps will be bound to constants by the inductive hypothesis. Let +D, BI, ..., Bm be the parent of G’ in T. Any variable of A not bound to a constant in k or fewer steps must be bound to a variable appearing in D, Since D must unify with a ground clause in step k + 1, we conclude that all variables appearing in A will be bound to constants after A succeeds. ❑

LEMMA 2. The cut insertion process is sound; that is,

PRooF. The insertion of cut symbols does not alter the declarative mean- ing of a program. Hence, the cut insertion process produces a program P! that is logically equivalent to the original program P. The result follows from the fact that, in the absence of function symbols, the PROLOG interpreter is sound [9]. ❑

LEMMA 3. The cut insertion process is complete with respect to Pr(P, d), that is, Pr(P, d) G Pr(P!, 0).

PRooF. Given a program P, input rules 1, and the set of functional dependencies F, we must prove that every ansvver produced by P on query d is produced by P!. Let Cl be defined as before, namely,

Suppose that 1 implies the binding rule B,: jl, . . . . j~, and that F implies the functional dependency

where { Zl, . . ., ZP} is defined as before; that is, { Zl, . . ., ZP} is the set of variables that appear in TZ and also appear in some Yk for k > i or in Z. Then clause C’j is replaced by the clauses

C,: A(2) +BI(JI),. ... B2-l(Y2-l)7~(~2)7BL+ l(71+l)7...)%n(~m).

yielding program P!. If clause C= is never invoked in the solution of q5, then clearly Pr(P, ~) g Pr(P!, 4). So assume CJ is invoked, in which case the SLD-tree T of Figure 1 will be constructed by the PROLOG interpreter. We assume further that subgoal B, succeeds, and that all branches rooted at goal G7 in T are finite (success or failure branches); otherwise, the cut has no effect, and Pr(P, O) G Pr(P!, ~). The effect of the cut is to prune all branches emanating from G5 except the leftmost leading to G6. No answers will be lost by the pruning of failure or infinite branches from T. We must demonstrate that the only success ACM Transactions on Database Systems,Vol. 16, No 1, March 1991. Functional Dependencies in Horn Clause Queries . 41 branches pruned from T are those which would give rise to duplicate answers for $. First we introduce some notation. Let X be the set of all variables appearing in 2, Yi be the set of all variables in ~i for 1 s i < m, and Y = U~lY,. By assumption, X G Y. Let Z = { Zl, . . . . ZP} = Z~ U Z~, where z~ = (u~,~. + 1Yj) fl Yi (the variables that appear both as arguments of B, and appear to the right of l?, in C~), and Z~ = X fl Y, (the variables that appear both as arguments of Bi and as arguments of the left-hand side of CJ). Now let V be the set of all variables appearing in some Di, 1< i < s, and U be the set of variables in ii (see Figure 1). Then let W = U (1 V (those variables that appear both in ii and some u,, 1 s i s s). During the unification of A( @ with A(2) let the elements of X with which the elements of W are unified be given by the set S. Note then that in goal Gz all arguments of any D, (that is, all elements of V) except those in S are distinct from any element of Y. Lemma 1 shows that every argument of any predicate appearing in a goal is either a constant or an (unbound) variable. Thus, as long as we ensure that before a clause is used its variables are renamed to be distinct from every variable appearing in T, there will be no need to rename variables between goals in T. We claim that every success branch rooted at G5 contains a node logically equivalent to G7. Clearly this is sufficient, since then all answers computed along these branches (if any) will have already been computed along the success branches (if any) rooted at GT. If a branch rooted at G5 is a success branch, then GG must eventually reduce to the sequence of subgoals B &+17.,..9 B~, Dl,. ... D., which we will denote by G~. We must show that the argument binding patterns in Gq and G~ are identical. Note that during the solution of B,, no variables in Gy and Gs other than those in Yi will be bound to constants. Since the variables in Yi – Z do not appear anywhere in GT or G8, their bindings can be ignored. From the binding rule Bi: .jI, . . . , j~, we know that at goal G~ the variables Yi~ = {YiJ1> ..., YiJq} are already bound to constants. Since program P satisfies the functional dependency

we have that each variable in Z must be bound to the same constant in both G7 and G~. Now, Z contains ZR, which is the set of variables that appear among the

Yi~’s and as some Yk.r in Bk for k > i. So all bindings to constants in Bi+l, . . ., B~ in G8 must be the same as in GT. Similarly, Z~ contains those variables that appear among the y,~’s and as some variable in V1 in Dz, 1 s 1 s s; that is, some element of S (through unification of A( il) with the head of C~). Thus, all bindings to constants in Dl, . . . . D. in G~ must be the same as in G7. The bindings of all variables in Gy and GB other than those in Z remain unchanged from GG, and are hence equal. El

Theorem 1 follows directly from Lemma 2 and Lemma 3. ❑ ACM Transactions on DatabaseSystems,Vol. 16, No. 1, March 1991. 42 . A O Mendelzon and P T, Wood

Example 2. Consider the program for computing transitive closure from Example 1, but augmented with a functional dependency that constrains the relation to be a function: E(X, Y): X+Y. T(X> Y) + E(x, Y). T(x, Y) + E(x, z), I“(Z, Y).

Suppose the input rule, as in Example 1, is T: 1,which implies the binding rules T :1 and E: 1.Then the transformation can be applied to both clauses of the program, producing: 2’(X, Y) + E!(X, Y). T(X, Y) + E!(X, Z), Z’(Z, Y).

Intuitively, what our transformations have done is to take advantage of the fact that any node related to a given fixed node can be found by following the edges of E without backtracking.

Example 3. Consider the following program P, where a project(P, M, E) tuple means employee E works for project P and manager M manages project P.

project( P, M, E): P + M project( P, M, E): E + P proJect(turlng, ric, jim). project(turing, ric, steve). salary (ric, 50k). bigproj( P) - project( P, M, E), salary(ll, S), S > 50k.

Suppose the input rule given is bigproj :1. The bigproj predicate defines those projects whose manager makes more than 50k. Suppose a query that satisfies the input rule, such as

bigproj(turing), is posed. After we find that jim works for project turing but the project manager’s salary is not greater than 50k, there is no need to look for other managers of project turing, since the functional dependency tells us there are no others. Our transformation lets us replace project by project! to reflect this fact. Note that the potential payoff in the last example is greater than that of Example 2. In Example 3, we avoid examining every employee that works on a project once we determine that the project does not meet the query criteria. It is interesting to point out that the potential payoff comes from the inefficiency introduced by the database design: the project predicate is not in Boyce -Codd Normal Form, It is also interesting to look at the case where the arguments of a predicate do not appear anywhere to its right. Even if there are no functional depen- dencies, the method can be applied according to our definition.

Example 4. Consider the query “find an employee who makes more than 20k and works for a department that sells something”: ans(E) + dept(E, D), sells(D, P), salary(l?, S), S > 20k. ACM Transactions on Database Systems,Vol. 16, No. 1, March 1991 Functional Dependencies in Horn Clause Queries . 43

Note that the variables in the sells literal do not appear to its right nor in the head of the clause; therefore we can replace sells by sells!. The effect of this transformation is that, if an employee’s salary is not greater than 20k, we are not going to recheck his salary for each product sold by his department, Thus our method solves as a special case the “deep backtracking” problem mentioned by [16]. It is also similar to the decomposition into independent subproblems described in [19].

4. COMPUTING THE BINDING RULES Given a set of input rules 1 for a program P we would like to compute the set of binding rules b such that 1 implies b. However, consider the following program P: c1 : z?(x) + R(x). (&: Q(X) + P(X). C3 : Q(X) + R(X), P(Z).

Then the input rule Q :1 implies the binding rule P: 1 if and only if the evaluation of R(X) does not halt. The database predicates in P are R and P. If the ground clauses for R and P are R(a) and P(c), then PROLOG will answer “yes” to the query + Q(a) but will not halt when given the query - Q(b). Thus, detecting whether evaluation of a predicate will halt cannot be done on the basis of the program alone, but requires inspection of both the query and the “facts” represented by the database predicates as well. Algo- rithms for eliminating looping in PROLOG in certain cases are discussed in [61 and [131. In the general case, this is a difficult problem. Apt et al. [11 give a sound and complete algorithm for detecting halting of function-free PRO- LOG programs, but they point out it is impractical, since it may require as much work as solving the goal. They also show that there is no “simple” sound and complete halting test, where simple means independent of the particular program, i.e., considering only the current SLD derivation. For these reasons, we take a conservative approach that computes a subset of all the binding rules implied by a set of input rules, by assuming that all literals that appear in a clause actually get invoked in some execution. For our purposes, it is safe to compute a subset, since the fewer binding rules we have, the fewer transformations we can apply to our program. Referring to the above example, although we cannot infer the binding rule P: 1 from the input rule Q: 1, we show in the next section that we can in fact replace both occurrences of P by P!; the first because X will always be bound to a constant in Cz, and the second because the trivial functional dependency P(X): X + @ holds (as in Example 4). In this section, we first present a simplified technique for inferring binding rules. We will refer to the following method as Algorithm 1. Given a program P and set of input rules 1, we construct a directed graph G as follows. For each n-ary predicate A appearing in P, G has n nodes labeled Al, Az, . . . . A.. If A also appears as the head of some clause in P, G has n additional nodes labeled A:, for 1 s i s n. These are called the 1-nodes. There is also one special node, labeled ti~. ACM Transactions on DatabaseSystems,Vol 16,No. 1, March 1991. 44 . A O Mendelzon and P. T Wood

Fig. 2. Graph:1depicting argument binding:L.lpatterns in the transitive closure program

A (directed) edge form A, to B, in G intuitively means that the ith argument of A will be bound only if the ~th argument of B is. There are three types of edges:

(1)If predicate A appears as head of a clause, there is an edge from A, to A: for each 1< i s n. (2) If a clause has head A and an occurrence of some predicate B in its body, and the ~th argument of A is also the ith argument of B, and is not an argument of any predicate appearing before B in the body, then there is an edge from B, to A~. (3) If A occurs in the body of a clause and its ith argument is not an argument of any predicate to the left of A in the clause, then there is an edge from A, to OJI.

Figure 2 shows the graph for the transitive closure program of Example 1, which is reproduced below.

Cl: T(X, Y) + E(X, Y). Cz : T(X, Y)+ E(X, Z), Z’(Z, Y).

Let G be the acyclic condensation of G. Mark all A: nodes such that the input rule A : i is in 1. Now repeat exhaustively the following process: Mark any non-I node if all its successors are marked. The following theorem shows that this method computes a subset of the binding rules implied by 1.

THEOREM 2. For each component K of G that is marked by Algorithm 1, and each A, ~ K, I implies the binding rule A : i.

PROOF. Given a program P and a set of input rules 1, let G be the graph constructed by Algorithm 1. The proof of Theorem 2 proceeds in two stages. We first prove the correctness of the marking algorithm for the case when G is acyclic. Then we show that it is sound to apply the marking algorithm to the acyclic condensation of G when G is cyclic. El

LEMMA 4. If G is acyclic, then for each A, marked by Algroithm 1, I implies the binding rule A : i.

PROOF. The proof is by induction on the number of steps taken by the marking algorithm. For this purpose, we restate the marking algorithm as ACM Transactions on Database Systems,Vol. 16, No. 1, March 1991. Functional Dependencies in Horn Clause Queries . 45 follows:

(0) Mark all A: nodes such that the input rule A: i is in 1. (1) Mark all isolated non-l nodes of G. (k) (k > 2) Mark any non-l node if all its successors were marked in steps 1 tok–1.

Since G is acyclic and the number of clauses in P is finite, the algorithm clearly terminates. Let the application of query @ (satisfying input rules 1) to the program P be represented by the SLD-tree T. A predicate A is invoked when it appears as the first subgoal of a goal G, in T. Before it can be invoked, predicate A must be added to the list of subgoals. This can occur in one of two ways: either (1) A appears as the predicate in @, or (2) A appears in the body of a clause whose head unifies with the leftmost subgoal of a goal in T. We must show that if a node A, in G is marked by the algorithm, then program P runs under binding rule A : i; that is, in every execution of P, predicate A is invoked with its ith argument bound to a constant.

Basis. Assume node A ~ is marked in step 1 of the algorithm. Since A ~ is isolated, predicate A never appears as the head of a clause (otherwise, ( A,, A:) would be an edge of G). This means that A can be invoked only as a result of (2) above. Furthermore, any variable X occurring as the ith argument of A in a clause C~ is never the first occurrence of X in Cj (otherwise, coI would be a successor of Ai) or the first occurrence of X in the body of C~ if X appears in the head B of Cj (otherwise, ( A i, B~) would be an edge of G, for some k). Hence, whenever predicate A appears in a clause C~ in P, it has as its ith argument a variable X, which appears before A in the body of Cj. By Lemma 1, this means that whenever A is invoked, its ith argument will be bound to a constant, since the predicates to the left of A in C~ will have succeeded. Thus, 1 implies the binding rule A: i.

Induction. Assume that the input rules 1 imply the binding rules B :1 for all nodes B1 marked by the first k (k > 1)steps of the algorithm. Let A, be a node that is marked in step k + 1, that is, all successors of A, in G were marked before step k + 1. Assume predicate A appears in the body of n clauses Cjl, . . ., Cjn (n > O), and (possibly) as the head of m clauses c kl~..., ckm (m 2 0), where kl, . . ., k ~ are not necessarily distinct from jl, . ..jnjn. Suppose that A is invoked as part of a query +, that is, as a result of (1) above, Then we can assume that m > 0 and that A appears as the head of some clause C~ . Therefore, (A ~, A:) must be an edge in G, and A: must have been mark~d in a previous step. So the ith argument of A will be bound to a constant in +. Suppose instead that A is added to the list of subgoals by (2) above, and that the p-ary predicate B is the head of clause C,, in the body of which A appears. If Bl is not a successor of A i in G for any 1, 1 < 1 s p, then the variable X occurring as the ith argument of A in C~l must appear as an ACM Transactions on DatabaseSystems,Vol. 16, No. 1, March 1991. 46 . A O Mendelzon and P, T Wood argument of a predicate to the left of A in the body of C~,, otherwise, edge ( AZ, uI) would be in G, and A, could not be marked. As a result, when predicate A is invoked in C~,, the ith argument of A will always be bound to a constant (by Lemma 1). On the other hand, if B1 is a successor of A, in G, then Bl must already have been marked by the algorithm. By the induction hypothesis, this means that the lth argument of B will be bound to a constant in every execution of P. Since the variable X appearing as the ith argument of A in C’JL also appears as the lth argument of B (by virtue of the edge ( A,, Bl) in G), X will always be bound to a constant through the unification of B with the head of C~,. Thus, the ith argument of A will always be bound to a constant whenever A is invoked in any execution of P on query O. ❑

LEMMA 5. If G is cyclic, then for each component K of G that is marked by Algorithm 1, and each A, ~ K, I implies the binding rule A : i.

PROOF. We must show that it is safe to mark a component K of the acyclic condensation G of G (thereby marking every A, eK in G) if all the successors of K have been marked. To prove this, we must demonstrate that if all the arguments corresponding to the successors of K in G are always bound to constants in every execution of a query d applied to a program P, then all the arguments in P which are represented by the nodes in K will also always be bound to constants. Let G be the (cyclic) graph constructed from a program P, and let G be the acyclic condensation of G, containing some component K. Suppose that all the successor nodes of K are marked by the algorithm, that is, the arguments corresponding to these nodes are always bound to constants in every execu- tion of P (by Lemma 4). Let A, be a node in K, and let variable X be the ith argument of predicate A in clause C&. Consider an execution step to be the unification of a subgoal with the head of a clause in P. The proof is by induction on the number of steps in this execution of P to invoke some predicate A in clause C~ (such that A, e K) for the first time.

Basis. If A is invoked in the first step of the execution, then A must have constituted the query +, and A must be the head of clause Ck. Hence, A: is a successor of A, in G and of K in G. By assumption A: is marked, and its ith argument is bound to a constant. We conclude that X will be bound to a con~tant through unification.

Induction. Assume that any arguments of predicates corresponding to nodes in K which are invoked in the first n ( n > 1)steps of this execution of P are bound to constants. Let A in clause C~ be invoked in step n + 1, where A, e K and the ith argument of A is X. If X is not bound to a constant in C~ when A is invoked, either it must be the first occurrence of X in Ch, or X must appear as the ~“th argument of the head of Ck, say predicate B, which must have been unbound when B was invoked. In the first case, COI

ACM Transactions on Database Systems,Vol. 16, No 1, March 1991 Functional Dependencies in Horn Clause Queries . 47 would be a successor of A ~ in G. Since COIcan never be in K and is never marked, this contradicts the assumption that all successors of K were marked. In the second case, Bj is a successor of A ~ in G, and would have been invoked in step n. If Bj is not in K, then 13i is a successor of K in G, and is marked but not always bound; we again have a contradiction. On the other hand, if l?~ e k, we contradict the inductive hypothesis. We concl~de that every argument represented by the nodes in each compo- nent K of G, all of whose successor nodes are marked, must always be bound to a constant. Hence the marking algorithm is correct; that is, for every Ai c K marked by the algorithm, 1 implies the binding rule A : i. ❑

Example 5. Let us apply the algorithm to the graph in Figure 2, with the single input rule T: 1. Since the graph is acyclic (except for the Tz self-loop), its acyclic condensation is itself. By the given input rule, we mark T:. It follows that TI and El also get marked, and no other nodes. The result is that binding rules T :1 and E: 1 are implied.

5. COMPUTING ADDITIONAL BINDING RULES

In this section, we show how to extend the method of the previous section in order to compute additional binding rules. However, the resulting set of binding rules remains a subset of all the binding rules implied by a set of input rules, since we still analyze only the program (not the ground clauses or the query) and assume that all predicates will be invoked. We will refer to the new method as Algorithm 2. Algorithm 1 can be viewed as being conservative in the sense that a binding rule A : i has to be satisfied by all occurrences of the predicate A in a program P before it can be inferred. In contrast, Algorithm 2 can infer a binding rule for a predicate A that holds for any invocation of A initiated from the body of a particular clause C~ in P. We will denote the fact that the ith argument of predicate A is bound in clause C~ by C~. A : i.

Example 6. Consider the program P: c1 : P(X> Y)+ P(Y, x). C2 : Q(X) + R(X), P(X, Y). along with the ground clauses R(a), I?(b) and P( a, b), and the query + Q(X). A PROLOG interpreter will find the answer X = a and then loop forever. Algorithm 1 does not improve program P since, although the first argument of P in Cz is always bound to a constant, this cannot be exploited because the binding rule P :1 cannot be inferred using Algorithm 1. How- ever, Algorithm 2 permits the binding rule Cz. P :1 to be computed, and since the functional dependency P( X, Y): X + @ holds, we can replace P by P! in Cz. This transformation results in an additional answer (X = b) being found for the query +- Q(X). Therefore, there exist programs P for which

Pr(P1!, @) c Pr(P2!, @), where Pi! denotes program P after transformation by Algorithm i, 1 < i s 2. ACM Transactions on DatabaseSystems,Vol. 16, No. 1, March 1991. 48 ● A O Mendelzon and P T. Wood

Algorithm 1 also fails to infer binding rules associated with repeated variables in the head of a clause.

Example 7. Consider the following program P:

Cl: max(X, Y, X)+ gt(X, Y). Cz : max(X, Y, Y) + le(X, Y).

Given the input rule max: 1,2 is not possible to replace gt by gt! or le by le! using Algorithm 1. To do so requires the binding the rule max: 3, which can only be inferred by considering all occurrences of max as the head of a clause. We will return to this example after defining Algorithm 2.

Given a program P and a set of input rules I, the nodes of the graph G are as before, except that there is an additional special node 6. The edges of G are Iabelled and are of five types (the first three are similar to those used in Algorithm 1):

(1) If predicate A appears as the head of clause Ch, there is an edge labeled k from A, to A; for each 1< i s n. (2) If a clause C~ has head A and an occurrence of some predicate B in its body, and the jth argument of A is also the ith argument of 1?, and is not an argument of any predicate appearing before B in the body of ck, then there is an edge labeled k from B, to A~. (3) If A occurs in the body of clause C~ and its ith argument is not an argument of any predicate to the left of A in the body of Cz, then there is an edge labeled k from A ~ to COI. (4) If predicate A appears as the head of clause C~ and its ith argument is a variable that also occurs as the jth argument of A in the head of C~, there is an edge labeled k from A: to A$. (5) If A occurs in the body of a clause C~ and its ith argument is either a variable that occurs to the left of A in the body of C~ or is a constant, then there is an edge labeled k from A, to ~.

If ( A,, BJ) is an edge labeled k in G, we call B, a k-successor of A,. For the ith argument of predicate B in the body of clause Ch with head A # B, the node B, in G will have as k-successor either (3, uI, or one or more A~’s (when there are repeated variables in the head A). If A = B, then B, will also have BZI as a k-successor. The label set of a node A,, denoted h( A,), is defined to be the set of labels k such that A, has a k-successor. For a given predicate A with n arguments, A(A,) = A(A~), 1 s i, j s n, and each set ~AZ) denotes the set of clauses in which A occurs. For an I-node A:, h( A:) is defined to be the set of labels k such that node A, has A: as a k-successor, that is, those k for which A is the head of clause C~. Let G consist of the I nodes of G along with the acyclic condensation of the non-I nodes of G, the label set of a component node K in G being the union of the label sets of each node A, e K in G. The marking procedure associates with each node K in G a set W(K) of edge labels known as the marking set of ACM Transactions on Database Systems,Vol. 16, No. 1, March 1991 Functional Dependencies in Horn Clause Queries

Q{ P{ P; 1 2 1 1

QI Pi P~

2 2 2 1 *I RI P H

Fig.13. Graph constructed by Algorithm 2 on Example 6.

K_. Initially, the marking set of every node in 6 is empty. The marking procedure consists of five phases:

(1)For each 1 node A: in ($ such that the input rule A: i is in 1, set p(A~) = h(A~). If there are m clauses in P, ,u(6) = {1, 2,. ... m}. (2) For each l-node A: in G, include k in P( A:) if A{ has some k-successor A; for which k c p( A;). (3) Repeat the following process exhaustively on G: for a non-~ node K, set ,u(K) to A(K) if for all k e h(K) each k-successor K’ of K has k e .u(K’). (4) Reform G for G, setting the marking set of each node A ~ in G equal to the intersectio~ of A( Ai) and the marking set of the component to which A, belongs in G. Remove all 1 nodes other than UI from G, yielding G-. (5) For each node A, in G-, include k. c A( A,) in P( Al) if each k-successor B; of Ai has k e p(Bj).

The first phase simply applies the initial markings. In phase 2, additional input rules are inferred based on the occurrence of shared variables in the head of a clause. The next marking phase is used to mark components arising from cycles induced by recursive clauses, as well as argument positions that are always bound in P. Thus, in phase 4 when the graph G is reconstructed, if a node Ai in G has P( A ~) = M A ~), it means that A, is bound in every clause in which it appears, so the binding rule A: i is computed by Algo- rithm 2. Phase 5 determines argument positions that are bound when the corresponding predicate is invoked from the body of a particular clause C~ in P. For each k included in ,u( A,) in phase 5, Algorithm 2 computes the binding rule Ch. A: i.

Example 8. Consider the program P of Example 6. The graph constructed by Algorithm 2 is depicted in Figure 3. For example, there is an edge from RI to QI labeled 2 since in clause C’z the first occurrence of X in the body of Cz is as the first argument of R. The fact that X appears subsequently as the first argument of P means that there is an edge labeled 2 from PI to 6. The label sets associated with predicate P areas follows: h(Pf) = A(P~) = {1}, while A( Pi) = M PJ = {1, 2}. Initially. node B is assimed the markiruz set {1, 2}, while the other marking sets remain empty, since there are no input rules. No markings can be applied during phase 3; however, phase 5 sets ACM Transactions on DatabaseSystems,Vol. 16, No. 1, March 1991. 50 . A. O. Mendelzon and P. T. Wood

K( Pl) to {2}, since 13 is the only 2-successor of P1. This allows the binding rule Cz. P :1 to be inferred by Algorithm 2, permitting P to be replaced by P! in Cz. As stated previously, this transformation can result in additional (correct) answers being found by the PROLOG interpreter. Before proving that Algorithm 2 computes only valid binding rules, we show that, for programs with no repeated variables in the heads of clauses, phases 1 to 4 of Algorithm 2 compute exactly the binding rules computed by Algroithm 1.

LEMMA 6. For a program P in which no variable is repeated in the head of any clause, Algorithm 2 computes in phases 1 to 4 exactly the binding rules computed by Algorithm 1.

ROOF. We need to show that A : i is computed by Algorithm 2 if and only if it is computed by Algorithm 1. Let GI and Gz be the graphs constructed by Algorithms 1 and 2, respectively. After the marking procedure of Algorithm 1 on GI has completed, we can assume that GI is reformed and that, for each marked component K in Gl, every node Ai e K is marked in GI. The nodes of GI and Gz are identical, except for I’3, which has no binding rules associated with it. This means that we must prove that, for each node A ~ (other than p), P( A,) = N A,) in Gz if and only if A, is marked in GI. Since there are no repeated variables in the head of any clause in P, there are no edges of type 4 in Gz, and hence phase 2 of the marking algorithm does not apply. As a result, for all nodes other than /3, there is an edge from A, to l?j in GI if and only if there is an edge from A, to BJ in Gz. Finally, after phases 1, 3, and 4 of Algorithm 2, the marking set of node A, is either equal to ~ A,) or is empty. Consider a reverse topological ordering of the nodes of GI or Ga excluding ~; that is, K < K’ if K is a successor of K’. The proof proceeds by induction on the position of a node in the ordering.

Basis. A node in position 1 must either be isolated in GI or be an 1 node. In both cases this node must correspond to a single node A, in G1. If A ~ is an I node, then clearly A, is marked in GI and has P( A~) = A( A,) in Gz if and only if A : i is an input rule in 1. If A, is isolated in GI (and is not u~), then A cannot appear as the head of any clause in P, and, whenever A appears in the body of a clause C~, its ith argument is either a constant or a variable occurring to the left of A in the body of Ck. Hence, the only successor of A ~ in Gz is B, and (3 is a k-successor of A, for each clause Ch in which A appears. Since Ai is isolated in Gl, it is marked by Algorithm 1; since for each clause Ch in which A appears, k e w(P), P( AJ is set to A( A,) by Algorithm 2.

Induction. Assume that, for all nodes K’ whose position in the ordering is less than n, n > ~, it is the case that K’ is marked in GI if and only if p( K’) = A(K’) in Gz. Now consider node K whose position in the ordering is ,. n. If K is an 1 node or is isolated in Gl, the result follows as above. Assume that K is neither an 1 node nor isolated in Gl, hence, K has a successor K’ in both GI and Gz. The position of any successor K’ of K is less ACM Transactions on Database Systems,Vol. 16, No 1, March 1991, Functional Dependencies in Horn Clause Queries . 51 than n in the reverse topological ordering. By the inductive hypothesis, each K’ is marked in GI if and only if ~(K’) = A(K’) in Gz. Assume that K is marked in GI by Algorithm 1. Thus, for every succes~or K’ of K, p( K’) = h(K’) in G,. Node K may also have (3 as a successor in G,, but p(~) = A(d). If K’ is a k-successor of K in Gz, then either K’ is ~ or, for some B~ e K’ and some A, ~ K, the variable in the jth position of the head B of clause C~ first appears as the ith argument of A in the body of C~. Consequently, by the definition of h(K’), k e ~K’), and since p(K’) = h(K’) we have that k e p(K’). As this is true for each k-successor of K, we conclude that Algorithm 2 sets w(K) to h(K) in phase 3. Now assume that Algorithm 2 sets p(K) to A(K) in phase 3. Hence, for all k e A(K), each k-successor K’ of K has k e w(K’). Since p(K’) is not empty, it must be equal to M K’). Therefore, by the inductive hypothesis, K’ must be marked in GI or must be /3. Since each successor of K in GI is a h-successor of K in ~z, for some k e h(K), we conclude that Algorithm 1 also marks K. If component K in dz has P(K) = A(K), then in phase 4 each node A, e K has P( A,) = A( Ai) since A(K) is the union of the labeling sets of each AL, while K( A,) is the intersection of h( A,) and P(K). Since the components in GI and Gz have the same nodes (except for ~), it follows that A, is marked in GI if and only if ,u(AJ = A(Ai) in Gz. El

The following theorem shows that Algorithm 2 provides a sound method for computing binding rules.

THEOREM 3. Let G- be the graph that results from applying Algorithm 2 to a program, P with set of input rules I. For each binding rule R, of the form A : i or Ch. A : i for C~ in P, that is computed by Algorithm 2, I implies R.

PROOF. It was shown in Lemma 6 that, when phase 2 does not apply, phases 1 to 4 of Algorithm 2 compute only binding rules implied by 1. Consider first the inclusion of phase 2 in Algorithm 2. It is not difficult to see that, unless for some 1 node A: phase 2 sets p( A{) to h( A:), no additional binding rules are computed by Algorithm 2. None are computed in phase 3 since the labeling sets of A: and A, are identical; hence, in order to mark A, in phase 3, Algorithm 2 requires that p( A:) = M A:). Furthermore, no addi- tional binding rules are computed in phase 5 since all 1 nodes other than u~ are removed in G in phase 4. Now assume that phase 2 sets p( A:) to ~ A:). This is done only if, for each k e A( A:), there is some k-successor A; for which k e p( A:). Consider a particular clause C~ in which A appears as the head and the subgraph G’ defined by the edges labeled k connecting 1 nodes in G. Because the relation defined by these edges is symmetric and transitive, if some node A, in G’ is marked with k in phase 1, then all nodes in G’ are marked with k in phase 2; otherwise, no node in G’ is marked with k in phase 2. If some node A, in G’ is marked with k in phase 1, it means that, whenever A is a goal of P, the ith argument of A is always bound to a constant. Node Al is a predecessor of A, in G if the same variable appears as both the ith and jth argument of A in the head of C~. Hence, if the goal of P unifies with the head of C~, the jth ACM Transactions on DatabaseSystems,Vol. 16, No. 1, March 1991. 52 . A O Mendelzon and P. T. Wood argument of A will also be bound to a constant through unification. If k e ,u( A;) for all k e A( A;), then whenever the goal of P unifies with the head of any clause whose head predicate is A, the jth argument of A will always be bound to a constant. This is equivalent to having an additional input rule A: j. Thus phases 1 to 4 remain sound when phase 2 applies. Next we consider phase 5. Assume that k c A( A,) is added to p( A,) in phase 5. Note that A must appear in the body of C~ since there are no 1 nodes (other than u~) in G-. Clearly, 07 cannot be a k-successor of A,. If /3 is a k-successor of A,, then it is the only k-successor of A,. In this case, the ith argument of A is either a constant or a variable occurring to the left of A in the body of C~. In either case, the PROLOG evaluation algorithm ensures that this argument is always bound to a constant when invoked via the body of c~ . If A ~ has one or more k-successors none of which is (3, then each k-succes- sor Bj of A, in G- has k e W(B~). So B is the head of Ck and we know that, since each Bj was marked in phase 4, V( B~) = A( B~); hence, by Lemma 6 and the above proof, 1 implies the binding rule B : j and thus the jth argument of B is always bound to a constant whenever B is invoked. Since the same variable appears as both the ith argument of A and the jth argument of B, we can conclude that, through unification, the ith argument of A will always be bound to a constant when invoked via the body of clause Ck. In other words, 1 implies Ch. A : i. ❑

The cut insertion procedure for Algorithm 2 is similar to that for Algo- rithm 1. Given a program P, input rules 1, and a set of functional dependen- cies F, let clause Cj be defined as before, namely,

If the input rules 1 imply the binding rule

and the functional dependencies F imply the functional dependency

(where Zl, . . . . ZP are the variables that appear both as arguments of B, and either to the right of it or as arguments to the left-hand side of the clause), then replace B, in clause C~ by B!,.

Example 9. Figure 4 shows the graph for the program P of Example 7, which is reproduced below.

Cl : max(X, Y, X) + gt(X, Y). C2 : max(X, Y, Y)+ le(X, Y).

For diagrammatic simplicity, we have drawn a single edge labeled by two values instead of two edges when the edges have the same endpoints, and have replaced two edges of opposite direction connecting the same nodes by a single bidirectional edge. ACM Transactions on Database Systems,Vol. 16, No. 1, March 1991. Functional Dependencies in Horn Clause Queries . 53

1 max< 2 maxf maxi

1,2 1,2 1,2

maxi 2

Ie I gtz 90 @’o Fig. 4. Graph constructed for the program of Example 9.

Recall that we are given the input rule max: 1,2 for P. This means that maxi and maxi are each assigned the marking set {1, 2} in phase 1 of Algorithm 2. In phase 2, {1} is included in p(max~) because maxi is a l-successor of max~ and the label 1 is in p(max~); similarly, {2} is inherited by max~ from max~. Since p(max~) becomes equal to A(max~) the input rule max: 3 is inferred. The following marking sets are assigned during phase 3: p(maxl) = ~(maxz) = ~(max~) = {1, 2}, p(lel) = p(lez) = {2}, and ~(gtl) = p(gtz) = {1}. Since the marking set for each of these nodes is equal to its label set, we can infer that each of the corresponding argument positions is always bound. Hence, we are able to replace le by le! and gt by gt! in P. Although the insertion of cuts does not actually improve the efficiency of the program of Example 9, there are similar programs in which improve- ments can be made. Consider the program P:

Cl: maxEarn( X, Y, X)+ empSal( X, W), empSal( Y, Z), gt( W, Z). C2 : maxEarn(X, Y, Y)+ empSal(X, W), empSal(Y, z), le(W, Z). where empSal is a database predicate, relating employees and their salaries, in which the functional dependency empSal( X, Y): X -+ Y holds. Each occur- rence of empSal in the two clauses can be replaced by empSal!, while gt and le can be replaced by gt! and le!, respectively. The previous example raises the issue of repeated occurrences of a predi- cate A in the body of a clause. These can be accommodated in Algorithm 2 by associating nodes in the graph with each occurrence of a predicate in a program. Further modifications of the algorithm would cater for O-ary predi- cates. Such a predicate A in program P would be represented by an isolated, marked node AO in the graph G; hence, A could always be replaced by A! in P, but the fact that AO is marked in G would not allow any other bindings to be inferred. Finally, we should mention that, if we are willing to run Algorithm 2 each time a query is formulated, then additional changes to the algorithm would permit yet more bindings to be inferred. For example, in the program

c1 : P(x, Y)+ S(Y, x). Cz : Q(X) - R(X), P(X, Y). ACM Transactions on DatabaseSystems,Vol. 16, No. 1, March 1991. 54 . A O Mendelzon and P T. Wood with the query + Q(X), we would be able to infer that the second argument of S in Clause Cl is always bound to a constant (for the given query).

6. DISCUSSION The work presented in this paper bears some similarity to studies for detecting determinacy or functionality of predicates in logic programs [7, 10, 15]. A predicate P is determinate if there exists at most one derivation of a solution for P; P is functional if all solutions produce the same answer. In discussing related work, it is necessary to distinguish between deriving a property that a predicate P always satisfies in a program P (that is, every occurrence of P in P satisfies the property), as opposed to a property that some literal over P in P satisfies. Our method can determine all database predicates or literals that are functional, as well as some nondatabase literals that are functional. Certain previous approaches consider PROLOG programs in their full generality, that is, with function symbols and “negation.” In addition, each argument of a predicate is assumed to be used in one of three modes: ground, free, or unknown. What we have called an input rule is related to the ground mode for an argument. Finally, some of our assumptions, which allow us to utilize the fact that all unit clauses are ground for example, simplify our algorithms considerably. The papers of [10] and [15] deal with detecting whether a predicate P is determinate or functional, given certain modes for its arguments. As a result, their methods cannot discover cases in which individual literals over F’ are functional while P itself is not. Also, neither method assumes any knowledge of functional dependencies; rather, such dependencies have to be inferred by searching through all ground unit clauses. Of course, this may allow addi- tional dependencies, which happen to hold in the present database state, to be exploited. More recently and independently, predefine functional dependencies have been included in the approach of [71. Sufficient conditions are given for detecting the functionality of both predicates and literals, although to do so requires solving a set of (possibly) recursive equations. This is more general than our method, since we have no means for deriving the functionality of a predicate which appears as the head of some clause. However, although Debray’s condition for determining the functionality of a literal is essentially the same as ours, no detailed algorithm for this step is given in the paper. The analysis of binding patterns has become a key ingredient in bottom-up proces~ing of Horn clause queries, following the approach introduced by Unman [17] and developed independently of our work. Ramakrishnan et al. [14] propose transformations that capture some aspects of the cut operator that are appropriate to the bottom-up model of computation.

ACKNOWLEDGMENTS We would like to thank the referees for their detailed and helpful comments on an earlier draft of this paper. ACM Transactions on Database Systems,Vol 16, No 1, March 1991 Functional Dependencies in Horn Clause Queries . 55

REFERENCES 1. Am, K. R., BOL, R. N., ANDKLOP, J. W. On the safe termination of PROLOG programs. Unpublished manuscript, Aug. 1989. 2. BONDY,J. A. ANDMURTY,U. S. R. Graph Theory with Applications. Elsevier Science, New York, 1976. 3. BRUYNOOGHE,M., AND PEREIRA,L. M. Deduction revision by intelligent backtracking. In Implementations of PROLOG, J. Campbell, Ed., Ellis-Horwood, 1984. 4. CHANDRA,A. K., ANDHAREL,D. Horn clause queries and generalizations J. Logic Program. 2, 1 (Apr. 1985), 1-15. 5. CLOCKSIN,W. F., ANDMELLISH,C. S. Programming in PROLOG. Springer-Verlag, Berlin, 1981. 6. COVINGTONM. A. A further note on looping in PROLOG. ACM SIGPLAN Not. 20, 8 (Aug. 1985), 28-31. 7. DEBRAY,S. K. Detection and optimization of functional computations in PROLOG. Tech. Rep. 85/020, Dept. of Computer Science, SUNY at Stony Brook, 1985. 8. KOHLI, M., ANDMINKER, J. Intelligent control using integrity constraints. In Proceedings of The National Conference on Artificial Intelligence (Washington, D.C., Aug. 22-26, 1983), AAAI, 1983, 202-205. 9. LLOYD,J. W. Foundations of Logic Programming. Springer-Verlag, Berlin, 1984. 10. MELLISH,C. S. The automatic generation of mode declarations for PROLOG programs. Res. Rep. 163, Dept. of Artificial Intelligence, Univ. of Edinburgh, 1981. 11. MENDELZON,A. O. Functional dependencies in logic programs. In Eleuenth International Conference on Very Large Data Bases (Stockholm, Aug. 21-23, 1985), 324-330. 12. PEREIRA,L. M., AND PORTO,A. Selective backtracking. In Logic Programming. K. Clark and S.-A. Tamlund, Eds., Academic Press, New York, 1982. 13. POOLE,D., ANDGOEBEL,R. On eliminating loops in PROLOG. ACM SIGPLAN Not. 20, 8 (Aug. 1985), 38-40. 14. RAMAKRISHNAN,R., BEERI, C., AND KRISHNAMUETHY,R. Optimizing existential queries, In Seventh ACM SIGACT-SIGMOD-SIGART Symposium on Principle of Database Systems (Austin, Tex., Mar. 21-23, 1988), 89-102. 15. REDDY,U. S. Transformation of logic programs into functional programs. In International Symposium on Logic Programming, IEEE, New York, 1984, 187-196. 16. SCIORE,E., ANDWARREN,D. S. Towards an integrated database-Prolog system. In Proceed- ings of the First International Conference on Expert Database Systems (Menlo Park, Calif., 1986), 1986, 801-815. 17. ULLMAN, J. D. Implementation of logical query languages for databases. ACM Trans. Database Syst. 10, 3 (Sept. 1985), 289-321. 18. ULLMAN, J. D. Principles of Database and Knowledge—Base Systems. Computer Science Press, Rockville, Md., 1988. 19. WARREN,D. H. D. Efficient processing of interactive relational database queries expressed in logic. In Seventh International Conference on Very Large Data Bases (Cannes, Sept. 9– 11, 1981), 272-281.

Received November 1986; revised November 1989; accepted March 1990

ACM Transactions on DatabaseSystems,Vol. 16, No. 1, March 1991.