<<

DO loop which is used is that the range of values assumed by the index variable is known upon entry to the loop. Thus, most but not all ALGOL for loops can be handled. Programming G. Manacher The analysis is performed from the standpoint of a Techniques Editor for a multiprocessor computer. Two general methods are described. The hyperplane method is The Parallel Execution applicable to both multiple instruction stream computers and single instruction stream computers such as the of DO Loops ILLIAC IV, the CDC STAR-100 and the Texas Instruments ASC. The coordinate method is applicable to single Leslie Lamport instruction stream computers. Both methods translate a Massachusetts Computer Associates, Inc. nest of DO loops into a form explicitly indicating the parallel execution. The DO loops may be of a fairly general nature. The major restrictions are that the loop body contain no I/o and no transfer of control to any statement outside the loop. Methods are developed for the parallel execution of These methods are basically quite simple, and can different iterations of a DO loop. Both asynchronous drastically reduce the execution time of the loop on a multiprocessor computers and array computers are parallel computer. They are currently being imple- considered. Practical application to the design of mented in the ILLIAC IV FORTRAN compiler. Preliminary for such computers is discussed. results indicate that they will yield parallel execution Key Words and Phrases: , for a fairly large class of programs. multiprocessor computers, array computers, vector The two methods are described separately in the computers, loops following two sections. The final section discusses CR Categories: 4.12, 5.24 some practical considerations for their implementation.

I. The Hyperplane Method

Example. To illustrate the hyperplane method, we Introduction consider the following loop. DO 99 1 = 1, L Any program using a significant amount of computer DO 99 J = 2, M time spends most of that time executing one or more DO 99 K = 2, N loops. For a large class of programs, these loops can be U(J,K) = (U(Jq-I,K) q- U(J, Kq-1) represented as FORTRAN DO loops. We consider meth- @ @ @ ods of executing these loops on a multiprocessor com- q- U(J--1,K) q- U(J,K--1)) • .25 puter, in which different processors independently execute different iterations of the loop at the same time. @ @ This approach was inspired by the ILLIAC IV since 99 CONTINUE (1) it is the only type of parallel computation which that (For future reference, we have assigned a name to computer can perform [1]. However, even for a com- each occurrence of the variable U, and written it in a puter with independent processors, it is inherently more circle beneath the occurrence.) This is a simplified efficient than the usual approach of having the processors version of a standard relaxation computation. work together on a single iteration of the loop. This is The loop body is executed L(M-- 1)(N-- 1) times-- because it requires much less communication between once for each point (I,J,K) in the index set ~ = { (i,j,k) : individual processors. 1

83 Communications February 1974 of Volume 17 the ACM Number 2 best a formidable task. It is impossible it L, M, and N Fig. 1. Computation of U(4,6) for l = 9. are not all known at compile time. Our approach is to try to execute the loop body Computed concurrently for all points (1,J,K) in a lying along a when ! = 8 plane. In particular, the hyperplane method will find that the body of loop (1) can be executed concurrently for all points (I,J,K) lying in the plane defined by • 21 9- J 9- K = constant. The constant is incremented 6 u • after each execution, until the loop body has been executed for all points in a. # To describe this more precisely, we need a means of Computed F" expressing concurrent computation. We use the state- when I = 9 ment DO 99 CONC FOR ALL (J,K) E 8 where 8 is a finite set of pairs of integers) It has the following meaning: Assign a separate processor to each element of 8. For each (j,k) E 8, the processor assigned to (j,k) is to set J = j, g = k and execute the state- ments following the DO CONC statement through statement 99. All processors are to run concurrently, completely independent of one another. No synchroniza- 2 4 tion is assumed. Execution is complete when all proc- essors have executed statement 99. Given loop (1), the hyperplane method chooses new index variables i, J,/~ related to/, J, K by i=2I+J+K J=I i.e. during the previous execution of the DO I loop, with /~ = r (2) 1 = 8. The values of U(3,6) and U(4,5) were calculated and the inverse relations during the current execution of the outer DO 1 loop, with I = 9. This is shown in Figure 1. I=J Now consider loop (3). At any time during its J=i-2J-E execution, U(p,q) is being computed concurrently for K = g7 . (2') up to half the elements of the array U. These computa- Loop (1) is then rewritten as tions involve many different values of 1. Figure 2 illus- trates the execution of the DO CONC for 7 = 27. The DO 99i= 6,2.L 9- Mg-N points (p,q) for which U(p,q) is being computed are DO 99 CONC FOR ALL (J,l{) E { (j, k) : marked with "x"s, and the value of I for the computa- 1

84 Communications February 1974 of Volume 17 the ACM Number 2 Fig. 2. Execution for [ = 27. Fig. 3. Execution for 7 = 28. q. / l q •• % % • %• % • • % • • % • ) , ..x .> 6" 8' • "X cP / X • X 8 X X g •% •••% % X • x • X • / ° X• • x %• %•• % ••%• • /O • a 6 • / • ~( X X 6 X • m

~• /0 • • %• • X • x / • X X • X • • ~, • • % •%•• % • 4, " / • X " x • '~ • X • X ° • % •

• X • x • / • X • X • X • • %

2 • ~ • x x X • X • X

I I I 0 I I I L i i i i i i i L 2 4 6 p 2 4 6 p

could then only be applied to the DO J/DO K loop. (A4) Any occurrence in the loop body of a generated The reader can check that applying the general method variable VAR is of the form VAR(e 1, ..., e'), described below to this loop reduces the number of where each e ~ is an expression not containing any sequential iterations from (M-- 1)(N-- I) to M q- N -- 3 generated variable. --still a significant reduction. Assumption (A3) could be replaced by the assump- tion that we know which data can be modified by a Notations and Assumptions To describe the general methods, we introduce some or function call. However, this would com- plicate the discussion. Assumption (A4) must be notation. We consider loops of the following form: strengthened to assure that the hyperplane method will DO 11 = l 1, u I work. This will be done below. We let Z denote the set of all integers, and Z ~ DO F = ln, u ~ denote the set of n-tuples of integers. For completeness, [ loop body I we define Z ° = {0}. The index set ~ of loop (4) is de- CONTINUE (4) fined to be the subset of Z" consisting of all values as- sumed by (11, . . ., I") during execution of the loop, so where l j and u j may be any integer-valued expressions, possibly involving 11, ..., I j-1. (Our use of superscripts = {(i 1, ..., i") : l 1 < i I ~ U 1, "'" }. and subscripts is in accord with the usual notation of Note that ~ may not be known at compile time. The tensor algebra.) We could allow arbitrary constant element (i 1, ..., i ") of ~ represents the execution of DO increments, but this would add many complicated the loop body for11 l" 1 ' ...,/~ = i n. details. We order the elements of Z" lexicographically in A variable which appears on the left-hand side of an the usual manner, with (2,9,13) < (3,-1,10) < assignment statement in the loop body is called a (3,0,0). For any elements P and Q of ~, the loop body generated variable. is executed for P before it is executed for Q if and only We make the following assumptions about the if P< Q. loop body. Addition and subtraction of elements of Z" are (AI) It contains no I/O statement. defined as usual by coordinate-wise addition and sub- (A2) It contains no transfer of control to any statement traction. Thus (3,-- 1,0) -F (2,2,4) = (5,1,4). We let 0 outside the loop. denote the element (0, 0, ..., 0). It is easy to see that (A3) It contains no subroutine or function call which for any P,Q C z n, we have P < Q if and only if can modify data. Q-P>O.

85 Communications February 1974 of Volume 17 the ACM Number 2 Rewriting the Loop Basic Considerations To generalize the rewriting procedure used in our Let VAR be a program array variable. An occurrence example, loop (4) will be rewritten in the form of VAR is any appearance of it in the loop body. If it DO a f = k I,/1 appears on the left-hand side of an assignment state- ment, the occurrence is called a generation; otherwise, DO a Jk = kk, k it is called a use. Thus, generations modify the values of DO ot CONC FOR ALL elements of the array, and uses do not. Consider the use u2 of the variable U in loop (1). (jk+l, ..., j,) E 3s~ ..... j~ During execution of the loop body for (i,j,k) E ~, [ loop body ] it references the array element U(j+ 1,k). We define the a CONTINUE (5) occurrence mapping Tu2:g ~ Z 2 by T~[(i,j,k)] = where Ssl ..... jk is a subset of Z "-k which may depend (j+ l,k). Similarly, if f is an occurrence of an r-dimen- upon the values of f, ..., jk. sional generated variable VAR in loop (4), then the To perform this rewriting, we will construct a one- occurrence mapping TI : g ~ Z r is defined so that f to-one mapping J : Z" ~ Z" of the form references the Tz(P) element of VAR during execution of the loop body for P E g. Assumption (A4) guarantees j[(/t,.., in)] = (~ astlj, . . . ~ aj,,i i) that this is a reasonable definition. ' ¢=x ' ~=1 (6) We are looking for a condition to assure that the = (j1,...,j,) rewritten loop (5) is equivalent to the given loop (4). for integers a/. (2) We then choose the k ~, t~ ~ and From our example, we can see that the significant con- $sl ..... j~ so that the index set ~ of loop (5) equals sideration is the sequence of references to array ele- J(~), and write the body of loop (5) so that its execu- ments. In loop (1), a value for U(5,6) is generated by tion for the point J(P) E ~ is equivalent to the execu- ul during execution of the loop body for (8,5,6) E ~. tion of the body of loop (4) for P E a. This value is used by u2 during the execution for (9,4,6). Define the mapping r :Z" ---~Z k by r[(I 1, ..., I")] = Therefore, when we change the order of executions in (j~,..., jk), so ~r(P) consists of the first k coordinates the rewriting, we must still have the execution for of J(P). It is then clear that for any points J(P), (8,5,6) precede the execution for (9,4,6). By statement J(Q) E ~, the execution of the body of loop (5) for (E) above, this means that r must satisfy 7r[(8,5,6)] < J(P) precedes the execution for J(Q) if and only if r[(9,4,6)]. Indeed, for our particular choice of r we It(P) < 7r(Q). If we consider loop (5) to be a reordering have ~-[(8,5,6)] = 27 < ~-[(9,4,6)] = 28. of the execution of loop (4), this statement is equivalent In general, let VAR be any variable. If a generation to the following. and a use of VAR both reference the same array ele- ment during execution of the loop, then the order of (E) For any P,Q E ~, the execution of the loop body the references must be preserved. In other words, iff for P precedes that for Q, in the new ordering of is a generation and g is a use of VAR, and TI(P) = executions, if and only if ~-(P) < 7r(Q). To(Q) for some points P, Q E a, then: (i) if P < Q, The loop body is executed concurrently for alk we must have ~r(P) < lr(Q); and (ii) if Q < P, we elements of ~ lying on a set of the form {P : r(P) = must have ~r(Q) < 7r(P). In the above example, constant E Zk}. Since J is assumed to be a one-to- T~I [(8,5,6)] = T,~[(9,4,6)] = (5,6), and (8,5,6) < (9,4,6), one linear mapping, these sets are parallel (n-k)- so we must have ~r[(8,5,6)] < ~r[(9,4,6)]. Note that if P dimensional planes in Z". ¢3) We thus have concurrent = Q, then the order of execution of the references will execution of the loop body along (n-k)-dimensional automatically be preserved since they happen during planes through the index set. For k = 1, these are a single execution of the loop body. hyperplanes. We use the name "hyperplane method" The above rule should also apply to any two genera- to also include the case k > 1. tions of a variable. This guarantees that the variable In our example, we had n = 3 and k -- 1. The map- has the correct values after the loop is run. It also ensures that a use will always obtain the value assigned ping J :Z 3 ---4 Z 3 is defined by J[(i,j,k)] = (2i+j+k,i,k), and ~- :Z 3 ---~Z is defined by ~r[(i,j,k)] = 2i +j + k. by the correct generation. The general problem is to find a mapping J for which These remarks can be combined into the following loop (5) gives an algorithm equivalent to that of loop basic rule. (4). By requiring that J be a linear mapping, we have (C1) For every variable, and every ordered pair of greatly restricted the class of mappings which are to be occurrencesf,g of that variable, at least one of which examined. It is this restriction which makes the analysis is a generation: if TI(P) = Ta(Q) for P,Q E ~ with feasible. P < Q, then ~r must satisfy the relation 7r(P) < J is one-to-one if and only if (6) can be-solved to write the ~'(a). I s. as linear expressions in the J~ with integer coefficients. 3We consider Z ~ to be a subset of ordinary Euclidean n-space Notice that the case Q < P is obtained by interchanging in the obvious way. f and g.

86 Communications February 1974 of Volume 17 the ACM Number 2 Table I. (C2) For every variable, and every ordered pair of Elements Sets Constraints occurrencesf, g of that variable, at least one of which >0 is a generation: for every X E (f,g) with X > 0, (ul,ul) = (*,0,0) (+,0,0) a~ > 0 7r must satisfy ~'(X) > 0. (ul,u2) = (%--1,0) (+,-1,0) al - a2 > 0 (u2,ul) = (*,1,0) (+,1,0) al + a2 > 0 Any ~r satisfying (C2) also satisfies (C1). Hence, (0,1,0) a2 > 0 rule (C2) gives a sufficient condition for loop (5) to be (ul,u3) = (*,0,--1) (+,0,-1) al - a3 > 0 equivalent to loop (4). Moreover, (C2) is independent (u3,ul) = (*,0,1) (+,0,1) al + a~ > 0 (0,0,1) a3 > 0 of the index set ~. (ul,u4) = (,,1,0) same as (u2,ul) Each condition 7r(X) > 0 given by rule C2 is a (u4,ul) = (*,-1,0) same as (ul,u2) constraint on our choice of 7r. If 7r satisfies all these (ul,uS) = (*,0,1) same as (u3,ul) constraints, then loop (5) is equivalent to loop (4). (u5,ul) = (*,0,-1) same as (ul,u3) Table I lists the constraints on ~r for loop (1). In this case ~r :Z 3 ---~Z is of the form rr[(i,j,k)] = a~i + a2j + a3k, and Table I gives the constraints which must be Rule (Cl) ensures that the new ordering of execu- satisfied by ax, a2, and a3. For example, the set of ele- tions of the loop body preserves all relevant orderings ments > 0in (ul, u2) is (+,--1,0) = {(x,--l,0) : of variable references. The orderings not necessarily x> 0}. The requirement that ~-[(x,-- 1,0)] > 0 for each preserved are those between references to different array x > 0 yields the constraint a~ -- a2 > 0. Our choice of elements, and between two uses. Changing just these al = 2, a2 = as = 1 satisfies all these constraints. orderings cannot change the value of anything com- Therefore, loop (3) is equivalent to loop (1). puted by the loop. The assumptions we have made about the loop body, especially the assumption that it contains no premature exit from the loop, therefore Computing the Sets (f,g) imply that rule (C1) gives a sufficient condition for In order to guarantee that we find a mapping ~r loop (5) to be equivalent to loop (4). For most loops, which satisfies (C2), some further restriction must be (C1) is also a necessary condition. made on the forms of variable occurrences allowed in the loop body. We make the following assumption.

The Sets (f, g) (A5) Each occurrence of a generated variable FAR in The trouble with rule (C1) is that it requires us to the loop body is of the form consider many pairs of points P,Q in ~. For the loop (1), there are (L--1) (M-1) (N-1) pairs of elements VAn (Iq +ml, . . . , Iir'Jvmr), (7) P,Q E ~ with T,I(P) = T,,.,(Q) and P < Q. However, where the m ~ are integer constants, and jl,... ,jr are T~I(P) = T~,~(Q) only if Q = P -I- (,,-1,0), where • r distinct integers between 1 and n. Moreover, the denotes any integer. We would like to be able to work j~ are thesame for any two occurrences of VAR. with the single descriptor (.,--1, 0) rather than all the pairs P,Q. Thus, if a generation A(I2--1,I1,14+1) appears in This suggests the following definition. For any the loop body, then the occurrence A(f+l,IX+6,14) occurrence f,g of a generated variable in loop (4), may also appear. However, the occurrence A(I 1- 1,I2,/4) define the subset (f, g) ofZ ~ by (f, g) = {X : TI(P) = may not. To(P+X) for some P E Z"}. Observe that (f,g) is It is possible to generalize our results to the case of independent of the index set ~. In our example, occurrences of the form VAR(e ~, . . . , e r) in which e ~ is (ul, u2) = {(x,--1,0) :x E Z}, and we denote this set any linear function of fl, ..., F. However, the results by (.,-- 1,0). The other sets (f,g) of loop (1) which we become weaker and much more complicated. will use are listed in Table I. Now let f be the occurrence (7) and let g be We now rewrite rule (C1) in terms of the sets (f,g). the occurrence VAR(I j' +n 1, . . . ,I jr-'}-n'). Then First, note that 7r(P+X) = 7r(P) + 7r(X), since we Ti[(p~,..., p")] = (pJI+m~,..., pJr+rnr), and have assumed ~- to be a linear mapping. (Recall the To[(p 1, . . ., p")] = (pJ'+n 1, . . ., pJ'+nr). It is easy to definition of 7r, and formula (6).) Also, remember that see from the definition that (f,g) is the set of all ele- A < A 9- Bifand only if B > 0. Then just substi- ments of Z" whose jkth coordinate is m k -- n k, for tuting P 9- X for Q in rule (Cl) yields this rule. k = 1,..., r, and whose remaining n -- r coordinates are any integers. (CI') For... generation: if Tf(P) = To(P+X) for As an example, suppose n = 5 andf, g are the oc- P,P q- XC ~ with X > 0, then 7r must satisfy currences VAR(I3+ I,f + 5,IS), VAR(I~--}-I,I2,IS--k l). the relation ~-(X) > 0. Then (f,g) is the set {(x,5,0,y,--1) :x, y E Z}, which Removing the clause "for P,P 9- X E ~" from (CI') we denote by (,,5,0,.,--1). gives a stronger condition for ~- to satisfy. Doing The index variable I j is said to be missing from VAR this and using the definition of (f,g) then gives the if fl is not one of the I jk in (7). It is clear that I j is miss- following more stringent rule. ing from VAR if and only if the set (f,g) has an • in

87 Communications February 1974 of Volume 17 the ACM Number 2 the jth coordinate, for any occurrences f,g of VAR. set 9, the construction of r and .1 given above is also We call f a missing index variable if it is missing from independent of 9. This completes the proof.[] some generated variable in the loop. Observe that the theorem is trivially true without the restriction that J be independent of a, because The Hyperplane Theorem given any set 9 we can construct a J for which the sets The following result is an important special case of Ss2, .... s- contain at most one element, and the order a more general result which will be given later. 4 The of execution of the loop body is unchanged. For proof contains an algorithm for constructing a map- example, if 9 = {(x,y,z) :1 ~ x < 10, 1 ~ y < 5, ping 7r which satisfies (C2). The reader can check that 1 < z < 7}, let J[(x,y,z)] = (35x+7y+z, x, y). Such it gives the ~- which we used for loop (1). As we will see a J is clearly of no interest. However, because the in loop (11) below, the algorithm sometimes works mapping J provided by the theorem depends only on even if the hypothesis of the theorem is not satisfied. the loop body, it will always give real concurrent execu- HYPERPLANE CONCURRENCY THEOREM. Assume that tion for a large enough index set. loop (4) satisfies (A1)-(A5), and that none of the index Condition (C2) gives a set of constraints on the variables 12, ..., I ~ is a missing variable. Then it can be mapping ~- :Z ~ --~ Z. The Hyperplane Theorem proves rewritten in the form of loop (5)for k = 1. Moreover, the existence of a ~- satisfying those constraints. We now the mapping J used for the rewriting can be chosen to be consider the problem of making an optimal choice of rr. independent of the index set g. It seems most reasonable to minimize the number PROOF. We will first construct a mapping 7r : Z n --+ Z of steps in the outer DO j1 loop of (5). (Remember that which satisfies rule (C2). Let 6, be the set consisting of k = 1.) If a sufficiently large number of processors are all the elements X > 0 of all the sets (f,g) referred available, then this gives the maximum amount of con- to in (C2). We must construct ~- so that ~-(X) > 0 for current computation. This means that we must minimize all X C 6,. ill -- ~kl in loop (5). But X1 and 1 are just the upper and Let "+" denote any positive integer, so lower bounds of {r(P) : P ~ 9}. Setting M ~ = u ~ -- l ~, (+, x2,..., x ~) is any element of Z" of the form it is easy to see that 1 _ Xl equals (x, x2,..., x ") with x > 0. Since 11 is the only index M1]al[ + ... + Mnla~f, (9) variable which may be missing, we can write 6, = {)(1,...,XN}, where X~ = (Xr 1,...,x~"), or X~ = where the at are defined by (8). Finding an optimal r (+, Xr2, . .., X~~) for some integers x/. is thus reduced to the following integer programming The mapping ~r is defined by problem: find integers al,..., a, satisfying the con- 7r[(i1,..., /n)] = a~I1 -t- "'" -t- a,1 ~ (8) straint inequalities given by rule (C2), which minimize the expression (9). for nonnegative integers a~, to be chosen below. Since Observe that the greatest common divisor of the 2 n al _> 0, ~-[(1, x~-, . . . , x~ )] > 0 implies ~r[(x, Xr2,..., resulting a~ must be I. This follows because the con- x~n)] > 0 for any integer x > 0. Therefore, each X~ of straints are of the form xla~ +... + xna~ > 0, so the form (+, x~2,..., x~") can be replaced by X~ = dividing the a~ by their g.c.d, gives new values of a~ (1, x~ 2, ..., x~"), and it is sufficient to construct ~- such satisfying the constraints, with a smaller value for (9). that ~r(X~) > 0 for each r = 1, . . ., N. Hence, the method of [4] can be applied to finding the Define 6,j = {X~ :x~ 1 ..... x{ -1 = 0, x/ # 0}, mapping J. so 6,i is the set of all X~ whose jth coordinate is the left- Although the above integer programming problem most nonzero one. Then each X, is an element of some is algorithmically solvable, we know of no practical 6,3.. method of finding a solution in the general case. How- Now construct the a~. sequentially for j = n, n -- 1, ever, the construction used in proving the Hyperplane •.., 1 as follows. Let a; be the smallest nonnegative Theorem should provide a good choice of 7r. In fact, integer such that ajx/ + ... -4- a, xf ~ > 0 for each for most reasonable loops such as loop (1), it actually X~ = (0,..., 0, x/,..., x~n) ~ 6,j. Since X, > 0 gives the optimal solution. and x/ ~ 0 imply x, j > 0, this is possible. Clearly, we have ~r(X~) > 0 for all X~ C 6,~'. But The General Plane Theorem each X~ is in some 6'y, so r(X~) > 0 for each r = 1, . . ., We now generalize the Hyperplane Theorem to N. Thus, 7r satisfies rule (C2). Observe that the first cover the case when some of the index variables nonzero aj that was chosen must equal 1, so 1 is the I2,..., I n are missing. Concurrent execution is then greatest common divisor of the as. (If all the aj are possible for the points in 9 lying along parallel planes. zero, then 6' must be empty, so we can let ~r[(f,..., Each missing variable may lower the dimension of the 1")] = 11.) A classical number theoretic calculation, planes by one. The following theorem may be viewed described in [4, p. 31], then gives a one-to-one linear as a generalization of a result stated in [5, p. 584]. mapping J:Z" ~ Z" such that J[(f,..., F)] = 4 A weaker version of this result can be found in [3]. 0r[(l~,..., 1,)],...).(5) 5If a j = 1, then we can define Jas follows: for each k > 2, Since the sets (fig) are independent of the index let jk equal some distinct Itk with/k ¢ j.

88 Communications February 1974 of Volume 17 the ACM Number 2 PLANE CONCURRENCY THEOREM. Assume that loop For completeness, we state a sufficient condition (4) satisfies (A1)-(A5) and that at most k -- 1 of the for the loop body to be concurrently executable for all index vartables• I,...2 , i n are missing. Then loop (4) can points in o--i.e, to be able to rewrite loop (4) with a be rewritten in the forrn &loop (5). Moreover, the map- DO a CONC FOR ALL (11 , . . . , I n) C S ping J used for the rewriting can be chosen to be inde- statement• This involves setting J equal to the identity pendent of the index set ~. mapping, k = 0, and 7r : Z" ~ Z ° the mapping defined PROOF. The proof is a generalization of the proof of by~r(P) = 0forallPC Z". Since (g,f) = {--X:XC the Hyperplane Theorem. Let I s2, ..., Iik be the pos- (fig) }, it is clear that this ~- satisfies (C2) if and only if sibly missing variables among I2,..., F. Set jl = all the sets (fig) are equal to {0}. The method of 1,jk+l = n + 1, and assume jl < j2 < • - • < j, < j,+l • computing these sets then gives the following rather Let 6, be the set of all elements X > 0 of all sets obvious result• (fig) referred to by rule (C2). We must construct ~- so If loop (4) satisfies (A1)-(AS), none of the index that 7r(X) > 0 for all X ~ 6,. Let 6,j = {(0,..., variables are missing, and all occurrences of any 0, xJ,..., x ") C 6):x j > 0}, so 6,i is the set of all generated variable are identical, then the loop can elements of 6' whose jth coordinate is the left-most be rewritten as: nonzero one. Then every element of 6" is in one of DO a CONC FOR ALL (f , . . . , I") C S the 6"j.. The mapping ~- :Z" ---+ Z k will be constructed with The hypothesis means that in the expression (6) for 7r(P) = (Trl(p), . . . , ~rk(P)), where each 7r i :Z" --+ Z is each generated variable VAR, r = n and the m ~ are the defined by ~J[(f,..., I")] = a~I ~ + ... + a,~I ~ for same for all occurrences of VAR. nonnegative integers aj . Moreover, we will have a/= 0 ifj < j~ or j > j~+l • This implies that if X C 6"j" and and j > j~+l, then TrY(X) = 0. It therefore suffices to II. The Coordinate Method construct ~-~ so that for each j with ji <_ j < ji+l, and each X C 6)3' : TrY(X) > 0--for we then have ~-(X) = Example• We illustrate the coordinate method with (o,..., 0, #(x),..., ~(x)) > 0. the following loop. Recall that for the sets (f,g), an • can appear only in the jl, • • •, jk coordinates. Thus any element of any DO 24 1 = 2, M of the sets 6'3' with ji ~ j < ji+l can be represented in DO 24 J = l, N the form (0,..., 0, x~,...,Ji X{i+1-1 ,...), or 21 A(L J) = B(I, J) + C(I) ji+l ". (0,...,0, +, x, ,.. . . , x~ '+'-1, . . .) for afinitecol- ® ® @ lection of integers x/, j~ _< j < ji+l • By the same argu- 22 c(I) = B(I- I,J) ment used in the proof of the Hyperplane Theorem, @ ® we can replace "+" by x'~~ = 1, and choose a/ >_ 0, j~0 23 B(I, J) = A(I + 1, J) ** 2 for each r. Choosing a/ = 0 for j < jl and j > j~+t ® @ completes the construction of the required ~-~. 24 CONTINUE (11) The construction of [4] is then applied to give invertable relations of the form J~• = a~i . I i~ -t- ..- + The hyperplane method would rewrite this as a DO i/ a~+~_a.Ii Ji+l--1 , and J*~ = z...,*=s~'~Ji+l--1 b~.I] r , for ji < j < DO CONC Jloop with] = I-k- J, and J = J. (Al- i~+,. Combining these and reordering the JJ gives the though J is a missing variable, so the hypothesis of the required mapping J.[] hyperplane theorem is not satisfied, the algorithm used As in the hyperplane case, to get an optimal solu- in the proof still gives a ~- satisfying (C2).) The rewritten tion, we want to minimize the number of iterations of loop has M + N -- 2 sequential iterations. the outer DO loops. This means minimizing (u~--;~+ l) For a synchronous, single instruction stream com- • .. (p~--~,~+l). It is easy to verify that if none of the puter like the ILLIAC IV, we can do better than this by expressions l ~, u ~ involve any index variable, then this using the coordinate method. To express synchronous number is equal to (M1 [ a~ll + ... + M~[a~l[ + 1) parallel execution, we introduce the DO SIM (for • .. (M~[a~*[ +-.. + M"[a, *] + 1), where M ~ = SIMultaneously) statement having the following form. u ~ -- l ~, and the a/are defined by (6)• DO c~ SIM FOR ALL C $, Finding the best a/is now an integer programming problem. Note that a solution with at ..... a, = 0 where 8 is a finite set of integers• Its meaning is similar for some i gives a solution to the rewriting problem to that of the DO CONC statement, except that the with k replaced by k -- 1, since that 7r~ can be removed computation is performed synchronously by the indi- without affecting the constraint inequalities• The Plane vidual processors• Each element of S is assigned to a Concurrency Theorem proves the existence of a separate processor, and each statement in the range of ~r : Z" --+ Z ~ satisfying (C2), for a particular value of k. the DO SIM is, in turn, simultaneously executed by all It may be possible to find such a ~r for a smaller k. the processors. An assignment statement is executed by

89 Communications February 1974 of Volume 17 the ACM Number 2 first computing the right-hand side, then simultaneously conditional assignment statement such as performing the assignment. IF (A(I~).GT.O) B(I ~) == A(I 1) The coordinate method does not introduce new It is easily implemented on the ILLIAC IV by turning off index variables. It will rewrite loop (11) as individual processors. The real assumption in (A7) is DO 24 J = 1, N that there are no loops within the loop body. In that DO 24SIMFORALLIE {i:2 < i < M} case, conditional branches can be removed by adding TEMP(I) = A(I + 1, J) 1F clauses. @ We are not making assumption(A5). Some restric- 21 A(/, J) = ~(i, ,I) + c(i) tions on subscript expressions must be made by a real ® @ @ compiler to permit computation of the (f,g) sets. We will not consider this problem. 23 B(I, J) = TEMP(I) ** 2 To simplify the discussion, we assume that each d ~ ® equals 1, and that the expressions l ~ and u ~ do not con- 22 c(i) = B(I- 1, J) tain any of the index variables I j. The modifications @ @ necessary for the general case are described later. 24 CONTINUE (12) The coordinate method will rewrite loop (13) as

Observe that the DO SIM must be executed by syn- DO a I j~ = l i~, u j' chronous processors. Processor i must generate the DO a I ~ = l jk, u j~ value for B(i, j) in statement 23 before processor i + 1 uses it in statement 22. We also see from this that it DO ct SIM FOR ALL (1~k+l, . . . , ff~) C $ was necessary to rearrange the loop body in writing I loop body ] loop (12) in order to obtain a loop equivalent to the a CONTINUE (14) original one. where jl < "" < jk and 8 is the set {(x k+l, ..., x n) : Loop (12) requires only N sequential iterations, 1~'' _< x' < uS'}. instead of the M + N -- 2 required by the hyperplane The mapping r : Z ~ ~ Z ~ is defined as before. method. Moreover, the change of index variables in the However, now it is the simple mapping r[(i 1, . . . , i n) ] = hyperplane method produces more complicated sub- (iJl,..., iJ~). In other words, r just deletes the script expressions, significantly increasing the time jk+t, • • •, jn coordinates. In our example, ~r was defined needed for a single execution of the loop body. By by ~r[(i,j)] = j. using the original index variables, the coordinate method eliminates this source of inefficiency. However, Basic Considerations there are sortie loops, such as loop (1), which cannot Any DO CONC statement can be executed as a be rewritten with the coordinate method. These loops DO SIM, since it must give the same result if the asyn- require the hyperplane method. chronous processors happen to be synchronized. Thus, the rewriting could be done just as before by trying to Assumptions and Notation find a r which satisfies (C2). However, the synchrony In general, we consider a loop of the form of the computation allows us to weaken the condition DO a 11 = 11, u I, d I (C2). Recall that rule (C1) was made so that the rewriting DO a I" = l ~, u", d" will preserve the order in which two different references ! loop body I are made to the same array element. For references ct CONTINUE (13) made during two different executions of the loop body, the asynchrony of the processors requires that the We assume that the loop body satisfies assumptions order of those executions be preserved. However, with (A1)-(A4). In addition, we make the following assump- synchronous processors, we can allow the two loop body tions: executions to be done simultaneously if the references (A6) Each d ~ is an integer constant. will then be made in the correct order. The order of these two references is determined by the positions (A7) There is no conditional transfer of control within the loop body. within the loop body of the occurrences which do the referencing. Assumption (A7) prohibits a statement such as First, assume that we do not change the loop body. For two occurrences f and g, let f-* g denote that the IF (A(I1).GT.O) GO TO 9 execution of f precedes the execution of g within the in the loop body. Such a statement would be me~tning- loop body. This means either that the statement con- less inside a DO SIM 11 loop, since all processors must taining f precedes the statement containing g, or that f execute the same statement. However, we do allow a is a use and g a generation in the same statement. The

90 Communications February 1974 of Volume 17 the ACM Number 2 Table I1. loop body, we now have to make sure that any two Is (s1 (i)) Ordering relations references to the same array element made during a The sets (f,g) violated? $1 (ii) $2 single execution of the loop body are still made in the (al,al) = (0,0) NO -- -- correct order. The following analogue of rule (C1) (al,a2) = (--1,0) NO -- -- (a2,al) = (1,0) NO a2 ---* al -- handles this. (b3,b3) = (0,0) NO -- -- For ... generation: if Ts(P) = To(P) for some (bl,b3) = (o,o) NO -- bl ~ b3 (b3,bl) = (0,0) NO -- -- P E ~ and f precedes g in the original loop body, (b2,b3) = (--1,0) NO -- -- then f---* g. (b3,b2) = (1,0) NO b3 ---* b2 -- (cl,cl) = (0,*) NO -- -- Rewriting this in terms of the sets (fig) gives the follow- (c1,c2) = (0,*) NO -- cl --o c2 ing rule. (c2,cl) = (0,*) NO -- -- ($2) For every variable, and every ordered pair of occurrences f,g of that variable, at least one of which is a generation: if 0 E (f,g) and f precedes g in the original loop body, then f--+ g. Rules (S1) and ($2) guarantee that the rewritten loop (14) is equivalent to the original loop (13). Note that rule ($2) does not involve Ir. above observation allows us to change rule (C1) to the following weaker condition on ~-. The Coordinate Algorithm For... generation: if Ts(P) = To (P) for P, Q E (S 1) and ($2) together give a sufficient condition for with P < Q, then we must have either a particular rewriting to be equivalent to the original (i) ~-(P) < r(O), or loop. Rule (Sl(i)) gives a condition which must be (ii) r(P) = r(Q) andf---~ g. satisfied by ~-. Rules (Sl(ii)) and ($2) specify ordering relations among the occurrences in the rewritten loop In this rule, either (i) or (ii) is sufficient to ensure that body. However, they do not indicate whether it is occurrencef references the array element Ts(P) for the possible to rewrite the loop body so that these relations point P E a before g references the same array element are satisfied. We now give a method for deciding if for Q E ~. The conditions can be rewritten in the fol- such a rewriting exists. lowing equivalent form: First, we make a trivial observation: a use in an (i) r(P) _< r(O), and (ii)/fr(P) = r(Q) thenf----~ g. assignment statement must precede the generation in that statement. This observation is given the status of In the same way that (C2) was obtained from (C1), a rule. the above rule gives the following rule. ($3) For any use f and generation g in a single state- (S1) For every variable and every ordered pair of ment, we must havef--~ g. occurrencesf,g of that variable, at least one of which is a generation: for every X E (f,g) with X > 0, Now let ~ denote the relations given by rules (S1)- we must have ($3). Add all relations implied by transitivity. That is, (i) ~r(X) > 0, and whenever f ~ g and g ~ h, add the relation f--~ h. (ii) if re(X) = O, then f ---~ g. (An efficient algorithm for doing this is given by [6].) If the resulting ordering relations are consistent--that If r satisfies rule (S1), then it satisfies the preceding is, if we do not havef--~f for any occurrence f--then rule, so the rewritten loop (14) is equivalent to the the loop body can be rewritten to satisfy the ordering original loop (13). relations. So far, our discussion has assumed that we have To show how the rewriting is actually done, we de- not changed the loop body. Now let us consider chang- scribe the application of the coordinate method to loop ing the order of execution of the occurrences. That is, (11). The calculations for Steps 1, 3, and 4 are shown we may change the position of occurrences within the in Table II. loop body, as we did in writing loop (12). (There was no point in doing this for asynchronous processors Step 1. Compute the relevant sets (f,g) for rules (SI) since it couldn't help.) and ($2). Let f ~ g mean that f is executed before g in the Step 2. Choose the DO SIM variables. We wish to rewritten loop body. Then rule (S1) guarantees that the rewrite loop (I1) as a DO J/DO SIM 1 loop, so correct temporal ordering of references is maintained the mapping ~- is defined by ~r[(i,j)] = j. when the references were made in the original loop Step 3. Check that (Sl(i)) is not violated. during different executions of the loop body. Having Step 4. Find the ordering relations given by (Sl(ii)) changed the positions of occurrences in rewriting the and ($2).

91 Communications February 1974 of Volume 17 the ACM Number 2 Step 5. Apply ($3) to get the following relations: In general, there are 2" - 1 choices for the DO statement 21 : bl -~ al SIM variables in Step 2. Steps 3-11 are repeated for cl --* al different choices until a suitable one is found. Rule statement 22: b2 --~ c2 (S1) should quickly eliminate many possibilities. In our statement 23: a2 -~ c3 example, the choice of a DO I/DO SIM J rewriting is Step 6. Find all relations implied by transitivity: eliminated by the relation cl ~ cl given by (Sl(ii)). b3 --* c2 [by b3 --* b2 and b2 -~ c2] One can also show that loop (13) can be rewritten a2 --* b2 [by a2 --* b3 and b3 -~ b2] with a DO SIM (I jk, ..., P") only if it can be re- bl --* b2 [by bl --* b3 and b3 --. b2] written with a DO SIM (I ~+1, ..., Ii"), where bl --* c2 [by bl -~ b2 and b2 --* c2] jk < "'" < j,. Thus, eliminating DO I/DO SIM J a2 --~ c2 [by a2 --~ b3 and b3 ~ c2] for loop (11) also eliminates the possibility of a DO Step 7. Check that no relation of the form f --* f was SIM (I,J) rewriting. If no choice of DO SIM variables found in Step 4 or Step 6. works, then the hyperplane method must be tried. Step 8. Order the generations in any way which is con- To handle arbitrary DO increments d ~, one need sistent with the above relations--i.e, obeying only generalize the definition of the set (f,g) as fol- b3 --* c2. We let al --* b3 --* c2. We then write: lows: (f,g) = {(xl,...,x") C Z":Ts(P) = To[P + 21 A(I, J) = (dlx 1, ..., dnxn)], for some P C Z"}. The rules (S1)- @ ($3) and the algorithm described above remain the same. 23 B(I, J) = For arbitrary DO limits if, u s we proceed as follows. @ For each i: if the expression 1s contains some I t, then 22 C(I)= replace I ~ by the new index variable i ~ = I s -- l s. Steps @ 1-10 are then executed as before. In Step 11, a more Step 9. Insert the uses in positions implied by the complicated procedure is needed to find the DO limits ordering relations (recall that a2 ~ al)" and DO SIM set for the rewritten loop. A(I + 1, J) @ 21 A(I, J) = B(I, J) + C(I) Ill. Practical Considerations @ @ @ 23 B(I, J) = **2 Satisfying the Assumptions @ Our analysis required several assumptions about the 22 C(I) = B(I-- 1, J) given loop. If a loop does not satisfy these assumptions, @ @ then it may still be possible to rewrite it so that it does. We have already indicated that assumption (A7) can Step 10. Add any extra variables necessitated by uses be met by replacing conditional transfers with IF no longer appearing in their original statements: clauses. We now describe some other useful techniques. TEMP(I) = A(i + l, J) Our first assumption was that the DOs are tightly @ nested, as in loop (4) ; i.e. we did not allow loops such as 21 A(I, J) = B(I, J) + C(I) @ @ @ D099I= 1, M 21 A(I, 1) = 0 23 B(I, J) = TEMP(I) ** 2 @ DO 99 J = 2, N 22 C(I) = B(I-- 1, J) It is easy to rewrite this as the following tightly nested @ @ loop: Step 11. Insert the DO and DO SIM statements, to get loop (12). D099I= 1, M DO 99 J = 2, N 21 IF (J .EQ. 2) A(I,J--1) = 0 Further Remarks It is easy to deduce a general algorithm for the coordinate method from the preceding example. The This method works in general. It may be possible later method can be extended to cover the case of an in- to move statement 21 back outside the J loop and consistent ordering of the occurrences. In that case, remove the IF clause. A future paper will describe the loop can be broken into a sequence of sub-loops. methods of handling nontightly nested loops without Every generation g for which the relation g --~ g does using this artifice [8]. not hold can be executed within a DO SIM loop. An Assumption (A4) can sometimes be satisfied by sub- algorithm for doing this is described in [7]. stituting for generated variables. One technique is illus-

92 Communications February 1974 of Volume 1"/ the ACM Number 2 trated by the following example. Given Consider the loop K=N D02I= 1, N D061= 1, N A(2,I--1) = B(I) 5 B(I) = A(K) 2 A(2,I) = C(I) 6 K=K--1 we can rewrite it as The coordinate method can rewrite this as a DO SIM D051I= I,N I loop. However, to execute this DO SIM loop in 5 B(I) = A(N-k-I--I) parallel on the ILLIAC IV requires a peculiar method of 51 CONTINUE storing the arrays. This storage scheme would probably 61 K= 1 be incompatible with the requirements of the rest of The use of auxiliary variables to effect negative incre- the program. menting is fairly common in FORTRAN programs. In general, the computer's data accessing mechanism will limit the forms of variable occurrences which may Scalar Variables appear in the loop. It may also limit the utility of the Even though the loop satisfies all the restrictions, hyperplane method. For example, implementation of it is clear that these methods can give no parallel com- the hyperplane method on the STAR-100 requires dy- putation if there are generated scalar variables. Any namic reformating of the arrays. such variable must be eliminated? We have allowed a conditional assignment state- Often, the variable simply acts as a temporary ment such as storage word within a single execution of the loop body. IF (A(I).GT.O)B(1) = .,4(1) The variable X in the following loop is an example. inside a DO SIM I loop. This is easily implemented on D031= 1, 10 the ILLIA¢ IV and with vector operations on the STAR-100. X = SQRT(A(I)) However, it cannot be implemented with the ASC vector B(I) = X operations. 3 C(1) = EXP(X) Other computer designs will require different re- In this loop, each occurrence of X can be replaced by strictions on the loops. However, our methods seem XX(I), where XXis a new variable. sufficiently general to be applicable to any parallel In general, we want to replace each occurrence of computer to be built in the near future. the scalar by VAR (i1,... ,I"), for a new variable VAR. (After the rewriting, to save space, we can lower Conclusion the dimension of VAR by eliminating any subscript We have presented methods for obtaining parallel not containing a DO FOR ALL index variable.) A execution of a DO loop nest. A number of details and simple analysis of the loop body's flow path determines refinements were omitted for simplicity. Some of these if this is possible. are described in [7]. However, all the basic ideas neces- Another common situation is for the variable X sary for their implementation have been included. to appear in the loop body only in the statement Preliminary experience with the ILLIAC IV FORTRAN X = X -b expression, where the expression does not compiler indicates that these methods can be used to involve X. This statement just forms the sum of the obtain parallel execution for a fairly large class of expression for all points in the index set ~. We can re- sequential programs. place it by the statement lIAR (11, ..., /") = expres- sion, and add the following "statement" after the Received February 1972; revised January 1973 loop: X = X q- ~'~(? ..... I")~ VAR (I 1, ..., I"). The References sum can be executed in parallel with a special sub- 1. Mclntyre, David. An introduction to the ILLIAC-IV routine. computer. Datamation 16, 4 (Apr. 1970), 60-67. The same approach applies when the variable is 2. Ramamoorthy, C.V., and Gonzalez, M.J. A survey of techniques for recognizing parallel processable streams in used in a similar way to compute the maximum or computer programs. Proc. AFIPS 1969 FJCC, Vol. 35. AFIPS minimum value of an expression for all points in a. Press, Montvale, N. J. pp. 1-15. 3. Muroaka, Yoishi. Parallelism exposure and exploitation in programs. Ph.D. Th., U. of Illinois, Urbana, II1., 1971. Practical Restrictions 4. Mordell,L.J. Diophantine Equations. Academic Press, New The methods we have described yield parallelism York, 1969. in the form of DO CONC or DO SIM loops. In order 5. Karp, R.M., Miller, R.E., and Winograd, S. The Organization of computations for uniform recurrence equations. J. ACM 14, for them to be of use in a real compiler, the particular 3 (July 1967), 563-590. target computer must be capable of' efficiently executing 6. Warshall, Stephen. A theorem on Boolean matrices. J. ACM these loops in parallel. The structure of the computer 9, 1 (Jan. 1962), 11-12. 7. Lamport, Leslie, and Presberg, David. The parallel execution of will place additional restrictions on the loops which FORTRAN DO loops. Mass. Computer Associates, Inc., AD the compiler can handle. 742-279. Wakefield, Mass. 1971. 8. Lamport, Leslie. The coordinate method for the parallel exe- s In our formalism, a scalar is a zero-dimensional array. Each cution of DO loops. To appear in Proc. 1973 Sagamore Comput. ~,g) set for a scalar variable equals all of Z ~. Conf.

93 Communications February 1974 of Volume 17 the ACM Number 2