The Parallel Execution of DO Loops

FORTRAN DO loop which is used is that the range of values assumed by the index variable is known upon entry to the loop. Thus, most but not all ALGOL for loops can be handled. Programming G. Manacher The analysis is performed from the standpoint of a Techniques Editor compiler for a multiprocessor computer. Two general methods are described. The hyperplane method is The Parallel Execution applicable to both multiple instruction stream computers and single instruction stream computers such as the of DO Loops ILLIAC IV, the CDC STAR-100 and the Texas Instruments ASC. The coordinate method is applicable to single Leslie Lamport instruction stream computers. Both methods translate a Massachusetts Computer Associates, Inc. nest of DO loops into a form explicitly indicating the parallel execution. The DO loops may be of a fairly general nature. The major restrictions are that the loop body contain no I/o and no transfer of control to any statement outside the loop. Methods are developed for the parallel execution of These methods are basically quite simple, and can different iterations of a DO loop. Both asynchronous drastically reduce the execution time of the loop on a multiprocessor computers and array computers are parallel computer. They are currently being imple- considered. Practical application to the design of mented in the ILLIAC IV FORTRAN compiler. Preliminary compilers for such computers is discussed. results indicate that they will yield parallel execution Key Words and Phrases: parallel computing, for a fairly large class of programs. multiprocessor computers, array computers, vector The two methods are described separately in the computers, loops following two sections. The final section discusses CR Categories: 4.12, 5.24 some practical considerations for their implementation. I. The Hyperplane Method Example. To illustrate the hyperplane method, we Introduction consider the following loop. DO 99 1 = 1, L Any program using a significant amount of computer DO 99 J = 2, M time spends most of that time executing one or more DO 99 K = 2, N loops. For a large class of programs, these loops can be U(J,K) = (U(Jq-I,K) q- U(J, Kq-1) represented as FORTRAN DO loops. We consider meth- @ @ @ ods of executing these loops on a multiprocessor com- q- U(J--1,K) q- U(J,K--1)) • .25 puter, in which different processors independently execute different iterations of the loop at the same time. @ @ This approach was inspired by the ILLIAC IV since 99 CONTINUE (1) it is the only type of parallel computation which that (For future reference, we have assigned a name to computer can perform [1]. However, even for a com- each occurrence of the variable U, and written it in a puter with independent processors, it is inherently more circle beneath the occurrence.) This is a simplified efficient than the usual approach of having the processors version of a standard relaxation computation. work together on a single iteration of the loop. This is The loop body is executed L(M-- 1)(N-- 1) times-- because it requires much less communication between once for each point (I,J,K) in the index set ~ = { (i,j,k) : individual processors. 1 <i<L, 2 NjN M, 2 N k N N}. We want to speed The methods presented are, of course, independent up the computation by performing some of these execu- of "the syntax of FORTRAN. The basic feature of the tions concurrently, using multiple processors. Of course, Copyright @ 1974, Association for Computing Machinery, Inc. General permission to republish, but not for profit, all or part this must be done in such a way as to produce the same of this material is granted provided that ACM's copyright notice results as the given loop. is given and that reference is made to the publication, to its date The obvious approach is to expand the loop into the of issue, and to the fact that reprinting privileges were granted by permission of the Association for Computing Machinery. L(M-- 1)(N-- 1) statements This research was supported by the Advanced Research Proj- U(2,2) .... ects Agency of the Department of Defense and was monitored by U(2,3) .... Army Research Office-Durham under Contract No. DAHC04- 70-C-0023. Author's address: Massachusetts Computer Associates. Inc., Lakeside Office Park, Wakefield, MA 01880. and then apply the techniques described in [2]. This is at 83 Communications February 1974 of Volume 17 the ACM Number 2 best a formidable task. It is impossible it L, M, and N Fig. 1. Computation of U(4,6) for l = 9. are not all known at compile time. Our approach is to try to execute the loop body Computed concurrently for all points (1,J,K) in a lying along a when ! = 8 plane. In particular, the hyperplane method will find that the body of loop (1) can be executed concurrently for all points (I,J,K) lying in the plane defined by • 21 9- J 9- K = constant. The constant is incremented 6 u • after each execution, until the loop body has been executed for all points in a. # To describe this more precisely, we need a means of Computed F" expressing concurrent computation. We use the state- when I = 9 ment DO 99 CONC FOR ALL (J,K) E 8 where 8 is a finite set of pairs of integers) It has the following meaning: Assign a separate processor to each element of 8. For each (j,k) E 8, the processor assigned to (j,k) is to set J = j, g = k and execute the statements following the DO CONC statement through statement 99. All processors are to run concurrently, completely independent of one another. No synchroniza- 2 4 tion is assumed. Execution is complete when all processors have executed statement 99. Given loop (1), the hyperplane method chooses new index variables i, J,/~ related to/, J, K by i=2I+J+K J=I i.e. during the previous execution of the DO I loop, with /~ = r (2) 1 = 8. The values of U(3,6) and U(4,5) were calculated and the inverse relations during the current execution of the outer DO 1 loop, with I = 9. This is shown in Figure 1. I=J Now consider loop (3). At any time during its J=i-2J-E execution, U(p,q) is being computed concurrently for K = g7 . (2') up to half the elements of the array U. These computa- Loop (1) is then rewritten as tions involve many different values of 1. Figure 2 illustrates the execution of the DO CONC for 7 = 27. The DO 99i= 6,2.L 9- Mg-N points (p,q) for which U(p,q) is being computed are DO 99 CONC FOR ALL (J,l{) E { (j, k) : marked with "x"s, and the value of I for the computa- 1 <j<L, 2 <i--2j--k~Mand tion is indicated. Figure 3 shows the same thing for 2<k<N} i= 28. U(i-- 2.J- gT,/() = (U(i- 2,J-/~q- 1,/{7) Note how the values being used in the computation + U(1-2.j-FS, I~+I) + U(i-2.J of U(4,6) in Figure 3 were computed in Figure 2. A -R- 1,/~) 4- U(i-2.J-K,K- 1)) comparison with Figure 1 illustrates why this method • .25 of concurrent execution is equivalent to the algorithm 99 CONTINUE (3) specified by loop (1). Using relations (2) and (2'), the reader can check The rewriting has reduced the number of sequential that loop (3) performs the same L(M--1)(N--1) loop iterations from L(M--1)(N--1) to 2L + M + N -- 5. body executions as loop (1), except in a different order. This gives the possibility of an enormous reduction in To see why both loops give the same results, consider execution time. The actual saving in execution time will the computation of U(4,6) in the execution of the depend upon the overhead in executing the DO CONC, original loop body for the element (9,4,6) E ~. It is set as well as the actual number of processors available. equal to the average of its four neighboring array The DO CONC set contains up to (M--1)(N--1)/2 elements: U(5,6), U(4,7), U(3,6), U(4,5). The values of points. Since individual executions may be asyn- U(5,6) and U(4,7) were calculated during the execution chronous, the DO CONC is easily implemented with of the loop body for (8,5,6) and (8,4,7), respectively, fewer processors. We must point out that a real program would prob- We remind the reader that a set is an unordered collection of elements. We will not bother to define a syntax foc expressing sets, ably have a loop terminated by a convergence test in but will use the customary informal mathematical notation. place of the outer DO I loop. The hyperplane method 84 Communications February 1974 of Volume 17 the ACM Number 2 Fig. 2. Execution for [ = 27. Fig. 3. Execution for 7 = 28. q. / l q •• % % • %• % • • % • • % • ) , ..x .> 6" 8' • "X cP / X • X 8 X X g •% •••% % X • x • X • / ° X• • x %• %•• % ••%• • /O • a 6 • / • ~( X X 6 X • m ~• /0 • • %• • X • x / • X X • X • • ~,b • • % •%•• % • 4, " / • X " x • '~ • X • X ° • % • • X • x • / • X • X • X • • % 2 • ~ • x x X • X • X I I I 0 I I I L i i i i i i i L 2 4 6 p 2 4 6 p could then only be applied to the DO J/DO K loop.

The Parallel Execution of DO Loops

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support