<<

ON PROGRAMMING PARALLEL

Leslie Lamport Massachusetts Associates, Inc.

.INTB,0DUCTION writing a program, which is an intermediate stage between the problem and the machine code. The In this paper, I will make some general ob- task of generating machine code for a problem is servations about how computers should be program- then broken into two steps: (1) programming -- med, and how programs should be compiled. I will the essentially human task of constructing a pro- restrict my attention to programming computers to gram which represents a solution to the problem, solve numerical analysis problems, although most and (2) compiling -- the essentially computer- of my remarks can be applied to other problem performed task of constructing a machine code areas as well. My primary concern is for large which correctly implements the program. I will parallel computers. However, I will show that represent this with the following picture: parallelism is not a major issue, and I will not re- programming compiling strict the discussion to any single type of compu- PROBLEM • PROGRAM ~ MACHINE CODE ter architecture. . The paper is a very general one, and I will We can think of the problem, the program, and the make few specific proposals for language or com- machine code as variables assuming values in piler design. Instead, I will examine the funda- their appropriate domains, so programming and mental nature of the programming/compiling pro- compiling are mappings between these domains. cess, how this process is done now, and how it The usefulness of this picture lies in the should be done in the future. Few of my observa- fact that it clearly shows the duality or comple- tions will be new or original. However, a careful mentarity of the programming and compiling tasks. analysis will lead to a somewhat surprising conclu- Suppose we consider the problem and the machine sion: a higher level is code to be fixed and the program to be variable. needed both to reduce programming costs and also Choosing a program which makes the programming to obtain more efficient programs. The analysis task easier will tend to make the compiling task will also provide some useful ideas for designing more difficult, and vice-versa -- assuming that all a higher level language and compiling efficient else remains the same. (Requiring the programmer code from it. to be blindfolded will make programming more dif- ficult, but it will not make compiling any easier.) ON PROGRAMMING It is important to remember that this duality exists only when the program is varied, but the machine I will begin with a general discussion of code -- or the class of acceptable machine what programming is all about. I will not say any- codes -- is not. thing very profound, but I hope to clarify some of In going from the problem to the machine the problems, and provide a basis for the more code, the goal is to minimize the cost of specific material I will present later. solving the problem. This cost has three distinct We are given a domain of problems to be components: solved. We may think of these problems as speci- fied in some very abstract form -- perhaps as I Programming Cost -- the cost of generating "pure thoughts". I -#ill remain vague about the na- the program. Its major element is the ture of a problem specification, since it is very human effort required, including the debug- hard to define it precisely. For example, the ging effort. One seldom solves a single specification of a problem involving the simulation problem on a single computer. Therefore, of a physical system includes some sort of state- evaluating the effect of some policy on this ment about the likely range of certain physical cost requires considering the cost of mod- parameters. ifying the program for a different problem Given one of these problems, a necessary (program maintenance) or for a different task is to generate a suitable machine code for It, computer. where a machine code is a collection of operating II Compili~q Cost -- the cost of generating instructions for some specific computer. For ex- machine code from the program. Its major ample, a machine language (binary) program for a element is the computer resource used to PDP-10 and an input tape for some particular perform all the compilations needed to Turlng machine are both machine codes. write and debug the program. It also in- This task is usually accomplished by cludes part of the cost of developing the

25 . Since this latter cost is shared transforming the problem into the PDP-10 program byall uses of the compiler, it is usually and must be put back by the . It is this negligible. vague something which I am calling "information". Ill Cost -- the cost of executing the This concept of information is closely re- machine code on a computer in order to lated to the concept of entropy in physics. A liter solve the problem. It consists mainly of of blue ink and a liter of yellow ink are equivalent the cost of the computer resource used. to the two litersof green fluid obtained by mixing them, since the individual "ink molecules" are Observe that the first two costs are at- unchanged by the mixing process. However, tached to the arrows in our picture. Hence, they ergodic theory tells us that it is more difficult to are dual to each other. Decreasing the program- separate the two colors than it is to mix them. ming cost will tend to increase the compiling Without an ergodic theory for programming, we can- cost, and vice-versa -- assuming that the quality not precisely define the concept of information. of the machine code and all other factors remain However, it seems to be a valid concept, and we the same. The third cost is just attached to the can place some trust in our intuitive understanding machine code. It measures the quality of the end of it. product of the programming and compiling proces- Consider a specific program for a certain ses. problem. We say that it is a high level program if Let us now examine the compiling task more it contains much information about the problem, closely. It can be separated conceptually into and that it is a low level program if it contains two parts: (i) translation -- the process of trans- little information about the problem. In other forming a program into some correct machine code, words, we have lost less information about the and (ii) compiler optimization -- the process Of problem in writing a high level program than in choosing the best possible machine code. Al- writing a low level one. though it is impossible to completely separate Equivalently, a high level program allows these two aspects of compiling, many do a great deal of freedom of choice in selecting a have a separate optimization phase which is de- machine code to implement it, and a low level pro- voted solely to performing this choosing. Of gram allows little choice of machine code imple- course, other phases of these compilers also per- mentation. For example, an language form some compiler optimization, since they make program allows the compiler little choice in the choices among different possible machine codes. machine language it produces. It is only free to There is a dual separation of the program- choose such things as the absolute machine ad- ming task into two parts: (i) the process of trans- dresses of certain program segments. Hence, the forming the problem into some program, and (ii) program is a low level one. programmer optimizatlon -- the process of choos- Conversely, a program can be compiled into ing the best possible program. It is impossible to many different machine codes. For example, there actually separate them, but it is useful to remem- may be many different instruction sequences for ber that there are these two different aspects to the evaluating a single Fortran arithmetic expression. programming task. Hence, the Fortran program is a high level one. Let us return to our picture of the program- Another way of viewing this is to think of a ruing/compiling process. Moving from left to right problem as specifying what is to be done, and a in the picture represents a loss of information. machine code as specifying how it is done. A pro- Choosing a particular program destroys some infor- gram lies somewhere in between. The closer it mation about the problem. A machine code imple- comes to describing what is to be done rather than mentation of that program contains even less infor- how to do it, the higher level a program it is. mation about the problem than the program does. A high level programming language is a Information is the same as freedom of choice. language in which one usually writes high level There are many possible machine codes to solve a programs, and a low level programming language is problem. When we write a program for the prob- one in which low level programs are usually writ- lem, we lose the possibility of obtaining some of ten. However, it is important to remember that those machine codes. For example, writing a even with a single programming language, one can PDP-10 assembly language program for a matrix write higher and lower level programs for the same calculation problem undoubtedly destroys the in- problem. formation about the program which is needed to Now let us consider the programming/com- generate an efficient machine code for the ILLIAC piling duality in terms of the level of a program. IV. We could certainly write a compiler to produce Suppose we want to obtain a certain quality of ILLIAC IV machine code from a PDP-10 assembly machine code for a given problem. The more free- language program. However, it could not generate dom of choice the compiler has, the harder it is to the type of efficient ILLIAC machine code that we make an optimal choice. Therefore, compiler op- could obtain by programming in an ILLIAC array timization is harder for a high level language then language such as [VTRAN. By writing our program for a low level one. Dually, the programmer has in PDP-10 assembly language, we have greatly re- to make a more specific choice when he writes a stricted the compiler's freedom in choosing a low level program than when he writes a high level machine code to solve the original problem. one. Therefore, programmer optimization is harder It is very difficult to precisely define what for a low level program than a high level one. "loss of information" means. Ina certain sense, Choosing the level of the program provides a means the PDP-10 assembly language program is equiva- of making a trade-off between programming cost lent to the original problem. It is theoretically and compiling cost. possible to "decompile" it into an IVTRAN program Our programmlng/compiling picture leads us which can then be compiled into very efficient to expect compiler optimization to be costlier for ILLIAC IV code. However, we know that this is high level progra m than a low level one. However, very difficult because something has been lost by it gives us no reason to expect this to be true of

~8 translation. Thus, although it is costlier to com- to efficient parsing, probably because pile a single Fortran statement than a single as- parsing was not well understood when the sembly language statement, a Fortran program will language was designed. However, most of have fewer statements than an assembly language the Fortran statements can be one which solves the same problem. There is no translated quite easily into machine code. reason to suppose that translating a Fortran pro- (The formatted I/O statements are notable gram is costlier than translating an equivalent as- exceptions. ) sembly language program. (3) Runninq Cost: Fortran was designed to There is, however, one reason why it may enable the programmer to produce efficient be costlier to translate a higher level program than machine code, thus keeping the running a lower level one. It seems to cost more to de- cost low. The concern for efficient machine velop a compiler for a higher level language, even code was apparent from Fortran's inception. one which does no optimizing. Hence, the pro- For example, the numerical IF statement gram's share of the compiler development cost is was designed to allow efficient use of the greater for a program written in a high level lan- IBM 704's compare instructlon.* Efficiency guage than for one written in a low level language. has taken precedence over elegance For a popular language like Fortran, this cost is throughout Fortran's evolution. Thus, re- negligible. However, it becomes significant if cursive function definitions are still not one is thinking of writing a compiler for an ex- allowed. tremely high level language, such as English. This emphasis on efficient machine code FORTRAN has made it possible for a Fortran programmer to produce reasonably efficient machine code for sol- ving numerical analysis problems on almost any Before I propose a new method of program- ming numerical analysis problems, let us consider computer. The ability to produce efficient machine the current method. Most numerical analysis pro- code for a wide range of computers has contributed grams are written in Fortran, so I will restrict my enormously to Fortran's success. One area in attention to Fortran programming B~ "Fortran", I which Fortran compilers have not produced efficient will mean ANSI standard Fortran [ i J, which re- machine code is Input/output. This has led to non-standard I/O statement types in several dia- presents the "intersection" of the various dialects lects of Fortran. which have been implemented. Why is Fortran so popular ? There are his- Although Fortran produces efficient machine torical reasons for its popularity -- it was the code for problems in numerical analysls, this is first widely available high level language, so it not true for other, non-numerlc problem domains. Fortran's success is largely the result of designing has the weight of tradition behind it. However, it specifically for the domain of numerical analysis there is a more valid reason why it is still so problems. However, the ability of Fortran to re- widely used today: Fortran programming provides duce programming costs has led to its use even for a reasonably low cost means of solving numerical problem areas in which it does not produce very analysis problems. Let us examine why this is so good machine code. For example, the ILLIAC IV in terms of the three components of this cost. Fortran compiler is written in PDP-10 Fortran. (I) Proqramminq Cost: Fortran is a sig- Having listed the reasons for using For- nlficantly higher level language than as- tran, I will now explain why I feel that Fortran is sembly language, and we know that it no longer adequate. Its basic problem is that it is should be easier to write programs in a too low level a language. Fortran has come to be high level language. Because it is fairly used as a sort of "universal machine language". simple and easy to learn, Fortran actually (It has even been proposed that other high level does make it easier to write programs. languages should be compiled by first translating There is a small number of different state- them into Fortran [ 4 ].) A machine language is ment types, and it uses ordinary algebraic one in which the compiler can simply translate a notation and a simple subscript notation. program in a straight forward manner without doing Fortran also requires no detailed knowledge much optimization. The programmer who uses of how computers actually work -- it can Fortran as a machine language is therefore trying be used by a programmer who has never to write a low level program. He wants to do all heard of accumulators or index registers. that he can to choose the machine code, and This all helps to reduce programming costs to remove as much choice as possible from the by making Fortran programs easier to write compiler. Let us examine the effect of this prac- and to maintain. tice on the cost of problem solving. Another important factor in reducing (1) Proqramminq Cost: Writing low-level programming costs is Fortran's machine- programs obviously increases the program- independence. A Fortran compiler has ming cost. The programmer must work been written for almost every computer, so harder to write the program in the first one can transfer a Fortran program from one machine to another with minimal effort. place. Because he is optimizing for ef- Although there are some incompatibilities ficient machine code rather than for read- ability, his optimization will almost al- among the dialects of Fortran implemented ways produce a program which is harder to on different computers, these are rela- understand, and thus harder to maintain. tively minor and affect only a small frac- In practice, this manifests itself in tion of the statements in most programs. (2) Compiling Cost: It is not too difficult Ironically, the problem of detecting "minus zero" to compile a Fortran program into machine code. The syntax of Fortran does not lead defeated this attempt at efficiency.

27 "unstructured" programs, replete with un- writing a Fortran program, one is essentially pro- disciplined and other obscuring fea- gramming a "Fortran computer". This Fortran com- tures. This increase in programming cost is puter reflects the essential properties of a standard well recognized, and I need not dwell upon sequential computer, as used for numerical compu- it. As we all know, the programming cost tations. (The fact that this is still true today is is becoming a larger and larger part of the perhaps partly due to the influence of Fortran on total cost of solving a problem. computer design.) An efficient program for the (2) Compiling Cost: By optimizing his own Fortran computer will describe an efficient program program, the programmer can decrease the for a real machine. cost of compiling it. Since the compiler However, input/output on the 370/165 is has less to do, it should cost less to com- quite different from on the 704. Thus, ANSI stan- pile the program. Although this is true, the dard Fortran has not been successful at producing saving is not that important. Computing efficient machine code for I/O. For a similar tea- costs are decreasing, and we would much son, the non-I/O aspect of Fortran is unsatisfac- prefer to let the compiler do the optimizing tory for the new large parallel computers. The instead of the programmer. I will have ILLIAC IV, consisting of sixty-four identical pro- more to say about compiling cost later. cessors controlled by a single instruction stream, (3) Runninq Cost: The costs of programmer is quite different from an IBM 704. One therefore optimization have been incurred in order to does not expect Fortran to be a good language for decrease running costs. It has been felt the ILLIAC IV. that the extra programming effort was nec- It is usually thought that Fortran is unsuit- essary in order to obtain more efficient able for the ILLIAC IV simply because it lacks any machine code. There seems to be a general way of specifying parallel execution. However, feeling that low level programs are neces- this is not the reason. ANSI Fortran will soon be sary in order to obtain good machine code. extended to include parallel operations [ 3 ], but However, our picture of the programming/ the following problems will still prevent it from compiling process shows that this feeling generating efficient code for parallel computers: is based upon the assumption that the pro- (1) Parallel computers differ in what they grammer is better at optimizing than the can execute in parallel. The presence or compiler is. If this assumption is correct, absence of a particular feature on a stan- then the programmer wants to do the opti- dard sequential computer will have only a mizing, and leave little choice to the com- small effect on the efficiency of the com- piler, thus writing a low level program. piled code. However, for the ILLIAC, it However, if the compiler is better at opti- could mean the difference between sequen- mizing than the programmer, then the op- tial and parallel execution of a program -- posite is true. We then want the program a difference of up to a factor of sixty-four to be a high level one, and leave most of in execution time. Machine independence the optimizing to the compiler. is one of Fortran's main features. If the In practice, neither the programmer nor programmer can only express parallel com- the compiler is strictly better at optimizing. putation which can be implemented on all Each performs certain types of optimiza- parallel computers, then he will be unable tions better than the other. Programmers to write an efficient parallel program for are usually better at such tasks as choosing a numerical for solving the prob- any of them. (2) There is evidence to indicate that lem. Compilers are often better at such people are not very good at finding the pos- other tasks as . In order sible parallel execution in their programs. to produce the most efficient machine code, In [ 7 ], I described a case in which com- it is necessary to have the programmer per- piling techniques developed for the ILLIAC form only those optimizations which he can transform a standard sequential algo- does best, and leave the rest to the com- rithm into better parallel programs than ones piler. written explicitly in parallel for two dif- The success of any new high level language ferent types of parallel computers. I once for numerical analysis problems will depend largely gave several members of the IVTRAN com- upon its ability to obtain efficient machine code. piler group a seven llne Fortran program to Therefore, let us consider how good Fortran Is In be rewritten in parallel for the ILLIAC. this respect. I will show that Fortran is too low (This rather contrived program was a sim- level a language to obtain efficient machine code plified version of the example used in [8].) for the large parallel computers of the present such None of them obtained as much parallel ex- as the [LLIAC IV and the CDC Star, or for the com- ecution as would a compiler using the plex computers of the future. methods of [ 8 ]. The idea of needing a higher level language These observations about parallel program- to produce more efficient machine code is a new ming can be abstracted to obtain the following one for most people. In order to justify it, let me general reasons why higher level programs are first consider why it has been possible to compile needed to produce more efficient machine code. efficient code for so many different computers from a single Fortran program. The basic reason is that (1) For programs to be run on a wider vari- up to now, computers have all been pretty much ety of computers, more freedom of choice alike. They have contained a single arithmetic must be left to the compiler. unit connected to a random access memory of com- (2) As computers become more complex, parable speed. Viewed in this way, an IBM 370/ and parallelism introduces a large degree 165 is basically the same as an IBM 704. When of complexity, the programmer becomes

28 less competent at optimization. Hence, should write a higher level program. He could try more optimization must be left to the com- to do this simply by writing a better structured, piler. higher level Fortran program. However, I will now A further example of the difficulty of writing describe how Fortran makes it impossible to write a sufficiently high level program. First of all, For- efficient programs in Fortran is given by a surpris- tran requires the programmer to be too specific -- ing result reported by Owens [ 12 ]. He described how taking a Fortran program coded for the CDC to make too many choices. Below are two examples 7600, translating it into a vector program for the of this. CDC-STAR, and running the new program with a (i) The ANSI Fortran standard requires that STAR , resulted in a program that ran twice the integrity of parenthesized subexpres- as fast on the 7600 as did the original program. sions be maintained when evaluating an The immediate implication of this is that it is bet- arithmetic expression. The compiler may ter to program the 7600 as a vector computer than not use distributlvity relations to obtain an as a "standard Fortran computer". algebraically equivalent expression which I have worked on one important aspect of can be executed more efficiently. Alge- compiler optimization for parallel computers: de- braically equivalent expressions may differ veloping techniques for the parallel execution of because of roundoff error, but often this is sequential programs. Many such techniques have not the case. The user has no way of spe- been found, by myself and others [ 8-11, 13, 14 ]. cifying which parentheses are significant. Very often, the basic algorithm used in writing a Although it does not affect the ILLIAC, this Fortran program allows parallel execution, If the restriction prevents the use of the methods program were written in a simple, direct style then of [ Ii ] for other types of parallel compu- these optimization techniques would permit the ters. compiler to obtain this parallel execution. How- (2) Many numerical involve a ever, after the programmer has finished optimizing convergent iterative procedure for compu- his Fortran program, it is almost impossible for a ting successive approximations to the de- compiler to find this parallelism. sired result. The iteration is terminated What the programmer has done by his opti- when some convergence criterion is met. mization is to write a lower level Fortran program, As explained in [ I0 ], the most efficient thus destroying information about the problem. parallel execution of such an algorithm Even if the information has not actually been "des- often requires executing extra iterations troyed", it has been hidden so as to make it much after convergence occurs. However, this is more difficult to discover. I believe that most of impossible because the Fortran programmer the programmer optimization techniques we have must specify that the calculations stop after learned for writing more efficient programs will ul- a particular iteration, and the compiled code timately turn out to be bad programming practice. must produce the exact results specified by As an example, consider the elementary technique the program. of removing invariant code from a loop. Suppose we want to set all elements of a 50 element array A Secondly, Fortran requires the destruction of certain information about the problem because it equal to the product of the scalars B and . A novice might write the following: gives the programmer no way of including it in his program. The following are some examples of the DO lI=l, 50 information which a compiler would like to have to I A(1) : B * C . help it optimize, but cannot find out from a Fortran program. However, he would soon learn to write this "better" version: (I) The context in which a subroutine or function is executed. From where is the TEMP = B * C subroutine called ? For the ILLIAC, it is DOi I=l, 50 sometimes best to compile two different 1 A(1) = TEMP . versions of a single routine -- one to be But is it really befter ? The first version is easier called in parallel, and another to be called to understand, since we need only look at state- sequentially which can perform its own ment 1 to see what values the elements of A re- parallel computations. ceive. Hence, the first version leads to lower (2) Information about a called function or programming cost. Furthermore, it is a simple job subroutine. What can be modified by for the compiler to transform the first version into calling the routine ? Is the routine called the second. It then knows that TEMP is a tempo- from anywhere else? If not, perhaps it rary variable which need not be saved after execu- should be compiled in line. tion of the DO loop. It might be difficult for the (3) Frequency of execution information. compiler to determine this from the second ver- How often is a particular branch of the Dro- sion -- especially if the programmer were clever gram actually executed? If a DO loop has enough to save memory by using TEMP elsewhere, variable limits, what are the approximate perhaps even putting it in COMMON storage. I.e., values of those limits? This information programmer optimization can destroy the information was provided by the original Fortran FRE- needed to generate optimal machine code -- code QUENCY statement, which has unfortunately in which B*C is saved in a live register and not disappeared from the language. stored in memory. Finally, in the iLLIAC IV, the (4) There are also many other facts about DO Loop would be executed in parallel just once the program which might help the compiler for all values of I . The first version is actually generate better code. E.g., is a certain more efficient than the second for the ILLIAC. input variable always positive? This might So far, I have indicated why the programmer be necessary to allow the parallel execution

29 of a loop on the ILLIAC. The compiler action. Leaving invariant code inside a loop re- would like to ask the programmer such quires a conscious effort for many programmers. I questions, but Fortran gives him no way of do not know how the programmer's urge to blindly putting the answers in his program. optimize can best be discouraged. One possible answer is provided by the growing interest in pro- The above criticism of Fortran is in no way gram correctness [ 2 ]. The desire to prove the intended to belittle its achievements. It has been correctness of his program should force the pro- a very useful tool, with a remarkable longevity. grammer to concentrate on simplicity rather than The ILLIAC IV compiler will produce surprisingly self-defeating efficiency. good parallel code from sequential Fortran pro- It is easier to see how Fortran's specific grams, largely because of the basic conceptual impediments to writing higher level programs can soundness of many aspects of the language. How- be overcome. For example, one can devise a way ever, a new way of programming is needed for the of specifying convergent iterative algorithms which ILLIAC and for other new complex computers. allows the execution of extra loop iterations. It is also easy to devise ways of including frequency A NEW WAY TO PROGRAM of execution information in the program. I will not discuss any of these problems. However, I will The discussion of Fortran indicates that briefly discuss the problem of I/O specification. both the programming cost and the execdtion cost Higher level programming means specifying what can be reduced by writing higher level programs. I rather than how. Specification of I/O should be will now try to show how this can actually be done. done by describing the relation between the input/ Programs should be written in a simple, output and the program variables. Thus, instead straightforward fashion. This will make them easy of writing for people to understand, thereby reducing the cost of writing and maintaining them, and easy for com- READ from file F into A pilers to understand, thereby allowing better com- PROCESS A piler optimization. This implies the use of hier- READ from file F into B archical, structured programming. PROCESS B Achieving this style of programming re- the programmer would specify in some way that A quires a programming language which encourages and B are the first two elements on file F , and simple, straightforward programming. The need to then just write inhibit constructions such as GOTOs which lead to unstructured programs is well understood. What is PROCESS A perhaps less well understood is the need to inhibit PROCESS B . unnecessary elegance. It is elegant to allow dy- The compiler would then be free to insert the actu- namically defined data types. However, their use al READ operations any place it chose, so long as is not likely to lead to simple, easily understood it produced a correct implementation. programs. Elegance leads to short programs, but So far, I have been talking only about a not necessarily to straightforward ones. new programming language. In addition to a new Writing simple programs does not mean language, the programmer needs a whole new way using the simplest possible programming language. of interacting with the compiler. With Fortran, A Turing machine is simple, but its programs are the programmer simply compiles each subroutine not. A good programming language is sufficiently individually, completely independent of any other general and elegant to permit clear, simple pro- subroutine. This is no longer satisfactory. If the grams, but not so elegant as to defeat attempts to compiler is to assume greater responsibility for compile efficient code. An example of what I con- optimizing then it requires information about the sider to be a good programming language feature is whole program. It cannot produce optimal machine the DO loop. It is very useful -- especially for code if "it sees only one subprogram at a time, the array computations common in numerical anal- knowing nothing about the rest of the program. ysis, it is easy to understand, it encourages good I envision the compiler -- that object structured programming, and it makes compiler op- which actually produces machine code -- being timization easier. Of course, Iam assuming a imbedded in a larger programminq system. The properly defined DO loop, with no "extended range" programmer uses the system in all phases of pro- and for which a "DO I = l, N" resuIts in no ex- gram design, from the initial high level design ecution of the loop body if N < 1 . One might down to the final programming stage. At each desire some syntactic improvement to make the stage, the programmer supplies the system with loop body easier to identify, but semantically it is preliminary information about the program, and the marvelous. system can respond with preliminary estimates of The fact that a language is to be compiled what the machine code will be llke. This will al- for a parallel computer does not mean that it must low the system to identify the most important be a "parallel language" People tend to think in parts of the program, so the programmer and the terms of sequential processes. Some parallel op- compiler can concentrate their efforts on them. erations are simple and natural, such as writing To obtain an efficient machlne-independent "A = 0" rather than a nest of DO loops in order to program, the programmer may want to write two (or set all elements of the array A to zero. How- more) versions of some subroutines. Both versions ever, the fact that people think sequentially im- would become part of the program. The compiler plies that programs should be essentially sequen- can choose the version which will produce the best tia 1. machine code for each individual computer. Encouraging simple programs means dis- The ideas [ have given for the programming couraging the programmer's clever attempt at pro- system are quite vague, and not all original. Some grammer optimization. This will not be easy, of them can be found in [ 16 ]. Actually designing since programmer optimization has become a reflex such a system is a formidable task. I have tried to indicate why it is necessary. I will next explain if there is no single machine code unit which im- why I think it is feasible. plements it, how can we recomptle Just that one program unit ? COMPILING With global optimization, we cannot guar- antee that changing a single program unit will re- The most difficult part of the programming quire recomplling just that unit. However, we can system to implement is the compiler. Trans- minimize the number of units which must be recom- lation -- the process of simply producing any cor- piled. To do this, some machine code object must rect machine code for a program -- is not difficult. be associated with each program unit. Global op- The difficult part of compiling is optimizing -- pro- timization prevents us from finding any such object ducing qood machine code. I will therefore con- which implements the original program unit. How- sider only the task of compiler optimization. ever, we can find a machine code object which im- Compiler optimization methods can be di- plements that unit under certain assumptions about vided into two general classes: local and global. its environment. For example, common subexpres- Local optimization is performed on a single se- sion elimination removes the calculation of a sub- mantic unit of the program, such as one program expression from a statement and replaces it with a statement, and it maintains the integrity of that fetch of the value from some register. The com- unit. It includes such optimizatlons as choosing piler generates a section of machine code which the fastest instruction sequence to evaluate a sin- correctly implements the statement under the as- gle arithmetic expression. Local optimization pro- sumption that the register contains the value of the duces a separate machine code object for each subexpre ssion. program unit. Although very important and often For each program unit, the compiler can quite difficult', local optimization presents no con- generate a machine code object and a list of as- ceptua I problem s. sumptions under which that object correctly imple- Global optimization involves the interactlon ments the unit. The unit must be recompiled if among separate program units. It produces machine changing another program unit invalidated those code in which there is no single machine code ob- assumptions. For example, consider the subrou- Ject which implements an individual program unit. tine INVERT. The compiler generates a machine Common subexpression elimination, in which a code subroutine for INVERT which is correct under subexpression appearing in two distinct program the assumption that its argument is a 100 ~ 100 statements is evaluated only once, is an example array. When recomplling any subroutine that calls of a global optimization. The resulting machine INVERT, the compiler checks if the call still satis- code does not contain two separate objects which fies this assumption. If so, then INVERT need not implement the two program statements. be recomplled. If not, then a new machine code The distinction between local and global must be compiled for INVERT. optimization depends upon the level of semantic With a sophisticated global optimizer, in- unit chosen. If the unit is the program statement, cremental compilation requires saving a great deal then removing invariant code from a DO loop is a of information about the environment of each pro- global optimization. However, it is a local opti- gram unit. Hence, it is feasible only for large mization if the DO loop is the semantic unit being program units such as subroutines. A less sophis- considered. ticated optimizer can save less information, and The programmer's attempts at global opti- can allow incremental compiling of smaller program mization are primarily responsible for the high units. Thus, a compiler which can recompile an (programming) cost of programmer optimization. individual Fortran statement is feasible only if it The basic principle of structured programming is does no common subexpression elimination. that one should be able to think about individual Two major objections have been raised program units independently of one another. Glo- against sophisticated optimizing compilers. The bal optimization destroys this independence, so it first is that they often make compilation too expen- destroys the hierarchic program structure necessary sive. This would be a valid objection if the opti- for reducing programming costs. mizer were only trying to minimize execution cost. Most global compiler optimization techni- However, a good optimizer will try to minimize the ques that have been developed are global with re- sum of the compiling cost, the execution cost, and spect to the individual program statements, but the cost. To do this, the optimizer must local with respect to larger semantic units such as choose among various options for the different the subroutine. Effective compiler optimization techniques which it can apply. These will include must be global at the highest possible level. Thus, the option of no optimization (perhaps executing it should include Inter-subroutine optimization. the program interpretively), options for applying Such optimization techniques are discussed in techniques to produce faster running machine code, [ 17 ]. As a simple example, let INVERT be a gen- and options for including debugging aids in the eral subroutine for inverting an N x N matrix, machine code. with the matrix and the value of N as arguments. In order to minimize the total cost, the Suppose that a particular program always calls compiler must be able to estimate both the cost of INVERT with N equal to 100. The compiler can performing a certain type of optimization and the then generate more efficient code for INVERT by benefits it will yield. Both types of estimates re- substituting 100 for N , and eliminating N as an quire that the compiler be able to estimate the argument. quality of the machine code which it will generate Programs seldom remain unchanged for very with its various options. For example, it may long. They are usually subject to modification, know that with no optimization, a program block which means recompillng. In order to prevent very containing N arithmetic operations will yield large compilation costs, this implies the use of machine code requiring an average of 2N micro- incremental compiling: only the program unit which seconds to execute on a particular computer. Elim- is changed should have to be recompiled. However, inating common subexpressions will reduce this

31 average to 1.5N microseconds, but will add N 2 mil- guage. The programmer and the compiler can co- liseconds to the compiling time. operate in this phase of optimization. The pro- Estimating the benefit of applying an opti- grammer can direct the compiler to perform certain mization technique requires that the compiler have transformations to selected portions of the pro- good frequency of execution information. Not only gram. He might even write some parts of the low must the compiler know how often a program block level version entirely by himself. The compiler will be executed during a single execution of a can provide him with information about the machine subroutine, but it must also know how often the code which it will generate from the transformed subroutine is executed during a single execution of version of the program. the program. Moreover, it must know how many Transforming the program into a low level times the programmer expects to execute the pro- one produced especially for a particular computer gram before the subroutine must be recompiled. offers two related advantages: (1) the programmer Frequency of execution information and the can be confident that the compiler will produce ability to estimate the quality of machine code good machine code from the transformed program, will allow the compiler to choose the optimization and (2) the compiler can make accurate estimates option which is most likely to minimize total com- of the quality of machine code it will generate from piling and execution cost. It can first use the the transformed program. Using a dialect of the frequency information to quickly identify which original programming language makes it easier for parts of the program produce most of the execu- the programmer to read the transformed program. tion time . It ca n then choose the best option The programmer can choose how much help for each of those parts. For example, suppose he wants to give the compiler. This allows him to the compiler discovers that most of the execution make a trade-off between programmer optimization time of a program occurs in one program block con- cost and program execution cost. Retaining the taining 500 arithmetic operations, which is execu- higher level, machine-independent version of the ted 100,000 times each time the program is run. It program helps lower the cost of maintaining the then knows that it should perform common subex- program and moving it to a different computer. pression elimination on that block if the program is This idea has been used in the ILLIAC IV expected to be executed more than 10 times before Fortran compiler. This compiler has a "Paralyzer" the block must be recompiled. phase which transforms the Fortran program into a Since it is very hard for the compiler to es- program written in IVTRAN -- a dialect of Fortran timate debugging costs, the programmer should de- designed specifically for the ILLIAC. The same cide what debugging aids are to be included in the idea was also described in [ 1,5 ], except that the machine code. The compiler can provide him with transformed version was a lower level program in the information needed to make this decision by the same programming language. telling him the compilation and execution cost of Optimizing by transforming into a lower the various options. level program has one additional advantage: the The second objection which has been compiler can be written gradually. Initially, it raised against optimizing compilers is that they carl require a great deal of prompting from the pro- are unreliable. There are two kinds of unreliabil- grammer in order to produce an efficient low level ity: producing incorrect machine code, and pes- program. The compiler can be continually im- simizing instead of optim-izing. Producing incor- proved to perform the transformation better, grad- rect code has certainly been a real problem with ually relieving the programmer of the optimization optimizing compilers. However, it is just a mani- task. Even the initial version would be very use- festation of the general problem of writing correct ful, especially since optimization is usually nec- programs. Progress in structured programming and essary for only a small part of the program [ S ]. program correctness should enable it to be solved. It should become less likely for compiler optimiza- CONCLUSION tion to produce an incorrect machine code than for programmer optimization to produce an incorrect I have tried to show that using a higher pro g ra m. level programming language for numerical analysis The problem of designing an optimizing problems can reduce programming costs while pro- compiler which actually optimizes instead of pes- ducing better machine code. This is a very en- simizing is harder to solve. High level programs couraging idea, and it suggests that the cost of leave most of the optimization to the compiler, problem solving can be made very small by using a However, the programmer will be better than the high enough level language. Unfortunately, there compiler at some kinds of optimization, although is one limiting factor: the cost of developing the which kinds may depend upon the particular pro- compiler. This cost is prohibitive for an extremely grammer. A programmer who wants to produce very high level language such as English. I have in- good machine code with a high level language dicated that a good can be could find himself fighting the optimizer. written for a higher level language than Fortran. The solution to this problem is to let the However, it would be a difficult and costly soft- programmer and the compiler cooperate, instead of ware project, and the programming profession has having the compiler always act independently. a poor record on such projects. Developing the This can be accomplished with a generalization of type of high level programming system which I an idea described in [ 6 ]. We can let the pro- have described will require a level of programming grammer see the results of those optimizations quality that has rarely been achieved. I hope which he might do better than the compiler. More that progress in structured programming methods specifically, there can be a compiler optimization will make it possible. phase in which the program is transformed into an equivalent lower level program in a machine- REFERENCES specific dialect of the original programming lan- [ i ] ANSI Fortran Standard, American Standards

32 Association, American Standard Basic For- [17] Wegbrelt, B. Procedure Closure in EL1. tran, X3.1O, ANSI, New York, 1966. The Computer 1ournal, Vol. 17, Number (February, 1974), pp. 38-43. [ 2 ] Elspas, B. etal. An Assessment of Tech- niques for Proving Program Correctness. Computing Surveys 4_, 2 (June, 1972), pp. 97-147. [ 3 ] Engel, F. Future FORTRAN Development. SIGPLAN Notices 8_, 3 (March, 1973), pp. 4-5. [ 4 ] Gear, C.W. What Do We Need in Program- ruing Languages ? M__gthematical I I L Informal Conference Proceedings, Pur- due University (May, 1974), pp. 19-24. [ 5 ] Knuth, . An Empirical Study of Fortran Programs. Software: Practice and Experi- ence, Vol. 1, Issue 2, (April-June, 1971), pp. 105-133. [ 6 ] Knuth, D. Structured Programming with t..o Statements. Stanford University, California, Department of Computer Sci- ence, Tech. Report STAN-CS-74-416, May, 1974. [ 7 ] Lamport, L. Some Remarks on Parallel Pro- gramming. Massachusetts Computer Asso- ciates, Inc., Wakefield, Massachusetts, CA-7211-2011, November, 1972. [ 8 ] Lamport, L. The Coordinate Method for the Parallel Execution of DO Loops. Proceed- i ngs of the 1973 Saqamore Conference on Parallel Processinq (August, 1973), pp. 1- 12. [9] Lamport, L. The Parallel Execution of DO Loops. Comm. ACM I__Z, 2 (February, 1974), pp. 83-93. [I0] Lamport, L. The Hyperplane Method for an Array Computer. To appear in Proceedings of the 1974 Saqamore Conference on Paral- le[ Processing. [11] Muroaka, Y. Parallelism Exposure and Ex- ploitatton. Ph.D. Thesis, University of Illinois, Urbana, Illinois (1971) o [12] Owens, J.L. The In fluence of Machine Organization on Algorithms. Proceedings of the Symposium on Complexity of Sequen- tial and Parallel Algorithms held at Carnegie-Mellon University (May, 1973). [13] Ramamoorthy, C.V. and Gonzalez,. M.J. A Survey of Techniques for Recognizing Parallel Processable Streams in Computer Programs. AFIPS Proceedinqs~ Fall Joint Computer Conference, Vol. 35, (1969), pp. 1-15. [14] Schneck, P.B. Automatic Recognition of Vector and Parallel Operations. Proceed- ings of the ACM 25th Anniversary Confer- ence, (August, 1972), pp. 772-779. [15] Schneck, P.B. and Angel, E. A Fortran to Fortran Optimizing Compiler. The Com- piler 1ourna.l., Vol. 16, Number 4, (November, 1973), pp. 322-330. [16] Schwartz, J.T. On Programming, An Inter- im Report on the SETL Project. Computer Science Department, Courant Institute of Mathematical Sciences, N.Y.U. (1972).

33