Proceedings of the ACM SIGPLAN '84 Symposium on Construction, SIGPLAN Noticea Vol. 19, No. 8, June 198~

Peep - An Architectura! Description Driven Peephob Optimizer

Robert R. Kessler 1 Portable AI Support Systems Project Department of Computer Science University of Utah Salt Lake City, Utah 84112

Abstract global flow analysis, because even though it uses the global information it still performs "local" peephole transformations, Peep is an architectural description driven peephole optimizer, e.g. it doesn' do code motion and loop invariant removal. that is being adapted for use in the Portable Standard Lisp Peep is currently being integrated into the Portable Standard compiler. Tables of optimizable instructions are generated prior Lisp (PSL) compiler [9]. The PSL environment is an ideal place to the creation of the compiler from the architectural description to utilize Peep, since its compiler needs a new optimizer for each of the target machine. Peep then performs global flow analysis target machine (Currently PSL is supported on DecSystem-20, on the target machine code and optimizes instructions as defined DEC Vax, Motorola 68000, Cray-1 and IBM-370.) The PSL in the table. This global flow analysis allows optimization across compiler generates code by translating the input language into a basic blocks of instructions, and the use of tables created at sequence of virtual machine instructions, which are then macro compiler-generation time minimizes the overhead of discovering expanded into the target machine "LAP" instructions. These optimizable instructions. instructions are then translated into binary code for subsequent direct loading into the running image, or into a FASL file for 1. Introduction later loading. Peep has been inserted into the compiler as a LAP to LAP pass. This prOvides a good environment to experiment Peep is functionally similar to traditional peephole with Peep, allowing easy comparison of optimized and optimizers [12, 11]. That is, its purpose is to pass over the target unoptimized code sequences. We have chosen the Motorola machine code produced by a compiler, eliminating redundant MC68000 [13] as the first target machine to which we will apply operations and combining instructions into more efficient ones. the Peep optimizations. The primary reason is that the The term "peephole" was derived from the fact that the optimizer architecture is fairly contemporary and offers a wide range of only looks at a small local window (peephole) of adjacent different addressing modes. Also, in looking over a number of machine instructions when searching for optimizations and does generated code sequences, we observed quite a few instructions not use any global knowledge of the program. For example, a that would benefit from Peep optimizations. typical peephole optimization would be to combine the two instruction sequence of load 1 into register X and add X to Y, We begin with a review of the first significant machine into an increment of Y. Peep is different from traditional independent peephole optimizer, PO, developed by Davidson and peephole optimizers in two important ways: first, instead of being Fraser [7, 5] and its latest version [4, 6]. PO and Peep have hand written it is automatically generated from an architectural recently grown closer in their functionality and will be used for description of the target machine; and second, it performs global comparison. We follow this comparison with a discussion of the flow analysis [14] over the code to relax the adjacency constraint previous versions of Peep and a brief description of the machine and allow optimization across basic blocks (instruction sequences description language that is used in Peep. That is followed by a with one entrance and one exit). However, it is a restricted discussion of the two main parts of Peep: 1) the Peep Table Generator (PTG), which performs the analysis of the target machine; and 2) the Peep optimizer (embedded in the compiler) lWork supported in part by the Hewlett Packard Company, International which first performs a global flow analysis upon the LAP code Business Machines Corporation and the National Science Foundation under Grant Numbers MCS81-21750 and MCS82-04247. and then perform optimizations as specified by the tables produced by the PTG. We conclude with a complexity Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the comparison with PO, the current status of Peep and finally, ACM copyright notice and title of the publication and its date appear, and notice directions for future Peep research. is given that copying is by permission of the Association for Computing Machinery. To copy otherwise, or to republish, requires a fee and/or specific premission. ©1984 ACM 0-89791-139-3/84/0600/0106500,75

106 2. Previous Peephole Optimizers compiler. The optimizer performs flow analysis, and upon In the past, peephole optimization has been an ad-hoc finding an instruction pair that is in the table, performs the technique, customized for each compiler, usually added as an optimization. afterthought when it was discovered that, for certain localized cases the compiler was not generating the most efficient code possible. The first serious work attempting to formalize this type 3. Early Versions of Peep of optimization was conducted by Davidson and Fraser [7, 5]. Peep has undergone a number of changes in its evolutionary They defined PO, a peephole optimizer written in SNOBOL that cycle, from its beginnings in the COG system [10]. The general structure still remains the same, consisting of the Peep Table used a target machine description (written in the Instruction Set Generator, which derives optimization tables at compiler Processor language ISP [3, 2]) to create a machine independent peephole optimizer. PO performed the following steps: generater time, and the actual Peep optimizer that utilizes these tables to perform the optimizations within the compiler. 1. If PO operates directly upon an input assembly code, it Originally, the PTG derived two types of optimization tables: first translates each instruction into a list of equivalent semantic equations (derived from the ISP description). Cancellation: Pairs of instructions that could both be When PO is placed directly into a compiler, the compiler eliminated because their effect was to cancel can emit the semantic equations, and eliminate this step; each other (for example, a PUSH followed by a POP); and 2. PO then scans backward through the code, eliminating dead register assignments (e.g. a register that is set to Compresmon: Pairs of Instructions that could be compressed some value in one instruction and then set again in the into a single more efficient instruction (for next will eliminate the first assignment); example, Load a 1 and add, compresses into an increment). 3. PO then attempts to combine the two semantic equations Peep then took these tables and laboriously scanned the input of lexically adjacent instructions within a basic block. code searching for optimizable instructions. It checked the This is performed by substituting referenced resources in the second instruction with the values of the resources in resources accessed by each instruction, and ignored those that the first instruction. The combined semantic equation is would not conflict with the optimization pair being search for. then used to search the machine instruction descriptions This algorithm could potentially perform many passes over the for an instruction that performs the equivalent operation; code resulting in a complexity measure of N 2. 4. PO performs one other operation. When it finds a label, it After investigating PO, it was decided that flow analysis was a searches to determine if that label is still referenced (the reference could have been optimized away). If it is not, bettor technique for performing the optimization pass in Peep. PO removes the label and attempts to combine the labeled Flow analysis was beth a faster .technique (order N within a instruction with the previous one. basic block) and eliminated the necessity of deriving the PO was enhanced in late 1981 and has been described in cancellation pairs in the PTG (the flow analysis can do dead Davidson's dissertation [4] and POPL-9 [6]. The enhancements register elimination, which is functionally equivalent to the included a reimplementation in and a new technique for cancellation pairs, using only the knowledge of which resources optimizing "logically adjacent" instructions instead of lexically each instruction accesses). Finally, it was decided that with only adjacent ones. PO uses a simple flow analysis upon each basic a little more work, global flow analysis could be added to allow block of cede, to determine where resources are accessed. These optimizations across basic blocks. This addition allows many more optimizations including dead "resource" elimination. This are linked together into a set of lists of related instructions has resulted in the current version of Peep, which has a (related by access to the various machine resources). Each list is then scanned for possible optimizations. Thus, even though simplified PTG, and uses global flow analysis. there may be a lexically intervening instruction it will not be logically included in a particular list and may be ignored by the 4. Target Machine Description optimizer. This is an excellent technique, since it requires only Peep utilizes a target machine description that is Lisp based. one pass over the code to create the lists and one more pass over This allows easy expression of the constructs, and maximal each list to find the optimizations. flexibility in writing the definition (by allowing the machine PO is a major advance over the previous peephole optimization definer to write Lisp macros where needed). It is also in the techniques, mainly because it is target machine independent. spirit of the PSL compiler and system, in which nearly all parts Both PO and Peep have many areas of commonality, including of the system are written in PSL itself. The user must define the fact that both utilize flow analysis for optimizing non- constants, registers, addressing modes and instructions. adjacent instructions. However, Peep takes this a step further The constant definition informs the system of the constant, and uses global flow analysis to allow optimizations between and gives a range for its values. The register definition is used basic blocks. The primary significance of Peep is that instead of to define all registers belonging to a particular class (i.e.the A performing the manipulation of the semantic equations of the registers on the MC68000) and any synonym names for the input assembly code at run time, it is performed when the registers (i.e. register A7 is also referred to as the ST register). compiler is generated. The Peep Table Generator finds In most machine descriptions, each target~ machine instruction instruction pairs as candidates for replacement, and stores them allows a number of addressing modes for each operand. Peep in a table for use by the Peep optimizing phase within the allows an enumeration specification of each of the operands.

'107 This enumeration set is used in the pattern matcher for the % Defines the 8 A registers and synonyms. instructions. When searching for optimizations, two operands (define--register An (A0 0) (AI i) (A2 2) (A3 may be intersected to find the overlap of addressing modes. If 3) (A4 4) (A5 5) (A6 6 ZeroAReg) (A7 7 ST)) the intersection is empty then the optimization is ignored. % Define the 8 D registers and synonyms. (define-register Dn The instruction definition consists of the following parts: (DO NIL) (DI TI) (D2 T2) (D3 T3) (D4 T4) (D5 T5) (D6 T6) (D7 T7)) 1. The input format for the instruction; % A quickConstant is in the range 0 to 7. (define-constant 2. The semantic equation describing the function. We use a QuickConst 0 7) % Define restricted set of Lisp operators and simple register the addressing modes in terms of % transfer statements; previously defined modes (not presented here) % Addressing mode EA (Effective Address), 3. A time cost of the instruction. This can be a constant or % includes all of the modes for the 68000. an equation that depends upon the addressing mode of the (define-addressing-mode EA operands; and Dn An Indirect Postincrement Predecrement Displacement Indexed Absolute Displacement-PC 4. A space cost for the instruction. This may also be a Indexed-PC Immediate) constant, or an equation, dependent upon the addressing % EA--An is EA without An modes of the operands. Note, that both the units of time (define-addressing-mode EAmAn and space do not matter, as long as they are consistent Dn Indirect Postincrement Predecrement across each instruction. Displacement Indexed Absolute Displacement-PC Indexed-PC Immediate) Figure 1 below illustrates a sample of the MC68000 machine description. % A couple of instructions. % Note, Make-Flags is a macro that takes the % expression, and a list of condition codes 5. Peep Table Generator % settings and returns an expression that % As a side effect of reading in the machine description, the PTG defines them all. ((moveq Quick-Constant Dn) builds a table of the defined and used resources for each (Make--Flags (setf Dn Quick-Constant) instructions. This table is used by the flow analysis routines to Dn * * 0 0-) determine the effects of each instruction. This table building 4 % Time is 4 units process is a simple translation of the input machine description I) % Space is 1 unit into a usable form. It takes the machine instruction definition ( (add. 1 EA--AII Dn) and produces a table indexed by the machine operator. Each (Make--Flags (serf Dn (+ Dn EA--AII)) entry includes the used resources and those defined including the Dn* * * * *) (+ 8 (EA--Time EA--AII resulting equation. For example, the add. 1 instruction, the 4))% Time & Space depend (long EA-An)) % on Addressing mode table entry would be: ( (addq. 1 QuickConst (add. 1 ,EA--All ,Dn) % Table entry, the ", " EA--An) (Make--Flags (setf % designates an addressing mode to match. EA--An (+ EA--An QuickConst)) (,EA--All ,Dn) % Used resources EA--An * * * * *) (+ 8 (EA--Time EA--An 4)) ((,Dn (+ ,Dn ,EA--All) ) % Defined resources (,N (< ,Dn 0)) % The various condition ( iong EA--An) ) (,Z (= ,Dn 0)) % codes. (,V (Overflow)) Figure 1: Sample of the Motorola MC68000 (,C (Carry)) Machine Description (,X (Carry))) The real work of the PTG begins when it searches for addressing modes). If one is found, make sure that its instructions that may be combined into a single, more efficient space/time cost is less than the combination of the two instructiom This is accomplished by algebraically combining the original instructions; semantic equation of each instruction with that of each of the 2. Upon reaching this step, the two instructions are other instructions (one instruction at a time). The newly independent. Thus, the PTG appends them together and combined instruction is used to search a table of all of the searches for an instruction that performs both tasks. This case rarely occurs, instruction specifications. If an instruction is found that is more but an example is the multiple word move instruction on the MC68000, which allows efficient (in time and/or space, as designated by the compiler transferring a set of registers to consecutive memory builder) then the 3 instructions are stored in the table for use by locations. Two move instructions could be combined into Peep. This is accomplished using the following technique: a single movem instructions under the constraint on the target memory locations. 1. Compare each instruction with every other instruction This process is obviously of order N 2. However, since it and determine if the first instruction defines a resource is used in the second. If it cannot reference the resource, performed at compiler generation time it need only be done once then proceed to step 2. Otherwise, substitute the source and the resulting table stored for use in the compiler. The of the first instruction into the reference of the second and searching process is heavily pattern match based and therefore search for an instruction that performs the combined utilizes Will Galway's pattern compiler [8]. semantics (see discussion below on searching with

108 Multiple addressing modes for each instruction make the At this point the following information has been found for each search and algebraic combination processes a little more instruction in the procedure: complicated. However, it is easily solved when one recognizes -The usage points for each resource defined by the that the addressing modes can be viewed as a set of types of instruction. operands for each instruction. The following example should -The definition points for each resource used by the help to illustrate this technique (Figure 1 contains the original instruction. definition of the instructions used in this example. We have -A list of "equivalent resources" for each resource removed the condition code and cost information to make the used by the instruction. These are common explanation more transparent). The first instruction, Moveq subexpreseions. Their detection allows Peep to (for loads a quick constant (range 0 to 7) into a D register: example) replace a use of a memory location with the use of a register, when they both contain the same moveq -> (setf Dn Quick-Constant) value.

Using the Step 1 above, this may be combined with the second This information is then used to perform the following instruction, add. 1, only if Dn can be mapped into EA-Alh transformations on instructions:

add. 1 -> (setf Dn (+ Dn EA--AII) ) - When an instruction uses a value that is available in more than one resource, it is modified to use a EA-AII is the set of all addressing modes, so it does include Dn. "canonical" resource of least cost. That results in the combined instruction: -"Dead" instructions are deleted. A dead instruction is one for which their are no usage points for any of ??? -> (setf Dn (+ Dn QuickConst)) the resources it modifies. A consequence of this and the previous transformation is that instructions are with the constraint that EA-AII in the add. 1 instruction must deleted when they calculate a value that is already be a Dn. Searching the instructions, we find the addq. I available. This is because all usages of that value instruction: will refer to the "canonical" resource, and the other instruction becomes dead. addq. 1 --> (serf EA--An (+ EA--A~. QuickConst) ) -Each definition point for a resource used by an This instruction matches the combined instruction only if once instruction defines a pair of instructions---the defining instruction and the using instruction. Each again Dn can be mapped into EA-An. EA-An does include Dn pair is looked up in the table of optimizable pairs as one of its modes and therefore matches. The resulting provided by the PTG, and the pair is replaced if they pattern, with the constraints imposed by the matches is: match a pattern in that table. Note that usage information can affect the success of a match. For Patterns (moveq QuickConst Dx) example, an increment instruction may affect & (add. ] Dx Dy) condition codes in a manner different from an add Replacement (addq. 1 QuickConst Dy) instruction, so the use of the increment instruction would only be legal if the condition codes are not Note, Dx and Dy are syntactic renamings to avoid possible used. confusion that would be caused by two Dn's. Note that the first two transformations are basically machine independent, although machine dependent information was used 6. Peep Optimization during the analysis phase. Peep first performs a global flow analysm on each procedure, Peep updates the definition and usage information after using techniques quite similar to those described in Aho and performing an optimization. The affected instructions (those for Ullman [1]. The analysis phase first breaks up the procedure which the definition/usage information has changed) are into basic blocks. These blocks are then linked into a control rescanned for further possible optimizations. This procedure is flow graph. A fairly standard dataflow analysis is performed to repeated until no further optirnizations are found. determine the "logical adjacency" of the instructions within each basic block. This is accomplished by looking up which resources each instruction uses and "defines" (modifies). This is slightly 7. Complexity Comparison with PO different from the flow analysis described in Aho and Ullman, The worst case complexity of the PO peephole optimizer is of since, unlike the dataflow graphs used there, each instruction order N (where N is the number of instructions in the basic block may define many resources. For example, an instruction may to be optimized). This worst case could occur when the optimizer change both a register and some condition cedes. This has scans the full N instructions and repeatedly optimizes the last ramifications in dead code elimination, in that usage of ALL instruction pair until a single instruction remains. Since Peep resources defined in the instruction must be dead. The final uses nearly the same algorithm for performing the optimization, analysis step is to determine the global flow of resources between its complexity is of order N LOG N [14]. This extra overhead is the basic blocks. Again, the technique used is essentially the due to the global flow analysis. However, since Peep has already same as the the techniques for global flow analysis described in Aho and Ullman.

109 optimization should be less than that of PO. Peep uses a simple 3, Bell, C.G. and Newell, A. The PMS and ISP descriptive table lookup, while PO must perform an algebraic reduction of systems. In Computer Structures: Readings and Examples, the instructions, followed by a table search for a candidate McGraw-Hill, New York, New York, 1971, pp. 15-88,607-637. replacement. 4. Davidson, J.W. Simplifying Code Generation Through Peephole Optimization. Technical Report TR 81-19, Department of Computer Science, The University of Arizona, December, 8. Current Status and Future Research 1981. The first implementation for Peep/PSL has begun for the 5. Davidson, J.W.; and Fraser, C.W. "The Design and Motorola MC68000. The machine has been defined and the PTG Application of a Retargetable Peephole Optimizer." ACM table has been derived. The global flow analysis has been Transactions on Programming Languages and Systems 2, 2 implemented. It is still a little cumbersome, but is scheduled for (April 1980), 191-202. reimplementation after we gain more experience with its use. 6. Davidson, J.W.; and Fraser, C.W. Eliminating Redundant The original PSL compiler did not have any peephole Object Code. Conference Record of the Ninth Annual ACM optimization phase therefore using Peep has obviously slowed Symposium on Principles of Programming Languages, New the compilation process down. Our initial measurements York, New York, January, 1982, pp. 128-132. indicate that Peep adds about 40% in time to perform its 7. Fraser, C.W. A Compact, Machine-Independent Peephole optimizatious. We expect that by using techniques as outlined in Optimizer. Sixth Annual ACM Symposium on Principles of [14] this figure will be improved significantly. On the average, Programming Languages, ACM, January, 1979, pp. 1-6. we have seen about a 20% improvement in the code size. 8. Galway, W. A Pattern Compiler for Lisp. Utah Portable AI Future work in Peep will continue in the following areas: Support Systems Project, Opnote 84-02, University of Utah, Department of Computer Science, April, 1984. 1. We will continue to extend Peep to the other machines in 9. Griss, M.L. and Hearn, A.C. "A Portable LISP Compiler." the PSL family until we are satisfied that the overhead Software - Practice and Experience 11 (June 1981), 541-605. incurred by the global flow analysis is minimized. 10. Kessler, R.R. COG: An Architectural-Description-Driven 2. Loop invariant removal is not clone in the current system Compiler Generator. Ph.D. Th., Department of Computer and could be added relatively easily. However, we ars not Science, University of Utah, Salt Lake City, Utah 84112, convinced that this is the proper place for this type of January 1981. optimization as it should really be moved to earlier in the compiler. 11. Lowry, E.S. and Medlock, C.W. "Object Code Optimization." CACM 12, 1 (1969), 13-22. 3. As discussed by Tannenbaum [15], peephole optimization could be applied to the virtual machine operations. By 12. McKeeman, W.M. "Peephole Optimization." CACM 8, 7 describing the virtual machine operations in terms of an (1965), 443 ~4~. architectural description, a peephole optimizer could be created to operate directly on the virtual machine. We 13. Motorola, Inc.. MC68000 16-Bit Microprocesor User's plan to investigate the feasibility of this approach Manual. Prentice-Hall, Inc., Englewood Cliffs, N.J. 07632, 1982. (however, the new PSL compiler that we are designing will not have an intermediate virtual machine and may 14. Muchnick, S.S and Jones, D.D.. Program Flow Analysis. therefore not be appropriate for this type of optimization). Prentice-Hall, Inc., Englewood Cliffs, N.J. 07632, 1981.

4. Addressing modes that modify resources (like 15. Tannenbaum, A.S.; van Staveren, H.and Stevenson, J.W. predecrement and postdecrement on the MC68000) need "Using Peephole Optimization on Intermediate Code." ACM more study to be handled efficiently. We currently view Transactions on Programming Languages and Systems 4, 1 them as two "setf' statements which complicates the (1982), 21-36. definition and the algebraic reduction process.

9. Acknowledgments The author would like to thank Will Galway for his many discussions in the design of the system, and for his effort in implementing the global flow analysis phase of Peep. Finally, we appreciate the assistance provided by other members of the Utah PASS Project in reading earlier drafts of this paper.

10. References 1. Aho, A.V.; and Ullman, J.D.. Principles of Compiler Design. Addison-Wesley, Reading, Mass., 1977.

2. Barbacci, M.R. Instruction Set Processor Specifications (ISPS~: The Notation and its Applications. Tech. Rept. CMU- CS--79-123, Carnegie-Mellon University, May, 1979.

110