An E Ective Technique for VLIW and Superscalar Compilation

The Sup erblo ck: An E ectiveTechnique for VLIW and Sup erscalar Compilation Wen-mei W. Hwu Scott A. Mahlke William Y. Chen Pohua P. Chang Nancy J. Warter Roger A. Bringmann Roland G. Ouellette Richard E. Hank Tokuzo Kiyohara Grant E. Haab John G. Holm Daniel M. Lavery Abstract A compiler for VLIW and sup erscalar pro cessors must exp ose sucient instruction-level parallelism ILP to e ectively utilize the parallel hardware. However, ILP within basic blo cks is extremely limited for control-intensi ve programs. Wehave develop ed a set of techniques for exploiting ILP across basic blo ck b oundaries. These techniques are based on a novel structure called the superblock . The sup erblo ck enables the optimizer and scheduler to extract more ILP along the imp ortant execution paths by systematically removing constraints due to the unimp ortant paths. Sup erblo ck optimization and scheduling have b een implemented in the IMPACT-I compiler. This implementation gives us a unique opp ortunity to fully understand the issues involved in incorp orating these techniques into a real compiler. Sup erblo ck optimizations and scheduling are shown to b e useful while taking into accountavariety of architectural features. Index terms -codescheduling, control-intensive programs, instruction-level parallel pro cessing, optimizing compiler, pro le information, sp eculative execution, sup erblo ck, sup erscalar pro cessor, VLIW pro cessor. The authors are with the Center for Reliable and High-Performance Computing, University of Illinois, Urbana-Champaign, Illinois, 61801. Pohua Chang is now with Intel Corp oration. Roland Ouellette is now with Digital Equipment Corp oration. Tokuzo Kiyohara is with Matsushita Electric Industrial Co., Ltd., Japan. Daniel Lavery is also with the Center for Sup ercomputing Research and Development at the University of Illinois at Urbana-Champaign. 1 To App ear: Journal of Sup ercomputing, 1993 2 1 Intro duction VLIW and sup erscalar pro cessors contain multiple data paths and functional units, making them capable of issuing multiple instructions p er clo ck cycle [Colwell et al. 1987; Fisher 1981; Horst et al. 1990; Intel 1989; Rau et al. 1989; Warren 1990]. As a result, the p eak p erformance of the coming generation of VLIW and sup erscalar pro cessors will range from two to eight times greater than their scalar predecessors which execute at most one instruction p er clo ck cycle. However, previous studies have shown that using conventional co de optimization and scheduling metho ds, sup erscalar and VLIW pro cessors cannot pro duce a sustained sp eedup of more than two for control-intensive programs [Jouppi and Wall 1989; Schuette and Shen 1991; Smith et al. 1989]. For such programs, conventional compilation metho ds do not provide enough supp ort to utilize these pro cessors. Traditionally, the primary ob jective of co de optimization is to reduce the numb er and complexityof executed instructions. Few existing optimizing compilers attempt to increase the instruction-level parallelism, 1 or ILP , of the co de. This is b ecause most optimizing compilers have b een designed for scalar pro cessors, which b ene t little from the increased ILP. Since VLIW and sup erscalar pro cessors can take advantage of the increased ILP, it is imp ortant for the co de optimizer to exp ose enough parallelism to fully utilize the parallel hardware. The amount of ILP within basic blo cks is extremely limited for control-intensive programs. Therefore, the co de optimizer must lo ok b eyond basic blo cks to nd sucient ILP.Wehave develop ed a set of optimizations to increase ILP across basic blo ck b oundaries. These optimizations are based on a novel structure called the superblock. The formation and optimization of sup erblo cks increases the ILP along the imp ortant execution paths by systematically removing constraints due to the unimp ortant paths. Because these optimizations increase the ILP within sup erblo cks, they are collectively referred to as superblock ILP optimizations. Unlike co de optimization, co de scheduling for VLIW pro cessors has received extensive treatmentinthe literature [Aiken and Nicolau 1988; Bernstein and Ro deh 1991; Ellis 1985; Fisher 1981; Gupta and So a 1990]. In particular, the trace scheduling technique invented by Fisher has b een shown to b e very e ective for rearranging instructions across basic blo cks [Fisher 1981]. An imp ortant issue for trace scheduling is the compiler implementation complexity incurred by the need to maintain correct program execution after 1 One can measure the ILP as the average numb er of simultaneously executable instructions p er clo ck cycle. It is a function of the data and control dep endences b etween instructions in the program as well as the instruction latencies of the pro cessor hardware. It is indep endent of all other hardware constraints. To App ear: Journal of Sup ercomputing, 1993 3 moving instructions across basic blo cks. The co de scheduling technique describ ed in this pap er, whichis derived from trace scheduling, employs the sup erblo ck. Sup erblo ck ILP optimizations remove constraints, and the co de scheduler implementation complexity is reduced. This co de scheduling approach will b e referred to as superblock scheduling. In order to characterize the cost and e ectiveness of the sup erblo ck ILP optimizations and sup erblo ck scheduling, wehave implemented these techniques in the IMPACT-I compiler develop ed at the University of Illinois. The fundamental premise of this pro ject is to provide a complete compiler implementation that allows us to quantify the impact of these techniques on the p erformance of VLIW and sup erscalar pro cessors by compiling and executing large control-intensive programs. In addition, this compiler allows us to fully understand the issues involved in incorp orating these optimization and scheduling techniques into a real compiler. Sup erblo ck optimizations are shown to b e useful while taking into accountavarietyof architectural parameters. Section 2 of this pap er intro duces the sup erblo ck. Section 3 gives a concise overview of the sup erblo ck ILP optimizations and sup erblo ckscheduling. Section 4 presents the cost and p erformance of these techniques. The concluding remarks are made in Section 5. 2 The Sup erblo ck The purp ose of co de optimization and scheduling is to minimize the execution time while preserving the program semantics. When this is done globally, some optimization and scheduling decisions may decrease the execution time for one control path while increasing the time for another path. By making these decisions in favor of the more frequently executed path, an overall p erformance improvement can b e achieved. Trace scheduling is a technique that was develop ed to allowscheduling decisions to b e made in this manner [Ellis 1985; Fisher 1981]. In trace scheduling the function is divided into a set of traces that represent the frequently executed paths. There may b e conditional branches out of the middle of the trace side exits and transitions from other traces into the middle of the trace side entrances. Instructions are scheduled within each trace ignoring these control- ow transitions. After scheduling, b o okkeeping is required to ensure the correct execution of o -trace co de. Co de motion past side exits can b e handled in a fairly straightforward manner. If an instruction I is moved from ab ovetobelow a side exit, and the destination of I is used b efore it is rede ned when the side To App ear: Journal of Sup ercomputing, 1993 4 . Instr 5 Instr 1 Instr 1 Instr 3 Instr 2 Instr 2 Instr 1 Instr 5 Instr 4 Instr 3 Instr 3 Instr 2 Instr 2 Instr 4 Instr 4 Instr 3 Instr 3 Instr 5 Instr 1 Instr 4 Instr 4 Instr 5 Instr 5 . (a) (b) Figure 1: Instruction scheduling across trace side entrances. a Moving an instruction b elow a side entrance. b Moving an instruction ab ove a side entrance. exit is taken, then a copyofI must also b e placed b etween the side exit and its target. Movementofan instruction from b elowtoabove a branch can also b e handled without to o much diculty. The metho d for doing this is describ ed in section 3.4. More complex b o okkeeping must b e done when co de is moved ab ove and b elow side entrances. Figure 1 illustrates this b o okkeeping. In Figure 1a, when Instr 1 is moved b elow the side entrance to after Instr 4, the side entrance is moved b elow Instr 1. Instr 3 and Instr 4 are then copied to the side entrance. Likewise, in Figure 1b, when Instr 5 is moved ab ove the side entrance, it must also b e copied to the side entrance. Side entrances can also make it more complex to apply optimizations to traces. For example, Figure 2 shows how copy propagation can b e applied to the trace and the necessary b o okkeeping for the o -trace co de. In this example, in order to propagate the value of r1 from I1 to I3, b o okkeeping must b e p erformed. Before propagating the value of r1, the side entrance is moved to b elow I3 and instructions I2 and I3 are copied to the side entrance. The b o okkeeping asso ciated with side entrances can b e avoided if the side entrances are removed from the trace. A superblock is a trace which has no side entrances. Control may only enter from the top but may leave at one or more exit p oints.

Load more