The Sup erblo ck: An E ectiveTechnique
for VLIW and Sup erscalar Compilation
Wen-mei W. Hwu Scott A. Mahlke William Y. Chen Pohua P. Chang
Nancy J. Warter Roger A. Bringmann Roland G. Ouellette Richard E. Hank
Tokuzo Kiyohara Grant E. Haab John G. Holm Daniel M. Lavery
Abstract
A compiler for VLIW and sup erscalar pro cessors must exp ose sucient instruction-level parallelism
ILP to e ectively utilize the parallel hardware. However, ILP within basic blo cks is extremely limited
for control-intensi ve programs. Wehave develop ed a set of techniques for exploiting ILP across basic
blo ck b oundaries. These techniques are based on a novel structure called the superblock . The sup erblo ck
enables the optimizer and scheduler to extract more ILP along the imp ortant execution paths by sys-
tematically removing constraints due to the unimp ortant paths. Sup erblo ck optimization and scheduling
have b een implemented in the IMPACT-I compiler. This implementation gives us a unique opp ortunity
to fully understand the issues involved in incorp orating these techniques into a real compiler. Sup erblo ck
optimizations and scheduling are shown to b e useful while taking into accountavariety of architectural
features.
Index terms -codescheduling, control-intensive programs, instruction-level parallel pro cessing, op-
timizing compiler, pro le information, sp eculative execution, sup erblo ck, sup erscalar pro cessor, VLIW
pro cessor.
The authors are with the Center for Reliable and High-Performance Computing, University of Illinois, Urbana-Champaign,
Illinois, 61801. Pohua Chang is now with Intel Corp oration. Roland Ouellette is now with Digital Equipment Corp oration.
Tokuzo Kiyohara is with Matsushita Electric Industrial Co., Ltd., Japan. Daniel Lavery is also with the Center for Sup ercom-
puting Research and Development at the University of Illinois at Urbana-Champaign. 1
To App ear: Journal of Sup ercomputing, 1993 2
1 Intro duction
VLIW and sup erscalar pro cessors contain multiple data paths and functional units, making them capable of
issuing multiple instructions p er clo ck cycle [Colwell et al. 1987; Fisher 1981; Horst et al. 1990; Intel 1989;
Rau et al. 1989; Warren 1990]. As a result, the p eak p erformance of the coming generation of VLIW and
sup erscalar pro cessors will range from two to eight times greater than their scalar predecessors which execute
at most one instruction p er clo ck cycle. However, previous studies have shown that using conventional co de
optimization and scheduling metho ds, sup erscalar and VLIW pro cessors cannot pro duce a sustained sp eedup
of more than two for control-intensive programs [Jouppi and Wall 1989; Schuette and Shen 1991; Smith et al.
1989]. For such programs, conventional compilation metho ds do not provide enough supp ort to utilize these
pro cessors.
Traditionally, the primary ob jective of co de optimization is to reduce the numb er and complexityof
executed instructions. Few existing optimizing compilers attempt to increase the instruction-level parallelism,
1
or ILP , of the co de. This is b ecause most optimizing compilers have b een designed for scalar pro cessors,
which b ene t little from the increased ILP. Since VLIW and sup erscalar pro cessors can take advantage of
the increased ILP, it is imp ortant for the co de optimizer to exp ose enough parallelism to fully utilize the
parallel hardware.
The amount of ILP within basic blo cks is extremely limited for control-intensive programs. Therefore, the
co de optimizer must lo ok b eyond basic blo cks to nd sucient ILP.Wehave develop ed a set of optimizations
to increase ILP across basic blo ck b oundaries. These optimizations are based on a novel structure called the
superblock. The formation and optimization of sup erblo cks increases the ILP along the imp ortant execution
paths by systematically removing constraints due to the unimp ortant paths. Because these optimizations
increase the ILP within sup erblo cks, they are collectively referred to as superblock ILP optimizations.
Unlike co de optimization, co de scheduling for VLIW pro cessors has received extensive treatmentinthe
literature [Aiken and Nicolau 1988; Bernstein and Ro deh 1991; Ellis 1985; Fisher 1981; Gupta and So a
1990]. In particular, the trace scheduling technique invented by Fisher has b een shown to b e very e ective
for rearranging instructions across basic blo cks [Fisher 1981]. An imp ortant issue for trace scheduling is
the compiler implementation complexity incurred by the need to maintain correct program execution after
1
One can measure the ILP as the average numb er of simultaneously executable instructions p er clo ck cycle. It is a function
of the data and control dep endences b etween instructions in the program as well as the instruction latencies of the pro cessor
hardware. It is indep endent of all other hardware constraints.
To App ear: Journal of Sup ercomputing, 1993 3
moving instructions across basic blo cks. The co de scheduling technique describ ed in this pap er, whichis
derived from trace scheduling, employs the sup erblo ck. Sup erblo ck ILP optimizations remove constraints,
and the co de scheduler implementation complexity is reduced. This co de scheduling approach will b e referred
to as superblock scheduling.
In order to characterize the cost and e ectiveness of the sup erblo ck ILP optimizations and sup erblo ck
scheduling, wehave implemented these techniques in the IMPACT-I compiler develop ed at the University
of Illinois. The fundamental premise of this pro ject is to provide a complete compiler implementation
that allows us to quantify the impact of these techniques on the p erformance of VLIW and sup erscalar
pro cessors by compiling and executing large control-intensive programs. In addition, this compiler allows us
to fully understand the issues involved in incorp orating these optimization and scheduling techniques into
a real compiler. Sup erblo ck optimizations are shown to b e useful while taking into accountavarietyof
architectural parameters.
Section 2 of this pap er intro duces the sup erblo ck. Section 3 gives a concise overview of the sup erblo ck ILP
optimizations and sup erblo ckscheduling. Section 4 presents the cost and p erformance of these techniques.
The concluding remarks are made in Section 5.
2 The Sup erblo ck
The purp ose of co de optimization and scheduling is to minimize the execution time while preserving the
program semantics. When this is done globally, some optimization and scheduling decisions may decrease
the execution time for one control path while increasing the time for another path. By making these decisions
in favor of the more frequently executed path, an overall p erformance improvement can b e achieved.
Trace scheduling is a technique that was develop ed to allowscheduling decisions to b e made in this
manner [Ellis 1985; Fisher 1981]. In trace scheduling the function is divided into a set of traces that
represent the frequently executed paths. There may b e conditional branches out of the middle of the trace
side exits and transitions from other traces into the middle of the trace side entrances. Instructions are
scheduled within each trace ignoring these control- ow transitions. After scheduling, b o okkeeping is required
to ensure the correct execution of o -trace co de.
Co de motion past side exits can b e handled in a fairly straightforward manner. If an instruction I is
moved from ab ovetobelow a side exit, and the destination of I is used b efore it is rede ned when the side
To App ear: Journal of Sup ercomputing, 1993 4
...... Instr 5 Instr 1 Instr 1 Instr 3 Instr 2 Instr 2 Instr 1 Instr 5 Instr 4 Instr 3 Instr 3 Instr 2 Instr 2 Instr 4 Instr 4 Instr 3 Instr 3 Instr 5 Instr 1 Instr 4 Instr 4 Instr 5 Instr 5 ......
(a) (b)
Figure 1: Instruction scheduling across trace side entrances. a Moving an instruction b elow a side entrance.
b Moving an instruction ab ove a side entrance.
exit is taken, then a copyofI must also b e placed b etween the side exit and its target. Movementofan
instruction from b elowtoabove a branch can also b e handled without to o much diculty. The metho d for
doing this is describ ed in section 3.4.
More complex b o okkeeping must b e done when co de is moved ab ove and b elow side entrances. Figure 1
illustrates this b o okkeeping. In Figure 1a, when Instr 1 is moved b elow the side entrance to after Instr 4,
the side entrance is moved b elow Instr 1. Instr 3 and Instr 4 are then copied to the side entrance. Likewise,
in Figure 1b, when Instr 5 is moved ab ove the side entrance, it must also b e copied to the side entrance.
Side entrances can also make it more complex to apply optimizations to traces. For example, Figure 2
shows how copy propagation can b e applied to the trace and the necessary b o okkeeping for the o -trace
co de. In this example, in order to propagate the value of r1 from I1 to I3, b o okkeeping must b e p erformed.
Before propagating the value of r1, the side entrance is moved to b elow I3 and instructions I2 and I3 are
copied to the side entrance.
The b o okkeeping asso ciated with side entrances can b e avoided if the side entrances are removed from
the trace. A superblock is a trace which has no side entrances. Control may only enter from the top but
may leave at one or more exit p oints. Sup erblo cks are formed in two steps. First, traces are identi ed using
execution pro le information [Chang and Hwu 1988]. Second, a pro cess called tail duplication is p erformed
to eliminate any side entrances to the trace [Chang et al. 1991b]. A copy is made of the tail p ortion of
the trace from the rst side entrance to the end. All side entrances into the trace are then moved to the
corresp onding duplicate basic blo cks. The basic blo cks in a sup erblo ck need not b e consecutive in the co de.
However, our implementation restructures the co de so that all blo cks in a sup erblo ck app ear in consecutive
To App ear: Journal of Sup ercomputing, 1993 5
...... r1 <- 64 . r1 <- 64 . . . . .
I1: r1 <- 4 I1: r1 <- 4 I2': r2 <- label(_A)
I2: r2 <- label(_A) I2: r2 <- label(_A) I3': r3 <- mem(r2+r1)
I3: r3 <- mem(r2+r1) I3: r3 <- mem(r2+4)
......
(a) (b)
Figure 2: Applying copy propagation to an instruction trace. a Before copy propagation. b After copy
propagation with b o okkeeping co de inserted.
order for b etter cache p erformance.
The formation of sup erblo cks is illustrated in Figure 3. Figure 3a shows a weighted ow graph which
represents a lo op co de segment. The no des corresp ond to basic blo cks and arcs corresp ond to p ossible
control transfers. The count of each basic blo ck indicates the execution frequency of that basic blo ck.
In Figure 3a, the count of fA; B ; C; D; E ; F g is f100; 90; 10; 0; 90; 100g, resp ectively. The count of each
control transfer indicates the frequency of invoking these control transfers. In Figure 3a, the count of
fA ! B; A ! C; B ! D; B ! E; C ! F; D ! F; E ! F; F ! Ag is f90; 10; 0; 90; 10; 0; 90; 99g, resp ectively.
Clearly, the most frequently executed path in this example is the basic blo ck sequence < A;B;E;F >. There
are three traces: fA; B; E; F g, fC g, and fD g. After trace selection, each trace is converted into a sup erblo ck.
In Figure 3a, we see that there are two control paths that enter the fA; B; E; F g trace at basic blo ck F .
Therefore, we duplicate the tail part of the fA; B; E; F g trace starting at basic blo ck F . Each duplicated basic
blo ck forms a new sup erblo ck that is app ended to the end of the function. The result is shown in Figure 3b.
Note that there are no longer side entrances into the most frequently traversed trace, < A;B;E;F >;ithas
b ecome a sup erblo ck.
Sup erblo cks are similar to the extended basic blo cks. An extended basic blo ck is de ned as a sequence
of basic blo cks B ...B such that for 1 i 1 k i i+1 1 a unique predecessor [Aho et al. 1986]. The di erence b etween sup erblo cks and extended basic blo cks lies mainly in how they are formed. Sup erblo ck formation is guided by pro le information and side entrances are removed to increase the size of the sup erblo cks. It is p ossible for the rst basic blo ck in a sup erblo ckto To App ear: Journal of Sup ercomputing, 1993 6 Y Y 1 1 A A 100 100 90 90 10 10 B C B C 90 10 90 10 90 90 0 0 D E D E 0 90 10 0 90 10 90 90 0 0 F F 100 90 99 89.1 1 0.9 Z Z 9.9 0.1 F' 10 (a) (b) Figure 3: An example of sup erblo ck formation. have a unique predecessor. 3 Sup erblo ck ILP Optimization and Scheduling Before sup erblo ckscheduling is p erformed, sup erblo ck ILP optimizations are applied which enlarge the sup erblo ck and increase instruction parallelism by removing dep endences. 3.1 Sup erblo ck Enlarging Optimizations The rst category of sup erblo ck ILP optimizations is sup erblo ck enlarging optimizations. The purp ose of these optimizations is to increase the size of the most frequently executed sup erblo cks so that the sup erblo ck scheduler can manipulate a larger numb er of instructions. It is more likely the scheduler will nd indep endent instructions to schedule at every cycle in a sup erblo ck when there are more instructions to cho ose from. An imp ortant feature of sup erblo ck enlarging optimizations is that only the most frequently executed parts of a program are enlarged. This selective enlarging strategy keeps the overall co de expansion under control [Chang To App ear: Journal of Sup ercomputing, 1993 7 et al. 1991b]. Three sup erblo ck enlarging optimizations are describ ed as follows. BranchTarget Expansion. Branch target expansion expands the target sup erblo ck of a likely taken control transfer which ends a sup erblo ck. The target sup erblo ck is copied and app ended to the end of the original sup erblo ck. Note that branch target expansion is not applied for control transfers which are lo op backedges. Branch target expansion continues to increase the size of a sup erblo ckuntil a prede ned sup erblo ck size is reached, or the branch ending the sup erblo ck do es not favor one direction. Lo op Peeling. Sup erblo ck lo op p eeling mo di es a sup erblo ck lo op a sup erblo ck which ends with a likely control transfer to itself which tends to iterate only a few times for each lo op execution. The lo op 2 The original b o dy is replaced by straight-line co de consisting of the rst several iterations of the lo op. lo op b o dy is moved to the end of the function to handle executions which require additional iterations. After lo op p eeling, the most frequently executed preceding and succeeding sup erblo cks can b e expanded into the p eeled lo op b o dy to create a single large sup erblo ck. Lo op Unrolling. Sup erblo ck lo op unrolling replicates the b o dy of a sup erblo ck lo op which tends to iterate many times. To unroll a sup erblo ckloopN times, N 1 copies of the sup erblo ck are app ended to the original sup erblo ck. The lo op backedge control transfers in the rst N 1 lo op b o dies are adjusted or removed if p ossible to account for the unrolling. If the iteration countisknown on lo op entry, it is p ossible to remove these transfers by using a preconditioning lo op to execute the rst mo d N iterations. However, preconditioning lo ops are not currently inserted for unrolled lo ops. 3.2 Sup erblo ck Dep endence Removing Optimizations The second category of sup erblo ck ILP optimizations is sup erblo ck dep endence removing optimizations. These optimizations eliminate data dep endences b etween instructions within frequently executed sup erblo cks, which increases the ILP available to the co de scheduler. As a side e ect, some of these optimizations increase the numb er of executed instructions. However, by applying these optimizations only to frequently executed sup erblo cks, the co de expansion incurred is regulated. Five sup erblo ck dep endence removing optimizations are describ ed as follows. Register Renaming. Reuse of registers by the compiler and variables by the programmer intro duces arti cial anti and output dep endences and restricts the e ectiveness of the scheduler. Many of these arti cial dep endences can b e eliminated with register renaming [Kuck et al. 1981]. Register renaming assigns unique 2 Using the pro le information, the lo op is p eeled its exp ected numb er of iterations. To App ear: Journal of Sup ercomputing, 1993 8 registers to di erent de nitions of the same register. A common use of register renaming is to rename registers within individual lo op b o dies of an unrolled sup erblo ck lo op. Op eration Migration. Op eration migration moves an instruction from a sup erblo ck where its result is not used to a less frequently executed sup erblo ck. By migrating an instruction, all of the data dep endences asso ciated with that instruction are eliminated from the original sup erblo ck. Op eration migration is p er- formed by detecting an instruction whose destination is not referenced in its home sup erblo ck. Based on a cost constraint, a copy of the instruction is placed at the target of each exit of the sup erblo ck in which the destination of the instruction is live. Finally, the original instruction is deleted. Induction Variable Expansion. Induction variables are used within lo ops to index through lo op iterations and through regular data structures such as arrays. When data access is delayed due to dep en- dences on induction variable computation, ILP is typically limited. Induction variable expansion eliminates rede nitions of induction variables within an unrolled sup erblo ck lo op. Each de nition of the induction variable is given a new induction variable, thereby eliminating all anti, output, and ow dep endences among the induction variable de nitions. However an additional instruction is inserted into the lo op preheader to initialize each newly created induction variable. Also, patch co de is inserted if the induction variable is used outside the sup erblo ck to recover the prop er value for the induction variable. Accumulator Variable Expansion. An accumulator variable accumulates a sum or pro duct in each iteration of a lo op. For lo ops of this typ e, the accumulation op eration often de nes the critical path within the lo op. Similar to induction variable expansion, anti, output, and ow dep endences b etween instructions which accumulate a total are eliminated by replacing each de nition of accumulator variable with a new accumulator variable. Unlike induction variable expansion, though, the increment or decrementvalue is not required to b e constant within the sup erblo ck lo op. Again, initialization instructions for these new accumulator variables must b e inserted into the sup erblo ck preheader. Also, the new accumulator variables are summed at all sup erblo ck exit p oints to recover the value of the original accumulator variable. Note that accumulator variable expansion applied to oating p ointvariables may not b e safe for programs which rely on the order which arithmetic op erations are p erformed to maintain accuracy.For programs of this typ e, an option is provided for the user to disable accumulator variable expansion of oating p ointvariables. Op eration Combining. Flow dep endences b etween pairs of instructions with the same precedence each with a compile-time constant source op erand can b e eliminated with op eration combining [Nakatani and Eb cioglu 1989]. Flow dep endences which can b e eliminated with op eration combining often arise b etween To App ear: Journal of Sup ercomputing, 1993 9 (d) pre : r11 = r1 r21 = r1 + 4 r31 = r1 + 8 r14 = r4 r24 = 0 r34 = 0 L1 : r12 = MEM(A + r11) Original Loop bne (r12 1) L13 (c) r13 = MEM(B + r11) (a) L1 : if (A[i] == 1) C += B[i] r14 = r14 + r13 i++ L1 : r2 = MEM(A + r1) bge (r21 N') L2 if (i < N) goto L1 bne (r2 1) L3 L2 : r3 = MEM(B + r1) r22 = MEM(A + r21) bne (r22 1) L23 r4 = r4 + r3 r23 = MEM(B + r21) r1 = r1 + 4 r24 = r24 + r23 bge (r1 N') L2 (b) L1 : r2 = MEM(A + r1) bge (r31 N') L2 r2 = MEM(A + r1) bne (r2 1) L3 r32 = MEM(A + r31) bne (r2 1) L3 r3 = MEM(B + r1) bne (r32 1) L33 r3 = MEM(B + r1) r4 = r4 + r3 r33 = MEM(B + r31) r4 = r4 + r3 r1 = r1 + 4 Loop Dependence r34 = r34 + r33 r1 = r1 + 4 blt (r1 N') L1 Unroll Removal r11 = r11 + 12 L2 : bge (r1 N') L2 r21 = r21 + 12 L3 : r1 = r1 + 4 r2 = MEM(A + r1) r31 = r31 + 12 blt (r1 N') L1 bne (r2 1) L3 blt (r11 N') L1 r3 = MEM(B + r1) goto L2 L2 : r4 = r14 + r24 r4 = r4 + r3 r4 = r4 + r34 r1 = r1 + 4 blt (r1 N') L1 L2 : L13 : r11 = r11 + 4 r21 = r21 + 4 r31 = r31 + 4 L3 : r1 = r1 + 4 blt (r1 N') L1 blt (r11 N') L1 goto L2 goto L2 L23 : r11 = r11 + 8 r21 = r21 + 8 r31 = r31 + 8 blt (r11 N') L1 goto L2 L33 : r11 = r11 + 12 r21 = r21 + 12 r31 = r31 + 12 blt (r11 N') L1 goto L2 Figure 4: An application of sup erblo ck ILP optimizations, a original program segment, b assembly co de after sup erblo ck formation, c assembly co de after sup erblo ck lo op unrolling, d assembly co de after sup erblo ck dep endence removal optimizations. address calculation instructions and memory access instructions. Also, similar opp ortunities o ccur for lo op variable increments and lo op exit branches. The ow dep endence is removed by substituting the expression of the rst instruction into the second instruction and evaluating the constants at compile time. Example. An example to illustrate lo op unrolling, register renaming, induction variable expansion, and accumulator variable expansion is shown in Figure 4. This example assumes that the condition of the if statement within the lo op is likely to b e true. After sup erblo ck formation, the resulting assembly co de is shown in Figure 4b. To enlarge the sup erblo ck lo op, lo op unrolling is applied. The lo op is unrolled three times in this example Figure 4c. After lo op unrolling, data dep endences limit the amount of ILP in the sup erblo ck lo op. The dep endences among successive up dates of r 4 Figure 4c are eliminated with accumulator variable expansion. In Figure 4d, three temp orary accumulators, r 14, r 24, and r 34, are created within the su- To App ear: Journal of Sup ercomputing, 1993 10 p erblo ck lo op. After accumulator expansion, all up dates to r 4 within one iteration of the unrolled lo op are indep endent of one another. In order to maintain correctness, the temp orary accumulators are prop erly ini- tialized in the lo op preheader pre, and the values are summed at the lo op exit p oint L2. The dep endences among successive up dates of r 1 along with up dates of r 1 and succeeding load instructions Figure 4c are eliminated with induction variable expansion. In Figure 4d, three temp orary induction variables, r 11, r 21, and r 31, are created within the sup erblo ck lo op. After induction variable expansion, the chain of dep en- dences created by the induction variable is eliminated within one iteration of the unrolled lo op. Finally, register renaming is applied to the load instructions to eliminate output dep endences. After all sup erblo ck ILP optimizations are applied, the execution of the original lo op b o dies within the unrolled sup erblo ckloop may b e completely overlapp ed by the scheduler. 3.3 Sup erblo ckScheduling After the sup erblo ck ILP optimizations are applied, sup erblo ckscheduling is p erformed. Co de scheduling within a sup erblo ck consists of two steps: dep endence graph construction followed by list scheduling. A dep endence graph is a data structure which represents the control and data dep endences b etween instruc- tions. A control dep endence is initially assigned b etween each instruction and every preceding branch in the sup erblo ck. Some control dep endences may then b e removed according to the available hardware supp ort describ ed in the following section. After the appropriate control dep endences have b een eliminated, list scheduling using the dep endence graph, instruction latencies, and resource constraints is p erformed on the sup erblo ck. 3.4 Sp eculative Execution Supp ort Sp eculative execution refers to the execution of an instruction b efore it is known that its execution is required. Such an instruction will b e referred to as a speculative instruction. Sp eculative execution o ccurs when the sup erblo ckscheduler moves an instruction J ab ove a preceding conditional branch B . During run time, J will b e executed b efore B , i.e., J is executed regardless of the branch direction of B .However, according to 3 the original program order, J should b e executed only if B is not taken. Therefore, the execution result of J must not b e used if B is taken, which is formally stated as follows: 3 Note that the blo cks of a sup erblo ck are laid out sequentially by the compiler. Each instruction in the sup erblo ck is always on the fall-through path of its preceding conditional branch. To App ear: Journal of Sup ercomputing, 1993 11 Restriction 1. The destination of J is not used b efore it is rede ned when B is taken. This restriction usually has very little e ect on co de scheduling after sup erblo ck ILP optimization. For example, in Figure 4d, after dep endence removal in blo ck L1, the instruction that loads B [i]into r 13 can b e executed b efore the direction the preceding bne instruction is known. Note that the execution result of this load instruction must not b e used if the branchistaken. This is trivially achieved b ecause r 13 is never used b efore de ned if the branch is taken; the value thus loaded into r 13 will b e ignored in the subsequent execution. A more serious problem with sp eculative execution is the prevention of premature program termination due to exceptions caused by sp eculative instructions. Exceptions caused by sp eculative instructions which would not have executed on a sequential machine must b e ignored, which leads to the following restriction: Restriction 2. J will never cause an exception that may terminate program execution when branch B is taken. In Figure 4d, the execution of the instruction that loads B [i]into r13 could p otentially cause a memory access violation fault. If this instruction is sp eculatively scheduled b efore its preceding branch and such a fault o ccurs during execution, the exception should b e ignored if the branchwas taken. Rep orting the exception if the branch is taken would have incorrectly terminated the execution of the program. Two levels of hardware supp ort for Restriction 2 will b e examined in this pap er: restrictedpercolation model and general percolation model. The restricted p ercolation mo del includes no supp ort for disregarding the exceptions generated by the sp eculative instructions. For conventional pro cessors, memory load, memory store, integer divide, and all oating p oint instructions can cause exceptions. When using the restricted p ercolation mo del, these instructions may not b e moved ab ove branches unless the compiler can prove that those instructions can never cause an exception when the preceding branchistaken. The limiting factor of the restricted p ercolation mo del is the inabilitytomove p otential trap-causing instructions with long latency, such as load instructions, ab ove branches. When the critical path of a sup erblo ck contains manyof these instructions, the p erformance of the restricted p ercolation mo del is limited. The general p ercolation mo del eliminates Restriction 2 by providing a non-trapping version of instructions that can cause exceptions [Chang et al. 1991a]. Exceptions that may terminate program execution are avoided by converting all sp eculative instructions which can p otentially cause exceptions into their non- trapping versions. For programs in which detection of exceptions is imp ortant, the loss of exceptions can b e recovered with additional architectural and compiler supp ort [Mahlke et al. 1992]. However, in this To App ear: Journal of Sup ercomputing, 1993 12 82% AAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA Base AAAA AA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA Profile AAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAA AAAA AA AAAA AAAA AAAA AAAA AAAAAAAA AAAA AA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAA AAAA AA AAAA AAAA AAAA AAAA AAAAAAAA AAAA AA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA Superblock Formation AAAA AAA AAAA AA AAAA AAAA AAAA AAAA AAAAAAAA AAAA AA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAA AAAA AA AAAA AAAA AAAA AAAA AAAAAAAA AAAA AAAAA AAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAA AAAA AA AAAA AAAA AAAA AAAA AAAAAAAA AAAA AA AAAAAAAAAAAAAA A AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA Superblock Optimization AAAA AAA AAAA AA AAA AAAA AAAA AAAA AAAA AAAA AAAA AAAAAAAA AAAA AA AAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAA AAAA AA AAA AAAA AAAA AAAA AAAA AAAA AAAA AAAAAAAA AAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAA AAAA AA AAA AAAA AAAA AAAA AAAA AAAA AAAA AAAAAAAA AAAA AAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAA AAAA AA AAA AAAA AAAA AAAA AAAA AAAA AAAAAAAAAAA AAAAAAAA AAAA AAAAAAAAAAAAAA A AAA AAAA AAA AAAA AAAA AAAA AAAAAAAAAAA AAAA AAA AAAA AAA AAAA AAAA AAAA AAAAAAAAAAA AAA AAAA AAA AAAA AAAA AAAA AAAAAAAAAAA AAA AAAA AAA AAAA AAAA AAAA AAAAAAA 12% AAAA 2% 4% Figure 5: Compiler co de size break down. The entire compiler consists of approximately 92,000 lines of C co de with one co de generator. pap er only the non-trapping execution supp ort is utilized. In Section 4.6, we will showhow the general p ercolation mo del allows the sup erblo ckscheduler to exploit much more of the ILP exp osed by sup erblo ck ILP optimizations than the restricted p ercolation mo del. 4 Implementation Cost and Performance Results In this section, we rep ort the implications of sup erblo ck ILP optimization and scheduling on compiler size, compile time, and output co de p erformance. We additionally examine three architectural factors that di- rectly a ect the e ectiveness of sup erblo ck ILP optimizations and scheduling: sp eculative execution supp ort, instruction cache misses, and data cache misses. 4.1 Compiler Size The IMPACT-I C compiler serves two imp ortant purp oses. First, it is intended to generate highly opti- mized co de for existing commercial micropro cessors. Wehave constructed co de generators for the MIPS TM TM TM TM TM R2000 ,SPARC , Am29000 , i860 , and HP PA-RISC pro cessors. Second, it provides a platform for studying new co de optimization techniques for instruction-level parallel pro cessing. New co de optimiza- tion techniques, once validated, can b e immediately applied to the VLIW and sup erscalar implementations of existing and new commercial architectures. Figure 5 shows the p ercentage of compiler size due to each level of compiler sophistication. The base level accounts for the C front-end, traditional co de optimizations, graph-coloring-based register allo cation, basic blo ckscheduling, and one co de generator [Chaitin 1982; Chow and Hennessy 1990]. The traditional optimizations include classical lo cal and global co de optimizations, function inline expansion, instruction placement optimization, pro le-based branch prediction, and constant preloading [Aho et al. 1986; Hwu and To App ear: Journal of Sup ercomputing, 1993 13 Chang 1989a; Hwu and Chang 1989b]. The pro le level generates dynamic execution frequency information and feeds this information back to the compiler. The sup erblo ck formation level p erforms trace selection, tail duplication, and sup erblo ckscheduling. The sup erblo ck ILP optimization level p erforms branch expan- sion, lo op unrolling, lo op p eeling, register renaming, induction variable expansion, accumulator expansion, op eration migration, and op eration combining. As shown in Figure 5, the compiler source co de dedicated to the sup erblo ck techniques is only ab out 14 of the entire IMPACT-I compiler. 4.2 Benchmark Programs Benchmark Size Benchmark Description Pro le Description Input Description cccp 4787 GNU C prepro cessor 20 C source les 100 - 5000 lines 1 C source le 4000 lines cmp 141 compare les 20 similar/dissimilar les 2 similar les 4000 lines compress 1514 compress les 20 C source les 100 - 5000 lines 1 C source le 4000 lines eqn 2569 typ eset math formulas 20 ditro les 100 - 4000 lines 1 ditro le 17000 lines eqntott 3461 b o olean minimization 5 les of b o olean equations standard SPEC 92 input espresso 6722 b o olean minimization 20 original espresso b enchmarks opa grep 464 string search 20 C source les with search strings 1 C source le 4000 lines lex 3316 lexical analyzer generator 5 lexers for C, lisp, pascal, awk, pic C lexer li 7747 lisp interpreter 5 gabriel b enchmarks queens 7 tbl 2817 format tables for tro 20 ditro les 100 - 4000 lines 1 ditro le 5000 lines wc 120 word count 20 C source les 100 - 5000 lines 1 C source le 4000 lines yacc 2303 parser generator 10 grammars for C, pascal, pic, eqn C grammar Table 1: The b enchmarks. Table 1 shows the characteristics of the b enchmark programs to b e used in our compile time and output co de p erformance exp eriments. The Size column indicates the sizes of the b enchmark programs measured in numb ers of lines of C co de excluding comments. The remaining columns describ e the b enchmarks, the input les used for pro ling, and the input le used for p erformance comparison. In a few cases, such as lex and yacc, one of the pro le inputs was used as the test input due to an insucientnumb er of realistic test cases. Most of the b enchmark programs are large and complex enough that it would b e virtually imp ossible to conduct exp erimentation using these programs without a working compiler. 4.3 Base Co de Calibration All the sup erblo ck ILP optimization and scheduling results will b e rep orted as sp eedup over the co de gen- erated by a base compilation. For the sp eedup measures to b e meaningful, it is imp ortant to show that the base compilation generates ecient co de. This is done by comparing the execution time of our base compilation output against that pro duced bytwo high quality pro duction compilers. In Table 2, we compare TM TM the output co de execution time against that of the MIPS C compiler release 2.1, -O4 and the GNU C To App ear: Journal of Sup ercomputing, 1993 14 TM compiler release 1.37.1, -O, on a DECstation3100 which uses a MIPS R2000 pro cessor. The M I P S:O 4 and GN U:O columns show the normalized execution time for co de generated by the MIPS and GNU C compilers with resp ect to the IMPACT base compilation output. The results show that our base compiler p erforms slightly b etter than the two pro duction compilers for all b enchmark programs. Therefore, all the sp eedup numb ers rep orted in the subsequent sections are based on an ecient base co de. Benchmark IMPACT.O5 MIPS.O4 GNU.O cccp 1.00 1.08 1.09 cmp 1.00 1.04 1.05 compress 1.00 1.02 1.06 eqn 1.00 1.09 1.10 eqntott 1.00 1.04 1.33 espresso 1.00 1.02 1.15 grep 1.00 1.03 1.23 lex 1.00 1.01 1.04 li 1.00 1.14 1.32 tbl 1.00 1.02 1.08 wc 1.00 1.04 1.15 yacc 1.00 1.00 1.11 Table 2: Execution times of b enchmarks on DECstation3100 4.4 Compile Time Cost Due to the prototyp e nature of IMPACT-I, very little e ort has b een sp ent minimizing compile time. During the development of the sup erblo ck ILP optimizer, compile time was given much less concern than correctness, output co de p erformance, clear co ding style, and ease of software maintenance. Therefore, the compile time results presented in this section do not necessarily represent the cost of future implementations of sup erblo ck ILP optimizations and scheduling. Rather, the compile time data is presented to provide some initial insight into the compile time cost of sup erblo ck techniques. Figure 6 shows the p ercentage increase in compile time due to each level of compiler sophistication b eyond TM base compilation on a SPARCstation-I I workstation. To show the cost of program pro ling, we separated the compile time cost of deriving the execution pro le. The pro ling cost rep orted for each b enchmark in Figure 6 is the cost to derive the execution pro le for a typical input le. To acquire stable pro le information, one should pro le each program with multiple input les. The sup erblo ck formation part of the compile time re ects the cost of trace selection, tail duplication, and increased scheduling cost going from basic blo ckscheduling to sup erblo ckscheduling. For our set of b enchmarks, the overhead of this part is b etween 2 and 23 of the base compilation time. The sup erblo ck ILP optimization part of the compile time accounts for the cost of branch expansion, lo op To App ear: Journal of Sup ercomputing, 1993 15 7.12 5 Superblock Optimization AA 4 AA Superblock Formation AA AA AA Profile AA 3 AA Base AA AAAA A AAAA 2 AAAA A AAAA AAAA A A A AAAA AAAAAAAAA AAAAAAAAA AAAA A AAAAA A A AAAA A AAAA AAAA AAAA AAAA AAA A AAAAA AAAAAAAAAAAAAA AAAAAAAAA AAAAA AAAA AAAA A AAAAAAAAAA AAAAAAAAAA AAAA AAAA AAAAA AAAA AAAA Normalized Compile Time AAAAA A AAAAA A A A A AAAA A A AAAAA A AAAAA AAAAA AAAAAAAAA AAAA AAAA AAA AAAA AAAA AAA AAAAA AAA AAAAA A A A A A A A A A AAAAA AAAA AAAAA AAAA AAAA AAAAA AAAA AAAA AAAA AAAAA AAAA AAAAA A A A 1 A A A A A A A A A A AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA A A A A A A A A A AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA A A A A A A A A A A A A AAAA AAAA AAAA AAAA AAAA AAAA AAA AAAA AAA AAAA AAA AAAA A A A A A A A A A AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA A A A A A A A A A AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA A A A A A A A A A AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA A A A A A A A A A AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA 0 AAAA i wc yacc l t l espresso g cccp cmp compress eqn e b e l q r x n e t p o t t Figure 6: Compile time cost of sup erblo ck ILP optimizations. unrolling, lo op p eeling, and dep endence removing optimizations, as well as the further increase in scheduling cost due to enlarged sup erblo cks. Note that the overhead varies substantially across b enchmarks. Although the average overhead is ab out 101 of the base compilation time, the overhead is as high as 522 for cmp. After examining the compilation pro cess in detail, we found out that sup erblo ck enlargement created 2 some huge sup erblo cks for cmp. Because there are several O N algorithms used in ILP optimization and co de scheduling, where N is the numb er of instructions in a sup erblo ck, the compile time cost increased dramatically due to these huge sup erblo cks. This problem can b e solved by the combination of sup erblo ck size control and more ecient optimization and scheduling algorithms to decrease the worst-case compile time overhead. 4.5 Performance of Sup erblo ck ILP Optimization and Scheduling The compile time and base co de calibration results presented in this pap er have b een based on real machine execution time. From this p oint on, we will rep ort the p erformance of b enchmarks based on simulation of a wide variety of sup erscalar pro cessor organizations. All results will b e rep orted as sp eedup over a 4 scalar pro cessor executing the base compilation output co de. The scalar pro cessor is based on the MIPS R2000 instruction set with extensions for enhanced branch capabilities [Kane 1987]. These include squashing branches, additional branch op co des suchasBLT and BGT, and the ability to execute multiple branches p er cycle [Hwu and Chang 1993]. The underlying microarchitecture stalls for ow dep endent instructions, To App ear: Journal of Sup ercomputing, 1993 16 AAA AA 7 A AAA A AA AAA AAA AAA AAA AAA AA AAA Superblock Optimization A A A A AA 6 A AAA AAA AAA AAA AA A AA AAA AAA Superblock Formation A AAA AAA AAA AAA A A A AA AA AA AAA AA AAA AAA 5 Traditional Optimizations A AAA AAA AAA AAA AAAAAA A A A A A AA AA AA AAAA AAA AAA AAA AAAAA AAAAAA AAAAAA AAA AAA AAAAA AAAA AAAAAA AAA AAA AAAAA 4 A A A A A AAAA A A A AAA AAAA AA AA AA AA AAAA AAAAA AAA AAAAAA AAAAA AAA AAAAA AAA Speedup AAAAA AAA AAA AAAAAA AAAAA AAA AAAAAA AAA A A A AAAA A A A A A A A A AAAA AA AA AA AA AAAA AA AAAA AA AAAAA AAAAAA AAAAAA AAAAAA AAAAA AAAAAA AAAAAA AAAAAA AAAAAAAA AAAAAA AAAAAA AAAAAA AAAAA AAAAAA AAAAA AAAAAA 3 AAA AAAAAAAA AAAAA AAAAAA AAAAAAAAA AAAAA AAAAAA AAAAAAAA AAAAAA A A A AAAA A A AAA A A A A A AAA A A A A A A A AA AAAA AAAAA AA AA AAAA AA AAAA AA AA AA AAAA AA AA AAA AAAAAAAA AAAAA AAAAAAAAA AAAAAA AAAAAA AAAAAAAA AAAAA AAAAAA AAAAAA AAAAAAAA AAAAAA AAAAAA AAAAAAAA AAAAA AAAAAAAAAAA AAAAA AAAAA AAAAAAAA AAAAAAAA AAAAA AAAAAA AAAAAAAA AAAAAA A A A AAAAAA AAAA A A AAAA A AAA A A AAAA A A A A A AAAAAA AAAA A A A A AAAA A AA AA AAAAAAAA AAA AAAAAAAAAA AAAAA AAAA AAAA AA AA AAAA AAAAAA AA AA AA AAAA AA AA 2 A AAAAAA AAAAAAAA AAAAAAA AAAAAAAAA AAAAAA AAAAAAAAAAA AAAAAAAAAAA AAAAAAAA AAAAAAAA AAAAAAAA AAAAAAAA AAAAAAAA AAAAAAAAA AAAAAAAA AAAAAA AAA AAAAAAAA AAAAAAAA AAAAAAAAAAAA AAAAAAAA AAAAAAAAAAAA AAAAAAAA AAAAAAAA AAAAAAAA AAAAAAAAAAAAAA AAAAAAAAA AAAAAAAA AAAAAAAA AAAAAAAA AAAAAAAA AAAAAA AAA AAAAAAAA AAAAAAAAAAA AAAAAAA AAAAAA AAA AAAAAAAA AAAAAAAAAAAAAAA AAAA A AAAAAAAAA A A A AAAAAAAAA AAAA AAAAAA AAAA A AAAA AAAAAA AAAAAAAAA AAAAAAAAA AAAAAAAAA AAAAAAAAA AAAAAAAAA AAAAA AA AAAAAAAAA AAAAAA AAAAAAAAA AAAAAAAA AAAAAAA AAAAAAAA AAAAAAAAA AAAAAAAAA AAAAAAAAA AAAAAAAAA AAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA 1 AAA 248 248 248 248 248 248 248 248 248 248 248 248 cccp cmp compress eqn eqntott expresso greplexlitbl wc yacc Figure 7: Performance improvement due to sup erblo ck ILP optimization. The sp eedup numb ers are relative to the scalar pro cessor with base level compilation. resolves output dep endences with register renaming, and resolves anti-dep endences by fetching op erands at an early stage of the pip eline. The instruction latencies assumed in the simulations are: 1 cycle for integer add, subtract, comparison, shift, and logic op erations, 2 cycles for load in cache, 1 cycle for store, 2 cycles for branches, 3 cycles for integer multiply, 10 cycles for integer divide, 3 cycles for oating p oint add, subtract, multiply, comparison, and conversion, and 10 cycles for oating p oint divide. The load and store latency for data cache misses will b e discussed in Section 4.8. To derive the execution cycles for a particular level of compiler optimization and a pro cessor con guration, a simulation is p erformed for each b enchmark. Although the base scalar pro cessor is always simulated with the output co de from the base compilation, the sup erscalar pro cessors are simulated with varying levels of optimization. Exp erimental results presented in this section assume ideal cache for b oth scalar and sup erscalar pro cessors. This allows us to fo cus on the e ectiveness of the compiler to utilize the pro cessor resources. The e ect of cache misses will b e addressed in Sections 4.7 and 4.8. In Figure 7, the p erformance improvement due to each additional level of compiler sophistication is shown for sup erscalar pro cessors with varying instruction issue and execution resources. An issue K pro cessor has the capacity to issue K instructions p er cycle. In this exp eriment, the pro cessors are assumed to have uniform function units, thus there are no restrictions on the p ermissible combinations of instructions among the K issued in the same cycle. As shown in Figure 7, b oth sup erblo ck formation and sup erblo ck ILP optimization signi cantly increase the p erformance of sup erscalar pro cessors. In fact, without these techniques, the sup erscalar pro cessors To App ear: Journal of Sup ercomputing, 1993 17 AAA AA 7 A A AA AAA AAA AAA AAA AAA AA A A A AA 6 General Percolation A AA AAA AAA AAA AAA AAA AAA AA AAA AAA AAA A A A AA A AA AA Restricted Per colation A AA AAA AAA 5 A AAA AAA AAA AAAAAA A A A A A AA AA AA AA AA AAA AAA AAA AAAAAA AAAAAA AAA AAA AAA AAAAAA AAAAA AAAAAA AAA AAA AAAAAA A A A A AAAA A AAAA AAA AAAA AAAA AA AA AAA A 4 A AAAAA AAA AAAAA AAAAAA AAA AAAAAA AAA Speedup AAAAA AAA AAA AAAAAA AAAAAA AAA AAAAAA AAA A A A AAAA A A A A A A A A AAAA AA AA AAAA AA AA AA AA AA AA AAAAA AAAAAA AAAAA AAAAA AAAAAA AAAAAA AAAAAA AAAAAA AAAAAAAA AAAAAA AAAAAA AAAAAA AAAAAA AAAAA AAAAAA AAAAA 3 AAA AAAAAAA AAAAAA AAAAAA AAAAAAAA AAAAAA AAAAA AAAAAAAAA AAAAA A A A A A A A AAA A A A AAAAAA AAA A A AAAA A A A AAAAAA AA AA AAAA AA AAAA AAAAAA AAAA AAAAA AA AAAA AAA AAAAAA AAA AAAAAAAAA AAAAAAAA AAA AAA AAAAAAAA AAAAAA AAAAAA AAAAA AAAAAAAA AAAAA AAA AAAAAAA AAAAAAAAA AAAAAAAA AAAAAA AAAAAA AAAAAAAA AAAAAAAAA AAAAA AAAAA AAAAAAAA AAAAAA AAAA AAAA A AAAA A A A A AAAA AAAAAA A A A A A A A A AAAA A A A A AAAA A AA AAA AAA AA AA AA AAAA AAAAA AAAAAA AA AAAA AAAA AA AAAA AAAA AAAA AA AAAA 2 A AAAAAA AAAAAAA AAAAAAAAA AAAAAAAA AAAAA AAAAAAAAA AAAAAAAA AAAAAAAAA AAAAAAAAA AAAAAAA AAAAAAAA AAAAAAAA AAAAA AAAAAAA AAAAAAAA AAAAAAAA AAAAAAAA AAAAAAAAA AAAAAAAA AAAAAAAA AAAAAAAA AAAAAAAAA AAAAAAAA AAAAAAAA AAAAAAAA AAAAAAA AAAAAAAA AAAAAAAA AAAAAAA AAAAAAAA AAAAAAAA AAAAAAAA AAAAAAAA AAAAAAA AAAAAAAA AAAAAAAA A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A AA AAAA AAAAAA AAAA AA AA AAAA AAAAAA AAAA AA AA AAAA AAAA AA AA AAAA AAAAAA AAAA AA AA AAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAA AAA AAA AAA AAA AAA AAA AAA AAA AAA 1 AAA 248 248 248 248 248 248 248 248 248 248 248 248 cccp cmp compress eqn eqntott expresso greplexlitbl wc yacc Figure 8: E ect of sp eculative supp ort on sup erblo ck ILP optimization results. achieve little sp eedup over the base scalar pro cessor. Note that for cmp, grep, and wc, a 4-issue pro cessor achieves more than four times sp eedup over the base pro cessor. This sp eedup is sup erlinear b ecause the 4-issue pro cessor executes co de with sup erblo ck optimization whereas the base pro cessor only executes tra- ditionally optimized co de. For the 4-issue pro cessor, the cumulative p erformance improvement due to the sup erblo ck techniques range from 53 to 293 over the base compilation. This data clearly demonstrates the e ectiveness of the sup erblo ck techniques. 4.6 E ect of Sp eculative Execution The general co de p ercolation mo del is p erhaps the most imp ortant architectural supp ort for sup erblo ck ILP optimization and scheduling. The ability to ignore exceptions for sp eculative instructions allows the sup erblo ckscheduler to fully exploit the increased parallelism due to sup erblo ck ILP optimizations. This advantage is quanti ed in Figure 8. The general co de p ercolation mo del allows the compiler to exploit from 13 to 143 more instruction level parallelism for issue 8. Without hardware supp ort, the scheduler can take advantage of some of the parallelism exp osed by sup erblo ck optimization. However, using sp eculative instructions in the general co de p ercolation mo del, the scheduler is able to improve the p erformance of the 8-issue pro cessor to b etween 2.36 and 7.12 times sp eedup. Furthermore, as the pro cessor issue rate increases, the imp ortance of general co de p ercolation increases. For most b enchmarks with restricted p ercolation, little sp eedup is obtained from a 4-issue pro cessor to a 8-issue pro cessor. However, when general p ercolation is used, substantial improvements are observed from a 4-issue pro cessor to a 8-issue pro cessor. These results con rm our qualitative analysis in Section 3.4. To App ear: Journal of Sup ercomputing, 1993 18 5 A AAAA AAA A AAA AAA A Superblock Optimizations A AAAA A AAA AAAA A AAA AA A 4 A Superblock Formation A AAAA AAA A AAAA A AAA AAA A Traditional Optimizations A AAAA AAA A AAAA A A AAAA AAAA A A AAA AAAA 3 A A A AAAA AAAA A A AAAA AAAA A A A AAAA AAAA AAAA A A A AAAA AAAA AAAA AAAA A A A AAAA AAAA AAAA AAAA A A A A AAAA AAA AAAA AAAA A A AAA AAAA AAAAA AAAA 2 A A A A A AAAA AAAA AAAA AAAA AAAAA AAAA A A A A AAAA AAAA AAAA AAAA AAAAA AAAA A AAAA A A A A A AAAA AAAA AAA AAAA AAAA AAAA A A AAAAA AAAA AAAAA AAAA AAAAA AAAA AAAA AAAA AAAAA A A A A AAAA AAAAA AAAA AAAAA AAAA AAAAA AAAA AAAA AAAA AAAAA Output Code Expansion A A A AAAAA AAAAA AAAA AAAAA A AAAAA AAAA AAAAAA A AAAAA AAA AAA AAA AAAAA AAA AAAAA AAAA AAAA AAA AAAAA A A A A A A A AAAA AAAAA AAAA AAAAAAAAAA AAAA AAAAA AAAA AAAAA AAAA AAAAAAAAAA AAAAA A A A A A A A AAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA 1 A A A A A A A A AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA A A A A A A A A A A A A AAA AAAA AAA AAAA AAA AAAA AAA AAAA AAA AAAA AAAA AAAA A A A A A A A AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA A A A A A A A AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA A A A A A A A A A A A A AAA AAAA AAA AAAA AAA AAAA AAA AAAA AAA AAAA AAAA AAAA AAAA AAAAA AAAA AAAAA AAAA AAAAA AAAA AAAAA AAAA AAAAAAAAAA AAAAA AAAAA AAAA AAAAA AAAA AAAAA AAAA AAAAA AAAA AAAAAAAAAA AAAAA 0 AAAA i wc yacc l t l espresso g cccp cmp compress eqn e b e l q r x n e t p o t t . Figure 9: Output co de size expansion due to sup erblo ck ILP techniques. 4.7 Instruction Cache E ects The expansion of co de from sup erblo ck formation and sup erblo ck optimizations will have an e ect on in- struction cache p erformance. Most sup erblo ck ILP optimizations rely on co de duplication to enlarge the scheduling scop e. Some optimizations, such as accumulator expansion, add new instructions to maintain the correctness of execution after optimization. As shown in Figure 9, the sup erblo ck ILP optimizations signif- icantly increase the co de size. The co de sizes of eqn and cmp are increased by 23 and 355 resp ectively. Such co de expansion can p otentially degrade instruction cache p erformance. In addition, each stall cycle due to cache misses has a greater impact on the p erformance of sup erscalar pro cessors than that of scalar pro cessors. Therefore, it is imp ortanttoevaluate the impact of instruction cache misses on the p erformance of sup erblo ck ILP optimizations. Figure 10 shows the sp eedup of an 8-issue pro cessor over the base scalar pro cessor when taking instruc- tion cache miss p enaltyinto account. The four bars asso ciated with each b enchmark corresp ond to four combinations of two optimization levels and two cache re ll latencies. The two cumulative optimization levels are sup erblo ck formation A,C and sup erblo ck ILP optimization B,D. The two cache re ll latencies are 16 and 32 clo ck cycles. Each bar in Figure 10 has four sections shown the relative p erformance of four cache sizes: 1K, 4K, 16K, and ideal. The caches are direct mapp ed with 32-byte blo cks. Each instruction cache miss is assumed to cause the pro cessors to stall for the cache re ll latency minus the overlap cycles due to a load forwarding mechanism [Chen et al. 1991]. Since instruction cache misses a ect b oth the base scalar pro cessor p erformance and the sup erscalar pro cessors p erformance, sp eedup is calculated by taking To App ear: Journal of Sup ercomputing, 1993 19 AA 7 AA AA AA AA AA AA AA AA AA AA Ideal cache AAA AA AA AA AAA AA AA AA AA AA 6 A AA AA AA AA A AAA AAA A 16k instruction cache A AA AA AA AA AAA AAA AAA AA AA AA AA AAA AAA AAA AA AA AA AA AAA AAA AAA 4k instruction cache AAA AA AA AA A A A AA A AA AA 5 A AA AA AA AA AA AAA AAA AAA AA AA AA AA AA AA AA AA AAA AAA 1k instruction cache A AA AA AA A AA A A AA AA AA AA AA AA AA AA AA AA AAA AA AAA AAA AA AA AA AA AA AAA AAA AAA AA AA AA AA AA AA AA AAA AAA AAA AA 4 A AA AA AA AA AA AAA AAA A AAA A A AAA AAA AA AAA AA AA Speedup AA AAAA AA AA AA AA AAA AAA AAA AAA AAA AAA AAA AAA AAA AAA AA AAAA AA AA AA AA AAA AAA AAA AAA AAA AAA AAA AAA AAA AAA AAA AA AAAA AA AA AA AA AAA AAA AAA AAA AAA AAA AAA AAA AAA AAA AAA AA AAAA AA AA A A AA A A A A A AAA A AA A A AA AA AA A AA AA AA AAA AA AA AA 3 A AA AAAA AA AA AAAAAAAAAAAA AA AAA AAAAAAAAA AAA AAA AAA AAA AAA AA AAA AAA AA AAAA AA AA AAAAAAAAAAAA AA AAA AAA AAA AAAAAAAAA AAA AAA AAA AAA AAA AA AAA AAA AA AA AAAAAA AAAA A A AA A A A A A A AA A A AAA A A A A AA AAAAAAAAA AA AA AA AA AAAAAA AA AA AAA AA AA AA AA AA AA AA AAAAAA AAA AAA AAAAAAAAA AA AAA AAA AAA AA AAAAAAAAA AAA AAA AAA AAA AAA AAA AAA AAA AA AA AA AAAAAA AAA AAA AAAAAAAA AAA AA AAA AAA AAA AA AAA AAAAAAAAA AAA AAA AAA AAA AAA AAA AAA AAA AAAAAA AA AAAAAA AA AAAA AAA AAA AAAAAAAAA AAAA AAA AAAAAAAAAAAA AA AAA AAAAAAA AAA AAA AAAAAA AAA AAA AAA AAA AAA 2 AAA AAAAAA AA AA AA AAAAAA AAA A A A AAAA A AA AAA AAAA A A AA AAAAAA A A A A A AAAAAAAAAA A AAA A A A AAA AA AA AAAAAAAAA AAAA AA AAAAAAAAA AA AAA AAAAAA AA AA AAAAAA AA AAA AA AA AA AAAAAA AAAA AAAA AA AAAAAA AAAAAAAAAAAA AAA AAAAAAAAA AA AAAAA AAAAAAAAAAA AA AAAAAA AAAAAAA AAAAAAAAAAA AAAAAA AAAA AAA AAA AAA AAAAAAAAAAAA AAAAAA AAAAAA AAAA AA AA AAAAAA AAAAAAAAAAAA AAA AAAAAAAAAA AAAAA AAAAAAAAA AAAAAA AAAAAAA AAAAAAAAAAA AAAAAAAAAA AAAAAAAAAAAA AAA AAAAAAAAAAA A A A A A AAAAAA A A A A AAAAAAA A A A A AAAAA A A A A AA AAA A A A A AAAA A A AAAA A AAAAAA A A A A AAAAAAAA AA AAAAAAAA AAAA AAAAAAAA AAAA AAAAAA AAAAAA AAAAAAAA AAAAAAAAA AA AAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA 1 AAA ABCD ABCD ABCD ABCD ABCD ABCD ABCD ABCD ABCD ABCD ABCD ABCD cccp cmp compress eqn eqntott expresso greplexlitbl wc yacc Figure 10: Instruction cache e ect on sup erblo ck ILP techniques where, A and C represent sup erblo ck formation, B and D sup erblo ck optimization, A and B have a cache re ll latency of 16 cycles, C and D have a cache re ll latency of 32 cycles. instruction cache misses into account for b oth p erformance measurements. As shown in Figure 10, for larger caches, sup erblo ck ILP optimizations increase p erformance despite the e ect of cache misses. Even for 1K caches, sup erblo ck ILP optimizations increase p erformance for all but compress, grep, and wc. The p erformance approaches that of ideal cache when the instruction cache is 16K bytes or larger for b oth 16 cycle and 32 cycle cache re ll latencies. Since most mo dern high p erformance computers have more than 64K bytes of instruction cache, the p erformance advantage of sup erblo ck ILP optimizations is exp ected to b e relatively una ected by instruction misses in future high p erformance computer systems. 4.8 Data Cache E ects Because sup erblo ck optimizations do not a ect the numb er of data memory accesses, the numb er of extra cycles due to data cache misses remains relatively constant across the optimization levels. However, since the sup erblo ck optimizations reduce the numb er of execution cycles, the overhead due to data cache misses increases. Figure 11 shows the e ect of four cache con gurations on the p erformance of an 8-issue pro cessor. The data cache organizations have the same blo ck size and re ll latencies as those used in the instruction cache exp eriments, but the cache sizes are 4K, 16K, 64K, and ideal. Note that data cache misses have more in uence on the p erformance results than instruction cache misses. This is particularly true for the compress, eqntott, and lex b enchmarks where there is a noticeable di erence b etween the sp eedups for the 64K cache To App ear: Journal of Sup ercomputing, 1993 20 AA 7 AA AA AA AA AA AA AA AA AA AA AA AAA Ideal cache AA AA AAA AA AA AA AA AA AA AA AA 6 A AA AA AA AA A A AA 64k data cache A AA AA AA AA AAA AA AA AA AA AAA AAA AA AA AA AA AAA AAA 16k data cache AAA AA A AA AAA AA AA A AAA 5 A AA AA AA AAA AA AAA AAA AA AA AA AA AA AA AA AA AAA AAA 4k data cache A AA AA AA AA AA A A AAA AA AA AA AA AA AAA AA AA AA AA AA AAA AAA AAA AA AA AA AA AA AA AAA AAA AAA AAA AA AA AA AA AA AA AAA AAA AAA AAA 4 A AA AA AA AA AA A A A A AAA AAA AA AA AA AA AAA Speedup AA AA AA AA AA AA AAA AAA AAA AAA AAAAAA AAA AAA AAA AA AA AA AA AA AA AAA AAA AAA AAA AAA AAA AAA AA AAAAAA AAA AA AA AA AA AA AA AAA AAA AAA AAA AAA AAA AAA AA AAA AAA AA AAAAAA AA AA AA A A A AAA A AAA A A A AA A AA AAA AA AAA AA AA AA AA 3 A AA AAAAAA AA AA AA AA AAA AAAAAA AAA AAA AAA AAA AAA AAA AAA AA AAAAAA AA AA AA AAA AAAAAAAAA AAA AAA AAA AAA AAA AAA AAA AA AAAAAA AA AA AA AAA A A AAA A A AAA A A A AAAA A AAA AAAAAAA AA AA AA AA AA AA AA AA AA AAAAAA AA AA AA AAA AAA AA AAA AAAAAAA AAA AAA AAA AA AAA AAA AAA AAAAAA AAA AA AA AAAAAA AA AAA AA AA AAAAA AAA AA AAA AAAAAAA AAA AAA AAA AA AAA AAA AAA AAA AAA AA AA AA AAAAAA AAAA AA AAA AAA AAAAAAA AAA AA AAAAA AAA AAA AA AAA AAAAAAA AAA AAA AAAAAAAAAA AAA AAA AAA AAA AAA 2 AAA AA AAAA AA AA AAAAAA A AAA A AAAA AAAAAA AAA AAAA AAA AA AAA A A A A A AAAA AAAA A A A A A AA AAA AAAA AAA AA AAAA AAAAA AAA AAA AAAAAA AA AA AAAAAAA AA AA AA AA AA AA AAAAAA AAAA AA AAAA AAAAAA AAAAAA AAAAAA AAA AAAA AAAAAA AAA AAAAA AAAAAAAAAAAA AA AAAAAA AAAAAAA AAAAAAAAAAA AAAAAAAAA AAA AAA AAA AAAAAAAAAAAA AA AA AAAAAA AAAA AA AA AAAA AAAAAA AAAAAAAAAAAA AAA AAAA AAAAAAAAAAA AAA AAAAAAAAA AAAAAA AAAAAAA AAAAAAAAAAA AAAAAAA AAAAAAAAAAAA AAA AAAAAAAAAAAA A A A A A AAAAAA A A A A AAAAAAA A A A A AAAAA A A A A AA AAA A A A A AAAA A A A A A AAAAAA A A AAAA AAAAAAAA AA AA AAAAAAAAAAA AA AAAAAAAA AAAA AAAAAA AAAAAA AAAA AAAAAAAA AA AAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA 1 AAA ABCD ABCD ABCD ABCD ABCD ABCD ABCD ABCD ABCD ABCD ABCD ABCD cccp cmp compress eqn eqntott expresso greplexlitbl wc yacc Figure 11: Data cache e ect on sup erblo ck ILP techniques. where, A and C represent sup erblo ck formation, B and D sup erblo ck optimization, A and B have a cache re ll latency of 16 cycles, C and D have a cache re ll latency of 32 cycles. and the ideal cache. The p o or cache p erformance in the case of the compress b enchmark, can b e attributed to large internal data structures. The compress b enchmark has two large tables, each larger than 64K bytes when large input les are used. The e ect of the data cache on the p erformance of sup erblo ck optimizations illustrates the need to include data prefetching and other load latency hiding techniques in the compiler. 5 Conclusion Control intensive programs challenge instruction level parallel pro cessing compilers with excess constraints from many p ossible execution paths. In order to compile these programs e ectively,wehave designed the sup erblo ck ILP optimizations and sup erblo ckscheduling to systematically remove constraints due to unimp ortant execution paths. The IMPACT-I prototyp e proves that it is feasible to implement sup erblo ck ILP optimization and sup erblo ckscheduling in a real compiler. The development e ort dedicated to the prototyp e implementation is ab out 10 p erson-years in an academic environment. The implementation of the sup erblo ck techniques accounts for approximately 14 of the compiler source co de. Sup erblo ck techniques add an average overhead of 101 to the base compilation time. Wewould like to emphasize that the prototyp e is not tuned for fast compilation. The results here do not necessarily represent the compile time cost of commercial implementations. Rather, these numb ers are rep orted to prove that the compile time overhead is acceptable in a prototypical implementation. Using simulation, we demonstrate that sup erscalar pro cessors achievemuch higher p erformance with su- To App ear: Journal of Sup ercomputing, 1993 21 p erblo ck ILP optimization and sup erblo ckscheduling. For example, the improvement for an 4-issue pro cessor ranges from 53 to 293 across the b enchmark programs. Three architecture factors strongly in uence the p erformance of sup erscalar and VLIW pro cessors: sp ec- ulative execution supp ort, instruction cache misses, and data cache misses. Wehave shown that the general co de p ercolation mo del allows the compiler to exploit from 13 to 143 more instruction level parallelism. Considering the mo derate cost of sp eculative execution hardware, we exp ect that many future sup erscalar and VLIW systems will provide such supp ort. Although the instruction cache misses can p otentially cause large p erformance degradation, we found that the b enchmark p erformance results remain una ected for instruction caches of reasonable size. Since most workstations have more than 64K bytes of instruction cache, we do not exp ect the instruction misses to reduce the p erformance advantage of sup erblo ck ILP optimizations. Similar conclusions can b e drawn for the data cache. However, several b enchmarks require more advanced data prefetching techniques to comp ensate for the e ect of high cache miss rates. In conclusion, the IMPACT-I prototyp e proves that sup erblo ck ILP optimization and scheduling are not only feasible but also cost e ective. It also demonstrates that substantial sp eedup can b e achieved by sup erscalar and VLIW pro cessors over the current generation of high p erformance RISC scalar pro cessors. It provides one imp ortant set of data p oints to supp ort instruction level parallel pro cessing as an imp ortant technology for the next generation of high p erformance pro cessors. Acknowledgments The authors would liketoacknowledge all the memb ers of the IMPACT research group for their supp ort. This research has b een supp orted by the National Science Foundation NSF under Grant MIP-8809478, Dr. Lee Ho evel at NCR, Hewlett-Packard, the AMD 29K Advanced Pro cessor Development Division, Matsushita Electric Corp oration, Joint Services Engineering Programs JSEP under Contract N00014-90-J-1270, and the National Aeronautics and Space Administration NASA under Contract NASA NAG 1-613 in co op era- tion with the Illinois Computer Lab oratory for Aerospace Systems and Software ICLASS. Scott Mahlkeis supp orted byanIntel Fellowship. Grant Haab is supp orted byaFannie and John Hertz Foundation Grad- uate Fellowship. John Holm is supp orted byanAT&T Fellowship. Daniel Lavery is also supp orted by the Center for Sup ercomputing Research and Development at the University of Illinois at Urbana-Champaign To App ear: Journal of Sup ercomputing, 1993 22 under Grant DOE DE-FGO2-85ER25001 from the U.S. Department of Energy, and the IBM Corp oration. References Aho, A., Sethi, R., and Ullman, J. 1986. Compilers: Principles, Techniques, and Tools. Addison-Wesley, Reading, MA. Aiken, A. and Nicolau, A. 1988. A developmentenvironment for horizontal micro co de. IEEE Transactions on Software Engineering, 14, May, 584{594. Bernstein, D. and Ro deh, M. 1991. Global instruction scheduling for sup erscalar machines. In Proceedings of the ACM SIGPLAN 1991 ConferenceonProgramming Language Design and Implementation June, 241{255. Chaitin, G. J. 1982. Register allo cation and spilling via graph coloring. In Proceedings of the ACM SIGPLAN 82 Symposium on Compiler Construction June, 98{105. Chang, P.P. and Hwu, W. W. 1988. Trace selection for compiling large C application programs to micro co de. In Proceedings of the 21st International Workshop on Microprogramming and Microarchitecture Nov., 188{198. Chang, P.P., Mahlke, S. A., Chen, W. Y., Warter, N. J., and Hwu, W. W. 1991a. IMPACT: An architectural framework for multiple-instruction-issue pro cessors. In Proceedings of the 18th International Symposium on Computer Architecture May, 266{275. Chang, P.P., Mahlke, S. A., and Hwu, W. W. 1991b. Using pro le information to assist classic co de optimizations. SoftwarePractice and Experience, 21, 12 Dec.:1301{1321. Chen, W. Y., Chang, P.P., Conte, T. M., and Hwu, W. W. 1991. The e ect of co de expanding optimizations on instruction cache design. Technical Rep ort CRHC-91-17, Center for Reliable and High-Performance Computing, University of Illinois, Urbana, IL. Chow, F. C. and Hennessy, J. L. 1990. The priority-based coloring approach to register allo cation. ACM Transactions on Programming Languages and Systems, 12, Oct., 501{536. To App ear: Journal of Sup ercomputing, 1993 23 Colwell, R. P., Nix, R. P., O'Donnell, J. J., Papworth, D. B., and Ro dman, P. K. 1987. A VLIW architecture for a trace scheduling compiler. In Proceedings of the 2nd International ConferenceonArchitectural Support for Programming Languages and Operating Systems Apr., 180{192. Ellis, J. 1985. Bul ldog: A Compiler for VLIW Architectures. The MIT Press, Cambridge, MA. Fisher, J. A. 1981. Trace scheduling: A technique for global micro co de compaction. IEEE Transactions on Computers, c-30, July, 478{490. Gupta, R. and So a, M. L. 1990. Region scheduling: An approach for detecting and redistributing parallelism. IEEE Transactions on Software Engineering, 16, Apr., 421{431. Horst, R. W., Harris, R. L., and Jardine, R. L. 1990. Multiple instruction issue in the NonStop Cyclone pro cessor. In Proceedings of the 17th International Symposium on Computer Architecture May, 216{ 226. Hwu, W. W. and Chang, P.P. 1989a. Achieving high instruction cache p erformance with an optimizing compiler. In Proceedings of the 16th International Symposium on Computer Architecture May, 242{ 251. Hwu, W. W. and Chang, P.P. 1989b. Inline function expansion for compiling realistic C programs. In Pro- ceedings of the ACM SIGPLAN 1989 ConferenceonProgramming Language Design and Implementation June, 246{257. Hwu, W. W. and Chang, P.P. Accepted for publication. Ecient instruction sequencing with inline target insertion. IEEE Transactions on Computers. Intel 1989. i860 64-Bit Microprocessor. Santa Clara, CA. Jouppi, N. P. and Wall, D. W. 1989. Available instruction-level parallelism for sup erscalar and sup erpip elined machines. In Proceedings of the 3rd International ConferenceonArchitectural Support for Programming Languages and Operating Systems Apr., 272{282. Kane, G. 1987. MIPS R2000 RISC Architecture. Prentice-Hall, Inc., Englewo o d Cli s, NJ. Kuck, D. J. 1978. The Structure of Computers and Computations. John Wiley and Sons, New York, NY. To App ear: Journal of Sup ercomputing, 1993 24 Kuck, D. J., Kuhn, R. H., Padua, D. A., Leasure, B., and Wolfe, M. 1981. Dep endence graphs and compiler optimizations. In Proceedings of the 8th ACM Symposium on Principles of Programming Languages Jan., 207{218. Mahlke, S. A., Chen, W. Y., Hwu, W. W., Rau, B. R., and ansker, M. S. S. 1992. Sentinel scheduling for VLIW and sup erscalar pro cessors. In Proceedings of 5th International ConferenceonArchitectural Support for Programming Languages and Operating Systems Oct.. Nakatani, T. and Eb cioglu, K. 1989. Combining as a compilation technique for VLIW architectures. In Proceedings of the 22nd International Workshop on Microprogramming and Microarchitecture Sept., 43{55. Rau, B. R., Yen, D. W. L., Yen, W., and Towle, R. A. 1989. The Cydra 5 departmental sup ercomputer. IEEE Computer, Jan., 12{35. Schuette, M. A. and Shen, J. P. 1991. An instruction-level p erformance analysis of the Multi ow TRACE 14/300. In Proceedings of the 24th International Workshop on Microprogramming and Microarchitecture Nov., 2{11. Smith, M. D., Johnson, M., and Horowitz, M. A. 1989. Limits on multiple instruction issue. In Proceedings of the 3rd International ConferenceonArchitectural Support for Programming Languages and Operating Systems Apr., 290{302. Warren, Jr., H. S. 1990. Instruction scheduling for the IBM RISC system/6000 pro cessor. IBM Journal of Research and Development, 34, Jan., 85{92.