The Sup erblo ck: An E ectiveTechnique

for VLIW and Sup erscalar Compilation

Wen-mei W. Hwu Scott A. Mahlke William Y. Chen Pohua P. Chang

Nancy J. Warter Roger A. Bringmann Roland G. Ouellette Richard E. Hank



Tokuzo Kiyohara Grant E. Haab John G. Holm Daniel M. Lavery

Abstract

A for VLIW and sup erscalar pro cessors must exp ose sucient instruction-level parallelism

ILP to e ectively utilize the parallel hardware. However, ILP within basic blo cks is extremely limited

for control-intensi ve programs. Wehave develop ed a set of techniques for exploiting ILP across basic

blo ck b oundaries. These techniques are based on a novel structure called the superblock . The sup erblo ck

enables the optimizer and scheduler to extract more ILP along the imp ortant execution paths by sys-

tematically removing constraints due to the unimp ortant paths. Sup erblo ck optimization and scheduling

have b een implemented in the IMPACT-I compiler. This implementation gives us a unique opp ortunity

to fully understand the issues involved in incorp orating these techniques into a real compiler. Sup erblo ck

optimizations and scheduling are shown to b e useful while taking into accountavariety of architectural

features.

Index terms -codescheduling, control-intensive programs, instruction-level parallel pro cessing, op-

timizing compiler, pro le information, sp eculative execution, sup erblo ck, sup erscalar pro cessor, VLIW

pro cessor.



The authors are with the Center for Reliable and High-Performance Computing, University of Illinois, Urbana-Champaign,

Illinois, 61801. Pohua Chang is now with Intel Corp oration. Roland Ouellette is now with Digital Equipment Corp oration.

Tokuzo Kiyohara is with Matsushita Electric Industrial Co., Ltd., Japan. Daniel Lavery is also with the Center for Sup ercom-

puting Research and Development at the University of Illinois at Urbana-Champaign. 1

To App ear: Journal of Sup ercomputing, 1993 2

1 Intro duction

VLIW and sup erscalar pro cessors contain multiple data paths and functional units, making them capable of

issuing multiple instructions p er clo ck cycle [Colwell et al. 1987; Fisher 1981; Horst et al. 1990; Intel 1989;

Rau et al. 1989; Warren 1990]. As a result, the p eak p erformance of the coming generation of VLIW and

sup erscalar pro cessors will range from two to eight times greater than their scalar predecessors which execute

at most one instruction p er clo ck cycle. However, previous studies have shown that using conventional co de

optimization and scheduling metho ds, sup erscalar and VLIW pro cessors cannot pro duce a sustained sp eedup

of more than two for control-intensive programs [Jouppi and Wall 1989; Schuette and Shen 1991; Smith et al.

1989]. For such programs, conventional compilation metho ds do not provide enough supp ort to utilize these

pro cessors.

Traditionally, the primary ob jective of co de optimization is to reduce the numb er and complexityof

executed instructions. Few existing optimizing attempt to increase the instruction-level parallelism,

1

or ILP , of the co de. This is b ecause most optimizing compilers have b een designed for scalar pro cessors,

which b ene t little from the increased ILP. Since VLIW and sup erscalar pro cessors can take advantage of

the increased ILP, it is imp ortant for the co de optimizer to exp ose enough parallelism to fully utilize the

parallel hardware.

The amount of ILP within basic blo cks is extremely limited for control-intensive programs. Therefore, the

co de optimizer must lo ok b eyond basic blo cks to nd sucient ILP.Wehave develop ed a set of optimizations

to increase ILP across basic blo ck b oundaries. These optimizations are based on a novel structure called the

superblock. The formation and optimization of sup erblo cks increases the ILP along the imp ortant execution

paths by systematically removing constraints due to the unimp ortant paths. Because these optimizations

increase the ILP within sup erblo cks, they are collectively referred to as superblock ILP optimizations.

Unlike co de optimization, co de scheduling for VLIW pro cessors has received extensive treatmentinthe

literature [Aiken and Nicolau 1988; Bernstein and Ro deh 1991; Ellis 1985; Fisher 1981; Gupta and So a

1990]. In particular, the trace scheduling technique invented by Fisher has b een shown to b e very e ective

for rearranging instructions across basic blo cks [Fisher 1981]. An imp ortant issue for trace scheduling is

the compiler implementation complexity incurred by the need to maintain correct program execution after

1

One can measure the ILP as the average numb er of simultaneously executable instructions p er clo ck cycle. It is a function

of the data and control dep endences b etween instructions in the program as well as the instruction latencies of the pro cessor

hardware. It is indep endent of all other hardware constraints.

To App ear: Journal of Sup ercomputing, 1993 3

moving instructions across basic blo cks. The co de scheduling technique describ ed in this pap er, whichis

derived from trace scheduling, employs the sup erblo ck. Sup erblo ck ILP optimizations remove constraints,

and the co de scheduler implementation complexity is reduced. This co de scheduling approach will b e referred

to as superblock scheduling.

In order to characterize the cost and e ectiveness of the sup erblo ck ILP optimizations and sup erblo ck

scheduling, wehave implemented these techniques in the IMPACT-I compiler develop ed at the University

of Illinois. The fundamental premise of this pro ject is to provide a complete compiler implementation

that allows us to quantify the impact of these techniques on the p erformance of VLIW and sup erscalar

pro cessors by compiling and executing large control-intensive programs. In addition, this compiler allows us

to fully understand the issues involved in incorp orating these optimization and scheduling techniques into

a real compiler. Sup erblo ck optimizations are shown to b e useful while taking into accountavarietyof

architectural parameters.

Section 2 of this pap er intro duces the sup erblo ck. Section 3 gives a concise overview of the sup erblo ck ILP

optimizations and sup erblo ckscheduling. Section 4 presents the cost and p erformance of these techniques.

The concluding remarks are made in Section 5.

2 The Sup erblo ck

The purp ose of co de optimization and scheduling is to minimize the execution time while preserving the

program semantics. When this is done globally, some optimization and scheduling decisions may decrease

the execution time for one control path while increasing the time for another path. By making these decisions

in favor of the more frequently executed path, an overall p erformance improvement can b e achieved.

Trace scheduling is a technique that was develop ed to allowscheduling decisions to b e made in this

manner [Ellis 1985; Fisher 1981]. In trace scheduling the function is divided into a set of traces that

represent the frequently executed paths. There may b e conditional branches out of the middle of the trace

side exits and transitions from other traces into the middle of the trace side entrances. Instructions are

scheduled within each trace ignoring these control- ow transitions. After scheduling, b o okkeeping is required

to ensure the correct execution of o -trace co de.

Co de motion past side exits can b e handled in a fairly straightforward manner. If an instruction I is

moved from ab ovetobelow a side exit, and the destination of I is used b efore it is rede ned when the side

To App ear: Journal of Sup ercomputing, 1993 4

...... Instr 5 Instr 1 Instr 1 Instr 3 Instr 2 Instr 2 Instr 1 Instr 5 Instr 4 Instr 3 Instr 3 Instr 2 Instr 2 Instr 4 Instr 4 Instr 3 Instr 3 Instr 5 Instr 1 Instr 4 Instr 4 Instr 5 Instr 5 ......

(a) (b)

Figure 1: Instruction scheduling across trace side entrances. a Moving an instruction b elow a side entrance.

b Moving an instruction ab ove a side entrance.

exit is taken, then a copyofI must also b e placed b etween the side exit and its target. Movementofan

instruction from b elowtoabove a branch can also b e handled without to o much diculty. The metho d for

doing this is describ ed in section 3.4.

More complex b o okkeeping must b e done when co de is moved ab ove and b elow side entrances. Figure 1

illustrates this b o okkeeping. In Figure 1a, when Instr 1 is moved b elow the side entrance to after Instr 4,

the side entrance is moved b elow Instr 1. Instr 3 and Instr 4 are then copied to the side entrance. Likewise,

in Figure 1b, when Instr 5 is moved ab ove the side entrance, it must also b e copied to the side entrance.

Side entrances can also make it more complex to apply optimizations to traces. For example, Figure 2

shows how copy propagation can b e applied to the trace and the necessary b o okkeeping for the o -trace

co de. In this example, in order to propagate the value of r1 from I1 to I3, b o okkeeping must b e p erformed.

Before propagating the value of r1, the side entrance is moved to b elow I3 and instructions I2 and I3 are

copied to the side entrance.

The b o okkeeping asso ciated with side entrances can b e avoided if the side entrances are removed from

the trace. A superblock is a trace which has no side entrances. Control may only enter from the top but

may leave at one or more exit p oints. Sup erblo cks are formed in two steps. First, traces are identi ed using

execution pro le information [Chang and Hwu 1988]. Second, a pro cess called tail duplication is p erformed

to eliminate any side entrances to the trace [Chang et al. 1991b]. A copy is made of the tail p ortion of

the trace from the rst side entrance to the end. All side entrances into the trace are then moved to the

corresp onding duplicate basic blo cks. The basic blo cks in a sup erblo ck need not b e consecutive in the co de.

However, our implementation restructures the co de so that all blo cks in a sup erblo ck app ear in consecutive

To App ear: Journal of Sup ercomputing, 1993 5

...... r1 <- 64 . r1 <- 64 . . . . .

I1: r1 <- 4 I1: r1 <- 4 I2': r2 <- label(_A)

I2: r2 <- label(_A) I2: r2 <- label(_A) I3': r3 <- mem(r2+r1)

I3: r3 <- mem(r2+r1) I3: r3 <- mem(r2+4)

......

(a) (b)

Figure 2: Applying copy propagation to an instruction trace. a Before copy propagation. b After copy

propagation with b o okkeeping co de inserted.

order for b etter cache p erformance.

The formation of sup erblo cks is illustrated in Figure 3. Figure 3a shows a weighted ow graph which

represents a lo op co de segment. The no des corresp ond to basic blo cks and arcs corresp ond to p ossible

control transfers. The count of each basic blo ck indicates the execution frequency of that basic blo ck.

In Figure 3a, the count of fA; B ; C; D; E ; F g is f100; 90; 10; 0; 90; 100g, resp ectively. The count of each

control transfer indicates the frequency of invoking these control transfers. In Figure 3a, the count of

fA ! B; A ! C; B ! D; B ! E; C ! F; D ! F; E ! F; F ! Ag is f90; 10; 0; 90; 10; 0; 90; 99g, resp ectively.

Clearly, the most frequently executed path in this example is the basic blo ck sequence < A;B;E;F >. There

are three traces: fA; B; E; F g, fC g, and fD g. After trace selection, each trace is converted into a sup erblo ck.

In Figure 3a, we see that there are two control paths that enter the fA; B; E; F g trace at basic blo ck F .

Therefore, we duplicate the tail part of the fA; B; E; F g trace starting at basic blo ck F . Each duplicated basic

blo ck forms a new sup erblo ck that is app ended to the end of the function. The result is shown in Figure 3b.

Note that there are no longer side entrances into the most frequently traversed trace, < A;B;E;F >;ithas

b ecome a sup erblo ck.

Sup erblo cks are similar to the extended basic blo cks. An extended basic blo ck is de ned as a sequence

of basic blo cks B ...B such that for 1  i

1 k i i+1 1

a unique predecessor [Aho et al. 1986]. The di erence b etween sup erblo cks and extended basic blo cks lies

mainly in how they are formed. Sup erblo ck formation is guided by pro le information and side entrances

are removed to increase the size of the sup erblo cks. It is p ossible for the rst basic blo ck in a sup erblo ckto

To App ear: Journal of Sup ercomputing, 1993 6

Y Y

1 1

A A 100 100

90 90 10 10

B C B C 90 10 90 10

90 90 0 0

D E D E 0 90 10 0 90 10 90 90 0 0 F F 100 90 99 89.1 1 0.9

Z Z

9.9 0.1 F' 10

(a) (b)

Figure 3: An example of sup erblo ck formation.

have a unique predecessor.

3 Sup erblo ck ILP Optimization and Scheduling

Before sup erblo ckscheduling is p erformed, sup erblo ck ILP optimizations are applied which enlarge the

sup erblo ck and increase instruction parallelism by removing dep endences.

3.1 Sup erblo ck Enlarging Optimizations

The rst category of sup erblo ck ILP optimizations is sup erblo ck enlarging optimizations. The purp ose of

these optimizations is to increase the size of the most frequently executed sup erblo cks so that the sup erblo ck

scheduler can manipulate a larger numb er of instructions. It is more likely the scheduler will nd indep endent

instructions to schedule at every cycle in a sup erblo ck when there are more instructions to cho ose from. An

imp ortant feature of sup erblo ck enlarging optimizations is that only the most frequently executed parts of a

program are enlarged. This selective enlarging strategy keeps the overall co de expansion under control [Chang

To App ear: Journal of Sup ercomputing, 1993 7

et al. 1991b]. Three sup erblo ck enlarging optimizations are describ ed as follows.

BranchTarget Expansion. Branch target expansion expands the target sup erblo ck of a likely taken

control transfer which ends a sup erblo ck. The target sup erblo ck is copied and app ended to the end of

the original sup erblo ck. Note that branch target expansion is not applied for control transfers which are

lo op backedges. Branch target expansion continues to increase the size of a sup erblo ckuntil a prede ned

sup erblo ck size is reached, or the branch ending the sup erblo ck do es not favor one direction.

Lo op Peeling. Sup erblo ck lo op p eeling mo di es a sup erblo ck lo op a sup erblo ck which ends with a

likely control transfer to itself  which tends to iterate only a few times for each lo op execution. The lo op

2

The original b o dy is replaced by straight-line co de consisting of the rst several iterations of the lo op.

lo op b o dy is moved to the end of the function to handle executions which require additional iterations. After

lo op p eeling, the most frequently executed preceding and succeeding sup erblo cks can b e expanded into the

p eeled lo op b o dy to create a single large sup erblo ck.

Lo op Unrolling. Sup erblo ck lo op unrolling replicates the b o dy of a sup erblo ck lo op which tends to

iterate many times. To unroll a sup erblo ckloopN times, N 1 copies of the sup erblo ck are app ended to

the original sup erblo ck. The lo op backedge control transfers in the rst N 1 lo op b o dies are adjusted or

removed if p ossible to account for the unrolling. If the iteration countisknown on lo op entry, it is p ossible

to remove these transfers by using a preconditioning lo op to execute the rst mo d N iterations. However,

preconditioning lo ops are not currently inserted for unrolled lo ops.

3.2 Sup erblo ck Dep endence Removing Optimizations

The second category of sup erblo ck ILP optimizations is sup erblo ck dep endence removing optimizations.

These optimizations eliminate data dep endences b etween instructions within frequently executed sup erblo cks,

which increases the ILP available to the co de scheduler. As a side e ect, some of these optimizations increase

the numb er of executed instructions. However, by applying these optimizations only to frequently executed

sup erblo cks, the co de expansion incurred is regulated. Five sup erblo ck dep endence removing optimizations

are describ ed as follows.

Register Renaming. Reuse of registers by the compiler and variables by the programmer intro duces

arti cial anti and output dep endences and restricts the e ectiveness of the scheduler. Many of these arti cial

dep endences can b e eliminated with register renaming [Kuck et al. 1981]. Register renaming assigns unique

2

Using the pro le information, the lo op is p eeled its exp ected numb er of iterations.

To App ear: Journal of Sup ercomputing, 1993 8

registers to di erent de nitions of the same register. A common use of register renaming is to rename

registers within individual lo op b o dies of an unrolled sup erblo ck lo op.

Op eration Migration. Op eration migration moves an instruction from a sup erblo ck where its result is

not used to a less frequently executed sup erblo ck. By migrating an instruction, all of the data dep endences

asso ciated with that instruction are eliminated from the original sup erblo ck. Op eration migration is p er-

formed by detecting an instruction whose destination is not referenced in its home sup erblo ck. Based on a

cost constraint, a copy of the instruction is placed at the target of each exit of the sup erblo ck in which the

destination of the instruction is live. Finally, the original instruction is deleted.

Induction Variable Expansion. Induction variables are used within lo ops to index through lo op

iterations and through regular data structures such as arrays. When data access is delayed due to dep en-

dences on computation, ILP is typically limited. Induction variable expansion eliminates

rede nitions of induction variables within an unrolled sup erblo ck lo op. Each de nition of the induction

variable is given a new induction variable, thereby eliminating all anti, output, and ow dep endences among

the induction variable de nitions. However an additional instruction is inserted into the lo op preheader to

initialize each newly created induction variable. Also, patch co de is inserted if the induction variable is used

outside the sup erblo ck to recover the prop er value for the induction variable.

Accumulator Variable Expansion. An accumulator variable accumulates a sum or pro duct in each

iteration of a lo op. For lo ops of this typ e, the accumulation op eration often de nes the critical path within

the lo op. Similar to induction variable expansion, anti, output, and ow dep endences b etween instructions

which accumulate a total are eliminated by replacing each de nition of accumulator variable with a new

accumulator variable. Unlike induction variable expansion, though, the increment or decrementvalue is

not required to b e constant within the sup erblo ck lo op. Again, initialization instructions for these new

accumulator variables must b e inserted into the sup erblo ck preheader. Also, the new accumulator variables

are summed at all sup erblo ck exit p oints to recover the value of the original accumulator variable. Note that

accumulator variable expansion applied to oating p ointvariables may not b e safe for programs which rely

on the order which arithmetic op erations are p erformed to maintain accuracy.For programs of this typ e, an

option is provided for the user to disable accumulator variable expansion of oating p ointvariables.

Op eration Combining. Flow dep endences b etween pairs of instructions with the same precedence

each with a compile-time constant source op erand can b e eliminated with op eration combining [Nakatani

and Eb cioglu 1989]. Flow dep endences which can b e eliminated with op eration combining often arise b etween

To App ear: Journal of Sup ercomputing, 1993 9

(d) pre : r11 = r1 r21 = r1 + 4 r31 = r1 + 8 r14 = r4 r24 = 0 r34 = 0 L1 : r12 = MEM(A + r11) Original Loop bne (r12 1) L13 (c) r13 = MEM(B + r11) (a) L1 : if (A[i] == 1) C += B[i] r14 = r14 + r13 i++ L1 : r2 = MEM(A + r1) bge (r21 N') L2 if (i < N) goto L1 bne (r2 1) L3 L2 : r3 = MEM(B + r1) r22 = MEM(A + r21) bne (r22 1) L23 r4 = r4 + r3 r23 = MEM(B + r21) r1 = r1 + 4 r24 = r24 + r23 bge (r1 N') L2 (b) L1 : r2 = MEM(A + r1) bge (r31 N') L2 r2 = MEM(A + r1) bne (r2 1) L3 r32 = MEM(A + r31) bne (r2 1) L3 r3 = MEM(B + r1) bne (r32 1) L33 r3 = MEM(B + r1) r4 = r4 + r3 r33 = MEM(B + r31) r4 = r4 + r3 r1 = r1 + 4 Loop Dependence r34 = r34 + r33 r1 = r1 + 4 blt (r1 N') L1 Unroll Removal r11 = r11 + 12 L2 : bge (r1 N') L2 r21 = r21 + 12 L3 : r1 = r1 + 4 r2 = MEM(A + r1) r31 = r31 + 12 blt (r1 N') L1 bne (r2 1) L3 blt (r11 N') L1 r3 = MEM(B + r1) goto L2 L2 : r4 = r14 + r24 r4 = r4 + r3 r4 = r4 + r34 r1 = r1 + 4 blt (r1 N') L1 L2 : L13 : r11 = r11 + 4 r21 = r21 + 4 r31 = r31 + 4 L3 : r1 = r1 + 4 blt (r1 N') L1 blt (r11 N') L1 goto L2 goto L2 L23 : r11 = r11 + 8 r21 = r21 + 8 r31 = r31 + 8 blt (r11 N') L1 goto L2 L33 : r11 = r11 + 12 r21 = r21 + 12 r31 = r31 + 12 blt (r11 N') L1

goto L2

Figure 4: An application of sup erblo ck ILP optimizations, a original program segment, b assembly

co de after sup erblo ck formation, c assembly co de after sup erblo ck lo op unrolling, d assembly co de after

sup erblo ck dep endence removal optimizations.

address calculation instructions and memory access instructions. Also, similar opp ortunities o ccur for lo op

variable increments and lo op exit branches. The ow dep endence is removed by substituting the expression

of the rst instruction into the second instruction and evaluating the constants at compile time.

Example. An example to illustrate lo op unrolling, register renaming, induction variable expansion, and

accumulator variable expansion is shown in Figure 4. This example assumes that the condition of the if

statement within the lo op is likely to b e true. After sup erblo ck formation, the resulting assembly co de is

shown in Figure 4b. To enlarge the sup erblo ck lo op, lo op unrolling is applied. The lo op is unrolled three

times in this example Figure 4c. After lo op unrolling, data dep endences limit the amount of ILP in the

sup erblo ck lo op.

The dep endences among successive up dates of r 4 Figure 4c are eliminated with accumulator variable

expansion. In Figure 4d, three temp orary accumulators, r 14, r 24, and r 34, are created within the su-

To App ear: Journal of Sup ercomputing, 1993 10

p erblo ck lo op. After accumulator expansion, all up dates to r 4 within one iteration of the unrolled lo op are

indep endent of one another. In order to maintain correctness, the temp orary accumulators are prop erly ini-

tialized in the lo op preheader pre, and the values are summed at the lo op exit p oint L2. The dep endences

among successive up dates of r 1 along with up dates of r 1 and succeeding load instructions Figure 4c are

eliminated with induction variable expansion. In Figure 4d, three temp orary induction variables, r 11, r 21,

and r 31, are created within the sup erblo ck lo op. After induction variable expansion, the chain of dep en-

dences created by the induction variable is eliminated within one iteration of the unrolled lo op. Finally,

register renaming is applied to the load instructions to eliminate output dep endences. After all sup erblo ck

ILP optimizations are applied, the execution of the original lo op b o dies within the unrolled sup erblo ckloop

may b e completely overlapp ed by the scheduler.

3.3 Sup erblo ckScheduling

After the sup erblo ck ILP optimizations are applied, sup erblo ckscheduling is p erformed. Co de scheduling

within a sup erblo ck consists of two steps: dep endence graph construction followed by list scheduling. A

dep endence graph is a data structure which represents the control and data dep endences b etween instruc-

tions. A control dep endence is initially assigned b etween each instruction and every preceding branch in the

sup erblo ck. Some control dep endences may then b e removed according to the available hardware supp ort

describ ed in the following section. After the appropriate control dep endences have b een eliminated, list

scheduling using the dep endence graph, instruction latencies, and resource constraints is p erformed on the

sup erblo ck.

3.4 Sp eculative Execution Supp ort

Sp eculative execution refers to the execution of an instruction b efore it is known that its execution is required.

Such an instruction will b e referred to as a speculative instruction. Sp eculative execution o ccurs when the

sup erblo ckscheduler moves an instruction J ab ove a preceding conditional branch B . During run time, J

will b e executed b efore B , i.e., J is executed regardless of the branch direction of B .However, according to

3

the original program order, J should b e executed only if B is not taken. Therefore, the execution result

of J must not b e used if B is taken, which is formally stated as follows:

3

Note that the blo cks of a sup erblo ck are laid out sequentially by the compiler. Each instruction in the sup erblo ck is always

on the fall-through path of its preceding conditional branch.

To App ear: Journal of Sup ercomputing, 1993 11

Restriction 1. The destination of J is not used b efore it is rede ned when B is taken.

This restriction usually has very little e ect on co de scheduling after sup erblo ck ILP optimization. For

example, in Figure 4d, after dep endence removal in blo ck L1, the instruction that loads B [i]into r 13 can

b e executed b efore the direction the preceding bne instruction is known. Note that the execution result of

this load instruction must not b e used if the branchistaken. This is trivially achieved b ecause r 13 is never

used b efore de ned if the branch is taken; the value thus loaded into r 13 will b e ignored in the subsequent

execution.

A more serious problem with sp eculative execution is the prevention of premature program termination

due to exceptions caused by sp eculative instructions. Exceptions caused by sp eculative instructions which

would not have executed on a sequential machine must b e ignored, which leads to the following restriction:

Restriction 2. J will never cause an exception that may terminate program execution when branch B

is taken.

In Figure 4d, the execution of the instruction that loads B [i]into r13 could p otentially cause a memory

access violation fault. If this instruction is sp eculatively scheduled b efore its preceding branch and such

a fault o ccurs during execution, the exception should b e ignored if the branchwas taken. Rep orting the

exception if the branch is taken would have incorrectly terminated the execution of the program.

Two levels of hardware supp ort for Restriction 2 will b e examined in this pap er: restrictedpercolation

model and general percolation model. The restricted p ercolation mo del includes no supp ort for disregarding

the exceptions generated by the sp eculative instructions. For conventional pro cessors, memory load, memory

store, integer divide, and all oating p oint instructions can cause exceptions. When using the restricted

p ercolation mo del, these instructions may not b e moved ab ove branches unless the compiler can prove that

those instructions can never cause an exception when the preceding branchistaken. The limiting factor

of the restricted p ercolation mo del is the inabilitytomove p otential trap-causing instructions with long

latency, such as load instructions, ab ove branches. When the critical path of a sup erblo ck contains manyof

these instructions, the p erformance of the restricted p ercolation mo del is limited.

The general p ercolation mo del eliminates Restriction 2 by providing a non-trapping version of instructions

that can cause exceptions [Chang et al. 1991a]. Exceptions that may terminate program execution are

avoided by converting all sp eculative instructions which can p otentially cause exceptions into their non-

trapping versions. For programs in which detection of exceptions is imp ortant, the loss of exceptions can

b e recovered with additional architectural and compiler supp ort [Mahlke et al. 1992]. However, in this

To App ear: Journal of Sup ercomputing, 1993 12

82%

AAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

AA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

AA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

Base AAAA

AA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA Profile AAAA

AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

AAA AAAA AA AAAA AAAA AAAA AAAA AAAAAAAA AAAA AA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

AAA AAAA AA AAAA AAAA AAAA AAAA AAAAAAAA AAAA AA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

Superblock Formation AAAA

AAA AAAA AA AAAA AAAA AAAA AAAA AAAAAAAA AAAA AA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

AAA AAAA AA AAAA AAAA AAAA AAAA AAAAAAAA AAAA AAAAA AAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

AAA AAAA AA AAAA AAAA AAAA AAAA AAAAAAAA AAAA AA AAAAAAAAAAAAAA A AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

Superblock Optimization AAAA

AAA AAAA AA AAA AAAA AAAA AAAA AAAA AAAA AAAA AAAAAAAA AAAA AA AAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

AAA AAAA AA AAA AAAA AAAA AAAA AAAA AAAA AAAA AAAAAAAA AAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

AAA AAAA AA AAA AAAA AAAA AAAA AAAA AAAA AAAA AAAAAAAA AAAA AAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

AAA AAAA AA AAA AAAA AAAA AAAA AAAA AAAA AAAAAAAAAAA AAAAAAAA AAAA AAAAAAAAAAAAAA A

AAA AAAA AAA AAAA AAAA AAAA AAAAAAAAAAA AAAA

AAA AAAA AAA AAAA AAAA AAAA AAAAAAAAAAA

AAA AAAA AAA AAAA AAAA AAAA AAAAAAAAAAA

AAA AAAA AAA AAAA AAAA AAAA AAAAAAA 12% AAAA

2% 4%

Figure 5: Compiler co de size break down. The entire compiler consists of approximately 92,000 lines of C

co de with one co de generator.

pap er only the non-trapping execution supp ort is utilized. In Section 4.6, we will showhow the general

p ercolation mo del allows the sup erblo ckscheduler to exploit much more of the ILP exp osed by sup erblo ck

ILP optimizations than the restricted p ercolation mo del.

4 Implementation Cost and Performance Results

In this section, we rep ort the implications of sup erblo ck ILP optimization and scheduling on compiler size,

compile time, and output co de p erformance. We additionally examine three architectural factors that di-

rectly a ect the e ectiveness of sup erblo ck ILP optimizations and scheduling: sp eculative execution supp ort,

instruction cache misses, and data cache misses.

4.1 Compiler Size

The IMPACT-I C compiler serves two imp ortant purp oses. First, it is intended to generate highly opti-

mized co de for existing commercial micropro cessors. Wehave constructed co de generators for the MIPS

TM TM TM TM TM

R2000 ,SPARC , Am29000 , i860 , and HP PA-RISC pro cessors. Second, it provides a platform

for studying new co de optimization techniques for instruction-level parallel pro cessing. New co de optimiza-

tion techniques, once validated, can b e immediately applied to the VLIW and sup erscalar implementations

of existing and new commercial architectures.

Figure 5 shows the p ercentage of compiler size due to each level of compiler sophistication. The base

level accounts for the C front-end, traditional co de optimizations, graph-coloring-based register allo cation,

basic blo ckscheduling, and one co de generator [Chaitin 1982; Chow and Hennessy 1990]. The traditional

optimizations include classical lo cal and global co de optimizations, function , instruction

placement optimization, pro le-based branch prediction, and constant preloading [Aho et al. 1986; Hwu and

To App ear: Journal of Sup ercomputing, 1993 13

Chang 1989a; Hwu and Chang 1989b]. The pro le level generates dynamic execution frequency information

and feeds this information back to the compiler. The sup erblo ck formation level p erforms trace selection,

tail duplication, and sup erblo ckscheduling. The sup erblo ck ILP optimization level p erforms branch expan-

sion, lo op unrolling, lo op p eeling, register renaming, induction variable expansion, accumulator expansion,

op eration migration, and op eration combining. As shown in Figure 5, the compiler source co de dedicated to

the sup erblo ck techniques is only ab out 14 of the entire IMPACT-I compiler.

4.2 Benchmark Programs

Benchmark Size Benchmark Description Pro le Description Input Description

cccp 4787 GNU C prepro cessor 20 C source les 100 - 5000 lines 1 C source le 4000 lines

cmp 141 compare les 20 similar/dissimilar les 2 similar les 4000 lines

compress 1514 compress les 20 C source les 100 - 5000 lines 1 C source le 4000 lines

eqn 2569 typ eset math formulas 20 ditro les 100 - 4000 lines 1 ditro le 17000 lines

eqntott 3461 b o olean minimization 5 les of b o olean equations standard SPEC 92 input

espresso 6722 b o olean minimization 20 original espresso b enchmarks opa

grep 464 string search 20 C source les with search strings 1 C source le 4000 lines

lex 3316 lexical analyzer generator 5 lexers for C, lisp, pascal, awk, pic C lexer

li 7747 lisp interpreter 5 gabriel b enchmarks queens 7

tbl 2817 format tables for tro 20 ditro les 100 - 4000 lines 1 ditro le 5000 lines

wc 120 word count 20 C source les 100 - 5000 lines 1 C source le 4000 lines

yacc 2303 parser generator 10 grammars for C, pascal, pic, eqn C grammar

Table 1: The b enchmarks.

Table 1 shows the characteristics of the b enchmark programs to b e used in our compile time and output

co de p erformance exp eriments. The Size column indicates the sizes of the b enchmark programs measured

in numb ers of lines of C co de excluding comments. The remaining columns describ e the b enchmarks, the

input les used for pro ling, and the input le used for p erformance comparison. In a few cases, such as lex

and yacc, one of the pro le inputs was used as the test input due to an insucientnumb er of realistic test

cases. Most of the b enchmark programs are large and complex enough that it would b e virtually imp ossible

to conduct exp erimentation using these programs without a working compiler.

4.3 Base Co de Calibration

All the sup erblo ck ILP optimization and scheduling results will b e rep orted as sp eedup over the co de gen-

erated by a base compilation. For the sp eedup measures to b e meaningful, it is imp ortant to show that

the base compilation generates ecient co de. This is done by comparing the execution time of our base

compilation output against that pro duced bytwo high quality pro duction compilers. In Table 2, we compare

TM TM

the output co de execution time against that of the MIPS C compiler release 2.1, -O4 and the GNU C

To App ear: Journal of Sup ercomputing, 1993 14

TM

compiler release 1.37.1, -O, on a DECstation3100 which uses a MIPS R2000 pro cessor. The M I P S:O 4

and GN U:O columns show the normalized execution time for co de generated by the MIPS and GNU C

compilers with resp ect to the IMPACT base compilation output. The results show that our base compiler

p erforms slightly b etter than the two pro duction compilers for all b enchmark programs. Therefore, all the

sp eedup numb ers rep orted in the subsequent sections are based on an ecient base co de.

Benchmark IMPACT.O5 MIPS.O4 GNU.O

cccp 1.00 1.08 1.09

cmp 1.00 1.04 1.05

compress 1.00 1.02 1.06

eqn 1.00 1.09 1.10

eqntott 1.00 1.04 1.33

espresso 1.00 1.02 1.15

grep 1.00 1.03 1.23

lex 1.00 1.01 1.04

li 1.00 1.14 1.32

tbl 1.00 1.02 1.08

wc 1.00 1.04 1.15

yacc 1.00 1.00 1.11

Table 2: Execution times of b enchmarks on DECstation3100

4.4 Compile Time Cost

Due to the prototyp e nature of IMPACT-I, very little e ort has b een sp ent minimizing compile time. During

the development of the sup erblo ck ILP optimizer, compile time was given much less concern than correctness,

output co de p erformance, clear co ding style, and ease of software maintenance. Therefore, the compile time

results presented in this section do not necessarily represent the cost of future implementations of sup erblo ck

ILP optimizations and scheduling. Rather, the compile time data is presented to provide some initial insight

into the compile time cost of sup erblo ck techniques.

Figure 6 shows the p ercentage increase in compile time due to each level of compiler sophistication b eyond

TM

base compilation on a SPARCstation-I I workstation. To show the cost of program pro ling, we separated

the compile time cost of deriving the execution pro le. The pro ling cost rep orted for each b enchmark

in Figure 6 is the cost to derive the execution pro le for a typical input le. To acquire stable pro le

information, one should pro le each program with multiple input les.

The sup erblo ck formation part of the compile time re ects the cost of trace selection, tail duplication,

and increased scheduling cost going from basic blo ckscheduling to sup erblo ckscheduling. For our set of

b enchmarks, the overhead of this part is b etween 2 and 23 of the base compilation time.

The sup erblo ck ILP optimization part of the compile time accounts for the cost of branch expansion, lo op

To App ear: Journal of Sup ercomputing, 1993 15

7.12 5

Superblock Optimization AA

4 AA Superblock Formation

AA AA

AA Profile AA

3 AA Base

AA

AAAA

A AAAA

2 AAAA

A AAAA AAAA

A A A AAAA AAAAAAAAA AAAAAAAAA AAAA

A AAAAA A A AAAA A AAAA AAAA AAAA AAAA AAA

A AAAAA AAAAAAAAAAAAAA AAAAAAAAA AAAAA AAAA AAAA

A AAAAAAAAAA AAAAAAAAAA AAAA AAAA AAAAA AAAA AAAA

Normalized Compile Time

AAAAA A AAAAA A A A A AAAA A A AAAAA A AAAAA AAAAA AAAAAAAAA AAAA AAAA AAA AAAA AAAA AAA AAAAA AAA AAAAA

A A A A A A A A A AAAAA AAAA AAAAA AAAA AAAA AAAAA AAAA AAAA AAAA AAAAA AAAA AAAAA A A A

1 A

A A A A A A A A A AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA

A A A A A A A A A AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA

A A A A A A A A A A A A AAAA AAAA AAAA AAAA AAAA AAAA AAA AAAA AAA AAAA AAA AAAA

A A A A A A A A A AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA

A A A A A A A A A AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA

A A A A A A A A A AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA

A A A A A A A A A AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA 0 AAAA i wc yacc l t l espresso g cccp cmp compress eqn e b e l q r x n e t p o t

t

Figure 6: Compile time cost of sup erblo ck ILP optimizations.

unrolling, lo op p eeling, and dep endence removing optimizations, as well as the further increase in scheduling

cost due to enlarged sup erblo cks. Note that the overhead varies substantially across b enchmarks. Although

the average overhead is ab out 101 of the base compilation time, the overhead is as high as 522 for

cmp. After examining the compilation pro cess in detail, we found out that sup erblo ck enlargement created

2

some huge sup erblo cks for cmp. Because there are several O N  algorithms used in ILP optimization and

co de scheduling, where N is the numb er of instructions in a sup erblo ck, the compile time cost increased

dramatically due to these huge sup erblo cks. This problem can b e solved by the combination of sup erblo ck

size control and more ecient optimization and scheduling algorithms to decrease the worst-case compile

time overhead.

4.5 Performance of Sup erblo ck ILP Optimization and Scheduling

The compile time and base co de calibration results presented in this pap er have b een based on real machine

execution time. From this p oint on, we will rep ort the p erformance of b enchmarks based on simulation

of a wide variety of sup erscalar pro cessor organizations. All results will b e rep orted as sp eedup over a

4

scalar pro cessor executing the base compilation output co de. The scalar pro cessor is based on the MIPS

R2000 instruction set with extensions for enhanced branch capabilities [Kane 1987]. These include squashing

branches, additional branch op co des suchasBLT and BGT, and the ability to execute multiple branches

p er cycle [Hwu and Chang 1993]. The underlying microarchitecture stalls for ow dep endent instructions,

To App ear: Journal of Sup ercomputing, 1993 16

AAA

AA

7 A

AAA

A AA

AAA

AAA AAA AAA

AAA AA AAA

Superblock Optimization A

A A A AA

6 A

AAA AAA AAA AAA

AA A AA AAA AAA

Superblock Formation A

AAA AAA AAA AAA

A A A AA AA AA

AAA AA AAA AAA

5 Traditional Optimizations A

AAA AAA AAA AAA AAAAAA

A A A A A AA AA AA AAAA

AAA AAA AAA AAAAA

AAAAAA AAAAAA AAA AAA AAAAA

AAAA AAAAAA AAA AAA AAAAA

4 A

A A A A AAAA A A A AAA AAAA AA AA AA AA AAAA

AAAAA AAA AAAAAA AAAAA AAA AAAAA AAA

Speedup

AAAAA AAA AAA AAAAAA AAAAA AAA AAAAAA AAA

A A A AAAA A A A A A A A A AAAA AA AA AA AA AAAA AA AAAA AA

AAAAA AAAAAA AAAAAA AAAAAA AAAAA AAAAAA AAAAAA AAAAAA

AAAAAAAA AAAAAA AAAAAA AAAAAA AAAAA AAAAAA AAAAA AAAAAA

3 AAA

AAAAAAAA AAAAA AAAAAA AAAAAAAAA AAAAA AAAAAA AAAAAAAA AAAAAA

A A A AAAA A A AAA A A A A A AAA A A A A A A A AA AAAA AAAAA AA AA AAAA AA AAAA AA AA AA AAAA AA AA

AAA AAAAAAAA AAAAA AAAAAAAAA AAAAAA AAAAAA AAAAAAAA AAAAA AAAAAA AAAAAA AAAAAAAA AAAAAA

AAAAAA AAAAAAAA AAAAA AAAAAAAAAAA AAAAA AAAAA AAAAAAAA AAAAAAAA AAAAA AAAAAA AAAAAAAA AAAAAA

A A A AAAAAA AAAA A A AAAA A AAA A A AAAA A A A A A AAAAAA AAAA A A A A AAAA A AA AA AAAAAAAA AAA AAAAAAAAAA AAAAA AAAA AAAA AA AA AAAA AAAAAA AA AA AA AAAA AA AA

2 A

AAAAAA AAAAAAAA AAAAAAA AAAAAAAAA AAAAAA AAAAAAAAAAA AAAAAAAAAAA AAAAAAAA AAAAAAAA AAAAAAAA AAAAAAAA AAAAAAAA

AAAAAAAAA AAAAAAAA AAAAAA AAA AAAAAAAA AAAAAAAA AAAAAAAAAAAA AAAAAAAA AAAAAAAAAAAA AAAAAAAA AAAAAAAA AAAAAAAA AAAAAAAAAAAAAA

AAAAAAAAA AAAAAAAA AAAAAAAA AAAAAAAA AAAAAAAA AAAAAA AAA AAAAAAAA AAAAAAAAAAA AAAAAAA AAAAAA AAA AAAAAAAA AAAAAAAAAAAAAAA

AAAA A AAAAAAAAA A A A AAAAAAAAA AAAA AAAAAA AAAA A AAAA AAAAAA AAAAAAAAA AAAAAAAAA AAAAAAAAA AAAAAAAAA AAAAAAAAA AAAAA AA AAAAAAAAA AAAAAA AAAAAAAAA AAAAAAAA AAAAAAA AAAAAAAA AAAAAAAAA AAAAAAAAA AAAAAAAAA AAAAAAAAA AAAAAAAAA

AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA 1 AAA 248 248 248 248 248 248 248 248 248 248 248 248

cccp cmp compress eqn eqntott expresso greplexlitbl wc yacc

Figure 7: Performance improvement due to sup erblo ck ILP optimization. The sp eedup numb ers are relative

to the scalar pro cessor with base level compilation.

resolves output dep endences with register renaming, and resolves anti-dep endences by fetching op erands at

an early stage of the pip eline. The instruction latencies assumed in the simulations are: 1 cycle for integer

add, subtract, comparison, shift, and logic op erations, 2 cycles for load in cache, 1 cycle for store, 2 cycles for

branches, 3 cycles for integer multiply, 10 cycles for integer divide, 3 cycles for oating p oint add, subtract,

multiply, comparison, and conversion, and 10 cycles for oating p oint divide. The load and store latency for

data cache misses will b e discussed in Section 4.8.

To derive the execution cycles for a particular level of compiler optimization and a pro cessor con guration,

a simulation is p erformed for each b enchmark. Although the base scalar pro cessor is always simulated with

the output co de from the base compilation, the sup erscalar pro cessors are simulated with varying levels

of optimization. Exp erimental results presented in this section assume ideal cache for b oth scalar and

sup erscalar pro cessors. This allows us to fo cus on the e ectiveness of the compiler to utilize the pro cessor

resources. The e ect of cache misses will b e addressed in Sections 4.7 and 4.8.

In Figure 7, the p erformance improvement due to each additional level of compiler sophistication is shown

for sup erscalar pro cessors with varying instruction issue and execution resources. An issue K pro cessor has

the capacity to issue K instructions p er cycle. In this exp eriment, the pro cessors are assumed to have

uniform function units, thus there are no restrictions on the p ermissible combinations of instructions among

the K issued in the same cycle.

As shown in Figure 7, b oth sup erblo ck formation and sup erblo ck ILP optimization signi cantly increase

the p erformance of sup erscalar pro cessors. In fact, without these techniques, the sup erscalar pro cessors

To App ear: Journal of Sup ercomputing, 1993 17

AAA

AA 7 A

A AA

AAA

AAA AAA

AAA AAA

AA A A A AA

6 General Percolation A

AA AAA AAA AAA

AAA AAA AAA

AA AAA AAA AAA

A A A AA A AA AA

Restricted Per colation A

AA AAA AAA

5 A

AAA AAA AAA AAAAAA

A A A A A AA AA AA AA AA

AAA AAA AAA AAAAAA

AAAAAA AAA AAA AAA AAAAAA

AAAAA AAAAAA AAA AAA AAAAAA

A A A A AAAA A AAAA AAA AAAA AAAA AA AA AAA A

4 A

AAAAA AAA AAAAA AAAAAA AAA AAAAAA AAA

Speedup

AAAAA AAA AAA AAAAAA AAAAAA AAA AAAAAA AAA

A A A AAAA A A A A A A A A AAAA AA AA AAAA AA AA AA AA AA AA

AAAAA AAAAAA AAAAA AAAAA AAAAAA AAAAAA AAAAAA AAAAAA

AAAAAAAA AAAAAA AAAAAA AAAAAA AAAAAA AAAAA AAAAAA AAAAA

3 AAA

AAAAAAA AAAAAA AAAAAA AAAAAAAA AAAAAA AAAAA AAAAAAAAA AAAAA

A A A A A A A AAA A A A AAAAAA AAA A A AAAA A A A AAAAAA AA AA AAAA AA AAAA AAAAAA AAAA AAAAA AA AAAA

AAA AAAAAA AAA AAAAAAAAA AAAAAAAA AAA AAA AAAAAAAA AAAAAA AAAAAA AAAAA AAAAAAAA AAAAA

AAA AAAAAAA AAAAAAAAA AAAAAAAA AAAAAA AAAAAA AAAAAAAA AAAAAAAAA AAAAA AAAAA AAAAAAAA AAAAAA

AAAA AAAA A AAAA A A A A AAAA AAAAAA A A A A A A A A AAAA A A A A AAAA A AA AAA AAA AA AA AA AAAA AAAAA AAAAAA AA AAAA AAAA AA AAAA AAAA AAAA AA AAAA

2 A

AAAAAA AAAAAAA AAAAAAAAA AAAAAAAA AAAAA AAAAAAAAA AAAAAAAA AAAAAAAAA AAAAAAAAA AAAAAAA AAAAAAAA AAAAAAAA

AAAAA AAAAAAA AAAAAAAA AAAAAAAA AAAAAAAA AAAAAAAAA AAAAAAAA AAAAAAAA AAAAAAAA AAAAAAAAA AAAAAAAA AAAAAAAA

AAAAAAAA AAAAAAA AAAAAAAA AAAAAAAA AAAAAAA AAAAAAAA AAAAAAAA AAAAAAAA AAAAAAAA AAAAAAA AAAAAAAA AAAAAAAA

A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A AA AAAA AAAAAA AAAA AA AA AAAA AAAAAA AAAA AA AA AAAA AAAA AA AA AAAA AAAAAA AAAA AA AA AAAA

AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

AAA AAA AAA AAA AAA AAA AAA AAA AAA AAA

1 AAA 248 248 248 248 248 248 248 248 248 248 248 248

cccp cmp compress eqn eqntott expresso greplexlitbl wc yacc

Figure 8: E ect of sp eculative supp ort on sup erblo ck ILP optimization results.

achieve little sp eedup over the base scalar pro cessor. Note that for cmp, grep, and wc, a 4-issue pro cessor

achieves more than four times sp eedup over the base pro cessor. This sp eedup is sup erlinear b ecause the

4-issue pro cessor executes co de with sup erblo ck optimization whereas the base pro cessor only executes tra-

ditionally optimized co de. For the 4-issue pro cessor, the cumulative p erformance improvement due to the

sup erblo ck techniques range from 53 to 293 over the base compilation. This data clearly demonstrates

the e ectiveness of the sup erblo ck techniques.

4.6 E ect of Sp eculative Execution

The general co de p ercolation mo del is p erhaps the most imp ortant architectural supp ort for sup erblo ck

ILP optimization and scheduling. The ability to ignore exceptions for sp eculative instructions allows the

sup erblo ckscheduler to fully exploit the increased parallelism due to sup erblo ck ILP optimizations. This

advantage is quanti ed in Figure 8. The general co de p ercolation mo del allows the compiler to exploit from

13 to 143 more instruction level parallelism for issue 8. Without hardware supp ort, the scheduler can

take advantage of some of the parallelism exp osed by sup erblo ck optimization. However, using sp eculative

instructions in the general co de p ercolation mo del, the scheduler is able to improve the p erformance of the

8-issue pro cessor to b etween 2.36 and 7.12 times sp eedup. Furthermore, as the pro cessor issue rate increases,

the imp ortance of general co de p ercolation increases. For most b enchmarks with restricted p ercolation, little

sp eedup is obtained from a 4-issue pro cessor to a 8-issue pro cessor. However, when general p ercolation is

used, substantial improvements are observed from a 4-issue pro cessor to a 8-issue pro cessor. These results

con rm our qualitative analysis in Section 3.4.

To App ear: Journal of Sup ercomputing, 1993 18

5

A AAAA AAA

A AAA AAA

A Superblock Optimizations

A AAAA

A AAA AAAA

A AAA AA A

4 A Superblock Formation

A AAAA AAA

A AAAA

A AAA AAA

A Traditional Optimizations

A AAAA AAA

A AAAA

A A AAAA AAAA

A A AAA AAAA

3 A

A A AAAA AAAA

A A AAAA AAAA

A A A AAAA AAAA AAAA

A A A AAAA AAAA AAAA AAAA

A A A AAAA AAAA AAAA AAAA

A A A A AAAA AAA AAAA AAAA

A A AAA AAAA AAAAA AAAA

2 A

A A A A AAAA AAAA AAAA AAAA AAAAA AAAA

A A A A AAAA AAAA AAAA AAAA AAAAA AAAA

A AAAA A A A A A AAAA AAAA AAA AAAA AAAA AAAA

A A AAAAA AAAA AAAAA AAAA AAAAA AAAA AAAA AAAA AAAAA A

A A A AAAA AAAAA AAAA AAAAA AAAA AAAAA AAAA AAAA AAAA AAAAA

Output Code Expansion

A A A AAAAA AAAAA AAAA AAAAA A AAAAA AAAA AAAAAA A AAAAA AAA AAA AAA AAAAA AAA AAAAA AAAA AAAA AAA AAAAA

A A A A A A A AAAA AAAAA AAAA AAAAAAAAAA AAAA AAAAA AAAA AAAAA AAAA AAAAAAAAAA AAAAA

A A A A A A A AAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA

1 A

A A A A A A A AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA

A A A A A A A A A A A A AAA AAAA AAA AAAA AAA AAAA AAA AAAA AAA AAAA AAAA AAAA

A A A A A A A AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA

A A A A A A A AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA

A A A A A A A A A A A A AAA AAAA AAA AAAA AAA AAAA AAA AAAA AAA AAAA AAAA AAAA

AAAA AAAAA AAAA AAAAA AAAA AAAAA AAAA AAAAA AAAA AAAAAAAAAA AAAAA

AAAAA AAAA AAAAA AAAA AAAAA AAAA AAAAA AAAA AAAAAAAAAA AAAAA 0 AAAA i wc yacc l t l espresso g cccp cmp compress eqn e b e l q r x n e t p o t t

.

Figure 9: Output co de size expansion due to sup erblo ck ILP techniques.

4.7 Instruction Cache E ects

The expansion of co de from sup erblo ck formation and sup erblo ck optimizations will have an e ect on in-

struction cache p erformance. Most sup erblo ck ILP optimizations rely on co de duplication to enlarge the

scheduling scop e. Some optimizations, such as accumulator expansion, add new instructions to maintain the

correctness of execution after optimization. As shown in Figure 9, the sup erblo ck ILP optimizations signif-

icantly increase the co de size. The co de sizes of eqn and cmp are increased by 23 and 355 resp ectively.

Such co de expansion can p otentially degrade instruction cache p erformance. In addition, each stall cycle

due to cache misses has a greater impact on the p erformance of sup erscalar pro cessors than that of scalar

pro cessors. Therefore, it is imp ortanttoevaluate the impact of instruction cache misses on the p erformance

of sup erblo ck ILP optimizations.

Figure 10 shows the sp eedup of an 8-issue pro cessor over the base scalar pro cessor when taking instruc-

tion cache miss p enaltyinto account. The four bars asso ciated with each b enchmark corresp ond to four

combinations of two optimization levels and two cache re ll latencies. The two cumulative optimization

levels are sup erblo ck formation A,C and sup erblo ck ILP optimization B,D. The two cache re ll latencies

are 16 and 32 clo ck cycles. Each bar in Figure 10 has four sections shown the relative p erformance of four

cache sizes: 1K, 4K, 16K, and ideal. The caches are direct mapp ed with 32-byte blo cks. Each instruction

cache miss is assumed to cause the pro cessors to stall for the cache re ll latency minus the overlap cycles

due to a load forwarding mechanism [Chen et al. 1991]. Since instruction cache misses a ect b oth the base

scalar pro cessor p erformance and the sup erscalar pro cessors p erformance, sp eedup is calculated by taking

To App ear: Journal of Sup ercomputing, 1993 19 AA

7 AA

AA AA

AA AA

AA AA

AA AA AA

Ideal cache AAA

AA AA AA AAA

AA AA AA AA AA

6 A

AA AA AA AA A AAA AAA A

16k instruction cache A

AA AA AA AA AAA AAA AAA

AA AA AA AA AAA AAA AAA

AA AA AA AA AAA AAA AAA

4k instruction cache AAA

AA AA AA A A A AA A AA AA

5 A

AA AA AA AA AA AAA AAA AAA

AA AA AA AA AA AA AA AA AAA AAA

1k instruction cache A

AA AA AA A AA A A AA AA AA AA AA

AA AA AA AA AA AAA AA AAA AAA AA

AA AA AA AA AAA AAA AAA AA

AA AA AA AA AA AA AAA AAA AAA AA

4 A

AA AA AA AA AA AAA AAA A AAA A A AAA AAA AA AAA AA AA

Speedup

AA AAAA AA AA AA AA AAA AAA AAA AAA AAA AAA AAA AAA AAA AAA

AA AAAA AA AA AA AA AAA AAA AAA AAA AAA AAA AAA AAA AAA AAA AAA

AA AAAA AA AA AA AA AAA AAA AAA AAA AAA AAA AAA AAA AAA AAA AAA

AA AAAA AA AA A A AA A A A A A AAA A AA A A AA AA AA A AA AA AA AAA AA AA AA

3 A

AA AAAA AA AA AAAAAAAAAAAA AA AAA AAAAAAAAA AAA AAA AAA AAA AAA AA AAA AAA

AA AAAA AA AA AAAAAAAAAAAA AA AAA AAA AAA AAAAAAAAA AAA AAA AAA AAA AAA AA AAA AAA

AA AA AAAAAA AAAA A A AA A A A A A A AA A A AAA A A A A AA AAAAAAAAA AA AA AA AA AAAAAA AA AA AAA AA AA AA AA AA

AA AA AAAAAA AAA AAA AAAAAAAAA AA AAA AAA AAA AA AAAAAAAAA AAA AAA AAA AAA AAA AAA AAA AAA

AA AA AA AAAAAA AAA AAA AAAAAAAA AAA AA AAA AAA AAA AA AAA AAAAAAAAA AAA AAA AAA AAA AAA AAA AAA AAA

AAAAAA AA AAAAAA AA AAAA AAA AAA AAAAAAAAA AAAA AAA AAAAAAAAAAAA AA AAA AAAAAAA AAA AAA AAAAAA AAA AAA AAA AAA AAA

2 AAA

AAAAAA AA AA AA AAAAAA AAA A A A AAAA A AA AAA AAAA A A AA AAAAAA A A A A A AAAAAAAAAA A AAA A A A AAA AA AA AAAAAAAAA AAAA AA AAAAAAAAA AA AAA AAAAAA AA AA AAAAAA AA AAA AA AA AA

AAAAAA AAAA AAAA AA AAAAAA AAAAAAAAAAAA AAA AAAAAAAAA AA AAAAA AAAAAAAAAAA AA AAAAAA AAAAAAA AAAAAAAAAAA AAAAAA AAAA AAA AAA AAA AAAAAAAAAAAA

AAAAAA AAAAAA AAAA AA AA AAAAAA AAAAAAAAAAAA AAA AAAAAAAAAA AAAAA AAAAAAAAA AAAAAA AAAAAAA AAAAAAAAAAA AAAAAAAAAA AAAAAAAAAAAA AAA AAAAAAAAAAA

A A A A A AAAAAA A A A A AAAAAAA A A A A AAAAA A A A A AA AAA A A A A AAAA A A AAAA A AAAAAA A A A A AAAAAAAA AA AAAAAAAA AAAA AAAAAAAA AAAA AAAAAA AAAAAA AAAAAAAA AAAAAAAAA AA AAAAAAAA

AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA 1 AAA ABCD ABCD ABCD ABCD ABCD ABCD ABCD ABCD ABCD ABCD ABCD ABCD

cccp cmp compress eqn eqntott expresso greplexlitbl wc yacc

Figure 10: Instruction cache e ect on sup erblo ck ILP techniques where, A and C represent sup erblo ck

formation, B and D sup erblo ck optimization, A and B have a cache re ll latency of 16 cycles, C and D have

a cache re ll latency of 32 cycles.

instruction cache misses into account for b oth p erformance measurements.

As shown in Figure 10, for larger caches, sup erblo ck ILP optimizations increase p erformance despite

the e ect of cache misses. Even for 1K caches, sup erblo ck ILP optimizations increase p erformance for

all but compress, grep, and wc. The p erformance approaches that of ideal cache when the instruction

cache is 16K bytes or larger for b oth 16 cycle and 32 cycle cache re ll latencies. Since most mo dern

high p erformance computers have more than 64K bytes of instruction cache, the p erformance advantage of

sup erblo ck ILP optimizations is exp ected to b e relatively una ected by instruction misses in future high

p erformance computer systems.

4.8 Data Cache E ects

Because sup erblo ck optimizations do not a ect the numb er of data memory accesses, the numb er of extra

cycles due to data cache misses remains relatively constant across the optimization levels. However, since

the sup erblo ck optimizations reduce the numb er of execution cycles, the overhead due to data cache misses

increases. Figure 11 shows the e ect of four cache con gurations on the p erformance of an 8-issue pro cessor.

The data cache organizations have the same blo ck size and re ll latencies as those used in the instruction

cache exp eriments, but the cache sizes are 4K, 16K, 64K, and ideal. Note that data cache misses have more

in uence on the p erformance results than instruction cache misses. This is particularly true for the compress,

eqntott, and lex b enchmarks where there is a noticeable di erence b etween the sp eedups for the 64K cache

To App ear: Journal of Sup ercomputing, 1993 20 AA

7 AA

AA AA AA

AA AA AA

AA AA AA

AA AAA

Ideal cache AA

AA AAA AA AA

AA AA AA AA AA AA

6 A

AA AA AA AA A A AA

64k data cache A

AA AA AA AA AAA

AA AA AA AA AAA AAA

AA AA AA AA AAA AAA

16k data cache AAA

AA A AA AAA AA AA A AAA

5 A

AA AA AA AAA AA AAA AAA AA

AA AA AA AA AA AA AA AAA AAA

4k data cache A

AA AA AA AA AA A A AAA AA AA AA AA AA AAA

AA AA AA AA AA AAA AAA AAA

AA AA AA AA AA AA AAA AAA AAA AAA

AA AA AA AA AA AA AAA AAA AAA AAA

4 A

AA AA AA AA AA A A A A AAA AAA AA AA AA AA AAA

Speedup

AA AA AA AA AA AA AAA AAA AAA AAA AAAAAA AAA AAA AAA

AA AA AA AA AA AA AAA AAA AAA AAA AAA AAA AAA AA AAAAAA AAA

AA AA AA AA AA AA AAA AAA AAA AAA AAA AAA AAA AA AAA AAA

AA AAAAAA AA AA AA A A A AAA A AAA A A A AA A AA AAA AA AAA AA AA AA AA

3 A

AA AAAAAA AA AA AA AA AAA AAAAAA AAA AAA AAA AAA AAA AAA AAA

AA AAAAAA AA AA AA AAA AAAAAAAAA AAA AAA AAA AAA AAA AAA AAA

AA AAAAAA AA AA AA AAA A A AAA A A AAA A A A AAAA A AAA AAAAAAA AA AA AA AA AA AA AA

AA AA AAAAAA AA AA AA AAA AAA AA AAA AAAAAAA AAA AAA AAA AA AAA AAA AAA AAAAAA AAA

AA AA AAAAAA AA AAA AA AA AAAAA AAA AA AAA AAAAAAA AAA AAA AAA AA AAA AAA AAA AAA AAA

AA AA AA AAAAAA AAAA AA AAA AAA AAAAAAA AAA AA AAAAA AAA AAA AA AAA AAAAAAA AAA AAA AAAAAAAAAA AAA AAA AAA AAA AAA

2 AAA

AA AAAA AA AA AAAAAA A AAA A AAAA AAAAAA AAA AAAA AAA AA AAA A A A A A AAAA AAAA A A A A A AA AAA AAAA AAA AA AAAA AAAAA AAA AAA AAAAAA AA AA AAAAAAA AA AA AA AA AA

AA AAAAAA AAAA AA AAAA AAAAAA AAAAAA AAAAAA AAA AAAA AAAAAA AAA AAAAA AAAAAAAAAAAA AA AAAAAA AAAAAAA AAAAAAAAAAA AAAAAAAAA AAA AAA AAA AAAAAAAAAAAA

AA AA AAAAAA AAAA AA AA AAAA AAAAAA AAAAAAAAAAAA AAA AAAA AAAAAAAAAAA AAA AAAAAAAAA AAAAAA AAAAAAA AAAAAAAAAAA AAAAAAA AAAAAAAAAAAA AAA AAAAAAAAAAAA

A A A A A AAAAAA A A A A AAAAAAA A A A A AAAAA A A A A AA AAA A A A A AAAA A A A A A AAAAAA A A AAAA AAAAAAAA AA AA AAAAAAAAAAA AA AAAAAAAA AAAA AAAAAA AAAAAA AAAA AAAAAAAA AA AAAAAAAAA

AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA 1 AAA ABCD ABCD ABCD ABCD ABCD ABCD ABCD ABCD ABCD ABCD ABCD ABCD

cccp cmp compress eqn eqntott expresso greplexlitbl wc yacc

Figure 11: Data cache e ect on sup erblo ck ILP techniques. where, A and C represent sup erblo ck formation,

B and D sup erblo ck optimization, A and B have a cache re ll latency of 16 cycles, C and D have a cache

re ll latency of 32 cycles.

and the ideal cache. The p o or cache p erformance in the case of the compress b enchmark, can b e attributed

to large internal data structures. The compress b enchmark has two large tables, each larger than 64K bytes

when large input les are used. The e ect of the data cache on the p erformance of sup erblo ck optimizations

illustrates the need to include data prefetching and other load latency hiding techniques in the compiler.

5 Conclusion

Control intensive programs challenge instruction level parallel pro cessing compilers with excess constraints

from many p ossible execution paths. In order to compile these programs e ectively,wehave designed

the sup erblo ck ILP optimizations and sup erblo ckscheduling to systematically remove constraints due to

unimp ortant execution paths. The IMPACT-I prototyp e proves that it is feasible to implement sup erblo ck

ILP optimization and sup erblo ckscheduling in a real compiler. The development e ort dedicated to the

prototyp e implementation is ab out 10 p erson-years in an academic environment.

The implementation of the sup erblo ck techniques accounts for approximately 14 of the compiler source

co de. Sup erblo ck techniques add an average overhead of 101 to the base compilation time. Wewould

like to emphasize that the prototyp e is not tuned for fast compilation. The results here do not necessarily

represent the compile time cost of commercial implementations. Rather, these numb ers are rep orted to prove

that the compile time overhead is acceptable in a prototypical implementation.

Using simulation, we demonstrate that sup erscalar pro cessors achievemuch higher p erformance with su-

To App ear: Journal of Sup ercomputing, 1993 21

p erblo ck ILP optimization and sup erblo ckscheduling. For example, the improvement for an 4-issue pro cessor

ranges from 53 to 293 across the b enchmark programs.

Three architecture factors strongly in uence the p erformance of sup erscalar and VLIW pro cessors: sp ec-

ulative execution supp ort, instruction cache misses, and data cache misses. Wehave shown that the general

co de p ercolation mo del allows the compiler to exploit from 13 to 143 more instruction level parallelism.

Considering the mo derate cost of sp eculative execution hardware, we exp ect that many future sup erscalar

and VLIW systems will provide such supp ort.

Although the instruction cache misses can p otentially cause large p erformance degradation, we found

that the b enchmark p erformance results remain una ected for instruction caches of reasonable size. Since

most workstations have more than 64K bytes of instruction cache, we do not exp ect the instruction misses to

reduce the p erformance advantage of sup erblo ck ILP optimizations. Similar conclusions can b e drawn for the

data cache. However, several b enchmarks require more advanced data prefetching techniques to comp ensate

for the e ect of high cache miss rates.

In conclusion, the IMPACT-I prototyp e proves that sup erblo ck ILP optimization and scheduling are

not only feasible but also cost e ective. It also demonstrates that substantial sp eedup can b e achieved by

sup erscalar and VLIW pro cessors over the current generation of high p erformance RISC scalar pro cessors.

It provides one imp ortant set of data p oints to supp ort instruction level parallel pro cessing as an imp ortant

technology for the next generation of high p erformance pro cessors.

Acknowledgments

The authors would liketoacknowledge all the memb ers of the IMPACT research group for their supp ort.

This research has b een supp orted by the National Science Foundation NSF under Grant MIP-8809478, Dr.

Lee Ho evel at NCR, Hewlett-Packard, the AMD 29K Advanced Pro cessor Development Division, Matsushita

Electric Corp oration, Joint Services Engineering Programs JSEP under Contract N00014-90-J-1270, and

the National Aeronautics and Space Administration NASA under Contract NASA NAG 1-613 in co op era-

tion with the Illinois Computer Lab oratory for Aerospace Systems and Software ICLASS. Scott Mahlkeis

supp orted byanIntel Fellowship. Grant Haab is supp orted byaFannie and John Hertz Foundation Grad-

uate Fellowship. John Holm is supp orted byanAT&T Fellowship. Daniel Lavery is also supp orted by the

Center for Sup ercomputing Research and Development at the University of Illinois at Urbana-Champaign

To App ear: Journal of Sup ercomputing, 1993 22

under Grant DOE DE-FGO2-85ER25001 from the U.S. Department of Energy, and the IBM Corp oration.

References

Aho, A., Sethi, R., and Ullman, J. 1986. Compilers: Principles, Techniques, and Tools. Addison-Wesley,

Reading, MA.

Aiken, A. and Nicolau, A. 1988. A developmentenvironment for horizontal micro co de. IEEE Transactions

on Software Engineering, 14, May, 584{594.

Bernstein, D. and Ro deh, M. 1991. Global instruction scheduling for sup erscalar machines. In Proceedings

of the ACM SIGPLAN 1991 ConferenceonProgramming Language Design and Implementation June,

241{255.

Chaitin, G. J. 1982. Register allo cation and spilling via graph coloring. In Proceedings of the ACM SIGPLAN

82 Symposium on Compiler Construction June, 98{105.

Chang, P.P. and Hwu, W. W. 1988. Trace selection for compiling large C application programs to micro co de.

In Proceedings of the 21st International Workshop on Microprogramming and Microarchitecture Nov.,

188{198.

Chang, P.P., Mahlke, S. A., Chen, W. Y., Warter, N. J., and Hwu, W. W. 1991a. IMPACT: An architectural

framework for multiple-instruction-issue pro cessors. In Proceedings of the 18th International Symposium

on Computer Architecture May, 266{275.

Chang, P.P., Mahlke, S. A., and Hwu, W. W. 1991b. Using pro le information to assist classic co de

optimizations. SoftwarePractice and Experience, 21, 12 Dec.:1301{1321.

Chen, W. Y., Chang, P.P., Conte, T. M., and Hwu, W. W. 1991. The e ect of co de expanding optimizations

on instruction cache design. Technical Rep ort CRHC-91-17, Center for Reliable and High-Performance

Computing, University of Illinois, Urbana, IL.

Chow, F. C. and Hennessy, J. L. 1990. The priority-based coloring approach to register allo cation. ACM

Transactions on Programming Languages and Systems, 12, Oct., 501{536.

To App ear: Journal of Sup ercomputing, 1993 23

Colwell, R. P., Nix, R. P., O'Donnell, J. J., Papworth, D. B., and Ro dman, P. K. 1987. A VLIW architecture

for a trace scheduling compiler. In Proceedings of the 2nd International ConferenceonArchitectural

Support for Programming Languages and Operating Systems Apr., 180{192.

Ellis, J. 1985. Bul ldog: A Compiler for VLIW Architectures. The MIT Press, Cambridge, MA.

Fisher, J. A. 1981. Trace scheduling: A technique for global micro co de compaction. IEEE Transactions on

Computers, c-30, July, 478{490.

Gupta, R. and So a, M. L. 1990. Region scheduling: An approach for detecting and redistributing parallelism.

IEEE Transactions on Software Engineering, 16, Apr., 421{431.

Horst, R. W., Harris, R. L., and Jardine, R. L. 1990. Multiple instruction issue in the NonStop Cyclone

pro cessor. In Proceedings of the 17th International Symposium on Computer Architecture May, 216{

226.

Hwu, W. W. and Chang, P.P. 1989a. Achieving high instruction cache p erformance with an optimizing

compiler. In Proceedings of the 16th International Symposium on Computer Architecture May, 242{

251.

Hwu, W. W. and Chang, P.P. 1989b. Inline function expansion for compiling realistic C programs. In Pro-

ceedings of the ACM SIGPLAN 1989 ConferenceonProgramming Language Design and Implementation

June, 246{257.

Hwu, W. W. and Chang, P.P. Accepted for publication. Ecient instruction sequencing with inline target

insertion. IEEE Transactions on Computers.

Intel 1989. i860 64-Bit Microprocessor. Santa Clara, CA.

Jouppi, N. P. and Wall, D. W. 1989. Available instruction-level parallelism for sup erscalar and sup erpip elined

machines. In Proceedings of the 3rd International ConferenceonArchitectural Support for Programming

Languages and Operating Systems Apr., 272{282.

Kane, G. 1987. MIPS R2000 RISC Architecture. Prentice-Hall, Inc., Englewo o d Cli s, NJ.

Kuck, D. J. 1978. The Structure of Computers and Computations. John Wiley and Sons, New York, NY.

To App ear: Journal of Sup ercomputing, 1993 24

Kuck, D. J., Kuhn, R. H., Padua, D. A., Leasure, B., and Wolfe, M. 1981. Dep endence graphs and compiler

optimizations. In Proceedings of the 8th ACM Symposium on Principles of Programming Languages

Jan., 207{218.

Mahlke, S. A., Chen, W. Y., Hwu, W. W., Rau, B. R., and ansker, M. S. S. 1992. Sentinel scheduling

for VLIW and sup erscalar pro cessors. In Proceedings of 5th International ConferenceonArchitectural

Support for Programming Languages and Operating Systems Oct..

Nakatani, T. and Eb cioglu, K. 1989. Combining as a compilation technique for VLIW architectures. In

Proceedings of the 22nd International Workshop on Microprogramming and Microarchitecture Sept.,

43{55.

Rau, B. R., Yen, D. W. L., Yen, W., and Towle, R. A. 1989. The Cydra 5 departmental sup ercomputer.

IEEE Computer, Jan., 12{35.

Schuette, M. A. and Shen, J. P. 1991. An instruction-level p erformance analysis of the Multi ow TRACE

14/300. In Proceedings of the 24th International Workshop on Microprogramming and Microarchitecture

Nov., 2{11.

Smith, M. D., Johnson, M., and Horowitz, M. A. 1989. Limits on multiple instruction issue. In Proceedings

of the 3rd International ConferenceonArchitectural Support for Programming Languages and Operating

Systems Apr., 290{302.

Warren, Jr., H. S. 1990. Instruction scheduling for the IBM RISC system/6000 pro cessor. IBM Journal of

Research and Development, 34, Jan., 85{92.