Published in HICSS-26 Conference Pro ceedings, January 1993, Vol. 1, pp. 497-506. 1

The Bene t of Predicated Execution for Software Pip elining

Nancy J. Warter Daniel M. Lavery Wen-mei W. Hwu

Center for Reliable and High-Performance Computing

University of Illinois at Urbana/Champaign

Urbana, IL 61801

Abstract implementations for the Warp [5] [3] and Cydra 5 [6]

machines. Both implementations are based on the

mo dulo scheduling techniques prop osed by Rau and

Software pipelining is a compile-time scheduling

Glaeser [7]. The Cydra 5 has sp ecial hardware sup-

technique that overlaps successive loop iterations to

p ort for mo dulo scheduling in the form of a rotating

expose operation-level paral lelism. An important prob-

register le and predicated op erations [8]. In the ab-

lem with the development of e ective software pipelin-

sence of sp ecial hardware supp ort in the Warp, Lam

ing algorithms is how to hand le loops with conditional

prop oses mo dulo variable expansion and hierarchical

branches. Conditional branches increase the complex-

reduction for handling register allo cation and condi-

ity and decrease the e ectiveness of software pipelin-

tional constructs resp ectively [3]. Rau has studied the

ing algorithms by introducing many possible execution

b ene ts of hardware supp ort for register allo cation [9].

paths into the scheduling scope. This paper presents

In this pap er we analyze the implications and b en-

an empirical study of the importanceofanarchi-

e ts of predicated op erations for software pip elining

tectural support, referredtoaspredicated execution,

lo ops with conditional branches. In particular, the

on the e ectiveness of software pipelining. In order

analysis presented in this pap er is designed to help

to perform an in-depth analysis, we focus on Rau's

future micropro cessor designers to determine if predi-

modulo scheduling algorithm for software pipelining.

cated execution supp ort is worthwhile given their own

Three versions of the modulo scheduling algorithm,

estimation of the increased hardware cost.

one with and two without predicated execution support,

are implementedinaprototypecompiler. Experiments

Although software pip elining has b een shown to b e

basedonimportant loops from numeric applications

e ective for many di erent architectures, in this pap er

show that predicated execution support substantial ly

we limit the discussion to VLIW and sup erscalar pro-

improves the e ectiveness of the modulo scheduling al-

cessors. For clarity,we use VLIW terminology.Thus,

gorithm.

an instruction refers to a very long instruction word

which contains multiple op erations.

1 Intro duction

2 Mo dulo Scheduling

Software pip elining has b een shown to b e an e ec-

tivescheduling metho d for exploiting op eration-level

parallelism [1][2][3] [4]. The basic idea b ehind soft-

In a mo dulo scheduled software pip eline, a lo op it-

eration is initiated every II cycles, where II is the Ini- ware pip elining is to overlap the iterations of a lo op

tiation Interval [7] [3] [10] [11]. The II is constrained b o dy and thus exp ose sucient parallelism to utilize

the underlying hardware. There are two fundamental

by the most heavily utilized resource and the worst

problems that must b e solved for software pip elining.

case recurrence for the lo op. These constraints each

The rst is to ensure that overlapping lifetimes of the

form a lower b ound for II. A tightlower b ound is the

same virtual register are allo cated to unique physical

maximum of these lower b ounds. In this pap er, this

tightlower b ound is referred to as the minimum II.If registers. The second problem is enabling lo op itera-

tions with conditional branches to b e overlapp ed.

an iteration uses a resource r for c cycles and there

r

are n copies of this resource, then the minimum II These problems have b een solved in the r

Published in HICSS-26 Conference Pro ceedings, January 1993, Vol. 1, pp. 497-506. 2

til an II is found that satis- Algorithm Loop Data Dependence Modulo Resource This pro cess rep eats un

Graph Reservation Table

es the resource and dep endence constraints. If the

1. Generate data

veany recurrences, a schedule can dependence graph LD functional units lo op do es not ha

cycle mod II FU2

um II [7]. In the pres- 2. Determine Iteration FU1 usually b e found for the minim

Interval (II) ADD ere develop ed in the 0 LD ADD ence of recurrence s, heuristics w - maximum of resource

constraints {s = 0} {s = 4}

Cydra 5 compiler to generate near-optimal schedules.

3. List schedule using MUL hedule the recurrence no de

1 MUL ST The basic principle is to sc modulo resource {s = 5} {s = 7}

reservation table

rst and then the no des not constrained by recur-

ST - 2 functional units

] [11]. - uniform FU's s = start time rences [12

- II = 2

After II is scheduled, the co de for the software

pip elined lo op is generated. Since not every in-

Figure 1: Mo dulo scheduling a lo op without recur-

struction is issued in the rst II cycles, a prologue

rences.

is required to \ ll" the pip eline. There are p =

  

latest issue time

1 II's or stages in the prologue.

II



7+1

e 1 = 3. The In the example in Figure 1, p = d

due to resource constraints, RI I ,is

2

kernel of the software pip eline corresp onds to the p or-

 

c

r

tion of the co de that is iterated. Since the lo op b o dy

RI I = max ;

r 2R

n

r

spans multiple II's, a register lifetime mayoverlap it-

self. If this happ ens, the registers for each lifetime

where R is the set of all resources. If a dep endence

must b e renamed. This can b e done in hardware us-

edge, e, in a cycle has latency l and connects op era-

e

ing a rotating register le as in the Cydra 5 [6]. With

tions that are d iterations apart, then the minimum

e

this supp ort, the kernel has one stage. Without sp ecial

II due to dep endence cycles, CII,is

hardware supp ort, modulo variable expansion can b e

X

2 3

used to unroll the kernel and rename the registers [3].

l

e

6 7

The numb er of times the kernel is unrolled, u,isde-

e2E

c

6 7

X

CII = max ;

6 7

termined by the maximum register lifetime mo dulo II.

c2C

6 d 7

e

6 7

The last part of the pip eline is the epilogue which con-

e2E

c

sists of p stages needed to complete executing the last

where C is the set of all dep endence cycles and E is

c

p iterations of the lo op. In total, the software pip eline

the set of edges in dep endence cycle c.

has 2p + u stages.

Figure 1 shows how mo dulo scheduling is applied to

a lo op without recurrences. The lo op has four op era-

tions with dep endences shown in the data dep endence

3 Scheduling Lo ops with Conditional

graph. The arcs on the dep endence graph show the

Branches

typ e of the dep endence  ow and latency. In this ex-

ample, a 2-issue VLIW with uniform function units

In order to apply mo dulo scheduling to lo ops with

is assumed. The load op eration has a 4-cycle latency

conditional branches, the lo op b o dy must rst b e con-

and the add and multiply have unit latencies. Since

verted into straight-line co de. Two techniques, hier-

there are no recurrences , the minimum II dep ends on

archical reduction [13][3] and if-conversion [14] [15],

the maximum of the resource constraints. Since there

have b een develop ed for converting conditional con-

are two uniform function units and four op erations in

4

structs e.g., if-then-else constructs into straight-line

the lo op, the minimum II is d e = 2. With no recur-

2

co de for the Warp [5] and Cydra 5 [6] machines re-

rences, the op erations can b e list scheduled using the

sp ectively. In this section, we analyze the impact of

mo dulo resource reservation table [12]. The mo dulo

these two conversion techniques on the scheduling con-

resource reservation table has a column for each func-

straints.

tion unit and a row for each cycle in II. Note that

after the load, add, and multiply are scheduled, the

3.1 Hierarchical Reduction

store op eration can b e scheduled in cycle 6. However,

there are no available resources at cycle 0 6 mo d 2

Hierarchical reduction is a technique that con- in the reservation table and thus the store op eration

verts co de with conditional constructs into straight- is delayed a cycle and scheduled in cycle 7.

line co de by collapsing each conditional construct into If a schedule is not found for a given II, then

a pseudo op eration [13] [16][3]. In this section, these II is incremented and the lo op b o dy is rescheduled.

Published in HICSS-26 Conference Pro ceedings, January 1993, Vol. 1, pp. 497-506. 3

op1: r1 <- 0

pseudo op erations will b e referred to as reduct op's.

op2: r2 <- label_A

hical technique since nested conditional

It is a hierarc op3: r3 <- max*8 y collapsing from the inner- constructs are reduced b for (i = 0; i < max; i++) { op4: L1: r4 <- mem[r2+r1]

if(i == 0) { hical reduction, most to the outermost. After hierarc op5: bne r1, 0, L2

k = c1 * A[i];

op's can b e scheduled with other op erations in reduct op6: r5 <- r4 * c1

} else { hical reduction assumes no sp ecial the lo op. Hierarc op7: jump L3

L2: r5 <- r4 * c2

are supp ort. Thus, the conditional constructs hardw k = c2 * A[i]; op8: }

op9: L3: mem[r2+r1] <- r5 heduling, and all op er- are regenerated after mo dulo sc A[i] = k;

} op10: r1 <- r1 + 8

ve b een scheduled with a reduct op are

ations that ha op11: bne r1, r3, L1

duplicated to b oth paths of the conditional construct.

op is formed by rst list scheduling b oth

A reduct a: C code segment. b: segment

paths of the conditional construct. The resource us-

age of the reduct op is determined by the union of

Figure 2: Example C lo op segment with assembly cor-

the resource usages of b oth paths after list scheduling.

resp onding assembly co de.

The dep endences b etween op erations within the con-

ditional construct and those outside are replaced by

dep endence graph. Note that the latency for the ow

dep endences b etween the reduct op and the outside

dep endence b etween op erations op4 and op6 and also

op erations. A dep endence b etween two op erations is

between op4 and op8 is originally one. These de-

characterized by the typ e  ow, anti, output, and con-

p endences are replaced bya ow dep endence b etween

trol, distance, and latency. The distance is the num-

0

op4 and op5 with zero latency since op6 and op8 are

b er of lo op iterations the dep endence spans. The la-

0

scheduled one cycle later than op5 . Note that there

tency is the minimum numb er of cycles that must b e

are no control dep endences after the graph has b een

between the op erations to guarantee the dep endence

reduced. This co de can now b e mo dulo scheduled.

is met. While the typ e and distance of the dep endence

After scheduling, the reduct op's must b e expanded.

do es not change, the latency for a given dep endence is

Any op erations scheduled within a reduct op must b e

mo di ed to takeinto account when the original op er-

copied to b oth paths of the regenerated conditional

ation is scheduled with resp ect to the reduct op. Con-

construct.

sider two op erations op and op where where op is

i j j

While hierarchical reduction allows a lo op with con-

an op eration from the conditional construct that has

ditional constructs to b e mo dulo scheduled, it places

b een list scheduled at time t . A dep endence arc with

j

some arti cial scheduling constraints on the lo op by

latency d, source op , and destination op , is replaced

i j

rst list scheduling the op erations of the conditional

by a dep endence arc b etween op and the reduct op,

i

0 1

construct. The list schedule causes the reduct op to

op , with latency d = d t . If instead, op is the

r j j

have a complex resource usage whichmay con ict with

source and op is the destination, the latency for the

i

already scheduled op erations during mo dulo schedul-

dep endence b etween op and op is d + t .

r j j

ing. Figure 4 illustrates the complex resource usage

Figure 2 shows a C co de segment of a lo op and the

patterns formed by hierarchical reduction. In addi-

corresp onding assembly co de. Figures 3 and 4 show

tion, if a reduct op spans more than one II,itmay

how the hierarchical reduction technique reduces the

con ict with itself and thus no schedule for that II can

control construct in this co de segment. The machine

op can b e overlapp ed with b e found. Since a reduct

b eing scheduled in this example is a VLIW with two

other reduct op's, including itself, there is a p otential

uniform function units. All op erations have a one cy-

for large co de expansion. If a reduct op is overlapp ed

cle latency. Figure 3 shows the dep endence graph for

n

n times, there are 2 p ossible execution paths. Other

the lo op b o dy. The dep endence arcs are marked with

op erations scheduled with the overlapping reduct op

the typ e, distance and latency. The typ es are abbrevi-

must b e copied to each of these paths.

ated as follows: owf,anti a, and control c. Fig-

ure 4a shows how the reduction op eration is formed by

3.2 If-conversion

list scheduling the two paths op5, op6, op7 and op5,

op8. The resultant resource usage of the reduct op

0

If-conversion is another technique that can b e

op5 is also shown. Figure 4b shows the mo di ed lo op

used to convert lo ops with conditional constructs into

1

It is p ossible to have a dep endence with a negative latency.

straight-line co de [14]. The basic concept b ehind if-

0

After op is scheduled, op can b e scheduled d cycles earlier

r

i

conversion is to replace conditional branch op erations

if there are no resource con icts and all other dep endences are

with equivalent compare op erations which set a ag. satis ed.

Published in HICSS-26 Conference Pro ceedings, January 1993, Vol. 1, pp. 497-506. 4

op4

op5' op1: r1 <- 0 op2: op4 r2 <- label_A op3: r3 <- max*8 op6' op4: L1: r4 <- mem[r2+r1] op5': pred_ne r1, 0, p1 op5 op6': r5 <- r4 * c1 op8' op8': r5 <- r4 * c2 op9: mem[r2+r1] <- r5 op9 op6 op10: r1 <- r1 + 8 op11: bne r1, r3, L1

op10

op7 op11 op8 a: assembly code segment b: modified loop dependence graph after if-conversion

op9

bly co de segment after if-conversion. Figure 5: Assem

op10

Op erations that are control dep endent on the condi-

h are converted into op erations that only

op11 tional branc

execute if the ag is set prop erly. In this way, control

dep endences are converted into data dep endences [15].

Figure 3: Dep endence graph of example co de segment.

To execute if-converted co de, hardware supp ort must

be available for setting a conditional execution ag

and to allow conditional execution of op erations. In

avector pro cessor, if-conversion is supp orted bya

mask vector. In a sup erscalar or VLIW pro cessor, if-

conversion can b e supp orted by predicated execution.

The architecture supp ort for predicated execution is

presented in Section 4.

Figure 5a shows the assembly co de after if-

conversion for the example in Figure 2. A predicate

id and type, represented as < id; ty pe >, op4 has an FU1 FU2 FU1 FU2 cycle

list

yp e is either true or false. The predicate 0 op5 op5 schedule where t

for each 0

op5 sets clears the predicate op8 op5' compare op eration

1 op6 op7 path

p1; T > < p1; F >ifr1 is not equal to zero. Oth-

<

< p1; T > < p1; F >. Op- resource op9 erwise, it will clear set

usage of orignial 0

op6 executes if < p1; F > is set and op er- union eration

operations

0

op8 executes if < p1; T > is set. Note that op10 ation

while only one of these op erations executes, b oth will

resource

hed. Thus, with if-conversion, the resource

usage of op5' op11 b e fetc

constraints along b oth paths are summed. There-

version will usually require more resources a: resource usage of reduct_op b: modified loop fore, if-con

dependence graph

than hierarchical reduction but op eration placementis

more exible since these op erations are not list sched-

Figure 4: Applying hierarchical reduction to example

uled prior to mo dulo scheduling. Also note that if-

co de segment.

conversion removes the need for the jump op eration

op7. However, if a predicate compare op eration is not

guaranteed to b e executed during the execution of the

lo op b o dy, then the predicate must b e invalidated at the b eginning of the lo op. In the example in Figure 5a,

Published in HICSS-26 Conference Pro ceedings, January 1993, Vol. 1, pp. 497-506. 5

Predicate

validate op eration clears b oth

Specifier is true. The predicate in

bits of a predicate register. Each op eration within the ord has a predicate register sp eci er . . . wide instruction w Wide op 1 op 2 op n

Instr.

of width log R + 1, where log R bits sp ecify the pred-

2 2

icate register and the remaining bit sp eci es the typ e

Function . . . true or false. The width can b e reduced if the pred-

Units FU1 FU2 FUn

icate register le is implemented as a rotating register

le as in the Cydra 5 [11].

All op erations within the wide instruction are ex-

y,

Register integer and float registers ecuted. After the predicate register le access dela File an op eration in the execution pip eline that references

012 R-1 a cleared predicate register will b e squashed. predicate . . . False

registers . . . True

5 Compiler Supp ort

Figure 6: Architecture supp ort for predicated execu-

tion.

Both hierarchical reduction and if-conversion, as

well as mo dulo scheduling, have b een implemented in

0

the IMPACT C compiler. These techniques are ap-

the predicate compare op eration op5 will always b e

executed and thus no predicate invalidate op eration is plied to the appropriate lo ops after classical co de op-

required. The mo di ed dep endence graph in Figure 5b timizations have b een p erformed [17] and after they

illustrates that the control dep endences in Figure 3 have b een translated into the target machine assem-

bly co de, but b efore register allo cation. In our current

have b een converted into data dep endences.

implementation, we apply software pip elining to inner

lo ops that do not have function calls or early exits

from the lo op.

4 Architecture Supp ort for Predicated

After a lo op is determined to b e appropriate for

Execution

software pip elining, either hierarchical reduction or if-

conversion is applied to remove conditional constructs.

The architecture supp ort required for predicated

Then the lo op is mo dulo scheduled. Next, mo dulo

execution consists of: 1 predicate compare op era-

variable expansion is applied to rename overlapping

tions, 2 predicate invalidate op erations, 3 a predi-

register lifetimes. The resultantkernel co de is used to

cate register le, and 4 predicated op erations. An

generate the prologue and epilogue where op erations

example of an architecture similar to the Cydra 5 is

that are not issued in a given cycle are replaced by

shown in Figure 6 [6][8] [11]. There are 14 typ es of

no-op's. If hierarchical reduction has b een p erformed,

predicate compare op erations: integer, single precision

op's are expanded. Since there can b e no the reduct

oating-p oint; double precision oating-p ointversions

2

early exits from the software pip eline , the lo op must

of equal, not equal, greater than, and greater than

execute p + k  u times, where p is the number of

equal; and unsigned integer versions for greater than

stages in the prologue, k is an integer greater than or

and greater than equal. These op erations compare two

equal to one, and u is the numb er of stages in the

source op erands and set the value of the destination

kernel determined by mo dulo variable expansion. A

predicate register accordingly. A predicate invalidate

non-software pip elined version of the lo op is required

op eration is used to invalidate the sp eci ed predicate

to execute the remaining numb er of iterations. If the

register.

lo op trip count is greater than p + u, the remaining

The predicate register le has R 2-bit registers.

numb er of iterations is tr ip count pmodu. If the

The low-order bit corresp onds to the predicate b eing

trip count is less than p + u, only the non-software

false and the high-order bit corresp onds to the pred-

pip elined lo op is executed. If the trip countisknown

icate b eing true. Both bits should never b e set at

to b e less than p + u at compile time, the software

the same time. If b oth are cleared, the predicate is

pip eline is not generated.

invalid. A predicate compare op eration will set the

low-order bit and clear the high-order bit if the result

2

Early exits require a sp ecial epilogue for each stage in the

of the comparison is false. Likewise, it will set the

prologue and kernel which increases the co de generation com-

plexity and co de expansion considerably. high-order and will clear the low-order bit if the result

Published in HICSS-26 Conference Pro ceedings, January 1993, Vol. 1, pp. 497-506. 6

(2) hical reduction form do es op1: r1 <- -8 duction. The limited hierarc

op2: r2 <- label_A

not allowany reduct op to overlap itself i.e., to span

op3: r3 <- max*8

II. op10: L1: r1 <- r1 + 8 more than one

op4: r4 <- mem[r2+r1]

Machine Mo del op5: bne r1, 0, L2 6.1 (1) op6: r5 <- r4 * c1

op7: jump L3

hine mo del for these exp eriments is a

op8: L2: r5 <- r4 * c2 The mac

y-style interlo cking. There

op9: L3: mem[r2+r1] <- r5 VLIW pro cessor with Cra ts with the exception

op11: bne r1, r3, L1 are uniform resource constrain

that only one branch can b e issued p er cycle. There

are 14 branch op erations in order to b e compati-

Figure 7: Assembly co de segment after induction vari-

ble with predicate compare op erations. There are

able reversal.

no branch delay slots. Other than the branch op era-

tion, we use the instruction set and op eration laten-

In the presence of recurrences, Dehnert et. al., dis-

4

cies of the Intel i860 . Most integer op erations take

cuss the imp ortance of compiler optimizations such

1 cycle except for the integer load which takes 2 cy-

as back-substitution and load-store removal to reduce

cles. The integer multiply and divide and the oating-

CII [8]. Another imp ortant optimization is needed

p oint divide are implemented using approximation al-

to remove recurrences involving induction variables.

gorithms [18]. The oating-p oint load, ALU, and

When an is p ost-incremented as

single-precision multiply take 3 cycles, and the double

in the example in Figure 2a, any op erations that use

precision multiply takes 4 cycles. For if-conversion, we

the induction variable will cause a recurrence as shown

assume the predicate execution supp ort discussed in

in Figure 2b. This recurrence will force all of the op-

Section 4. The predicate compare op erations havea

erations in the cycle to b e scheduled within one II.

one cycle latency. In order to measure the predicate

However, the recurrence can b e removed by convert-

register requirements, the size of the predicate register

ing the induction variable to b e pre-incremented. We

le is unlimited. The mo del has an in nite register le

refer to this optimization as induction variable rever-

and we assume an ideal cache. The exp eriments were

sal.To p erform this optimization, the op eration that

p erformed using machines with instruction widths or,

increments the induction variable is moved to the b e-

equivalently, issue rates of 2, 4, and 8.

ginning of the lo op, and the initial value of the induc-

The base pro cessor for these exp eriments is a RISC

tion variable is decremented by the incrementvalue.

pro cessor with an in nite register le and an ideal

The co de after applying induction variable reversal

cache. The base schedule is a basic blo ckschedule.

3

to the assembly co de segment in Figure 2 is shown in

Figure 7. First op10 is moved to the b eginning of the

6.2 Benchmarks

lo op and then 8 is subtracted from the initial value in

op1. Note that there is still a recurrence due to op10.

To run our exp eriments, we collected a set of 28

However, since only one op eration is in the recurrence

lo ops with conditional branches from the Perfect and

cycle, the cycle can always b e scheduled. After this op-

SPEC b enchmarks. Since the fo cus of this study

timization, op4, op5 and op9 are no longer constrained

is to analyze the relative p erformance of conditional

to b e scheduled within one II. When the induction

branch handling techniques, only DOALL lo ops lo ops

variable spans more than one II,overlapping lifetimes

without cross-iteration memory dep endencies were

are renamed using mo dulo variable expansion.

included in the test suite.

6.3 Results

6 Exp erimental Results

Performance

In order to determine the b ene ts of predicated ex-

The sp eedup results presented are the harmonic mean

ecution, we ran exp eriments for three forms of mo dulo

of the sp eedup for each lo op. We assume that the

scheduled lo ops: 1 with if-conversion, 2 with hierar-

lo op executes an in nite numb er of times. Thus, the

chical reduction, and 3 with limited hierarchical re-

e ect of the prologue and epilogue are not accounted

3

The co de in Figure 2 has b een optimized by the classical op-

4

timizations induction variable and induction We assume that the load and oating-p oint pip elines are

variable elimination [17]. automatically advanced.

Published in HICSS-26 Conference Pro ceedings, January 1993, Vol. 1, pp. 497-506. 7

12 12

4 4

if-conversion if-conversion

4

hierarchical reduct 2 hierarchical reduct 2

10 10

 

limited hierarchical reduct limited hierarchical reduct

8 8

2

Sp eedup Sp eedup

6 6

4



2

4

4 4



4

4

2

4

2



 

2

2  2

2

0 0

2 3 4 5 6 7 8 2 3 4 5 6 7 8

Issue Rate Issue Rate

Figure 8: Sp eedup b efore induction variable reversal. Figure 9: Sp eedup after induction variable reversal.

Co de Expansion

Figure 10 shows the average co de expansion of the

for. Figure 8 shows the p erformance of the three tech-

three techniques. This is the co de expansion for the

niques b efore induction variable reversal. Comparing

software pip elined lo op including the prologue, kernel,

this graph with the graph of the p erformance after

5

and epilogue . It do es not include the co de expansion

induction variable reversal shown in Figure 9, we see

for the extra lo op required to execute the remaining

that the optimization improves the p erformance of hi-

iterations that do not t into the software pip eline.

erarchical reduction byasmuch as 160 for an issue-

Figure 10 shows that hierarchical reduction has 5

8 machine and the p erformance of if-conversion byas

and 27 more co de expansion than if-conversion for a

much as 157 for an issue-8 machine.

issue 2 and 4 resp ectively. As the issue rate increases,

more conditional constructs overlap and thus the co de

In Figure 9 mo dulo scheduling with if-conversion

expansion increases. For an issue-8 machine the co de

p erforms approximately 25, 29, and 48 b etter

expansion for hierarchical reduction is approximatedly

than hierarchical reduction for issue rates 2, 4 and 8

78 greater than that for if-conversion. If the lo op has

resp ectively.For lower issue rates, hierarchical reduc-

a high trip count, the working set is the kernel co de.

tion p erforms worse since it is more dicult to sched-

In this case, if-conversion will havea much smaller

ule the reduct op using fewer resources. The reason

working set than hierarchical reduction. Also, after if-

the relative p erformance increases for issue-8 is that

conversion there is straight-line co de which increases

the constraint of one branch p er cycle b ecomes a limit-

the data lo cality. While limited hierarchical reduction

ing factor. For an issue-2 machine, limited hierarchical

has comparable co de expansion to if-conversion, it's

reduction has approximately the same p erformance as

p erformance is to o p o or to o set this b ene t.

hierarchical reduction. This is due to the fact that

the reduced resources increases II such that a condi-

tional construct can usually b e scheduled within one

5

Explicit prologue and epilogue co de is not required with

II.However, for issue 4 and 8, hierarchical reduction

predicated execution. However, if they are generated, the com-

p erforms approximately 10 and 45 b etter than lim-

piler can overlap their execution with op erations outside the

lo op. ited hierarchical reduction.

Published in HICSS-26 Conference Pro ceedings, January 1993, Vol. 1, pp. 497-506. 8

100 issue 2

70 issue 4

4 80

if-conversion

issue 8

60

2 hierarchical reduct

 hical reduct limited hierarc 60

50 %Loops 2

40 40

Co de

Expand

30 

4 20

20

2

4

10 2

4 1 2481632 

Predicate Register File Size

0

2 3 4 5 6 7 8

Figure 11: Numb er of predicated registers.

Issue Rate Figure 10: Co de expansion. 100

limited hierarchical reduction 80 hierarchical reduction

Predicate Register File Size if conversion

60

es the p ercentage of the lo ops that can

Figure 11 giv %Loops to the sp eci ed predicate register le size. As ex-

t in 40

p ected, the predicate register requirements increase as

ver-

the issue rate increases and more iterations are o 20

lapp ed. A predicate register le with R registers re-

quires log R + 1 bits in the predicate register sp eci- 2

4 8 12 16 20 24 28 32

er. Toschedule all of the lo ops for an issue-8 machine

Minimum Loop Trip Count

would require a 6-bit predicate register sp eci er. How-

ever, the ma jority of the lo ops scheduled have only one

Figure 12: Minimum lo op trip count for issue-2.

conditional construct p er lo op. Thus, if a rotating reg-

ister le is used [8], a 1-bit predicate register sp eci er

would b e adequate.

100

um Lo op Trip Count Minim limited hierarchical reduction

80 hierarchical reduction

The minimum lo op trip count for a lo op to b e software

if conversion

umb er of stages in the prologue and pip elined is the n 60

epilogue. This metric can b e used to determined the %Loops t required to see a b ene-

magnitude of the trip coun 40

t from software pip elining. Figures 12-14 show the

um lo op trip count for is-

distributions of the minim 20

sue 2, 4, and 8. The numb ers on the x-axis represent

the upp er b ound of a given range e.g., 8 refers to the

4 8 12 16 20 24 28 32 range 5-8. As exp ected, as the issue rate increases

Minimum Loop Trip Count

the minimum lo op trip count increases b ecause the II

decreases, and thus the lo op b o dy is scheduled across

Figure 13: Minimum lo op trip count for issue-4.

more stages. For the same reason, the minimum trip

count is the greatest for if-conversion.

Published in HICSS-26 Conference Pro ceedings, January 1993, Vol. 1, pp. 497-506. 9

100

whichhavea variable II such as enhanced pip eline

heduling [25] [22]. limited hierarchical reduction sc

80 hierarchical reduction

Since predicated instructions remove the condi-

if conversion

tional branches, op erations can b e moved without wor-

60

rying ab out co de duplication to guarantee that the

%Loops prop er op erations are executed along b oth paths of a

40

branch. Perfect pip elining, enhanced pip eline schedul-

20 ing, and GURPR* all require co de duplication and

thus do not have the co de space eciency a orded by

scheduling with predicated execution. 4 8 12 16 20 24 28 32

Minimum Loop Trip Count

8 Conclusion

Figure 14: Minimum lo op trip count for issue-8.

In order to study the b ene t of predicated execution

for software pip elining, wehave implemented hierar-

7 Related Work

chical reduction, if-conversion and mo dulo scheduling

in a prototyp e compiler. The implementation e ort

Mo dulo scheduling schedules op erations for a given

allows us to gain insightinto the complexity and con-

iteration interval by delaying op erations to elim-

straints imp osed by conditional branches on the mo d-

inate resource con icts [19]. Other global soft-

ulo scheduling technique.

ware pip elining techniques rely on global compaction

With this implementation, we are able to quantify

where op erations are scheduled as early as p ossible.

the impact of architectural supp ort for predicated ex-

Aiken and Nicolau's p erfect pip elining technique com-

ecution on the p erformance of lo ops with conditional

pacts and unrolls the lo op b o dy until a pattern is

branches. For our b enchmarks, software pip elined

reached [20]. Su and Wang have develop ed a sim-

lo ops with predicated execution supp ort execute on

ilar technique, GURPR*, that do es not require the

the average 34 faster than without the supp ort.

computational complexity of determining a rep eating

The primary reason for this improvement is that the

pattern and that reduces the co de expansion over-

pre-pass list scheduling in the hierarchical reduction

head [21]. Eb cioglu and Nakatani's enhanced pip eline

metho d creates pseudo op erations with complex re-

scheduling also compacts the co de bymoving op era-

source usage patterns. These pseudo op erations limit

tions upward until an instruction is lled. Once an

the e ectiveness of software pip elining for many lo ops.

instruction is lled, it is moved across the backedge

On the other hand, the if-conversion metho d do es not

and is again available for scheduling. In this fashion,

need pre-pass list scheduling. This increased exibil-

op erations are scheduled across iteration b oundaries

ity accounts for the sup erior p erformance results with

enabling software pip elining [22].

predicated execution supp ort.

Enhanced pip eline scheduling also requires sp e-

cial hardware supp ort for a decision tree which

Acknowledgements

allows op erations from the same conditional con-

The authors would liketoacknowledge all memb ers

structs in di erent iterations of the lo op to over-

of the IMPACT research group for their supp ort, par-

lap. Jones and Allan showed that mo dulo schedul-

ticularly Scott Mahlke, David Lin, and William Chen

ing p erforms slightly b etter than their implementa-

for their contributions of compiler supp ort for pred-

tion of enhanced pip eline scheduling for lo ops with-

icated execution. Sp ecial thanks to the anonymous

out conditional statements [23]. They also predicted

referees whose comments and suggestions help ed to

that enhanced pip eline scheduling would p erform b et-

improve the quality of this pap er signi cantly.

ter than Lam's implementation of mo dulo scheduling

due to the constraints imp osed by hierarchical reduc- This research has b een supp orted by the Joint Ser-

tion. The drawback to mo dulo scheduling with b oth vices Engineering Programs JSEP under Contract

if-conversion and hierarchical reduction is that the II N00014-90-J-1270, Dr. Lee Ho evel at NCR, the AMD

remains constant for every iteration [24]. Thus, a lo op 29K Advanced Pro cessor Development Division, Mat-

that has a conditional construct with an infrequently sushita Electric Industrial Corp oration Ltd., Hewlett-

executed path much longer than the frequently exe- Packard, and the National Aeronautics and Space Ad-

cuted path will take longer to execute than techniques ministration NASA under Contract NASA NAG1-

Published in HICSS-26 Conference Pro ceedings, January 1993, Vol. 1, pp. 497-506. 10

613 in co op eration with the Illinois Computer Lab ora- [12] P. Y. T. Hsu, Highly Concurrent Scalar Processing.

PhD thesis, Department of Computer Science, Uni-

tory for Aerospace Systems and Software ICLASS.

versity of Illinois, Urbana, IL, 1986.

[13] G. Wo o d, \Global optimization of microprograms

through mo dular control constructs," in Proceedings

References

of the 12th Annual Workshop on Microprogramming,

pp. 1{6, 1979.

[1] A. Charlesworth, \An approach to scienti c array

[14] R. Towle, Control and Data Dependence for Program

pro cessing: The architectural design of the AP-

Transformations. PhD thesis, Department of Com-

120B/FPS-164 family," in IEEE Computer, Septem-

puter Science, University of Illinois, Urbana, IL, 1976.

b er 1981.

[15] J. R. Allen, K. Kennedy,C.Porter eld, and J. War-

[2] R. Touzeau, \A Fortran compiler for the FPS-164 Sci-

ren, \Conversion of control dep endence to data de-

enti c Computer," in Proceedings of the SIGPLAN

p endence," in Proceedings of the 10th ACM Sym-

'84 Symposium on Compiler Construction, June 1984.

posium on Principles of Programming Languages,

[3] M. S. Lam, \Software pip elini ng: An e ective

pp. 177{189, January 1983.

scheduling technique for VLIW machines," in Pro-

[16] M. Lam, A Systolic Array . PhD

ceedings of the ACM SIGPLAN 1988 Conferenceon

thesis, Carnegie Mellon University, Pittsburg, PA,

Programming Language Design and Implementation,

1987.

pp. 318{328, June 1988.

[17] A. Aho, R. Sethi, and J. Ullman, : Princi-

[4] R. L. Lee, A. Kwok, and F. Briggs, \The oating

ples, Techniques, and Tools. Reading, MA: Addison-

p oint p erformance of a sup erscalar SPARC pro ces-

Wesley, 1986.

sor," in Proceedings of the 4th International Confer-

[18] Intel, i860 64-Bit Microprocessor. Santa Clara, CA,

enceonArchitecture Support for Programming Lan-

1989.

guages and Operating Systems, pp. 28{37, April 1989.

[19] J. H. Patel and E. S. Davidson, \Improving the

[5] M. Annaratone, E. Amould, T. Gross, H. Kung,

throughput of a pip elin e by insertion of delays," in

M. Lam, O. Menzilciogl u, and J. Webb, \The Warp

Proceedings of the 3rd International Symposium on

compiler: Architecture, implementation and p erfor-

Computer Architecture, pp. 159{164, 1976.

mance," IEEE Transactions on Computers,vol. 12,

[20] A. Aiken and A. Nicolau, \Optimal lo op paralleli za-

Decemb er 1987.

tion," in Proceedings of the ACM SIGPLAN 1988

[6] B. R. Rau, D. W. L. Yen, W. Yen, and R. A. Towle,

ConferenceonProgramming Language Design and

\The Cydra 5 departmental sup ercomputer," IEEE

Implementation, pp. 308{317, June 1988.

Computer, pp. 12{35, January 1989.

[21] B. Su and J. Wang, \GURPR*: A new global software

[7] B. R. Rau and C. D. Glaeser, \Some scheduling tech-

pip elini ng algorithm," in Proceedings of the 24th An-

niques and an easily schedulable horizontal architec-

nual Workshop on Microprogramming and Microar-

ture for high p erformance scienti c computing," in

chitecture, pp. 212{216, Novemb er 1991.

Proceedings of the 20th Annual Workshop on Micro-

[22] K. Eb cioglu and T. Nakatani, \A new compilation

programming and Microarchitecture, pp. 183{198, Oc-

technique for parallelizi ng lo ops with unpredictable

tob er 1981.

branches on a VLIW architecture," in Second Work-

[8] J. C. Dehnert, P. Y. Hsu, and J. P. Bratt, \Over-

shop on Languages and Compilers for Paral lel Com-

lapp ed lo op supp ort in the Cydra 5," in Proceedings of

puting, August 1989.

the Third International ConferenceonArchitectural

[23] R. B. Jones and V. H. Allan, \Software pip elini ng:

Support for Programming Languages and Operating

An evaluation of Enhanced Pip elini ng," in Proceed-

Systems, pp. 26{38, April 1989.

ings of the 24th International Workshop on Micropro-

gramming and Microarchitecture, pp. 82{92, Novem-

[9] B. R. Rau, M. Lee, P.P. Tirumalai, and M. S.

b er 1991.

Schlansker, \Register allo cation for software pip elined

lo ops," in Proceedings of the ACM SIGPLAN 92 Con-

[24] F. Gasp eroni, \Compilation techniques for VLIW

ferenceonProgramming Language Design and Imple-

architectures," Tech. Rep. 66741, IBM Research

mentation, pp. 283{299, June 1992.

Division , T.J. Watson Research Center, Yorktown

Heights, NY 10598, August 1989.

[10] C. Eisenb eis, \Optimization of horizontal micro co de

[25] K. Eb cioglu, \A compilation technique for software generation for lo op structures," in International Con-

ference on Supercomputing, pp. 453{465, July 1988.

pip elini ng of lo ops with conditional jumps," in Pro-

ceedings of the 20th Annual Workshop on Micropro-

[11] P. Tirumalai, M. Lee, and M. Schlansker, \Paralleliza-

gramming and Microarchitecture, pp. 69{79, Decem-

tion of lo ops with exits on pip elined architectures," in

b er 1987.

Supercomputing,Novemb er 1990.