Published in HICSS-26 Conference Pro ceedings, January 1993, Vol. 1, pp. 497-506. 1
The Bene t of Predicated Execution for Software Pip elining
Nancy J. Warter Daniel M. Lavery Wen-mei W. Hwu
Center for Reliable and High-Performance Computing
University of Illinois at Urbana/Champaign
Urbana, IL 61801
Abstract implementations for the Warp [5] [3] and Cydra 5 [6]
machines. Both implementations are based on the
mo dulo scheduling techniques prop osed by Rau and
Software pipelining is a compile-time scheduling
Glaeser [7]. The Cydra 5 has sp ecial hardware sup-
technique that overlaps successive loop iterations to
p ort for mo dulo scheduling in the form of a rotating
expose operation-level paral lelism. An important prob-
register le and predicated op erations [8]. In the ab-
lem with the development of e ective software pipelin-
sence of sp ecial hardware supp ort in the Warp, Lam
ing algorithms is how to hand le loops with conditional
prop oses mo dulo variable expansion and hierarchical
branches. Conditional branches increase the complex-
reduction for handling register allo cation and condi-
ity and decrease the e ectiveness of software pipelin-
tional constructs resp ectively [3]. Rau has studied the
ing algorithms by introducing many possible execution
b ene ts of hardware supp ort for register allo cation [9].
paths into the scheduling scope. This paper presents
In this pap er we analyze the implications and b en-
an empirical study of the importanceofanarchi-
e ts of predicated op erations for software pip elining
tectural support, referredtoaspredicated execution,
lo ops with conditional branches. In particular, the
on the e ectiveness of software pipelining. In order
analysis presented in this pap er is designed to help
to perform an in-depth analysis, we focus on Rau's
future micropro cessor designers to determine if predi-
modulo scheduling algorithm for software pipelining.
cated execution supp ort is worthwhile given their own
Three versions of the modulo scheduling algorithm,
estimation of the increased hardware cost.
one with and two without predicated execution support,
are implementedinaprototypecompiler. Experiments
Although software pip elining has b een shown to b e
basedonimportant loops from numeric applications
e ective for many di erent architectures, in this pap er
show that predicated execution support substantial ly
we limit the discussion to VLIW and sup erscalar pro-
improves the e ectiveness of the modulo scheduling al-
cessors. For clarity,we use VLIW terminology.Thus,
gorithm.
an instruction refers to a very long instruction word
which contains multiple op erations.
1 Intro duction
2 Mo dulo Scheduling
Software pip elining has b een shown to b e an e ec-
tivescheduling metho d for exploiting op eration-level
parallelism [1][2][3] [4]. The basic idea b ehind soft-
In a mo dulo scheduled software pip eline, a lo op it-
eration is initiated every II cycles, where II is the Ini- ware pip elining is to overlap the iterations of a lo op
tiation Interval [7] [3] [10] [11]. The II is constrained b o dy and thus exp ose sucient parallelism to utilize
the underlying hardware. There are two fundamental
by the most heavily utilized resource and the worst
problems that must b e solved for software pip elining.
case recurrence for the lo op. These constraints each
The rst is to ensure that overlapping lifetimes of the
form a lower b ound for II. A tightlower b ound is the
same virtual register are allo cated to unique physical
maximum of these lower b ounds. In this pap er, this
tightlower b ound is referred to as the minimum II.If registers. The second problem is enabling lo op itera-
tions with conditional branches to b e overlapp ed.
an iteration uses a resource r for c cycles and there
r
are n copies of this resource, then the minimum II These problems have b een solved in the compiler r
Published in HICSS-26 Conference Pro ceedings, January 1993, Vol. 1, pp. 497-506. 2
til an II is found that satis- Algorithm Loop Data Dependence Modulo Resource This pro cess rep eats un
Graph Reservation Table
es the resource and dep endence constraints. If the
1. Generate data
veany recurrences, a schedule can dependence graph LD functional units lo op do es not ha
um II [7]. In the pres- 2. Determine Iteration FU1 usually b e found for the minim
Interval (II) ADD ere develop ed in the 0 LD ADD ence of recurrence s, heuristics w - maximum of resource
constraints
Cydra 5 compiler to generate near-optimal schedules.
3. List schedule using MUL hedule the recurrence no de
1 MUL ST The basic principle is to sc modulo resource
reservation table
rst and then the no des not constrained by recur-
ST - 2 functional units
] [11]. - uniform FU's s = start time rences [12
- II = 2
After II is scheduled, the co de for the software
pip elined lo op is generated. Since not every in-
Figure 1: Mo dulo scheduling a lo op without recur-
struction is issued in the rst II cycles, a prologue
rences.
is required to \ ll" the pip eline. There are p =