ResourceConstrained Software Pip elining

Alexandru Nicolau Alexander Aiken

Department of Information and Computer Science Computer Science Division

University of California Irvine University of California Berkeley

Irvine CA Berkeley CA

email nicolauicsuciedu email aikencsberkeleyedu

y

Steven Novack

Department of Information and Computer Science

University of California Irvine

Irvine CA

email snovackicsuciedu

Abstract

This pap er presents a software pip elining algorithm for the automatic extraction of negrain

parallelism in general lo ops The algorithm accounts for machine resource constraints in a way that

smo othly integrates the management of resource constraints with software pip elining Furthermore

generality in the software pip elining algorithm is not sacriced to handle resource constraints and

scheduling choices are made with truly global information Pro ofs of correctness and the results of

exp eriments with an implementation are also presented

Introduction

Recently there has b een considerable interest in a class of parallelization techniques known

collectively as software pipelining Software pip elinin g algorithms compute a static parallel schedule

overlapping the op erations of a lo op b o dy in much the same way that a hardware pip eline overlaps

op erations in a dynamic instruction stream The schedule computed by a software pip elini ng algorithm

is suitable for execution on a synchronous tightlycoupled parallel machine such as a sup erscalar or

VLIW Very Long Instruction Word machine

Software pip elini ng algorithms are interesting for at least three reasons The rst reason is that

sup erscalar and VLIW machines are b eing built IBMs System can execute four op erations in

parallel Intels i and i chips can execute three op erations in parallel The largest tightlycoupled

synchronous machine built to date is Multiows TRACE which has functional units Several

computer manufacturerseg HP Phillips Siemensare also developing VLIW or sup erscalar archi

tectures The second reason is that these tightlycoupled machines must b e programmed at a very low

This work supp orted in part by ONR grant NK

y

This work supp orted in part by ONR grant NK

level Someone writing a program for a tightlycoupled machine must develop a parallel schedule which

means that p erson must know ab out and account for details of the hardware design such as instruc

tion timings and resource conicts b etween functional units This task is extremely timeconsuming and

errorprone compilation techniques are needed to translate programs written at a reasonably high level

into go o d parallel schedules

The nal reason is that software pip elini ng techniques hold the promise of pro ducing b etter co de with

faster compilation time than other scheduling techniques This p otential is illustrated by the example

in Figure Figure a shows a simple sequential lo op and Figures b and c show two dierent parallel

schedules for the lo op For convenience we lab el the op erations in the original lo op ad and refer only

to these lab els in the parallel lo ops In this example some parallelism is present within the lo op b o dy

b ecause op erations b and c can b e executed simultaneously as well as across iterations b ecause d from

one iteration can overlap with a from the next iteration The classical approach to scheduling the lo op

in Figure a is to unroll the lo op b o dy some number of times and then apply scheduling heuristics within

the unrolled lo op b o dy Fis as illustrated in Figure b While this approach allows parallelism to

b e exploited b etween some iterations of the original lo op there is still sequentiality imp osed b etween

iterations of the unrolled lo op b o dy In general if the lo op could b e fully unrolled all parallelism

b oth inside and across iterations could b e exploited by this approach However full unrolling is usually

imp ossible or impractical to obtain Software pip elini ng provides a direct way of exploiting parallelism

inside and across all iterations of a lo op hence software pip elini ng achieves the eect of scheduling with

full unrolling A softwarepipelined version of the original lo op is given in Figure c

Previous Work

One b o dy of work on software pip elini ng has fo cussed on establishing the formalism required to adequately

address what software pip elini ng algorithms can and cannot achieve Results in this line of development

include a software pip elini ng algorithm that generates optimal co de for lo ops without conditional tests

ANa and a pro of that optimal software pip elini ng is imp ossible in general SGE However this

work has largely ignored resource constraints

Existing software pip elinin g algorithms handle resource constraints in a variety of ways Some algo

rithms deal with only weak forms of resource constraintseg the number of op erations that can b e

executed in parallel Others assume resource constraints are handled in a separate xup phase af

ter software pip elinin g NPA Several software pip elini ng algorithms account for resource constraints

directly as part of the software pip elinin g algorithm eg RG Lam However in most such algo

rithms the treatment of resource constraints is intimately connected to software pip elini ngthat is the

software pip elini ng is not separable from the handling of resource constraints One of our interests is to

separate what is really intrinsic to software pip elin ing from other orthogonal concerns A more extensive

discussion of previous and related work is included in Section

a

a i i bc

b j i h da a

c k i g bc bc

d l j d da

c Pip elined lo op a An example lo op b Unrolled twice and scheduled

Figure Lo op Unrolling and Software Pip elinin g

Our Approach

In this pap er we present an algorithm that smo othly integrates software pip elini ng with the treatment

of resource constraints while at the same time maintaining a structured design that separates orthogonal

concerns Our algorithm serves two purp oses First we b elieve the algorithm represents a practical

direction and can form the basis of implementations of software pip elini ng we discuss an implementation

of our algorithm in Section Second the algorithm represents a summary of many of the most interesting

asp ects of our investigation of software pip elini ng over the last several years Nic ANa ANb Aik

Aik AN Our algorithm has several novel features

The handling of resource constraints is orthogonal to the software pip elini ng

At each step the algorithm has global information ab out the op erations that can b e scheduled

In a technical sense dened precisely in Section given sucient resources our algorithm can

pro duce co de arbitrarily close to the theoretical optimum

The advantage of the rst p oint is that the treatment of resources could b e mo died say for a dierent

machine and no changes would b e required in the overall algorithm The second and third p oints together

imply that the quality of the nal pip elined lo op is limited only by the ability to make go o d resource

allo cation decisions see Section and not by the design of the software pip elini ng algorithm

Our software pip elini ng algorithm is built from two comp onents a scheduler and a dependence an

alyzer The machinedependent scheduler is used to incrementally build a parallelized lo op from a se

quential lo op For each parallel instruction the scheduler selects op erations to schedule based on the

set of op erations available for scheduling in that instruction and available resources The set of available

operations is maintained by a global dep endence analyzer as the scheduler makes decisions ab out where

to place op erations the set of available op erations is up dated incrementally Together the scheduler

and the dep endence analyzer encapsulate all machinedependent information As the parallelized lo op is

constructed the software pip elini ng algorithm checks for rep eating states that can b e pip elined The

software pip elini ng algorithm itself is very simple the diculty lies in establishing minimal restrictions

on the scheduler and dep endence analyzer that guarantee the correctness and termination of the software

pip elini ng algorithm

The rest of this pap er is divided into nine sections Section denes the mo del of parallel computation

used to develop the algorithm Section works through a small example to give an intuitive idea of how

the software pip elini ng algorithm works Section describ es the algorithm and presents a pro of of

correctness Section gives an algorithm for incrementally maintaining the set of available op erations

Section describ es the integration of resource constraints into the algorithm Handling resources well is

critical in realistic applications of software pip elini ng Section briey describ es an implementation of our

algorithm some additional optimizations and some exp erimental results The exp erimental results b ear

out the strengths of our approach and p oint out some weaknesses b oth are discussed at length Section

presents a result that suggests our algorithm can achieve the b est schedules p ossible in the presence of

resource constraints A discussion of related work is in Section The nal section summarizes and

presents some conclusions

Basic Terminology

This section develops a simple mo del of a tightlycoupled synchronous parallel machine The formalism

is used to explain our software pip elini ng algorithm and to provide a basis for a pro of of correctness

A program is an automaton hX n N i X is a set of n op erations fx x g Op erations are

n

divided into assignments that read and write a global store tests b o oleanvalued functions that aect

the ow of control and a distinguished op eration stop

The b o dy of the program is a set N of states n n The state n is the start state of the

m

program Asso ciated with each state n is opsn the op erations of n which are elements of X The

states represent parallel instructions intuitively when control reaches a state n all op erations in opsn

are executed simultaneously To simplify the presentation we assume that all op erations execute in unit

time Extensions to multicycle op erations and pip elin ed functional units are discussed in Section

A conguration is a pair hn si where n is a state and s is a store the contents of memory lo cations and

registers The transition function maps congurations into congurations An execution is a sequence

of congurations h hn s i i such that hn s i hn s i

i i i i i i

The transition function describ es how a tightlycoupled synchronous machine actually executes a

parallel instruction We delib erately avoid dening a transition function in any detail The transition

functions of sup erscalar and VLIW machines are complex and vary considerably from machine to ma

chine The greatest source of complexity is dening what it means to execute more than one test in

parallel multiway jumps As an example in one p ossible mo del tests within a state n are always

organized as a binary decision tree with a unique ro ot One branch of each test in the decision tree is

lab eled true the other is lab eled false Each leaf of the decision tree is a p ointer to another state When

the state is executed all of the tests are evaluated in parallel in the store The next state to b e executed

is the leaf that terminates the unique path from the ro ot where every branch is lab eled by the value of

that test in the store There are other p ossible implementations of multiway jumps many mechanisms

have b een prop osed and implemented Fis KN AAG Eb c The software pip elini ng algorithm

we present applies to any of these controlow mechanisms

We use the following abstraction of controlow throughout this pap er We assume that controlow

is determined entirely by tests that is the result of evaluating the tests in a state determines the next

state A branch of a state n is a truth assignment hx true x false i to the tests x x in

n The set of all branches of n is branchn if n has no tests then branchn is the singleton set fhig

consisting of the empty truth assignment The function succonbranch maps a state n and a branch

c branchn to a successor no de n the name succonbranch stands for successor on branch We

1 k k

Note that in most cases a no de with k tests will not have distinct successors For generality we treat each of the

branches separately in our algorithm in an implementation for a particular controlow mechanism many branches can b e

assume that if succonbranchn c n then there is a state s such that hn si hn s i and that the

evaluation of tests in conguration hn si satises the truth assignment c

The set of successors succn of a state n is fn jc st succonbranchn c n g When n is executed

control is transferred to some n succn A state that contains the op eration stop cannot contain other

op erations and cannot have any successors

We next dene a meaning function for programs which is used in the pro of that our software

pip elini ng algorithm is correct ie that it preserves the meaning of the original program

Denition Let P b e a program hX n N i If there is an execution hhn si hn s ii such that

k

opsn fstopg then P s s If no such execution exists then P s

k

Programs P and P are equivalent P P if s P s P s

Software pip elini ng is a lo op parallelization technique so we must describ e the lo ops we are interested

in parallelizin g For convenience we use the following denition A sequential loop is a program with i

op erations x x and i states n n where n fx g All backedges go to the start state n

i i j j

that is if n succn and i j then i Every state is assumed to b e reachable from the start state

i j

An Example

Given a sequential lo op L our software pip elinin g algorithm incrementally builds a parallelized lo op from

L Initially the parallelized lo op is empty has no states and the algorithm chooses a set of op erations

from the sequential lo op L that legally can b e scheduled as the start state of the parallel lo op After

scheduling a subset of the available op erations as the start state the algorithm recursively schedules the

successors of the start state by considering what op erations can b e scheduled in the successor states and

so on The main diculty is guaranteeing that this pro cedure terminates We show that eventually the

scheduled states must fall into a detectable rep eating pattern at which p oint a lo op can b e constructed

from this pattern of rep eating states

An imp ortant data structure used by the algorithm is an incrementally maintained set A of available

operations At each step A contains a set of op erations available for scheduling in the current state

b eing constructed How this set is built and maintained is discussed in Section For now it is only

imp ortant to understand that set A contains all op erations that could b e scheduled legally in the current

state without violating program semantics

Initially the new program graph is empty and A contains all op erations available for scheduling in

the rst state Consider the program in Figure We display programs as controlow graphs with the

convention that true branches of tests are to the left and false branches are to the right Not all op erations

can b e scheduled in the rst state for example c must b e scheduled after b since c references a value

that b writes In standard compiler terminology there is a data dependence from b to c KKP

For this example we assume a machine mo del in which all reads take place b efore any writes during

execution of a state and write conicts are not p ermitted In this mo del the op erations a b and f are

treated together

a Ai fAi

b j i

c if Aj

R

d Bj Aj e Bj Aj

R

f if i n

t f

R

g stop

Figure An example lo op

all available for scheduling in the rst state Because the algorithm may overlap op erations from dierent

iterations we sup erscript op erations with the scheduled iteration from which they came In addition

we subscript available op eration sets to keep track of dierent values for dierent states Thus initially

A fa b f g

Another comp onent of the pip elinin g algorithm is the scheduler The scheduler selects from A a set

of op erations to schedule in the current state Together the pro cedure to maintain the set of available

op erations and the scheduler encapsulate all machinedependent information The software pip elini ng

algorithm itself is built on top of these two comp onents

A pip eline d version of the lo op in Figure is given in Figure In Figure the state n is lab eled

i

by the integer i The rest of this section describ es how the software pip elin ing algorithm computes this

parallel schedule from the sequential lo op For the rst state n assuming that the machine has sucient

resources the scheduler could choose to schedule all available op erations Because f is a test there will

b e two successors of the rst stateone for the case where f evaluates to true and one for the case where

f evaluates to false The sets of available op erations are dierent for the two successors

Consider the successor n of n for the case where f evaluates to false This case is easy as the

program terminates on this branch The new set A of available op erations is fc d e g reecting the

fact that a b and f have b een scheduled and that this branch of f is the lo op exit Because write

conicts are not p ermitted d and e cannot b e scheduled in the same state but b oth are availableat

abf

H

H

H

H

H

Hj

cda cd

A

A

R

AU

bf ebf e

t f t f

g

Figure The lo op after software pip elini ng

this p oint all dep endences on the two statements have b een satised Assume that the scheduler selects

op erations c and d for state n Op eration c is a test so there are two successors of this state For

the successor n where c evaluates to false the set of available op erations A is fe g Assume that the

scheduler places e in n For the single successor n of n the set of available op erations A is just fg g

the stop op eration Thus n contains only g Backing up to n the set of available op erations for the

branch where c evaluates to true is also fg g so the successor of n on this path is also n

This completes the terminating path from n On the other path where f evaluates to true the

new set of available op erations A is fc d e a g Note that the op eration a from the second iteration

is available for scheduling in parallel with statements from the rst iteration A subtle p oint is that

op eration b is not available for scheduling even though all reads take place b efore all writes and all

op erations from the rst iteration that read variable j are available in A Op eration b is not available

b ecause as b efore d and e cannot b e scheduled in the same state Even though all op erations that

read j are in A not all of these can b e scheduled in n and this fact prevents statements that write j

from b eing available

Assume that the scheduler selects op erations c d and a for state n Op eration c is a test so

there are two successors of this state For the successor n where c evaluates to true the set A is

fb f g Assuming that the scheduler places b oth op erations in n the set of available op erations for

the successor of n on the path where f is true is fc d e a g Note that except for the sup erscripts

this set is exactly the same as A The sup erscripts are just a way of keeping track of the iteration of

each op eration the sets have the same op erations Rather than continue scheduling at this p oint the

pip elini ng algorithm simply makes n a successor of n Similarly the set of available op erations for the

successor of n where f evaluates to false is fc d e g Except for sup erscripts this is exactly the same

as A As b efore the pip elini ng algorithm makes n a successor of n

a j i div

b Ai Ai j k a

c Ai Ai j h bc

d i d

a Another example lo op b An incorrect schedule

Figure Another example lo op

Backing up the pip elini ng algorithm next considers the successor n of n where c evaluates to false

The set of available op erations A is fe b f g Assuming that the scheduler places all three op erations

in n the sets of available op erations for the two successors of n are the same as for n and scheduling

pro ceeds just as it did for n The algorithm terminates with the schedule in Figure

There are three technically dicult asp ects of the software pip elinin g algorithm The rst problem is

justifying the step where previously scheduled states are reused such as when the pip elini ng algorithm

decided to make n the successor of n We have simply implied that this is correct and in the example

it happ ens to b e correct but in general this step is not correct Intuitively the problem is that just

b ecause two sets of available op erations happ en to b e the same for two dierent states that do es not

by itself guarantee that all subsequent sets of available op erations would b e the same in all successors of

those states

We illustrate this problem with the lo op in Figure a To make the example as simple as p ossible

there are no conditional statements or exits from the lo op Assuming that the variable i is always zero

up on entering the lo op note that statements b and c are indep endent for the rst iterations and data

dep endent for the next iterations If dep endence analysis recognizes that b and c are indep endent for

the rst iterations then as the parallelized lo op is built the scheduler could place b and c together

in the rst iterations Following the pip elinin g strategy for the previous example rep eating states

would b e detected in the second iteration leading to the parallelized program in Figure b which is

clearly incorrect In this example irregular dep endencies make it dicult to detect rep eating b ehavior

Section formalizes the software pip elini ng algorithm and provides constraints on the scheduler and

available op eration information that guarantee the correctness and termination of the software pip elini ng

algorithm

The second problem is computing the sets of available op erations An algorithm for maintaining these

sets incrementally was rst presented in EN for programs without lo ops ie with acyclic controlow

graphs In Section we present a detailed description of the computation and maintenance of available

op erations for use in software pip elinin g of lo ops Our presentation is simpler and easier to understand

and implement than the algorithm in EN

The third signicant problem is managing nite resources While resource allo cation do es not b ear

directly on the correctness of our software pip elini ng algorithm go o d resource usage is obviously imp or

tant if the algorithm is to b e useful in practice In Section we show how nite resources is integrated

with software pip elini ng in our system

The Software Pip elining Algorithm

The example in Section illustrates that the key step in our algorithm is discovering when states can b e

reused to form a software pip eline Recognizing patterns in the scheduled op erations is not trivial and

is in fact not valid if the scheduler and the available op eration analysis are not constrained in some way

For example if the scheduler merely selects op erations to schedule at random no rep eating b ehavior can

b e inferred Similarly even if the scheduler is wellbehaved the example in Figure shows that if the

available op eration analysis do es not exhibit a detectable pattern software pip elini ng is not p ossible

In this section we present constraints on the scheduler and available op eration analysis that make

software pip elini ng p ossible These constraints are quite weak and are easily satised in practice After

presenting the constraints we present the software pip elinin g algorithm itself and prove its correctness

Finally we discuss termination of the software pip elinin g algorithm

The Constraints

c

Recall that x denotes the instance of op eration x from iteration c of a lo op The following denition is

i

i

used in the discussion of the constraints

j c j

i i

c

g b e a set of op erations The set X is the set f x g Denition Let X f x

i i

As discussed in Section one comp onent of the software pip elini ng algorithm is a scheduler for a

sp ecic machine The following constraint requires that the scheduler is a function the scheduler

must schedule some op eration in every state and the op eration chosen can dep end on the set of

op erations available and the relative distance in iterations b etween the op erations available but not on

the actual iterations of the op erations available

Constraint Let X b e a set of op erations The scheduler must b e a function mapping a set of

already scheduled op erations and a set of available op erations to a single op eration or the value none

In addition scheduleX A none if X We also require that

i i k i k i i i i

x i scheduleX A x and x A i scheduleX A none

j

j j

In our algorithm X is the set of op erations already scheduled in the state currently under considera

k

tion The op eration x returned by the scheduler is an additional instruction to b e scheduled in the same

j

state The primary restriction imp osed by Constraint is that the scheduler is a function of op erations

available in the state b eing scheduled This constraint is weak b ecause the set of available op erations pro

vides global information ab out the programthe scheduler can choose any statement that could b e legally

scheduled in the current state This particular constraint also has a signicant design b enet it cleanly

separates the scheduler from the rest of the algorithm thus isolating the most machinedependent p ortion

of the co de Any scheduler satisfying Constraint will work with the software pip elini ng algorithm In

Section we show how to generalize Constraint to include resource constraints

The scheduler is used by the software pip elini ng algorithm to rep eatedly select op erations for schedul

ing in a state When the scheduler returns none the state is nished and successors of the state are

scheduled In Section we presented a simplied example in which the scheduler chooses a subset of the

available op erations for scheduling However the iterative metho d describ ed here is necessary in general

b ecause the op erations available for scheduling in a state n can dep end on the set of op erations already

scheduled in n For example consider the simple program fragment in Figure Assuming that the

parallel machine p erforms reads b efore writes it is clear that b oth a and b can b e scheduled together in

the rst and only state n However b cannot b e scheduled in n unless a is also scheduled in n that

is b is not available for scheduling in n unless a is scheduled in n Otherwise if the set of available

op erations were simply fa bg then the scheduler could choose to schedule b in n and a in n s successor

which is incorrect

A second constraint is placed on the available op erations At any moment there is a set of op erations

A that are available for scheduling asso ciated with a state n There are two ways that A can b e up dated

i

First the pro cedure call updateonen A x returns a pair consisting of the up dated state with op erations

i i

opsn fx g and the new set of available op erations given that x has b een scheduled Second when n

is complete we wish to compute the set of available op erations in the successors of n The pro cedure call

nextn A maps n and A to a set of pairs fhn A ig where for every branch c branchn n is a new

j j j j

empty no de n succonbranchn c and A is the set of op erations available in n Implementations

j j j j

of pro cedures updateone and next are given in Section

2

One can imagine even more p owerful schedulers for example a scheduler having global information ab out not just one

state at a time but all states at all times Because scheduling is inherently a very hard problem however it is not clear

that this extra theoretical p ower translates into any practical advantage over the scheme presented here see Section

a j i

b i i

Figure A simple program

Constraint Consider an arbitrary set of available op erations A state n and op eration x Then

there exists a set of op erations B such that

i j i i i

i updateone m A x hm B i where opsm opsn and

i j i

opsm opsn fx g

Furthermore there exist sets of op erations A and states n for j jbranchnj such that

j j

i i i

i nextm A fhn A ig where opsm opsn

j

j

Constraint says that the op erations available may dep end on which op erations have already b een

scheduled and the relative distance in iterations b etween op erations already scheduled but it cannot

dep end on the actual values of the iterations of op erations already scheduled In the implementation

j i

of updateone the result no de is simply n up dated to include the op eration x see Section

Whether Constraint is satised or not dep ends on the form of the data dep endence analysis used to

maintain op eration availability information Standard data dep endence graphs satisfy Constraint as

do extensions to dep endence graphs such as lab eling edges with constant distance vectors PBJ In

fact as far as we know every prop osed representation of dep endence information satises this constraint

Constraint is needed to rule out pathological cases like Figure where irregular dep endence analysis

leads to incorrect schedules

The Algorithm

The software pip elini ng algorithm is given in Figure Given an initial set of available op erations the

pro cedure pipeline invokes the pro cedure schedulestate to build a single state and then to build states

for all the branches of that state and so on If at any p oint the algorithm encounters the same set of

available op erations mo dulo iteration numbers a second time it uses the previously scheduled state

The algorithm never backtracks to explore alternative schedules While a backtracking version could b e

designed easily we feel a backtracking algorithm would b e to o slow to b e practical

The order in which pipeline pro cesses the successors of a scheduled state is unsp ecied and makes no

dierence in the nal parallel program The order in which states are scheduled can make a dierence in

pro cedure schedulestate n A

while scheduleopsn A none do

let x scheduleopsn A in

hn Ai updateone n A x

return hn Ai

pro cedure pipelineA

X scheduledbeforeX no

let r b e an empty no de in todo fhr Aig

while hn Ai todo do

j

if j st scheduledbeforeA no then

j

n scheduledbeforeA

todo todo fhn Aig

else

let hn A i schedulestate n A and

f hn A i g nextn A in

i i

scheduledbeforeA n

todo todo f hn A i g fhn Aig

i i

Figure The software pip elinin g algorithm

the eciency of the available op erations computation in Section we present a slightly mo died version

of the algorithm in Figure that pro cesses states in an ecient order

We use Constraints and to prove the correctness of the software pip elin ing algorithm in

Figure Let L b e a sequential lo op and let L b e the result of software pip elini ng We show that L L

As a rst step in the pro of we must assume that the available op eration analysis is correct Intuitively

the available op eration analysis is correct if any schedule that is consistent with the analysis preserves

program semantics We use the program in Figure to formalize this intuition This program is identical

to the one in Figure except that it do es not reuse previously scheduled states Let L b e the innite

parallel program dened by this algorithm for a lo op L The available op eration analysis is correct if for

any choice of scheduler L L

The essential step in proving the correctness of pro cedure pipeline is to show that every execution of

L is also an execution of L

Lemma Let L pipelineA and let L pipelineA For all k if there is an execution

s ii of L hhn s i hn s ii of L then there is an execution hhn s i hn

k k k

k

Pro of The pro of is by induction on on the length of an execution For the base case let e hhn s ii

b e an execution of L Consider how the initial states of L and L are built The initial set of available

op erations A is the same for b oth Now in pro cedure pipeline we have scheduledbeforeA no b ecause

initially no states are scheduled Then schedulestate n A hn B i Clearly e hhn s ii is an

execution of L

pro cedure pipelineA

let r b e an empty no de in

todo fhr Aig

Condition of the while is always true

while hn Ai todo do

let hn A i schedulestate n A and

f hn A i g nextn A in

i i

todo todo f hn A i g fhn Aig

i i

Figure An algorithm that denes an innite parallel program

For the induction step assume that e hhn s i hn s ii is an execution of L and that e

i i

hhn s i hn s ii is an execution of L Furthermore assume that there exists a k such that when

i

i

the states n and n were scheduled the sets of available op erations were A in pro cedure pipeline and

i

i

k k

It is easy to check that A in pro cedure pipeline resp ectively Finally assume that opsn opsn

i

i

all of these assumptions hold after the base case

are nal states and we are done Otherwise in the next transition we If n fstopg then n and n

i i

i

have

hn s i hn s i

i i i i

hn s i hn s i

i i

i i

The stores must b e the same in the two transitions since by hypothesis n and n have the same

i

i

s i b ecause n op erations Let c b e the branch taken in state hn s i Note c is also taken in state hn

i i i i

i

have the same op erations evaluated in the same store To nish the pro of we need to show that and n

i

n and n have the same op erations p ossibly diering in iteration numbers used by the pip elini ng

i

i

j

algorithm That is we must show that opsn opsn for some j

i

i

Consider once more the state of the two software pip elini ng algorithms when n and n are sched

i

i

k k

A uled By Constraint and the induction hypothesis hm B i nextn A and hm B i nextn

i

i

where m and m are fresh empty states on branch c from n and n resp ectively Now there are two

i

i

j k

cases For the rst case assume j scheduledbeforeB no when hm B i is removed from the todo

list by pipeline In L let schedulestate m B hn C i Then by Constraints and in L we

i

k k k

have schedulestate m B hn C i and opsn opsn

i

i i

k

For the second case assume that when hm B i is removed from the todo list by pipeline there is a j

j j

such that scheduledbeforeB n Then n was scheduled in L using available op erations B The rest

j k

of the argument is symmetric to the case ab ove using B in place of B and the fact that n n 2

i

Constraints and are needed to prove Lemma These constraints ensure that having the

same op erations available for two states implies that all p ossible branches from those states are also b e

the same Combining the correctness condition for the available op eration analysis and Lemma gives

a pro of of correctness

Theorem If pro cedure pipeline pro duces a lo op L from an initial lo op L then L L

Pro of To prove L L we must show s L s L s First L s L s since by

assumption the available op erations analysis is correct By Lemma every execution of L is also an

execution of L so L s L s 2

Termination

Theorem proves that the software pip elinin g algorithm pro duces only correct results but it do es not

show that it always terminates To show termination we must prove that the todo set in pro cedure

pipeline is eventually empty The todo set decreases in size when there is a pair hn Ai such that for

j

some j A has b een scheduled previously Let b e the equivalence relation on sets of op erations

j

A B j st A B If we assume that the pro cedure schedulestate always terminates then to

prove termination it is sucient to show that there are only nitely many equivalence classes under

Unfortunately there may b e innitely many equivalence classes and in fact the pro cedure pipeline

is not necessarily terminating under the constraints given so far Consider for example what happ ens

if the A sets simply increase in size on each recursive call A necessary condition for A B is that

jAj jB j if there are sets of unbounded cardinality then there are innitely many equivalence classes

An additional constraint is placed on the availability information to limit the size of the set of op erations

available for scheduling

j

Constraint There is a constant k such that for all p ossible availability sets A if x A then no

h

y A for any h j k

This constraint states that op erations can b e available from at most k consecutive iterations at one time

Thus the scheduler has a sliding window of op erations and until op erations in the rst iteration are

scheduled the window cannot b e shifted to include a new iteration at the end

Lemma Constraint ensures that there are only nitely many equivalence classes of sets of op er

ations under

Pro of If there are n op erations in a lo op b o dy and k consecutive iterations can app ear in A then

cj

i

g for some c i n and j k 2 every available op eration set is a subset of f x

i

i

The value k of Constraint is a parameter of the software pip elini ng algorithm It need not b e the

same for every lo op scheduled ie it can b e computed dynamically but it must have a maximum value

for any particular lo op Also it is not necessary to make the window an integral number of iterations

Partial iterations work just as well although the details of the implementation are a bit more complex

While Constraint is motivated by the need to guarantee termination it also leads to a go o d

implementation of the pro cedure pipeline The most exp ensive part of pipeline is checking whether the

j

current set of available op erations A has ever b een scheduled b efore for some A For a window size of

k iterations op eration availability information for iterations j through j k can b e represented as

a bit vector of length k n where n is the number of op erations in the sequential lo op The bit hn i is

j h

if op eration x is available for scheduling otherwise it is When iteration j has b een completely

i

scheduled this o ccurs when the rst n bits are all the bit vector is shifted left n bits discarding

information for iteration j and the last n bits are set to reect the availability of op erations in iteration

j k With this representation checking whether the same availability information has b een seen b efore

only requires checking whether the same bit vector has b een seen b efore which can b e implemented very

eciently through hashing

Available Op erations

Available op erations analysis plays a role in our algorithm similar to the role global dataow analysis

plays in traditional optimizing An algorithm for computing available op erations was rst

given in EN for historical reasons available op erations were termed uniableops in EN In

this section we give a new presentation of available op erations While functionally equivalent to the

algorithms of EN our presentation is b oth simpler and more direct and the nal algorithms are

easier to implement The development is divided into two parts First we show how to compute the

initial set of available op erations Second we show how to incrementally up date the information in

resp onse to decisions made by the scheduler At the end of the section we prove the correctness of the

analysis and discuss some eciency considerations

Computing Available Op erations

Recall that Constraint forces the available op erations to span no more than k iterations of a lo op

Therefore to compute the op erations available for scheduling it is sucient to examine at most k iterations

of a lo op Since any number of unrolled iterations form a lo opless acyclic program we restrict the

problem of computing available op erations to an analysis of lo opless programs

Computing available op erations requires the use of dep endence analysis b etween op erations There are

many variations on dep endence analysis in the literature that satisfy our requirements Constraint

and it is b eyond the scop e of this pap er to include them here KKP FOW PBJ The algo

rithms in this section are presented using an abstract mechanism for dep endence By using a particular

dep endence analysis representation the algorithms can b e made more ecient We use the following

denitions to mo del dep endence analysis

Denition A lo cation is either a memory address or a register For op erations x and y and sets of

op erations X we dene

a if a b

t f

R

b j j c h j

Figure Op eration b can kill a reference live at the ro ot

S

writex writex the set of lo cations x may write writeX

xX

S

kil lx the set of lo cations x must write kil lX kil lx

xX

S

readx readx the set of lo cations x may read readX

xX

dependsx y writex ready write y dependsX y x X st dependsx y

The set writex resp readx must include every lo cation x could ever write resp read The set

kil lx must include only lo cations x always writes Two dierent sets writex and kil lx are dened

b ecause dep endence analysis must b e conservative in general it is not always p ossible to know at compile

time exactly which lo cations an op eration may read or write The predicate dependsx y is true if there

may b e a dep endence from x to y

Dening correct available op eration analysis requires identifying the op erations that cannot b e avail

able b ecause of p otential data dep endence violations Assume that x precedes y on a path and dependsx y

is true Then clearly y cannot b e available on that path until x is scheduled or else y could b e scheduled

b efore x resulting in a dep endence violation The following dataow equation sp ecies the op erations

reachable from state n that are not data dep endent on an intervening op eration

nodepsn opsn nodepsn fxjdependsopsn xg

n succn

Since the program fragment P b eing analyzed is lo opless nodepsn can b e computed for all n by a single

b ottomup traversal of the controlow graph for P

The program in Figure illustrates another situation in which an op eration cannot b e available In

this case op eration b cannot b e available for scheduling in the rst state b ecause its denition of j could

change the value read by the reference to j in op eration c In standard compiler terminology lo cation

j is live at the rst state and b can kil l cs reference to j Clearly any op eration that can kill a live

reference cannot b e available

The second comp onent of the available op erations analysis is a computation of live references A

reference to lo cation l is live at a state n if there is a state reachable from n where l is p otentially

read and there is no intervening write of l A conventional live reference analysis is not sucient for our

purp oses instead we wish to compute live references discounting the eect of a particular op eration x

More precisely we wish to know the set of live references assuming that x has b een moved to the ro ot

state of the program In this case to say x has b een moved to the ro ot means that all o ccurrences of

x that can p otentially move to the ro ot are not counted in the live variable computation The intuitive

justication b ehind this computation is that when moving an op eration x in the schedule it is necessary

to check if x will kill live references in its new p osition However in deciding whether or not x will kill

live references in its new p osition one should not count references of x itself in its current p osition

The following dataow equation denes the set of lo cations live at state n mo dulo op eration x

liven x readY kil lY Z

n

n succn

liven x if x nodepsn

where Z

n

liven stop otherwise

where Y opsn fxg

distinguish b etween the cases where o ccurrences of x can or The two cases in the denition of Z

n

cannot b e blo cked by data dep endencies If there is an o ccurrence of x on a path that is not blo cked

by data dep endencies ie x nodepsn then that o ccurrence of x is discounted in the live reference

computation ie liven x If there is no o ccurrence of x that can p otentially move then all live

references are counted ie liven stop counts all references since stop has no eect on the store As

with the computation of nodeps liven x can b e computed for all states n and op erations x by a single

b ottomup traversal of the controlow graph for P Some further improvements to the eciency of this

pro cedure are discussed at the end of the section

Let r b e the initial state or root of P An op eration x is available for scheduling in r if it satises

three conditions x is in nodepsr x do es not kill any live reference in op erations other than x and x is

in the sliding window of op erations Recall that Constraint requires that op erations from no more

than k consecutive iterations b e available for any state For a set of available op erations A let minitA

b e the minimum iteration number of any op eration in A

i i

availabler fx jx R and i minitR k g

where R nodepsr fxjwritex liver x g

Note that we are concerned with live references only at the ro ot r op erations that p otentially kill

live references at an internal state n are included in nodepsn The program in Figure illustrates this

situation Unlike Figure the reference to j in op eration c is not live at the ro ot b ecause it reads the

value written by d The imp ortant observation is that any op eration that can kill a reference that is not

live at the ro ot eg b can kill cs reference to j must b e dep endent on some preceding op eration eg

dependsd b That is a reference to j that is not live at the ro ot must b e preceded by an op eration that

writes j this op eration prevents other op erations that could write j from b eing available at the ro ot

The most exp ensive part of computing availabler is computing liver x for every op eration x The

eciency of the naive pro cedure describ ed ab ove can b e improved in two ways First it is not necessary

to compute liver x for every x it is sucient to compute it only for those op erations in nodepsr

b ecause nodepsr is a sup erset of the op erations available for scheduling Second op erations that kill

d j

a if a b

f t

R

b j j c h j

Figure Op eration b can kill j but j is not live at the ro ot

live references could b e detected earlier in the computation instead of checking only at the ro ot we have

not given this alternative to simplify the presentation

Maintaining Available Op erations

We rst describ e at a high level how available op erations are maintained after which we give implemen

tations of the pro cedures next and updateone Let P b e a sequential lo opless program We add an empty

state r a state with no op erations to P and make it the ro ot r is the initial state in pro cedure pipeline

see Figure This empty state r will b e lled with op erations chosen by the scheduler

The next step is to compute the dataow analysis of Section At this p oint the set of op erations

available for scheduling in state r is availabler

Once the initial global analysis of P is completed we are ready to b egin scheduling states When

state n is scheduled it is rst lled with op erations by schedulestate and then hn Ai is removed from

the todo set and ns successors are added to the todo set A state in the todo set is a frontier state At any

p oint in the incremental development of the parallelized lo op every frontier state of the parallel lo op has

the prop erty that its known predecessors have b een scheduled and its successors have yet to b e scheduled

The unknown predecessors are those that are added through backedges inserted that complete the

software pip elined lo op Available op erations are needed only for the frontier states predecessors of

frontier states are never mo died When pro cedure pipeline terminates there are no frontier states and

the mo died program P is the parallel lo op

Figure gives a generic snapshot of the algorithms data structures during scheduling The states

ab ove the line lab eled A are parallel states already scheduled by the algorithm These states are arranged

as a tree except where backedges have b een added by pip elini ng The states b etween the lines A and B

are the frontier states These are empty states that have yet to b e lled with op erations The states b elow

line B are states of the original sequential lo op Conceptually these states are not part of the parallel

lo op they are used instead to compute the available op erations information for the frontier states

The rst todo set is fhr availabler ig Thus initially r is the only frontier state The pro cedure

pipeline selects a pair hr availabler i from todo and lls it with op erations by calling schedulestater availabler

The pro cedure schedulestate in turn calls updateone one or more times to choose the scheduled op er A

B

Figure A snapshot of software pip elinin g during scheduling

ations see Figure The pro cedure call updateone availabler r x p erforms two tasks First x is

deleted from interior states of P and x is added to the frontier state r Thus this transformation moves

x to r from its original place in the sequential schedule Some copies of x may have to remain in interior

states of P if x cannot move on all paths to the frontier state see the discussion b elow When a test is

moved to r the control ow of P must b e mo died to preserve P s semantics Second the sets nodeps

and live are up dated where necessary

An imp ortant fact is that b oth the deletion of x and up dating of the nodeps and live sets can b e

restricted to a relatively small subset of the states of P this prop erty makes the incremental cost

of maintaining available op erations reasonable The new set of available op erations is the up dated

availabler

When the scheduling of r is complete nextr A fhr A ig is the set of empty successors r of

i i i

r and the corresp onding sets of available op erations A We implement nextr A by inserting a new

i

empty state r b efore each succr c on branch c branchr The set availabler is exactly the set

i i i i

of op erations available for scheduling on branch c from r Note that the r are new frontier states of

i i

P This implementation of next allows P and therefore the available op eration analysis to b e shared

among all elements of the todo set As scheduling pro ceeds there will b e multiple frontier states in P

one for each element in todo An implementation of next is given in Figure

Lemma Let P b e P with the mo dications p erformed by next Then P P

Pro of Pro cedure next only inserts empty no des in P 2

To complete the description of available op erations we must give an implementation of pro cedure

updateone We could do this trivially in terms of the lo cal transformations of Percolation Scheduling

pro cedure nextr A

for each c branchr do

i

let r b e a fresh empty state n succonbranchr c in

i i

succonbranchr c r

i i

succonbranchr hi n

i

nodepsr nodepsn

i

liver x liven x for all op erations x

i

return fhr availabler ijr as dened ab oveg

i i i

Figure Implementation of next

Nic but for completeness we describ e a direct implementation that is closer to the way it should b e

done in practice Let r b e a frontier state of P let x scheduleopsr availabler and assume that

x none We rst describ e how x is deleted from P and added to r when x is an assignment this is

the easier case Moving an op eration x while preserving P s semantics is a little subtle b ecause x may

b e available at the frontier state but still blo cked by data dep endencies on some paths The program in

Figure a illustrates this situation In this example c is available at the ro ot b ecause c do es not kill any

references live at the ro ot and b ecause there is a path from the ro ot to c in this case passing through

the false branch of a such that c is not dep endent on any op eration on the path However there may b e

other paths from the ro ot to c in this case passing through the true branch of a such that c is dep endent

on some op eration on the path clearly c cannot b e deleted from such a path In addition there may

even b e paths from other frontier states to c represented by the incoming edge from e If c is moved to

root in Figure it still must b e preserved on paths from other frontier states

As illustrated in Figure b this problem can b e resolved by duplicating states so that no instance of

the op eration b eing moved is shared b etween paths where it can move and paths where it cannot move

In this example only the single state containing c needs to b e duplicated but in general multiple states

may have to b e duplicated In Figure c state c has b een deleted and the op eration moved to the ro ot

It is easy to verify that the program in Figure c is equivalent to the program in Figure a

To formalize which states are duplicated and which states are deleted we need some additional deni

tions and notation A path is a sequence of states hn n i such that n succn for all i k

k i i

A state n is covered by a state n for op eration x if there is a path hn n i such that op eration x is

k k

in nodepsn for every n on the path

i i

coveredn x fn j there exists is a path hn n i st i i k x nodepsn g

k k i

We say a path is covered by n x if every state of the path is in coveredn x When an op eration x is

moved to a frontier state r it should b e deleted only from paths that are covered by r x other paths

should b e left unchanged The simplest case is if every path to x is covered by r x We say a lo opless

program P is delete consistent for r x if for every n coveredr x such that n fxg every path from

a frontier state to n is covered by r x If P is delete consistent for r x then x is not blo cked by data

ro ot ro ot ro ot j j

a if a a if a a if a

f t t t f f

R

d b j e b j e b j e

R R R

c j j j j c j j j j

R

d d d

a Before c is moved c After c is moved b Delete consistent

Figure Moving an assignment

dep endencies on any path to the frontier state r Hence we can delete states n fxg in coveredr x

up date the predecessors of n to p oint to ns successor and add x to r

Lemma Let x b e an assignment such that x availabler and let N fnjn fxg and n

coveredr xg Assume that P is delete consistent for r x Let P b e P with the following changes

Recall that hi is the empty branchsee Section

mo dify each n where succonbranchn c n for some n N so that succonbranchn c

succn hi

delete every n N

r r fxg

Then P P

Pro of For brevity we only sketch the pro of The transformation can b e implemented by a sequence

of semanticspreserving Percolation Scheduling transformations b etween adjacent no des Nic Since

each individual Percolation Scheduling transformation preserves program semantics the entire sequence

preserves program semantics 2

Of course Lemma only applies if P is delete consistent We next show how to make an arbitrary

lo opless program delete consistent for r x The set of predecessors of a state n is predn fn jn

succn g The following lemma gives an easy test for determining whether P is delete consistent

Lemma P is delete consistent for r x i n coveredr x predn coveredr x

ro ot

a a ro ot if b

f t

R

b if b a a a a

t f

R

c d c d

b After b is moved a Before b is moved

Figure Moving a test

Pro of If every predecessor of a member of coveredr x is in coveredr x then clearly every path from

r to n coveredr x is covered by r x For the other direction assume that there is an n coveredr x

and for some n predn n coveredr x Then there must b e a path from some frontier state r of

the form hr n ni This path is not covered by r x 2

The following algorithm makes P delete consistent for r x Let C coveredr x Iterate the

following two steps until no n is chosen in

Cho ose n C such that some p in predn is not also in C

Let n b e a duplicate of n and for every p predn such that p C if succonbranchp c n

then mo dify p so that succonbranchp c n

Note that this algorithm copies the minimum number of states needed to make P delete consistent

Once P is delete consistent the steps of Lemma can b e applied to move x to the frontier state

All that remains is to up date the nodeps and live sets States that are duplicated in making P delete

consistent retain the nodeps and live information of the original state The set of states for which the

analysis can change is coveredr x Since b oth nodeps and live are computed b ottomup the analysis

can b e up dated in a single b ottomup pass over the paths covered by r x

Finally we show how to up date the available op erations in the case where the instruction chosen

by updateone is a test Let r b e a frontier state of P and let x scheduleopsr availabler For

updateone to move a test x while preserving P s semantics it is necessary to mo dify the control ow

of P Intuitively we duplicate all the covered paths from r to x the original set of paths leads to the

successor on xs true branch the duplicate set of paths leads to the successor on xs false branch This

transformation is illustrated in Figures The program in Figure a is already delete consistent for

root b When b is moved to root in Figure b state a is duplicated on bs true and false branches to

preserve control ow Recall that a branch c is a truth assignment to tests hx b x b i where

n n

b is one of true or false The following lemma shows how to move a test to a frontier state

i

Lemma Let P b e a program with frontier state r and let x availabler where x is a test Assume

P is delete consistent for r x and let X coveredr x in P Program P is P with the following

mo dications p erformed in order

For each n X let n b e a state such that opsn ops n and for each c branchn

n if n coveredr x

n if n coveredr x and opsn fxg succonbranchn c

succonbranchn hx false i if n coveredr x and opsn fxg

where n succonbranchn c

For each n X if succonbranchn c n and opsn fxg then mo dify n so that succonbranchn c

succonbranchn hx true i

For each c branchr where c hx b x b i and n succonbranchr c do

n n

succonbranchr hx true x b x b i n

n n

n if n X

succonbranchr hx false x b x b i

n n

n if n X

Let opsr opsr fxg

Then P P

Pro of Again for brevity we only sketch the pro of It is easy to verify that P preserves the control

ow of P As in the pro of of Lemma the transformation can b e expressed as a sequence of lo cal

transformations b etween adjacent no des 2

Part of Lemma duplicates covered paths by creating a copy n of every state in coveredr x

and by assigning successors so that the paths formed by the n lead to the false branch of x Part

mo dies the original states in coveredr x so that they lead to the true branch of x Part mo dies the

branches of r to p oint to the original no des if x is true and to the copied no des if x is false A description

of pro cedure updateone is given in Figure

The implementation we have describ ed is somewhat naive and there are ineciencies that can b e

eliminated at the cost of greater complexity in the algorithm Most of the p otential problems are related

to space explosion either in the size of the nal co de or in the size of intermediate data structures used

by the algorithm Some states that are initially dierent may b ecome identical as a result of scheduling

op erations This observation applies to b oth states in the parallel schedule and states that have yet to

b e scheduled A go o d implementation should merge states that are identical and are on identical paths

When p erformed on the states of the parallel schedule this optimization reduces the size of the nal

co de

3

Note that in this construction the original states containing x are not removed from P but they b ecome unreachable

b ecause controlow is redirected around them

pro cedure updateone r A x

make P delete consistent for r x Lemma

X coveredr x

if x is an assignment then

p erform steps in Lemma

else

p erform steps in Lemma

up date nodeps and live for n X and any states added by Lemma

return hr availabler i

Figure Implementation of pro cedure updateone

A separate p otential problem lies in the denition of delete consistency Making the sequential pro

gram delete consistent prior to moving an op eration x may result in duplicating many states of the

sequential program These duplicates cannot subsequently b e merged b ecause x o ccurs on one set of

paths in its original p osition ie on those paths where x was blo cked by a data dep endence and not

on the set of paths where x was moved A partial solution is to move x as far as p ossible on the paths

where it is blo cked by a data dep endence thus allowing some sharing of common paths Some scheduling

systems have this prop erty Nic ME However this optimization may b e of marginal value in our

algorithm b ecause of the duplicated states of the sequential program are so on eliminated by subsequent

scheduling anyway

Another approach to improving the eciency of the techniques presented here is to use a representation

other than the controlow graph for computing available op erations The obvious alternative is to use

some form of the program dep endence graph which admits more ecient algorithms for some purp oses

see LA AJLS for uses of program dep endence graphs in the context of software pip elinin g We

have presented our techniques using a controlow graph representation for simplicity onlythere is no

barrier to using other p otentially faster representations in an implementation

Correctness of the Analysis

In Section we assumed the available op erations analysis was correct to prove the correctness of the

software pip elin ing algorithm Recall that the available op erations analysis is correct if L L where

L is a sequential lo op and L is the innite parallel program computed by pipeline In this section we

prove that the implementation of available op erations given in Sections and is correct

Lemma Let L b e the program dened by pipeline for some lo op L Then for any scheduler

L L

Pro of Let P b e the innite acyclic program formed by full unrolling of L Apply pipeline using

P for the available op erations analysis and let L b e the nal program Each transformation of P by

next or updateone preserves the semantics of P by Lemmas and Therefore L P L

2

Managing the Window

For p erformance reasons it is obviously desirable to minimize the number of iterations of L that are

actually used in the available op erations analysis It is p ossible to use only a few iterations of P b ecause

Constraint forces the available op erations analysis for any state to span no more than k iterations of

lo op L In this section we show that the number of iterations needed for available op erations analysis

can b e limited to k

The only problem with limiting the number of iterations used in the analysis is that dierent frontier

no des may require available op erations from dierent iteration windows For example for a frontier state

r op erations may b e available from iterations i to i k but for another frontier state r op erations may

b e available from iterations i c to i k c In a naive implementation P must contain op erations

from iterations i through i k c to cover b oth frontier states Fortunately this is not necessary

We can rst schedule r using iterations i to i k and any other states that have op erations available

from iteration i Once all states with op erations available from iteration i are scheduled a new iteration

of L can b e added to P and the window shifted to i to i k

Figure is a mo died version of pipeline In this implementation all frontier states that have

op erations available from iteration i are scheduled b efore any frontier states that have op erations available

from iteration i A new iteration is added to P only when every state that has op erations available

from iteration i is already scheduled Thus P always contains the minimal number of iterations and

iterations are added to P as infrequently as p ossible

There is one detail omitted from Figure When the ith iteration of L is added to P the live sets

of the leaf states ie the states at the end of iteration i must b e initialized to the set of lo cations live

at the end of iteration i

Resources

Resource allo cation is a critical issue for software pip elinin g algorithms In this section we show how

the allo cation of functional units can b e smo othly integrated into our software pip elin ing algorithm

Our approach to incorp orating functional resources is similar to the reservation table metho ds used in

dynamic scheduling algorithms Bae To describ e the mo dications to the algorithm that accommo date

functional resources we require some additional denitions Let ff f g b e the set of functional units

n

for a machine We drop the assumption that every op eration executes in a single cycle and assume that c

is the greatest number of cycles required by any op eration A reservation table is an n c array of b o olean

values where entry i j is true i resource f is busy at cycle j For an op eration x the reservation

i

table resourcesx describ es the resources required by x in each cycle of xs execution

There is one diculty in extending the mo del to multicycle op erations If an op eration x requires

pro cedure pipelineA

X scheduledbeforeX no

P k iterations of L with empty ro ot r

todo fhr Aig

itnum

while todo do

b egin

while hn Ai todo such that minitA itnum do

b egin

j

if j st scheduledbeforeA no then

j

n scheduledbeforeA

todo todo fhn Aig

else

let hn A i schedulestate n A and

f hn A i g nextn A in

i i

todo todo f hn A i g fhn Aig

i i

scheduledbeforeA n

end

itnum itnum

add one iteration of L to end of P and up date analysis

end

Figure The mo died software pip elini ng algorithm

multiple cycles to complete then its result is not available for multiple cycles However data dep endencies

and resource constraints alone do not prevent op erations that dep end on xs result from b eing scheduled

in the cycle after x is initiated We resolve this problem by treating an icycle op eration x as i onecycle

op erations op erations that dep end on the result of x are dep endent on the last op eration in the chain

To guarantee legal schedules it is necessary to constrain the i unitcycle op erations to b e scheduled in

successive cycles without interruption This constraint can b e encapsulated entirely within the p olicy for

selecting op erations to schedule and thus do es not aect the overall structure of the software pip elini ng

algorithm

To allo cate functional units the software pip elini ng algorithm is mo died so that when a state n is

scheduled there is a reservation table asso ciated with n describing resource usage at that p oint in the

schedule The scheduler is mo died so that it chooses an op eration that is b oth available and for which

resources can b e allo cated Two reservation tables R and R are compatible if they do not require the

same functional unit in the same cycle ie there is no p oint i j such that R i j true R i j

If the reservation table R is asso ciated with state n then the scheduler must choose an op eration x to

schedule in n such that compatibleR resourcesx The following constraint mo dies Constraint to

include reservation tables

Constraint Let X and A b e sets of op erations and let R b e a reservation table The sched

uler is a function that takes a set of already scheduled op erations available op erations and reser

vation table and returns an op eration In addition if X and there exists an x A such that

compatibleresourcesx R then scheduleopsn A R none ie the scheduler must choose an op er

ation in every state if p ossible Finally we require that

W

k i k i k i

i i i

x i scheduleX A R x x A and compatibleresourcesx R

j

j j j

i i

i scheduleX A R none

The pro cedures next and updateone must also b e mo died to up date reservation tables to reect the

changes in available resources when op erations are scheduled The pro cedure call nextn A R should

advance the reservation table R by one cycle to reect the fact that in successors of n the resources used

in the rst cycle of R are no longer reserved The pro cedure call updateonen A R x should not only

up date n and A but also up date R by adding the resources required by x

The next constraint mo dies Constraint to include reservation tables The logical or of two

reservations tables R R is a table R such that Ri j R i j R i j The reservation table

advanceR is a table R such that R i j Ri j for j c and Ri c false

Constraint Consider an arbitrary set of available op erations A state n op eration x and reservation

table R Then there exists a set of op erations B such that

i j i i j i

i updateonem A R x hm B R resourcesx i where opsm opsn and

i j i

opsm ops n fx g

pro cedure schedulestate n A R

while scheduleopsn A R none do

let x scheduleopsn A R in

hn A Ri updateone n A R x

return hn A Ri

pro cedure pipelineA

X R scheduledbeforeX R no

let r b e an empty no de R b e an empty reservation table in

todo fhr A Rig

while hn A Ri todo do

j

if j st scheduledbeforeA R no then

j

n scheduledbeforeA R

todo todo fhn Aig

else

let hn A R i schedulestate n A R and

fhn A R ig nextn A R in

i i

todo todo fhn A R ig fhn A Rig

i i

scheduledbeforeA R n

Figure The software pip elini ng algorithm

Furthermore there exist sets of op erations A and states n for j jbranchnj such that

j j

i i i

advanceRig where opsm ops n i nextm A R fhn A

j

j

Figure gives a mo died version of the software pip elinin g algorithm that includes reservation tables

For simplicity the mo dications are presented to the original algorithm in Figure rather than the more

ecient version in Figure Note that the detection of rep eating states now involves b oth the set of

available op erations and the reservation table Using Constraints and it is straightforward to

adapt the original pro of of correctness of the software pip elini ng to prove the correctness of the algorithm

in Figure Termination is still guaranteed b ecause there are only a nite number of reservation tables

and therefore rep eating states are guaranteed to o ccur

Register Allo cation

Registers are another critical resource that must b e utilized eectively to achieve go o d results in practice

Traditional register allo cation can interact very badly with software pip elini ng If register assignment

is p erformed b efore schedulingthe usual practicethen software pip elini ng may pro duce p o or results

b ecause the register allo cator may unnecessarily reuse registers thus adding data dep endences to the

program Our approach is to mo dify an initial register allo cation on the y during software pip elinin g

The basic technique is easy to describ e it is based on a similar technique of Eb ciogluEbc Consider

b r r op r

a r r op r a r r op r

b r r op r c r r

a Instruction b is unavailable b After renaming registers

Figure Dynamically improving register allo cation

the program fragment in Figure a In this example op eration b is not available for scheduling at

the ro ot b ecause its target register is the one of the op erand registers of op eration a However if there

is a spare register then the dep endence can b e broken by renaming the destination register of b as in

Figure b Now op eration b is available for scheduling It is necessary to insert a register move c

into the program to restore the machine state after op eration a This transformation is a heuristicit

assumes that the advantage gained in eliminating the dep endence outweighs the cost of the extra copy

This is usually true and almost always the copy op eration can b e removed by a later global pass of

generalized copy propagation PNW

There is an additional problem with the register allo cation scheme describ ed ab ove Including reg

ister allo cation in the software pip elining algorithm requires that registers b e taken into account when

i

determining when two states are the same A sucient condition is that two states s and s can

b e considered the same only if each register holds a value generated by op eration x in state s and a

i

value generated by op eration x in state i This condition guarantees correctness and termination and is

analogous to the similar requirements for available op erations and functional units

In practice it app ears that this condition alone p ermits an impractical number of states to b e generated

in some lo ops and a rep eating pattern fails to emerge within a reasonable number of steps The problem is

that even for small register les the number of p ossible assignments of values to registers is astronomical

To accelerate convergence of the pip elini ng algorithm it is necessary to limit the space of p ossible register

assignments in some way The solution we use is as follows A register le r is a renaming of register

le r if r can b e mapp ed to r by some set of registertoregister transfers In the software pip elini ng

algorithm two states are considered to b e equivalent if the register les in the two states are renamings

of each other and the other conditions on op erations and functional units are satised

This design makes identies together many register les thereby accelerating convergence of the

algorithm The cost is that registertoregister transfers must b e issued on the backedges of pip elin ed

lo ops to move values into the correct registers These copy op erations can b e eliminated by a separate

copy elimination optimization pass after software pip elinin g Leaving the copy op erations in the co de is

reasonable as well as they incur only a minor p erformance p enalty see Section

Implementation and Exp eriments

The software pip elinin g algorithm describ ed here has b een implemented as part of a compiler pro ject at

the University of California Irvine The compiler is a version of the GNU C compiler GCC mo died

to accommo date our metho ds GCC is used as a frontend to translate the C level source into an

intermediate representation This translation includes an initial register allo cation and a number of

common optimizations jump optimization eg removing jumps to jumps common

sub expression elimination and GCCs lo op unrolling and

inlining are disabled and replaced by our software pip elini ng and scheduling algorithm

A number of incremental optimizations eg incremental treeheight reduction are b enecial in

conjunction with software pip elini ng pip elinin g For the results presented in this pap er only dynamic

renaming see Section and loadafterstore elimination are p erformed together with pip elini ng Load

afterstore elimination identies loads that dep end on a unique store such loads can b e eliminated in

favor of uses of the value b eing stored In some cases the store can b e removed also if it is known that

the eliminated load is the only read of the lo cation written to by the store Loadafterstore elimination

is useful b ecause it removes register spill co de that b ecomes dead as a result of dynamic renaming

Both dynamic renaming and loadafterstore elimination are an inherent part of our on the y register

allo cation scheme

The strength of our software pip elini ng framework is the exibility to exploit whatever negrain

parallelism is available in a lo op Restrictions placed on co de motions are designed to b e as weak as

p ossible while still guaranteeing correctness and termination As discussed b elow this exibility do es in

fact translate into very go o d sp eedups across a variety of architectural mo dels

The downside of the weak restrictions of our system is that there are a huge number of p otential

states even for small lo ops and machines with mo dest resources The huge state space can cause slow

convergence of pip elini ng to a pattern and large nal lo ops There is a clean solution to this problem the

scheduler should b e designed to minimize co de explosion by restricting co de motions that increase co de

size For the purp oses of this pap er we have fo cussed the exp eriments to reveal information ab out the

software pip elini ng algorithm and not information ab out particular smart scheduling heuristics Thus

we have used only very simple greedy listscheduling heuristics that make no eort to take account of the

impact of co de motions on co de size As we shall see shortly in the ma jority of cases the size of the state

space is not a problem and software pip elini ng converges quickly to a pattern even with naive scheduling

In some cases using an iteration window that is large enough to maximally exploit the available

parallelism results in unjustiably slow convergence To identify these cases in this exp eriment we nd

Name Latency Description

ALU cycle integer addsub and logical

SHIFT cycle arithmetic and logical shifts

FALU cycles oating p oint addsub and logical

MUL cycles integer and oating p oint multiply

DIV cycles integer and oating p oint divide

MEM cycles cache read cache miss stalls the pro cessor

cycle cache write

BRANCH cycles conditional branch

Table Functional Unit Kind and Latency

it useful to introduce the notion of cuto convergence which constrains the maximum number of

iterations scheduled to some xed amount the remainder of any paths that have not converged after the

cuto number of iterations are simply scheduled sequentially We stress that cuto convergence is

a creature of our exp erimentits purp ose is to identify when co de explosion is a problem In practice

one should prefer to use scheduling heuristics designed to prevent co de explosion this topic is discussed

further in Section

Architectural Mo dels

Two pip elin ed VLIW architecture mo dels are used for the exp eriments one with homogeneous functional

units and one with heterogeneous functional units Both mo dels assume a single bit wide register

le shared by all functional units With the exception of two unlimited resource exp eriments used to

measure threshold p erformance results the register le is assumed to have registers

Op eration latencies for b oth mo dels given in Table are similar to the Motorola Sup erscalar

An instance of the heterogeneous mo del has or an unlimited number of each of the functional

units dened in Table An instance of the homogeneous mo del has or an unlimited number of

homogeneous functional units where each homogeneous functional unit can p erform any of the functions

dened in Table

Each VLIW instruction sp ecies one p ossibly NOP op eration for each functional unit Each op eration

has the optional side eect of advancing the pip eline For b oth mo dels there are no hardware interlocks

for detecting data or control hazards so the compiler is entirely resp onsible for ensuring that all hazards

are avoided at run time

Exp erimental Results

Tables and show the dynamic sp eedup measured for b oth target architecture mo dels on Liver

more Lo ops The sp eedups are with resp ect to running the unscheduled co de sequentially on the target

architecture Thus the sp eedups reect b oth the exploitation of multiple functional units and pip elini ng

kernel FUs FUs FUs Infx InfxInf

LL

LL

LL

LL

LL

LL

LL

LL

LL

LL

LL

LL

LL

LL

LL

LL

LL

LL

LL

LL

LL

LL

LL

LL

Avg

Table Homogeneous Multicycle Functional Units SPEEDUP

kernel of each of each of each Infx InfxInf

LL

LL

LL

LL

LL

LL

LL

LL

LL

LL

LL

LL

LL

LL

LL

LL

LL

LL

LL

LL

LL

LL

LL

LL

Avg

Table Heterogeneous Multicycle Functional Units SPEEDUP

within a functional unit The rst three columns of Table show the sp eedups for the homogeneous

mo del assuming registers and and homogeneous functional units resp ectively Table shows

the same information for the heterogeneous mo del congured with and units of each type again

assuming registers The last two columns of each table show threshold p erformance levels that are

discussed b elow

For the results presented in the rst columns of each table lo ops were pip elined with progressively

larger iteration windows until there was no noticeable increase in the average sp eedup for all b enchmarks

The numbers in parentheses at the top of each column show the smallest iteration window sizes for which

the highest average p erformance was attained for each and for which the sp eedups shown in the tables

were generated Notice that all of the window sizes are fairly small and none exceeds iterations For

the results shown in the rst three columns no more than iterations are scheduled ie this is the

cuto In almost all cases only a fraction of this number is needed for convergence to a pattern The

conv type column in Tables and how the algorithm terminated P indicates convergence to a pattern

and C indicates cuto convergence on at least one path

The last two columns of Tables and which are identical for b oth tables show the sp eedups

obtained assuming an unlimited number of functional units and either registers column Infx or

an unlimited number of registers column InfxInf For b oth columns we want to show the maximum

sp eedup that can b e obtained for the sp ecic architecture conguration given the xed co de motion

capabilities scheduling heuristics and frontend optimizations used in our system Therefore for the

Infx and InfxInf columns the iteration window size and cuto limits were set to the number

of iterations that would b e executed by each lo op at run time which for most of these lo ops is

iterations Thus lo ops that exhibit natural convergence are guaranteed to b e optimal in the sense

dened for Theorem and the few lo ops that do not converge are optimal in the same sense b ecause

they are fully unrolled and scheduled Note that by homogeneous functional units and of each

heterogeneous functional units the sp eedups are already optimal with resp ect to the Infx numbers

but were obtained with an iteration window size of just iterations For of the lo ops the

FUs and of each sp eedups are optimal with resp ect to the InfxInf numbers which shows that

even optimal register allo cation for these lo ops cannot increase p erformance

A wide range of sp eedups exist in Tables and ranging from all the way up to What we

have tried to show is that given a xed set of co de motion capabilities scheduling heuristics and front

end optimizations such as those pro duced by GCC our software pip elini ng algorithm is able to achieve

the same p erformance as fully unrolling and scheduling the lo op Furthermore despite the generality of

our approach the algorithm manages to achieve go o d utilization of resources even with naive scheduling

heuristics The overall p erformance of these b enchmarks with either pip elini ng or complete unrolling

could b e improved in a number of ways that are orthogonal to our software pip elini ng approach eg by

improving memory reference disambiguation

4

Although due to cyclic dep endencies involving long latency op erations such as oatingp oint division andor pro ce

dure calls the p erformance of some of these lo ops eg LL would not likely improve signicantly even with p erfect

FUs FUs FUs

conv reg min max total conv reg min max total conv reg min max total

kernel type use lo op lo op size type use lo op lo op size type use lo op lo op size

LL P C P

LL P P P

LL P P P

LL P P P

LL P P P

LL P P P

LL P C C

LL P C C

LL P P P

LL P C C

LL P P P

LL P P P

LL P P P

LL P P P

LL P P P

LL C C C

LL P P P

LL P P P

LL P P P

LL C C C

LL P P P

LL P P C

LL P P P

LL P P P

Avg

Infx InfxInf

conv reg min max total conv reg min max total

kernel type use lo op lo op size type use lo op lo op size

LL P P

LL P P

LL P P

LL P P

LL P P

LL P P

LL C C

LL C C

LL P P

LL C C

LL P P

LL P P

LL P P

LL P P

LL P C

LL C C

LL P P

LL P C

LL P P

LL C C

LL P P

LL C P

LL P P

LL P P

Avg

Table Homogeneous Multicycle Functional Units MISC PERFORMANCE MEASURES

of each of each of each

conv reg min max total conv reg min max total conv reg min max total

kernel type use lo op lo op size type use lo op lo op size type use lo op lo op size

LL P P P

LL P P P

LL P P P

LL P P P

LL P P P

LL P P P

LL P C C

LL P C C

LL P P P

LL C C C

LL P P P

LL P P P

LL P P P

LL P P P

LL P C P

LL C C C

LL P P P

LL P P P

LL P P P

LL P C C

LL P P P

LL P C C

LL P P P

LL P P P

Avg

Infx InfxInf

conv reg min max total conv reg min max total

kernel type use lo op lo op size type use lo op lo op size

LL P P

LL P P

LL P P

LL P P

LL P P

LL P P

LL C C

LL C C

LL P P

LL C C

LL P P

LL P P

LL P P

LL P P

LL P C

LL C C

LL P P

LL P C

LL P P

LL C C

LL P P

LL C P

LL P P

LL P P

Avg

Table Heterogeneous Multicycle Functional Units MISC PERFORMANCE MEASURES

In the rest of this section we discuss and interpret the p erformance results in more detail To aid

in interpreting the results shown in Tables and we present the following p erformance measures in

Tables and

conv type Convergence type P means that pip elin ing converged on a pattern and C means that it

converged on the cuto

reg use The maximum number of registers used at any instruction ie state

min lo op The number of instructions on the shortest path through the pip eline d lo op

max lo op The number of instructions on the longest path through the pip elined lo op

total size The total number of instructions in the b enchmark including inner lo op instructions as well

as all co de preceding and succeeding the lo op

As discussed in Section software pip elini ng sometimes inserts registertoregister transfers in order

to sp eed convergence of the algorithm Because these transfers can b e eliminated by copy propagation

alb eit at the cost of an increase in co de size we have not counted them in the sp eedup gures in Tables

and Even if the copies are not eliminated the gures in Tables and Table show that the p erformance

p enalty is low For example for LL with homogeneous functional units the worst case is that all

registers are live and must b e copied on the backedge which costs cycles or of the length of the

pip elined lo op b o dy For the large lo ops the p enalty is well b elow for small lo ops the overhead can

b e reduced by unrolling the pip elined lo op b o dy

There are two interesting anomalies in the sp eedup tables The rst is that for a few b enchmarks

eg LL LL and LL in b oth tables the sp eedup actually decreases slightly after some increases

in the number of resources One cause is that even though two pip elined lo ops may exhibit the same

asymptotic sp eedup the overhead from their prelo op andor p ostlo op co de can dier eg sp eedup

go es from to when going from Infx to InfxInf for LL The other cause of some small

decreases in p erformance when resources increase is overly simplistic scheduling heuristics For instance

the list scheduling heuristics currently used in the compiler allow op erations to b e scheduled much earlier

than their next use p otentially saturating the register le at subsequent states and thus preventing the

removal of false dep endencie s via renaming that might otherwise allow op erations on the critical path to

b e scheduled earlier Kernels LL and LL provide go o d examples of this eect The fewer resources

there are the fewer unimp ortant ie o the critical path op erations are scheduled much further head

of their next uses and the less likely that the register le b ecomes saturated with unimp ortant values

This problem can b e alleviated with dierent scheduling heuristics In any case this issue is orthogonal

to software pip elini ng itself

disambiguation

5

Note that since there is only a single path through most of these lo ops min lo op and max lo op are usually equivalent

and in this case represent the total number of instruction in the pip elin ed lo op

The other anomaly o ccurs for lo ops LL LL and LL in Table and LL LL and LL in

Table Factoring out considerations like the ab ove scheduling anomaly and other heuristic asp ects such

as sp eculative scheduling we would exp ect sp eedup to increase linearly with the number of functional

units until some threshold sp eedup is reached Thus for each doubling in the number of functional units

we would exp ect the the sp eedup to b e the lesser of twice the old sp eedup and the maximum unlimited

sp eedup however for this second class of anomaly when going from to functional units in Table or

to of each functional units in Table we see that the sp eedup is slightly less than this exp ected value

The reason for this is that while the iteration window size is chosen to maximize the average sp eedup

shown in the tables the p erformance of a few of the lo ops in each table would have improved with a

larger window size though without any signicant eect on the average sp eedup over all lo ops

Finally it is interesting to consider the circumstances under which the algorithm fails to converge

to a pattern b efore the cuto is reached An analysis of the kernels with type C convergence see

Tables and shows that the problem arises is vectorizable lo ops or more generally lo ops with very

few ow dep endencies In this case the only constraints are resource constraints and op erations are

free to move almost anywhere in the schedule In this situation the lack of dep endence structure in

the program combined with greedy scheduling heuristics tends to lead to an explosion in the set of

states slowing convergence Variations on a device of Eb cioglusmay show how to mo dify the scheduler

to avoid this problem Eb c The basic idea is to introduce articial dep endencies that dont harm

parallelism extraction but dramatically reduce the number of p otential states the scheduler may explore

i

For example a rule of thumb for vectorizable lo ops could b e that op eration x must b e scheduled no later

i

than the time of x Since the lo op is vectorizable there is no reason to prefer scheduling one b efore

the other eliminating some orderings reduces the overall number of p otential states

Notice that in most cases the total size of the nal lo op is an order of magnitude larger than the

shortest and longest paths through the lo ops Because lo op control conditionals from succeeding iterations

are scheduled in parallel with op erations from preceding iterations a new lo op exit path is usually created

for each iteration scheduled In some cases this is simply the cost side of a cost vs p erformance tradeo

inherent to scheduling conditional jumps the b enets of which are to allow strictly control dep endent

op erations eg op erations like stores that can not b e renamed to b e scheduled earlier and to commit

to alternative control paths as early as p ossible so as to minimize the amount sp eculative scheduling

Fortunately in many cases such as for lo op exit co de this cost often can b e signicantly reduced by

merging multiple identical control paths into a single shared path In the context of available op erations

scheduling this is accomplished by merging states with identical available op erations sets an optimization

we have not implemented

6

To guarantee the preservation of correct semantics when scheduling a conditional ab ove op erations that precede it it is

necessary to duplicate those op erations onto each branch of the conditional after it has b een scheduled

On Optimal Software Pip elining

In this section we briey review research on the limitations of software pip elin ing esp ecially a result

showing that optimal software pip elini ng is unachievable SGE Given this result we show that our

algorithm is as go o d as p ossible in the sense that it can pro duce arbitrarily go o d schedules

Research in software pip elinin g has naturally fo cused on discovering algorithms for computing pip eline d

schedules b oth in general and for sp ecic machines Concurrently researchers have investigated the theo

retical limitations of software pip elin ing One of the central theoretical questions is whether or not there is

a software pip elin ing algorithm that pro duces optimal pip elined schedules for an arbitrary lo op Because

scheduling algorithms are based on preserving data dep endences the natural meaning of optimality is

with resp ect to the length of dep endence chains

Denition A program L is time optimal if for every execution hhx s i hhfstop g s ii of L n

n

is the length of the longest dep endence chain in the execution

The obvious form of the optimality question is stated as follows is there an algorithm which takes

as input a machine description ie resource constraints instruction timings etc and a lo op and

pro duces a time optimal schedule for that machine This problem statement is not very useful however

b ecause scheduling problems with nite resources are computationally intractable even without software

pip elini ng To gain some insight into software pip elini ng itself researchers have usually abstracted the

problem as given sucient resources and a lo op L is there an algorithm which computes a time optimal

schedule for L

The answer to this question is trivially no for some programs such as the one in Figure Recall

that instructions d and e cannot b e scheduled in the same instruction b ecause they write the same store

lo cation One branch of the test must always b e optimized at the exp ense of the other branch and

thus there do es not exist a parallel version L that is time optimal The conict b etween d and e in

Figure is usually classied as another type of dep endencean output dep endence KKP To avoid

this problem we can rephrase the question again given unbounded resources and a lo op L without

output dep endences is there an algorithm which computes a time optimal schedule for L This question

has b een resolved negatively SGE Again the problem is that for some lo ops an optimal closedform

parallel version do es not exist

While Denition is natural it app ears that so many qualications are required to apply it in the

analysis of general software pip elini ng algorithms that it ceases to b e useful For our purp oses we adopt

a dierent denition of what it means for a software pip elini ng algorithm to b e as go o d as p ossible

Recall from Section that the lo op L is the innite parallel program that results from scheduling

with complete information ab out available op erations While L may not b e optimal it represents

the b est that can b e done with global knowledge of the program and the ability to fully unroll lo ops

The following theorem shows that as the window size k of the software pip elini ng algorithm increases

the quality of the co de approaches that of L

Theorem Let L b e a lo op and let L b e the result of applying pipeline with a scheduling window

k

of k iterations Let s b e any store such that L s Dene tL s to b e the length of the execution

of L in store s Then

lim tL s tL s

k

k

Pro of Let i b e the largest index of any iteration in the execution of L on store s For any k i

programs L and L have identical executions on store s 2

k

Theorem is a theoretical result since in practice the scheduling window k cannot cover more than

a few iterations However it do es show that within the framework of our algorithm it is p ossible to

generate arbitrarily go o d co de sub ject to the ability of the scheduler to make go o d scheduling decisions

for nite resources

Related Work

Software pip elinin g is actually a relatively old idea Programmers in the micro co de community software

pip elined co de by hand for decades Kog The rst semiautomatic technique for software pip elini ng

was prop osed by Charlesworth Cha For an overview of the history of instruction level parallelism

see RF

Today there are variety of algorithms and frameworks for software pip elini ng We describ e each and

discuss its relationship to our own work Because of the large amount of work in the area our discussion

of each prop osal is necessarily brief

Mo dulo Scheduling

Modulo scheduling is an imp ortant software pip elini ng technique introduced by Rau and Glaeser RG

and subsequently used as the basis for numerous other algorithms Lam Jon RTS Huf WMHR

Mo dulo scheduling has b een used in compilers for the FPS series Tou the p olycyclic machine RG

and Cydromes Cydra Cyd

A basic mo dulo scheduling algorithm works as follows Consider a lo op L that requires a resource k

times p er iteration of the lo op b o dy If the target machine has t of the resource then an upp er b ound

on the throughput is one iteration of L every k t cycles Let the initiation interval s b e max k t In

mo dulo scheduling the lo op b o dy is heuristically scheduled one statement at a time When a statement

is scheduled at time c the instance of c in iteration i is scheduled at time c is If at any p oint a

statement cannot b e added to the schedule due to resource or dep endency constraints then the schedule

is abandoned and the algorithm either backtracks or tries a larger initiation interval

Mo dulo scheduling smo othly integrates the simultaneous treatment of resource constraints and soft

ware pip elini ng The primary disadvantage of mo dulo scheduling is that it do es not apply directly to

lo ops with conditional tests in the lo op b o dy Two extensions have b een prop osed to overcome this limi

tation Lam introduced hierarchical reduction to combine mo dulo scheduling with complex control ow

Lam In hierarchical reduction the then and else branches of a conditional test are rst scheduled

indep endently The shorter branch is padded with no ops to make it the same length as the longer branch

and the scheduler encapsulates the entire ifthenelse construct as a single statement Hierarchical reduc

tion suers from several drawbacks First some paths are padded with no ops which may slow execution

second treating the ifthenelse as a single statement necessarily overestimates resource requirements

and third preserving the control structure of the program restricts p ossible co de motions

A second prop osal for integrating mo dulo scheduling with conditional tests is to use ifconversion

AKPW b efore mo dulo scheduling and reverse ifconversion WHB WMHR after mo dulo schedul

ing When a lo op is ifconverted the expression of control ow is changed from explicit jumps to guarded

op erations where each op eration of the original lo op is guarded by the predicates of the conditionals that

control its execution In this way all nontrivial control ow in the lo op is replaced by data dep endences

Mo dulo scheduling with ifconversion app ears to improve up on mo dulo scheduling with hierarchical

reduction WMHR However ifconversion retains the undesirable features of hierarchical reduction to

a considerable degree First b ecause control ow is expressed as data dep endence sp eculative execution

of op erations ie moving op erations ab ove conditionals is not p ossible nor is it p ossible to reorder

conditionals for the same reason Thus the p ossible co de motions are restricted In addition p erform

ing ifconversion greatly hinders the management of limited resources during scheduling Ifconversion

schedules all the op erations in the original lo op b o dy in a single basic blo ck These op erations comp ete

for resources during scheduling including op erations that could never execute simultaneously b ecause

they app ear on dierent control paths in the original lo op Thus straightforward mo dulo scheduling of

ifconverted lo ops overestimates resource requirements

For the case of lo ops without control ow and unlimited resources there is considerable commonality

b etween our algorithm and mo dulo scheduling For example in ANa it was shown that a simplied

version of our algorithm pro duces optimal co de for lo ops without conditionals in the b o dy and for ma

chines with sucient resources Despite the dierences in conception b etween the two algorithms this

result was later shown to hold for a small mo dication of mo dulo scheduling as well Jon

In short our algorithm combines software pip elini ng resource constraints and handling of control ow

with a exibility not matched by current mo dulo scheduling techniques In our opinion the signicant

practical advantage of mo dulo scheduling at this time is that in cases where b oth techniques pro duce

equally fast schedules the schedules pro duced by mo dulo scheduling are generally more concise Jon

Pip eline Scheduling

The work most closely related to our own is that of Eb ciogluEb c and Eb ciogluand Nakatani NE

EN NE and later Mo on and Eb ciogluME Pipeline scheduling diers from our approach in that

the lo op b o dy is not constructed by scheduling and testing for rep eating states Instead the original

lo op is incrementally transformed to create a parallel schedule Software pip elini ng is achieved by moving

op erations across the backedge of the lo op this has the eect of moving an op eration b etween lo op

iterations The handling of control ow is based on the same principles as our own approach and is

7

This limitation is noted in WMHR no indication is given ab out how it can b e overcome

equally general A scheduling window of op erations is also used NE although the purp ose is to

reduce co de explosion rather than to guarantee termination

An advantage of pip eline scheduling is that the lo op is always equivalent to the original lo op and

therefore it is legal to apply any semanticspreserving transformation to the lo op at any time even trans

formations that have little to do directly with scheduling Eb cioglu and Nakatani exploit this prop erty

by aggressively renaming registers and p erforming strength reduction optimizations which substantially

alter the dep endence structure of the lo op We also apply some of these optimizations see Section

but cannot apply them as generally as pip eline scheduling b ecause of our need to guarantee regular de

p endences for correctness As an aside to the b est our knowledge mo dulo scheduling implementations

do not p erform any transformations that mo dify the dep endence graph

An advantage of our algorithm over the current pip eline scheduling algorithm is in the handling of

resource constraints Pip eline scheduling uses only lo cal transformations to move op erations from one

state to another Thus at some p oints resource constraints may need to b e violated in the schedule as

an op eration moves through one state on its way to another state To deal with this prop erty pip eline

scheduling has a mo derately elab orate phase structure in which resources constraints are alternately

enforced and relaxed on sp ecic p ortions of the lo op b o dy Our algorithm treats resource constraints in

a more direct and uniform way

GURPR

GURPR for Global Unrolling Pip elini ng and Rerolling is a software pip elini ng technique prop osed by

Su Ding and Xia SDX The technique is based on URPR an algorithm for pip elinin g lo ops without

tests SDX Given a lo op L the rst step of GURPR is to apply URPR to each path through the

original lo op b o dy The separate pip elined paths are then put together to form the pip elined lo op with

comp ensation co de added at p oints where execution could jump from one path to another

The approach is similar in philosophy to trace schedulingpaths are rst optimized as basic blo cks

ignoring jumps into and out of the path and then xup co de is added to ensure correctness GURPR is

also sub ject to the same criticism as trace scheduling There is no reason why the execution of a program

should rep eatedly follow the same path through the lo op b o dy Our approach and the approach Eb cioglu

and Nakatani is more uniform overlapping iterations on all paths rather than a subset of paths

Petri Net Techniques

Recently there has b een interest in using Petri Nets to formalize the software pip elini ng problem GWN

RA There is a natural mapping from op erations dep endences and resource constraints into Petri

Nets thus combining all of these features in a single well understo o d formalism This approach has

b een shown to b e comp etitive with mo dulo scheduling with hierarchical reduction RA and app ears

promising

The weakness of current algorithms based on Petri Net techniques is that control ow is handled in a

way very similar to ifconversion The net eect of the mapping into the Petri Net mo del is that control

ow is enforced just like data dep endences and thus sp eculative execution of op erations is not p ossible

Furthermore the rate execution of iterations is determined by the length of the longest path through the

lo op b o dy even when shorter paths through the lo op are taken during execution

Conclusions

We have presented a simple but fairly detailed description of a compactionbased software pip elini ng

algorithm that handles resource constraints The novel asp ect of our algorithm is that it cleanlyin

fact completelyseparates issues sp ecic to software pip elini ng such as detecting rep eating pip eline

states and termination from other orthogonal issues such as the computation of available op erations

and scheduling decisions We hop e that this makes two contributions to the state of the art First

our algorithm explains in a fairly simple way what software pip elini ng is ab out and what are its unique

characteristics Second the mo dular and simple design of our algorithm should facilitate the development

of general retargetable implementations of software pip elin ing

References

AAG M Annaratone E Arnould T Gross H T Kung M Lam O Menzilcioglu K Saro cky

and J A Webb Warp architecture and implementation In Proceedings of the th Annual

Symposium on pages June

Aik A Aiken CompactionBased Parallelization PhD thesis Cornell Department of

Computer Science Technical Rep ort No

Aik A Aiken A theory of compactionbased parallelization Theoretical Computer Science

AJLS V H Allan J Janardhan RM Lee and M Srinivas Enhanced Region Scheduling on

a Program Dep endence Graph In Proceedings of the th International Symposium and

Workshop on Microarchitecture MICRO December

AKPW J R Allen K Kennedy C Portereld and J Warren Conversion of control dep endence

to data dep endence In Proceedings of the Symposium on Principles of Programming

Languages pages January

ANa A Aiken and A Nicolau Optimal lo op parallelization In Proceedings of the ACM

SIGPLAN Conference on Programming Language Design and Implementation pages

June

ANb A Aiken and A Nicolau Perfect Pip elini ng A new lo op parallelization technique In

Proceedings of the European Symposium on Programming pages Springer

Verlag Lecture Notes in Computer Science no March

AN A Aiken and A Nicolau A realistic resourceconstrained software pip elin ing algorithm In

Advances in Languages and Compilers for Parallel Processing pages MIT Press

Bae J L Baer Computer Systems Architecture Computer Press

Cha A E Charlesworth An approach to scientic array pro cessing The architectural design of

the APbFPS family IEEE Computer

Cyd Cydrome Inc Palo Alto Ca Technical Summary

Eb c K Eb ciogluA compilation technique for software pip elini ng of lo ops with conditional jumps

In Proceedings of the th Annual Workshop on Microprogramming pages December

EN K Eb cioglu and A Nicolau A global resourceconstrained parallelization technique In

Proceedings of the ACM SIGARCH International Conference on Supercomputing June

EN K Eb cioglu and T Nakatani A new compilation technique for parallelizing lo ops with

unpredictable branches on a VLIW architecture In Languages and Compilers for Parallel

Computing pages MIT Press

n

Fis J Fisher way jump microinstruction hardware and an eective instruction binding

metho d In Proceedings of the th Annual Workshop on Microprogramming pages

December

Fis J A Fisher Trace Scheduling A technique for global micro co de compaction IEEE Trans

actions on Computers C July

FOW J Ferrante K J Ottenstein and J D Warren The program dep endence graph and its use

in optimization ACM Transactions on Programming Languages and Systems

June

GWN G Gao Y Wong and Q Ning A timed PetriNet mo del for ne grain lo op scheduling In

Proceedings of the ACM SIGPLAN Conference on Programming Language Design and

Implementation pages June

Huf R A Hu Lifetimesensitive mo dulo scheduling In Proceedings of the ACM SIGPLAN

Conference on Programming Language Design and Implementation pages June

Jon RB Jones Constrained Software Pipelining Masters thesis Department of Computer

Science Utah State University Logan UT September

KKP D J Kuck R Kuhn D Padua B Leasure and M Wolfe Dep endence graphs and compiler

optimizations In Proceedings of the SIGACTSIGPLAN Symposium on Principles of

Programming Languages pages January

KN K Karplus and A Nicolau Ecient hardware for multiway jumps and prefetches In

Proceedings of the th Annual Workshop on Microprogramming pages December

Kog P M Kogge The microprogramming of pip elined pro cessors In Proceedings of the th

Annual International Symposium on Computer Architecture

LA RM Lee and VH Allan Advanced Software Pip elini ng and the Program Dep endence

Graph In Fourth IEEE Symposium on Parallel and Distributed Processing December

Lam M Lam A Systolic Array PhD thesis Carnegie Mellon University

ME S M Mo on and K Eb cioglu An ecient resourceconstrained global scheduling technique

for sup erscalar and VLIW pro cessors In Proceedings of the th International Symposium

and Workshop on Microarchitecture MICRO pages December

NE T Nakatani and K Eb cioglu Combining as a compilation technique for VLIW archi

tectures In Proceedings of the nd Annual Workshop on Microprogramming pages

NE T Nakatani and K Eb cioglu Using a lo okahead window in a compactionbased parallelizing

compiler In Proceedings of the rd Annual Workshop on Microprogramming

Nic A Nicolau Uniform parallelism exploitation in ordinary programs In Proceedings of the

International Conference on Parallel Processing pages August

NPA A Nicolau K Pingali and A Aiken Finegrain compilation for pip elined machines Journal

of Supercomputing August

PBJ K Pingali M Beck R Johnson M Moudgill and P Sto dghill Dep endence ow graphs

An algebraic approach to program dep endences In Proceedings of the Symposium on

Principles of Programming Languages pages January

PNW R Potasman A Nicolau and H G Wang Register allo cation renaming and their impact

on negrain parallelism In Proceedings of the Workshop on Languages and Compilers

for Parallel Computing pages Springer Verlag Lecture Notes in Computer Science

no April

RA M Ra jagopalan and V H Allan Ecient scheduling of ne grain parallelism in lo ops In

Proceedings of the th Annual International Symposium on Microarchitecture pages

December

RF B R Rau and J Fisher Instructionlevel parallel pro cessing History overview and p er

sp ective Journal of SuperComputing January

RG B R Rau and C D Glaeser Some scheduling techniques and an easily schedulable horizontal

architecture for high p erformance scientic computing In Proceedings of the th Annual

Workshop on Microprogramming pages Octob er

RTS B R Rau P P Tirumalai and M S Schlansker Register allo cation for software pip elin ed

lo ops In Proceedings of the ACM SIGPLAN Conference on Programming Language

Design and Implementation pages June

SDX B Su S Ding and J Xia Urpran extension of urcr for software pip elini ng In Proceedings

of the th Annual Workshop on Microprogramming pages Octob er

SDX B Su S Ding and J Xia GURPRa metho d for global software pip elini ng In Proceedings

of the th Annual Workshop on Microprogramming pages December

SGE U Schwiegelshohn F Gasp eroni and K Eb cioglu On optimal parallelization of arbitrary

lo ops Journal of Parallel and Distributed Computing

Tou R F Touzeau A Fortran compiler for the FPS scientic computer In Proceedings of

the ACM SIGPLAN Symposium on Compiler Construction pages June

WHB N J Warter G E Haab and J W Bo ckhaus Enhanced Mo dulo Scheduling for Lo ops with

Conditional Branches In Proceedings of the th International Symposium and Workshop

on Microarchitecture MICRO December

WMHR N J Warter S A Mahlke W W Hwu and B R Rau Reverse ifconversion In Proceedings

of the ACM SIGPLAN Conference on Programming Language Design and Implementa

tion pages June