Instruction Using Integer Programming Optimal

Kent Wilken Jack Liu Mark He ernan

Department of Electrical and Computer Engineering

UniversityofCalifornia,Davis, CA 95616

fwilken,jjliu,[email protected]

Abstract { This paper presents a new approach to 1.1 Lo cal Instruction Scheduling

local instruction scheduling based on integer programming

The lo cal instruction scheduling problem is to nd a min-

that produces optimal instruction in a reasonable

imum length instruction for a basic blo ck. This

time, even for very large basic blocks. The new approach

instruction scheduling problem b ecomes complicated (inter-

rst uses a set of graph transformations to simplify the data-

esting) for pip elined pro cessors b ecause of data hazards and

dependency graph while preserving the optimality of the nal

structural hazards [11 ]. A data hazard o ccurs when an in-

schedule. The simpli edgraph results in a simpli ed integer

struction i pro duces a result that is used bya following in-

program which can be solved much faster. A new integer-

struction j, and it is necessary to delay j's execution until

programming formulation is then applied to the simpli ed

i's result is available. A structural hazard o ccurs when a

graph. Various techniques are used to simplify the formu-

resource limitation causes an instruction's execution to b e

lation, resulting in fewer integer-program variables, fewer

delayed.

integer-program constraints and fewer terms in some of the

The complexity of lo cal instruction scheduling can de-

remaining constraints, thus reducing integer-program solu-

p end on the maximum data-hazard latency which occurs

tion time. The new formulation also uses certain adap-

among the target pro cessor's instructions. In this pap er,

tively addedconstraints (cuts) to reduce solution time. The

latency is de ned to be the di erence between the cycle

proposed optimal instruction scheduler is built within the

in which instruction i executes and the rst cycle in which

Gnu Compiler Col lection (GCC) and is evaluated experi-

data-dep endent instruction j can execute. Note that other

mental ly using the SPEC95 oating point benchmarks. Al-

authors de ne latency (delay) to b e the cycle di erence mi-

though optimal scheduling for the target processor is consid-

nus one, e.g., [2 , 17 ]. We prefer the present de nition b e-

eredintractable, al l of the benchmarks' basic blocks areop-

cause it naturally allows write-after-read data dep endencies

timal ly scheduled, including blocks with up to 1000 instruc-

to b e represented by a latency of 0 (write-after-read dep en-

tions, while total compile time increases by only 14%.

dent instructions can execute in the same cycle on a typical

multiple-issue pro cessor, b ecause the read o ccurs b efore the

1 Intro duction

write within the pip eline).

Instruction scheduling for a single-issue pro cessor with

Instruction scheduling is one of the most imp ortant compiler

a maximum latency of two is easy. Instructions can be

optimizations b ecause of its role in increasing pip eline uti-

optimally scheduled in p olynomial time following the ap-

lization. Conventional approaches to instruction scheduling

proach prop osed by Bernstein and Gertner [2 ]. Instruc-

are based on heuristics and may pro duce schedules that are

tion scheduling for more complex pro cessors is hard. No

sub optimal. Prior work has considered optimal instruction

p olynomial-time algorithm is known for optimally schedul-

scheduling, but no approach has been prop osed that can

ing a single-issue pro cessor with a maximum latency of three

optimally schedule a large numb er of instructions in reason-

or more [2 ]. Optimal scheduling is NP-complete for realistic

able time. This pap er presents a new approachtooptimal

multiple-issue pro cessors [3 ]. Because optimal instruction

instruction scheduling which uses a combination of graph

scheduling for these more complex pro cessors is considered

transformations and an advanced integer-programming for-

intractable, pro duction compilers use sub optimal heuristic

mulation. The new approach pro duces optimal schedules

approaches. The most common approachislist scheduling,

in reasonable time even for scheduling problems with 1000

where instructions are represented as no des in a directed

instructions.

acyclic data-dep endency graph (DAG) [15 ]. A graph edge

represents a data dep endency, an edge weight represents the

This research was supp orted by Equator Technologies, Men-

tor Graphics' Emb edded Software Division, Microsoft Research, the

corresp onding latency, and each DAG no de is assigned a

National Science Foundation's CCR Division under grant #CCR-

priority. Critical-path list scheduling is a common variation,

9711676, and by the University of California's MICRO program.

where an instruction's priority is based on the maximum-

AG from the no de representing Permission to make digital or hard copies of all or part of this work for length path through the D

personal or classroom use is granted without fee provided that copies

the instruction to anyleafnode[15]. While critical-path list

Permissionare not madeto make or distributed digital or for hard profit copies or commercial of all or advantage part of thisand thatwork for

heduling is usually e ective, it can pro duce sub optimal re-

personalcopies or bear classroom this notice use and is the granted full citation without on the fee first provided page. To copy that copies sc

en for small scheduling problems. Consider the DAG

are nototherwise, made or or distributed republish, to forpost profit on servers or commercial or to redistribute advantage to lists, and that sults ev

en from [17 ], where each edge is lab eled with copiesrequires bear thisprior notice specific and permission the full and/or citation a fee. on the first page. To copy in Figure 1, tak

PLDI 2000, Vancouver, British Columbia, Canada.

its latency and each no de is lab eled with its critical-path otherwise,Copyright or republish,2000 ACM 1-58113-199-2/00/0006to post on servers or$5.00. to redistribute to lists, requires prior specific permission and/or a fee. PLDI 2000, Vancouver, British Columbia, Canada. Copyright 2000 ACM 1-58113-199-2/00/0006…$5.00. 121

us the approach do es not app ear practical for large

222 onds). Th

heduling problems. Chang, Chen and King [5]

A B C instruction sc

teger programming formulation that combines

2 2 22 prop ose an in

instruction scheduling with lo cal register allo cation. 0 0 lo cal

D E

Exp erimental results are given for one simple 10-instruction

example whichtakes 20 minutestosolve optimally. These

results suggest that the approach has very limited practical-

Figure 1: Simple example where critical-path list scheduling

ity.

can pro duce a sub optimal schedule.

Although prior work using integer programming has pro-

duced limited success for optimal instruction scheduling, in-

teger programming has b een used successfully to optimally

priority. For this DAG, no des A, B and C all have the same

solvevarious other compiler optimization problems, includ-

priority,sothescheduler can arbitrarily select the order of

ing array dep endence analysis [19 ], data layout for paral-

these instructions. If the initial order is A, C , B or C , A,

lel programs [4 ] and global register allo cation [9 ]. Inte-

B , the next cycle will b e a stall b ecause the latency from B

ger programming is the metho d of choice for solving many

to D or from B to E is not satis ed. Other orders of A, B

large-scale real-world combinatorial optimization problems

and C will pro duce an optimal schedule which has no stall.

in other elds [16 ], including other scheduling problems such

as airline crew scheduling. This successful use of integer

1.2 Optimal Instruction Scheduling

programming elsewhere suggests that improved integer pro-

gramming formulations maybethekey to solving large-scale

Although optimal instruction scheduling for complex pro ces-

instruction scheduling problems.

sors is hard in theory, in practice it maybepossibletoop-

This pap er presents a new approach to optimal instruc-

timally solve imp ortant instances of instruction scheduling

tion scheduling based on integer programming, the rst ap-

problems in reasonable time using metho ds from combinato-

proach which can solve very large scheduling problems in

rial optimization. Prior work has used various combinatorial

reasonable time. The pap er is organized as follows. Section

optimization approaches to optimally schedule instructions

2 describ es a basic integer-programming formulation for op-

for complex pro cessors. However none of these approaches

timal instruction scheduling, which is representativeoffor-

can optimally schedule large basic blo cks in reasonable time.

mulations prop osed in prior work. The material in Section

Ertl and Krall [8] use constraint logic programming and

2provides the reader with background on how instruction

consistency techniques to optimally schedule instructions for

scheduling can be formulated as an integer programming

a single-issue pro cessor with a maximum latency greater

problem, and provides a basis for comparing the new integer-

than two. Vegdahl [23 ] and Kessler [13 ] use dynamic pro-

programming formulation. Exp erimental results in Section

gramming to optimally schedule instructions. Chou and

2show that the basic formulation cannot solve large instruc-

Chung [6 ] and Tomiyama et al. [22 ] use approaches that

tion scheduling problems in reasonable time, which is consis-

implicitly enumerate all p ossible schedules to nd an opti-

tent with the results from prior work. Section 3 intro duces

mal schedule. For eciency, [6 ] and [22 ] prop ose techniques

a set of DAG transformations which can signi cantly sim-

to prune the enumeration tree so that redundantorequiv-

plify the DAG, while preserving the optimalityofthesched-

alentschedules are not explicitly enumerated. Exp erimen-

ule. The simpli ed DAG leads to simpli ed integer pro-

tal results for these various approaches show that they are

grams which are shown exp erimentally to solve signi cantly

e ective only for small to medium-sized basic blo cks (30 in-

faster. Section 4 intro duces a new integer-programming for-

structions or less).

mulation that simpli es the integer program by reducing the

Prior work has also used integer programming to opti-

number of integer-program variables, reducing the number

mally schedule instructions. An integer program is a linear

of integer-program constraints, and simplifying the terms in

program, with the added requirement that all problem vari-

some of the remaining constraints. The simpli ed integer

ables must b e assigned a solution value that is an integer.

programs are shown exp erimentally to solve dramatically

Like the other approaches to optimal instruction schedul-

faster. The last section summarizes the pap er's contribu-

ing, prior work using integer programming has only pro-

tions and outlines future work.

duced approaches that are e ective for small to medium-

sized scheduling problems. Arya [1 ] prop oses an integer pro-

gramming formulation for vector pro cessors, although the

2 Basic Integer-Programming Formulation

basic approach is general and can b e applied to other pro-

cessor typ es. The exp erimental results in [1 ] suggest the

This section describ es a basic integer-programming formula-

formulation can signi cantly improve the instruction sched-

tion for optimal instruction scheduling, which is representa-

ule. However, the results are limited to only three small to

tiveofformulations prop osed in prior work. The basic for-

medium-sized problems (12 to 36 instructions). Only the

mulation provides background for the DAG transformations

smallest problem is solved optimally, with the other prob-

presented in Section 3 and the new integer-programming

lems timing out b efore an optimal solution is found. Leup ers

formulation presented in Section 4.

and Marwedel [14 ] prop ose an integer programming formu-

lation for optimally scheduling a multiple-issue pro cessor.

2.1 Optimal Instruction Scheduling, Basic Formulation

Although their work fo cuses on DSP pro cessors, again the

basic approach is general and can b e used for other pro cessor

In the basic formulation, the basic blo ck is initially sched-

typ es. The exp erimental results in [14 ] are also limited to a

uled using critical-path list scheduling. The length U of the

few small to medium-sized problems (8 to 37 instructions).

resulting schedule is an upp er b ound on the length of an

While the solution time might b e acceptable for the largest

optimal schedule. Next, alower b ound L on the schedule

problem studied (37 instructions solves in 130 seconds), the

length is computed. Given the DAG's critical path c and

solution time do es not app ear to scale well with problem size

the pro cessor's issue rate r ,alowerboundontheschedule

(the next smaller problem, 24 instructions, solves in 7 sec-

for a basic blo ckwithn instructions is:

122

in which instruction i is scheduled. Using this expression, a

x11 ,x ,...,x1 1

t of the following form is pro duced for 1 2 n C dep endency constrain

2 2 2 2

h edge in the DAG to enforce the dep endency of instruc-

x,x,...,x12 n C eac

i on instruction k , where the latency from k to i is the 33 3 C3 tion

x,x,...,x1 2 n

constant L : 44 4 C4 ki

x,x,...,x1 2 n

m m X

55 5 5 X j x,x,...,x C j

12 n

j  x + L  j  x

ki

i

k

j =1 j =1 ork has prop osed a basic metho d for simplifying x,x,...,xmm m Cm Prior w

1 2 n

the integer program and hence reducing integer-program so-

lution time. Upp er and lower b ounds can be determined

for the cycle in which an instruction i can be scheduled,

Figure 2: The arrayof0-1scheduling variables for a basic

thereby reducing an instruction's scheduling range. All of i's

blo ck with n instructions for a schedule of m clo ck cycles.

scheduling variables for cycles outside the scheduling range

set by i's upp er and lower b ounds are unnecessary and can

b e eliminated. After scheduling range reduction, if anyin-

L = 1 + maxfc; dn=r e1g

struction's scheduling range is empty (its upp er b ound is

If U = L,theschedule is optimal, and an integer program

less than its lower b ound), no length m schedule exists and

is unnecessary.IfU>L,aninteger program is pro duced (as

an integer program is not needed.

describ ed b elow) to nd a length U 1schedule. If the inte-

Chang, Chen and King prop ose using the critical path

ger program is infeasible, the length U schedule is optimal.

distance from anyleafnode(any ro ot no de) or the number

Otherwise a length U 1schedule was found and a second

of successors (predecessors) to determine an upp er bound

integer program is pro duced to nd a length U 2sched-

(lower b ound) for each instruction [5 ]. For the r -issue pro-

ule. This cycle rep eats until a minimum-length schedule is

cessor de ned ab ove, a lower b ound L on i's scheduling

i

found.

range is set by:

To pro duce an m clo ck-cycle schedule for an n-instruction

basic blo ck, a 0-1 integer-program scheduling variable is cre-

L = 1 + maxfcr ; d(1 + p )=r e1g (1)

i i i

ated for each instruction, for eachclock cycle in the schedule.

j

where cr is the critical path distance from any root node

i

represents the decision to sched- The scheduling variable x

i

to i,andp is the number of i's predecessors. Similarly,an

i

ule (1) or not schedule (0) instruction i in clo ck cycle j . The

upp er b ound U on i's scheduling range is set by:

i

scheduling variables for the corresp onding clo ck cycles are

illustrated in Figure 2.

U = m maxfcl ; d(1 + s )=r e1g (2)

i i i

A solution for the scheduling variables must be con-

strained so that a valid schedule is pro duced. A constraint

where cl is the critical path distance from i to any leaf no de,

i

is used for each instruction i to ensure that i is scheduled at

and s is the number of i's successors.

i

exactly one of the m cycles, a must-schedule constraint with

Collectively, the reduced set of scheduling variables, the

the following form:

must-schedule constraints, the issue constraints, and the de-

m

X

p endency constraints constitute a basic 0-1 integer program-

j

x =1

i

ming formulation for nding a schedule of length m. Applied

j =1

iteratively as describ ed ab ove, this formulation pro duces an

optimal instruction schedule.

Additional constraints must ensure the schedule meets

the pro cessor's issue requirements. Consider a statically

2.2 Basic Formulation, Exp erimental Results

scheduled r-issue pro cessor that allows any r instructions

to issue in a given clo ck cycle, indep endent of the instruc-

The basic formulation is built inside the Gnu Compiler Col-

tion typ e. For this r -issue pro cessor, an issue constraint of

lection (GCC), and is compared exp erimentally with critical-

the following form is used for each clo ck cycle j :

path list scheduling. As shown in [2 ], optimal instruction-

n

scheduling for a single-issue pro cessor with a two-cycle max-

X

j

imum latency can be done in p olynomial time. However,

x  r

i

optimal scheduling for a single-issue pro cessor with a three-

i=1

cycle maximum latency, the next harder scheduling problem,

is considered intractable [2 ]. If this easiest hard schedul-

If a multiple-issue pro cessor has issue restrictions for var-

ing problem cannot b e solved optimally in reasonable time,

ious typ es of instructions, a separate issue constraintcanbe

there is little hop e for optimally scheduling more complex

used for each instruction typ e [14 ].

pro cessors. Thus, this pap er fo cuses on optimal schedul-

A set of dependency constraints is used to ensure that

ing for a single-issue pro cessor with a three-cycle maximum

the data dep endencies are satis ed. For each instruction i,

latency.

the following expression resolves the clo ck cycleinwhich i

The SPEC95 oating pointbenchmarks were compiled

is scheduled:

m

using GCC 2.8.0 with GCC's instruction scheduler replaced

X

j

by an optimal instruction scheduler using the basic formula-

j  x

i

tion. The b enchmarks were compiled using GCC's highest

j =1

level of optimization (-O3) and were targeted to a single-

issue pro cessor with a maximum latency of three cycles. The

Because only one x variable is set to 1 and the rest are

i

target pro cessor has a latency of 3 cycles for loads, 2 cycles

set to 0 in an integer program solution, the expression pro-

for all oating point op erations and 1 cycle for all integer

duces the corresp onding co ecient j , whichistheclock cycle

123

transformations are fast (low-order p olynomial time in the op erations. The SPEC95 integer benchmarks are not in-

size of the DAG) and are shown to preserve the optimality cluded in this exp eriment b ecause for this pro cessor mo del

of the nal schedule. The integer program pro duced from a there would b e no instructions with a 2-cycle latency,which

transformed DAG is signi cantly simpli ed and solves much makes the scheduling problems easier to solve.

faster. The optimal instruction scheduler is given a 1000 second

time limit to nd an optimal schedule. If an optimal sched-

ule is not found within the time limit, the b est improved

3.1 DAG Standard Form

schedule pro duced using integer programming (if any) is se-

The transformations describ ed in the following sections are

lected, otherwise the schedule pro duced bylistscheduling is

for DAGs in standard form. ADAG in standard form has

selected. The integer programs are solved using CPLEX 6.5,

a single ro ot no de and a single leaf no de. A DAG with

a commercial integer-programming solver [12 ], running on

multiple leaf and ro ot no des is transformed into a DAGin

an HP C3000 workstation with a 400MHz PA-8500 pro ces-

standard form by adding an arti cial leaf no de and an ar-

sor and 512MB of main memory. The exp erimental results

ti cial ro ot no de. The arti cial ro ot no de is the immediate

for the basic formulation are shown in Table 1.

predecessor of all ro ot no des. A latency-one edge extends

from the arti cial ro ot no de to each root node in the DAG.

Basic Formulation

Similarly, an arti cial leaf no de is the immediate successor

Total Basic Blo cks (BB) 7,402

of all DAG leaf no des. A latency-one edge extends from

BB from List Scheduling

eachleafnodeoftheDAG to the arti cial leaf no de. These

shown to b e Optimal w/o IP 6,885

arti cial no des do not a ect the optimal schedule length of

BB Passed to IP Formulation 517

the original DAG no des and are removed after scheduling is

BB IP Solved Optimally 482

complete.

BB IP Timed Out 35

BB IP Improved and Optimal 15

3.2 DAGPartitioning

BB IP Improved but Non-Optimal 0

Some DAGs can b e partitioned into smaller subDAGs which

Total Cycles IP Improved 15

can be optimally scheduled individually, and the subDAG

Total Scheduling Time (sec.) 35,879

schedules can b e recombined to form a schedule that is op-

timal for the entire DAG. Partitioning is advantageous b e-

Table 1: Exp erimental results using the basic integer pro-

cause the integer-program solution time is sup er-linear in

gramming formulation.

the size of the DAG, and thus the total time to solve the

partitions is less than the time to solve the original prob-

lem. Partitioning is also advantageous b ecause even though

Various observations are made ab out these data. First,

the original DAG may require an integer program to nd

using only list scheduling most of the schedules, 6,885 (93%),

its optimal schedule, one or more partitions may b e optimal

are shown to b e optimal b ecause the upp er b ound U equals

from list scheduling and an integer program is not required

the lower b ound L or b ecause after scheduling range reduc-

for those partitions.

tion for a schedule of length U 1, an instruction's schedul-

ADAG can b e partitioned at a partition node. A par-

ing range is empty. This is not surprising b ecause this group

tition no de is a no de that dominates the DAG's leaf no de

of basic blo cks includes such trivial problems as basic blo cks

and p ost-dominates its ro ot no de. A partition no de forms

with one instruction. For the 517 non-trivial problems that

a barrier in the schedule. No no des ab ove a partition no de

require an integer program, 482 (93%) are solved optimally

may be scheduled after the partition no de, and no no des

and 35 (7%) are not solved using a total of 35,879 CPU sec-

below the partition no de maybescheduled b efore the parti-

onds (10.0 hours). Asapoint of reference, the entire b ench-

tion no de. An ecient algorithm for nding partition no des

mark suite compiles in 708 seconds (11.8 minutes) when only

is describ ed in Section 3.5.1. The algorithm's worst-case ex-

list scheduling is used. Thus, the basic formulation fails in

ecution time is O (e), where e is the numb er of edges in the

two imp ortant resp ects: not all basic blo cks are scheduled

DAG.

optimally,andthescheduling time is long. Only 15 of the

Figure 3a shows a DAGwhich can b e divided into two

basic blo cks (3% of the non-trivial basic blo cks) haveanim-

partitions at partition no de D . No des A, B , C , and D

proved schedule, and the total static cycle improvementis

form one partition, and no des D , E , F , and G form the

only 15 cycles, a mo dest sp eedup. Sp eedup will be much

other partition. As illustrated, the two partitions are each

higher for a more complex pro cessor (longer latency and

optimally scheduled, and the schedules are then combined

wider issue). For example, results in [21 ] for the multiple-

to form an optimal schedule for the entire DAG.

issue Alpha 21164 pro cessor show that list scheduling is sub-

optimal for more than 50% of the basic blo cks studied. The

3.3 Redundant Edge Elimination

prop er conclusion to draw from the results in Table 1 is not

that optimal instruction scheduling do es not provide sig-

A DAG may include redundant edges. An edge between

ni cant sp eedup, but that the basic integer-programming

no des A and B (edg e ) is redundant if there is another

AB

formulation cannot pro duce optimal schedules in reasonable

path P from A to B , and the distance along P is greater

AB AB

time, even for the easiest of the hard scheduling problems.

than or equal to the distance across edg e . The distance

AB

Amuch b etter approach to optimal instruction scheduling

along a path is the sum of the latencies of the path's edges.

is needed.

The distance across an edge is the edge's latency.Inasched-

ule of the DAG no des, edg e enforces the partial order of

AB

3 DAGTransformations

A and B and the minimum latency b etween A and B . How-

ever, b ecause P enforces b oth of these conditions, edg e

AB AB

A set of graph transformations is prop osed which can sim-

is unnecessary and can b e removed. EachDAG edge requires

plify the DAG b efore the integer program is pro duced. These

a dep endency constraint in the integer program. Therefore,

124 A 1 1 A A C 1 1 B C B A E X B C 1 2 D C 1 D B 1 1 1 2 A J D 1 1 G 1 1 D 2 1 F F 1 B C D E K L 2 D 1 E F 1 2 2 2 H 1 2 F G D 3 Y M 11 E F E G I 1 1 G G (a) (b) (c)

(a) (b) (c) (d)

Figure 5: Example regions in a DAG.

Figure 3: Example DAG that can b e partitioned. Single-Entry Single-Exit Regions A A 3.4.1

11 11

O of the no des in a region can b e enforced while B C B C An order

1 1 2

optimality if for every optimal schedule of the

2 preserving

AG, the region's no des can b e reordered to the order O , 1 D D D

2 1 2 1

pro ducing a valid schedule with the same length. An order

that satis es this condition is a dominant order for the E F E F O

1 1 1 1

region. The simplest typ e of region for which a dominantor-

G G

der can b e found is a single-entry, single-exit (SESE) region.

An SESE region is a region where the entry no de dominates

(a) (b)

the interior no des, and the exit no de p ost-dominates the in-

terior no des. Figure 5a illustrates an SESE region. SESE

regions can b e found in O (n) time [18 ].

Figure 4: Example redundant edge, edg e .

BE

Theorem 1: If the schedule of an SESE region meets the

fol lowing conditions, then the schedule's node order O is a

removing redundant edges reduces the integer program's size

dominant order of the region's nodes:

and hence reduces its solution time. Removing redundant

 The schedule of the interior nodes is dense. Aschedule

edges can also create new opp ortunities for partitioning, fur-

is dense if the schedule contains no empty cycles.

ther simplifying the integer program. An ecient algorithm

for nding redundant edges in a DAG partition is describ ed

 The rst interior node in order O has an incoming edge

in Section 3.5.2. The algorithm's worst-case execution time

in the DAG that is one of the minimum-latency edges

is O (n e )wheren is the numb er of no des and e is the

P P P P

outgoing from the entry node to the region's interior.

numb er of edges in the DAG partition.

In Figure 4, edg e is a redundantedge whichcanbe

BE

 The last interior node in order O has an outgoing edge

removed. When edg e is removed, the DAG is reduced to

BE

in the DAG that is one of the minimum-latency edges

the DAGshown in Figure 3, which can then b e partitioned.

incoming to the exit node from the region's interior.

Pro of: Assume an optimal schedule where the SESE re-

3.4 Region Linearization

0

gion's no des are in order O . Remove the region's no des

Region linearization isaDAG transformation whichorders

from the schedule. The no des in order O can always be

the set of no des contained in a region of the DAG. A region

placed in the vacancies left by the removed no des. The en-

R is de ned by a pair of no des, the entry no de A and the

try no de in order O will remain the rst region no de in the

exit no de B . R is the subDAG induced bynodeA, no de

schedule and the exit no de will remain the last region no de

B , and all no des that are successors of A and predecessors

in the schedule. Thus, order O satis es the latencies be-

of B , no des which are called the region's interior nodes. R

tween the region's no des and the external no des. Avacancy

must contain two disjoint paths from A to B and each path

cannot occur within the minimum latency from the entry

must include a no de other than A or B. Figure 5 shows

no de to the interior no des, nor within the minimum latency

three example regions inside the rectangles. For region EI

to the exit no de from the interior no des. Therefore, the la-

in Figure 5b, no de E is the entry no de, no de I is the exit

tencies to the rst interior no de and from the last interior

no de, no des F , G,andH are the interior no des, and no de

no de in order O are satis ed. If the schedule of the vacan-

0

X is external to the region.

cies left bytheinterior no des in order O is dense, then the

In region linearization, list scheduling is used to pro duce

interior no de schedule for order O can b e placed directly in

aschedule for the instructions in eachDAG region. Under

the vacancies. Or, if the schedule of the vacancies left by

0

certain conditions (describ ed b elow), the original region sub-

the interior no des of order O is not dense, the dense sched-

DAG can b e replaced by a linear chain of the region's no des

ule of order O 's interior no des can be stretched to match

in the order determined by list scheduling, while preserving

the schedule of the vacancies. The latencies in the stretched

the optimality of the nal overall schedule. This can signif-

schedule between the interior no des are satis ed, b ecause

icantly simplify the DAG, and hence the integer program.

the latencies are greater than or equal to the latencies in

This pap er considers region linearization for a single-issue

the dense schedule of the interior no des of order O . 2

pro cessor. Region linearization for multiple-issue pro cessors

A dominant order for an SESE region can b e enforced

is also p ossible and will b e considered in a future pap er.

in a DAGby replacing the region subDAGbya subDAG

with the region's no des in the dominant order. The interior

125 1 1 G 2 1 1 A B C D E F G A H 1 2 2 G 1 1 1 1 1 I A B 2 2 2 J 1 1 X H 1 Y B C C 1 J

K 1 X

Example DAG with no des in a linear sequence. 11 1 1 Figure 7: 1 1 I D D 2 K E L 1 1 3 2 2 2 Y

F E M L try no des from no des outside the 2 2 the dep endencies to side-en

F M

region are also satis ed.2

An algorithm for nding all of the regions in a DAGpar-

(a) (b) (c) (d)

tition is describ ed in Section 3.5.3. The algorithm's worst-

case execution time is O (n e )where n is the number of

P P P

no des and e is the numb er of edges in the DAG partition.

Figure 6: Examples of region linearization transformation.

P

In Section 3.5.4, an algorithm for linearizing a region is de-

2

scrib ed which has a worst-case execution time of O (e + n ),

P

R

where n is the numb er of no des in the region.

no des in the transformed region are separated by latency one

R

Figure 6c shows a non-SESE region with a side-entry

edges. The entry no de is separated from the rst interior

no de K andaside exit no de I . The no de order G, H , I ,

no de and the last interior no de is separated from the exit

J , K , L, M is a dominant order for the region. Figure 6d

no de by the same edges that o ccur b etween these no des in

shows the region after the linearization transformation.

the original DAG. For the DAG in Figure 6a, the no de order

A, B , C , D , E , F is a dominant order. Figure 6b shows the

region after the linearization transformation.

3.5 Ecient Algorithms forDAGTransformations

This section describ es a set of ecient algorithms for p er-

3.4.2 Non-SESE regions

forming the DAG transformations describ ed in Section 3.2

through Section 3.4.

Regions which are not SESE regions contain side-exit nodes

and side-entry nodes. A side-exit no de is an interior no de

with an immediate successor that is not a region no de. A

3.5.1 DAGPartitioning Algorithm

side-entrynodeisaninterior no de with an immediate pre-

An algorithm is prop osed for nding the partition no des of

decessor that is not a region no de. Figures 5b and 5c are

a DAG. If a DAG is drawn as a sequence of no des in a

examples of non-SESE regions. In Figure 5b, no de G is a

topological sort [7], then all edges from no des preceding a

side-entry no de. In Figure 5c, no de K is a side-exit no de.

partition no de terminate at or before the partition no de.

The conditions for linearizing non-SESE regions are more

Figure 7 shows the DAG from Figure 3a redrawn as a linear

restrictive than those for SESE regions b ecause of the addi-

sequence in the top ological sort A, B , C , D , E , F , G. No

tional dep endencies to and from no des outside the region.

1

edge extends across the dashed lines at the partition no des .

Theorem 2: If the schedule of a non-SESE region meets

An ecient algorithm for nding partition no des iterates

the fol lowing conditions then the schedule's node order O is

through the no des of the DAG in a top ological sort O =

a dominant order of the region's nodes:

(N ;:::;N ), where n is the number of nodes in the DAG.

1 n

Avariable latest is maintained across the iterations of the

 The order O satis es the conditions for a dominant

algorithm. latest equals the latest immediate successor of

order of an SESE region.

any no de b efore the no de of the current iteration. Initially,

latest is set equal to the ro ot no de. Partition no des are

 All nodes that precede a side-exit node in order O are

determined and latest is up dated as follows:

predecessors of the side-exit node in the DAG.

for i 1ton

 All nodes that fol low a side-entry node in order O are

if N = latest

i

successors of the side-entry node in the DAG.

N is a partition no de

i

for each N 2 S (N )

i

k

Pro of: Assume an optimal schedule with the region's no des

0

if N is later in O than latest

k

in order O . Remove the region's no des from the schedule.

latest N

k

The region's no des in order O can always b e placed in the

vacancies left by the removed no des. The order O satis es

where S (N ) is the set of immediate successors of N .

i i

the conditions for a dominantschedule for an SESE region.

The original instruction order of the DAG no des b efore

Therefore, following the pro of of Theorem 1, the dep enden-

instruction scheduling can be used as the top ological sort

cies b etween interior no des are satis ed. Similarly, the de-

for the algorithm. EachDAG edge and no de is considered

p endencies b etween no des outside the region and the region

only once in the execution of the algorithm. Therefore, the

entry and exit no des are satis ed. Only the predecessors

algorithm runs in O (n + e) time, where n is the number of

of a side-exit no de in the DAG precede the side-exit no de

no des and e is the numb er of edges.

in order O . Therefore, the minimum number of nodes pre-

Table 2 illustrates the execution of the algorithm on the

cede each side-exit no de in order O . If the region no des are

example DAG in Figure 7. The column `current latest'

removed from the schedule and reordered in order O , a side-

indicates the value of latest at the start of the iteration.

exit no de cannot b e placed in a vacancy later than the origi-

The column `new latest' indicates the value of latest at

nal lo cation of the side-exit no de in the schedule. Therefore,

the end of the iteration.

the dep endencies from the side-exit no de to no des outside

1

The ro ot and leaf no des of a DAGare,by de nition, partition

the region are satis ed. A symmetric argument proves that no des.

126

is earliest in a top ological sort. If A is a predecessor of B ,

current new partition

R (A; B ) is uniquely de ned b ecause each dominator of B

iteration i N latest latest no de?

i

relativetoA is necessarily a predecessor or successor of any

1 A A C yes

other dominator of B relativetoA.IfA is not a predecessor

2 B C D no

of B or A = B , R (A; B ) is unde ned.

3 C D D no

Theorem 3: Nodes A and B de ne a region if and only

4 D D F yes

if there exist two immediate predecessors of B , nodes I

1

5 E F G no

and I , for which R (A; I ) and R (A; I ) are de ned and

2 1 2

6 F G G no

R (A; I ) 6= R (A; I ).

1 2

7 G G G yes

Pro of: Let I and I b e immediate predecessors of B ,and

1 2

R (A; I ) and R (A; I ) are de ned and R (A; I ) 6= R (A; I ).

1 2 1 2

Table 2: Execution of the partitioning algorithm on the

Because R (A; I ) 6= R (A; I ), a common dominator of I and

1 2 1

DAG in Figure 7.

I relativetoA do es not exist (excluding A). Accordingly,

2

there must exist disjoint paths P from A to I and P

AI 1 AI

1 2

from A to I . P concatonated with edg e and P

2 AI I B AI

1 1 2

3.5.2 An Algorithm for Finding Redundant Edges

concatonated with edg e de ne two disjoint paths from A

I B

2

to B . Therefore, A and B de ne a region.

An ecient algorithm is prop osed for nding the redundant

0 0

Let no des A and B de ne a region. By de nition, there

edges of a partition. The algorithm iterates through all

0 0 0

0 0

exist two disjoint paths P and P from A and B .

0 0

A B

A B

no des in the partition except the last no de. Ateach itera-

0

0 0

and P must each include a di erent immediate P

0 0

A B

A B

tion, each edge extending from the current no de is compared

0 0 0

predecessor of B ,nodesI and I resp ectively. A is a pre-

1 2

with parallel paths in the DAG. Let O =(N ;:::;N )be

1 n

P

0 0 0 0

decessor of I and I so R (A; I )and R (A; I ) are de ned,

1 2 1 2

a top ological sort of the partition no des, where n is the

P

0 0

and R (A; I ) 6= R (A; I ) b ecause there exist disjoint paths

1 2

numb er of no des in the partition. Redundant edges are de-

0 0

from A to I and from A to I . 2

1 2

termined as follows:

Let O =(N ;::;N ) b e a top ological sort of the no des

1 n

in a partition. The value R (A; B ) for each pair of no des A

for i 1ton 1

P

and B in the partition can b e determined as follows:

for each N 2 S (N )

j i

for each N 2 S (N ), where N 6= N

i j

k k

for i 2ton

if l (edg e )+D (N ;N )  l (edg e )

N N j N N

k

i i j

k

for j 1toi

is redundant edg e

N N

i j

if R (N ;N ) is unde ned 8 N 2 P (N )

j i

k k

if N 2 P (N )

j i

where S (N ) is the set of immediate successors of no de N

i i

R (N ;N ) N

j i i

in the DAG, and l (edg e ) is the latency of edg e .

N N N N

i i

k k

else

D (N ;N ) is the critical-path distance in the DAG from

j

k

R (N ;N ) unde ned

j i

no de N to no de N . D (N ;N )=1 if no path exists

j j

k k

), ) 6= R (N ;N else if R (N ;N

j j

k k

2 1

from N to N .

i

k

for some N 2 P (N );N 2 P (N ) where

i i

k k

1 2

The elimination of redundant edges requires the critical-

R (N ;N )andR (N ;N ) are de ned

j j

k k

1 2

path distance b etween all no des in the partition. An algo-

R (N ;N ) N

j i i

rithm describ ed in [7 ] calculates the critical-path distance

N and N de ne a region

j i

from one no de of a DAGtoevery other no de in the DAG.

else

The worst-case execution time of this algorithm is O (e ),

P

for some N 2 P (N ) where R (N ;N ) is de ned

i j

k k

where e is the number of edges in the DAG partition.

P

R (N ;N ) R (N ;N )

j i j

k

Therefore, the worst-case execution time for calculating the

critical-path distances b etween all no des of the partition is

where P (N )is the set of immediate predecessors of no de

i

O (n e ), where n is the number of no des in the parti-

P P P

N in the DAG.

i

tion. Once critical-path distances are calculated, the algo-

For each iteration i of the outer lo op, every edge ter-

rithm for nding redundant edges iterates through each edge

minating at a no de preceding N in order O is considered

i

edg e of the partition once. Ateach iteration, edg e

N N N N

i j i j

O (1) times. Therefore, the worst case execution time of the

is compared with all other edges extending from the same

algorithm is O (n e ).

P P

no de N . The number of edges whichmay extend from a

i

single no de is O (n ). Therefore, the worst-case execution

P

3.5.4 An Algorithm for Region Linearization

time of the redundant-edge nding algorithm is O (n e ).

P P

2

Although the numb er of edges in a partition can b e O (n ),

P

An algorithm for region linearization is prop osed which uses

the DAGs from the exp eriments describ ed in 2.2 generally

critical-path list scheduling to nd a schedule S foraregion

have O (n ) edges.

P

R . If no de order O of S is a dominant order for the region

R as de ned in Section 3.4 then O is enforced by linearizing

3.5.3 An Algorithm for Finding Regions

the region no des in the DAG.

The region nding algorithm describ ed in Section 3.5.3

An ecient algorithm is prop osed for nding the regions in

nds the entry no de and exit no de for each region of a parti-

a partition. Section 3.4 de nes a region by conditions on

tion. The interior no des of the region can b e determined by

the paths from the region's entry no de to its exit no de. A

p erforming a forward depth- rst search from the entry no de

region maybe equivalently de ned by relative dominance.

and a reverse depth- rst search from the exit no de. No des

No de C dominates no de B relativetonodeA if every path

which are touched in b oth searches are interior no des of the

in the DAG from A to B includes C , and C 6= A. Under

region. The searches can also identify side-entry no des and

this de nition, no de B dominates itself relativetoA. Let

side-exit no des.

R (A; B ) b e de ned as the dominator of B relativetoA which

127

By Theorem 2, in scheduling a non-SESE region R ,noin-

Total Basic Blo cks (BB) 7,402

terior no de maybescheduled b efore a side-exit no de whichis

BB Optimal from List Scheduling 6885

not a predecessor of the side-exit no de in the DAG. Symmet-

BB Optimal from DAGTransformations 143

rically,nointerior no de maybescheduled after a side-entry

BB Passed to IP Formulation 374

no de which is not a successor of the side-entry no de in the

BB IP Solved Optimally 368

DAG. To enforce these constraints during region scheduling,

BB IP Timed Out 6

temp orary edges are added to the region subDAG. Latency

BB IP Improved and Optimal 23

one edges are added from each side-exit no de to eachinterior

BB IP Improved but Not Optimal 1

no de which is not a predecessor of the side-exit no de. Sim-

Total Cycles IP Improved 31

ilarly, latency one edges are added to each side-entry no de

Total Scheduling Time (sec.) 7,743

from every interior no de which is not a successor of the side-

entry no de. If there exists a cycle in the dep endency graph

Table 3: Exp erimental results using DAG transformations

after this transformation, no dominant order for the region

and the basic integer-programming formulation.

is found. The temp orary dep endencies are removed after

scheduling the region.

After adding the temp orary dep endencies, the region

no des are scheduled using critical-path list scheduling. If

solve the more valuable problems with larger cycle improve-

thenodeorderO of the resulting region schedule is a dom-

ments. Overall these data show that graph transformation

inant order of the region, then the region is linearized as

is an imp ortanttechnology for pro ducing optimal instruc-

describ ed in Section 3.4.

tion schedules in reasonable time. However additional new

Finding the interior no des of a region requires two depth-

technology is clearly needed.

rst searches of the partition. The worst-case execution

time of the searches is O (e ). A non-SESE region may

P

4 Advanced Integer-Programming Formulation

have O (n ) side-exit and side-entry no des, where n is the

R R

number of no des in the region. Each side-exit and side-

This section describ es an advanced formulation for optimal

entry no de may require the addition of O (n ) temp orary

R

instruction scheduling which dramatically decreases integer-

dep endences to and from other region no des. Therefore,

program solution time compared with the basic formulation.

adding and removing temp orary dep endencies to a region

2

has a worst-case execution time of O (n ). Critical-path list

R

4.1 Advanced Scheduling-Range Reduction scheduling of the region has a worst-case execution time of

2

O (n ) [15 ]. If a dominant order is pro duced by list schedul-

R

As describ ed in Section 2.1, the basic formulation uses a

ing, replacing the region subDAG with a linearized subDAG

technique that can reduce an instruction's scheduling range,

requires the deletion of O (n ) edges and the insertion of

R

and hence the number of scheduling variables. Signi cant

O (n ) edges. Collectively, therefore, the worst-case execu-

R

additional scheduling-range reductions are p ossible.

2

tion time of the region linearization algorithm is O (e + n ).

P

R

The basic formulation uses static range reduction based

on critical-path distance or the number of successors and

3.6 DAGTransformations, Exp erimental Results

predecessors. This technique is termed static range reduc-

tion b ecause it is based on static DAG prop erties. An addi-

The exp eriment describ ed in Section 2.2 was rep eated with

tional criterion for static range reduction is prop osed which

GCC's instruction scheduler replaced by an optimal instruc-

uses the numb er of predecessors (or successors) and the min-

tion scheduler that includes the DAG transformations and

imum latency from (or to) any immediate predecessor (or

the basic integer-program formulation. As b efore, each ba-

successor). For the r -issue pro cessor de ned earlier, if in-

sic blo ckis rstscheduled using list scheduling and the ba-

struction i has p predecessors, the predecessors will o ccupy

i

sic blo cks with schedules that are shown to b e optimal are

at least bp =r c cycles. If the minimum latency from a prede-

i

removed from further consideration. For this exp eriment,

cessor of i to i is pr ed minl , i can b e scheduled no so oner

i

the DAG transformations are then applied to the remaining

minl . Given this additional than cycle 1 + bp =r c + pr ed

i i

basic blo cks, and the basic blo cks are scheduled again using

criterion, Equation 1 can be extended to create a tighter

list scheduling. The resulting schedules are checked for opti-

lower b ound L on i's scheduling range:

i

mality and those basic blo cks without optimal schedules are

then solved using integer programming. The results from

L =1+maxfcr ; d(1 + p )=r e1; bp =r c + pr ed minl g

i i i i i

this exp eriment are shown in Table 1.

Various observations can b e made by comparing the data

Symmetrically, Equation 2 can b e extended to create a

in Table 3 with the data obtained using only the basic for-

tighter upp er b ound U on i's scheduling range:

i

mulation in Table 1. The DAG transformations reduce the

total number of integer programs by 28%, to 374 from 517.

U = m maxfcl ; d(1 + s )=r e1; bs =r c + succ minl g

i i i i i

The DAG transformations reduce the number of integer pro-

where succ minl is the minimum latency from i to a suc-

grams that time out by83%, to 6 from 35. Total optimal

i

cessor of i.

scheduling time using the DAG transformations is reduced

A new range reduction technique, iterative range reduc-

by ab out ve-fold, to 7743 seconds from 35,879 seconds. The

tion, is prop osed. Iterative range reduction uses initial logi-

numb er of basic blo cks that have an improved schedule us-

cal implications (describ ed b elow) to reduce the scheduling

ing the DAG transformations increases by60%,to24from

range of one or more instructions. This range reduction may

15. There is a 1 cycle average schedule improvement for

in turn allow the ranges of predecessors or successors to b e

the 15 blo cks which improved using the basic formulation

tightened. An instruction i's lower b ound can b e tightened

alone. The additional 9 basic blo cks whichimproved using

by selecting the maximum of:

the DAG transformations haveanaverage improvementof

1.8 cycles. This suggests that the DAG transformations help

 The static lower b ound L ,or i

128

For each immediate predecessor j , j 's lower b ound plus  [1,1] [1,1] [1,1]

A A A

the latency of edg e .

ji 11 1 1 1 1 [2,3] [3,3] [3,2] [2,2]

B C [2,2] B C [2,2] B C

ws i's lower b ound to b e tight-

The second criterion allo 2 3 2 3 2 3

ely as the lower b ounds of i's immediate pre- ened iterativ [4,4] [4,4] [5,5]

[4,5] D E [5,5] D E [5,5] D E

tened iteratively. Similarly, instruction i's decessors are tigh 1 1 1 1 1 1

F F F

tened by selecting the minimum of: upp er b ound can b e tigh [6,6] [6,6] [6,6]

(a) (b) (c)

 The static upp er b ound U ,or

i

 For each immediate successor j , j 's upp er b ound minus

Figure 8: Example scheduling range reduction based on one-

the latency of edg e .

ij

cycle scheduling range.

Following an initial logical implication, the predecessor

and successor range reductions may iteratively propagate

b e 0 and the variable can b e eliminated from the problem.

through the DAG and may lead to additional logical impli-

General-purp ose probing is used in commercial solvers, but

cations that can reduce scheduling ranges. These new logical

has a very high computation cost [12 ].

implications may in turn allow additional predecessor and

A sp eci c probing technique called instruction probing

successor range reductions. This pro cess can iterate until

is prop osed for reducing instruction scheduling ranges. In-

no further reductions are p ossible.

struction probing is computationally ecient b ecause it ex-

For the r -issue pro cessor de ned earlier, a logical im-

ploits knowledge of the instruction scheduling problem and

plication can b e made for instructions that have a one-cycle

exploits the DAG's structure. Instruction probing is done

scheduling range. If an instruction i has a one-cycle schedul-

for each instruction i, starting with i's lower b ound. A lower-

k

ing range that spans cycle C , i must b e scheduled at cycle

b ound prob e consists of temp orarily setting i's upp er b ound

k

C . If r instructions with one-cycle ranges must b e sched-

equal to the current lower b ound. This has the e ect of

k

uled at cycle C , no other instruction can b e scheduled at

temp orarily scheduling i at the currentlower b ound. Based

k k

cycle C . Thus, cycle C can b e removed from the schedul-

on i's reduced scheduling range, the ranges of i's predeces-

k

ing range of any other instruction j which includes cycle C .

sors are temp orarily tightened throughout the DAG. If the

Based on j 's range reduction, the ranges of j 's predecessors

resulting scheduling ranges are infeasible, the prob e is suc-

and successors may b e reduced. This may lead to additional

cessful and i's lower b ound is p ermanently increased by1.

instructions with one-cycle ranges, and the pro cess mayit-

Based on i's new lower b ound, the ranges of i's successors

erate for further range reductions.

are p ermanently tightened throughout the DAG. If the re-

Iterative range reduction can lead to scheduling ranges

sulting scheduling ranges are infeasible, the overall schedul-

that are infeasible, which implies that no length m schedule

ing problem is feasible. Otherwise, the new lower b ound is

exists. Two infeasibility tests are:

prob ed and the pro cess rep eats. If a lower-b ound prob e is

unsuccessful, i's lower-b ound probing is complete. i's upp er

 The scheduling range of any no de is empty b ecause its

b ound is then prob ed in a symmetric manner.

upp er b ound is less than its lower b ound.

Figure 9 illustrates the use of instruction probing for

a single-issue pro cessor. Figure 9a shows the scheduling

 For any k -cycle range, more than rk instructions have

ranges that are pro duced using static range reduction for

scheduling ranges that are completely contained within

a schedule of length 8. Figure 9b shows the temp orary

the k cycles, i.e., the scheduling ranges violate the pi-

scheduling ranges that result from probing no de B 's lower

geon hole principle [10 ].

b ound. Based on the one-cycle logical implication, cycle 2

is removed from no de C 's range. No de C 's increased lower

Figure 8 illustrates iterative range reduction using the

b ound in turn causes the lower b ounds for no des E , G and H

one-cycle logical implication for a single-issue pro cessor. Fig-

to b e temp orarily tightened. Because no de E has a one-cycle

ure 8a shows eachnode lab eled with the lower and upp er

range (cycle 5), cycle 5 is removed from no de D 's range,

b ounds that are computed using static range reduction for

2

which in turn causes no de F 's lower b ound to b e tightened.

aschedule of length six . No des A, C , E and F have one-

No des F and G must b e scheduled at the same cycle, which

cycle ranges, so the corresp onding cycles (1,2,5 and 6) are

is infeasible. Thus, no de B cannot be scheduled at cycle

removed from the scheduling ranges of all other no des, re-

2 and cycle 2 is p ermanently removed from B 's scheduling

sulting in the scheduling ranges shown in Figure 8b. Prede-

range. The consequence of B 's p ermanent range reduction

cessor and successor ranges can then b e tightened using the

is shown in Figure 9c. Based on B 's tightened lower b ound,

iterative range reduction criterion, pro ducing the schedul-

the lower b ounds of no des D , F and H are p ermanently

ing ranges shown in 8c. The resulting scheduling ranges are

tightened. Based on no de B 's one-cycle range, cycle 3 is

infeasible b ecause no de B 's scheduling range is empty.

removed from no de C 's range. Due to no de D 's one-cycle

A second logical implication based on probing is used

range, cycle 6 is removed from no de G's range. The result-

to reduce scheduling range. Probing is a general approach

ing ranges are infeasible b ecause no des F and G must b e

that can b e used for prepro cessing any0-1integer program

scheduled at the same cycle, thus no 8-cycle schedule exists

to improve its solution time [20 ]. Probing selects a 0-1 inte-

for the DAG.

ger program variable and attempts to show that the variable

cannot b e 1 by assuming the variable's value is 1 and then

4.2 Optimal Region Scheduling

showing that the resulting problem is infeasible. If the prob-

lem is infeasible, bycontradiction, the variable's value must

For a region as de ned in Section 3, the critical-path distance

2

between the region's entry and exit no des may not b e su-

List scheduling pro duces a length 7 schedule for this DAG, so

an integer program is pro duced to nd the next shorter schedule, as

ciently tight. This can lead to excessiveinteger-program so-

describ ed in Section 2.1.

lution time b ecause the integer program pro duced from the

129

Optimal region schedules are computed starting with the

A [1,2] A [1,1] A [1,1]

inner regions, progressing outward to the largest

1 1 1 1 1 1 smallest

tire DAG partition. Optimal schedules [2,3] B C [2,3] [2,2] B C [3,3] [3,3] B C [2,2] outer region, the en

3 2 3 2 3 2

for the inner regions can usually b e found quickly using the

ytests. If an optimal region schedule requires an [5,6] D E [4,5] [6,6] D E [5,5] [6,6] D E [4,5] infeasibilit

1 2 1 2 1 2

integer program and the inner regions have pseudo edges, a

[6,7] F G [6,7] [7,7] F G [7,7] [7,7] F G [7,7]

dep endency constraint is pro duced for each pseudo edge, in

11 11 11 As the optimal region H H H the manner describ ed in Section 2.1.

[7,8] [8,8] [8,8]

scheduling pro cess moves outward, the additional pseudo

(a) (b) (c)

edges progressively constrain the scheduling of the large

outer regions, allowing the optimal scheduling of the outer

regions to b e solved quickly, usually using only the infeasi-

Figure 9: Example scheduling range reduction using instruc-

bility tests.

tion probing.

4.3 Branch-and-Cut [1,2] [1,2] [1,2] A A A

1 1 1 1 1

ypically 0-1 integer programs are initially solved as a lin- 1 T 1 1 1

[2,4] B C [2,4] B C [2,4] B C

h the solution values can b e non- [2,3] [2,3] [2,3] ear program (LP) in whic

2 3 2 6 3 2 6 3

tegers [24 ]. If a variable in the initial solution is a non- [4,6] D E [5,6] [4,6] D E [5,6] [4,6] D E [5,6] in

1 1 1 1 1

teger, branch-and-bound can b e used to nd an all integer

1 in F [6,7] F [7,7] F [7,7] [2,9] [2,9] L [2,9] L 1 L

1 1 1 1

Branch-and-b ound selects a branch variable, 1 solution [24 ].

[7,9] G H [7,8] [8,9] G H [8,8] [9,9] G H [8,8]

h is one of the variables that was not an integer in the 2 3 2 3 2 3 whic

[10,11] [11,11] [11,11]

solution. Two subproblems are created in which the [9,11] I J 3 [10,11] I J 3 [11,11] I J 3 LP

1 1 1 1 1 1

branch variable is xed to 0 and to 1, and the subprob-

[11,12] K [12,12] K [12,12] K

lems are solved as LPs. The collection of subproblems form

(a) (b) (c)

a branch-and-bound tree [24 ]. The ro ot of the tree is the

initial LP of the integer program and eachnodeis an LP

subproblem with some variables xed to integer values. By

Figure 10: Example pseudo edge insertion.

solving the LP subproblems, the branch-and-b ound tree is

traversed until an all integer solution is found, or the prob-

lem is determined to b e infeasible [24 ].

DAG is under-constrained. A region is loose if the critical-

For some integer programming applications, the num-

path distance from the region's entry no de to its exit no de

ber of no des in the branch-and-b ound tree that must be

is less than the distance from the entry no de to the exit

solved to pro duce an optimal integer solution can be dra-

no de in an optimal schedule of the region. Clearly in any

matically reduced by adaptively adding application-sp eci c

valid overall schedule, a region's entry and exit no des can

constraints, or cuts, to the LP subproblem at each no de

be no closer than the distance between these no des in an

in the branch-and-b ound tree. The cuts are designed to

optimal schedule of the region. A lo ose region can b e tight-

eliminate areas of the solution space in the LP subprob-

ened by computing an optimal schedule for the region to

lem whichcontain no integer solutions. This enhancement

determine the minimum distance b etween the region's entry

to the branch-and-b ound metho d is called branch-and-cut

and exit no des. A pseudo edge can then b e added to the

[24 ]. Twotyp es of cuts are prop osed for solving instruction

DAGfromtheentry no de to the exit no de, with a latency

scheduling integer programs: dependency cuts and spreading

that equals the distance b etween these no des in the optimal

cuts.

region schedule. Pseudo edges may allow the scheduling

ranges of the entry and exits no des to b e reduced. These

4.3.1 Dep endency Cuts

range reductions may then iteratively propagate through the

DAG as describ ed in the previous subsection and may result

In an LP solution, an instruction may b e fractionally sched-

in scheduling ranges that are infeasible for scheduling the ba-

uled over multiple cycles. For an instruction k that is data

sic blo ckinm cycles. Even if the overall scheduling problem

dep endent on instruction j , fractions of j canbescheduled

is not shown to b e infeasible, the problem will solve faster

after all cycles in which k is scheduled, without violating the

b ecause of the reduced number of scheduling variables.

corresp onding dep endency constraint. This is illustrated in

The application of region scheduling is illustrated in Fig-

Table 4, which shows a partial LP solution for a schedul-

ure 10. Figure 10a shows the scheduling ranges for a sched-

ing problem that includes instructions j and k , where k is

ule of length 12. The region AF is an instance of the region

dep endenton j with latency 1. The solution satis es the de-

in Figure 8, which has an optimal schedule of length 7. Thus,

p endency constraintbetween j and k because j is scheduled

after region scheduling is applied to region AF a latency 6

at cycle 1  0:5+5 0:5 = 3, while k is scheduled at cycle

pseudo edge is added from no de A to no de F , as shown in

4. However a fraction of j is scheduled after all fractions of

Figure 10b. The pseudo edge causes the lower b ounds for

k . This invalid solution can b e eliminated in the subsequent

no des F , G, H , I , J and K to b e tightened to the ranges

LP subproblems by adding the following dependency cut for

shown in Figure 10b. Based on no de H 's one-cycle range, cy-

cycle c, the last cycle in which j can b e scheduled given the

cle8isremoved from no de G's range. The increase in no de

p osition of the last fraction of k in the current LP solution:

G's lower b ound causes no de I 's lower b ound to increase.

The resulting ranges are shown in Figure 10c. These ranges

c+l

jk c

X X

i i

are infeasible b ecause no des I and J must b e scheduled at

x  x

k j

the same cycle. Thus, the overall schedule is shown to b e

i=LB (k ) i=LB (j )

infeasible based on the optimal schedule of only one inner region.

130

where LB (j ) and LB (k ) are the scheduling range lower

[1,1] A

b ounds for j and k , resp ectively, and l is the latency of

jk 11

ween j and k .

the dep endency b et [2,3] B C [2,3]

or example, the following dep endency cut can b e added

F 1

2 2

for cycle 3 for the solution in Table 4 to prevent j from

[5,6] D E [5,6]

b eing fractionally scheduled after a fraction k in cycle 4 in

11

subsequent LP subproblems:

[7,7] F

1 2 3 2 3 4

x + x + x  x + x + x

j j j k k k

Figure 11: Example DAG with redundant dep endency

edg e .

CD

clock cycle variables

1

C x =0:5

1

j

C

2

4.4 Redundant Constraints

C

3

4

Some constraints in the integer program may b e redundant

C x =1:0

4

k

5

and can be removed. This simpli es the integer program

C x =0:5

5

j

and reduces its solution time.

If an instruction i has a one cycle scheduling range at

Table 4: Example LP solution to illustrate a dep endency

cycle C , the must-schedule constraint for instruction i can

cut.

b e eliminated, and the issue constraint for cycle C can b e

eliminated for a single-issue pro cessor.

The integer program includes a dep endency constraint

for eachDAG edge. The dep endency constraint ensures that

4.3.2 Spreading Cuts

each instruction is scheduled suciently earlier than anyof

In an LP solution, an instruction k may be fractionally

its dep endent successors. However, if the scheduling ranges

scheduled closer to k 's immediate predecessors than the la-

of dep endent instructions are spaced far enough apart, the

tencies allowinaninteger solution, but still satisfy the cor-

data dep endency b etween the two instructions will neces-

resp onding dep endency constraints. This is illustrated in

sarily b e satis ed in the schedule pro duced by the integer

Table 5, whichshows a partial LP solution for a scheduling

program. In this case, the dep endency constraint for the

problem which includes instructions i, j , and k . Instruction

corresp onding edge can be removed from the integer pro-

k is dep endent on b oth i and j with latency 2. The solution

gram. More precisely, if an instruction k is dep endenton

satis es the dep endency constraints of the LP b ecause k is

instruction j with latency L, then the j to k dep endency

scheduled at cycle 3  0:5+4 0:5 =3:5, and i and j are

constraint is redundantif:

scheduled at cycle 1  0:5+2 0:5=1:5. However, in cycle 3,

a fraction of k is scheduled closer to i and j than the latency

L + Upp er b ound of j  Lower b ound of k

allows in an all integer solution. This invalid solution can

b e eliminated in the subsequent LP subproblems by adding

For the example DAG in Figure 11, the dep endency con-

the following spreading cut for cycle c, the lowest cycle in

straint for edg e is redundant b ecause the upp er b ound

CD

which a fraction of k is scheduled in the current solution:

for no de C is cycle 3, the lower b ound for no de D is cycle 5

c

X X

and the latency of edg e is 1.

CD

i cl+1

x + x  1

k

I

i=cl+1

I 2P (k )

4.5 Algebraic Simpli cation

where P (k ) is the set of immediate predecessors of instruc-

An algebraic simpli cation reduces the number of dep en-

tion k .

dency constraint terms and reduces the size of the co e-

For example, the following spreading cut can b e added

cients of the remaining terms. The basic dep endency con-

for cycle 3 for the solution in Table 5 to force the fractions

straintfor edg e has one term for each cycle in j 's range

jk

of i and j to b e spread further apart from a fraction of k in

and one term for each cycle in k 's range. The dep endency

cycle 3 in subsequent LP subproblems:

constraint is simpli ed to include terms only for cycles in

2 2 3 2

which j 's and k 's scheduling ranges overlap. The simpli ed

 1 + x + x + x x

j i k k

constraints allowtheinteger program to solve faster.

Supp ose instruction i is dep endenton k with latency L.

The scheduling ranges for k and i are [a,b] and [c,d], re-

clock cycle variables

sp ectively, and the dep endency constraint is not redundant.

1 1

C x =0:5 x =0:5

1

i j

That is, c +1  b + L. The basic dep endency constraintis:

2 2

=0:5 =0:5 C x x

2

i j

d b

3

X X

C x =0:5

3

k

j j

+ L  j  x (3) j  x

4

i

k

C x =0:5

4

k

j =a j =c

Table 5: Example LP solution to illustrate a spreading cut.

Subtracting c from b oth sides of (3) yields:

b d

X X

j j

Symmetric spreading cuts can be used to prevent in-

+ L c  j  x c j  x

i

k

structions from b eing fractionally scheduled closer to their

j =a j =c

immediate successors than the latencies allowinaninteger solution.

131

4.6 Advanced IP Formulation, Exp erimental Results Because the x and x variables are nonzero for only one

i

k

cycle j :

The exp eriment describ ed in 2.2 is rep eated with GCC's in-

b d

X X

struction scheduler replaced by an optimal instruction sched-

j j

(4) (j + L c)  x  (j c)  x

i

k

uler that includes the DAG transformations and the ad-

j =a j =c

vanced integer-program formulation. The exp erimental re-

sults in Table 6 show the dramatic improvementprovided

After schedule range tightening, a  c L, and b ecause

by the new optimal scheduler. All basic blo cks are sched-

the dep endency constraintisnotredundant, c L +1  b.

uled optimally and the scheduling time is very reasonable.

Accordingly,(4)may b e expanded to:

The graph transformations, advanced range reduction and

cL b d

region scheduling techniques reduce to 22 the number of

X X X

j j j

basic blo cks that require an integer program, down from

(5) (j +Lc)x + (j +Lc)x  (j c)x

i

k k

517 using the basic formulation. These 22 most dicult

j =a j =cL+1 j =c

problems require a total solver time of only 45 seconds, an

j

average of only 2 seconds each. The total increase in com-

The right side of (5) is nonnegative. If x is nonzero for a

k

pilation time is only 98 seconds, a 14% increase in total

cycle j  c L, the left side of (5) is nonp ositive. Then, the

compilation time, which includes the time for DAG trans-

inequality is necessarily satis ed indep endent of the values

formations, advanced range reduction and region scheduling.

of the x variables. Therefore, the left summation on the

i

Total scheduling time is reduced by more than 300 fold com-

left hand side of (5) may b e eliminated, and the dep endency

pared with the basic formulation. The improvementincode

constraint b ecomes:

quality is more than 4 times that of the basic formulation,

b d

X X

66 static cycles compared with 15 cycles. The additional 6

j j

(6) (j + L c)  x  (j c)  x

i

k

basic blo cks that are solved optimally using the advanced

j =cL+1 j =c

formulation all have improved schedules, with the average

improvement of 5.8 cycles. This suggests that the hardest

Let M = b + L c. The right hand side of (6) can b e

problems to solve are those which provide the most p erfor-

transformed as follows:

mance improvement.

d d

X X

j j

(j c)  x = M (M j + c)  x

i i

j =c j =c

Total Basic Blo cks (BB) 7,402

BB Optimal from List Scheduling 6,885

Because the left hand side of (6) is at most M ,allright

BB Optimal from Graphs Transformation 143

hand side cycles after b + L 1 don't a ect the constraint

BB Optimal from Advanced Range Reduction

b ecause they have negative co ecients and will pro duce a

and from Region Scheduling 353

right hand side result that is greater than M . Thus the right

BB Passed to IP Formulation 22

hand side of (6) can b e simpli ed to:

BB IP Solved Optimally 22

b+L1

X

BB IP Timed Out 0

j

M (7) (M j + c)  x

BB Improved and Optimal 29

i

j =c

BB Improved but Not Optimal 0

Total Cycles Improved 66

Substituting (7) for the right hand side in (6), yields the

Total Scheduling Time (sec.) 98

simpli ed constraint:

b+L1 b

Table 6: Exp erimental results using DAG transformations

X X

j j

 M (M j + c)  x (j + L c)  x

and the advanced integer programming formulation.

i

k

j =c j =cL+1

To illustrate the simpli cation, assume k has scheduling

The new approach can optimally schedule very large ba-

range [8,15], i has scheduling range [14,78], and i is dep en-

sic blo cks. The scattergram in Figure 12 shows a dot for

dent on k with latency 1. The original data dep endency

each of the 517 basic blo cks that are pro cessed by the graph

constraint:

transformations and the advanced integer programming for-

mulation. The axes indicate the blo ck's size and the time

15 78

X X

j j

to optimally schedule that blo ck. This gure shows that

j  x +1  j  x

i

k

manyvery large blo cks, as large as 1000 instructions, are

j =8 j =14

optimally scheduled in a short time.

is simpli ed to:

5 Summary

15 15+11

X X

j j

(j +1 14)  x  (15 + 1 14) (15+1 j )  x

i

k

This pap er presents a new approach to optimal instruction

j =141+1 j =14

scheduling which is fast. The approach quickly identi es

most basic blo cks for whichlistscheduling pro duces an op-

whichis:

timal schedule, without using integer programming. For

14 15 14 15

the remaining blo cks, a simpli ed DAG and an advanced

x +2 x  2 2  x x

k k i i

integer-programming formulation lead to optimal schedul-

Before simpli cation the dep endency constraint has 74

ing times whichaverage a few seconds p er blo ck, even for a

terms. After simpli cation there are only 4 variable terms.

b enchmark set with some very large basic blo cks. These re-

The maximum sized co ecient has b een reduced to 2 from

sults show that the easiest of the hard instruction scheduling 74.

132

[9] D. Go o dwin and K. Wilken. Optimal and Near-Optimal

1000

Global Register Allo cation Using 0-1 Integer Program-

Software|Practice and Experience, 26(8):929{

100 ming. 965, August 1996.

10

[10] J. Grossman. Discrete Mathematics. Macmillan, 1990.

1

[11] J. Hennessy and D. Patterson. Computer Architecture:

0.1

A Quantitative Approach. Morgan Kaufmann, 1996.

0.01 Second Edition.

Total Scheduling Time (secs.)

ILOG. ILOG CPLEX 6.5 User's Manual. ILOG, 1999. 0.001 [12]

1 10 100 1000 10000

C. Kessler. Scheduling Expression DAGs for Min-

Intermediate Instructions in Basic Block [13]

iml Register Need. Computer Languages, 24(1):33{53,

April 1998.

Figure 12: Scatter-gram of basic-blo cksizeversus optimal

[14] R. Leup ers and P. Marwedel. Time-Constrained Co de

scheduling time.

Compaction for DSPs. IEEE Transactions VLSI Sys-

tems, 5(1):112{122, March 1997.

problems can b e solved in reasonable time. This is an imp or-

[15] S. Munchnick. Advanced Compiler Design and Imple-

tant rst step toward solving harder instruction scheduling

mentation. Morgan Kaufmann, 1997.

problems in reasonable time, including scheduling very large

[16] G. Nemhauser. The Age of Optimization: Solving

basic blo cks for long-latency multiple-issue pro cessors. The

Large-Scale Real-World Problems. Operations Re-

prop osed approach will also serve as a solid base for future

search, 42(1):5{13, Jan.{Feb. 1994.

work on optimal formulations of combined lo cal instruction

scheduling and lo cal register allo cation that can b e solved

[17] K. Palem and B. Simons. Scheduling Time-Critical In-

in reasonable time for large basic blo cks.

structions on RISC Machines. ACM Transactions on

Programming Languages and Systems, 15(4):632{658,

References

Septemb er 1993.

[18] K. Pingali and G. Bilardi. Optimal Control Dep endence

[1] S. Arya. An Optimal Instruction-Scheduling Mo del for

and the Roman Chariots Problem. ACM Transactions

a Class of Vector Pro cessors. IEEE Transactions on

on Programming Languages and Systems, 19(3):462{

Computers, C-34(11):981{995, Novemb er 1985.

491, May 1997.

[2] D. Bernstein and I. Gertner. Scheduling Expressions

[19] W. Pugh. The Omega Test: AFast and Practical Inte-

on a Pip elined Pro cessor with a Maximal DelayofOne

ger programming Algorithm for Dep endence Analysis.

Cycle. ACM Transactions on Programming Languages

In Proceedings Supercomputing '91, pages 18{22, Nov

and Systems, 11(1):57{66, January 1989.

1991.

[3] D. Bernstein, M. Ro deh, and I. Gertner. On

[20] M. Savelsb ergh. Prepro cessing and Probing for Mixed

the Complexity of Scheduling Problems for Paral-

Integer Programming Problems. ORSA Journal of

lel/Pip elined Machines. IEEE Transactions on Com-

Computing, 6(4):445{454, Fall 1994.

puters, 38(9):1308{1313, Septemb er 1989.

[21] P. Sweany and S. Beaty. Instruction Scheduling Us-

[4] R. Bixby, K. Kennedy, and U. Kremer. Automatic Data

ing Simulated Annealing. In Proc. 3rd International

Layout Using 0-1 Integer Programming. In Proceedings

Conference on Massively Paral lel Computing Systems

of Conference on Paral lel Architectures and Compila-

(MPCS '98), 1998.

tion Techniques, August 1994.

[22] H. Tomiyama, T. Ishihara, A. Inoue, and H. Yasuura.

[5] C-M Chang, C-M Chen, and C-T King. Using Inte-

Instruction Scheduling to Reduce Switching Activityof

ger for Instruction Scheduling and

O -Chip Buses for Low-Power Systems with Caches.

Register Allo cation in Multi-Issue Pro cessors. Com-

IEICE Trans. on Fundamentals of Electronics, Com-

puters and Mathematics with Applications, 34(9):1{14,

munications and Computer Sciences, E81-A(12):2621{

November 1997.

2629, Decemb er 1998.

[6] H. Chou and C. Chung. An Optimal Instruction Sched-

[23] S. Vegdahl. Local Code Generation and Compaction

uler for Sup erscalar Pro cessors. IEEE Trans. on Paral-

in Optimizing Microcode. PhD thesis, Carnegie Mellon

lel and Distributed Systems, 6(3):303{313, March 1995.

University, Decemb er 1982.

[7] T. Cormen, C. Leiserson, and R. Rivest. Introduction

[24] Laurence A. Wolesey. Integer Programming. John Wi-

to Algorithms. McGraw-Hill, 1989.

ley & Sons, Inc., 1998.

[8] A. Ertl and A. Krall. Optimal Instruction Schedul-

ing Using Constraint Logic Programming. In Program-

ming Language Implementation and Logic Program-

ming (PLILP). Springer-Verlag, 1991.

133