Instruction Scheduling Using Integer Programming Optimal
Kent Wilken Jack Liu Mark He ernan
Department of Electrical and Computer Engineering
UniversityofCalifornia,Davis, CA 95616
fwilken,jjliu,[email protected]
Abstract { This paper presents a new approach to 1.1 Lo cal Instruction Scheduling
local instruction scheduling based on integer programming
The lo cal instruction scheduling problem is to nd a min-
that produces optimal instruction schedules in a reasonable
imum length instruction schedule for a basic blo ck. This
time, even for very large basic blocks. The new approach
instruction scheduling problem b ecomes complicated (inter-
rst uses a set of graph transformations to simplify the data-
esting) for pip elined pro cessors b ecause of data hazards and
dependency graph while preserving the optimality of the nal
structural hazards [11 ]. A data hazard o ccurs when an in-
schedule. The simpli edgraph results in a simpli ed integer
struction i pro duces a result that is used bya following in-
program which can be solved much faster. A new integer-
struction j, and it is necessary to delay j's execution until
programming formulation is then applied to the simpli ed
i's result is available. A structural hazard o ccurs when a
graph. Various techniques are used to simplify the formu-
resource limitation causes an instruction's execution to b e
lation, resulting in fewer integer-program variables, fewer
delayed.
integer-program constraints and fewer terms in some of the
The complexity of lo cal instruction scheduling can de-
remaining constraints, thus reducing integer-program solu-
p end on the maximum data-hazard latency which occurs
tion time. The new formulation also uses certain adap-
among the target pro cessor's instructions. In this pap er,
tively addedconstraints (cuts) to reduce solution time. The
latency is de ned to be the di erence between the cycle
proposed optimal instruction scheduler is built within the
in which instruction i executes and the rst cycle in which
Gnu Compiler Col lection (GCC) and is evaluated experi-
data-dep endent instruction j can execute. Note that other
mental ly using the SPEC95 oating point benchmarks. Al-
authors de ne latency (delay) to b e the cycle di erence mi-
though optimal scheduling for the target processor is consid-
nus one, e.g., [2 , 17 ]. We prefer the present de nition b e-
eredintractable, al l of the benchmarks' basic blocks areop-
cause it naturally allows write-after-read data dep endencies
timal ly scheduled, including blocks with up to 1000 instruc-
to b e represented by a latency of 0 (write-after-read dep en-
tions, while total compile time increases by only 14%.
dent instructions can execute in the same cycle on a typical
multiple-issue pro cessor, b ecause the read o ccurs b efore the
1 Intro duction
write within the pip eline).
Instruction scheduling for a single-issue pro cessor with
Instruction scheduling is one of the most imp ortant compiler
a maximum latency of two is easy. Instructions can be
optimizations b ecause of its role in increasing pip eline uti-
optimally scheduled in p olynomial time following the ap-
lization. Conventional approaches to instruction scheduling
proach prop osed by Bernstein and Gertner [2 ]. Instruc-
are based on heuristics and may pro duce schedules that are
tion scheduling for more complex pro cessors is hard. No
sub optimal. Prior work has considered optimal instruction
p olynomial-time algorithm is known for optimally schedul-
scheduling, but no approach has been prop osed that can
ing a single-issue pro cessor with a maximum latency of three
optimally schedule a large numb er of instructions in reason-
or more [2 ]. Optimal scheduling is NP-complete for realistic
able time. This pap er presents a new approachtooptimal
multiple-issue pro cessors [3 ]. Because optimal instruction
instruction scheduling which uses a combination of graph
scheduling for these more complex pro cessors is considered
transformations and an advanced integer-programming for-
intractable, pro duction compilers use sub optimal heuristic
mulation. The new approach pro duces optimal schedules
approaches. The most common approachislist scheduling,
in reasonable time even for scheduling problems with 1000
where instructions are represented as no des in a directed
instructions.
acyclic data-dep endency graph (DAG) [15 ]. A graph edge
represents a data dep endency, an edge weight represents the
This research was supp orted by Equator Technologies, Men-
tor Graphics' Emb edded Software Division, Microsoft Research, the
corresp onding latency, and each DAG no de is assigned a
National Science Foundation's CCR Division under grant #CCR-
priority. Critical-path list scheduling is a common variation,
9711676, and by the University of California's MICRO program.
where an instruction's priority is based on the maximum-
AG from the no de representing Permission to make digital or hard copies of all or part of this work for length path through the D
personal or classroom use is granted without fee provided that copies
the instruction to anyleafnode[15]. While critical-path list
Permissionare not madeto make or distributed digital or for hard profit copies or commercial of all or advantage part of thisand thatwork for
heduling is usually e ective, it can pro duce sub optimal re-
personalcopies or bear classroom this notice use and is the granted full citation without on the fee first provided page. To copy that copies sc
en for small scheduling problems. Consider the DAG
are nototherwise, made or or distributed republish, to forpost profit on servers or commercial or to redistribute advantage to lists, and that sults ev
en from [17 ], where each edge is lab eled with copiesrequires bear thisprior notice specific and permission the full and/or citation a fee. on the first page. To copy in Figure 1, tak
PLDI 2000, Vancouver, British Columbia, Canada.
its latency and each no de is lab eled with its critical-path otherwise,Copyright or republish,2000 ACM 1-58113-199-2/00/0006to post on servers or$5.00. to redistribute to lists, requires prior specific permission and/or a fee. PLDI 2000, Vancouver, British Columbia, Canada. Copyright 2000 ACM 1-58113-199-2/00/0006…$5.00. 121
us the approach do es not app ear practical for large
222 onds). Th
heduling problems. Chang, Chen and King [5]
A B C instruction sc
teger programming formulation that combines
2 2 22 prop ose an in
instruction scheduling with lo cal register allo cation. 0 0 lo cal
D E
Exp erimental results are given for one simple 10-instruction
example whichtakes 20 minutestosolve optimally. These
results suggest that the approach has very limited practical-
Figure 1: Simple example where critical-path list scheduling
ity.
can pro duce a sub optimal schedule.
Although prior work using integer programming has pro-
duced limited success for optimal instruction scheduling, in-
teger programming has b een used successfully to optimally
priority. For this DAG, no des A, B and C all have the same
solvevarious other compiler optimization problems, includ-
priority,sothescheduler can arbitrarily select the order of
ing array dep endence analysis [19 ], data layout for paral-
these instructions. If the initial order is A, C , B or C , A,
lel programs [4 ] and global register allo cation [9 ]. Inte-
B , the next cycle will b e a stall b ecause the latency from B
ger programming is the metho d of choice for solving many
to D or from B to E is not satis ed. Other orders of A, B
large-scale real-world combinatorial optimization problems
and C will pro duce an optimal schedule which has no stall.
in other elds [16 ], including other scheduling problems such
as airline crew scheduling. This successful use of integer
1.2 Optimal Instruction Scheduling
programming elsewhere suggests that improved integer pro-
gramming formulations maybethekey to solving large-scale
Although optimal instruction scheduling for complex pro ces-
instruction scheduling problems.
sors is hard in theory, in practice it maybepossibletoop-
This pap er presents a new approach to optimal instruc-
timally solve imp ortant instances of instruction scheduling
tion scheduling based on integer programming, the rst ap-
problems in reasonable time using metho ds from combinato-
proach which can solve very large scheduling problems in
rial optimization. Prior work has used various combinatorial
reasonable time. The pap er is organized as follows. Section
optimization approaches to optimally schedule instructions
2 describ es a basic integer-programming formulation for op-
for complex pro cessors. However none of these approaches
timal instruction scheduling, which is representativeoffor-
can optimally schedule large basic blo cks in reasonable time.
mulations prop osed in prior work. The material in Section
Ertl and Krall [8] use constraint logic programming and
2provides the reader with background on how instruction
consistency techniques to optimally schedule instructions for
scheduling can be formulated as an integer programming
a single-issue pro cessor with a maximum latency greater
problem, and provides a basis for comparing the new integer-
than two. Vegdahl [23 ] and Kessler [13 ] use dynamic pro-
programming formulation. Exp erimental results in Section
gramming to optimally schedule instructions. Chou and
2show that the basic formulation cannot solve large instruc-
Chung [6 ] and Tomiyama et al. [22 ] use approaches that
tion scheduling problems in reasonable time, which is consis-
implicitly enumerate all p ossible schedules to nd an opti-
tent with the results from prior work. Section 3 intro duces
mal schedule. For eciency, [6 ] and [22 ] prop ose techniques
a set of DAG transformations which can signi cantly sim-
to prune the enumeration tree so that redundantorequiv-
plify the DAG, while preserving the optimalityofthesched-
alentschedules are not explicitly enumerated. Exp erimen-
ule. The simpli ed DAG leads to simpli ed integer pro-
tal results for these various approaches show that they are
grams which are shown exp erimentally to solve signi cantly
e ective only for small to medium-sized basic blo cks (30 in-
faster. Section 4 intro duces a new integer-programming for-
structions or less).
mulation that simpli es the integer program by reducing the
Prior work has also used integer programming to opti-
number of integer-program variables, reducing the number
mally schedule instructions. An integer program is a linear
of integer-program constraints, and simplifying the terms in
program, with the added requirement that all problem vari-
some of the remaining constraints. The simpli ed integer
ables must b e assigned a solution value that is an integer.
programs are shown exp erimentally to solve dramatically
Like the other approaches to optimal instruction schedul-
faster. The last section summarizes the pap er's contribu-
ing, prior work using integer programming has only pro-
tions and outlines future work.
duced approaches that are e ective for small to medium-
sized scheduling problems. Arya [1 ] prop oses an integer pro-
gramming formulation for vector pro cessors, although the
2 Basic Integer-Programming Formulation
basic approach is general and can b e applied to other pro-
cessor typ es. The exp erimental results in [1 ] suggest the
This section describ es a basic integer-programming formula-
formulation can signi cantly improve the instruction sched-
tion for optimal instruction scheduling, which is representa-
ule. However, the results are limited to only three small to
tiveofformulations prop osed in prior work. The basic for-
medium-sized problems (12 to 36 instructions). Only the
mulation provides background for the DAG transformations
smallest problem is solved optimally, with the other prob-
presented in Section 3 and the new integer-programming
lems timing out b efore an optimal solution is found. Leup ers
formulation presented in Section 4.
and Marwedel [14 ] prop ose an integer programming formu-
lation for optimally scheduling a multiple-issue pro cessor.
2.1 Optimal Instruction Scheduling, Basic Formulation
Although their work fo cuses on DSP pro cessors, again the
basic approach is general and can b e used for other pro cessor
In the basic formulation, the basic blo ck is initially sched-
typ es. The exp erimental results in [14 ] are also limited to a
uled using critical-path list scheduling. The length U of the
few small to medium-sized problems (8 to 37 instructions).
resulting schedule is an upp er b ound on the length of an
While the solution time might b e acceptable for the largest
optimal schedule. Next, alower b ound L on the schedule
problem studied (37 instructions solves in 130 seconds), the
length is computed. Given the DAG's critical path c and
solution time do es not app ear to scale well with problem size
the pro cessor's issue rate r ,alowerboundontheschedule
(the next smaller problem, 24 instructions, solves in 7 sec-
for a basic blo ckwithn instructions is:
122
in which instruction i is scheduled. Using this expression, a
x11 ,x ,...,x1 1
t of the following form is pro duced for 1 2 n C dep endency constrain
2 2 2 2
h edge in the DAG to enforce the dep endency of instruc-
x,x,...,x12 n C eac
i on instruction k , where the latency from k to i is the 33 3 C3 tion
x,x,...,x1 2 n
constant L : 44 4 C4 ki
x,x,...,x1 2 n
m m X
55 5 5 X j x,x,...,x C j
12 n
j x + L j x
ki
i
k
j =1 j =1 ork has prop osed a basic metho d for simplifying x,x,...,xmm m Cm Prior w
1 2 n
the integer program and hence reducing integer-program so-
lution time. Upp er and lower b ounds can be determined
for the cycle in which an instruction i can be scheduled,
Figure 2: The arrayof0-1scheduling variables for a basic
thereby reducing an instruction's scheduling range. All of i's
blo ck with n instructions for a schedule of m clo ck cycles.
scheduling variables for cycles outside the scheduling range
set by i's upp er and lower b ounds are unnecessary and can
b e eliminated. After scheduling range reduction, if anyin-
L = 1 + maxfc; dn=r e 1g
struction's scheduling range is empty (its upp er b ound is
If U = L,theschedule is optimal, and an integer program
less than its lower b ound), no length m schedule exists and
is unnecessary.IfU>L,aninteger program is pro duced (as
an integer program is not needed.
describ ed b elow) to nd a length U 1schedule. If the inte-
Chang, Chen and King prop ose using the critical path
ger program is infeasible, the length U schedule is optimal.
distance from anyleafnode(any ro ot no de) or the number
Otherwise a length U 1schedule was found and a second
of successors (predecessors) to determine an upp er bound
integer program is pro duced to nd a length U 2sched-
(lower b ound) for each instruction [5 ]. For the r -issue pro-
ule. This cycle rep eats until a minimum-length schedule is
cessor de ned ab ove, a lower b ound L on i's scheduling
i
found.
range is set by:
To pro duce an m clo ck-cycle schedule for an n-instruction
basic blo ck, a 0-1 integer-program scheduling variable is cre-
L = 1 + maxfcr ; d(1 + p )=r e 1g (1)
i i i
ated for each instruction, for eachclock cycle in the schedule.
j
where cr is the critical path distance from any root node
i
represents the decision to sched- The scheduling variable x
i
to i,andp is the number of i's predecessors. Similarly,an
i
ule (1) or not schedule (0) instruction i in clo ck cycle j . The
upp er b ound U on i's scheduling range is set by:
i
scheduling variables for the corresp onding clo ck cycles are
illustrated in Figure 2.
U = m maxfcl ; d(1 + s )=r e 1g (2)
i i i
A solution for the scheduling variables must be con-
strained so that a valid schedule is pro duced. A constraint
where cl is the critical path distance from i to any leaf no de,
i
is used for each instruction i to ensure that i is scheduled at
and s is the number of i's successors.
i
exactly one of the m cycles, a must-schedule constraint with
Collectively, the reduced set of scheduling variables, the
the following form:
must-schedule constraints, the issue constraints, and the de-
m
X
p endency constraints constitute a basic 0-1 integer program-
j
x =1
i
ming formulation for nding a schedule of length m. Applied
j =1
iteratively as describ ed ab ove, this formulation pro duces an
optimal instruction schedule.
Additional constraints must ensure the schedule meets
the pro cessor's issue requirements. Consider a statically
2.2 Basic Formulation, Exp erimental Results
scheduled r-issue pro cessor that allows any r instructions
to issue in a given clo ck cycle, indep endent of the instruc-
The basic formulation is built inside the Gnu Compiler Col-
tion typ e. For this r -issue pro cessor, an issue constraint of
lection (GCC), and is compared exp erimentally with critical-
the following form is used for each clo ck cycle j :
path list scheduling. As shown in [2 ], optimal instruction-
n
scheduling for a single-issue pro cessor with a two-cycle max-
X
j
imum latency can be done in p olynomial time. However,
x r
i
optimal scheduling for a single-issue pro cessor with a three-
i=1
cycle maximum latency, the next harder scheduling problem,
is considered intractable [2 ]. If this easiest hard schedul-
If a multiple-issue pro cessor has issue restrictions for var-
ing problem cannot b e solved optimally in reasonable time,
ious typ es of instructions, a separate issue constraintcanbe
there is little hop e for optimally scheduling more complex
used for each instruction typ e [14 ].
pro cessors. Thus, this pap er fo cuses on optimal schedul-
A set of dependency constraints is used to ensure that
ing for a single-issue pro cessor with a three-cycle maximum
the data dep endencies are satis ed. For each instruction i,
latency.
the following expression resolves the clo ck cycleinwhich i
The SPEC95 oating pointbenchmarks were compiled
is scheduled:
m
using GCC 2.8.0 with GCC's instruction scheduler replaced
X
j
by an optimal instruction scheduler using the basic formula-
j x
i
tion. The b enchmarks were compiled using GCC's highest
j =1
level of optimization (-O3) and were targeted to a single-
issue pro cessor with a maximum latency of three cycles. The
Because only one x variable is set to 1 and the rest are
i
target pro cessor has a latency of 3 cycles for loads, 2 cycles
set to 0 in an integer program solution, the expression pro-
for all oating point op erations and 1 cycle for all integer
duces the corresp onding co ecient j , whichistheclock cycle
123
transformations are fast (low-order p olynomial time in the op erations. The SPEC95 integer benchmarks are not in-
size of the DAG) and are shown to preserve the optimality cluded in this exp eriment b ecause for this pro cessor mo del
of the nal schedule. The integer program pro duced from a there would b e no instructions with a 2-cycle latency,which
transformed DAG is signi cantly simpli ed and solves much makes the scheduling problems easier to solve.
faster. The optimal instruction scheduler is given a 1000 second
time limit to nd an optimal schedule. If an optimal sched-
ule is not found within the time limit, the b est improved
3.1 DAG Standard Form
schedule pro duced using integer programming (if any) is se-
The transformations describ ed in the following sections are
lected, otherwise the schedule pro duced bylistscheduling is
for DAGs in standard form. ADAG in standard form has
selected. The integer programs are solved using CPLEX 6.5,
a single ro ot no de and a single leaf no de. A DAG with
a commercial integer-programming solver [12 ], running on
multiple leaf and ro ot no des is transformed into a DAGin
an HP C3000 workstation with a 400MHz PA-8500 pro ces-
standard form by adding an arti cial leaf no de and an ar-
sor and 512MB of main memory. The exp erimental results
ti cial ro ot no de. The arti cial ro ot no de is the immediate
for the basic formulation are shown in Table 1.
predecessor of all ro ot no des. A latency-one edge extends
from the arti cial ro ot no de to each root node in the DAG.
Basic Formulation
Similarly, an arti cial leaf no de is the immediate successor
Total Basic Blo cks (BB) 7,402
of all DAG leaf no des. A latency-one edge extends from
BB from List Scheduling
eachleafnodeoftheDAG to the arti cial leaf no de. These
shown to b e Optimal w/o IP 6,885
arti cial no des do not a ect the optimal schedule length of
BB Passed to IP Formulation 517
the original DAG no des and are removed after scheduling is
BB IP Solved Optimally 482
complete.
BB IP Timed Out 35
BB IP Improved and Optimal 15
3.2 DAGPartitioning
BB IP Improved but Non-Optimal 0
Some DAGs can b e partitioned into smaller subDAGs which
Total Cycles IP Improved 15
can be optimally scheduled individually, and the subDAG
Total Scheduling Time (sec.) 35,879
schedules can b e recombined to form a schedule that is op-
timal for the entire DAG. Partitioning is advantageous b e-
Table 1: Exp erimental results using the basic integer pro-
cause the integer-program solution time is sup er-linear in
gramming formulation.
the size of the DAG, and thus the total time to solve the
partitions is less than the time to solve the original prob-
lem. Partitioning is also advantageous b ecause even though
Various observations are made ab out these data. First,
the original DAG may require an integer program to nd
using only list scheduling most of the schedules, 6,885 (93%),
its optimal schedule, one or more partitions may b e optimal
are shown to b e optimal b ecause the upp er b ound U equals
from list scheduling and an integer program is not required
the lower b ound L or b ecause after scheduling range reduc-
for those partitions.
tion for a schedule of length U 1, an instruction's schedul-
ADAG can b e partitioned at a partition node. A par-
ing range is empty. This is not surprising b ecause this group
tition no de is a no de that dominates the DAG's leaf no de
of basic blo cks includes such trivial problems as basic blo cks
and p ost-dominates its ro ot no de. A partition no de forms
with one instruction. For the 517 non-trivial problems that
a barrier in the schedule. No no des ab ove a partition no de
require an integer program, 482 (93%) are solved optimally
may be scheduled after the partition no de, and no no des
and 35 (7%) are not solved using a total of 35,879 CPU sec-
below the partition no de maybescheduled b efore the parti-
onds (10.0 hours). Asapoint of reference, the entire b ench-
tion no de. An ecient algorithm for nding partition no des
mark suite compiles in 708 seconds (11.8 minutes) when only
is describ ed in Section 3.5.1. The algorithm's worst-case ex-
list scheduling is used. Thus, the basic formulation fails in
ecution time is O (e), where e is the numb er of edges in the
two imp ortant resp ects: not all basic blo cks are scheduled
DAG.
optimally,andthescheduling time is long. Only 15 of the
Figure 3a shows a DAGwhich can b e divided into two
basic blo cks (3% of the non-trivial basic blo cks) haveanim-
partitions at partition no de D . No des A, B , C , and D
proved schedule, and the total static cycle improvementis
form one partition, and no des D , E , F , and G form the
only 15 cycles, a mo dest sp eedup. Sp eedup will be much
other partition. As illustrated, the two partitions are each
higher for a more complex pro cessor (longer latency and
optimally scheduled, and the schedules are then combined
wider issue). For example, results in [21 ] for the multiple-
to form an optimal schedule for the entire DAG.
issue Alpha 21164 pro cessor show that list scheduling is sub-
optimal for more than 50% of the basic blo cks studied. The
3.3 Redundant Edge Elimination
prop er conclusion to draw from the results in Table 1 is not
that optimal instruction scheduling do es not provide sig-
A DAG may include redundant edges. An edge between
ni cant sp eedup, but that the basic integer-programming
no des A and B (edg e ) is redundant if there is another
AB
formulation cannot pro duce optimal schedules in reasonable
path P from A to B , and the distance along P is greater
AB AB
time, even for the easiest of the hard scheduling problems.
than or equal to the distance across edg e . The distance
AB
Amuch b etter approach to optimal instruction scheduling
along a path is the sum of the latencies of the path's edges.
is needed.
The distance across an edge is the edge's latency.Inasched-
ule of the DAG no des, edg e enforces the partial order of
AB
3 DAGTransformations
A and B and the minimum latency b etween A and B . How-
ever, b ecause P enforces b oth of these conditions, edg e
AB AB
A set of graph transformations is prop osed which can sim-
is unnecessary and can b e removed. EachDAG edge requires
plify the DAG b efore the integer program is pro duced. These
a dep endency constraint in the integer program. Therefore,
124 A 1 1 A A C 1 1 B C B A E X B C 1 2 D C 1 D B 1 1 1 2 A J D 1 1 G 1 1 D 2 1 F F 1 B C D E K L 2 D 1 E F 1 2 2 2 H 1 2 F G D 3 Y M 11 E F E G I 1 1 G G (a) (b) (c)
(a) (b) (c) (d)
Figure 5: Example regions in a DAG.
Figure 3: Example DAG that can b e partitioned. Single-Entry Single-Exit Regions A A 3.4.1
11 11
O of the no des in a region can b e enforced while B C B C An order
1 1 2
optimality if for every optimal schedule of the
2 preserving
AG, the region's no des can b e reordered to the order O , 1 D D D
2 1 2 1
pro ducing a valid schedule with the same length. An order
that satis es this condition is a dominant order for the E F E F O
1 1 1 1
region. The simplest typ e of region for which a dominantor-
G G
der can b e found is a single-entry, single-exit (SESE) region.
An SESE region is a region where the entry no de dominates
(a) (b)
the interior no des, and the exit no de p ost-dominates the in-
terior no des. Figure 5a illustrates an SESE region. SESE
regions can b e found in O (n) time [18 ].
Figure 4: Example redundant edge, edg e .
BE
Theorem 1: If the schedule of an SESE region meets the
fol lowing conditions, then the schedule's node order O is a
removing redundant edges reduces the integer program's size
dominant order of the region's nodes:
and hence reduces its solution time. Removing redundant
The schedule of the interior nodes is dense. Aschedule
edges can also create new opp ortunities for partitioning, fur-
is dense if the schedule contains no empty cycles.
ther simplifying the integer program. An ecient algorithm
for nding redundant edges in a DAG partition is describ ed
The rst interior node in order O has an incoming edge
in Section 3.5.2. The algorithm's worst-case execution time
in the DAG that is one of the minimum-latency edges
is O (n e )wheren is the numb er of no des and e is the
P P P P
outgoing from the entry node to the region's interior.
numb er of edges in the DAG partition.
In Figure 4, edg e is a redundantedge whichcanbe
BE
The last interior node in order O has an outgoing edge
removed. When edg e is removed, the DAG is reduced to
BE
in the DAG that is one of the minimum-latency edges
the DAGshown in Figure 3, which can then b e partitioned.
incoming to the exit node from the region's interior.
Pro of: Assume an optimal schedule where the SESE re-
3.4 Region Linearization
0
gion's no des are in order O . Remove the region's no des
Region linearization isaDAG transformation whichorders
from the schedule. The no des in order O can always be
the set of no des contained in a region of the DAG. A region
placed in the vacancies left by the removed no des. The en-
R is de ned by a pair of no des, the entry no de A and the
try no de in order O will remain the rst region no de in the
exit no de B . R is the subDAG induced bynodeA, no de
schedule and the exit no de will remain the last region no de
B , and all no des that are successors of A and predecessors
in the schedule. Thus, order O satis es the latencies be-
of B , no des which are called the region's interior nodes. R
tween the region's no des and the external no des. Avacancy
must contain two disjoint paths from A to B and each path
cannot occur within the minimum latency from the entry
must include a no de other than A or B. Figure 5 shows
no de to the interior no des, nor within the minimum latency
three example regions inside the rectangles. For region EI
to the exit no de from the interior no des. Therefore, the la-
in Figure 5b, no de E is the entry no de, no de I is the exit
tencies to the rst interior no de and from the last interior
no de, no des F , G,andH are the interior no des, and no de
no de in order O are satis ed. If the schedule of the vacan-
0
X is external to the region.
cies left bytheinterior no des in order O is dense, then the
In region linearization, list scheduling is used to pro duce
interior no de schedule for order O can b e placed directly in
aschedule for the instructions in eachDAG region. Under
the vacancies. Or, if the schedule of the vacancies left by
0
certain conditions (describ ed b elow), the original region sub-
the interior no des of order O is not dense, the dense sched-
DAG can b e replaced by a linear chain of the region's no des
ule of order O 's interior no des can be stretched to match
in the order determined by list scheduling, while preserving
the schedule of the vacancies. The latencies in the stretched
the optimality of the nal overall schedule. This can signif-
schedule between the interior no des are satis ed, b ecause
icantly simplify the DAG, and hence the integer program.
the latencies are greater than or equal to the latencies in
This pap er considers region linearization for a single-issue
the dense schedule of the interior no des of order O . 2
pro cessor. Region linearization for multiple-issue pro cessors
A dominant order for an SESE region can b e enforced
is also p ossible and will b e considered in a future pap er.
in a DAGby replacing the region subDAGbya subDAG
with the region's no des in the dominant order. The interior
125 1 1 G 2 1 1 A B C D E F G A H 1 2 2 G 1 1 1 1 1 I A B 2 2 2 J 1 1 X H 1 Y B C C 1 J
K 1 X
Example DAG with no des in a linear sequence. 11 1 1 Figure 7: 1 1 I D D 2 K E L 1 1 3 2 2 2 Y
F E M L try no des from no des outside the 2 2 the dep endencies to side-en
F M
region are also satis ed.2
An algorithm for nding all of the regions in a DAGpar-
(a) (b) (c) (d)
tition is describ ed in Section 3.5.3. The algorithm's worst-
case execution time is O (n e )where n is the number of
P P P
no des and e is the numb er of edges in the DAG partition.
Figure 6: Examples of region linearization transformation.
P
In Section 3.5.4, an algorithm for linearizing a region is de-
2
scrib ed which has a worst-case execution time of O (e + n ),
P
R
where n is the numb er of no des in the region.
no des in the transformed region are separated by latency one
R
Figure 6c shows a non-SESE region with a side-entry
edges. The entry no de is separated from the rst interior
no de K andaside exit no de I . The no de order G, H , I ,
no de and the last interior no de is separated from the exit
J , K , L, M is a dominant order for the region. Figure 6d
no de by the same edges that o ccur b etween these no des in
shows the region after the linearization transformation.
the original DAG. For the DAG in Figure 6a, the no de order
A, B , C , D , E , F is a dominant order. Figure 6b shows the
region after the linearization transformation.
3.5 Ecient Algorithms forDAGTransformations
This section describ es a set of ecient algorithms for p er-
3.4.2 Non-SESE regions
forming the DAG transformations describ ed in Section 3.2
through Section 3.4.
Regions which are not SESE regions contain side-exit nodes
and side-entry nodes. A side-exit no de is an interior no de
with an immediate successor that is not a region no de. A
3.5.1 DAGPartitioning Algorithm
side-entrynodeisaninterior no de with an immediate pre-
An algorithm is prop osed for nding the partition no des of
decessor that is not a region no de. Figures 5b and 5c are
a DAG. If a DAG is drawn as a sequence of no des in a
examples of non-SESE regions. In Figure 5b, no de G is a
topological sort [7], then all edges from no des preceding a
side-entry no de. In Figure 5c, no de K is a side-exit no de.
partition no de terminate at or before the partition no de.
The conditions for linearizing non-SESE regions are more
Figure 7 shows the DAG from Figure 3a redrawn as a linear
restrictive than those for SESE regions b ecause of the addi-
sequence in the top ological sort A, B , C , D , E , F , G. No
tional dep endencies to and from no des outside the region.
1
edge extends across the dashed lines at the partition no des .
Theorem 2: If the schedule of a non-SESE region meets
An ecient algorithm for nding partition no des iterates
the fol lowing conditions then the schedule's node order O is
through the no des of the DAG in a top ological sort O =
a dominant order of the region's nodes:
(N ;:::;N ), where n is the number of nodes in the DAG.
1 n
Avariable latest is maintained across the iterations of the
The order O satis es the conditions for a dominant
algorithm. latest equals the latest immediate successor of
order of an SESE region.
any no de b efore the no de of the current iteration. Initially,
latest is set equal to the ro ot no de. Partition no des are
All nodes that precede a side-exit node in order O are
determined and latest is up dated as follows:
predecessors of the side-exit node in the DAG.
for i 1ton
All nodes that fol low a side-entry node in order O are
if N = latest
i
successors of the side-entry node in the DAG.
N is a partition no de
i
for each N 2 S (N )
i
k
Pro of: Assume an optimal schedule with the region's no des
0
if N is later in O than latest
k
in order O . Remove the region's no des from the schedule.
latest N
k
The region's no des in order O can always b e placed in the
vacancies left by the removed no des. The order O satis es
where S (N ) is the set of immediate successors of N .
i i
the conditions for a dominantschedule for an SESE region.
The original instruction order of the DAG no des b efore
Therefore, following the pro of of Theorem 1, the dep enden-
instruction scheduling can be used as the top ological sort
cies b etween interior no des are satis ed. Similarly, the de-
for the algorithm. EachDAG edge and no de is considered
p endencies b etween no des outside the region and the region
only once in the execution of the algorithm. Therefore, the
entry and exit no des are satis ed. Only the predecessors
algorithm runs in O (n + e) time, where n is the number of
of a side-exit no de in the DAG precede the side-exit no de
no des and e is the numb er of edges.
in order O . Therefore, the minimum number of nodes pre-
Table 2 illustrates the execution of the algorithm on the
cede each side-exit no de in order O . If the region no des are
example DAG in Figure 7. The column `current latest'
removed from the schedule and reordered in order O , a side-
indicates the value of latest at the start of the iteration.
exit no de cannot b e placed in a vacancy later than the origi-
The column `new latest' indicates the value of latest at
nal lo cation of the side-exit no de in the schedule. Therefore,
the end of the iteration.
the dep endencies from the side-exit no de to no des outside
1
The ro ot and leaf no des of a DAGare,by de nition, partition
the region are satis ed. A symmetric argument proves that no des.
126
is earliest in a top ological sort. If A is a predecessor of B ,
current new partition
R (A; B ) is uniquely de ned b ecause each dominator of B
iteration i N latest latest no de?
i
relativetoA is necessarily a predecessor or successor of any
1 A A C yes
other dominator of B relativetoA.IfA is not a predecessor
2 B C D no
of B or A = B , R (A; B ) is unde ned.
3 C D D no
Theorem 3: Nodes A and B de ne a region if and only
4 D D F yes
if there exist two immediate predecessors of B , nodes I
1
5 E F G no
and I , for which R (A; I ) and R (A; I ) are de ned and
2 1 2
6 F G G no
R (A; I ) 6= R (A; I ).
1 2
7 G G G yes
Pro of: Let I and I b e immediate predecessors of B ,and
1 2
R (A; I ) and R (A; I ) are de ned and R (A; I ) 6= R (A; I ).
1 2 1 2
Table 2: Execution of the partitioning algorithm on the
Because R (A; I ) 6= R (A; I ), a common dominator of I and
1 2 1
DAG in Figure 7.
I relativetoA do es not exist (excluding A). Accordingly,
2
there must exist disjoint paths P from A to I and P
AI 1 AI
1 2
from A to I . P concatonated with edg e and P
2 AI I B AI
1 1 2
3.5.2 An Algorithm for Finding Redundant Edges
concatonated with edg e de ne two disjoint paths from A
I B
2
to B . Therefore, A and B de ne a region.
An ecient algorithm is prop osed for nding the redundant
0 0
Let no des A and B de ne a region. By de nition, there
edges of a partition. The algorithm iterates through all
0 0 0
0 0
exist two disjoint paths P and P from A and B .
0 0
A B
A B
no des in the partition except the last no de. Ateach itera-
0
0 0
and P must each include a di erent immediate P
0 0
A B
A B
tion, each edge extending from the current no de is compared
0 0 0
predecessor of B ,nodesI and I resp ectively. A is a pre-
1 2
with parallel paths in the DAG. Let O =(N ;:::;N )be
1 n
P
0 0 0 0
decessor of I and I so R (A; I )and R (A; I ) are de ned,
1 2 1 2
a top ological sort of the partition no des, where n is the
P
0 0
and R (A; I ) 6= R (A; I ) b ecause there exist disjoint paths
1 2
numb er of no des in the partition. Redundant edges are de-
0 0
from A to I and from A to I . 2
1 2
termined as follows:
Let O =(N ;::;N ) b e a top ological sort of the no des
1 n
in a partition. The value R (A; B ) for each pair of no des A
for i 1ton 1
P
and B in the partition can b e determined as follows:
for each N 2 S (N )
j i
for each N 2 S (N ), where N 6= N
i j
k k
for i 2ton
if l (edg e )+D (N ;N ) l (edg e )
N N j N N
k
i i j
k
for j 1toi
is redundant edg e
N N
i j
if R (N ;N ) is unde ned 8 N 2 P (N )
j i
k k
if N 2 P (N )
j i
where S (N ) is the set of immediate successors of no de N
i i
R (N ;N ) N
j i i
in the DAG, and l (edg e ) is the latency of edg e .
N N N N
i i
k k
else
D (N ;N ) is the critical-path distance in the DAG from
j
k
R (N ;N ) unde ned
j i
no de N to no de N . D (N ;N )= 1 if no path exists
j j
k k
), ) 6= R (N ;N else if R (N ;N
j j
k k
2 1
from N to N .
i
k
for some N 2 P (N );N 2 P (N ) where
i i
k k
1 2
The elimination of redundant edges requires the critical-
R (N ;N )andR (N ;N ) are de ned
j j
k k
1 2
path distance b etween all no des in the partition. An algo-
R (N ;N ) N
j i i
rithm describ ed in [7 ] calculates the critical-path distance
N and N de ne a region
j i
from one no de of a DAGtoevery other no de in the DAG.
else
The worst-case execution time of this algorithm is O (e ),
P
for some N 2 P (N ) where R (N ;N ) is de ned
i j
k k
where e is the number of edges in the DAG partition.
P
R (N ;N ) R (N ;N )
j i j
k
Therefore, the worst-case execution time for calculating the
critical-path distances b etween all no des of the partition is
where P (N )is the set of immediate predecessors of no de
i
O (n e ), where n is the number of no des in the parti-
P P P
N in the DAG.
i
tion. Once critical-path distances are calculated, the algo-
For each iteration i of the outer lo op, every edge ter-
rithm for nding redundant edges iterates through each edge
minating at a no de preceding N in order O is considered
i
edg e of the partition once. Ateach iteration, edg e
N N N N
i j i j
O (1) times. Therefore, the worst case execution time of the
is compared with all other edges extending from the same
algorithm is O (n e ).
P P
no de N . The number of edges whichmay extend from a
i
single no de is O (n ). Therefore, the worst-case execution
P
3.5.4 An Algorithm for Region Linearization
time of the redundant-edge nding algorithm is O (n e ).
P P
2
Although the numb er of edges in a partition can b e O (n ),
P
An algorithm for region linearization is prop osed which uses
the DAGs from the exp eriments describ ed in 2.2 generally
critical-path list scheduling to nd a schedule S foraregion
have O (n ) edges.
P
R . If no de order O of S is a dominant order for the region
R as de ned in Section 3.4 then O is enforced by linearizing
3.5.3 An Algorithm for Finding Regions
the region no des in the DAG.
The region nding algorithm describ ed in Section 3.5.3
An ecient algorithm is prop osed for nding the regions in
nds the entry no de and exit no de for each region of a parti-
a partition. Section 3.4 de nes a region by conditions on
tion. The interior no des of the region can b e determined by
the paths from the region's entry no de to its exit no de. A
p erforming a forward depth- rst search from the entry no de
region maybe equivalently de ned by relative dominance.
and a reverse depth- rst search from the exit no de. No des
No de C dominates no de B relativetonodeA if every path
which are touched in b oth searches are interior no des of the
in the DAG from A to B includes C , and C 6= A. Under
region. The searches can also identify side-entry no des and
this de nition, no de B dominates itself relativetoA. Let
side-exit no des.
R (A; B ) b e de ned as the dominator of B relativetoA which
127
By Theorem 2, in scheduling a non-SESE region R ,noin-
Total Basic Blo cks (BB) 7,402
terior no de maybescheduled b efore a side-exit no de whichis
BB Optimal from List Scheduling 6885
not a predecessor of the side-exit no de in the DAG. Symmet-
BB Optimal from DAGTransformations 143
rically,nointerior no de maybescheduled after a side-entry
BB Passed to IP Formulation 374
no de which is not a successor of the side-entry no de in the
BB IP Solved Optimally 368
DAG. To enforce these constraints during region scheduling,
BB IP Timed Out 6
temp orary edges are added to the region subDAG. Latency
BB IP Improved and Optimal 23
one edges are added from each side-exit no de to eachinterior
BB IP Improved but Not Optimal 1
no de which is not a predecessor of the side-exit no de. Sim-
Total Cycles IP Improved 31
ilarly, latency one edges are added to each side-entry no de
Total Scheduling Time (sec.) 7,743
from every interior no de which is not a successor of the side-
entry no de. If there exists a cycle in the dep endency graph
Table 3: Exp erimental results using DAG transformations
after this transformation, no dominant order for the region
and the basic integer-programming formulation.
is found. The temp orary dep endencies are removed after
scheduling the region.
After adding the temp orary dep endencies, the region
no des are scheduled using critical-path list scheduling. If
solve the more valuable problems with larger cycle improve-
thenodeorderO of the resulting region schedule is a dom-
ments. Overall these data show that graph transformation
inant order of the region, then the region is linearized as
is an imp ortanttechnology for pro ducing optimal instruc-
describ ed in Section 3.4.
tion schedules in reasonable time. However additional new
Finding the interior no des of a region requires two depth-
technology is clearly needed.
rst searches of the partition. The worst-case execution
time of the searches is O (e ). A non-SESE region may
P
4 Advanced Integer-Programming Formulation
have O (n ) side-exit and side-entry no des, where n is the
R R
number of no des in the region. Each side-exit and side-
This section describ es an advanced formulation for optimal
entry no de may require the addition of O (n ) temp orary
R
instruction scheduling which dramatically decreases integer-
dep endences to and from other region no des. Therefore,
program solution time compared with the basic formulation.
adding and removing temp orary dep endencies to a region
2
has a worst-case execution time of O (n ). Critical-path list
R
4.1 Advanced Scheduling-Range Reduction scheduling of the region has a worst-case execution time of
2
O (n ) [15 ]. If a dominant order is pro duced by list schedul-
R
As describ ed in Section 2.1, the basic formulation uses a
ing, replacing the region subDAG with a linearized subDAG
technique that can reduce an instruction's scheduling range,
requires the deletion of O (n ) edges and the insertion of
R
and hence the number of scheduling variables. Signi cant
O (n ) edges. Collectively, therefore, the worst-case execu-
R
additional scheduling-range reductions are p ossible.
2
tion time of the region linearization algorithm is O (e + n ).
P
R
The basic formulation uses static range reduction based
on critical-path distance or the number of successors and
3.6 DAGTransformations, Exp erimental Results
predecessors. This technique is termed static range reduc-
tion b ecause it is based on static DAG prop erties. An addi-
The exp eriment describ ed in Section 2.2 was rep eated with
tional criterion for static range reduction is prop osed which
GCC's instruction scheduler replaced by an optimal instruc-
uses the numb er of predecessors (or successors) and the min-
tion scheduler that includes the DAG transformations and
imum latency from (or to) any immediate predecessor (or
the basic integer-program formulation. As b efore, each ba-
successor). For the r -issue pro cessor de ned earlier, if in-
sic blo ckis rstscheduled using list scheduling and the ba-
struction i has p predecessors, the predecessors will o ccupy
i
sic blo cks with schedules that are shown to b e optimal are
at least bp =r c cycles. If the minimum latency from a prede-
i
removed from further consideration. For this exp eriment,
cessor of i to i is pr ed minl , i can b e scheduled no so oner
i
the DAG transformations are then applied to the remaining
minl . Given this additional than cycle 1 + bp =r c + pr ed
i i
basic blo cks, and the basic blo cks are scheduled again using
criterion, Equation 1 can be extended to create a tighter
list scheduling. The resulting schedules are checked for opti-
lower b ound L on i's scheduling range:
i
mality and those basic blo cks without optimal schedules are
then solved using integer programming. The results from
L =1+maxfcr ; d(1 + p )=r e 1; bp =r c + pr ed minl g
i i i i i
this exp eriment are shown in Table 1.
Various observations can b e made by comparing the data
Symmetrically, Equation 2 can b e extended to create a
in Table 3 with the data obtained using only the basic for-
tighter upp er b ound U on i's scheduling range:
i
mulation in Table 1. The DAG transformations reduce the
total number of integer programs by 28%, to 374 from 517.
U = m maxfcl ; d(1 + s )=r e 1; bs =r c + succ minl g
i i i i i
The DAG transformations reduce the number of integer pro-
where succ minl is the minimum latency from i to a suc-
grams that time out by83%, to 6 from 35. Total optimal
i
cessor of i.
scheduling time using the DAG transformations is reduced
A new range reduction technique, iterative range reduc-
by ab out ve-fold, to 7743 seconds from 35,879 seconds. The
tion, is prop osed. Iterative range reduction uses initial logi-
numb er of basic blo cks that have an improved schedule us-
cal implications (describ ed b elow) to reduce the scheduling
ing the DAG transformations increases by60%,to24from
range of one or more instructions. This range reduction may
15. There is a 1 cycle average schedule improvement for
in turn allow the ranges of predecessors or successors to b e
the 15 blo cks which improved using the basic formulation
tightened. An instruction i's lower b ound can b e tightened
alone. The additional 9 basic blo cks whichimproved using
by selecting the maximum of:
the DAG transformations haveanaverage improvementof
1.8 cycles. This suggests that the DAG transformations help
The static lower b ound L ,or i
128
For each immediate predecessor j , j 's lower b ound plus [1,1] [1,1] [1,1]
A A A
the latency of edg e .
ji 11 1 1 1 1 [2,3] [3,3] [3,2] [2,2]
B C [2,2] B C [2,2] B C
ws i's lower b ound to b e tight-
The second criterion allo 2 3 2 3 2 3
ely as the lower b ounds of i's immediate pre- ened iterativ [4,4] [4,4] [5,5]
[4,5] D E [5,5] D E [5,5] D E
tened iteratively. Similarly, instruction i's decessors are tigh 1 1 1 1 1 1
F F F
tened by selecting the minimum of: upp er b ound can b e tigh [6,6] [6,6] [6,6]
(a) (b) (c)
The static upp er b ound U ,or
i
For each immediate successor j , j 's upp er b ound minus
Figure 8: Example scheduling range reduction based on one-
the latency of edg e .
ij
cycle scheduling range.
Following an initial logical implication, the predecessor
and successor range reductions may iteratively propagate
b e 0 and the variable can b e eliminated from the problem.
through the DAG and may lead to additional logical impli-
General-purp ose probing is used in commercial solvers, but
cations that can reduce scheduling ranges. These new logical
has a very high computation cost [12 ].
implications may in turn allow additional predecessor and
A sp eci c probing technique called instruction probing
successor range reductions. This pro cess can iterate until
is prop osed for reducing instruction scheduling ranges. In-
no further reductions are p ossible.
struction probing is computationally ecient b ecause it ex-
For the r -issue pro cessor de ned earlier, a logical im-
ploits knowledge of the instruction scheduling problem and
plication can b e made for instructions that have a one-cycle
exploits the DAG's structure. Instruction probing is done
scheduling range. If an instruction i has a one-cycle schedul-
for each instruction i, starting with i's lower b ound. A lower-
k
ing range that spans cycle C , i must b e scheduled at cycle
b ound prob e consists of temp orarily setting i's upp er b ound
k
C . If r instructions with one-cycle ranges must b e sched-
equal to the current lower b ound. This has the e ect of
k
uled at cycle C , no other instruction can b e scheduled at
temp orarily scheduling i at the currentlower b ound. Based
k k
cycle C . Thus, cycle C can b e removed from the schedul-
on i's reduced scheduling range, the ranges of i's predeces-
k
ing range of any other instruction j which includes cycle C .
sors are temp orarily tightened throughout the DAG. If the
Based on j 's range reduction, the ranges of j 's predecessors
resulting scheduling ranges are infeasible, the prob e is suc-
and successors may b e reduced. This may lead to additional
cessful and i's lower b ound is p ermanently increased by1.
instructions with one-cycle ranges, and the pro cess mayit-
Based on i's new lower b ound, the ranges of i's successors
erate for further range reductions.
are p ermanently tightened throughout the DAG. If the re-
Iterative range reduction can lead to scheduling ranges
sulting scheduling ranges are infeasible, the overall schedul-
that are infeasible, which implies that no length m schedule
ing problem is feasible. Otherwise, the new lower b ound is
exists. Two infeasibility tests are:
prob ed and the pro cess rep eats. If a lower-b ound prob e is
unsuccessful, i's lower-b ound probing is complete. i's upp er
The scheduling range of any no de is empty b ecause its
b ound is then prob ed in a symmetric manner.
upp er b ound is less than its lower b ound.
Figure 9 illustrates the use of instruction probing for
a single-issue pro cessor. Figure 9a shows the scheduling
For any k -cycle range, more than rk instructions have
ranges that are pro duced using static range reduction for
scheduling ranges that are completely contained within
a schedule of length 8. Figure 9b shows the temp orary
the k cycles, i.e., the scheduling ranges violate the pi-
scheduling ranges that result from probing no de B 's lower
geon hole principle [10 ].
b ound. Based on the one-cycle logical implication, cycle 2
is removed from no de C 's range. No de C 's increased lower
Figure 8 illustrates iterative range reduction using the
b ound in turn causes the lower b ounds for no des E , G and H
one-cycle logical implication for a single-issue pro cessor. Fig-
to b e temp orarily tightened. Because no de E has a one-cycle
ure 8a shows eachnode lab eled with the lower and upp er
range (cycle 5), cycle 5 is removed from no de D 's range,
b ounds that are computed using static range reduction for
2
which in turn causes no de F 's lower b ound to b e tightened.
aschedule of length six . No des A, C , E and F have one-
No des F and G must b e scheduled at the same cycle, which
cycle ranges, so the corresp onding cycles (1,2,5 and 6) are
is infeasible. Thus, no de B cannot be scheduled at cycle
removed from the scheduling ranges of all other no des, re-
2 and cycle 2 is p ermanently removed from B 's scheduling
sulting in the scheduling ranges shown in Figure 8b. Prede-
range. The consequence of B 's p ermanent range reduction
cessor and successor ranges can then b e tightened using the
is shown in Figure 9c. Based on B 's tightened lower b ound,
iterative range reduction criterion, pro ducing the schedul-
the lower b ounds of no des D , F and H are p ermanently
ing ranges shown in 8c. The resulting scheduling ranges are
tightened. Based on no de B 's one-cycle range, cycle 3 is
infeasible b ecause no de B 's scheduling range is empty.
removed from no de C 's range. Due to no de D 's one-cycle
A second logical implication based on probing is used
range, cycle 6 is removed from no de G's range. The result-
to reduce scheduling range. Probing is a general approach
ing ranges are infeasible b ecause no des F and G must b e
that can b e used for prepro cessing any0-1integer program
scheduled at the same cycle, thus no 8-cycle schedule exists
to improve its solution time [20 ]. Probing selects a 0-1 inte-
for the DAG.
ger program variable and attempts to show that the variable
cannot b e 1 by assuming the variable's value is 1 and then
4.2 Optimal Region Scheduling
showing that the resulting problem is infeasible. If the prob-
lem is infeasible, bycontradiction, the variable's value must
For a region as de ned in Section 3, the critical-path distance
2
between the region's entry and exit no des may not b e su-
List scheduling pro duces a length 7 schedule for this DAG, so
an integer program is pro duced to nd the next shorter schedule, as
ciently tight. This can lead to excessiveinteger-program so-
describ ed in Section 2.1.
lution time b ecause the integer program pro duced from the
129
Optimal region schedules are computed starting with the
A [1,2] A [1,1] A [1,1]
inner regions, progressing outward to the largest
1 1 1 1 1 1 smallest
tire DAG partition. Optimal schedules [2,3] B C [2,3] [2,2] B C [3,3] [3,3] B C [2,2] outer region, the en
3 2 3 2 3 2
for the inner regions can usually b e found quickly using the
ytests. If an optimal region schedule requires an [5,6] D E [4,5] [6,6] D E [5,5] [6,6] D E [4,5] infeasibilit
1 2 1 2 1 2
integer program and the inner regions have pseudo edges, a
[6,7] F G [6,7] [7,7] F G [7,7] [7,7] F G [7,7]
dep endency constraint is pro duced for each pseudo edge, in
11 11 11 As the optimal region H H H the manner describ ed in Section 2.1.
[7,8] [8,8] [8,8]
scheduling pro cess moves outward, the additional pseudo
(a) (b) (c)
edges progressively constrain the scheduling of the large
outer regions, allowing the optimal scheduling of the outer
regions to b e solved quickly, usually using only the infeasi-
Figure 9: Example scheduling range reduction using instruc-
bility tests.
tion probing.
4.3 Branch-and-Cut [1,2] [1,2] [1,2] A A A
1 1 1 1 1
ypically 0-1 integer programs are initially solved as a lin- 1 T 1 1 1
[2,4] B C [2,4] B C [2,4] B C
h the solution values can b e non- [2,3] [2,3] [2,3] ear program (LP) in whic
2 3 2 6 3 2 6 3
tegers [24 ]. If a variable in the initial solution is a non- [4,6] D E [5,6] [4,6] D E [5,6] [4,6] D E [5,6] in
1 1 1 1 1
teger, branch-and-bound can b e used to nd an all integer
1 in F [6,7] F [7,7] F [7,7] [2,9] [2,9] L [2,9] L 1 L
1 1 1 1
Branch-and-b ound selects a branch variable, 1 solution [24 ].
[7,9] G H [7,8] [8,9] G H [8,8] [9,9] G H [8,8]
h is one of the variables that was not an integer in the 2 3 2 3 2 3 whic
[10,11] [11,11] [11,11]
solution. Two subproblems are created in which the [9,11] I J 3 [10,11] I J 3 [11,11] I J 3 LP
1 1 1 1 1 1
branch variable is xed to 0 and to 1, and the subprob-
[11,12] K [12,12] K [12,12] K
lems are solved as LPs. The collection of subproblems form
(a) (b) (c)
a branch-and-bound tree [24 ]. The ro ot of the tree is the
initial LP of the integer program and eachnodeis an LP
subproblem with some variables xed to integer values. By
Figure 10: Example pseudo edge insertion.
solving the LP subproblems, the branch-and-b ound tree is
traversed until an all integer solution is found, or the prob-
lem is determined to b e infeasible [24 ].
DAG is under-constrained. A region is loose if the critical-
For some integer programming applications, the num-
path distance from the region's entry no de to its exit no de
ber of no des in the branch-and-b ound tree that must be
is less than the distance from the entry no de to the exit
solved to pro duce an optimal integer solution can be dra-
no de in an optimal schedule of the region. Clearly in any
matically reduced by adaptively adding application-sp eci c
valid overall schedule, a region's entry and exit no des can
constraints, or cuts, to the LP subproblem at each no de
be no closer than the distance between these no des in an
in the branch-and-b ound tree. The cuts are designed to
optimal schedule of the region. A lo ose region can b e tight-
eliminate areas of the solution space in the LP subprob-
ened by computing an optimal schedule for the region to
lem whichcontain no integer solutions. This enhancement
determine the minimum distance b etween the region's entry
to the branch-and-b ound metho d is called branch-and-cut
and exit no des. A pseudo edge can then b e added to the
[24 ]. Twotyp es of cuts are prop osed for solving instruction
DAGfromtheentry no de to the exit no de, with a latency
scheduling integer programs: dependency cuts and spreading
that equals the distance b etween these no des in the optimal
cuts.
region schedule. Pseudo edges may allow the scheduling
ranges of the entry and exits no des to b e reduced. These
4.3.1 Dep endency Cuts
range reductions may then iteratively propagate through the
DAG as describ ed in the previous subsection and may result
In an LP solution, an instruction may b e fractionally sched-
in scheduling ranges that are infeasible for scheduling the ba-
uled over multiple cycles. For an instruction k that is data
sic blo ckinm cycles. Even if the overall scheduling problem
dep endent on instruction j , fractions of j canbescheduled
is not shown to b e infeasible, the problem will solve faster
after all cycles in which k is scheduled, without violating the
b ecause of the reduced number of scheduling variables.
corresp onding dep endency constraint. This is illustrated in
The application of region scheduling is illustrated in Fig-
Table 4, which shows a partial LP solution for a schedul-
ure 10. Figure 10a shows the scheduling ranges for a sched-
ing problem that includes instructions j and k , where k is
ule of length 12. The region AF is an instance of the region
dep endenton j with latency 1. The solution satis es the de-
in Figure 8, which has an optimal schedule of length 7. Thus,
p endency constraintbetween j and k because j is scheduled
after region scheduling is applied to region AF a latency 6
at cycle 1 0:5+5 0:5 = 3, while k is scheduled at cycle
pseudo edge is added from no de A to no de F , as shown in
4. However a fraction of j is scheduled after all fractions of
Figure 10b. The pseudo edge causes the lower b ounds for
k . This invalid solution can b e eliminated in the subsequent
no des F , G, H , I , J and K to b e tightened to the ranges
LP subproblems by adding the following dependency cut for
shown in Figure 10b. Based on no de H 's one-cycle range, cy-
cycle c, the last cycle in which j can b e scheduled given the
cle8isremoved from no de G's range. The increase in no de
p osition of the last fraction of k in the current LP solution:
G's lower b ound causes no de I 's lower b ound to increase.
The resulting ranges are shown in Figure 10c. These ranges
c+l
jk c
X X
i i
are infeasible b ecause no des I and J must b e scheduled at
x x
k j
the same cycle. Thus, the overall schedule is shown to b e
i=LB (k ) i=LB (j )
infeasible based on the optimal schedule of only one inner region.
130
where LB (j ) and LB (k ) are the scheduling range lower
[1,1] A
b ounds for j and k , resp ectively, and l is the latency of
jk 11
ween j and k .
the dep endency b et [2,3] B C [2,3]
or example, the following dep endency cut can b e added
F 1
2 2
for cycle 3 for the solution in Table 4 to prevent j from
[5,6] D E [5,6]
b eing fractionally scheduled after a fraction k in cycle 4 in
11
subsequent LP subproblems:
[7,7] F
1 2 3 2 3 4
x + x + x x + x + x
j j j k k k
Figure 11: Example DAG with redundant dep endency
edg e .
CD
clock cycle variables
1
C x =0:5
1
j
C
2
4.4 Redundant Constraints
C
3
4
Some constraints in the integer program may b e redundant
C x =1:0
4
k
5
and can be removed. This simpli es the integer program
C x =0:5
5
j
and reduces its solution time.
If an instruction i has a one cycle scheduling range at
Table 4: Example LP solution to illustrate a dep endency
cycle C , the must-schedule constraint for instruction i can
cut.
b e eliminated, and the issue constraint for cycle C can b e
eliminated for a single-issue pro cessor.
The integer program includes a dep endency constraint
for eachDAG edge. The dep endency constraint ensures that
4.3.2 Spreading Cuts
each instruction is scheduled suciently earlier than anyof
In an LP solution, an instruction k may be fractionally
its dep endent successors. However, if the scheduling ranges
scheduled closer to k 's immediate predecessors than the la-
of dep endent instructions are spaced far enough apart, the
tencies allowinaninteger solution, but still satisfy the cor-
data dep endency b etween the two instructions will neces-
resp onding dep endency constraints. This is illustrated in
sarily b e satis ed in the schedule pro duced by the integer
Table 5, whichshows a partial LP solution for a scheduling
program. In this case, the dep endency constraint for the
problem which includes instructions i, j , and k . Instruction
corresp onding edge can be removed from the integer pro-
k is dep endent on b oth i and j with latency 2. The solution
gram. More precisely, if an instruction k is dep endenton
satis es the dep endency constraints of the LP b ecause k is
instruction j with latency L, then the j to k dep endency
scheduled at cycle 3 0:5+4 0:5 =3:5, and i and j are
constraint is redundantif:
scheduled at cycle 1 0:5+2 0:5=1:5. However, in cycle 3,
a fraction of k is scheduled closer to i and j than the latency
L + Upp er b ound of j Lower b ound of k
allows in an all integer solution. This invalid solution can
b e eliminated in the subsequent LP subproblems by adding
For the example DAG in Figure 11, the dep endency con-
the following spreading cut for cycle c, the lowest cycle in
straint for edg e is redundant b ecause the upp er b ound
CD
which a fraction of k is scheduled in the current solution:
for no de C is cycle 3, the lower b ound for no de D is cycle 5
c
X X
and the latency of edg e is 1.
CD
i c l+1
x + x 1
k
I
i=c l+1
I 2P (k )
4.5 Algebraic Simpli cation
where P (k ) is the set of immediate predecessors of instruc-
An algebraic simpli cation reduces the number of dep en-
tion k .
dency constraint terms and reduces the size of the co e-
For example, the following spreading cut can b e added
cients of the remaining terms. The basic dep endency con-
for cycle 3 for the solution in Table 5 to force the fractions
straintfor edg e has one term for each cycle in j 's range
jk
of i and j to b e spread further apart from a fraction of k in
and one term for each cycle in k 's range. The dep endency
cycle 3 in subsequent LP subproblems:
constraint is simpli ed to include terms only for cycles in
2 2 3 2
which j 's and k 's scheduling ranges overlap. The simpli ed
1 + x + x + x x
j i k k
constraints allowtheinteger program to solve faster.
Supp ose instruction i is dep endenton k with latency L.
The scheduling ranges for k and i are [a,b] and [c,d], re-
clock cycle variables
sp ectively, and the dep endency constraint is not redundant.
1 1
C x =0:5 x =0:5
1
i j
That is, c +1 b + L. The basic dep endency constraintis:
2 2
=0:5 =0:5 C x x
2
i j
d b
3
X X
C x =0:5
3
k
j j
+ L j x (3) j x
4
i
k
C x =0:5
4
k
j =a j =c
Table 5: Example LP solution to illustrate a spreading cut.
Subtracting c from b oth sides of (3) yields:
b d
X X
j j
Symmetric spreading cuts can be used to prevent in-
+ L c j x c j x
i
k
structions from b eing fractionally scheduled closer to their
j =a j =c
immediate successors than the latencies allowinaninteger solution.
131
4.6 Advanced IP Formulation, Exp erimental Results Because the x and x variables are nonzero for only one
i
k
cycle j :
The exp eriment describ ed in 2.2 is rep eated with GCC's in-
b d
X X
struction scheduler replaced by an optimal instruction sched-
j j
(4) (j + L c) x (j c) x
i
k
uler that includes the DAG transformations and the ad-
j =a j =c
vanced integer-program formulation. The exp erimental re-
sults in Table 6 show the dramatic improvementprovided
After schedule range tightening, a c L, and b ecause
by the new optimal scheduler. All basic blo cks are sched-
the dep endency constraintisnotredundant, c L +1 b.
uled optimally and the scheduling time is very reasonable.
Accordingly,(4)may b e expanded to:
The graph transformations, advanced range reduction and
c L b d
region scheduling techniques reduce to 22 the number of
X X X
j j j
basic blo cks that require an integer program, down from
(5) (j +L c)x + (j +L c)x (j c)x
i
k k
517 using the basic formulation. These 22 most dicult
j =a j =c L+1 j =c
problems require a total solver time of only 45 seconds, an
j
average of only 2 seconds each. The total increase in com-
The right side of (5) is nonnegative. If x is nonzero for a
k
pilation time is only 98 seconds, a 14% increase in total
cycle j c L, the left side of (5) is nonp ositive. Then, the
compilation time, which includes the time for DAG trans-
inequality is necessarily satis ed indep endent of the values
formations, advanced range reduction and region scheduling.
of the x variables. Therefore, the left summation on the
i
Total scheduling time is reduced by more than 300 fold com-
left hand side of (5) may b e eliminated, and the dep endency
pared with the basic formulation. The improvementincode
constraint b ecomes:
quality is more than 4 times that of the basic formulation,
b d
X X
66 static cycles compared with 15 cycles. The additional 6
j j
(6) (j + L c) x (j c) x
i
k
basic blo cks that are solved optimally using the advanced
j =c L+1 j =c
formulation all have improved schedules, with the average
improvement of 5.8 cycles. This suggests that the hardest
Let M = b + L c. The right hand side of (6) can b e
problems to solve are those which provide the most p erfor-
transformed as follows:
mance improvement.
d d
X X
j j
(j c) x = M (M j + c) x
i i
j =c j =c
Total Basic Blo cks (BB) 7,402
BB Optimal from List Scheduling 6,885
Because the left hand side of (6) is at most M ,allright
BB Optimal from Graphs Transformation 143
hand side cycles after b + L 1 don't a ect the constraint
BB Optimal from Advanced Range Reduction
b ecause they have negative co ecients and will pro duce a
and from Region Scheduling 353
right hand side result that is greater than M . Thus the right
BB Passed to IP Formulation 22
hand side of (6) can b e simpli ed to:
BB IP Solved Optimally 22
b+L 1
X
BB IP Timed Out 0
j
M (7) (M j + c) x
BB Improved and Optimal 29
i
j =c
BB Improved but Not Optimal 0
Total Cycles Improved 66
Substituting (7) for the right hand side in (6), yields the
Total Scheduling Time (sec.) 98
simpli ed constraint:
b+L 1 b
Table 6: Exp erimental results using DAG transformations
X X
j j
M (M j + c) x (j + L c) x
and the advanced integer programming formulation.
i
k
j =c j =c L+1
To illustrate the simpli cation, assume k has scheduling
The new approach can optimally schedule very large ba-
range [8,15], i has scheduling range [14,78], and i is dep en-
sic blo cks. The scattergram in Figure 12 shows a dot for
dent on k with latency 1. The original data dep endency
each of the 517 basic blo cks that are pro cessed by the graph
constraint:
transformations and the advanced integer programming for-
mulation. The axes indicate the blo ck's size and the time
15 78
X X
j j
to optimally schedule that blo ck. This gure shows that
j x +1 j x
i
k
manyvery large blo cks, as large as 1000 instructions, are
j =8 j =14
optimally scheduled in a short time.
is simpli ed to:
5 Summary
15 15+1 1
X X
j j
(j +1 14) x (15 + 1 14) (15+1 j ) x
i
k
This pap er presents a new approach to optimal instruction
j =14 1+1 j =14
scheduling which is fast. The approach quickly identi es
most basic blo cks for whichlistscheduling pro duces an op-
whichis:
timal schedule, without using integer programming. For
14 15 14 15
the remaining blo cks, a simpli ed DAG and an advanced
x +2 x 2 2 x x
k k i i
integer-programming formulation lead to optimal schedul-
Before simpli cation the dep endency constraint has 74
ing times whichaverage a few seconds p er blo ck, even for a
terms. After simpli cation there are only 4 variable terms.
b enchmark set with some very large basic blo cks. These re-
The maximum sized co ecient has b een reduced to 2 from
sults show that the easiest of the hard instruction scheduling 74.
132
[9] D. Go o dwin and K. Wilken. Optimal and Near-Optimal
1000
Global Register Allo cation Using 0-1 Integer Program-
Software|Practice and Experience, 26(8):929{
100 ming. 965, August 1996.
10
[10] J. Grossman. Discrete Mathematics. Macmillan, 1990.
1
[11] J. Hennessy and D. Patterson. Computer Architecture:
0.1
A Quantitative Approach. Morgan Kaufmann, 1996.
0.01 Second Edition.
Total Scheduling Time (secs.)
ILOG. ILOG CPLEX 6.5 User's Manual. ILOG, 1999. 0.001 [12]
1 10 100 1000 10000
C. Kessler. Scheduling Expression DAGs for Min-
Intermediate Instructions in Basic Block [13]
iml Register Need. Computer Languages, 24(1):33{53,
April 1998.
Figure 12: Scatter-gram of basic-blo cksizeversus optimal
[14] R. Leup ers and P. Marwedel. Time-Constrained Co de
scheduling time.
Compaction for DSPs. IEEE Transactions VLSI Sys-
tems, 5(1):112{122, March 1997.
problems can b e solved in reasonable time. This is an imp or-
[15] S. Munchnick. Advanced Compiler Design and Imple-
tant rst step toward solving harder instruction scheduling
mentation. Morgan Kaufmann, 1997.
problems in reasonable time, including scheduling very large
[16] G. Nemhauser. The Age of Optimization: Solving
basic blo cks for long-latency multiple-issue pro cessors. The
Large-Scale Real-World Problems. Operations Re-
prop osed approach will also serve as a solid base for future
search, 42(1):5{13, Jan.{Feb. 1994.
work on optimal formulations of combined lo cal instruction
scheduling and lo cal register allo cation that can b e solved
[17] K. Palem and B. Simons. Scheduling Time-Critical In-
in reasonable time for large basic blo cks.
structions on RISC Machines. ACM Transactions on
Programming Languages and Systems, 15(4):632{658,
References
Septemb er 1993.
[18] K. Pingali and G. Bilardi. Optimal Control Dep endence
[1] S. Arya. An Optimal Instruction-Scheduling Mo del for
and the Roman Chariots Problem. ACM Transactions
a Class of Vector Pro cessors. IEEE Transactions on
on Programming Languages and Systems, 19(3):462{
Computers, C-34(11):981{995, Novemb er 1985.
491, May 1997.
[2] D. Bernstein and I. Gertner. Scheduling Expressions
[19] W. Pugh. The Omega Test: AFast and Practical Inte-
on a Pip elined Pro cessor with a Maximal DelayofOne
ger programming Algorithm for Dep endence Analysis.
Cycle. ACM Transactions on Programming Languages
In Proceedings Supercomputing '91, pages 18{22, Nov
and Systems, 11(1):57{66, January 1989.
1991.
[3] D. Bernstein, M. Ro deh, and I. Gertner. On
[20] M. Savelsb ergh. Prepro cessing and Probing for Mixed
the Complexity of Scheduling Problems for Paral-
Integer Programming Problems. ORSA Journal of
lel/Pip elined Machines. IEEE Transactions on Com-
Computing, 6(4):445{454, Fall 1994.
puters, 38(9):1308{1313, Septemb er 1989.
[21] P. Sweany and S. Beaty. Instruction Scheduling Us-
[4] R. Bixby, K. Kennedy, and U. Kremer. Automatic Data
ing Simulated Annealing. In Proc. 3rd International
Layout Using 0-1 Integer Programming. In Proceedings
Conference on Massively Paral lel Computing Systems
of Conference on Paral lel Architectures and Compila-
(MPCS '98), 1998.
tion Techniques, August 1994.
[22] H. Tomiyama, T. Ishihara, A. Inoue, and H. Yasuura.
[5] C-M Chang, C-M Chen, and C-T King. Using Inte-
Instruction Scheduling to Reduce Switching Activityof
ger Linear Programming for Instruction Scheduling and
O -Chip Buses for Low-Power Systems with Caches.
Register Allo cation in Multi-Issue Pro cessors. Com-
IEICE Trans. on Fundamentals of Electronics, Com-
puters and Mathematics with Applications, 34(9):1{14,
munications and Computer Sciences, E81-A(12):2621{
November 1997.
2629, Decemb er 1998.
[6] H. Chou and C. Chung. An Optimal Instruction Sched-
[23] S. Vegdahl. Local Code Generation and Compaction
uler for Sup erscalar Pro cessors. IEEE Trans. on Paral-
in Optimizing Microcode. PhD thesis, Carnegie Mellon
lel and Distributed Systems, 6(3):303{313, March 1995.
University, Decemb er 1982.
[7] T. Cormen, C. Leiserson, and R. Rivest. Introduction
[24] Laurence A. Wolesey. Integer Programming. John Wi-
to Algorithms. McGraw-Hill, 1989.
ley & Sons, Inc., 1998.
[8] A. Ertl and A. Krall. Optimal Instruction Schedul-
ing Using Constraint Logic Programming. In Program-
ming Language Implementation and Logic Program-
ming (PLILP). Springer-Verlag, 1991.
133