Using Prole Information to Assist Classic Co de Optimizations

Pohua P Chang Scott A Mahlke and Wenmei W Hwu

Center for Reliable and Highp erformance Computing

University of Illinois UrbanaChampaign

hwucrhcuiucedu

SUMMARY

This pap er describ es the design and implementation of an optimizing that automati

cally generates prole information to assist classic co de optimizations This compiler contains two

new comp onents an execution proler and a prolebased co de optimizer which are not commonly

found in traditional optimizing The execution proler inserts prob es into the input pro

gram executes the input program for several inputs accumulates prole information and supplies

this information to the optimizer The prolebased co de optimizer uses the prole information

to exp ose new optimization opp ortunities that are not visible to traditional global optimization

metho ds Exp erimental results show that the prolebased co de optimizer signicantly improves

the p erformance of pro duction programs that have already b een optimized by a highquality

global co de optimizer

Key Words C co de optimization compiler prolebased co de optimization proler

INTRODUCTION

The ma jor ob jective of co de optimizations is to reduce the execution time Some classic co de

optimizations such as dead co de elimination common sub expression elimination and copy propa

gation reduce the execution time byremoving redundant computation Other co de optimizations

suchasloopinvariant co de removal and lo op elimination reduce the execution

time bymoving instructions from frequently executed program regions to infrequently executed

program regions This pap er describ es an that accurately identies frequently

executed program paths and optimizes them

To app ear Software Practice Exp erience

Static analysis such as lo op detection can estimate execution counts but the estimates are

imprecise outcome of conditional statements lo op iteration counts and recursion depth are rarely

predictable using static techniques For example a lo op nested within a conditional statementdoes

not contribute to the execution time if the condition for its evaluation is never true Optimizing

such a lo op may degrade the overall program p erformance if it increases the execution time of other

parts of the program

Classic co de optimizations use other static analysis metho ds suchaslivevariable analysis

reaching denitions and denitionuse chain to ensure the correctness of co de transformations

These static analysis metho ds do not distinguish b etween frequently and infrequently executed

program paths However there are often instances where a value is destroyed on an infrequently

executed path which exists to handle rare events As a result one cannot apply optimizations to

the frequently executed paths unless the infrequently executed paths are systematically excluded

from the analysis This requires an accurate estimate of the program runtime b ehavior

Proling is the pro cess of selecting a set of inputs for a program executing the program with

these inputs and recording the runtime b ehavior of the program By carefully selecting inputs

one can derive accurate estimate of program runtime b ehavior with proling The motivation to

integrate a proler into a C compiler is to guide the co de optimizations with prole information We

refer to this scheme as prolebasedcode optimization In this pap er we present a new metho d for

using prole information to assist classic co de optimizations The idea is to transform the control

ow graph according to the prole information so that the optimizations are not hindered byrare

conditions Because prolebased co de optimizations demand less work from the user than hand

tuning of a program do es prolebased co de optimizations can b e applied to very large application

In this pap er we assume that the reader is familiar with the static analysis metho ds

To app ear Software Practice Exp erience

programs With prolebased co de optimizations much of the tedious work can b e eliminated from

the handtuning pro cess The programmers can concentrate on more intellectual work suchas

algorithm tuning

The contribution of this pap er is a description of our exp erience with the generation and use of

prole information in an optimizing C compiler The prototyp e proler that wehave constructed

is robust and tested with large C programs Wehave mo died many classic co de optimizations to

use prole information Exp erimental data show that these co de optimizations can substantially

sp eedup realistic nonnumeric C application programs We also provide insightinto why these co de



optimizations are eective

The intended audience of this pap er is optimizing compiler designers and pro duction software

develop ers Compiler designers can repro duce the techniques that are describ ed in this pap er Pro

duction software develop ers can evaluate the costeectiveness of prolebased co de optimizations

for improving pro duct p erformance

RELATED STUDIES

Using prole information to handtune algorithms and programs has b ecome a common practice



for serious program develop ers Several UNIX prolers are available suchaspr of g pr of

and tcov The pr of output shows the execution time and the invo cation count of each function

The gprof output not only shows the execution time and the invo cation countofeach function

but also shows the eect of called functions in the prole of each caller The tcov output is an

annotated listing of the source program The execution countofeach straigh tline segmentofC



It should b e noted that prolebased co de optimizations are not alternatives to conventional optimizations but

are meant to b e applied in addition to conventional optimizations



UNIX is a Trademark of ATT

To app ear Software Practice Exp erience

statements is rep orted These proling to ols allow programmers to identify the most imp ortant

functions and the most frequently executed regions in the functions

Recent studies of prolebased co de optimizations have provided solutions to sp ecic architec

tural problems The accuracy of branch prediction is imp ortant to the p erformance of pip elin ed

pro cessors that use the squashing branchscheme It has b een shown that prolebased branch

prediction at p erforms as well as the b est hardware schemes Trace scheduling

is a p opular global micro co de compaction technique For trace scheduling to b e eective the

compiler must b e able to identify frequently executed sequences of basic blo cks It has b een shown

that proling is an eective metho d to identify frequently executed sequences of basic blo cks in

aow graph Instruction placement is a co de optimization that arranges the basic blo cks

of a ow graph in a particular linear order to maximize the sequential lo cality and to reduce the

numb er of executed branch instructions It has b een shown that proling is an eective metho d

to guide instruction placement A C compiler can implementamultiway branch ie a

sw itch statement in C as a sequence of branch instructions or as a hash table lo okup jump If

most o ccurrences are satised by few case conditions then it is b etter to implement a sequence

of branch instructions starting from the most likely case to the least likely case Otherwise it is

b etter to implement a hash table lo okup jump

Prole information can help a register allo cator to identify the frequently accessed variables

Function eliminates the overhead of function calls and enlarges the scop e of

global co de optimizations Using prole information the compiler can identify the most frequently

invoked calls and determine the b est expansion sequence A counterbased execution proler

that measures the average execution times and their variance can b e optimized to achieve a run

time overhead less than The estimated execution times can b e used to guide program

To app ear Software Practice Exp erience

partitioning and scheduling for multipro cessors

DESIGN OVERVIEW

C programs

Input data

BoxA

BoxC

Compiler

Proler

Frontend

Intermediate

Co de

BoxB

Co de

Co de

Generator Optimizer

AMDk

MIPS SPARC

i

Host Assemblers

Figure A blo ck diagram of our prototyp e C compiler

Figure shows the ma jor comp onents of our prototyp e C compiler Box A contains the com

piler frontend and the co de generator Box B is the global co de optimizer that op erates on the

intermediate form Table lists the lo cal and global co de optimizations that wehave implemented

in our prototyp e compiler In order to have prolebased co de optimizations wehave added a new

To app ear Software Practice Exp erience

Box C to the prototyp e compiler The prole information is then integrated into the intermediate

co de Some co de optimizations in Box B are mo died to use the prole information These co de

optimizations form a separate pass that is p erformed after the classic global co de optimizations

Our prototyp e compiler generates co de for several existing pro cessor architectures MIPS R

SPARC Intel i and AMDk

local global

constant propagation constant propagation

copy propagation copy propagation

common sub expression elimination common sub expression elimination

redundant load elimination redundant load elimination

redundant store elimination redundant store elimination

lo op unrolling

lo op invariantcoderemoval

constantcombining lo op induction strength reduction

op eration folding lo op induction elimination

dead co de removal dead co de removal

co de reordering global variable migration

Table Classic co de optimizations

Program representation Our intermediate co de has the following prop erties The op eration

co des are very close to those of the host machines eg MIPS R and SPARC It is a

loadstore architecture Arithmetic instructions are registertoregister op erations Data transfers

between registers and memory are sp ecied by explicit memory loadstore instructions The

intermediate co de provides an innite numb er of temp orary registers

In optimizing compilers a function is typically represented bya ow graph where eachnode

is a basic blo ck and each arc is a p otential control owpathbetween two basic blo cks Because

To app ear Software Practice Exp erience



classic co de optimizations havebeendevelop ed based on the ow graph data structure we extend

the ow graph data structure to contain prole information We dene a weightedowgraph as

a quadruplet fV E count ar c countg where eachnodeinV is a basic blo ck each arc in E is

a p otential control ow path b etween two basic blo cks countv is a function that returns the

execution count of a basic blo ck v andar c counte is a function that returns the taken countof

a control ow path e

Each basic blo ck contains a straightline segment of instructions The last instruction of a

basic blo ckmay b e one of the following typ es an unconditional jump instruction a way

conditional branch instruction a multiway branch instruction or an arithmetic instruction

For simplicitywe assume that a jumpsubroutine instruction is an arithmetic instruction b ecause

it do es not change the control ow within the function where the jumpsubroutine instruction is



dened Except the last instruction all other instructions in a basic blo ckmust b e arithmetic

instructions that do not change the ow of control to another basic blo ck

Proler implementation We are interested in collecting the following information with the

proler

The numb er of times a program has b een proled

The invo cation countofeach function

The execution countofeach basic blo ck

For eachway conditional branch instruction the numb er of times it has b een taken



Algorithms for nding dominators detecting lo ops computing livevariable information and other dataow

analysis have b een develop ed on the ow graph data structure



An exception is when a long jmp is invoked by the callee of a jumpsubroutine instruction and the control do es

not return to the jumpsubroutine instruction Another exception is when the callee of a jumpsubroutine instruction

is exit However these exceptions do not aect the correctness of co de optimizations based on ow graphs

To app ear Software Practice Exp erience

For eachmultiwaybranch instruction the numb er of times each case has b een taken

With this information we can annotate a ow graph to form a weighted ow graph

Automatic proling is supp orted by four to ols a prob e insertion program an execution monitor

a program to combine several prole les into a summarized prole le and a program that maps

the summarized prole data into a ow graph to generate a weighted ow graph data structure

All that a user has to do to p erform proling is to supply input les The compiler automatically

p erforms the entire proling pro cedure in ve steps

a The prob e insertion program assigns a unique id to each function and inserts a prob e at the

entry p oint of each function Whenever the prob e is activated it pro duces a f unctionid

token In a f unctionidtoken id is the unique id of the function The prob e insertion

program also assigns a unique id to each basic blo ck within a function The prob e insertion

program inserts a prob e in each basic blo ck to pro duce a bbfidbid cctoken every time that

basic blo ck is executed In a bbf id bid cctoken fid identies a function bid identies a

basic blo ck in that function and cc is the branch condition The output of the prob e insertion

program is an annotated intermediate co de

b The annotated intermediate co de is compiled to generate an program which pro

duces a trace of tokens ev ery time the program is executed

c The execution monitor program consumes a trace of tokens and pro duces a prole le We

have implemented the execution monitor program in twoways It can b e a separate program

which listens through a UNIX so cket for incoming tokens Alternatively it can b e a function

whichislinked with the annotated user program The second approach is at least twoorders

of magnitude faster than the rst approach but may fail when the original user program

To app ear Software Practice Exp erience

contains a very large data section that prevents the monitor program from allo cating the

necessary memory space Fortunatelywehavenotyet encountered this problem

d Step c is rep eated once for each additional input All prole les are combined into a prole

le by summing the counts and keeping a counter that indicates the numb er of prole les

combined From the ab ove information the average execution counts can b e derived

e Finally the average prole data is mapp ed into the original intermediate co de using the assigned

function and basic blo ck identiers

CODE OPTIMIZATION ALGORITHMS

Optimizing frequently executed paths All prolebased co de optimizations presented in this

section explore a single concept optimizing the most frequently executedpaths We illustrate

this concept using an example Figure shows a weighted ow graph which represents a lo op

program The count of basic blo cks fA B C D E F g are f g resp ectively

ar c count of fA B A C B D B E C F D F E F F Ag are The

f g resp ectively Clearly the most frequently executed path in this example

is the basic blo ck sequence ABEF Traditionally the formulation of nonlo op based

classic co de optimizations are conservative and do not p erform transformations that may increase

the execution time of any basic blo ck The formulation of lo op based classic co de optimizations

consider the entire lo op b o dy as a whole and do not consider the case where some basic blo cks

in the lo op b o dy are rarely executed b ecause of a very biased if statement In the rest of this

section we describ e several prolebased co de optimizations that make more aggressive decisions and explore more optimization opp ortunities

To app ear Software Practice Exp erience

A

C

B

E

D

F

Figure A weighted ow graph

To app ear Software Practice Exp erience

We prop ose the use of a simple data structure called a sup erblo ck to represent a frequently

executed path A sup erblo ck is a linear sequence of basic blo cks that can b e reached only from the

rst blo ck in the sequence The program control mayleave the sup erblo ck from any basic blo ck

When execution reaches a sup erblo ck it is very likely that all basic blo cks in that sup erblo ckare

executed

The basic blo cks in a sup erblo ck do not have to b e consecutive in the co de However our

implementation restructures the co de so that as far as the optimizer is concerned all blo cks in a

sup erblo ck are always consecutive

Forming sup erblo cks The formation of sup erblo cks is a two step pro cedure trace selection

and tail duplication Trace selection identies basic blo cks that tend to execute in sequence and

groups them intoatrace The denition of a trace is the same as the denition of a sup er

blo ck except that the program control is not restricted to enter at the rst basic blo ck Trace

selection was rst used in trace scheduling An exp erimental study of several trace selection

algorithms was rep orted in The outline of a trace selection algorithm is shown in Figure

The best pr edecessor of node function returns the most probable source basic blo ckofnodeif

the source basic blo ck has not yet b een marked The growth of a trace is stopp ed when the most

successor of node probable source basic blo ckofthecur r ent no de has b een marked The best

function is dened symmetrically

Figure shows the result of trace selection Each dottedline b ox represents a trace There

are three traces fA B E F g fC gandfD g After trace selection each trace is converted into a

sup erblo ckby duplicating the tail part of the trace in order to ensure that the program control

can only enter at the top basic blo ck The tail duplication algorithm is shown in Figure Using

To app ear Software Practice Exp erience

algorithm traceselectiona weighted flow graph G begin

mark all nodes in G unvisited

while there are unvisited nodes begin

seed the node with the largest execution count

among all unvisited nodes

mark seed visited

grow the trace forward

current seed

loop

s bestsuccessorofcurrent

if s exit loop

add s to the trace

mark s visited

current s

endloop

grow the trace backward

current seed

loop

s bestpredecessorofcurrent

if s exit loop

add s to the trace

mark s visited

current s

endloop

endwhile

endalgorithm

Figure A traceselection algorithm

algorithm tailduplicationa trace Bn begin

Let Bi be the first basic block that

is an entry point to the trace except for i

for kin begin

create a trace that contains a copy of Bk

place the trace at the end of the function

redirect all control flows to Bk except

the ones from Bk to the new trace

endfor

endalgorithm Figure The tailduplication algorithm

To app ear Software Practice Exp erience

the example in Figure we see that there are twocontrol paths that enter the fA B E F g trace at

basic blo ck F Therefore we duplicate the tail part of the fA B E F g trace starting at basic blo ck

F Each duplicated basic blo ck forms a new sup erblo ck that is app ended to the end of the function



The result is shown in Figure More co de transformations are applied after tail duplication to

eliminate jump instructions For example the F sup erblo ck in Figure could b e duplicated and

each copy b e combined with the C and D sup erblo cks to form two larger sup erblo cks

In order to control the amount of co de duplication we exclude all basic blo cks whose execution

countisbelow a threshold value eg p er run from the trace selection pro cess They are also

excluded from prolebased co de optimization to control the increase in compile time

Formulation of co de optimizations Tableshows a list of classic co de optimizations that we

have extended to use prole information The original formulation of these classic co de optimiza

tions can b e found in In Table the column describ es the extended scop es of these

co de optimizations The nonlo op based co de optimizations work on a single sup erblo ck at a time

The lo op based co de optimizations work on a single sup erblo ck lo op at a time A sup erblo ck

lo op is a sup erblo ck that has a frequently taken backedge from its last no de to its rst no de

The optimizer rst applies livevariable analysis to detect variables that are live across sup erblo ck

b oundaries and then optimizes one sup erblo ckatatimeF or each sup erblo ck the prolebased

co de optimizations are applied one or more times up to a limit or when no more opp ortunities can

b e detected

In the following discussion each co de optimization consists of a pr econdition function and

an action function The precondition function is used to detect optimization opp ortunities and



Note that the prole information has to b e scaled accordingly Scaling the prole information will destroy the

accuracyFortunately co de optimization s after forming sup erblo cks only need approximate prole information

To app ear Software Practice Exp erience

A

C

B

E

D

F

F

Figure Forming sup erblo cks

To app ear Software Practice Exp erience

name scope

constant propagation sup erblo ck

copy propagation sup erblo ck

constantcombining sup erblo ck

common sub expression elimination sup erblo ck

redundant store elimination sup erblo ck

redundant load elimination sup erblo ck

dead co de removal sup erblo ck

lo op invariantcoderemoval sup erblo ckloop

lo op induction variable elimination sup erblo ckloop

global variable migration sup erblo ckloop

Table Sup erblo ck co de optimizations

to ensure that the transformation improves overall program p erformance The action function

p erforms the actual co de transformation To apply a co de optimization the optimizer identies

sets of instructions that may b e eligible for the optimization The precondition function is then

invoked to make an optimization decision for each set With the approval from the precondition

function the action function transforms the eligible sets into their more ecient equivalents



We denote the set of variables that an instruction opimodiesby desti We denote the

set of variables that opi requires as source op erands by sr ci We denote the op eration co de of

opiby f Therefore opi refers to the op eration desti f sr ci

i i

Lo cal optimizations extended to sup erblo cks There are several lo cal co de optimizations that



can b e extended in a straightforward manner to sup erblo cks These lo cal optimizations include

constant propagation copy propagation constantcombining common sub expression elimination

redundant load elimination and redundant store elimination



In this pap er we assume that there can b e at most one elementindestiofany instruction opi



The details of the required extensions can b e found in a technical rep ort

To app ear Software Practice Exp erience

Traditionally lo cal optimization cannot b e applied across basic blo cks and global co de opti

mization must consider each p ossible execution path equallyHowever there are often instances

where an optimization opp ortunity is inhibited by an infrequently executed path As a result one

cannot apply optimizations to the frequently executed paths unless the infrequently executed paths

are systematically excluded from the analysis Forming sup erblo cks with tail duplication achieves

this eect Therefore prolebased co de optimizations can nd more opp ortunities than traditional

co de optimizations

To illustrate why lo cal co de optimizations are more eective when they are applied to sup er

blo cks consider the case of common sub expression elimination shown in Figure The original

program is shown in Figure a After trace selection and tail duplication the program is shown

in Figure b Because of tail duplication opC cannot b e reached from opB therefore common

sub expression elimination can b e applied to opA and opC

Dead co de removal Dead co de removal op erates on one instruction at a time Let opxbean

instruction in a sup erblo ck The traditional formulation of the precondition function of dead co de

removal is that if the values of destx will not b e used later in execution opx can b e eliminated

To take full advantage of prole information we prop ose an extension to dead co de removal In

the extension the precondition function consists of the following b o olean predicates

The sup erblo ckwhereopx is dened is not a sup erblo ck lo op

Opx is not a branch instruction

D estx is not used b efore redened in the sup erblo ck

To app ear Software Practice Exp erience

a

opA r r

opB r r

opC r r

b

opA r r

opB r r

opC r r

opC r r

c

opA r r

opB r r

opC r r

opC r r

Figure An example of sup erblo ck common sub expression elimination a Original program

segment b Program segment after sup erblo ck formation c Program segment after common sub expression elimination

To app ear Software Practice Exp erience

Find an integer y such that opy is the rst instruction that mo dies destxandxyIf

destx is not redened in the sup erblo ck set y to m where opm is the last instruction

in the sup erblo ck Find an integer z such that opz is the last branch instruction in

fopk k x y g Either there is no branch instruction in fopk k x y g

or sr cx is not mo died by an instruction in fopj j x z g

The action function of dead co de removal consists of the following steps

For every branch instruction in f opii x y gif destxislive when opi is taken

copy opx to a place b etween opiandevery p ossible target sup erblo ckofopi when opi

is taken

If y is m and the sup erblo ck where opx is dened has a fallthru path b ecause the last

instruction in the sup erblo ck is not an unconditional branch copy opx to b ecome the last

instruction of the sup erblo ck

Eliminate the original opx from the sup erblo ck

Dead co de elimination is like common sub expression elimination in that tail duplication is a

ma jor source of opp ortunities to apply it A sp ecial feature of our dead co de elimination is that it

can eliminate an instruction from a sup erblo ckbycopying it to some control ow paths that exit

from the middle of the sup erblo ck This co de motion is b enecial b ecause the program control

rarely exits from the middle of a sup erblo ck

Figure shows a simple example of dead co de removal The program is a simple lo op that

has b een unrolled four times The lo op index variable r has b een expanded into four registers

Avariable is liveifitsvalue will b e used b efore redened An algorithm for computing livevariables can b e

found in

To app ear Software Practice Exp erience

r r

r r

r r

r r

r r

r r

r r

r r

r r

X

X

r r

r r

Y

Y

r r

Z

rr

r r

r r

Z

b

a

Figure An example of sup erblo ck dead co de removal a Original program segment b

Program segment after dead co de removal

To app ear Software Practice Exp erience

rrrr that can b e computed in parallel If the lo op index variable is live after the lo op

execution then it is necessary to up date the value of r in each iteration as shown in Figure

a According to the denition of sup erblo ck dead co de removal these up date instructions eg

rrrr and rr b ecome dead co de since their uses are replaced by rrr and r These

up date instructions can b e moved out from the sup erblo ck as shown in Figure b

Lo op optimizations Sup erblo ck lo op optimizations can identify more optimization opp ortuni

ties than traditional lo op optimizations that must account for all p ossible execution paths within a

lo op Sup erblo ck lo op optimizations reduce the execution time of the most likely path of execution

through a lo op In traditional lo op optimizations a p otential optimization may b e inhibited by

a rare event such as a function call to handle a hardware failure in a device driver program or

a function call to rell a large character buer in text pro cessing programs In sup erblo ckloop

optimizations function calls that are not in the sup erblo ck lo op do not aect the optimization of

the sup erblo ck lo op

Wehave identied three imp ortant lo op optimizations that most eectively utilize prole in

formation invariant co de removal global variable migration and induction variable elimination

Each optimization is discussed in a following subsection

Lo op invariantcoderemoval Invariantcoderemoval moves instructions whose source op erands

do not change within the lo op to a preheader blo ck Instructions of this typ e are then executed

only once each time the lo op is invoked rather than on every iteration The precondition function

for invariantcoderemoval consists of the following b o olean predicates that m ust all b e satised

sr cx is not mo died in the sup erblo ck

To app ear Software Practice Exp erience

opx is the only instruction which mo dies destx in the sup erblo ck

opxmust precede all instructions which use destx in the sup erblo ck

opxmust precede every exit p oint of the sup erblo ck in which destx is live

If opx is preceded by a conditional branch in the sup erblo ck it must not p ossibly cause an

exception

The action function of invariantcoderemoval is moving opx to the end of the preheader blo ckof

the sup erblo ck lo op

In the precondition function predicate returns true if opx is executed on every iteration of

the sup erblo ck lo op An instruction that is not executed on every iteration maynotbemoved to the

preheader if it can p ossibly cause an exception Memory instructions oating p oint instructions

and integer divide are the most common instructions which cannot b e removed unless they are

executed in every iteration

Predicates and dep ends on two optimization comp onents memory disambiguation and

interpro cedural analysis Currently our prototyp e C compiler p erforms memory disambiguation

us if opx is a memory instruction predicate will return but no interpro cedural analysis Th

false if there are any subroutine calls in the sup erblo ckloop

The increased optimization opp ortunities created by limiting the search space to within a sup er

blo ck for invariantcoderemoval is b est illustrated by an example Figure shows a simple example

of sup erblo ckloopinvariant co de removal In Figure a opA is not lo op invariant in the tradi

tional sense b ecause its source op erand is a memory variable and opD is a function call that may

mo dify any memory variable On the other hand opA is invariant in the sup erblo ck lo op The

result of sup erblo ckloopinvariantcoderemoval is shown in Figure b

To app ear Software Practice Exp erience

opA r buerlength

opB r rr

a

opD rell

opC r r

opA r buerlength

b

opB r rr

opD rell

opC r r

Figure An example of sup erblo ckloopinvariantcoderemoval a Original program segment

b Program segment after lo op invariantcoderemoval

To app ear Software Practice Exp erience

Global variable migration Global variable migration moves frequently accessed memory vari

ables such as globally declared scalar variables array elements or structure elements into registers

for the duration of the lo op Loads and stores to these variables within the lo op are replaced by

register accesses A load instruction is inserted in the preheader of the lo op to initialize the register

and a store is placed at each lo op exit to up date memory after the execution of the lo op

The precondition function for global variable migration consists of the following b o olean pred

icates that must all b e satised If opx is a memory access let addr essx denote the memory

address of the access

opx is a load or store instruction

addr essxisinvariant in the sup erblo ckloop

If opx is preceded by a conditional branch it must not p ossibly cause an exception

The compiler must b e able to detect in the sup erblo ck lo op all memory accesses whose

addresses can equal addr essx at runtime and these addresses must b e invariant in the

sup erblo ckloop

The action function of global variable migration consists of three steps

A new load instruction opa with sr ca addr essx and destatemp reg is inserted

after the last instruction of the preheader of the sup erblo ckloop

A store instruction opb with destbaddr essxand sr cb temp reg is inserted as

the rst instruction of each blo ck that can b e immediately reached when the sup erblo ckloop

is exited

If a basic blo ck that is immediately reached from a control ow exit of the sup erblo ck lo op can b e reached from

To app ear Software Practice Exp erience

All loads in the sup erblo ck lo op with sr ci addr essx are converted to register move

instructions with sr ci temp reg and all stores with desti addr essx are converted

to register move instructions with desti temp reg The unnecessary copies are removed

by later applications of copy propagation and dead co de removal

Figure shows a simple example of sup erblo ckglobalvariable migration The memory variable

xr cannot b e migrated to a register in traditional global variable migration b ecause r is not lo op

invariantintheentire lo op On the other hand r is lo op invariant in the sup erblo ck lo op and

xr can b e migrated to a register by sup erblo ck global variable migration The result is shown in

Figure b Extra instructions opX and opY are added to the sup erblo ck lo op b oundary p oints

to ensure correctness of execution

Lo op induction variable elimination Induction variables are variables in a lo op incremented

by a constant amounteach time the lo op iterates Induction variable elimination replaces the uses

of an induction variable by another induction variable thereby eliminating the need to increment

the variable on each iteration of the lo op If the induction variable eliminated is needed after the

lo op is exited its value can b e derived from one of the remaining induction variables

The precondition function for induction variable elimination consists of the following b o olean

predicates that must all b e satised

opx is an inductive instruction of the form destx destx K

opx is the only instruction which mo dies destx in the sup erblo ck

y desty K opy is an inductive op eration of the form dest

multiple basic blo cks a new basic blo ck needs to b e created to bridge the sup erblo ck lo op and the originally reached

basic blo ck

To app ear Software Practice Exp erience

opA xr xrr

a

opC r r

opB r r

opX r xr

b

opA r rr

opY xr r

opB r r

opC r r

opY xr r

Figure An example of sup erblo ckloopglobalvariable migration a Original program segment

b Program segment after global variable migration

To app ear Software Practice Exp erience

opy is the only instruction which mo dies desty in the sup erblo ck

opxand opy are incremented by the same value ie K K

There are no branch instructions b etween opxand opy

For each op eration opj in which srcj contains destx either j x or all elements of srcj

except destx are lo op invariant

All uses of destx can b e mo died to desty in the sup erblo ck without incurring time



p enalty

The action function of induction variable elimination consists of steps

opx is deleted

A subtraction instruction opm destm destx desty is inserted after the last

instruction in the preheader of the sup erblo c kloop

For each instruction opa whichusesdestx let other sr ca denote the src op erand of opa

sr ca destm is whichisnotdestx A subtraction instruction opn destn other

inserted after the last instruction in the preheader The source op erands of opa are then

changed from destxandother sr cato desty anddestn resp ectively

An addition instruction opo destx desty destm is inserted as the rst instruction

of each blo ck that can b e immediately reached when the sup erblo ck lo op is exited in which

destxislivein

The restriction of predicate K K can b e removed in some sp ecial uses of destx however these sp ecial

uses are to o complex to b e discussed in this pap er



For example if we know that destx desty b ecause of dierent initial values then a branchifnot

equal bnedestx instruction is converted to a bnedesty instruction For some machines bnedesty

needs to b e broken down to a compare instruction plus a branch instruction then the optimization may degrade p erformance

To app ear Software Practice Exp erience

It should b e noted that step of the action function may increase the execution time of opaby

changing a source op erand from an integer constant to a register For example a branchifgreater

thanzero instruction b ecomes a compare instruction and a branch instruction if the constantzero

source op erand is converted to a register Predicate prevents the co de optimizer from making a

wrong optimization decision In traditional lo op induction elimination wecheck the entire lo op

b o dy for violations of precondition predicates In sup erblo ck lo op induction elimination wecheck

only the sup erblo ck and therefore nd more optimization opp ortunities

Extension of sup erblo ck lo op optimizations In order to further relax the conditions for

invariant co de removal and global variable migration the compiler can unroll the sup erblo ckloop

b o dy once The rst sup erblo ck serves as the rst iteration of the sup erblo ck lo op for each

invo cation while the duplicate is used for iterations and ab ove The compiler is then able

to optimize the duplicate sup erblo ckloopknowing each instruction in the sup erblo ck has b een

executed at least once For example instructions that are invariant but conditionally executed

due to a preceding branch instruction can b e removed from the duplicate sup erblo ck lo op With

this extension precondition predicates and for invariantcoderemoval and predicate for

global variable migration can b e eliminated The implementation of our C compiler includes this

extension

EXPERIMENTATION

Table shows the characteristics of the b enchmark programs The siz e column indicates the

sizes of the b enchmark programs measured in numb ers of lines of C co de The descr iption column

briey describ es the b enchmark programs

To app ear Software Practice Exp erience

name size description

cccp GNU C prepro cessor

cmp compare les

compress compress les

eqn typ eset mathematical formulas for tro

eqntott b o olean minimization

espresso b o olean minimization

grep string search

lex lexical analysis program generator

mpla pla generator

tbl format tables for tro

wc word count

xlisp lisp interpreter

yacc parsing program generator

Table Benchmarks

name input description

cccp C source les lines

cmp similar dierent les

compress C source les lines

eqn ditro les lines

eqntott b o olean equations

espresso b o olean functions original espresso b enchmarks

grep C source les lines with various search strings

lex lexers for C Lisp Pascal awk and pic

mpla b o olean functions minimized by espresso original espresso b enchmarks

tbl ditro les lines

wc C source les lines

xlisp gabriel b enchmarks

yacc grammars for C Pascal pic eqn awk etc

Table Input data for proling

To app ear Software Practice Exp erience

For each b enchmark program wehave selected a numb er of input data for proling Table

shows the characteristics of the input data sets The input column indicates the numb er of inputs

that are used for eachbenchmark program The descr iption column briey describ es the input

data For eachbenchmark program wehave collected one additional input and used that input

to measure the p erformance The execution time of the b enchmark programs that are annotated

with prob es for collecting prole information is from to times slower than that of the original

b enchmark programs It should b e noted that our proler implementation is only a prototyp e and

has not b een tuned for p erformance

name global prole MIPSO GNUO

cccp

cmp

compress

eqn

eqntott

espresso

grep

lex

mpla

tbl

wc

xlisp

yacc

avg

sd

Table DEC execution sp eed for each individual b enchmark

Table shows the output co de quality of our prototyp e compiler We compare the output co de

sp eed against that of the MIPS C compiler release O and the GNU C compiler release

O on a DEC workstation which uses a MIPSR pro cessor The numb ers that

are shown in Table are the sp eedups over the actual execution times of globally optimized co de

To app ear Software Practice Exp erience

name global prole

cccp

cmp

compress

eqn

eqntott

espresso

grep

lex

mpla

tbl

wc

xlisp

yacc

avg

sd

Table Ratios of co de expansion

pro duced by our prototyp e compiler The pr of il e column shows the sp eedup that is achieved by

applying prolebased co de optimizations in addition to global co de optimizations Note that the

input data used to measure the p erformance of prolebased co de optimizations is dierentfrom

those used to gather the prole information

The M I P SO column shows the sp eedup that is achieved by the MIPS C compiler over our

global co de optimizations The GN UO column shows the sp eedup that is achieved by the GNU C

compiler over our global co de optimizations The numb ers in the M I P SO and GN UO columns

show that our prototyp e global co de optimizations p erforms slightly b etter than the two pro duction

compilers for all b enchmark programs Table clearly shows the imp ortance of these sup erblo ck

co de optimizations

The sizes of the executable programs directly aect the cost of maintaining these programs in

a computer system in terms of disk space In order to control the co de expansion due to tail

To app ear Software Practice Exp erience

duplication basic blo cks are added into a trace only if their execution counts exceed a predened

constant threshold For these exp eriments we use an execution count threshold of Table

shows how co de optimizations aect the sizes of the b enchmark programs The pr of il e column

shows the sizes of prolebased co de optimized programs relative to the sizes of globally optimized

programs In Table we show that our prototyp e compiler has eectively controlled the co de

expansion due to forming sup erblo cks

The cost of implementing the prolebased classic co de optimizations is mo dest The conven

tional global co de optimizer in our prototyp e compiler consists of approximately lines of C

co de The prolebased classic co de optimizer consists of approximately lines of C co de The

proler is implemented with ab out lines of C co de and a few subroutines

CONCLUSIONS

Wehaveshown how an execution proler can b e integrated into an optimizing compiler to

provide the compiler with runtime information ab out input programs Wehave describ ed our

design and implementation of prolebased classic co de optimizations Wehave identied two

ma jor reasons why these co de optimizations are eective eliminating control ows into the

middle sections of a trace and optimizing the most frequently executed path in a lo op Exp er

imental results haveshown that prolebased classic co de optimizations signicantly improve the

p erformance of pro duction C programs

Acknowledgements

The authors would like to thank Nancy Warter Andy Glew William Chen and all members

of the IMPA CT research group for their supp ort comments and suggestions Wewould liketo

acknowledge the anonymous referees whose comments have help ed us to improve the qualityof

this pap er signicantly This research has b een supp orted by the National Science Foundation

NSF under Grant MIP Dr Lee Ho evel at NCR the AMD K Advanced Pro cessor

REFERENCES

Development Division the National Aeronautics and Space Administration NASA under Contract

NASA NAG in co op eration with the Illinois Computer lab oratory for Aerospace Systems and

Software ICLASS

References

A V Aho R Sethi and J D Ullman Compilers Principles Techniques and Tools

AddisonWesley Publishing Company

S L Graham P B Kessler and M K McKusick gprof A Call Graph Execution Proler

Pro ceedings of the SIGPLAN Symp osium on Compiler Construction SIGPLAN Notices

Vol No pp June

S L Graham P B Kessler and M K McKusick An Execution Proler for Mo dular

Programs SoftwarePractice and Exp erience Vol John Wiley Sons Ltd New York

ATT Bell Lab oratories UNIX Programmers Manual Murray Hill NJ January

S McFarling and J L Hennessy Reducing the Cost of Branches The th International

Symp osium on Computer Architecture Conference Pro ceedings pp Tokyo Japan

June

W W Hwu T M Conte and PP Chang Comparing Software and Hardware Schemes For

Reducing the Cost of Branches Pro ceedings of the th Annual International Symp osium

on Computer Architecture Jerusalem Israel May

J A Fisher Trace scheduling A technique for global micro co de compaction IEEE Trans

actions on Computers VolC No July

J R Ellis Bul ldog A Compiler for VLIW Architectures The MIT Press

PP Chang and W W Hwu T race Selection for Compiling Large C Application Pro

grams to Micro co de Pro ceedings of the st Annual Workshop on Microprogramming and

Microarchitectures pp San Diego California November

W W Hwu and PP Chang Achieving High Instruction Cache Performance with an

Optimizing Compiler Pro ceedings th Annual International Symp osium on Computer

Architecture Jerusalem Israel June

K Pettis and R C Hansen Prole Guided Co de Positioning Pro ceedings of the ACM

SIGPLAN Conference on Programming Language Design and Implementation pp

June

PP Chang and W W Hwu Optimization for Sup ercomputer Scalar Pro cess

ing Pro ceedings International Conference on Sup ercomputing Crete Greece June

D W Wall Global Register Allo cation at Link Time Pro ceedings of the SIGPLAN Symp osium on Compiler Construction June

REFERENCES

D W Wall Register Window vs Register Allo cation Pro ceedings of the SIGPLAN

Conference on Programming Language Design and Implementation June

W W Hwu and PP Chang Inline Function Expansion for Compiling Realistic C Pro

grams Pro ceedings ACM SIGPLAN Conference on Programming Language Design and

Implementation Portland Oregon June

V Sarkar Determining Average Program Execution Times and Their Variance Pro ceed

ings of the SIGPLAN Conference on Programming Language Design and Implementation

June

V Sarkar Partitioning and Scheduling Paral lel Programs for Multiprocessors Pitman Lon

don and The MIT Press Cambridge Massachusetts

PP Chang S A Mahlke W W Hwu Using Prole Information to Assist Classic Code Op

timizationsTechnical Rep ort Center for Reliable and HighPerformance Computing CRHC

University of Illinois UrbanaChampaign

F Allen and J Co cke A Catalogue of Optimizing Transformations pp of

R Rustin Editor Design and Optimization of CompilersPrenticeHall Englewo o d Clis NJ