RICE UNIVERSITY

ValueDriven Redundancy Elimination

by

Loren Taylor Simpson

A Thesis Submitted

in Partial Fulfillment of the

Requirements for the Degree

Do ctor of Philosophy

Approved Thesis Committee

Keith D Co op er Asso ciate Professor Chair

Department of Computer Science

Rice University

Linda Torczon FacultyFellow

Department of Computer Science

Rice University

Ken Kennedy Noah Harding Professor

Department of Computer Science

Rice University

Sarita Adve Assistant Professor

Department of Electrical and Computer

Engineering

Rice University

Houston Texas April

ValueDriven Redundancy Elimination

Loren Taylor Simpson

Abstract

Valuedriven redundancy elimination is a combination of value numb ering and co de

motion Value numb ering is an optimization that assigns numb ers to values in sucha

waythattwovalues are assigned the same numb er if the compiler can prove they are

equal When this optimization discovers two computations that pro duce the same

value it can under certain circumstances eliminate one of them Co de motion is an

optimization that attempts to move instructions to less frequently executed lo cations

Traditional techniques must assume that every denition pro duces a distinct value

Therefore an instruction cannot move past a denition of one of its sub expressions

This restriction can b e relaxed when certain denitions are known to pro duce redun

dantvalues By understanding how these two optimizations interact we can simplify

each of them and the resulting combination will b e more p owerful than the sum of

the two parts Value numb ering will b e simpler b ecause it need not b e concerned

with eliminating instructions and co de motion will b e simpler b ecause identifying

sub expressions is not necessary

This researchinvestigates this valuedriven approach to redundancy elimination

We improve up on the known algorithms for b oth value numb ering and co de motion

and we show exp erimental evidence of their eectiveness

Acknowledgments

Tosay that this thesis has b een an individual eort would b e grossly misleading

Many p eople havecontributed to this work in tangible and intangible ways I would

rst like to thank Go d for leading me to Rice and surrounding me with wonderful

friends and coworkers My wife Sallye and myparents and her parents have provided

me with constant supp ort and encouragement

The memb ers of my committee are Keith Co op er Linda Torczon Ken Kennedy

and Sarita Adve They have contributed greatly to this research I admire and resp ect

them deeply and I am honored to b e asso ciated with them

This research has b een part of the Massively Scalar Compiler Pro ject at Rice

University The pro ject is supp orted by ARPA and Vivek Sarkar of IBM has provided

my supp ort The memb ers of the group past and present are Keith Co op er Linda

Torczon Ken Kennedy Preston Briggs Tim Harvey Lisa Thomas Rob Shillingsburg

Cli Click Chris Vick John Lu Edmar Wienskoski Phil Schielke and Linlong Jiang

Preston Briggs was a great inuence during myearlyyears as a graduate student

and I regret that he is no longer a memb er of our pro ject Tim Harvey has b een the

kind of friend who will go out of his way to help someone and then denythatitwas

any trouble at all

Bob Morgan of DEC has shown a great deal of interest in the pro ject and he

has encouraged our researc h in the area of redundancy elimination Dave Sp ott and

David Wallace of Sun Microsystems are implementing the algorithm for SCCbased

value numb ering in their next generation compiler

My time as a graduate student at Rice has b een one of the most rewarding p erio ds

in my life This is due in part to the standard of excellence set by the past and present

Rice graduate students

Contents

Abstract ii

Acknowledgments iii

List of Illustrations vii

Intro duction

Intermediate Representations

Static Single AssignmentForm

Optimization

Classication of Optimizations

Safety Opp ortunity and Protability

Redundancy Elimination

Organization of the Thesis

HashBased Value Numb ering

Static Single AssignmentForm

DominatorTree Value Numb ering

Incorp orating Value Numb ering into SSA Construction

Unied Hash Table

Interaction with Other Optimizations

Data Structures

Extensions

Summary

Value Partitioning

Complexity

Data Structures to Supp ort Partitioning

Rening the Partition

Handling Commutative Op erations

Eliminating Redundant Stores

v

Summary

SCCBased Value Numb ering

Shortcomings of Previous Techniques

The RPO Algorithm

Extensions

Discussion

The SCC Algorithm

Example

Uninitialized Values

Summary

Co de Removal

DominatorBased Removal

AVAILBased Removal

Summary

Co de Motion

Partial Redundancy Elimination

Lazy Co de Motion

Critical Edges

Moving LOAD Instructions

Summary

Value Driven Co de Motion

VDCM Algorithm

Examples

Summary

Relief of Register Pressure

Identifying Blo cks with High Register Pressure

Cho osing Expressions to Relieve Pressure

Selecting Lo cations to Insert Instructions

Results

Summary

vi

Exp erimental Results

Exp erimental Compiler

Tests Performed

Raw Instruction Counts

Normalized Instruction Counts

Comparison with Previous State of the Art

Compile Times

Relief of Register Pressure

Summary

Recommendations

Summary of Contributions

A Op erator

A The Algorithm

A Preliminary Transformations

A Finding Region Constants and Induction Variables

A Co de Replacement

A Example

A Running Time

A Linear Function Test Replacement

A Followup Transformations

A Previous Work

A Summary

Bibliography

Illustrations

Organization of a typical compiler

Program b efore conversion to SSA

Program after conversion to SSA

Value numb ering example

CFG for ifthenelse construct

Dominatortree value numb ering

Dominatortree value numb ering example

Value numb ering during SSA construction

Unied hash table

Interaction with other optimizations

Data structure for scop ed hash table

Example program containing loadsandstores

Partitioning algorithm

Partitioning example

Partitioning steps for example program

Incorrect partition when p ositions are ignored

Op erations supp orted by the partition

Data structures for representing the partition

Data structures for rening the partition

Commutativity example

Partition for second commutativity example

Data structures for handling commutative op erations

Example program for redundantstore elimination

Initial partition for redundantstore elimination example

Partition to enable redundantstore elimination

viii

Data structures for redundantstore elimination

Improved by hashbased techniques

Improved by partitioning techniques

The RPO algorithm

Example requiring D SSA iterations

Example with D CFG D SSA

Tarjans SCC nding algorithm

SCCbased value numb ering algorithm

Example with equal induction variables

Example with edges removed from SCC

Example with uninitialized values

Program improved by dominatorbased removal

Program improved byAVAILbased removal

Naivebucket sorting algorithm

Better bucket sorting algorithm

Example program

Dataow equations for AVAILbased removal

Program improved by partial redundancy elimination

Unnecessary co de motion

Dataow equations for lazy co de motion Part

Dataow equations for lazy co de motion Part

Splitting a critical edge

Incorrect motion of a store instruction

Algorithm for computing altered using values

Algorithm for computing altered using lexical names

Expression tree

VDCM example

Another VDCM example

Example where redundancy elimination decreases register pressure

ix

Motivating example for heuristic

Motivating example for heuristic

Motivating example for heuristic

Dataow equations for relief of register pressure

Sample ILOC routine

Numb er of ILOC op erations for hashbased value numb ering

techniques Sp ec b enchmark

Numb er of ILOC op erations for value partitioning techniques Sp ec

b enchmark

Numb er of ILOC op erations for SCCbased value numb ering

techniques Sp ec b enchmark

Numb er of ILOC op erations for hashbased value numb ering

techniques FMM b enchmark

Numb er of ILOC op erations for value partitioning techniques FMM

b enchmark

Numb er of ILOC op erations for SCCbased value numb ering

techniques FMM b enchmark

Comparison of hashbased value numb ering techniques Sp ec

b enchmark

Comparison of value partitioning techniques Sp ec b enc hmark

Comparison of SCCbased value numb ering techniques Sp ec

b enchmark

Comparison of value numb ering techniques using dominatorbased

removal Sp ec b enchmark

Comparison of value numb ering techniques using AVAILbased

removal Sp ec b enchmark

Comparison of value numb ering techniques using lazy co de motion

Sp ec b enchmark

Comparison of value numb ering techniques using valuedriven co de

moion Sp ec b enchmark

Comparison of hashbased value numb ering techniques FMM

b enchmark

Comparison of value partitioning techniques FMM b enchmark

x

Comparison of SCCbased value numb ering techniques FMM

b enchmark

Comparison of value numb ering techniques using dominatorbased

removal FMM b enchmark

Comparison of value numb ering techniques using AVAILbased

removal FMM b enchmark

Comparison of value numb ering techniques using lazy co de motion

FMM b enchmark

Comparison of value numb ering techniques using valuedriven co de

motion FMM b enchmark

Comparison with previous state of the art Sp ec b enchmark

Comparison with previous state of the art FMM b enchmark

Comparison of relief heuristics Sp ec b enchmark LOADSTORE

weight

Comparison of relief heuristics Sp ec b enchmark LOADSTORE

weight

Comparison of relief heuristics FMM b enchmark LOADSTORE

weight

Comparison of relief heuristics FMM b enchmark LOADSTORE

weight

A Example

A Transformed co de

A SSA graph

A Tarjans SCC nding algorithm

A Op erator strength reduction algorithm

A Co de replacement functions

A After op erator strength reduction

A Aworstcase example

Chapter

Intro duction

A compiler is a program that translates programs in one language called the source

language to equivalent programs in another language called the target language

Usually the source language is a highlevel language written by a programmer and

the target language is the machine co de for some computer The primary goals of the

compiler are to

Preserve the meaning of the original co de

Pro duce ecientcode

Compile quickly

Figure shows the organization of a typical compiler consisting of three stages

The frontendmust p erform scanning parsing and contextsensitive analysis The

optimizer transforms the program to improve its eciency Generally the optimizer

is organized as a sequence of passes each with a sp ecic purp ose The backend

generates the target language It must p erform register allo cation scheduling and

There are manyinteresting problems involved in building b oth

the front end and the back end of a compiler but this thesis is concerned primarily

with the compilers optimizer

front back target source

optimizer

end end language language

H

H

H

H

H

H

Hj

errors

Figure Organization of a typical compiler

Intermediate Representations

The compiler transforms a program represented in the source language to an equiv

alent program represented in the target language Along the waya varietyof inter

mediate representations will b e used The design of the intermediate representations

is a fundamental part of the compiler development pro cess The intermediate rep

resentation allows the three stages to communicate and it determines the amountof

information ab out the program that is available

Often the compiler will build an abstract syntax tree AST during parsing It is

very similar to a parse tree except that many of the unnecessary no des are removed

The AST is very convenient for p erforming contextsensitive analysis and certain

highlevel optimizations

The compiler can generate threeaddress co de during a walk of the AST Three

address co de is a sequence of instructions of the form

x y op z

Threeaddress co de gets its name b ecause each instruction can access three names

During the generation of threeaddress co de the compiler may create temp orary

names for results that are not given a name by the programmer For example a

large expression will b e calculated by a sequence of threeaddress instructions and

each instruction will dene a compilergenerated temp orary name When the value

of the entire expression has b een calculated it can b e stored in one of the programs

variables

The threeaddress co de can b e divided into basic blo cks or maximal sequences of

straight line co de A basic blo ck has the prop erty that if one instruction executes

to a controlow graph they all execute in order The basic blo cks are organized in

CFG The no des in the CFG represent basic blo cks and the edges corresp ond to

p ossible control ow from one basic blo ck directly to another

Static Single AssignmentForm

Many of the optimizations presented here are based on static single assignment

SSA form The basic idea used in constructing SSA form is to give unique

names to the targets of all assignments in the routine and then to overwrite uses of

the assignments with the new names A complication arises in routines with more

than one basic blo ck Values can owinto a blo ck from more than one denition

site and each site has a unique SSA name for the item Consider the example in

if

B

H

H

H

H

H

H

H

Hj

B B

x x

H

H

H

H

H

H

H

Hj

y x

B

Figure Program b efore conversion to SSA

Figure There are two denition p oints for the name x referenced in B If

each denition p ointisoverwritten with a unique SSA name how can we correctly

replace the reference to x Sp ecial assignments called functions are used to solve

the problem These functions are placed at the routines join p oints that is at the

b eginning of the basic blo cks whichhave more than one predecessor One no de is

inserted at each join p oint for each name in the original routine This name is called

the no des original name In practice to save space and time functions are placed

at only certain join p oints and for only certain names Sp ecicallyafunction is

placed at the birthp oint of eachvalue the earliest lo cation where the joined value

exists

The functions provide a single denition for a name that had more than one

denition on dierentcontrolow paths in the original routine Each no de denes

a new name for the original item as a function of all of the SSA names whichare

current at the end of the join p oints predecessors Any uses of the original item in

the basic blo ck are replaced by the new name The numb er of inputs for a no de is

equal to the numb er of predecessors of the blo ckinwhichtheno de app ears The

tics of the function are simple The no de selects the value of the input that seman

corresp onds to the blo ckfromwhichcontrol is transferred and assigns this value to the

result When functions are prop erly inserted it b ecomes p ossible for any routine

to b e renamed to pro duce an equivalent routine in whicheach name has exactly one

denition p oint The SSA form of the example in Figure is shown in Figure

Renaming has already b een p erformed If control is transferred from blo ck B to B

if

B

H

H

H

H

H

H

H

Hj

x x

B B

H

H

H

H

H

H

H

Hj

x x x

B

y x

Figure Program after conversion to SSA

the no de will assign x the value from x Ifcontrol is transferred from blo ck B to

B the no de will assign x the value from x

Optimization

This thesis fo cuses on the compilers optimizer Actuallythetermoptimization is

somewhat misleading It implies that the co de pro duced is optimal In realitythe

optimizer transforms the intermediate representation in a way that is b elieved to

improve the program Typically improvements fo cus on the execution sp eed of the

program However other improvements such as reducing memory requirements are

also p ossible There is no guarantee that the resulting co de cannot b e improved

In fact it is p ossible that the optimizer will make the program worse Therefore

many tradeos must b e considered when designing an optimizer The fundamental

resp onsibility of the optimizer is to preserve observ ational equivalence In other

words the optimized program must pro duce the same results as the unoptimized

program for all p ossible inputs

Atypical optimizer is organized as a sequence of passes or optimizations Each

pass tries to either improve the running time of the program or decrease its space re

quirements Some passes or all passes may b e rep eated Each optimization p erforms

a sp ecic task Examples include

discover and propagate a constantvalue

move a computation to a less frequently executed place

sp ecialize a computation based on context

discover a redundant computation and removeit

remove co de that is useless or unreachable

combine several instructions into some p owerful instruction

The order in which the transformations are p erformed is imp ortant for several rea

sons One optimization may require co de in a certain shap e One optimization may

discover facts that another needs One optimization mayintro duce opp ortunities

or destroy opp ortunities for another optimization

Classication of Optimizations

When discussing a particular optimization it is often helpful to compare it to other

optimizations To do this eectivelywemust understand which optimizations are

go o d candidates for comparison The rst classication is based on the assumptions

made ab out the target machine

Machine indep endent Assumes no knowledge of the target machine

Machine dep endent Uses a sp ecic feature of the target machine

There are very few optimizations that are truly indep endent of the target machine

For example removing a redundant computation would improve the execution time

of the program on most machines However this change may increase the register

pressure b eyond what the target machine can supp ort If so the register allo cator will

b e forced to insert spill co de that may b e more exp ensive than the original computa

tion As a rule of thumb machine indep endent optimizations apply to a broad class of

machines and they ignore many of the constraints presentinrealmachines Examples

tly include removing redundant computations and moving instructions to less frequen

executed lo cations Further optimizations whose primary fo cus is to take advantage

of some feature of a particular architecture are called machine dep endent Examples

include rewriting memory op erations to takeadvantage of sp ecic addressing mo des

and reordering instructions to reduce the numb er of pip eline stalls

The second classication is based on the scop e of the optimization howmuchof

the program must b e analyzed

Lo cal Analyze and transform a single basic blo ck

Global Analyze and transform a single pro cedure

Interpro cedural Analyze and transform the whole program

As we shall see there are several scop es that fall b etween these ma jor levels For

example in Chapter we will discuss extended basic blo ckswhichliebetween lo cal

and global

Safety Opp ortunity and Protability

The primary criteria for evaluating optimizations are safety opp ortunity and prof

itability Safety deals with the issue of whether or not an optimization will destroy

the observational equivalence prop erty An unsafe optimization has the p otential to

change the observable b ehavior of a program Unsafe optimizations should never b e

applied

Opp ortunity is concerned with the amount of eort involved in identifying lo ca

tions within the program where the transformation can b e applied The intermediate

representation plays a signicant role in this asp ect of optimization For example

lo cating the lo ops is rather simple using an abstract syntax tree but it is more dif

cult if the program is represented byacontrolow graph On the other hand the

programmer mayhave created a lo op using gotos which will not b e visible in the

abstract syntax tree Another common waytoidentify opp ortunities is dataow

analysis compiletime reasoning ab out the runtime owofvalues

y deals with the amount of improvement exp ected from applying a Protabilit

transformation The protability of an optimization can b e computed in a number

of ways Some transformations suchas dead co de elimination removing compu

tations whose values are never used are always protable Other transformations

such as removing redundant computations are simply assumed to b e protable We

sawhow this assumption can b e violated in the previous section The protability

of an optimization might b e calculated at compiletime or at runtime If the cal

culation shows that the transformation is protable then it will b e applied When

the calculation is p erformed at runtime the compiler must generate co de with and

without applying the transformation Often the compiler writer must consider the

tradeos b etween opp ortunity and protability when deciding which optimizations

to implement For example a transformation with large amounts of compile time

required to nd opp ortunities and relatively small payos in protabilitywould not

be very attractive On the other hand a transformation where opp ortunities are easy

to nd and protabilityislargewould b e very attractive

Redundancy Elimination

This research fo cuses on new techniques for compilerbased redundancy elimination

Optimizing compilers often attempt to eliminate redundant computations either by

removing instructions or bymoving instructions to less frequently executed lo cations

Historically the algorithms aimed at removing instructions have b een designed inde

p endently from those aimed at moving instructions Usually there is one optimization

pass that attempts to determine when two instructions compute the same value and

then decides if one of the instructions can b e eliminated A second optimization pass

determines a set of lo cations where an instruction would compute the same result and

it selects the one that is exp ected to b e the least frequently executed This approach

suers from a phase ordering problem as well as the problem that no information

is shared b etween the two passes Webelieve the correct approach to redundancy

elimination is a single optimization with two steps

Determine which computations in the program compute the same value and

identify the values computed in the routine We will refer to this step as value

numb ering b ecause we assign numb ers to values so that twovalues havethe

same numb er if the compiler can prove they are equal

Use the value numb ers to remove instructions or move them to less frequently

executed lo cations

One advantage of formulating the problem in this manner is that it provides a

go o d separation of concerns Step enco des the knowledge that it discovers into the

choice of sp ecic value numb ers Step can rely on the numb ers enco ded in the

name space or in a set of tables as a basis for its reasoning Thus we can select the

algorithm for step one indep endently of the selection of the algorithm for step two

This research has improved up on the b est known solution to each of these steps

Organization of the Thesis

The remainder of this thesis is organized as follows Chapter will discuss the hash

based approachtovalue numb ering Chapter will present a global algorithm for

value numb ering that is based on partitioning Chapter will present an original

algorithm that combines the advantages of hashbased value numb ering and value

partitioning Chapter will explain techniques for removing instructions from a rou

tine and Chapter will present techniques for moving instructions to less frequently

executed lo cations Chapter will present a new algorithm that improves up on pre

vious co de motion frameworks by taking advantage of the results of value numb ering

Chapter will discuss a new technique for applying co de motion techniques to relieve

register pressure and to improve register allo cation All of the techniques describ ed

in this thesis have b een implemented using the compiler develop ed by the Massively

Scalar Compiler Pro ject at Rice University An exp erimental comparison of the tech

niques will b e presented in Chapter Chapter will summarize the contributions

and results of this thesis We present an algorithm for op erator strength reduction

in App endix A b ecause the algorithm relies on redundancy elimination and it is cen

tered around nding the strongly connected comp onents of the SSA graph just like

the value numb ering algorithm in Chapter

Chapter

HashBased Value Numb ering

Value numb ering is a co de optimization technique with a long history in b oth lit

erature and practice Although the name was originally applied to a metho d for

improving single basic blo cks it is now used to describ e a collection of optimiza

tions that vary in p ower and scop e In particular value numb ering accomplishes four

ob jectives

It assigns an identifying number a value number to eachvalue computed by

thecodeinaway that twovalues have the same numb er if the compiler can

prove they are equal for all p ossible program inputs

It recognizes certain algebraic identities like i i and j j and uses

them to simplify the co de and to expand the set of values known to b e equal

It uses value numb ers to nd redundant computations and remove them

It discovers constantvalues evaluates expressions whose op erands are constants

and propagates them through the co de

Co cke and Schwartz describ e a lo cal technique that uses hashing to discover redun

dant computations and fold constants We b elieve the technique was originally

yBalke at Computer Sciences Corp oration Eachuniquevalue is iden invented b

tied by its value numberTwo computations in a basic blo ckhave the same value

numb er if they are provably equal In the literature this technique and its derivatives

are called value numb ering

The algorithm is relatively simple In practice it is very fast For each instruction

from top to b ottom in the blo ck it hashes the op erator and the value numb ers of the

op erands to obtain the unique value numb er that corresp onds to the computations

value If it has already b een computed in the blo ck the expression will already

exist in the table The recomputation can b e replaced with a reference to an earlier

computation Any op erator with knownconstant arguments is evaluated and the

resulting value used to replace subsequent references The algorithm is easily extended

to account for commutativity and simple algebraic identities without aecting its

complexity

As variables get assigned new values the compiler must carefully keep trackofthe

lo cation of each expression in the hash table Consider the co de fragment on the left

side of Figure At statement the expression X Y is found in the hash table

but it is available in B and not in AsinceA has b een redened At statement

the situation is worse X Y is in the hash table but it is not available anywhere

We can handle this by attaching a list of variables to each expression in the hash

table and carefully keeping it up to date

As describ ed the technique works for single basic blo cks It can also b e applied

to an expanded scop e called an extended basic blo ck An extended basic blo ckis

a sequence of blo cks B B B where B is the only predecessor of B for

n i i

inandB do es not have a unique predecessor Notice that a blo ckcan

b e a memb er of more than one extended basic blo ck For example consider an if

is contained thenelse construct like the one shown in Figure where blo ck B

in two extended basic blo cks B B and B B Value numb ering over extended

blo cks works precisely b ecause eachvalue that ows into a blo ckmust ow from

its predecessor and nowhere else The blo cks in the sequence can b e pro cessed by

initializing their hash tables with the results of pro cessing the previous blo ck In

general the extended basic blo cks in a ow graph form a forest which suggests that

a scop ed hash table similar to one that would b e used for nested scop e languages would

b e appropriate Rather than copying the hash table new entries can b e removed after

a blo ck is pro cessed In reality the compiler must do more than delete information

A X Y A X Y

B X Y B X Y

A A

C X Y C X Y

B B

C C

D X Y D X Y

Original SSA Form

Figure Value numb ering example

B

H

H

H

H

Hj

B B

H

H

H

H

Hj

B

Figure CFG for ifthenelse construct

added by the new blo ck It must restore the name list for each expression and the

mapping from variables to value numb ers In practice this adds a fair amountof

overhead and complication to the algorithm but it do es not change its asymptotic

complexity

Static Single AssignmentForm

Many of the diculties encountered during value numb ering of extended basic blo cks

can b e overcome by constructing the static single assignmentSSA form of the rou

tine The critical prop ertyofSSA that we require is the naming discipline that it

imp oses on the co de Each SSA name is assigned a value by exactly one op eration in

a routine therefore no name is ever reassigned and no expression ever b ecomes inac

cessible The advantage of this approach b ecomes apparent if the co de in Figure

is converted to SSA form At statement the expression X Y can b e replaced

by A b ecause the second assignmentto A was given the name A Similarly the

expression at statement can b e replaced by A Also the transition from single

to extended basic blo cks is simpler b ecause we can in fact use a scop ed hash table

where only the new entries must b e removed

DominatorTree Value Numb ering

The concept of dominance is very imp ortant in program optimization and the con

struction of SSA form In a ow graph if no de X app ears on every path from the

start no de to no de Y thenX dominates Y X Y If X Y and X Y then

X strictly dominates Y X Y The immediate dominator of Y idomY

is the closest strict dominator of Y In the routines dominator tree the parent

of each no de is its immediate dominator Notice that all no des that dominate a no de

X are ancestors of X in the dominator tree

Aside from the naming discipline imp osed another key feature of SSA form is the

information it provides ab out the wayvalues owinto each basic blo ck A value can

enter a blo ck B in one of twoways either it is dened bya no de at the start of B

or it ows through B s parent in the dominator tree Notice that for extended basic

blo cks B is the immediate dominator of B for i n These observations

i i

led us to an algorithm for value numb ering over the dominator tree Bob Morgan of

DEC also made these observations and encouraged us to pursue this approach

The algorithm pro cesses eachblockby initializing the hash table with the informa

t in the dominator tree To accomplish tion resulting from value numb ering its paren

this we again use a scop ed hash table The value numb ering pro ceeds by recur

sively walking the dominator tree Figure shows highlevel pseudoco de for the

algorithm

To simplify the implementation of the algorithm the SSA name of the rst o ccur

rence of an expression in this path in the dominator tree b ecomes the expressions

value numb er When a redundant computation of the expression is found the com

piler removes the op eration and replaces all uses of the dened SSA name with the

expressions value number The compiler can use this replacementscheme over a

limited region of co de in blo cks dominated by the op eration and in parameters to

no des in the dominance frontier of the op eration In b oth cases control must

ow through the blo ck where the rst evaluation o ccurred dening the SSA names

value

The no des require sp ecial treatment Before the compiler can analyze the

no des in a blo ck it must have previously assigned value numb ers to all of the inputs

This is not p ossible in all cases sp ecicallyany no de input whose value ows

through a back edge cannot havea value numb er If any of the parameters of a

no de have not b een assigned a value numb er then the compiler cannot analyze the

no de and it must assign a unique new v alue numb er to the result The following

The dominancefrontier of no de X is the set of no des Y such that X dominates a predecessor of

Y but X do es not strictly dominate Y ie DFX fY jP Pred Y XP and X Y g

pro cedure DVNTBlo ck b

Mark the b eginning of a new scop e

for each no de p for name n in b

if p is meaningless or redundant

Put the value number for p into VN n

Remove p

else

VN n n

Add p to the hash table

for each assignment a of the form n exp in b

if exp is found in the hash table

Put the value number for exp into VN n

Remove a

else

VN n n

Add exp to the hash table

for each successor s of b

Adjust the no de inputs in s

for eachchild c of b in the dominator tree

DVNTc

Clean up the hash table after leaving this scop e

Figure Dominatortree value numb ering

two conditions guarantee that all no de parameters in a blo ckhave b een assigned

value numb ers

When DVNT is called recursively for the children of blo ck b in the dominator

tree the children must b e pro cessed in reverse p ostorder This ensures that

all of a blo cks predecessors are pro cessed b efore the blo ck itself unless the

predecessor is connected byaback edge relativetothe DFS tree

The blo ckmust have no incoming back edges

If the ab ove conditions are met we can analyze the no des in a blo ck and decide if

they can b e eliminated A no de can b e eliminated if it is meaningless or redundant

A no de is meaningless if all its parameters have the same value number A mean

ingless node can be removed if the references to its result are replaced with the value

numb er of its input parameters A no de is redundant if it computes the same value

as another no de in the same blo ck The compiler can identify redundant no des

using a hashing scheme analogous to the one used for expressions Without addi

tional information ab out the conditions controlling the execution of dierent blo cks

the compiler cannot compare no des in dierentblocks

After value n umb ering the no des and instructions in a blo ck the algorithm

visits each successor blo ck and up dates any no de inputs that come from the current

blo ck This involves determining which no de parameter corresp onds to input from

the current blo ck and overwriting the parameter with its value numb er Notice the

resemblance b etween this step and the corresp onding step in the SSA construction

algorithm This step must b e p erformed b efore value numb ering any of the blo cks

children in the dominator tree if the compiler is going to analyze no des

To illustrate how the algorithm works we will apply it to the co de fragmentin

Figure The rst blo ck pro cessed will b e B Since none of the expressions on the

righthand sides of the assignments have b een seen the names u v and w will b e

assigned their SSA name as their value numb er

The next blo ck pro cessed will b e B Since the expression c d was dened in

kby blo ck B which dominates B we can delete the two assignments in this blo c

assigning the value numb er for b oth x and y to b e v Before we nish pro cessing

blo ck B wemust ll in the no de parameters in its successor blo ck B The rst

argumentof no des in B corresp onds to input from blo ck B sowe replace u x

and y with u v and v resp ectively

u a b u a b

B B v c d v c d

w e f w e f

H H

H H

H H

H H

Hj Hj

h

h

h

u a b u a b

h

h

h h

h h

B B B B

h h

x e f w e f

x c d v c d

h h

h h

h h

h h

h h

y e f w e f y c d

v c d

h h

h h

H H

H H

H H

H H

Hj Hj

h

h

h

u u u u u u

h

h

x v w x x x

h

h

h

B x v w y y y B

h

h

z u y

z u x

h

h

h

u a b u a b

h

h

After Before

u u

v v

w w

x v

y

v

u u

x w

y

w

u u

x x

y

x

z z

u u

Value Numbers

Figure Dominatortree value numb ering example

Blo ck B will b e visited next Since every righthandside expression has b een

seen we assign the value numb ers for u x y to b e u w and w respectively

and remove the assignments To nish pro cessing B we ll in the second parameter

of the no des in B with u w and w resp ectively

The nal blo ck pro cessed will b e B The rst step is to examine the no des

Notice that we are able to examine the no des only b ecause we pro cessed B s

children in the dominator tree B B andB inreverse p ostorder and b ecause

there are no back edges owing into B The no de dening u is meaningless

ve the same value numb er Therefore b ecause all its parameters are equal they ha

we eliminate the no de by assigning u the value number u Notice that this

no de was made meaningless by eliminating the only assignmentto u in a blo ck with

B in its dominance frontier In other words when we eliminate the assignmentto

u in blo ck B we eliminate the reason the no de for u was inserted during the

construction of SSA form The second no de combines the values v and w Since

this is the rst app earance of a no de with these parameters x is assigned its SSA

name as its value numb er The no de dening y is redundant b ecause it is equal to

x Therefore we eliminate this no de by assigning y the value number x When

alue number pro cessing the assignments in the blo ck we replace each op erand by its v

This results in the expression u x in the assignmentto z The assignmentto u

is eliminated by giving u the value number u

Notice that if we applied singlebasicblo ckvalue numb ering to this example the

only redundancies we could remove are the assignments to y and y Ifweapplied

extendedbasicblo ckvalue numbering we could also remove the assignments to x

u and x Only dominatortree value numb ering can remove the assignments to u

y andu

Incorp orating Value Numb ering into SSA Construction

Wehave describ ed dominatortree value numb ering as it would b e applied to routines

already in SSA form However it is p ossible to incorp orate value numb ering in to the

SSA construction pro cess There is a great deal of similaritybetween the value num

b ering pro cess and the renaming pro cess during SSA construction section

The renaming pro cess can b e mo died as follows to accomplish renaming and value

numb ering simultaneously

For each name in the original program a stackismaintained which contains

subscripts used to replace uses of that name To accomplish value numb ering

these stacks will contain value numb ers

Before inventing a new name for each no de or assignment we rst checkifit

can b e eliminated If so we push the value number of the no de or assignment

onto the stack for the dened name

The algorithm for dominatortree value numb ering during SSA construction is

presented in Figure

Unied Hash Table

A further improvement to dominatortree hashbased value numb ering is p ossible

Wewalk the dominator tree using a unied table ie a single hash table for the entire

routine and replace each SSA name with its value numb er Figure illustrates how

this technique diers from dominatortree value numb ering Since blo cks B and C

are siblings in the dominator tree the entry for a b would b e removed from the

scop ed hash table after pro cessing blo ck B and b efore pro cessing blo ck C Therefore

the two o ccurrences of the expression will b e assigned dierentvalue numb ers On

the other hand no hashtable entries are removed when using a unied table This

allows b oth o ccurrences of a b to b e assigned the same value numb er

Using a unied hashtable has a very imp ortant algorithmic consequence Replace

ments cannot b e p erformed online b ecause the table no longer reects availability

Thus we cannot immediately remove expressions found in the table In the example

it would b e unsafe to remove the computation of a b from blo ck C However

computations that are simplied such as the meaningless no de in blo ck D may

b e removed Since we cannot remove all redundancies during value numb ering we

must use a second pass over the co de to p erform replacement we can use anyofthe

techniques describ ed in Chapters or to eliminate the actual redundancies

Strictly sp eaking the unied hash table algorithm is not a global technique b e

cause it only works on acyclic subgraphs In particular it cannot analyze no des in

blo cks with incoming back edges and therefore it must assign a unique value number

to any no de in sucha block

and value numberBlo ck b pro cedure rename

Mark the b eginning of a new scop e

for each no de p for name n in b

if p is meaningless or redundant

Push the value numb er for p onto S n

Remove p

else

Invent a new value number v for n

Push v onto S n

Add p to the hash table

for each assignment a of the form n exp in b

if exp is found in the hash table

Push the value numb er for exp onto S n

Remove a

else

Invent a new value number v for n

onto S n Push v

Add exp to the hash table

for each successor s of b

Adjust the no de inputs in s

for eachchild c of b in the dominator tree

rename and value numberc

Clean up the hash table after leaving this scop e

for each no de or assignment a in the original b

for each name n dened by a

pop S n

Figure Value numb ering during SSA construction

A A

H H

H H

H H

H H

Hj Hj

B C B C

x a b x a b x a b x a b

H H

H H

H H

H H

Hj Hj

x x x

D D

y x y x

Before After

Figure Unied hash table

Interaction with Other Optimizations

We should p oint out that eliminating more redundancies do es not necessarily result

in reduced execution time This eect is a result of the way optimizations interact

The main interactions are with register allo cation and combining instructions via copy

folding or p eephole optimization Each replacement aects register allo cation b ecause

it has the p otential of shortening the live ranges of its op erands and lengthening

the live range of its result Because the precise impact of a replacementonthe

lifetimes of values dep ends completely on context the impact on demand for registers

is dicult to assess In a threeaddress intermediate co de each replacementhastwo

opp ortunities to shorten a live range and one opp ortunity to extend a live range We

will discuss this issue further in Chapter

The interaction b etween value numb ering and other optimizations can also aect

the execution time of the optimized program The example in Figure illustrates

how removing more redundancies may not result in improved execution time The

co de in blo ck B loads the value of the second element of a common blo ck called fo o

and the co de in blo ck B loads the rst element of the same common blo ck Compared

to value numb ering over single basic blo cks value numb ering over extended basic

blo cks will remove more redundancies In particular the computation of register r

is not needed b ecause the same value is in register r However the denition of r

k B due to the in the denition of r The is no longer used in blo c

denition of r is now partially dead b ecause it is used along the path through blo ck

r fo o

r

B

r r r

r load r

H

H

H

H

Hj

r fo o

r

r r r

B B

r load r

Original Program

r fo o r fo o

r r

B B

r fo o r fo o

r load r r load r

H H

H H

H H

H H

Hj Hj

r fo o

r r

B B B B

r load r r load r

Single Basic Blo cks Extended Basic Blo cks

More Redundancies Removed Fewer Reduncancies Removed

r fo o

B B

r fo o r fo o

r load r r load r

H H

H H

H H

H H

Hj Hj

r fo o

B B B B

r load r r load r

Final Co de Final Co de

Figure Interaction with other optimizations

B but not along the path through B If the path through blo ck B is taken at run

time the computation of r will b e unused On the other hand value numb ering over

single basic blo cks did not remove the denition of r and the denition of r can b e

removed by dead co de elimination The result is that b oth paths through the CFG

are as short as p ossible Other optimizations that fold or combine optimizations such

as constant propagation or p eephole optimization can pro duce analogous results

Data Structures

The data structures needed to supp ort the hashbased value numb ering algorithm in

Figure are relatively simple The array VN maps SSA names to value numb ers

and the hash table maps expressions to value numb ers

We maintain our hash tables using overowchaining Further since wevalue

numb er during a walk of the dominator tree we need a scop ed hash table similar to

the style sometimes used to supp ort compilation of a lexicallyscop ed language To

accomplish this we a need an ecientway to removetheentries inserted during the

pro cessing of a particular blo ck Eachentry in the table will contain p ointers to two

other entries

thenextentry in the same overowchain and

thenextentry in the current scop e

Ateach nesting level a variable called scop e will p oint to the start of the list of

entries in the current scop e Removing these entries is a simple matter of traversing

this list The hash table data structure is shown in Figure In our implementation

there are two hash tables one for expressions and one for no des

Extensions

There are several extensions that can b e applied to hashbased value numb ering A

very simple extension is to handle commutative op erators more exibly Whenever an

expression with a commutative op erator is encountered we sort its op erands based

on their value numb ers b efore searching the hash table Thus expressions containing

acommutative op erator are always put into a canonical form b efore pro cessing

There are other op erators that technically are not commutative but they can also

b e handled more eectively using this technique For example xy should pro duce

scope scope

buckets

Figure Data structure for scop ed hash table

the same result as yxthus we can place the expression in a canonical form by

sorting the op erands and reversing the sense of the op erator if necessary

We can signicantly improve the eectiveness of the technique by incorp orating

constant folding and algebraic simplication To accomplish this the algorithm must

keep trackofwhichvalues represent compiletime constants and their values Since

the algorithm op erates on SSA form the only data structures required for this exten

sion are

a bit vector representing the values known to b e constant and

an array mapping SSA names to their values only those names with their entry

set in the bit vector will havevalid entries in this array

Whenever an expression is found with constant op erands the expression is eval

uated and it is overwritten with the constantvalue There are other simplications

that can b e made when only one of the op erands is a constanteg x x or

when weknowthattwo op erands are equal eg x x The complete list of

simplications used in our compiler is shown in Table

Another improvementwe can make to this technique is to trace values into and

out of memory This is accomplished via the tags in our in termediate representation

Each tag represents a distinct memory lo cation but certain tags may b e aliased to

each other Op erations that can aect memory are marked with a list of referenced

x x

x or x x bitwise

x xor x bitwise

x and x x bitwise

x or x bitwise

x xor x bitwise

x and bitwise

maxx x x

minx x x

SIGN x x x Sign transfer

DIMx x Positive dierence

x x TRUE

x x TRUE

x x TRUE

x x FALSE

x x FALSE

x x FALSE

x x

x x

x

SHIFTx x

x if x

x x

x x

x mo d

Table Algebraic simplications

SIGNx y y ABSx ABSx



DIMx y x y xy

tags or a list of dened tags or b oth For load and store op erations the rst tag

in each list is the one that app ears in the source and the others are p ossible aliases

During the conversion to SSA form all tags are given subscripts to give eachonea

unique denition p oint To trace values through memorywemust consistently hash

load and store op erations Wewillconvert all load op co des into the corresp onding

store op co des b efore hashing Memory op erations are hashed using another table

similar to the ones describ ed in Section called the tag table

Wevalue number a load op eration by searching the hash table for a register that

already contains the result This can happ en in twoways

We could have previously loaded the value into a register

We could have previously stored into this lo cation from a register without

overwriting the lo cation b efore we get to this load

If such a register is found we use that register in place of the register b eing loaded

Otherwise we add an entry into the tag table showing that this register contains the

value at this memory lo cation

When we encounter a store op eration wewant to determine if it is writing the

same value to a lo cation that is already there This is often the case for the stores

We lo ok up the value stored in the memory inserted just b efore subroutine calls

lo cation and compare it to the value b eing stored If the twovalues are equal the

store op eration can b e deleted

The example in Figure illustrates how this pro cess works Notice that the

loadsinto r and r will pro duce the same value When the rst load is pro cessed

weconvert the load op erator into the corresp onding store and search the hash

table for the expression store x The expression will not b e found so an entry

is added to the table with name r When the next load is pro cessed we will nd

the expression in the table and thus provethat r r Next the program stores

anewvalue into lo cation x The store op eration is marked with a list of b oth

referenced and dened tags The referenced list contains the b efore name of each

tag and the dened list contains the after name of eachtagIfthevalue number

of the b efore name is the same as the value numb er of the value b eing written to

this memory lo cation then the store is redundant and can b e removed Weadd



In our compiler load and store op erations are distinguished bythetyp e of the value and the addressing mo de

r load x

r load x

r

store r x x

r load x

store r x x

Figure Example program containing loadsand stores

the expression store x with the name r to the table This entry is then used

to prove that r r Finallythestore into x from r can b e deleted b ecause the

ve the same value numb er namely r expression store x and r ha

For simplicitywehave ignored the address calculation used to access the variable

in our example For scalar variables it is irrelevant On the other hand the address

must b e considered when pro cessing arrayvariables In realitywe consider the ad

dress to b e another op erand of the store expression eg store addr a By doing

this we can distinguish b etween accesses to dierent elements of an array Assume

that the variable x in Figure is an array rather than a scalar If each access is to

the same element value numb ering may b e able to determine that the addresses for

each access are equal and the redundancies will b e eliminated as b efore However if

each access is to a dierent element the address op erand of the expressions will not

match and the program will remain unchanged

Summary

This chapter describ es the hashbased approachtovalue numb ering The technique

was rst applied to single basic blo cks and later to extended basic blo cks Wehave

shown that the use of static single assignment form simplies the technique as well as

provides opp ortunities for further improvements Wehaveshown new algorithms for

hashbased v alue numb ering over a routines dominator tree and using a unied hash

table When applied to single basic blo cks extended basic blo cks or the dominator

tree value numb ering is p erformed online redundancies are removed as so on as

they are identied The unied hash table approach is an oline algorithm it relies

on the techniques describ ed in Chapters or to eliminate the redundancies

Another contribution of this chapter is the ability to trace values through memory

and to removeany redundancies identied

Hashbased value numb ering has several advantages It is easy to understand and

to implement it can easily handle constant folding and algebraic identities and it

runs in exp ected linear time The unied hash table algorithm is almost global but it

can fail to discover redundancies where values ow through a back edge in the CFG

Chapter

Value Partitioning

Alp ern Wegman and Zadeck describ e a global approachtovalue numb ering that we

call value partitioning In this chapter we extend this algorithm to handle com

mutative op erations and to eliminate redundant store op erations The algorithm

op erates on the static single assignmentSSA form of the routine In contrast to

hashbased approaches this technique partitions values into congruence classes Two

values are congruent if they are

Computed by the same op co de and

Each of the corresp onding op erands are congruent

Since the denition of congruence is recursive there will b e routines where the

solution is not unique A trivial solution would b e to set eachvalue in the routine

to b e congruent only to itself However the solution we seek is the maximal xed

point the solution that contains the most congruentvalues The maximal xed p oint

partitions the values in the routine into congruence classes The primary advantage of

value partitioning over hashbased value numb ering is that it is a global algorithm

However it only accomplishes the rst ob jectiveofvalue numb ering See Chapter

It do es not accomplish constant folding or algebraic simplication and it do es not

remove redundant computations Redundant computations are removed by a separate

algorithm

e partition the values in the routine into congruence classes using a variation W

of the algorithm by Hop croft for minimizing a nitestate machine We use the

term partition to refer to a set of congruence classes such that eachvalue SSA name

in the routine is in exactly one class The partitioning of values is accomplished

by starting with an initial partition and iteratively rening the partition until it

stabilizes In the initial partition all values dened by the same op co de are in the

same congruence class Of course the set of values dened by some opcodes must b e

divided immediatelyFor example the frame op co de in our compiler denes a set

Place all values computed by the same op co de in the same

congruence classes

worklist the classes in the initial partition

while worklist

Select and delete an arbitrary class c from worklist

for each p osition p of a use of x c

touched

for each x c

Add all uses of x in p osition p to touched

for each class s suchthats touched s

Create a new class n s touched

s s n

if s worklist

Add n to worklist

else

Add smaller of n and s to worklist

Figure Partitioning algorithm

of distinct v alues that are passed to the routine in registers Therefore all the values

dened bytheframe op co de are placed in dierent classes in the initial partition

The partition is rened using a worklist algorithm The variable called worklist

will contain classes that might cause other classes to split When a class is removed

from worklistwe visit the uses of all its memb ers The touched set records the values

dened by the uses that are visited at each iteration Notice that we only visit uses in

the same p osition b ecause wemust treat all op co des as if they were not commutative

Commutative op erations are handled by an extension to the partitioning algorithm

describ ed in Section Any class s s stands for split with a prop er subset of its

members in touched s touched smust b e split into two classes We create

a new class n containing the touched memb ers of sandwe remove the memb ers of n

from s Splitting class s may cause other classes in the partition to split so wemust

up date worklist There are two cases to consider

Case If s was already in worklist this means that the partition was not stable with

sowemust leave s in worklist and also add n to worklist resp ect to s

Case If s was not in worklist this means that the partition was stable with resp ect

to s b efore it was split Thus any class that would b e split as a result of

removing s from worklist would b e split the same way as a result of removing

n from worklist Since we are free to cho ose b etween s and nwe will select

the one that can b e pro cessed in a smaller amount of time Therefore weadd

the smaller of s and n to worklistAswe shall see this step is fundamentally

imp ortantinachieving our desired running time

Once wehave rened the partition with resp ect to class cwe will not examine c again

unless it is split Pseudoco de for the partitioning algorithm is shown in Figure

To illustrate how the algorithm op erates consider the co de fragment in Figure

Figure shows the initial partition if X is not congruenttoY X Y Notice

that all names dened by the subtraction op co de are initially in the same class Let

the class containing X b e the rst removed from worklistTouching the uses of X

in p osition will result in A b eing removed from its class Touching the uses of X

in p osition will result in B b eing removed from its class The partition after one

iteration is shown in Figure Let the class containing A b e the next one removed

from worklistTouching the uses of A in p osition will result in C being removed

from its class Touching the uses of A in p osition will not cause any classes to b e

split b ecause the touched setistheentire class Since all values are now in dierent

classes no further splitting can o ccur However since the algorithm has no wayto

determine this it will not terminate until worklist is empty The nal partition is

also shown in Figure

To understand the imp ortance of touching only uses in the same p osition consider

again the co de fragment in Figure and its initial partition in Figure Touching

all uses of X in any p osition will result in A and B b eing removed from their class and

put into a new class together If the algorithm continues in this fashion no further

splitting will o ccur The nal partition is shown in Figure Notice that wehave

D B and C erroneously proven that A

A X Y

B Y X

C A B

D B A

Figure Partitioning example

 

  

A B X A B X A B X

  

   

C D C D Y Y C D Y

  



 

Initial Partition After One Iteration Final Partition

Figure Partitioning steps for example program

X A B





Y C D





Figure Incorrect partition when p ositions are ignored

Complexity

From the pseudoco de given in Figure it is not clear what the running time of the

algorithm will b e We claim that the partition can b e computed in OE log N time

where E is the numb er of edges and N is the numb er of no des in the routines SSA

graph However only a careful implementation of the data structures will result in

the desired time b ound To discover the op erations our data structure must supp ort

and their complexities we will analyze the algorithm from the b ottom up The

statement in line can b e p erformed in constanttimeifwe can determine the size

of a class in constant time Therefore the if statement in lines will require

constant time The splitting of a class in lines and can b e p erformed in Oknk

time if we can remove an element from class s and insert it into class n in constant

time The crucial part of the implementation is to p erform the for lo op in lines

in Oktouchedk time This will b e discussed in detail in Section The for lo op

in lines and can b e p erformed in Okck We assume that the number of uses

in any op eration is b ounded by a constant so the maximum p osition of a use is also

y a constant Therefore we can ignore the for lo op starting on line b ounded b



The SSA graph has a no de for each denition and edges ow from denitions to uses



We use the notation knk to represent the size of set n

Op eration Complexity

Determine the numb er of elements in a class O

Determine which class contains a name O

Remove a memb er of a class O

Add a memb er to a class O

O Add a new class to the partition

Iterate through the memb ers of class c Okck

Figure Op erations supp orted by the partition

Nowwe are ready to analyze the running time of the entire algorithm Consider

the numb er of times a class containing a particular element x can b e removed from

worklist Each time such a class is chosen it must b e smaller than half the size of the

last class to contain x removed from worklist Thisisbecausewe only add the smaller

of n and s in line Therefore a class containing x can b e removed from worklist

at most Olog N times Supp ose the cost of each execution of the for lo op in lines

is charged to each x in c according to the numb er of uses of x in p osition p

Then the cost for each x is OkUSESxk log N where USES x is the set of uses of

x Since E is the set of all uses of any x the total cost of the algorithm is OE log N

y required for the various op erations that Figure summarizes the time complexit

will b e p erformed on the partition

Data Structures to Supp ort Partitioning

To supp ort the complexity requirements given in Figure each class will contain

the following information

the number of memb ers and

a doublylinked list of memb ers

We will also maintain a lo okup table that will have a p ointer to the list no de and

the class number of each element in the partition This will enable us to lo cate and

remove an element from a class in constant time Figure shows how the data

structure would lo ok assuming that the item with SSA name x is in class

classes

x

members

num

members

node

lookup

class num

x

Figure Data structures for representing the partition

Rening the Partition

Recall that in our initial presentation of the partitioning algorithm Figure we

used a set called touched This set contained all uses of memb ers of a class in some

p osition We then searched for classes with a prop er subset of their members in

touched and split them An implementation of the algorithm must employane

cient technique for determining which classes to split and whichmemb ers to remove

at each iteration In Section we claimed that this step could b e p erformed in

Oktouchedk time To accomplish this we do not representthe touched set itself

Instead wekeep the intersection of touched and each class in the partition This in

formation is kept in an array called intersectionsEach elementof intersections

will contain the intersection of touched with the corresp onding elementof classes

Wemust also keep trackofwhich classes have a nonempty intersections entry

When an item is touched during the pro cess of rening the partition it is moved out

array Recall that we are only interested in of its class and into the intersections

classes with a prop er subset of their memb ers touched Therefore if all members of

a class are touched they will b e returned to their original class

Figure depicts an example partition with SSA name x is in class number

The entry in lookupx indicates class numb er and p oints to the no de for

x in the members list of classesIfx is touched the no de will b e removed

from that list and app ended to the members list of intersections Once all

the uses in p osition p have b een touched we can determine which classes must b e

split by iterating through eachclass s with a nonempty list at intersectionss

To accomplish this wemust carefully maintain the set of classes to b e split We

have already removed the necessary memb ers from classess and placed them in

intersectionss Thus we can simply create a new class containing the members

of intersection ss and iterate through the memb ers list up dating the class num

eld of the lookup arrayTheentire pro cess requires Oktouchedktime

Handling Commutative Op erations

Commutative op erations must b e handled with an extension to the partitioning

algorithm The idea is to ignore the p osition when touching a use of a member of

a congruence class Now instead of splitting classes based on whichmemb ers were



In other words we split all classes s such that s touched s

classes intersections

x

num

members

members

num

members members

node

lookup

num class

x

Figure Data structures for rening the partition

 



A X Y

X A B X A B

B Y X



 

C X X

Y Y C D C D

D Y Y

   

 

Initial Partition Final Partition

Figure Commutativity example

touched or not touched we split classes based on which members were touched

or times Consider the example co de fragment in Figure and assume that

X Y The initial partition is shown in Figure Let the class containing X be

the rst removed from the worklist Touching the uses of X will result in A and B

b eing touched once and C b eing touched twice Therefore A and B will b e placed

in a class together and C will b e in a class by itself Let the class containing Y be

the next one removed from the worklist Touching the uses of Y will result in A and

B b eing touched once and D being touched twice Therefore no further splitting of

classes is required The nal partition is shown in Figure

Now assume that X Y The initial partition is shown in Figure Let the

class containing X and Y b e the rst removed from the worklist Touching the uses of

X and Y will result in A B C and D all b eing touched twice Therefore no further

splitting of classes is required and the initial partition is also the nal partition

Recall that when we touched an item for noncommutative op erations wemoved

once the item out of its current class and into the intersections arrayThetouched

and touched twice arrays are used likethe intersections array The rst time an

once The second item is touched it is moved from its original class into touched

time it is touched it is moved from touched once to touched twiceWemust b e

careful to keep track of which classes have a nonemptyentry in touched once or

twice just as we do for the intersections array touched

Figure depicts the partition if the variable A in Figure is in class number

The rst time A is touched it will b e removed from the members list of classes

and app ended to the members list of touched once The second time A is touched

once and app ended to touched twice it will b e removed from the touched

Each time the item is touched the num touches eld of the lookup array will b e

incremented

  

X A B

Y C D

 

Figure Partition for second commutativity example

classes touched once touched twice

A

num num

members

members members

num

members members members

node

num class

lookup

num touches

A

Figure Data structures for handling commutative op erations

Eliminating Redundant Stores

Redundantstore op erations write the same value to a memory lo cation that was

previously written there Therefore they do not alter the contents of memoryand

they can b e eliminated from the routine but this requires an extension to the original

algorithm Redundant stores should not b e confused with dead stores which write a

value to memory that is never subsequently read Handling redundant scalar stores

requires an extension to the partitioning algorithm Consider the program fragmentin

Figure The frame op eration denes the initial values for the memory tags x and

y During the conversion to SSA form all tags were given subscripts to give eachonea

unique denition p oint The store op erations havetwo tags asso ciated with them

The rst tag is the b efore value of the memory lo cation and the second is the after

value of the memory lo cation In our example the store from r is redundantwhile

the store from r is not Wewould liketocheck for congruence b etween the b efore

and after tag values to determine if a store is redundant Unfortunately the tag x

is dened bythe frame op eration and the tag x is dened bya store op eration

Therefore they will b e in dierent classes in the initial partition The initial partition

is shown in Figure Clearly x can never b e congruentto x using the unmo died

partitioning algorithm

Wemust treat scalar load and store op erations as copies from a register to

memory or vice versa Since copying a value do es not change it wemust ensure

that the source and destination of a copy will remain in the same congruence class

throughout the partitioning pro cess To accomplish this wekeep a list of copies for

frame x y

r load x

r

store r x x

store r y y

Figure Example program for redundantstore elimination



Actually these are lists of tags The rst tag is the one that app ears in the source and the others

are p ossible aliases In this example the lists have length one

 

r x

x

 

y

y

r

 

 

Figure Initial partition for redundantstore elimination example

x r x



r y





y



Figure Partition to enable redundantstore elimination

each item in the partition During the pro cess of rening the partition the copy list

for an item moves from class to class with the original item In our example routine

r is treated as a copyof x and x is treated as a copyof r Also y is a copyof r

Given this scheme the initial partition is also the nal partition See Figure

Notice that x y Thus we can eliminate the rst store but not the x buty

second

Figure depicts the data structures representing the partition if x from

Figure is class number ve The items r and x are in the copy list for x

Notice that the entries in lookup for r and x pointtothenodeforx and the T

copy eld indicate that they are copies This tells the partitioning entry in the is

algorithm to ignore them during the renement pro cess

Summary

This chapter describ es in detail the value partitioning algorithm of Alp ern Wegman

and Zadeck This algorithm uses a variation of Hop crofts DFA minimization algo

rithm to partition the values in a routine into congruence classes The implementation

of value partitioning is describ ed in detail We present extensions to the algorithm

to handle commutative op erations and to eliminate redundant store op erations

When compared to hashbased value numb ering value partitioning has the ad

vantage that it is global However value partitioning has some disadvantages It is

dicult to implement it cannot handle constant folding and algebraic identities and

it runs in OE log N time

classes intersections

x

r

x

I

num

members

members

num

members members

node

lookup

class num

is copy

T F T

r x x

Figure Data structures for redundantstore elimination

Chapter

SCCBased

In Chapters and we describ ed two comp eting value numb ering techniques hash

ing and partitioning The hashing techniques are easy to understand and implement

and they can easily handle constant folding and algebraic identities ie x x

Their prime drawback is that they are not global techniques The partitioning tech

niques are global but they cannot easily handle constant folding and algebraic iden

tities As a result of their shortcomings b oth of these techniques can fail to discover

some crucial equivalences In this chapter we describ e a new technique for assign

ing value numb ers that combines the advantages of b oth techniques it is easy to

understand and implement it can easily handle constant folding and algebraic iden

tities and it is global We refer to this new technique as SCCbased value numb ering

b ecause it is centered around the strongly connected comp onents of the static single

assignment graph

Shortcomings of Previous Techniques

Assume that X and Y are known to b e equal in the co de fragment in Figure

Then the partitioning algorithm will nd A congruenttoB and C congruenttoD

More careful reasoning would show that they are not just congruentby pairs but also

that they all havethevalue zero Unfortunately partitioning cannot discover that

fact On the other hand the hashbased approach will easily conclude that if X Y

then A B C and D are all zero

A X Y

B Y X

C A B

D B A

Figure Improved by hashbased techniques

The critical dierence b etween the hashing and partitioning algorithms identi

ed by this example is their notion of equivalence The hashbased approach proves

equivalences based on values while the partitioning technique considers only congru

ent computations to b e equivalent The co de in this example hides the redundancy

b ehind an algebraic identity Only the techniques based on value equivalence will

discover the common sub expression here

Now consider the co de fragmentinFigureIfwe apply any of the hashbased

approaches to this example none of them will b e able to provethat X is equal to

Y This is b ecause at the time a value number must b e assigned to X and Y none

of these techniques have visited X or Y Theymust therefore assign dierentvalue

numbers to X and Y However the partitioning technique will provethat X is

congruenttoY and thus X is congruentto Y The key feature of the partitioning

algorithm whichmakes this p ossible is its initial optimistic assumption that all values

dened by the same op erator are congruent It then pro ceeds to disprove the instances

where the assumption is false In contrast the hashbased approaches b egin with

the p essimistic assumption that no values are equal and pro ceeds to prove as many

equalities as p ossible

The shortcomings of hashbased value numb ering and value partitioning suggest

a need for a value numb ering algorithm that combines the advantages of b oth tech

niques Such an algorithm would combine the ability to p erform constant folding and

algebraic simplication with the abilitytomake optimistic assumptions and later dis

prove them Click presents an extension to value partitioning that includes constant

folding algebraic simplication and unreachable co de elimination He presents

twoversions of the algorithm The straightforward version runs in ON time and

the complex version runs in OE log N time where N and E are the number of

X

Y

while

X X X

Y Y Y

X X

Y Y

Figure Improved by partitioning techniques

no des and edges in the routines intermediate representation graph His intermediate

representation contains the edges in the SSA graph plus some edges used for con

trol dep endences The complex version can miss some congruences b etween no des

that will b e proven congruentby the straightforward algorithm The problem o ccurs

when the op erands of a no de are assumed congruent and later proven not congru

ent When the class containing the op erands is split into two pieces the algorithm

arbitrarily places the no de in one of the pieces Thus another no de with the

same op erands might b e placed in the other piece

This chapter presents an algorithm called SCCbased value numb ering that more

closely resembles hashbased value numb ering than value partitioning SCCbased

value numb ering is simpler to implementthanvalue partitioning and it runs in

ON D SSA time where N is the number of SSA names and D SSA is the lo op

connectedness of the SSA graph The lo op connectedness of a graph is the maximum

N numb er of back edges in any acyclic path This numb er can b e as large as O

Knuth showed that for controlow graphs of real Fortran programs it is b ounded

in practice by three We are concerned with the lo opconnectedness of the SSA

graph we also exp ect it to b e small In our test suite the maximum number of

iterations required by the SCC algorithm is four

We will rst present a simplied version called the RPO algorithm that is easier

to understand We will prove the correctness and time b ounds for this algorithm

and then we will present SCCbased value numb ering as an extension with the same

asymptotic complexity In practice it is more ecient than the RPO algorithm

The RPO Algorithm

The algorithm in Figure is called the RPO algorithm b ecause it op erates on

the routine in r everse p ostorder We will assume for simplicity that all denitions

in the routine are of the form x y op z where op can b e any op eration in the

th

intermediate representation or a no de Let xi representthe i op erand of the

expression dening x and xop represent the op erator that denes x Additionally

wesay that xi represents a backedgeifthevalue ows along a back edge in the CFG

The VN array maps SSA names to value numb ers Eachvalue numb er represents a

set of SSA names ie those names with the same entry in the VN array Therefore

avalue numb er is itself an SSA name For claritywe will surround an SSA name

that represents a value numb er with angle brackets eg hx i The lo okup function

for all SSA names i

VNi

do

changed FALSE

for all blo cks b in reverse p ostorder

for all denitions x in b

expr hVNxxop VNxi

temp lo okup exprx

if VNx temp

changed TRUE

VNx temp

Removeallentries from the hash table

while changed

Figure The RPO algorithm

searches a hash table for the expression hVNxxop VNxi If the expression

is found it returns the name of the expression Otherwise it adds the expression to

the table with name hxi

thatparti The RPO algorithm computes a sequence of equivalence relations

i

th

tion the set of SSA names Wesa y if and only if after the i iteration of ythatx

i

the RPO algorithm VNx VNy

y x y i x



xop yop





i x y i

y e xe that are nonbackedges xe

i

i







y e xe that are back edges xe

i

We refer to a partition by the equivalence relation that pro duces it Wesay that

if and only if there are no congruences in that is a renementof

i j i i j

canbederived from y In other words y x ie x y x are not in

i j i j

by breaking congruences Given the partition the algorithm computes in

j i i

exp ected running time ON where N is the number of SSA names in the routine

The following theorem shows that each iteration renes the partition

y y x Theorem x

i i

Pro of The pro of is by induction on i

y Basis i By denition x

Induction step i Supp ose not let x b e the SSA name with the

smallest RPO number such that the assumption is false x y

i

y y Consider the reasons why x and x

i i

y a con Case xop yop This implies that x tradiction

i

y y e for some nonback edge Since x Case xe

i i

y e which means that xe is a no de where the xe

i

assumption is false and it has a smaller RPO numb er than xa

contradiction

y e for some back edge By the induction Case xe

i

y a y e which implies that x hyp othesis xe

i i

contradiction

Corollary The RPO algorithm must terminate and it nds the

t of the congruence relation computed byvalue parti maximal xed p oin

tioning

Pro of Each step pro duces a renement of the partition and renement

cannot continue indenitelyFurther value partitioning nds the

maximal xed p oint of the following equivalence relation



xop yop

x y i

xe y e e

Since the RPO algorithm b egins with all SSA names congruent wemust

converge to the same xed p ointasvalue partitioning

To understand howquickly the algorithm terminates wemust understand how

values are proven not to b e congruent Since we pro cess the blo cks in reverse p ost

order back edges playakey role in determining the numb er of iterations required The

following lemmacharacterizes the iteration on whichtwo SSA names are determined

not to b e congruent

y then there is a sequence of inputs y and x Lemma If x

i i

p ossibly empty

e e e

n

with b thenumber of back edges in e e and b i such that

j j n

y x

i

xe y e

ib

e e y xe e

n ib n

n

Pro of The pro of is by induction on i

Basis i Use the empty sequence

Induction step i Let p p b e the sequence of pairs x y

m

y ordered by the minimum RPO number of y and x with x

i i

the pair We will pro ceed by induction on j the index into this

sequence

Basis j Consider the reasons why x y

i

Case xop yop This cannot o ccur b ecause weknow

y that x

i

Case xe y e for some nonback edge This cannot

i

o ccur b ecause either xe will have a smaller RPO number

than x or y e will have a smaller RPO numb er than y

y e for some back edge The sequence Case xe

i

consists of e followed by the sequence for the pair xeye

whichwe know exists by the induction hyp othesis for i

Induction stepj Consider the reasons why x y

i

Case xop yop This cannot o ccur b ecause weknow

y that x

i

Case xe y e for some nonback edge The

i

sequence consists of e followed by the sequence for the pair

xeye whichweknow exists by the induction hyp othesis

for j

Case xe y e for some back edge The sequence

i

consists of e followed by the sequence for the pair xeye

whichwe know exists by the induction hyp othesis for i

Nowwe can prove the algorithms running time It terminates in D SSA

b er of back iterations where D SSA is the lo op connectedness the maximum num

edges on any acyclic path of the SSA graph

Theorem x y x y

D SSA D SSA

Pro of Supp ose not let x b e the SSA name with the smallest RPO

number suchthatx y and x y According to

D SSA D SSA

Lemma there is a sequence of inputs such that

y x

D SSA

y e xe

D SSAb

y e e xe e

n n

This sequence contains D SSA back edges so it must contain a

cycle Since x has the smallest RPO numb er it must b e included in a

cycle Therefore x y for some i DSSA By Theorem

i

x y a contradiction

D SSA

Corollary The RPO algorithm terminates in at most D SSA

passes

Pro of Since the partition is the same as the partition

D SSA

the done ag will remain TRUE throughout iteration

D SSA

D SSA and the algorithm will terminate

Extensions

Since our algorithm uses hashing we can easily extend it to include constant folding

and algebraic simplication Wedothisby asso ciating a value from the constant prop

h SSA name This framework will discover agation lattice f g Z with eac

at least as many congruences as hashbased value numb ering or value partitioning

Under this extended framework an elementcanfall D SSA times with resp ect

to the value numb ering lattice and twice with resp ect to the constant propagation

lattice Therefore the height of this aggregate lattice is D SSA However

since each element falls in b oth frameworks on the rst iteration any element can

fall at most D SSA times Therefore the extended algorithm must terminate

in D SSA iterations

The example in Figure requires D SSA iterations for the algorithm to

terminate The back edges are shown with b old arrows After the rst iteration

all no des are b elieved to b e constants notice that b oth i and j are assigned the

constant During the next iteration i j i andj are determined not to b e

j During the third iteration j and i constant but we still assume that i

j On the weprove that i i j andj are not constant and weprove that i

fourth and fth iteration weprove that i j and i j resp ectively On the

sixth iteration the partition stabilizes and the algorithm terminates Intuitivelythe

algorithm takes D SSA passes to provethat i and j are not the constant and

thus cannot b e equal then it takes another D SSA passes to propagate this fact

Discussion

Wehaveshown that the RPO algorithm nds at least as many congruences as hash

based value numb ering or value partitioning in ON D SSA time Kam and

Ullman sho wed that the iterative dataow analysis used for a large class of dataow

frameworks requires D CFG passes over the CFG This numb er can b e as large as

OB where B is the number of blocks in the CFG but it is b elieved that in practice

this numb er is b ounded by a small constant We exp ect that for most programs

D CFGD SSA However the program in Figure is an example where this is

not true The back edges are shown with b old arrows Notice that D CFG but

D SSA Further we could make D SSA even larger by adding variables in the

same pattern as j and k Despite this p otential the maximum numb er of iterations

required by SCCbased value numb ering is four for any routine in our test suite

The SCC Algorithm



The nal iteration checks the stability of the analysis

 

j

i

 

 

j

i

 

 

   

 

j j

i i

inc

dbl

   

 

 

 

j

i

 

SSA Graph

i i i i i i i

i i i i i i i

i i i i i i i

i i i i i i i

i i i i i i i

i i i i i i j

j i i i i j j

j i i i j j j

j i i j j j j

j i i j j j j

Value Numbers

Figure Example requiring D SSA iterations

i

j

k

  

j

i

k

 

  

i i i

  

j j j

 

j k k k

i

k

  

 

i i i

  

j j j

  

k k k j

i

k

  

 

i inc i

  

j j i

 

  

j k k j

i

k

inc

  

 

Controlowgraph SSA graph

Figure Example with D CFG D SSA

Tomake the algorithm more ecient in practice we op erate on the SSA graph

instead of the controlow graph If no cycles are presentintheSSA graph we can

simply pro cess the no des in a single reverse p ostorder walk That ordering guaran

tees that all op erands of an expression are visited b efore the expression itself must

b e pro cessed If cycles are presentthennosuch ordering exists We can identify

the strongly connected comp onents SCCs of the graph and treat each SCC as a

single no de A strongly connected comp onent is a maximal collection of no des such

that given any pair of no des in the collection there is a path b etween them After

collapsing each SCC into a single no de the resulting graph must b e acyclic When

we traverse the no des of this graph in reverse p ostorder weknow that all op erands

of a no de are pro cessed b efore the no de itself A no de that is not part of any cycle

requires no sp ecial pro cessing but a no de that represents a strongly connected com

DFSno de

no de DFSnum nextDFSnum

no de visited TRUE

no de low no deDFSnum

PUSHno de

for each o fop erands of no de g

if not ovisited

DFSo

no delow MINno delowolow

if oDFSnum no de DFSnum and o stack

no de low MINoDFSnum no de low

if no de low no deDFSnum

SCC

do

POP x

SCC SCC fxg

while x no de

Pro cessSCC SCC

Figure Tarjans SCC nding algorithm

p onent requires sp ecial pro cessing This observation led us to an improved algorithm

called SCCbased value numb ering b ecause it concentrates on the strongly connected

comp onents of the SSA graph The algorithm works in conjunction with Tarjans

depthrst algorithm for nding SCCs shown in Figure Tarjans algorithm

uses a stack to determine which no des are in the same SCC no des not contained in

any cycle are p opp ed singly while all the no des in the same SCC are p opp ed together

Tarjans algorithm has an interesting prop erty when a collection of no des p ossibly

containing only a single no de is p opp ed from the stack all of the op erands that are

outside the collection have already b een p opp ed Therefore we pro cess the no des

as they are p opp ed from the stack When a single no de is p opp ed from the stack

we knowthatwehave assigned value numb ers to the op erands of the corresp onding

expression Thus we can examine the expression and assign a value numb er to this

no de When a collection of no des representing an SCC is p opp ed we know that we

have assigned value numb ers to any op erands outside the SCC The memb ers of the

SCC require sp ecial handling in order to p erform value numbering

initialize optimistic and valid tables

for all no des n

nvalnum

while there is an unvisited no de n

DFSn see Figure

Pro cessSCC SCC

if SCC has a single member n

Valnumn valid

else

do

changed FALSE

for each n SCC in reverse p ostorder

Valnumn optimistic

while changed

for each n SCC in reverse p ostorder

Valnumn valid

Valnumno de table

umi expr hno de valnum no de op no devaln

Try to simplify expr

temp lo okupexpr table no de SSAname

if no de valnum temp

changed TRUE

no devalnum temp

Figure SCCbased value numb ering algorithm

Reverse p ostorder numb ers are assigned with resp ect to the controlow graph

Since we cannot removetheentries from the hash table after eachpassasthe

RPO algorithm do es we will use two hash tables The valid table contains only facts

that are known to b e true while the optimistic table contains facts that may later

b e disproven Single no des are pro cessed once using the valid table and collections

of no des are pro cessed iteratively using the optimistic table Figure shows the

algorithm for SCCbased value numb ering The rst step is to initialize the optimistic

and valid tables and to assign as the value numb er for each no de in the SSA graph

Avalue number of is considered congruenttoeverything it indicates that this

no de has not yet b een examined Next we rep eatedly apply Tarjans algorithm to

anyunvisited no de in the graph As Tarjans algorithm identies strongly connected

comp onents it calls Pro cessSCC This function decides if the SCC is a single no de

or a collection of no des Single no des are pro cessed using the valid table Collections

of no des are pro cessed by iterating in reverse p ostorder with resp ect to the CFG

using the optimistic table After the iteration stabilizes the no des are pro cessed one

nal time using the valid table The Valnum function pro cesses each no de It rst

tries to simplify the expression that the no de represents For ordinary instructions

we p erform constant folding and algebraic simplication For no des simplication

e can also simplify no des of the can b e p erformed if all the op erands are equal W

form hxi tohxi Intuitively this allows the initial value of a variable to fall

through the no de and enter the lo op Since we are iterating in reverse p ostorder

can only app ear as an argumenttoa no de not as an op erand of an instruction

Example

To further clarify the algorithm consider howitwould pro ceed if given the exam

ple in Figure The values i and j are not contained in any cycle so they will b e

assigned value numb ers b efore either of the SCCs Assume they are each given the

value number hi i and that the SCC containing i and i is pro cessed next During

the rst pass over the SCCtheno de hi i will b e simplied optimistically

to hi iandi will b e given value number hi i Then the expression dening i

hi i can b e simplied to An entry mapping the constanttovalue number

hi i will b e added to the optimistic table During the second pass the expression

Recall that is considered congruenttoeverything



Rememb er that expressions are formed from an op erator and the value numb ers of the op erands

not the op erands themselves

 

j

i

 

i

j

 

while

j

i

 

 

i i i

j j j

 

i i

  j j

 

I I

 

j

i

 

SSA Graph SSA Form

Figure Example with equal induction variables

hi i hi i cannot b e simplied so an entry mapping the expression to hi i is added

to the optimistic table Next an entry mapping hi i to hi i will b e added to the

optimistic table Atthispointthevalue numb ers have stabilized so wemake a nal

pass using the valid table During this pass we add entries mapping hi i hi ito

hi i and mapping hi i to hi i to the valid table Notice that the optimistic entry

mapping to hi i is not added to the valid table this assumption has b een disproven

The next step is to pro cess the SCC containing j and j During the rst pass

the expression hi i will b e simplied to hi iandj will b e given the value

number hi i Next the expression hi i can b e simplied to it will b e found in

the optimistic table with value number hi i During the second pass the expression

hi i hi i will b e found in the optimistic table with value number hi iand hi i

will b e found with value number hi iAt this p ointthevalue numb ers have stabilized

so we pro cess the SCC using the valid table Since entries already exist mapping

hi i hi itohi i and hi i to hi inonewentries will b e added to the valid table

Thus the algorithm has determined that i j and i j

The contents of the optimistic and valid tables is an imp ortant issue that merits

further discussion The primary function of the optimistic table is to hold assumptions

that may later b e disproven In contrast the valid table represents only those facts

that are proven Notice that in pro cessing the example in Figure entries for hi i

and hi i were added to b oth the optimistic and valid tables On the other hand the

entry mapping the constanttohi i was only added to the optimistic table This

entry represents an optimistic assumption that was disproven It remains in the table

b ecause it is needed for the analysis of the second SCC the one containing j and

j In some sense that entry marks the trail that the analysis must takeinorder

to provethatthetwo SCCs are equivalent It is also p ossible that the constant

app ears somewhere else in the routine If so we cannot give it the name hi i instead

we add an entry to the valid table mapping to a dierent name

Recall that after the iteration stabilizes wemake one additional pass over the

SCC using the valid table The need to place expressions in b oth tables arises from

constant folding and algebraic simplication These transformations eliminate edges

from the SSA graph Thus an SCC can b e transformed into a collection of no des that

is no longer strongly connected If this happ ens wewant to test the no des in this new

collection for congruence with other no des that were pro cessed using the valid table

The example in Figure illustrates howthismight happ en The dashed edges in

the SSA graph will b e broken while iteratively analyzing the SCC containing i and

i When the iteration stabilizes the algorithm discovers that i However this

optimistic table where there is no entry mapping analysis was p erformed using the

to hj i This entry is only in the valid table Similarly the entry mapping hj i hj i

to hj i exists only in the valid table Therefore our algorithm has not yet discovered

j The additional pass over i and i using the valid table j and i that i

enables SCCbased value numb ering to discover these two congruences

Uninitialized Values

The handling of uninitialized values during value numb ering is a complicated issue

Our implementation of SSA construction will remove an op eration if it references an

op erand that is uninitialized on all paths However it is p ossible that a value could

b e initialized along some paths but not others If this is the case a sp ecial value

U will represent an uninitialized input to a no de Anyvalue numb ering algorithm

must correctly handle U when pro cessing the routine Our initial assumption was

that since the b ehavior of an uninitialized value is undened the compiler is free

to do just ab out anything Therefore we considered U to b e congruenttoevery

other value just like Further study showed that this assumption may not b e

correct Whether or not it is correct dep ends on the source language denition The

example in Figure demonstrates that this assumption can violate the principle

of least astonishment In this example the variable j is used to record the last

 

j

i

 

i

j

 

while

j

i

 

 

i i i

j j j



i i

 j

I

 

 

j

i

 

SSA Graph SSA Form

Figure Example with edges removed from SCC

i

 

 

i

U

 

i i i

j U j

if

 



P

P

j P

i

P

 

P

 

Pq



j i



j





 

I

  

j j j

j

i

 

i i

 

SSA Graph ControlowGraph

Figure Example with uninitialized values

iteration on which some condition is true If the condition is never true the value of

j remains uninitialized on exit from the lo op Since the value of j is not initialized

b efore entering the lo op the corresp onding parameter in the no de for j is U If

we assume that U is congruenttoeverything SCCbased value numb ering will go on

to prove that j i In other words the value of j is now the last iteration of the

lo op indep endent of the condition We correct this problem by simply changing our

assumptions ab out U Instead of treating U like itmust b e treated as a distinct

value numb er that is not congruenttoanything If we do this in our example the

algorithm will prove that j i

Summary

This chapter presents an original algorithm for value numb ering that combines the

advantages of hashbased value numb ering and value partitioning It is easy to un

derstand and to implement it can handle constant folding and algebraic identities

SCCbased value numb ering b ecause it is and it is global We call the algorithm

centered around the strongly connected comp onents of the SSA graph It runs in

ON D SSA time where N is the numb er of no des and D SSA is the lo op con

nectedness of the SSA graph The lo op connectedness can b e as large as N but

it is b elieved that in practice D SSA isboundby a small constant The maximum

numb er of iterations required for any routine in our test suite is four Weprove that

the algorithm nds at least as many congruences as hashbased value numb ering and

value partitioning

Chapter

Co de Removal

Chapters and explain howtorenumb er the registers and no des so that

congruentvalues are given the same numb er However renumb ering alone will

not improve the running time of the routine wemust also remove the redundant

computations This chapter will presenttwotechniques for removing co de from a

routine

DominatorBased Removal The technique suggested by Alp ern Wegman and

Zadeckistoremove computations that are dominated by another denition of

the same value numb er Figure shows an example routine that we can

improve with this metho d Since the computation of z in blo ck B dominates

the computation in blo ck B the second computation can b e removed

AVAILBased Removal The classical approach is to compute the set of available

expressions AVAIL and to remove computations that are in the AVAIL set

z x y z x y

B B

if if

H H

H H

H H

H H

j H Hj

B B B B

H H

H H

H H

H H

Hj Hj

z x y

B B

Before After

Figure Program improved by dominatorbased removal



In Chapter only the unied hash table algorithm will provide the consistent naming that we require

if if

B B

H H

H H

H H

H H

Hj Hj

z x y z x y z x y z x y

B B B B

H H

H H

H H

H H

Hj Hj

z x y

B B

Before After

Figure Program improved byAVAILbased removal

at the p oint where they app ear in the routine This approach uses data

ow analysis to determine the set of expressions available along all paths from

the start of the routine Notice that the calculation of z in Figure will b e

removed b ecause it is in the AVAIL set In fact any computation that would b e

removed by dominatorbased removal would also b e removed by AVAILbased

removal However there are improvements that can b e made bytheAVAIL

based technique that are not p ossible using dominators Consider the routine

in Figure Since z is calculated in b oth B and B itisintheAVAIL set at

B Thus the calculation of z in B can b e removed However since neither B

or B dominate B dominatorbased removal could not improv e this routine

DominatorBased Removal

To p erform dominatorbased removal we consider each set of names with the

same value numb er and lo ok for pairs of memb ers where one dominates the other

Tomake the algorithm ecient webucket sort the memb ers of the set based on the

preorder index in the dominator tree of the blo ck where they are computed A naive

bucket sorting algorithm would keep a list of items dened for each blo ck Assume

that items x and y are congruent and b oth are computed in blo ck B The bucket

sorting algorithm would place them in a list indexed by the preorder index of B See

gure

However we can improve up on the naive algorithm If more than one member

is computed in the same blo ck only the one computed earliest in the blo ckmust b e

Blo cks

y

x

B

Figure Naive bucket sorting algorithm

Blo cks

yx

B

Figure Better bucket sorting algorithm

considered b ecause it dominates all the others Thus instead of an array of lists

we can keep an arrayofSSA names When an item is inserted into an entry that

already contains an item we can select the one that is computed earlier in the blo ck

This item will b e written into the arrayentry and the op eration computing the other

will b e removed immediately Assume that items x and y are congruent and b oth

are computed in blo ck B and that x is computed earlier than y Thebucket sorting

algorithm would replace y with x in the entry indexed by the preorder index of B

See Figure

Once wehave found the item computed earliest in each blo ck we can compare

pairs of elements in the array and decide if one dominates the other This decision is

based on an ancestor test in the dominator tree The entire pro cess can b e done in

time prop ortional to the size of the set

During dominatorbased removal well need to p erform an ancestor test on the

dominator tree To do this in constant time wemust know the preorder index and

the numb er of descendants of eachblock in the dominator tree We can decide if

blo ck b dominates blo ck b by p erforming an ancestor test in the dominator tree

b and b resp ectively and let ND b e the Let p and p b e the preorder indices of

numb er of descendants of b Then b dominates b if and only if

p p p ND

Weuseatwonger algorithm for comparing items dened in dierent blo cks

We consider adjacent pairs of nonzero entries in the Blocks arrayTheb variable

p oints to the rst blo ck of the pair in consideration The b variable p oints to the

second blo ck These twopointers move through the Blocks arrayuntil each pair of

adjacent blo cks have b een compared For each pair wecheckif b dominates b If

so weremove the instruction in the entry for blo ck b otherwise weset b to p oin t

to blo ck b The nal step is to move b to the next nonemptyentry in the Blocks

array

AVAILBased Removal

The classical approach to redundancy elimination is to remove computations in the

set of available expressions AVAILatthepoint where they app ear in the routine

This approach uses dataow analysis to determine the set of expressions available

along all paths from the start of the routine An expression is available if it is computed along all paths from the b eginning of the routine If an op eration computes

avalue already in the set of available expressions then it can b e removed First we

compute the AVAIL set for eachblock then we removeany op erations whose result

is in the set

Prop erties of the value numb ered SSA form let us simplify the formulation of

AVAIL The traditional dataow equations deal with the formal identity of lexical

names while our equations deal with identical values This is a very imp ortant

distinction We need not consider the killed set for a blo ck b ecause no values are

redened in SSA form and value numb ering preserves this prop erty Consider the

co de fragment in Figure Under the traditional dataow framework the assign

mentto X would kill the Z expression However if the assignmentto X caused

the two assignments to Z to have dierentvalues then they would b e assigned dif

ferent names Since value numb ering has determined that the two assignments to Z

are congruent the second one is redundant and can b e removed The only waythe

intervening assignment will b e given the name X is if the value computed is equal to

the denition of X that reaches the rst assignmentto Z

The simplied dataow equations are shown in Figure Notice that the equa

tion for AVOUT do es not include a term for the expressions killed in blo ck iAswe

i

shall see in Chapter the killed set requires a great deal of time to compute ON

time where N is the number of value numb ers In our framework we simply add

the set of values dened in the blo ck dened to the set of values available at the

i

b eginning of the blo ckAVIN The set dened is the set of value numb ers with a

i i

denition in blo ck iThisisasup erset of the set of values generated in i gen which

i

do es not include any expressions whose denition is followed by a mo dication of one

of its sub expressions Further dened is much easier to compute than gen

i

i

Once wehave computed AVAIL for eachblock we are ready to remove instructions

from the routine We step through the blo cks and use the AVIN set as a guide for

Z X Y

X

Z X Y

Figure Example program



if i is the entry blo ck







AVIN

i

AVOUT otherwise



j





j predi

AVOUT AVIN dened

i i i

Figure Dataow equations for AVAILbased removal

removing instructions We can removeany instruction whose result is in AVINIf

the instruction is not removed we add its denition to the AVIN set b ecause it is

available to instructions later in the blo ck

Summary

This chapter presents two techniques for removing redundancies Dominatorbased

removal eliminates computations that are dominated by another computation with

the same value number AVAILbased removal is an improvement that relies on data

ver the set of available expressions and remove instructions whose ow analysis to disco

value is in the set We improve the dataow framework by taking advantage of the

prop erties of the value numb ered SSA form The improved framework deals with

identical values rather than lexical names It is simpler and faster to compute and it

can remove more instructions than the traditional framework

Chapter

Co de Motion

The compiler can eliminate redundancies not only byremoving computations but also

bymoving computations to less frequently executed lo cations Manytechniques rely

on dataow analysis to determine the set of lo cations where each computation will

pro duce the same value and to select the ones that are exp ected to b e least frequently

executed

Partial Redundancy Elimination

Partial redundancy elimination PRE is an optimization intro duced by Morel and

Renvoise that combines common sub expression elimination with lo op invariantcode

motion Partially redundant computations are redundant along some but not

necessarily all execution paths In general PRE moves co de upward in the routine

to the earliest p oint where the computation would pro duce that same value without

lengthening any path through the program Notice that the computation of z in

Figure is redundant along all paths to blo ck B so it will b e removed by PRE

On the other hand the routine in Figure cannot b e improved using AVAILbased

if if

B B

H H

H H

H H

H H

Hj Hj

z x y z x y z x y

B B B B

H H

H H

H H

H H

Hj Hj

z x y

B B

Before After

Figure Program improved by partial redundancy elimination

removal b ecause z is not available along the path through blo ck B The calculation of

z is computed twice along the path through B but only once along the path through

B Therefore it is considered partially redundant PRE can move the computation

of z from blo ck B to blo ck B This will shorten the path through B and leave the

length of the path through B unchanged

Lazy Co de Motion

Kno op Ruthing and Steen describ e a descendantofPRE called lazy co de mo

tion LCM Drechsler and Stadel presenta variation of this technique that

they claim is more practical The dataow equations for this framework are

shown in Figures and One advantage that Drechsler and Stadels framework

has over Kno op et al is that it never inserts instructions in the middle of a blo ck

Kno op et al maintain an entry and an exit p oint for each expression in each blo ck

This requires a substantial amount of memory sp ecically N B p ointers to

instructions where N is the numb er of expressions and B is the numb er of blo cks

LCM avoids the unnecessary co de motion inherentin PRE This feature is im

p ortant when co de motion interacts with register allo cation and other optimizations

Each replacement aects register allo cation b ecause it has the p otential of shortening

the live ranges of its op erands and lengthening the live range of its result Because

the precise impact of a replacement on the lifetimesofvalues dep ends completely on

context the impact on demand for registers is dicult to assess In a threeaddress

intermediate co de each replacementhastwo opp ortunities to shorten a live range and

one opp ortunity to extend a liverangeWe will study this issue further in Chapter

Figure gives an example of how unnecessary co de motion can impact p eephole

optimization The original routine shows a sequence of three instructions to p erform a

test and a branch If PRE moves the rst two instructions to each of the predecessor

blo cks it has not changed the length of any path through the routine However

notice the eect that this unnecessary co de motion has on p eephole optimization This

optimization can combine the EQ and the BR instructions to pro duce a BReq instruction

if there is a single denition reaching the BR The unnecessary co de motion has resulted

in two denitions that reach the BR instruction Therefore the instructions cannot

b e combined

The common thread among all of these techniques is that they solve a sequence

of dataow equations that drive co de motion The rst step of the analysis is to

r CMP r CMP

r EQ r r EQ r

H H

H H

H H

H H

Hj Hj

r CMP

r EQ r

BR L L r BR L L r

Original After PRE

 

r CMP r CMP

r EQ r r EQ r

H H

H H

H H

H H

Hj Hj

r CMP

BReq L L r

BR L L r

After After Peephole Optimization

Figure Unnecessary co de motion

assign a unique numb er bitvector index to each expression in the program In

our implementation we enforce a naming scheme on the symb olic registers in the

intermediate representation The rst rule is that multiple computations of the same

register must compute the same expression and all computations of the same expres

sion must dene the same register In other words eachsymb olic register denes

exactly one lexical expression This rule allows us to use the symb olic register name

as the bitvector index for each expression Secondly register numb ers must b e as

signed b ottomup in each expression tree in increasing order In other words the

register numb er dened byeach expression must b e higher than the register num

b ers of the op erands This rule allows us to simplify the analysis to determine which

values dep end on each other and it allows us to insert more than one instruction in

the same place in the correct order Co de that ob eys these rules can b e obtained in

anumber of ways First a carefully co ded front end can pro duce intermediate co de

in this form by hashing each expression to determine the symb olic register that it

must dene This approachmay not b e practical b ecause it forces every frontend

to pro duce co de in the prop er form Another drawback is that co de motion can only

b e run once and it must b e the rst optimization applied A b etter approachisto

enforce the naming scheme during value numb ering Hashbased value numb ering

will ob ey these rules if we simply assign value numb ers in increasing order Value

partitioning and SCCbased value numb ering require a separate pass over the co de to

enforce the naming scheme

The next step is to determine which computations are eligible for motion and

which are not Morel and Renvoise refer to these as expressions and variablesre

sp ectively Kno op et al call them terms and variables We will refer to them as

candidates for co de motion and xed valuesFor each candidate wemust determine

its sub expressions This step is where wetakeadvantage of the second rule in our

naming scheme Wewalk each expression tree b ottomup by simply considering each

symb olic register in increasing order Atthepoint when we examine a symb olic regis

ter wehave already determined the sub expressions for each of its op erands Therefore

the set of sub expressions for the current register is the union of the sub expressions

for its op erands Once wehave computed the sub expressions for each candidate we

are ready to compute the dep endences for eachxedvalue Wesay that a candidate

c dep ends on xed value f iff is in the set of sub expressions of c S c To compute

c we add c to the dep endences we examine each S c For eachxedvalue f in S

set of dep endences for f This pro cess requires ON time where N is the number

of names

Once wehave computed the dep endences for each xed value we are ready to

compute the lo cal predicates required for dataow analysis Each predicate is rep

resented bya bitvector attachedtoeachblock The rst lo cal predicate is altered

the set of candidates whose value is changed in the blo ck This is the union of the

dep endences of every xed value dened in the blo ck The comp predicate represents

the set of candidates dened in the blo ck that are not later altered The antlo c set

contains the candidates that are dened b efore they are altered in the blo ck We

compute all three predicates with a single pass over the blo ck examining each deni

tion in order If the denition is a xed value f we add the set of dep endences for f

to the altered set and removeany dep endences from the comp set If the denition is

a candidate cwe add c to the comp set and weadd c to antlo c if it is not in altered

The dataow analysis uses the lo cal predicates to solve a series of equations See

Figures and The predicates used to mo dify the routine are INSERT for

ij

e each edge i j and DELETE for each blo ck i When inserting instructions wetak

i

advantage of our second rule for naming symb olic registers register numb ers must

b e assigned b ottomup in each expression tree We ensure that no instruction is

placed b efore the denition of its op erands by inserting the memb ers of INSERT in

ij

increasing order

Critical Edges

In realitywewanttoavoid inserting instructions on edges whenever p ossible

Drechsler and Stadel p oint out that we can avoid this if critical edges have b een

split A critical edge is an edge b etween a blo ckwithmultiple successors and a blo ck

with multiple predecessors ie i j is a critical edge if and only if jsuccij

and jpredj j If edge i j is the only predecessor of j thenLATERIN

j

LATER so INSERT Further if edge i j is the only successor of ithenwe

ij ij

can insert the candidates in INSERT at the end of blo ck i In general we insert

ij

INSERT at the end of blo ck i rather than on the edge

ij

j succi

Critical edges can b e removed by splitting inserting an empty basic blo ck along

the edge Figure shows a critical edge and how it could b e split It is not always



The analogous predicate in Drechsler and Stadel is TRANSP



if i is the entry blo ck







AVIN

i

AVOUT otherwise



j





j predi

AVOUT AVIN altered comp

i i i

i

Availability



if i is the exit blo ck







ANTOUT

i

ANTIN otherwise



j





j succi

ANTIN ANTOUT altered antlo c

i i i i

Anticipatability

AVOUT if i is the exit blo ck  ANTIN

i j







EARLIEST

ANTIN AVOUT

ij

j i









ANTOUT otherwise altered

i i

Earliest

Figure Dataow equations for lazy co de motion Part



if j is the entry blo ck







LATERIN

j

 LATER otherwise

ij





ipredj

ANTLOC EARLIEST LATER LATERIN

i ij ij i

Later

INSERT LATER LATERIN

ij ij j



if i is the entry blo ck

DELETE

i



LATERIN otherwise ANTLOC

i i

Placement

Figure Dataow equations for lazy co de motion Part

H

H

H

H

Hj

H

H

H

H

Hj

H

H

H

H

Hj

Before Split

Figure Splitting a critical edge

p ossible to split critical edges If the predecessor ends in a jump register instruction

we cannot rewrite the co de so that it will jump to the new blo ck This problem led

us to lo ok at the sources of jumpregister instructions In our compiler they are

generated for computed gotos in Fortran and for switch statements in C In b oth

cases the co de contains a sp ecic sequence of instructions

Compute the oset into a table of addresses

Load the address from the table

Jump to the lo cation using a jumpregister instruction

Given this sequence of instructions we can split a critical edge if each jumpregister

instruction uses a dierent table and if weknowwhichentry in the table corresp onds

to the critical edge If so we simply change the address in the table to p ointto

the synthetic blo ck For this reason we added a new op co de to our intermediate

language called jump table which tells the optimizer that the destination of the

jump was loaded from a table Therefore the optimizer can split any critical edges

leaving the blo ck This approachallows us to split all critical edges generated byC

or Fortran However languages with rstclass lab els eg continuations in Scheme

can generate jumpregister instructions that cannot b e analyzed in this manner In

general the inability to split all critical edges means that the co de motion framework

may b e forced to place co de on paths where it did not originally exist However the

lazy co de motion framework describ ed by Kno op et al can pro duce incorrect co de

for routines containing unsplit critical edges

Moving LOAD Instructions

Signicant improvements to the p erformance of optimized co de can b e gained by

including load instructions in the set of candidates This is accomplished via the

tags in our intermediate representation Tags are describ ed in detail in Chapter

The imp ortant feature of tags is that each instruction that might reference or dene

e extend the dataow a memory lo cation is lab eled with the tag for that lo cation W

analysis to include tags as well as registers Wemust increase the size of the bit vectors

to hold b oth registers and tags and we extend the analysis of sub expressions and

dep endences to include tags Supp ose that a particular load instruction references

tag t and denes register r We consider t to b e an op erand of r sor is in the

set of dep endences for t During the lo cal analysis whenever a denition of t is

encountered r will b e added to the altered set for the blo ck Therefore the load

instruction cannot move past any denition of its tag Note that the register op erands

controlling the address of the load instruction will also inhibit its movement Thus

a load whose address varies with each iteration of a lo op cannot b e moved out of a

lo op even if the tag is not mo died inside the lo op

Wewould liketoallow store instructions to moveaswell However this exten

sion is not as straightforward as it mightseem The problem lies in the fact that

antidep endences are not explicitly represented by tags The example in Figure

shows an antidep endence b etween the reference of Ai and the denition of A

If this antidep endence is ignored the denition of A will b e incorrectly moved

outside the lo op b ecause the address is lo op invariant and the lo cation is not written

inside the lo op However this motion is clearly unsafe b ecause it will alter the value

computed in the sum variable

Summary

This chapter presents techniques for moving computations to less frequently executed

bines lo op invariant co de motion with lo cations Partial redundancy elimination com

common sub expression elimination Partially redundant computations are redundant

along some but not necessarily all execution paths Lazy co de motion is a descendant

of partial redundancy elimination that avoids unnecessary co de motion Wehave

extended these algorithms to allowmotionofload instructions

sum sum

do i N A

sum sum Ai do i N

A sum sum Ai

enddo enddo

Before Antidep endence Ignored

Figure Incorrect motion of a store instruction



An antidep endence enforces the ordering b etween a read and a subsequent write of the same lo cation

Chapter

Value Driven Co de Motion

Valuedriven co de motion VDCM is an improvement to classical co de motion tech

niques that takes advantage of the results of global value numb ering Traditional

dataow analysis frameworks must assume that every denition pro duces a distinct

value Therefore an instruction cannot move past a denition of one of its sub expres

sions This restriction can b e relaxed when certain denitions are known to pro duce

redundantvalues This information is discovered during value numb ering but pre

vious techniques for co de motion have not exploited it By understanding howcode

motion interacts with global value numb ering we can simplify and improve the co de

motion framework at the price of constraining the order of the optimizations Our

approach is to mo dify the dataow framework to account for the assumption that

each denition represents a value rather than a lexical name This approach can b e

applied to a variety of dataow frameworks In particular this chapter fo cuses on

lazy co de motion presented in Chapter That algorithm is provably optimal this

chapter shows that bychanging our assumptions ab out the shap e of the input pro

gram we can pro duce a technique that b oth eliminates more redundancies and runs

more eciently

We can prepare the routine for VDCM using anyofthevalue numb ering algorithms

describ ed in Chapters or In general eac h algorithm discovers a dierent

set of equivalences The imp ortant feature that these algorithms share is that they

can rewrite the names in the entire routine consistently to reect the equivalences

discovered

The ability of the compiler to p erform co de motion is inuenced heavily bythe

shap e of the input program Briggs and Co op er showed that global reasso ciation

followed byvalue partitioning will transform co de into a form that makes PRE more

eective Further improvements are still p ossible The fo cus of this chapter is to



In Chapter only the unied hash table algorithm will provide the consistent naming that we require

extend the dataow framework to op erate on value equivalences rather than lexical

names just as we extended the framework for available expressions in Section

Because values are never killed a computation of an expression can move across

a denition of one of its op erands if the value of that op erand is available at the

p oint where the computation is placed In other words a computation can b e placed

anywhere that the values of its op erands are available

VDCM Algorithm

The rst step of VDCM is to nd available expressions as describ ed in Section

The only lo cal predicate required for this framework is dened the set of values

b

dened in blo ck b Using the results of the available expressions calculation we can

compute the other predicates needed for the remaining dataow frameworks These

predicates take on a dierent meaning under VDCM b ecause we are dealing with

values rather than lexical names

The predicate altered represents the set of values that cannot move past some

b

denition in blo ck b In our framework an instruction can move past the denition

of one of its op erands However an instruction cannot movetoapoint where the

values of its op erands cannot b e made available The concept of ready is dened

recursively for eachvalue v and each program p oint pWesaythat v is ready if its

xed op erands are available and its candidate op erands are ready at p Given the

set of available values at p oint pwe can compute the set of ready values as follows

Initially the set of ready values is the set of available values then wetraverse each

expression tree b ottom up and add any expression with all of its op erands in the set

Finallythesetofvalues altered in blo ck b is simply the set of values that are ready

at the end of b but not at the b eginning In other words a value is altered in blo ck

b if at least one of its sub expressions is computed in blo ck b for the rst time along

some path in the CFG

The algorithm for computing altered is shown in Figure Intuitively altered

b

for blo ck b is derived by rst computing the set of ready values at the b eginning

and end of the blo ck and then nding the dierence In practice altered can b e

b

computed more eciently by starting with AVOUT AVIN then traversing each

b b

expression tree b ottom up and adding any expression with one of its op erands in the

umb er of names set This requires ON time where N is the n

for each blo ck b

altered AVOUT AVIN

b b b

for each expression e in b ottomup order

for each op erand o of e

if o altered

b

altered altered feg

b b

Figure Algorithm for computing altered using values

for each expression e in b ottomup order

for eachoperand o of e

S e S e S o o

for each xed value f S e

D f D f feg

for each blo ck b

for each denition x in b

if x is a xed v alue

altered altered D x

b b

Figure Algorithm for computing altered using lexical names

Compare this approach with the technique for computing the altered set using

lexical names shown in Figure First the sub expressions for each candidate

and the set of dep endences for eachxedvalue must b e computed To compute

sub expressions the analyzer must traverse each expression tree b ottom up for each

expression it computes the union of the sub expressions of its op erands To compute

the set of dep endences it examines the sub expressions S e for each expression e

For eachxedvalue f S e it adds e to the set of dep endences for f This pro cess

requires ON time where N is the numb er of expressions Finally for each blo ck

b the analyzer computes altered as the union of the dep endences for anyxedvalue

b

dened in b

To clarify the dierences b etween the algorithms for computing altered consider

the expression tree in Figure First consider the valuedriven algorithm Wewill

AVIN The algorithm will visit each assume that the xed value a is in AVOUT

b b

expression in b ottomup order The expression e will b e considered rst since one

of its op erands a is in the altered set e will b e added Since none of the op erands

e

H

H

H

H

H

H

H

H

H

e e

y

a x

b

Figure Expression tree

of e are in altered e will not b e added Finally e will b e added to altered b ecause

e is a memb er of the set Therefore altered fe e g

Nowwe will apply the algorithm based on lexical names to the expression tree

in Figure The rst step is to compute the sub expressions for each expression

is fa xg Similarly in b ottomup order The set of sub expressions for e S e

S e fb y g and nally S e fe axe byg Notice that wehave p erformed N

bitvector op erations eachofsizeN where N is the numb er of expressions Therefore

computing sub expressions requires ON time The next step is to compute the set

of dep endences for each xed value Since a isamember of S e and S e the

dep endences for a D a will b e fe e g Similarly D xfe e g D b fe e g

and D y fe e g Finallywe examine each denition in the blo ck When we

encounter a denition of aweadd D afe e gtothealtered set

The other predicate needed for eachblock b is the set of expressions that are

lo cally anticipatable antlo c Under the traditional framework this is the set of

b

expressions e computed in b b efore anyofes op erands are mo died in bHowever

under the valuedriven framework antlo c is the set of values computed in b whose

b

denition could legally b e placed at the b eginning of b Anyvalue computed in b

can b e computed at the b eginning of b if that value is not altered in b Therefore

we dene antlo c as the set of values that are computed but not altered in b ie

b

dened altered

b b

Given the valuedriven versions of available expressions altered and lo cally an

ticipatable the co de motion pro ceeds as b efore Sp ecicallywe compute the set of

values to insert on each edge ie INSERT for eachedge e i j and the set of

ij

values to delete from eachblockie DELETE for eachblock b

b

Examples

The example in Figure demonstrates how VDCM is more p owerful than LCM

The traditional framework would compute the set of sub expressions for r as fxg

and for r as fr xg Therefore the set of dep endences for x would b e fr r g and

the store to x in blo ck B must b e assumed to alter the values of b oth r and r

ie altered fr r g However the value numb ering has discovered that the

B



value of r dep ends only on the value of r and it is indep endentofthevalue stored

into x inside the lo op In other words r must b e the absolute value of the value of

x on entry to the lo op

We will now explain howvaluedriven co de motion would analyze this example

The rst step is to compute available expressions using the equations in Figure

The solution to these equations shows that r is available on entry and exit for blo ck

B r AVIN and r AVOUT We initialize the set altered with AVOUT

B B B B

   

AVIN r altered andVDCM is able to move AVIN Since r AVOUT

B B B B

   

the denition of r outside the lo op On the other hand LCM must leave the denition

inside the lo op

Figure demonstrates another example where VDCM is more p owerful than

LCM Notice that the co de stores the value of r into Ai and later loads the value

backinto a register without changing the value of iValue numb ering has determined

that the load will pro duce the same value as r Therefore VDCM will remove the

load instruction On the other hand LCM must assume that the store aects the

value of the load so it cannot removeit

Summary

This chapter presents a new approach to dataow analysis that takes advantage of

facts discovered during value numb ering Traditional datao w analysis frameworks

op erate on lexical names while our framework uses values We apply this imp ortant

distinction to the framework for lazy co de motion Despite the fact that lazy co de

motion is provably optimal we can eliminate more redundancies using our technique

r load x

r load x

r abs r

B B

   

store x

store x

B B

r abs r

 

Before After

Figure VDCM example

r

store Ai r

r load Ai

Figure Another VDCM example

Further our algorithm runs faster than lazy co de motion b ecause the computation

of the altered set for eachblock is greatly simplied

Chapter

Relief of Register Pressure

This chapter will discuss the use of redundancy elimination techniques to relieve

register pressure the demand for registers at anypoint in a routine It has b een

argued that eliminating redundancies will always increase register pressure

However it is also p ossible that eliminating redundancies can reduce register pressure

The example in Figure demonstrates how this can happ en When the second

computation of x is removed the lifetime of the rst computation of x is lengthened

On the other hand the lifetimesof y and z are shortened b ecause their last use is now

earlier in the program The net eect is to reduce the number of live registers by one

at the p oint marked bythe There are two other issues that further complicate

the situation

Increasing register pressure do es not aect p erformance unless the pressure

b ecomes greater than the number of physical registers on the target machine

Other optimizations that run after redundancy elimination can change the reg

ister pressure in unpredictable ways

x y z x y z

x x

only x is livehere y and z are live here

x y z

x x

Before After

Figure Example where redundancy elimination decreases register pressure

Before Before After After

Routine Optimization Redundancy Redundancy Optimization

Elimination Elimination

fpppp

twldrv

deseco

supp

subb

tomcatv

efill

prophy

saturr

ddeflu

iniset

bilan

debflu

debico

repvid

inisla

paroi

bilsla

pastem

dyeh

drepvi

colbur

ihbtr

yeh

inithx

integr

cardeb

orgpar

gamgen

heat

inter

svd

decomp

rkfs

spline

fehl

fmin

Table Register pressure at various p oints

Table shows the register pressure at various p oints during optimization for

the routines in each b enchmark suite in which the register pressure is ab ove

Eachentry in the table represents the maximum register pressure at any p ointin

the routine In almost all cases redundancy elimination increased register pressure

However in only of the routines did the register pressure cross the register

threshold during the redundancy elimination phase The register pressure in of the

routines is reduced by eliminating redundancies It is reduced signicantly for fpppp

Finally the later optimization passes change the register pressure in unpredictable

ways usually the pressure is raised it is lowered in of the routines

One approach to relief of register pressure mightbetosomehow limit redundancy

elimination so that it do es not raise the register pressure ab ovethenumber of physical

registers However this would still not account for the increase in register pressure

caused by later optimizations Therefore we feel that the most eectivewayto

relieve register pressure will b e a transformation that runs immediately b efore register

allo cation Another advantage of this approach is that it do es not dep end on the

algorithm used for redundancy elimination The idea is to insert instructions that

reduce the register pressure but are cheap er than the load and store op erations that

will b e inserted during register allo cation Further these instructions will b e

placed on a more global basis than the spill co de In some sense this transformation

will undo the adverse eects of redundancy elimination This transformation will

op erate as follows

Perform value numb ering to identify expressions with the same value

Identify lo cations in the routine where the pressure is greater than R the

hine This is accomplished using number of physical registers on the target mac

traditional livevariable analysis The register pressure at each p ointinthe

routine is estimated by the size of the live set The true register pressure

in the blo ckmaychange as a result of the copy coalescing that o ccurs during

register allo cation

Heuristically identify calculations that will decrease the pressure For each blo ck

where the register pressure is to o high we search for expressions whose result

and op erands are live at the b eginning and end of the blo ck but not mentioned



We consider to b e a go o d estimate of the number of physical registers on a typical machine

inside the blo ck Compiletime constants are particularly attractive b ecause

they have no register op erands If such a calculation were placed at the end of

the blo ck the register pressure inside the blo ckwould decrease byone

Select the lo cation where inserting these calculations will result in the least

runtime impact We will place instructions as late as p ossible without crossing

a use of the result or lengthening any path at a greater nesting depth

Apply dead co de elimination to removeany denitions that have b een masked

by inserted instructions

Identifying Blo cks with High Register Pressure

We lo cate regions of high register pressure using dataow analysis to identify live

registers For each blo ck we b egin with the LIVEOUT set and examine the instruc

tions in reverse order At each instruction we removeany dened registers from the

set and insert any used registers The register pressure is simply the size of the set at

anypoint We record the maximum register pressure for anypoint inside each blo ck

Cho osing Expressions to Relieve Pressure

Once wehaveidentied the blo cks with high register pressure we are ready to select

expressions to relieve the pressure We b egin with the set of expressions whose result

is live across the blo ck and whose result and op erands are not touched within the

blo ck For each expression in this set wemust also ensure that its op erands are live

on exit to the blo ck Otherwise inserting the expression at the end of the blo ckwould

shorten the live range of the result while lengthening the live range of one or more of

the op erands The net result would b e unchanged or even increased register pressure

Once wehave narrowed down the choice of expressions we heuristically cho ose

one expression at a time until the register pressure for the blo ck drops to R or the

set b ecomes emptyWehaveidentied several heuristics

Heuristic Cho ose expressions in any order

Heuristic Give priority to registers constrained in predecessors The example in

Figure motivates this heuristic It shows three consecutive blo cks with high

register pressure If three dierent expressions are chosen to relieve pressure

HP HP

e

HP HP

e

HP HP

e e

Dierent expressions Same expression

Figure Motivating example for heuristic

then three instructions will b e needed However if the same expression is chosen

for each blo ck only one instruction will b e needed During the next step the

correct lo cation for the calculation will b e chosed at the end of the third blo ck

Heuristic Allow existing denitions to participate in co de motion Figure

shows how inserting an expression at the end of a blo ck with high pressure

can render another denition of the expression partially dead If we treat b oth

denitions as candidates for co de motion we will eliminate the partially dead

computation

Heuristic Limit co de motion to expressions actually inserted Figure shows

how heuristic can move an expression into an empty basic blo ckthatwould

otherwise have b een removed by a later optimization pass The result is that

the instruction is not executed on the nal iteration of the lo op at the cost of

executing an extra jump on every iteration This problem o ccurs in the gamgen

routine For this reason it is often useful to limit the expressions to b e moved

to only those involved in relieving pressure

Heuristic Give priority to compile time constants Because these expressions have

no register op erands we can insert them without worrying ab out lengthening

the live range of the op erands This heuristic is similar to the

e e

Q Q

Q Q

Q Q

Q Q

Q Q

Qs Qs

HP HP

e e e

Q Q

Q Q

Q Q

Q Q

Q Q

Qs Qs

e e

Before Expression inserted

Q

Q

Q

Q

Q

Qs

e

HP

e

Q

Q

Q

Q

Q

Qs

e

e

Move expressions forward

Figure Motivating example for heuristic

e e

   

e e

e

e

   

Before After

Figure Motivating example for heuristic

technique of Briggs Co op er and Torczon The key dierence is where the

co de to compute the constant is inserted Rematerialization is attempted for

each live range that is spilled during register allo cation any use within the live

range that is shown to b e a constant will b e preceded bya loadimmediate

instruction rather than a load If all uses are handled in this fashion then

no store instructions are needed at the denitions within the live range In

contrast this heuristic inserts a loadimmediate instruction at the end of

a blo ck with high pressure and moves that instruction forward based on the

analysis describ ed in the next section

Heuristic Give priority based on nesting depth Since expressions are moved as

close as p ossible to their uses we lo ok for expressions whose uses are at the

outermost nesting depth

Selecting Lo cations to Insert Instructions

We select lo cations to insert each expression using dataow analysis similar to

the second phase of lazy co de motion See Chapter Wemoveeach expression as

late as p ossible sub ject to the following

Do not pass a use of the expression Passing a use would force that use to obtain

its value from a denition b efore the blo ck where we are attempting to relieve pressure

if i is the entry blo ck















RELIEFOUT otherwise

j

RELIEFIN

i





 j predi







depthi depthj

RELIEFOUT RELIEFIN exp osed relief

i i i

i

Relief

INSERT RELIEFOUT RELIEFIN LIVEIN

ij i j j

Insert

Figure Dataow equations for relief of register pressure

Do not lengthen any path at a greater nesting depth The restriction of not

lengthening any path can prohibit conditionally executed expressions from mov

ing out of lo ops On the other hand allowing any path to b e lengthened can

result in expressions moving into lo ops Therefore wehavechosen to compro

mise

The dataow equations for selecting insertion p oints are given in Figure The

k i are exp osed and relief They represent lo cal predicates required for each blo c

i

i

expressions with an exp osed use in i and expressions initially inserted in i resp ectively

Notice that we only consider the predecessors with an equal or greater nesting depth

when computing RELIEFIN

Recall that the dataow framework for lazy co de motion computes a set of ex

pressions to insert on eachedgeoftheCFG In Section weshowed that inserting

expressions on edges was not necessary if critical edges in the graph had b een split

The same is true in this framework

Results

Table shows the register pressure for several routines using each of the heuris

tics describ ed ab ove Since the heuristics can only relieve register pressure that exists

across basic blo cks they are not eectiveon fppppwhich contains only one basic

blo ck Also the gamgen heat interand fmin routines are not aected b ecause

their register pressure after optimization is b elow the register threshold When

the heuristics are applied each one reduces the pressure signicantlyHowever the

pressure in only of the routines was lowered b elow its level achieved b efore re

dundancy elimination see Table Many of the heuristics reduce the maximum

pressure to the same level If the maximum pressure is high this means that all

p ossible expressions were chosen and there is simply no decision for the heuristics to

make However this do es not mean that each heuristic will change the routine in the

same way Recall that register pressure can vary throughout a routine and that even

though there are blo cks where the pressure cannot b e brought b elow the threshold

there are other blo cks where the pressure can b e relieved The dierences b etween

the heuristics b ecome evidentinblocks where the pressure is near the threshold In

these blo cks dierent heuristics will cho ose dierent expressions and dierentbe

havior will b e observed at run time The register pressure inside the iniset ihbtr

t where it can b e allo cated on inithx and cardeb routines was reduced to a p oin

a register machine without spill co de We are not able to achieve this goal on

every routine simply b ecause there are not enough expressions whose insertion would

relieve pressure Chapter will present data showing the improvements in instruction

counts that result from each of the heuristics

Summary

Wehavepresented a technique that can reduce register pressure byintro ducing re

dundancies into a routine As a result less spill co de will b e inserted during register

allo cation and the running time of the routine is reduced Intuitively the tech

nique pro ceeds by identifying blo cks with high register pressure in the routine

heuristically cho osing expressions that will reduce register pressure if inserted at

the end of the blo ck and moving denitions as close to their uses as p ossible

One key asp ect of this technique is that denitions cannot move past a use This

includes uses in registertoregister copy instructions Since many of these instructions

are removed during the coalescing phase of register allo cation further opp ortunities

Without Heuristic

Routine Relief

fpppp

twldrv

deseco

supp

subb

tomcatv

efill

prophy

saturr

ddeflu

iniset

bilan

debflu

debico

repvid

inisla

paroi

bilsla

pastem

dyeh

drepvi

colbur

ihbtr

yeh

inithx

integr

cardeb

orgpar

svd

decomp

rkfs

spline

fehl

Table Register pressure using relief heuristics

for movementmay b e created This observation leads us to b elievethatthistechnique

might b e more eective as a part of the register allo cator rather than a separate pass

that runs b efore allo cation Such an allo cator could insert expressions and p erform

forward motion following each application of the coalescing phase Forward motion

could not only b e applied to expressions inserted during the current phase but also

to those expressions inserted in previous phases Thus any expression whose forward

motion was blo cked by a copythatwas subsequently coalesced could b e moved closer

to its true uses

Another advantage of incorp orating this technique into the register allo cator is

that co de motion could b e applied to the spill co de Rather than inserting a store

instruction after each denition in a live range and a load instruction b efore each

use the allo cator could identify regions of high pressure and insert a store at the

b eginning and a load at the end Then the load instructions could b e moved

forward as far as p ossible using the existing dataow framework and the store

instructions could b e moved as early as p ossible using an analogous framework

Chapter

Exp erimental Results

A theoretical comparison of the redundancy elimination techniques reveals that SCC

based value numb ering and valuedriven co de motion are never worse than their

comp etitors in terms of eliminating redundancies However an equally imp ortant

question is howmuch this theoretical distinction matters in practice Practical con

siderations force us to ask twokey questions

How often do opp ortunities for improved redundancy elimination arise

Howdoesimproved redundancy elimination interact with other optimization

passes

To assess the real impact of these techniques wehave implemented all of the opti

mizations in our exp erimental Fortran compiler

Exp erimental Compiler

Wehave built an exp erimental compiler centered around our intermediate language

called ILOC pronounced eyelo ck ILOC is a pseudoassembly language for a RISC

machine with an arbitrary number of symb olic registers load and store op erations

are provided to access memory and all computations op erate on symb olic registers

Figure shows a sample routine written in ILOCEach basic blo ck in the routine

is identied by its lab el and eachblock ends with an explicit owofcontrol op eration

The routine b egins with a frame op eration that denes the registers passed from

the calling routine It ends with a rtn op eration Each op eration is lab eled with

the a line numb er in the source co de Each op eration consists of an op co de a list

of constants a list of referenced registers and a list of dened registers Anyof

these lists can b e empty The symb ol indicates the b eginning of the list of

dened registers Constantintegers app ear with no prex memory lo cations app ear

with as the rst character and registers app ear with the letter r as the rst

character All op co des that dene registers havea typ e indicated by the rst letter

bubble FRAME r r r r

iSLDor n r r

iSLI r r

iLDI r

iADDI r r

iSLI r r

iCMP r r r

BRlt LL LL r

LL iADDI r r

iADD r r r

iADD r r r

iADDI r r

JMPl LL

LL iADDI r r

iCMP r r r

BRlt L LL r

L ii r r

JMPl LL

LL ii r r

ii r r

JMPl LL

iLDor a r r LL

iLDor a r r

iCMP r r r

BRlt LL LL r

LL ii r r

JMPl LL

LL iADDI r r

iCMP r r r

BRge LL LL r

LL iLDor a r r

iLDor a r r

iSTor a r r

iSTor a r r

iCMP r r r

BRge L LL r

L ii r r

JMPl LL

LL RTN r

Figure Sample ILOC routine

of the op co de The following data typ es are supp orted byte integer oating p oint

double precision complex and double precision complex In this example all registers

are integers load and store op erations supp ort two addressing mo des The oset

register mo de eg iLDor computes the address by adding a constant oset to the

value in a register The registerregister mo de eg iLDrr computes the address by

adding the values in two registers

The front end translates Fortran into ILOC The optimizer is organized as a col

lection of Unix lters that consume and pro duce ILOC This design allows us to easily

apply optimizations in almost any order The back end pro duces C co de instrumented

to count the number of ILOC instructions executed The numb er of instructions ex

ecuted is a go o d but not p erfect indication of the programs execution time The

following details are overlo oked

Not all instructions require the same numb er of cycles to execute

The eects of a back end are not measured In particular the eects of register

allo cation and scheduling mightchange the results Solving these problems

globally requires a heuristic approximation to an NPcomplete problem The

heuristic chosen must b e tailored to work well with the co de exp ected from the

optimizer

Machine features such as pip elines and memory hierarchies are not taken into

account

The common thread among these shortcomings is that they all dep end on features

of a sp ecic target machine Therefore we feel that our exp eriments are useful in

measuring the eectiveness of machineindep endent optimizations

Tests Performed

In this exp eriment we optimized over routines from the Sp ec b enchmark suite

and from Forsythe Malcolm and Molers b o ok on numerical metho ds We

refer to the latter as the FMM benchmark Each routine was optimized in several

dierentways byvarying the typ e of value numbering and code removalmotion

Table shows the p ossible combinations The entries marked with a represent

optimizations that existed prior to the b eginning of this research entries marked with

a represent contributions of this research

Toachieve accurate comparisons wevaried only the typ e of redundancy elimina

tion p erformed Routines were optimized with the sequence of global reasso ciation

Single Extended Dominator AVAIL Lazy Valuedriven

Basic Basic Based Based Co de Co de

Blo cks Blo cks Removal Removal Motion Motion

Hashbased

Partitioning

SCC

Table Tests p erformed

redundancy elimination dierent in each test global constant propagation op er

ator strength reduction see App endix A redundancy elimination global constant

propagation global p eephole optimization dead co de elimination Section

copy coalescing and a pass to eliminate empty basic blo cks

Raw Instruction Counts

Figures through present ILOC instruction counts for each combination in

Table Table presents a guide for these gures Rememb er that the instruction

counts show the p erformance of the dierenttechniques in the context of an optimiz

ing compiler Thus even though one technique mayremove more redundancies than

another subsequent optimization passes may negate this improvement

Typ e of redundancy elimination Sp ec FMM

Hashbased vary co de removalmotion Figure Figure

Partitioning vary co de removalmotion Figure Figure

SCCbased vary co de removalmotion Figure Figure

Table Key to raw instruction count gures

 These optimizations are rep eated to clean up after strength reduction

Hash Based Single Extended Dominator AVAIL LCM VDCM tomcatv 391116123 324009765 321398525 321398525 227904139 226608739 twldrv 77380898 65204805 62457308 62394396 63462668 61658575 gamgen 102266 86685 86282 86282 86256 86256 iniset 84119 38115 38115 38115 38077 38077 deseco 17500 16271 15622 15608 14863 14823 debflu 5164 4855 4553 4553 4364 4330 prophy 5026 4342 3777 3757 3939 3939 pastem 4817 4078 3570 3570 3582 3576 fpppp 3925 3922 3922 3922 3990 3922 repvid 3858 3427 3074 3021 2859 2846 bilan 3670 3721 3419 4034 3452 3452 paroi 3591 3574 3495 3503 3724 3601 debico 3567 3393 3364 3335 3061 3061 inithx 3121 2874 2552 2552 2628 2620 integr 2479 2338 2124 2124 2444 2444 sgemv 1236 1042 1035 1035 791 791 cardeb 1132 1076 1029 1042 785 785 sgemm 1130 982 981 981 834 834 inideb 982 920 891 891 788 775 supp 896 812 693 693 693 693 saxpy 859 759 759 759 467 467 ddeflu 809 773 741 739 704 701 subb 630 539 539 539 539 539 fmtset 604 365 360 360 343 360 ihbtr 492 450 456 456 478 487 drepvi 372 329 272 268 267 267 x21y21 358 299 299 299 263 263 saturr 316 318 244 242 243 243 fmtgen 280 232 205 200 176 176 efill 270 251 216 213 215 215 si 246 177 164 164 165 165 heat 215 212 178 178 177 177 dcoera 171 153 144 144 155 144 lclear 165 127 127 127 91 91 orgpar 154 134 129 129 117 117 yeh 147 142 124 124 134 124 colbur 132 130 117 117 117 117 coeray 131 113 104 104 100 96 drigl 124 120 115 115 115 115 lissag 106 89 89 89 82 82 aclear 101 79 79 79 59 59 sortie 90 88 81 81 78 78 sigma 54 48 48 48 48 48 hmoy 27 23 23 23 23 23 dyeh 25 25 25 25 25 25 vgjyey 22 22 20 20 20 20 arret 19 19 19 19 19 19 inter 10 10 10 10 10 10 bilsla 666666

inisla 666666

Figure Numb er of ILOC op erations for hashbased

value numb ering techniques Sp ec b enchmark

Partitioning Dominator AVAIL LCM VDCM tomcatv 429513447 429513447 228685858 227390458 twldrv 65643989 65581077 66494004 64689911 gamgen 152725 152725 137858 137858 iniset 38115 38115 38077 38077 deseco 15117 15066 14723 14710 debflu 4987 4987 4569 4535 prophy 3419 3447 4045 4045 pastem 3717 3717 3675 3669 fpppp 3988 3988 4056 3988 repvid 3215 3162 2915 2915 bilan 3560 3560 3544 3544 paroi 4147 4151 4145 4039 debico 3809 3627 3375 3375 inithx 2653 2653 2682 2675 integr 2124 2124 2444 2444 sgemv 1834 1834 1197 1197 cardeb 1540 1525 884 884 sgemm 1230 1230 838 838 inideb 1175 1175 916 916 supp 693 693 693 693 saxpy 1009 1009 472 472 ddeflu 762 761 726 725 subb 539 539 539 539 fmtset 416 416 354 371 ihbtr 465 465 475 484 drepvi 305 301 300 300 x21y21 299 299 263 263 saturr 244 242 243 243 fmtgen 214 215 180 180 efill 219 215 217 217 si 169 169 170 170 heat 181 181 180 180 dcoera 149 149 160 149 lclear 127 127 91 91 orgpar 132 132 118 118 yeh 124 124 134 124 colbur 129 129 129 129 coeray 109 109 105 101 drigl 129 129 129 129 lissag 102 102 89 89 aclear 79 79 59 59 sortie 85 85 82 82 sigma 48 48 48 48 hmoy 34 34 34 34 dyeh 26 26 26 26 vgjyey 20 20 20 20 arret 19 19 19 19 inter 13 13 13 13 bilsla 6666

inisla 6666

Figure Numb er of ILOC op erations for value

partitioning techniques Sp ec b enchmark

SCC-Based Dominator AVAIL LCM VDCM tomcatv 321393425 321393425 227899039 226603639 twldrv 63070603 63007691 64169355 62365262 gamgen 83855 83855 86256 86256 iniset 38115 38115 38077 38077 deseco 14018 14018 14462 14423 debflu 4553 4553 4364 4330 prophy 3297 3325 3923 3923 pastem 3544 3544 3566 3550 fpppp 3922 3922 3990 3922 repvid 2912 2859 2779 2766 bilan 3055 3055 3369 3369 paroi 3497 3505 3726 3603 debico 3070 3070 2963 2963 inithx 2552 2552 2627 2620 integr 2124 2124 2444 2444 sgemv 1035 1035 791 791 cardeb 1029 1042 785 785 sgemm 981 981 834 834 inideb 892 892 789 776 supp 693 693 693 693 saxpy 759 759 467 467 ddeflu 735 734 702 701 subb 539 539 539 539 fmtset 360 360 343 360 ihbtr 432 432 478 476 drepvi 272 268 267 267 x21y21 299 299 263 263 saturr 244 242 243 243 fmtgen 205 200 176 176 efill 216 213 215 215 si 164 164 165 165 heat 178 178 177 177 dcoera 144 144 155 144 lclear 127 127 91 91 orgpar 129 129 117 117 yeh 124 124 134 124 colbur 117 117 117 117 coeray 104 104 100 96 drigl 115 115 115 115 lissag 90 90 83 83 aclear 79 79 59 59 sortie 81 81 78 78 sigma 48 48 48 48 hmoy 23 23 23 23 dyeh 25 25 25 25 vgjyey 20 20 20 20 arret 19 19 19 19 inter 10 10 10 10 bilsla 6666

inisla 6666

Figure Numb er of ILOC op erations for SCCbased

value numb ering techniques Sp ec b enchmark

Hash-Based Single Extended Dominator AVAIL LCM VDCM svd 6392 5502 5081 5059 4445 4281 fmin 2072 1138 941 941 880 881 zeroin 1418 912 836 836 748 748 spline 1054 937 928 928 895 886 decomp 847 724 714 707 658 657 fehl 744 708 704 704 646 646 rkfs 552 384 332 332 279 279 urand 226 225 223 223 225 223 solve 208 189 186 186 187 185 seval 121 113 78 78 77 77

rkf45 35 35 35 35 35 35

Figure Numb er of ILOC op erations for hashbased

value numb ering techniques FMM b enchmark

Partitioning Dominator AVAIL LCM VDCM svd 5362 5340 4532 4384 fmin 941 941 880 881 zeroin 836 836 748 748 spline 972 972 924 915 decomp 713 706 659 659 fehl 704 704 646 646 rkfs 332 332 279 279 urand 223 223 225 223 solve 208 208 196 196 seval 82 82 81 81

rkf45 48 48 48 48

Figure Numb er of ILOC op erations for value

partitioning techniques FMM b enchmark

SCC Based Dominator AVAIL LCM VDCM svd 5061 5039 4444 4281 fmin 941 941 880 881 zeroin 836 836 748 748 spline 928 928 895 886 decomp 711 704 658 657 fehl 704 704 646 646 rkfs 332 332 279 279 urand 223 223 225 223 solve 186 186 187 185 seval 78 78 77 77

rkf45 35 35 35 35

Figure Numb er of ILOC op erations for SCCbased

value numb ering techniques FMM b enchmark

Normalized Instruction Counts

Table provides a guide to Figures through Each of these gures

compares normalized execution times for either a row or column in Table

Comparison with Previous State of the Art

Since each of the techniques describ ed in this thesis contributes in its own way to run

time improvements we made a nal comparison showing the sum of all contributions

Our reference p oint will b e the combination of value partitioning and lazy co de motion

as presented in by Briggs and Co op er We compare this combination with

SCCbased value numb ering and valuedriven co de motion The results are shown in

Figures and

Compile Times

We compared the time required byeach of the redundancy elimination techniques for

some of the larger routines in the test suite The numb er of blo cks SSA names and

op erations are given to indicate the size of the routine b eing optimized

Table compares the compile times of the three value numb ering techniques

The SCC technique is signicantly faster than partitioning it is comp etitive with

hashing until a routine has enough SCCstomake iteration its dominantbehavior

We also compared the compiletimes required byeach of the co de removal and

motion techniques Table compares the running times of the co de removal tech

niques

Sp ec FMM Typ e of redundancy elimination

Hashbased vary co de removalmotion Figure Figure

Partitioning vary co de removalmotion Figure Figure

SCCbased vary co de removalmotion Figure Figure

Dominatorbased removal vary value numb ering Figure Figure

AVAILbased removal vary value numb ering Figure Figure

Figure Figure Lazy co de motion vary value numb ering

Valuedriven co de motion vary value numb ering Figure Figure

Table Key to normalized instruction countgures

1.2 Single 1 Extended 0.8 Dominator 0.6 AVAIL 0.4 LCM VDCM 0.2 0 tomcatv twldrv gamgen iniset deseco debflu prophy pastem fpppp repvid

1.2 Single 1 Extended 0.8 Dominator 0.6 AVAIL 0.4 LCM 0.2 VDCM 0 bilan paroi debico inithx integr sgemv cardeb sgemm inideb supp

1.2 Single 1 Extended 0.8 Dominator 0.6 AVAIL 0.4 LCM VDCM 0.2 0 saxpy ddeflu subb fmtset ihbtr drepvi x21y21 saturr fmtgen efill

1.2 Single 1 Extended 0.8 Dominator 0.6 AVAIL 0.4 LCM 0.2 VDCM 0 si heat dcoera lclear orgpar yeh colbur coeray drigl lissag

1.2 1 Single 0.8 Extended Dominator 0.6 AVAIL 0.4 LCM 0.2 VDCM 0

aclear sortie sigma hmoy dyeh vgjyey arret inter bilsla inisla

Figure Comparison of hashbased value

numb ering techniques Sp ec b enchmark

1.2

1 Dominator 0.8 AVAIL LCM 0.6 VDCM 0.4 0.2 0 tomcatv twldrv gamgen iniset deseco debflu prophy pastem fpppp repvid

1.2

1 Dominator 0.8 AVAIL 0.6 LCM VDCM 0.4 0.2 0 bilan paroi debico inithx integr sgemv cardeb sgemm inideb supp

1.2

1 Dominator 0.8 AVAIL 0.6 LCM 0.4 VDCM 0.2 0 saxpy ddeflu subb fmtset ihbtr drepvi x21y21 saturr fmtgen efill

1.2

1 Dominator 0.8 AVAIL 0.6 LCM VDCM 0.4 0.2 0 si heat dcoera lclear orgpar yeh colbur coeray drigl lissag

1.2

1 Dominator 0.8 AVAIL 0.6 LCM 0.4 VDCM 0.2 0

aclear sortie sigma hmoy dyeh vgjyey arret inter bilsla inisla

Figure Comparison of value partitioning techniques Sp ec b enchmark

1.2

1 Dominator 0.8 AVAIL 0.6 LCM VDCM 0.4 0.2 0 tomcatv twldrv gamgen iniset deseco debflu prophy pastem fpppp repvid

1.2

1 Dominator 0.8 AVAIL 0.6 LCM VDCM 0.4 0.2 0 bilan paroi debico inithx integr sgemv cardeb sgemm inideb supp

1.2

1 Dominator 0.8 AVAIL 0.6 LCM VDCM 0.4 0.2 0 saxpy ddeflu subb fmtset ihbtr drepvi x21y21 saturr fmtgen efill

1.2

1 Dominator 0.8 AVAIL 0.6 LCM VDCM 0.4 0.2 0 si heat dcoera lclear orgpar yeh colbur coeray drigl lissag

1.2

1 Dominator 0.8 AVAIL 0.6 LCM VDCM 0.4 0.2 0

aclear sortie sigma hmoy dyeh vgjyey arret inter bilsla inisla

Figure Comparison of SCCbased value

numb ering techniques Sp ec b enchmark

1.2 1 Partitioning 0.8 Hash-based 0.6 SCC-based 0.4 0.2 0 tomcatv twldrv gamgen iniset deseco debflu prophy pastem fpppp repvid

1.2 1 Partitioning 0.8 Hash-based 0.6 SCC-based 0.4 0.2 0 bilan paroi debico inithx integr sgemv cardeb sgemm inideb supp

1.2 1 Partitioning 0.8 Hash-based 0.6 SCC-based 0.4 0.2 0 saxpy ddeflu subb fmtset ihbtr drepvi x21y21 saturr fmtgen efill

1.2 1 Partitioning 0.8 Hash-based 0.6 SCC-based 0.4 0.2 0 si heat dcoera lclear orgpar yeh colbur coeray drigl lissag

1.2 1 Partitioning 0.8 Hash-based 0.6 SCC-based 0.4 0.2 0

aclear sortie sigma hmoy dyeh vgjyey arret inter bilsla inisla

Figure Comparison of value numb ering techniques

using dominatorbased removal Sp ec b enchmark

1.2 1 Partitioning 0.8 Hash-based 0.6 SCC-based 0.4 0.2 0 tomcatv twldrv gamgen iniset deseco debflu prophy pastem fpppp repvid

1.2 1 Partitioning 0.8 Hash-based 0.6 SCC-based 0.4 0.2 0 bilan paroi debico inithx integr sgemv cardeb sgemm inideb supp

1.2 1 Partitioning 0.8 Hash-based 0.6 SCC-based 0.4 0.2 0 saxpy ddeflu subb fmtset ihbtr drepvi x21y21 saturr fmtgen efill

1.2 1 Partitioning 0.8 Hash-based 0.6 SCC-based 0.4 0.2 0 si heat dcoera lclear orgpar yeh colbur coeray drigl lissag

1.2 1 Partitioning 0.8 Hash-based 0.6 SCC-based 0.4 0.2 0

aclear sortie sigma hmoy dyeh vgjyey arret inter bilsla inisla

Figure Comparison of value numb ering techniques

using AVAILbased removal Sp ec b enchmark

1.2 1 Partitioning 0.8 Hash-based 0.6 SCC-based 0.4 0.2 0 tomcatv twldrv gamgen iniset deseco debflu prophy pastem fpppp repvid

1.2 1 Partitioning 0.8 Hash-based 0.6 SCC-based 0.4 0.2 0 bilan paroi debico inithx integr sgemv cardeb sgemm inideb supp

1.2 1 Partitioning 0.8 Hash-based 0.6 SCC-based 0.4 0.2 0 saxpy ddeflu subb fmtset ihbtr drepvi x21y21 saturr fmtgen efill

1.2 1 Partitioning 0.8 Hash-based 0.6 SCC-based 0.4 0.2 0 si heat dcoera lclear orgpar yeh colbur coeray drigl lissag

1.2 1 Partitioning 0.8 Hash-based 0.6 SCC-based 0.4 0.2 0

aclear sortie sigma hmoy dyeh vgjyey arret inter bilsla inisla

Figure Comparison of value numb ering

techniques using lazy co de motion Sp ec b enchmark

1.2 1 Partitioning 0.8 Hash-based 0.6 SCC-based 0.4 0.2 0 tomcatv twldrv gamgen iniset deseco debflu prophy pastem fpppp repvid

1.2 1 Partitioning 0.8 Hash-based 0.6 SCC-based 0.4 0.2 0 bilan paroi debico inithx integr sgemv cardeb sgemm inideb supp

1.2 1 Partitioning 0.8 Hash-based 0.6 SCC-based 0.4 0.2 0 saxpy ddeflu subb fmtset ihbtr drepvi x21y21 saturr fmtgen efill

1.2 1 Partitioning 0.8 Hash-based 0.6 SCC-based 0.4 0.2 0 si heat dcoera lclear orgpar yeh colbur coeray drigl lissag

1.2 1 Partitioning 0.8 Hash-based 0.6 SCC-based 0.4 0.2 0

aclear sortie sigma hmoy dyeh vgjyey arret inter bilsla inisla

Figure Comparison of value numb ering techniques

using valuedriven co de moion Sp ec b enchmark

1.2

1 Single Extended 0.8 Dominator 0.6 AVAIL 0.4 LCM VDCM 0.2

0

svd fmin zeroin spline decomp fehl rkfs urand solve seval rkf45

Figure Comparison of hashbased value

numb ering techniques FMM b enchmark

1.2

1 Dominator 0.8 AVAIL LCM 0.6 VDCM 0.4

0.2

0

svd fmin zeroin spline decomp fehl rkfs urand solve seval rkf45

Figure Comparison of value

partitioning techniques FMM b enchmark

1.2

1 Dominator 0.8 AVAIL LCM 0.6 VDCM 0.4

0.2

0

svd fmin zeroin spline decomp fehl rkfs urand solve seval rkf45

Figure Comparison of SCCbased value

numb ering techniques FMM b enchmark

1.2

1 Partitioning 0.8 Hash-based 0.6 SCC-based 0.4

0.2

0

svd fmin zeroin spline decomp fehl rkfs urand solve seval rkf45

Figure Comparison of value numb ering techniques

using dominatorbased removal FMM b enchmark

1.2

1 Partitioning 0.8 Hash-based 0.6 SCC-based 0.4

0.2

0

svd fmin zeroin spline decomp fehl rkfs urand solve seval rkf45

Figure Comparison of value numb ering techniques

using AVAILbased removal FMM b enchmark

1.2

1 Partitioning 0.8 Hash-based 0.6 SCC-based 0.4

0.2

0

svd fmin zeroin spline decomp fehl rkfs urand solve seval rkf45

Figure Comparison of value numb ering techniques

using lazy co de motion FMM b enchmark

1.2

1 Partitioning 0.8 Hash-based 0.6 SCC-based 0.4

0.2

0

svd fmin zeroin spline decomp fehl rkfs urand solve seval rkf45

Figure Comparison of value numb ering techniques

using valuedriven co de motion FMM b enchmark

Table compares the running times of the co de motion techniques Since the

dierences in running times are determined primarily by the size of the bit vectors and

the time required to compute the altered set for each blo ck these times are included

in the comparison Notice that the bit vector sizes used by VDCM are larger than

those used by LCM This is due to the fact that there are more values than lexical

names ie the same name can havemanyvalues However the reduction in the

time required to compute the altered sets more than comp ensates for this dierence

In every instance VDCM ran faster than LCM

routine blo cks SSA names op erations hashbased SCCbased partitioning

tomcatv

ddeflu

debflu

deseco

twldrv

fpppp

Table Compile times of value numb ering techniques

1.2 1 0.8 Partitioning/LCM 0.6 SCC/VDCM 0.4 0.2 0 tomcatv twldrv gamgen iniset deseco debflu prophy pastem fpppp repvid

1.2 1 0.8 Partitioning/LCM 0.6 SCC/VDCM 0.4 0.2 0 bilan paroi debico inithx integr sgemv cardeb sgemm inideb supp

1.2 1

0.8 Partitioning/LCM 0.6 SCC/VDCM 0.4 0.2 0 saxpy ddeflu subb fmtset ihbtr drepvi x21y21 saturr fmtgen efill

1.2 1 0.8 Partitioning/LCM 0.6 SCC/VDCM 0.4 0.2 0 si heat dcoera lclear orgpar yeh colbur coeray drigl lissag

1.2 1

0.8 Partitioning/LCM 0.6 SCC/VDCM 0.4 0.2 0

aclear sortie sigma hmoy dyeh vgjyey arret inter bilsla inisla

Figure Comparison with previous

state of the art Sp ec b enchmark

1.2

1

0.8 Partitioning/LCM

0.6 SCC/VDCM

0.4

0.2

0

svd fmin zeroin spline decomp fehl rkfs urand solve seval rkf45

Figure Comparison with previous

state of the art FMM b enchmark

routine blo cks SSA names op erations dominator AVAIL

tomcatv

ddeflu

debflu

deseco

twldrv

Table Compile times of co de removal techniques

LCM VDCM

routine set size altered total set size altered total

tomcatv

ddeflu

debflu

deseco

twldrv

Table Compile times of co de motion techniques

Relief of Register Pressure

Figures through compare each of the heuristics for relief of register pressure

see Chapter to allo cation without relief for the routines in our test suite Since

our goal is to replace the memory op erations that would b e inserted during register

allo cation with less exp ensive computations we p erform comparisons with load and

store op erations weighted by andthenbyAstheweight is increased further

improvements are seen until they eventually level o The co de generated in this

exp erimentwas instrumented not only to count the total numb er of instructions

executed but also to count the number of load and store instructions executed

We then use this secondary counttovary the weight of memory op erations In our

exp eriments heuristics and p erform consistently well

Summary

This chapter presents exp erimental data comparing the eectiveness of the techniques

presented in this thesis Exp eriments are run in the context of our optimizing com

piler and the number of ILOC op erations executed are measured The basis for each

exp erimentisvarying only a single pass in the sequence of optimizations

The rst set of exp eriments compares the eectiveness of various redundancy

elimination techniques In general each renement to the technique results in an

improvement in the results When comparing techniques for value numb ering we see

that hashbased value n umb ering almost always eliminates signicantly more redun

dancies than value partitioning SCCbased value numb ering removes slightly more

redundancies than hashbased value numb ering When comparing the techniques

for co de removal and motion we see signicantimprovements when moving from

single basic blo cks to extended basic blo cks and again to dominatorbased removal

One surprising asp ect of this study is that the dierences b etween dominatorbased

removal and AVAILbased removal are small in practice AVAILbased removal is

slightly b etter than dominatorbased removal The dierences b etween AVAILbased

removal and lazy co de motion are signicant This is b ecause moving invariantcode

out of lo ops provides a great deal of improvement However there are some examples

where AVAILbased removal is b etter than lazy co de motion b ecause it is based on

values while lazy co de motion is based on lexical names The dierences b etween lazy

co de motion and valuedriven co de motion are small b ecause lazy co de motion is a

provably optimal technique

1.2 Without Heuristic 0 1 Heuristic 1 Heuristic 2 0.8 Heuristic 3 Heuristic 4 0.6 Heuristic 5 0.4 tomcatv twldrv gamgen iniset deseco fpppp debflu prophy bilan paroi

1.2 Without Heuristic 0 1 Heuristic 1 Heuristic 2 0.8 Heuristic 3

0.6 Heuristic 4 Heuristic 5 0.4 repvid debico pastem inithx integr supp ddeflu subb cardeb sgemm

1.2 Without Heuristic 0 1 Heuristic 1 Heuristic 2 0.8 Heuristic 3 0.6 Heuristic 4 Heuristic 5 0.4 sgemv inideb ihbtr saxpy efill saturr fmtset drepvi x21y21 heat

1.2 Without Heuristic 0 1 Heuristic 1 Heuristic 2 0.8 Heuristic 3 0.6 Heuristic 4 Heuristic 5 0.4 fmtgen si dcoera yeh colbur orgpar drigl coeray lclear lissag

1.3 1.2 Without 1.1 Heuristic 0 1 Heuristic 1 0.9 Heuristic 2 0.8 0.7 Heuristic 3 0.6 Heuristic 4 0.5 Heuristic 5 0.4

sortie aclear sigma dyeh hmoy vgjhey arret inter inisla bilsla

Figure Comparison of relief heuristics Sp ec

benchmark LOADSTORE weight

1.2 Without 1 Heuristic 0 0.8 Heuristic 1 Heuristic 2 0.6 Heuristic 3 0.4 Heuristic 4 0.2 Heuristic 5 0 tomcatv twldrv gamgen iniset deseco fpppp prophy debflu bilan repvid

1.2 Without 1 Heuristic 0 0.8 Heuristic 1 Heuristic 2 0.6 Heuristic 3 0.4 Heuristic 4 0.2 Heuristic 5 0 paroi debico pastem integr inithx supp ddeflu subb cardeb sgemm

1.4 Without 1.2 Heuristic 0 1 Heuristic 1 0.8 Heuristic 2 0.6 Heuristic 3 0.4 Heuristic 4 0.2 Heuristic 5 0 inideb sgemv efill saturr saxpy ihbtr fmtset drepvi x21y21 yeh

1.2 Without 1 Heuristic 0 0.8 Heuristic 1 Heuristic 2 0.6 Heuristic 3 0.4 Heuristic 4 0.2 Heuristic 5 0 colbur heat fmtgen si drigl dcoera orgpar lclear lissag sortie

1.6 Without 1.4 Heuristic 0 1.2 Heuristic 1 1 Heuristic 2 0.8 0.6 Heuristic 3 0.4 Heuristic 4 0.2 Heuristic 5 0

coeray aclear sigma dyeh hmoy arret vgjhey inter inisla bilsla

Figure Comparison of relief heuristics Sp ec

benchmark LOADSTORE weight

1.2 Without Heuristic 0 1 Heuristic 1 Heuristic 2 0.8 Heuristic 3

0.6 Heuristic 4 Heuristic 5 0.4

svd spline fmin decomp zeroin fehl rkfs urand solve seval rkf45

Figure Comparison of relief heuristics FMM

benchmark LOADSTORE weight

1.2 Without Heuristic 0 1 Heuristic 1 Heuristic 2 0.8 Heuristic 3 0.6 Heuristic 4 Heuristic 5 0.4

svd spline decomp fehl fmin rkfs zeroin urand solve seval rkf45

Figure Comparison of relief heuristics FMM

benchmark LOADSTORE weight

Anomalous b ehavior is seen in a few instances There are three reasons to account

for these results

Removing more redundancies can result in partially dead co de see Section

Changes in the live ranges can inhibit copy coalescing see Chapter

Improved redundancy elimination intro duces more opp ortunities for op erator

strength reduction see App endix A

Negative results from improved redundancy elimination are never more than a few

p ercent while p ositive results can b e as muchas

The second set of exp eriments compares the heuristics for relief of register pressure

The amountofimprovement dep ends on the cost of load and store instructions

relative to registertoregister instructions The higher the cost the greater the im

provement When the register pressure is high each of the heuristics will improve

the p erformance except in a few anomalous cases Improvements as high as are

seen when the cost of memory op erations is three times the cost of registertoregister

op erations However decreased in p erformance are also seen in a few cases

Recommendations

When cho osing algorithms to implement in a compiler the develop er must weigh the

tradeos b etween three areas

Execution time of the optimized program

Execution time of the compiler

Amount of programmer time required for implementation

For each of these areas this thesis has provided the information necessary for

the compiler writer to make an informed decision ab out which form of redundancy

elimination to implement The decision should b e based engineering decisions and

the amount of imp ortance placed on each of the ab ove areas

umb ering For the value numb ering phase the choice is b etween hashbased value n

value partitioning and SCCbased value numb ering Value partitioning can b e elimi

nated from consideration b ecause it is more dicult to implement it runs slower and

it eliminates fewer redundancies in practice than the other two options The choice

between hashbased and SCCbased value numb ering is not so clear Even though

SCCbased value numb ering will nd more redundancies our exp eriments indicate

that the improvements made in practice are small On the other hand hashbased

value numb ering can run signicantly faster than SCCbased value numb ering We

can characterize the typ es of redundancies identied by SCCbased value numb ering

that are not discovered by hashbased value numb ering as equal values involved in a

cycle eg two induction variables with the same initial value that are incremented

by the same amounteach trip through a lo op We do not exp ect to see these typ es

of redundancies in co de written by programmers In fact in our exp eriments no co de

generated by the frontend contains these typ es of redundancies However certain

transformations applied by the compiler can intro duce them In our exp eriments the

op erator strength reduction pass can intro duce them even though it is fairly care

ful not to intro duce redundancies Other optimizations such as lo op unrolling and

transformations intended to improvecache p erformance can intro duce them If

any of these transformations are p erformed b efore redundancy elimination then the

ttoSCCbased value numb ering compiler writer mightgivemoreweigh

For the co de motion phase the choiceisbetween lazy co de motion and value

driven co de motion Valuedriven co de motion is the clear winner b ecause it wins in

all three areas On the other hand if an implementation of lazy co de motion already

exists and programmer time is at a premium the compiler writer might consider

using the existing implementation and accepting the loss in optimization time and

p erformance

Selecting which heuristic is appropriate for relief of register pressure will dep end

on the characteristics of the target machine as well as the register allo cator In our

exp eriments heuristics and p erform consistently well On large routines heuristic

p erforms b etter than the other heuristics However this heuristic can increase the

running time of small routines This is due to the fact that it can movecodeinto an

emptyblock that could have otherwise b een removed see Figure The result is

an increase in the number of jump instructions executed If the target pro cessor has

an instruction prefetch buer with zerocost jumps then this eect will not matter

In this case heuristic would b e the clear choice On the other hand heuristic

is simple to implement and it did not increase the running time of any routine in

our test suite There may b e some combination of compiler and pro cessor where this heuristic p erforms b est

Chapter

Summary of Contributions

Optimizing compilers apply a sequence of transformations to a routine in order to im

prove the p erformance of the target program One or more of these passes is devoted

to redundancy elimination The primary fo cus of this thesis has b een to improve

up on known techniques for redundancy elimination Redundancies can b e eliminated

by removing instructions or moving them to less frequently executed lo cations In the

past the algorithms for removing computations have b een designed and implemented

indep endently from algorithms for moving instructions The primary contribution of

this thesis is to unify these techniques By understanding how these two optimiza

tions interact we can simplify each of them and the resulting combination will b e

more p owerful than the sum of the two parts

Value numb ering attempts to discover which instructions compute the same value

It assigns numb ers to values in suchawaythattwovalues are assigned the same

numb er if the compiler can prove they are equal There are three main approaches

to value numb ering

Hashbased value numb ering This algorithm uses a hash table to map expres

sions to value numb ers It was originally applied to single basic blo cks and

extended basic blo cks Wehave extended this technique to op erate over the

dominator tree and to use a unied hash table

DFA minimization Value partitioning This algorithm uses a variation of Hop crofts

algorithm to partition the values in a routine into congruence classes Wehave

extended this technique to handle commutative op erations and to eliminate

redundant stores

SCCbased value numb ering This is an original technique that combines the fea

tures of hashbased value numb ering and value partitioning It is easy to under

stand and to implement it can handle constant folding and algebraic identities

and it is global We can prove that the algorithm nds at least as manycon

gruences as hashbased value numb ering or value partitioning

Value numb ering will renumb er the registers and no des in a routine so that

congruentvalues are given the same numb er However renumb ering alone will not

improve the running time of the routine wemust remove computations or move them

to less frequently executed lo cations

Dominatorbased removal This technique was suggested by Alp ern Wegman and

Zadeck It removes instructions that are dominated by a congruent computa

tion

AVAILbased removal This technique relies on dataow analysis to discover the

setofavailable expressions and remove instructions whose value is in the set

Wehave improved this technique so that it op erates on values rather than lexical

names

Partial redundancy eliminationLazy co de motion These algorithms combine

lo op invariant co de motion with common sub expression elimination They rely

on dataow analysis to move instructions Wehave extended these techniques

to allow motion of load instructions

Valuedriven co de motion This original approach to dataow analysis is based

raditional dataow frameworks cannot on values rather than lexical names T

move an instruction past a denition of one of its sub expressions This re

striction can b e relaxed when value numb ering has identied the computations

that pro duce redundantvalues By understanding how co de motion interacts

with value numb ering we can simplify and improve the co de motion frame

work This algorithm is faster than lazy co de motion and it can eliminate more

redundancies

Redundancy elimination can change the live ranges in a routine whichcanpo

tentially have a negative impact on register allo cation Wehaveinvented a new

technique to relieve register pressure by reintro ducing redundancies into the routine

As a result less spill co de will b e inserted during register allo cation and the running

time of the routine is reduced

A theoretical comparison of the redundancy elimination techniques reveals that

SCCbased value numb ering and valuedriven co de motion are never worse than their

comp etitors in terms of eliminating redundancies However an equally imp ortant

question is howmuch this theoretical distinction matters in practice Chapter

exp erimentally compares all of the techniques describ ed in this thesis

App endix A gives a new algorithm for op erator strength reduction that improves

on existing algorithms in several ways It takes advantage of the prop erties of SSA form and the shap e of a routine after redundancy elimination

App endix A

Op erator Strength Reduction

Op erator strength reduction is a transformation that a compiler uses to replace costly

strong instructions with cheap er weaker ones The algorithm replaces an iterated

series of strong computations with an equivalent series of weaker computations The

classic example replaces certain multiplication op erations inside a lo op with equivalent

addition op erations This case arises routinely in lo opbased array address calcula

tions and many other op erations can b e reduced in this manner Allen Co cke and

Kennedy provide a detailed catalog of such reductions

Strength reduction has b een an imp ortant transformation for two principal rea

sons First multiplying integers has usually taken longer than adding them This

made strength reduction protable the amountofimprovementvaried with the rel

ative costs of addition and multiplication Second strength reduction decreased the

overhead intro duced by translation from a highlevel language down to assembly

co de Opp ortunities for this transformation are frequently intro duced by the com

piler as part of address translation for arrayelements In part strength reductions

p opularity stems from the fact that these computations are plentiful stylized and

in a very real sense outside the programmers concern

In the future wemay see micropro cessors where an integer multiply and an in

teger add b oth take a single cycle On sucha machine strength reduction will still

have a role to play In combination with algebraic reasso ciation strength

reduction may let the compiler use fewer induction variables in a lo op lowering b oth

the op eration count inside the lo op and the demand for registers This eect may

b e esp ecially pronounced in co de that has b een automatically blo cked to improve

lo cality

hapter presents a new algorithm for p erforming strength reduction It pro This c

duces results similar to those of Allen Co cke and Kennedys classic algorithm

By assuming some sp ecic prior optimizations and op erating on the SSA form of the

pro cedure wehave derived a metho d that is simple to understand and to

implement relies on the dominator tree whichmust b e computed during SSA

sum

i sum

L sum sum sum i

i i i L t i

sum

t i t t

do i

t t t ta

sum sum ai

t t a t load t

enddo

t load t sum sum t

sum sum t i i

i i if i goto L

if i goto L

Source co de Intermediate co de SSA form

Figure A Example

construction rather than the lo op structure of the program which can b e costly to

compute avoids instantiating the sets of induction variables and region constants

required by other algorithms pro cesses each candidate instruction immediately

rather than maintaining a worklist and greatly simplies linear function test re

placement Its asymptotic complexity is in the worst case the same as the Allen

Co cke and Kennedy algorithm

Opp ortunities for strength reduction arise routinely from details that the compiler

inserts as it converts from a sourcelevel representation to a machinelevel represen

tation To see this consider the simple Fortran co de fragment shown in Figure A

The left column shows source co de the central column shows a lowlevel intermediate

co de version of the same lo op Notice the four instruction sequence that b egins at

the lab el L The compiler inserted this co de with its multiply as the expansion of

ai The right column shows the co de in pruned SSA form

The left column of Figure A shows the result of strength reduction and the

optimizations describ ed in Section A The compiler created a new variable t

to hold the value of the expression i a We will describ e an algorithm to

automate this pro cess inside a compiler

Of course further improvementmay b e p ossible For example if the only remain

ing use for i is in the controlow tests that govern the lo op the compiler could

reformulate the tests to use t making the instructions that dene i useless or

sum

i

sum

t a

t a

L sum sum sum

L sum sum sum

i i i

t t t

t t t

t load t

t load t

sum sum t

sum sum t

t t

i i

if t a goto L

t t

if i goto L

After strength reduction After linear function test replacement

Figure A Transformed co de

dead This transformation is called linear function test replacement LFTR The

results of applying LFTR app ear in the right column of Figure A

A The Algorithm

A Preliminary Transformations

To simplify the algorithm we assume that some prior optimization has b een p er

formed This prepro cessing simplies the task of strength reduction by enco ding

certain facts in the shap e of the co de For example after invariant code has been

movedoutofloopswe can easily iden tify lo op invariantvalues based on the lo cation

of their denition

Chapters through describ e the algorithms for redundancy elimination which

combines lo op invariant co de motion and common sub expression elimination We

p erform global reasso ciation and global renaming prior to redundancy elimination

Because our algorithms for redundancy elimination will not move conditionally

executed co de out of lo ops it is p ossible that some lo op invariant co de will remain

inside a lo op We account for this by dening a region constanttobeeither a compile

time constantoravalue whose denition is outside the lo op This do es not identify all

p ossible region constants but our exp eriments indicate that the missed opp ortunities

are insignicant in practice Since we consider compiletime constants to b e region

constants we require some form of constant propagation to identify as many constants

as p ossible We use Wegman and Zadecks sparse conditional constant algorithm

We construct the pruned SSA form of the program In the programs SSA

graph each no de represents an op eration or a no de and edges ow from uses to

denitions The SSA graph can b e built from the resulting program by adding the

usedenition chains which can b e represented as a lo okup table indexed by SSA

names Figure A shows the SSA graph for the example program in Figure A

A Finding Region Constants and Induction Variables

Previous strength reduction algorithms have b een centered around lo ops or regions

inside a pro cedure These are detected using Tarjans owgraph reducibilityalgo

rithm Given a region r anodeintheSSA graph is a region constant with resp ect

to r if its value do es not change inside r Avariable is an with

resp ect to region r if within the region r its value is only incremented or decremented

by a region constant Weavoid the need to build the lo op tree by using the dominator

antage tree that was constructed during the conversion to SSA form Wetake adv

of twokey prop erties of SSA form to identify region constants and induction variables

After invariant code has been moved out of lo ops each region constant with

resp ect to region r will either b e a compiletime constant or its denition will

strictly dominate every blo ckinr The SSA construction algorithm ensures

this if it is not true the SSA form must havea no de inside r

Every induction variable forms a strongly connected comp onentSCCinthe

SSA graph Notice that the converse is not true ie not every SCC represents

an induction variable In Figure A the SCC containing sum and sum do es

not represent an induction variable b ecause t is not a region constant

The idea of nding induction variables as SCCs of the SSA graph is due to Wolfe

To discover the SCCs we will use Tarjans algorithm based on depthrst search

shown in Figure A It uses a stack to determine which no des are in the same

SCC no des not contained in any cycle are p opp ed singly while all the no des in the

same SCC are p opp ed together

Tarjans algorithm has an interesting prop erty when a collection of one or more

no des is p opp ed from the stack all of the op erands referenced in those no des that

are dened outside the collection have already b een p opp ed We can capitalize on

 

sum

i

 

 

  

sum

i

  

 

I

t

 

 

I

a

t

 



I

t





I

t

load





I



 

sum

i

 

 

Figure A SSA graph

DFSno de

no de DFSnum nextDFSnum

no de visited TRUE

no de low no deDFSnum

PUSHno de

for each o fop erands of no deg

if not ovisited

DFSo

no delow MINno delowolow

if oDFSnum no de DFSnum and o stack

no de low MINoDFSnum no de low

DFSnum if no de low no de

SCC

do

x POP

SCC SCC fxg

while x no de

Pro cessSCC SCC

Figure A Tarjans SCC nding algorithm

while there is an unvisited no de n

DFS n see Figure A

Pro cessSCC SCC

if SCC has a single member n

rcor x rc iv if n is of the form x iv rc x rc iv x iv

Replacen iv rc

else

nheader NULL

else

ClassifyIVSCC

RegionConst name header

immediate or nameblo ck header return nameop load

ClassifyIVSCC

for each n SCC

if header RPOnum nblo ck RPOnum

header nblo ck

for each n SCC

if nop f copyg

SCC is not an induction variable

else

for each o fop erands of ng

if o SCC and not RegionConst o header

SCC is not an induction variable

if SCC is an induction variable

for each n SCC

nheader header

else

for each n SCC

if n is of the form x iv rc x rc iv x iv rcor x rc iv

Replacen iv rc

else

nheader NULL

Figure A Op erator strength reduction algorithm

this observation and pro cess the no des as they are p opp ed from the stack In our

metho d the SCCnder drives the entire strength reduction pro cess

Figure A shows the algorithm for identifying induction variables As each SCC

is identied the algorithm decides immediately if it represents an induction variable

The test is simple and ecient The rst step is to consider all basic blo cks containing

a denition inside the SCC and to identify the header blo ck the blo ck with the small

est reversep ostorder numb er We use this information to identify region constants

with resp ect to this SCC a region constantmust either b e a compiletime constant

or its denition must strictly dominate the header blo ck The RegionConst function

in Figure A implements the test to determine if an SSA name is a region constant

with resp ect to a header blo ck The next step is to check that each op eration and

no de in the SCC has the prop er form For no des each argumentmust b e either

amember of the SCC or a region constant For addition op erations one op erand

must b e a member of the SCC and the other op erand must b e a region constant For

subtraction op erations the left op erand mustbeamemb er of the SCC and the right

op erand must b e a region constant The only other p ermissible op eration is a copyIf

ariable we lab el each no de with a p ointer the SCC is determined to b e an induction v

to the header blo ck Otherwise wecheckifany of the memb ers are candidates for

reduction

For each no de that is not a member of an SCCwe can decide immediately if it

is a candidate for strength reduction and p erform the co de replacement describ ed in

the next section This is an improvementover previous algorithms that require a

worklist of candidate instructions For simplicitywe will restrict our discussion to

instructions of the following forms

x i j x j i x i j x j i

where i is an induction variable and j is a region constant with resp ect to the header

blo ck for i Allen Co cke and Kennedy describ e a variety of other candidate typ es

These are straightforward extensions to the technique If this op eration is a

candidate for reduction the Replace function describ ed in the next section is invoked

immediately Since this function transforms x into an induction variable x is lab eled

as an induction variable with the same header blo ckas i This allows further reduction

of op erations using x



We use the notation B B to signify that B strictly dominates B

 

A Co de Replacement

Once wehave found a candidate instruction of the form x i j weupdate

the co de so that the multiply is no longer p erformed inside the lo op The compiler

creates a new SCC in the SSA graph and replaces the instruction with a copy from

the no de representing the value of i j This pro cess is handled by three mutually

recursive functions shown in Figure A

Replace Replace the current op eration with a copy from its reduced counterpart

Reduce Insert co de to strength reduce an induction variable and return the SSA

name of the result

Apply Insert an instruction to apply an op co de to two op erands and return the SSA

name of the result Simplications suchasconstant folding are p erformed if

p ossible

The replacement pro cess is supp orted by a hash table that tracks the results

of reduction This prevents us from p erforming the same reduction twice and

causes the recursion in Reduce to terminate Access to the hash table is through two

functions

search takes an expression an op co de and two op erands and returns the SSA name

of the result

add takes an expression and the name of its result and adds an entry to the table

The Replace function is straightforward It provides the toplevel call to the re

cursive function Reduce and replaces the current op eration with a copy The resulting

op eration must b e an induction variable

The Reduce function is resp onsible for adding the appropriate op erations to the

pro cedure Therststepistocheck the hash table for the desired result If the

result is already in the hash table then no additional instructions are needed and

SSA name of the result Otherwise it must copy the op eration Reduce returns the

or no de that denes the induction variable and assign a new name to the result

The copyDef function do es this Next Reduce considers each argument of the copy

If the argument is dened inside the SCC Reduce invokes itself recursively on that

argument Arguments dened outside the SCC are either the initial value of the

induction variable or the value bywhich the induction value is incremented The

initial value must b e an argumentofa no de and the incrementvalue must b e an

Replaceno de iv rc

result Reduceno de op iv rc

Replace no de with a copy from result

no de header ivheader

SSAname Reduceop co de iv rc

result searchop co de iv rc

if result is not found

result inventName

addop co de iv rc result

newDef copyDefiv result

newDefheader ivheader

for each op erand o of newDef

if oheader ivheader

op co de o rc Replace o with Reduce

else if opcode or newDefop

Replace o with Applyop co de o rc

return result

SSAname Applyop co de op op

result searchop co de op op

if result is not found

if op header NULL and RegionConst op op header

result Reduceop co de op op

else if op header NULL and RegionConst op opheader

result Reduceop co de op op

else

result inventName

addop co de op op result

Cho ose the lo cation where the op eration will b e inserted

Decide if constant folding is p ossible

Create newOp er at the desired lo cation

newOp erheader NULL

return result

Figure A Co de replacement functions

op erand of an instruction The reduction is always applied to the initial value but

the reduction is only applied to the incrementifwe are reducing a multiplyInother

words when the candidate is an add or subtract instruction we mo dify only the

initial value but if the candidate is a multiplywe mo dify b oth the initial value and

the increment Therefore Reduce invokes Apply on arguments dened outside the

SCC only if we are reducing a multiply or if we are pro cessing the arguments of a

no de

The Apply function is conceptually simple although there are a few details that

must b e considered The basic function of Apply is to create an op eration with the

desired result We rely on the hash table to determine if such an op eration already

exists It is p ossible that the op eration we are ab out to create is a candidate for

strength reduction If so we p erform the reduction immediately by calling Reduce

This case often arises from triangular lo ops where the b ounds of an inner lo op are

a function of the index of an outer lo op

Before inserting the op eration the algorithm must select the lo cation where it

is legal to do so Rather than construct landing pads b efore the lo op b eing re

duced our algorithm relies on dominance information created during SSA construc

tion Intuitively the instruction must go into a blo ck that is dominated by b oth

op erands If one of the op erands is a constant wemayhave to copy its denition to

satisfy this condition Otherwise b oth op erands must b e region constants so their

denitions dominate the header blo ck One op erand must b e a descendantofthe

other in the dominator tree so the op eration can b e inserted immediately after the

descendant This avoids the need for landing pads it may place the op eration in a

less deeply nested lo cation than the landing pad

A Example

As an example of how the co de replacement functions op erate we will apply them to

einvoke Replace the SSA graph in Figure A The rst candidate identied is t W

with iv i and rc The hash table search in Reduce will fail so the rst SSA

name invented will b e osr We add this entry to the hash table and create a copy

of the no de for i Next we pro cess the arguments of the copy Since the rst

argument i is a region constant we replace it with the result of Apply whichwill

p erform constant folding and return the SSA name slosr The second argument i

is an induction variable so we recursively invoke Reduce Since no match is found

    

osr osr sum osr

a

i

    

         

    

osr osr sum osr

i

    

I I I

  

copy copy copy

t t t

  

I



t

load



BM

B

B

   

B

B

B

   

    

B

osr osr osr sum

i

    

    

Figure A After op erator strength reduction

in the hash table weinvent a new SSA name slosr add an entry to the hash table

and copy the op eration for i The rst argument is the region constantwhich will

b e left unchanged The second argumentis i which is an induction variable The

recursive call to Reduce will pro duce a match in the hash table with slosr as the

result Atthispoint the calls to Reduce nish and the SSA name osr is returned to

Replace We replace the op eration dening t with a copy from osr We lab el t

as an induction variable so that t and t will also b e identied as candidates for

strength reduction Figure A shows the SSA graph of our example program after

op erator strength reduction is completed

A Running Time

The time required to identify the induction variables and region constants in an

SSA graph is ON E where N is the numb er of no des and E is the num ber of

edges The Replace function p erforms work that is prop ortional to the size of the SCC

containing the induction variable which can b e as large as ON Since Replace can

be invoked ON times the worst case running time is ON This seems exp ensive

unfortunately it is necessary Figure A shows a program that generates this worst

i i t

t

t

n

while P do while P do

if P then if P then

t t c t t c i i

n n n

i i t t c k i c

k t

if P then if P then

t t c t t c i i

n n n

t t c i i k i c

k t

if P then if P then

n n

t t n c t t n c i i n

n n n

t t n c i i n k i c

n

k t

n

end end

Original co de Transformed co de

Figure A Aworstcase example

case b ehavior in the replacement step It requires intro duction of a quadratic number

of up dates Note that this b ehavior is a function of the co de b eing transformed

not any particular details of our algorithm Any algorithm that p erforms strength

reduction on this co de will have this b ehavior In practice exp erience with strength

reduction suggests that this problem do es not arise In fact wehave not seen this

problem mentioned in the literature Since the amountofwork is prop ortional to the

numb er of instructions inserted any algorithm for strength reduction that reduces

these cases will have the same or worse complexity

A Linear Function Test Replacement

After strength reduction the co de often contains induction variables whose sole use

is to govern control ow If so linear function test replacement can convert them into

dead co de The compiler should lo ok for comparisons b etween an induction variable

and a region constant For example the comparison if i goto Linour

example program see Figure A could b e replaced with if osr a goto L

This transformation is called linear function test replacement LFTR

Previous metho ds would search the hash table for an expression containing the in

duction variable referenced in the comparison In the example in Figures A and A

achain of reductions was applied to no de i IfLFTR is to b e eective wemust

follow the entire chain quicklyTo facilitate this pro cess Reduce records the reduc

tions it p erforms on eachnodeintheSSA graph Each reduction is represented by

an edge from a no de to its strengthreduced counterpart lab eled with the op co de and

the region constant of the reduction When a candidate for LFTR is found it is a

simple matter of traversing these edges inserting co de to compute the new region

constant for the test and up dating the compare instruction We use two pro cedures

to supp ort this pro cess

EDGES Follow the LFTR edges and return the SSA name of the last one FOLLOW

in the chain

ApplyEdges Apply the op erations represented bythe LFTR edges to a region con

stant and return the SSA name of the result

The ApplyEdges function can b e easily implemented using the Apply function

describ ed in Section A For each LFTR candidate we replace the induction vari

able with the result of FOLLOW EDGESandwe replace the region constant with the

result of ApplyEdges In the example in Figure A there are two sets of these edges

 a

i osr osr osr

a 

osr osr osr i

To transform the test i we replace i with the result of FOLLOW EDGES osr

and we replace with the result of ApplyEdges a a Notice

that LFTR renders the original induction variable dead Subsequent optimizations

should remove the asso ciated instructions

A Followup Transformations

The algorithm presented here op erates in a compiler that p erforms a suite of opti

mization passes Toprovide a go o d separation of concerns weleavemuchofthe

cleaning up to other wellknown optimizations that should b e run after op erator

strength reduction

Since op erator strength reduction has the p otential to intro duce equal induc

tion variables we need either value partitioning or SCCbased value numb ering see

Chapters and to detect these equalities Hashbased value numb ering see

Chapter will b e unable to detect the equalityof SCCs b ecause they contain values

owing through back edges

The SSA graph in Figure A contains a great deal of dead co de This is b ecause

many of the usedenition edges in the original SSA graph havebeenchanged re

sulting in orphaned no des We rely on a separate pass of dead co de elimination to

remove these instructions Section

Many of the copies intro duced during strength reduction can b e eliminated For

uses example the copyinto t in Figure A can b e eliminated if the load into t

the value of osr directlyWe rely on the copy coalescing phase of a Chaitinstyle

graph coloring register allo cator to accomplish this task

A Previous Work

Reduction of op erator strength has a long history in the literature The classic metho d

is presented in a pap er by Allen Co cke and Kennedy It in turn builds on earlier

work byCocke and Kennedy These algorithms transform one lo op at a time

working outward through each lo op nest making passes to generate defuse chains

nd lo ops and insert landing pads nd region constants and induction variables and

to p erform the actual reduction and instruction replacement Linear function test

replacement is a separate pass for each lo op Chase extended the Allen Co cke and

Kennedy metho d to reduce more additions

A second family of techniques has grown up around the literature of dataow

analysis These metho ds use the careful co de placement calculations

develop ed for co de motion to p erform strength reduction These metho ds avoid the

controlow analysis used in the Allen Co cke and Kennedy metho ds our algorithm

uses prop erties of SSA for the same purp ose Their placement techniques avoid length

ening execution paths our algorithm cannot make the same claim Their principal

limitation is that they work from a simpler notion of a region constantonly literal

constants can b e found The Allen Co cke and Kennedystyle techniques including

ours include lo op invariantvalues as region constants

Paige has lo oked at reducing a numb er of set op erators and using multiset

discrimination as an alternativetohashingtoavoid its worst case b ehavior Sites

lo oked at the related issue of minimizing the numb er of lo op induction variables

Markstein Markstein and Zadeck in a chapter for a forthcoming ACM Press Bo ok

present an algorithm that combines strength reduction expression reasso ciation and

co de motion Our work has treated these issues separately

A Summary

This chapter presents a simple and elegant algorithm for p erforming reduction of op

erator strength The results of applying our metho d are similar to those achieved by

the Allen Co cke and Kennedy algorithm Our tec hnique relies on prior optimiza

tions and prop erties of the SSA graph to pro duce an algorithm that is simple to

understand and to implement relies on the dominator tree whichmust b e com

puted during SSA construction rather than the lo op structure of the program which

can b e costly to compute avoids instantiating the sets of induction variables and

region constants required by other algorithms pro cesses each candidate instruc

tion immediately rather than maintaining a worklist and greatly simplies linear

function test replacement The result is an ecient algorithm that is b oth easy to

understand and easy to implement

Bibliography

Alfred V Aho John E Hop croft and Jerey D Ullman The Design and Anal

ysis of Computer Algorithms AddisonWesley Reading Massachusetts

Alfred V Aho Ravi Sethi and Jerey D Ullman Compilers Principles Tech

niques and Tools AddisonWesley

Frances E Allen John Co cke and Ken Kennedy Reduction of op erator strength

In Steven S Muchnick and Neil D Jones editors Program Flow Analysis The

ory and Applications PrenticeHall

Bowen Alp ern Mark N Wegman and F Kenneth Zadeck Detecting equality

of variables in programs In ConferenceRecord of the Fifteenth Annual ACM

Symposium on Principles of Programming Languages pages San Diego

California January

Preston Briggs and Keith D Co op er Eective partial redundancy elimination

SIGPLAN Notices June Proceedings of the ACM SIG

PLAN ConferenceonProgramming Language Design and Implementation

Preston Briggs Keith D Co op er and Linda Torczon Rematerialization SIG

dings of the ACM SIGPLAN PLAN Notices July Procee

ConferenceonProgramming Language Design and Implementation

Preston Briggs Keith D Co op er and Linda Torczon Improvements to graph

coloring register allo cation ACM Transactions on Programming Languages and

Systems May

Jiazhen Cai and Rob ert Paige Lo ok Ma no hashing and no arrays neither

In ConferenceRecord of the Eighteenth Annual ACM Symposium on Principles

of Programming Languages pages Orlando Florida January

David Callahan Steve Carr and Ken KennedyImproving register allo cation for

subscripted variables SIGPLAN Notices June Proceedings

of the ACM SIGPLAN ConferenceonProgramming Language Design and

Implementation

Steve Carr Kathryn S McKinley and ChauWen Tseng Compiler optimizations

for improving data lo cality In Pro ceedings of the Sixth International Conference

on Architectural Supp ort for Programming Languages and Op erating Systems

pages San Jose California

Gregory J Chaitin Marc A Auslander Ashok K Chandra John Co cke Mar

tin E Hopkins and Peter W Markstein Register allo cation via coloring Com

puter Languages January

David R Chase Brief survey of optimizations Unpublished pap er

Cli Click Combining Analyses Combining Optimizations PhD thesis Rice

University

John Co cke Global common sub expression elimination SIGPLAN Notices

July Proceedings of a Symposium on Compiler Optimization

John Co cke and Ken Kennedy An algorithm for reduction of op erator strength

Communications of the ACM Novemb er

John Co cke and Peter Markstein Measurement of program improvementalgo

rithms In Proceedings of Information Processing North Holland Publishing

Company

artz Programming languages and their compilers John Co cke and Jacob T Schw

Preliminary notes Technical rep ort Courant Institute of Mathematical Sciences

New York University

Ron Cytron Jeanne Ferrante Barry K Rosen Mark N Wegman and F Ken

neth Zadeck Eciently computing static single assignment form and the control

dep endence graph ACM Transactions on Programming Languages and Systems

Octob er

Dhananjay M Dhamdhere On algorithms for op erator strength reduction Com

munications of the ACM pages May

Dhananjay M Dhamdhere A new algorithm for comp osite hoisting and strength

reduction International Journal of Computer Mathematics pages

KarlHeinz Drechsler and Manfred P Stadel A solution to a problem with

Morel and Renvoises Global optimization by suppression of partial redundan

cies ACM Transactions on Programming Languages and Systems

Octob er

KarlHeinz Drechsler and Manfred P Stadel A variation of Kno op Ruthing

and Steens lazy co de motion SIGPLAN Notices May

George E Forsythe Michael A Malcolm and Cleve B Moler Computer Methods

for Mathematical Computations PrenticeHall Englewo o d Clis New Jersey

Particia C Goldb erg A comparison of certain optimization techniques In

Rustin editor Design and Optimization of Compilers pages Prentice

Hall

Matthew S Hecht Flow Analysis of Computer Programs Programming Lan

guages Series Elsevier NorthHolland Inc Vanderbilt Avenue New York

NY

JR Issac and Dhananjay M Dhamdhere A comp osite algorithm for strength

reduction and co de movement International Journal of Computer and Informa

tion Sciences pages

John B Kam and Jerey D Ullman Global data ow analysis and iterative

algorithms Journal of the ACM January

Ken Kennedy Reduction in strength using hashed temp oraries SETL Newsletter

h Courant Institute of Mathematical Sciences New York UniversityMarc

Jens Kno op Oliver Ruthing and Bernhard Steen Lazy co de motion SIG

PLAN Notices July Proceedings of the ACM SIGPLAN

ConferenceonProgramming Language Design and Implementation

Jens Kno op Oliver Ruthing and Bernhard Steen Lazy strength reduction

Journal of Programming Languages

Jens Kno op Oliver Ruthing and Bernhard Steen Optimal co de motion The

ory and practice ACM Transactions on Programming Languages and Systems

July

Donald E Knuth An empirical study of Fortran programs SoftwarePractice

and Experience

Thomas Lengauer and Rob ert Endre Tarjan A fast algorithm for nding domina

tors in a owgraph ACM Transactions on Programming Languages and Systems

July

Etienne Morel and Claude Renvoise Global optimization by suppression of par

tial redundancies Communications of the ACM February

Rob ert Paige and Shaye Ko enig Finite dierencing of computable expressions

ACM Transactions on Programming Languages and Systems July

Rob ert Paige and Jacob T Schwartz Reduction in strength of high level op er

ations In ConferenceRecord of the Fourth ACM Symposium on Principles of

Programming Languages pages Los Angeles California January

Gordon D Plotkin Callbyname callbyvalue and the calculus Theoretical

Computer Science

Lori L Pollo ck AnApproach to Incremental Compilation of OptimizedCode

PhD thesis University of Pittsburgh

John H Reif Symb olic programming analysis in almost linear time In Confer

enceRecord of the Fifth Annual ACM Symposium on Principles of Programming

Languages pages Tucson Arizona January

Vatsa Santhanam Register reasso ciation in PARISC compilers Hew lett

Packard Journal pages June

Richard L Sites The compilation of lo op induction expressions ACM Transac

tions on Programming Languages and Systems July

Rob ert E Tarjan Depth rst search and linear graph algorithms SIAM Journal

on Computing June

Rob ert Endre Tarjan Testing ow graph reducibility Journal of Computer and

System Sciences

Mark N Wegman and F Kenneth Zadeck Constant propagation with condi

tional branches In ConferenceRecord of the Twelfth Annual ACM Symposium on

Principles of Programming Languages pages New Orleans Louisiana

January

Mark N Wegman and F Kenneth Zadeck Constant propagation with condi

tional branches ACM Transactions on Programming Languages and Systems

April

Michael E Wolf and Monica S Lam Adatalocality optimizing algorithm

SIGPLAN Notices June Proceedings of the ACM SIGPLAN

ConferenceonProgramming Language Design and Implementation

Michael Wolfe Beyond induction variables SIGPLAN Notices

July Proceedings of the ACM SIGPLAN ConferenceonProgramming

Language Design and Implementation