c

Copyright by

John Bradley Plevyak

OPTIMIZATION OF OBJECTORIENTED AND CONCURRENT PROGRAMS

BY

JOHN BRADLEY PLEVYAK

BA University of California Berkeley

THESIS

Submitted in partial fulllment of the requirements

for the degree of Do ctor of Philosophy in Computer Science

in the Graduate College of the

University of Illinois at UrbanaChampaign Urbana Illinois

Abstract

High level programming language features have long b een seen as improving programmer ef

ciency at some cost in program eciency When features such as ob jectorientation and

negrained concurrency which greatly simplify expression of complex programs are used par

simoniously their eectiveness is mitigated It is my thesis that these features can b e imple

mented eciently through interpro cedural analysis and transformation By sp ecializing their

implementation to the contexts in which they are used the programs eciency is not adversely

aected by the exibility of the language The sp ecic contributions herein are an adaptive

ow analysis for practical precise analysis of ob jectoriented programs a cloning algorithm

for building sp ecialized versions of general abstractions a set of optimizations for removing

ob jectoriented and negrained concurrency overhead and a hybrid sequentialparallel ex

ecution mo del which adapts to the availability of data The eectiveness of this framework

has b een empirically validated on standard b enchmarks It is publicly available as part of the

Illinois Concert system httpwwwcsagcsuiucedu iii

To Cindy iv

Acknowledgments

I would like to thank everyone at the University of Illinois at UrbanaChampaign who made this

p ossible I would esp ecially like to thank Vijay Karamcheti my ocemate colleague and friend

for his help supp ort knowledge and patience Xingbin Zhang for putting up with my tirades

ab out the right way to build a compiler Julian Dolby with whom I shared many a thought

and cup of coee Scott Pakin and the rest of the Concurrent Systems Architecture group for

many interesting discussions some technical and my advisor Andrew Chien for information

help time and giving me the opp ortunity to pursue so many fascinating directions I would

like to thank my committee for appreciating this work and Uday Reddy for conversations and

his patience with my lack of pro clivity for rigor

Life in Chambana would have b een unb earable if it were not for my friends there Esp ecially

Keith keep er of the b ox George towny Von Bridget Chris Steve Mahesh Tara Ulrike Je

Igor and Ellen Bob and the rest of the Illinois Sailing Club I also want to mention The Inn

The Blind Pig Thats Rentertainment the several Korean restaurants and trips to the BVI as

havens and escap es from a dull Midwest corn and cow town

I would like to thank my dear friends in California for their supp ort while I have b een away

pursuing this work Esp ecially Jo e Patty Chris and Anita for putting me up when I came

to visit Kian and Aaron for coming to visit me in this wasteland John Craig Greg Deedee

Lily Mark Tracy Dan Gina for many a past and future day of fun Kelly Simon and Justin

for reminding me of what is imp ortant Kevin Kari Terry Linda the Ashlands crew and all

the rest

I would like to thank my family for their supp ort understanding and patience Dave and

Cameron Candy and Skip and my dear mother and father Words cannot express

And most of all I would like to thank Cindy my true love v

Preface

The reasonable man adapts himself to the world the unreasonable one p ersists

in trying to adapt the world to himself Therefore all progress dep ends on the

unreasonable man

George Bernard Shaw

This thesis had b oth a conception and a birth The conception o ccurred when I discovered

Smalltalk in A thought was planted that expressiveness could b e joined with simplicity

and clarity However b eing unreasonable I desired eciency and concurrency as well The

birth o ccurred in when I b egan work on what would later b e the Concert pro ject whose

purp ose was to combine the simplicity of Smalltalk with the p ervasive concurrency of Actors

and yet automagically b e as ecient as FORTRAN on distributed memory MIMD machines

At the time the b est pure ob jectoriented systems were several times slower than C and

sharedmemory vector and SIMD machines reigned supreme To day concurrent ob jectoriented

programming has b een p opularized and networks of workstations NOWs are the rage To day

the sequential eciency of the Concert system matches C exceeds that of C and the parallel

eciency approaches FORTRAN with message passing

This thesis is organized into four parts The rst part is background Chapters and If

you are already familiar with ob jectoriented and concurrent ob jectoriented concepts you may

still wish to read the summaries Sections and which dene some useful terms The

second part describ es the compilation framework in general Chapter and the Illinois Concert

system in particular The third part comprises the b o dy of this thesis For readers interested

in particular analyses or transformations I have tried to make Chapters through relatively

indep endent by including some background information and references back to the p ertinent

earlier parts of this thesis Finally part four Chapter discusses adaptive sequentialparallel

execution vi

Table of Contents

Chapter

Intro duction

Thesis Summary

The Concert System

Results

Contributions

Organization

Programming Mo del

Ob jectOrientation

Abstract Data Typ es

Polymorphism

Inheritance

Dynamic Dispatch

Parametric Polymorphism

Implementation Issues

Concurrency

Concurrent Statements

Concurrent Ob jects

Naming Lo cation and Distribution

Implementation Issues

Languages

Concurrent Aggregates CA vii

Illinois Concert C ICC

Example Language

Related Work

Ob jectOriented Programming

Concurrent Ob jectOriented Programming

Parallel C

Summary

Execution Mo del

Hardware

Micropro cessor

Distributed Memory Multicomputer

Implementation Issues

Software

Threads

Ob jects

Slots

Tags

Lo cks

Messages

Global Shared Name Space

Scheduling

Implementation Issues

Sp ecialization and Memory Map Conformance

Load Balance and Data Distribution

Memory Allo cation and Garbage Collection

Runtime

Related Work

Summary

The Compilation Framework

Concert viii

Ob jective

Philosophy

Parts of the System

Timetable

Compiler

Overview

Retargetable Front End

Program Graph Construction

Analysis

Transformation

Co de Generation

Runtime

Target Machines

SPARCWorkstation

Thinking Machines Corp oration Connection Machine

Cray Research TD

Related Work

Summary

Adaptive Flow Analysis

Background

Constraintbased Analysis

Program Representation

Contours

Adaptive Analysis Overview

The Analysis Step

Denitions

Analysis Step Driver

Lo cal Flow Graph Construction

Global Flow Graph

Restrictions ix

Imprecision and Polymorphism

Adaptation

Splitting

Selecting Contours

Metho d Contour Splitting

Ob ject Contour Splitting

Remaining Issues

Recursion

Termination and Complexity

Other Data Flow Problems

Implementation

Data Structures

Accessors

Arrays

First Class Continuations

Closures

Performance and Results

Test Suite

Analysis

Precision

Time Complexity

Space Complexity

Related Work

Summary

Cloning

Example Matrix Multiply

Clones and Contours

Overview of the Algorithm

Information from Analysis

Mo died Dynamic Dispatch Mechanism x

Selecting Clones

Creating Clones

Performance and Results

Clone Selection

Dynamic Dispatch

Site Counts

Numb er of Invo cations

Co de Size

Discussion

Related Work

Summary

Optimization of Ob jectOriented Programs

Eciency Problems

Metho d Size

Data Dep endent Control Flow

Aliasing

Optimization Overview

Benchmarks

Invo cation Optimization

Static Binding

Sp eculation

Inlining

Unb oxing

Instance Variable to Static Single Assignment Conversion

Array Aliasing

General Optimizations

Overall Performance and Results

Metho dology

Pro cedural Co des Overall Results

Pro cedural Co des Individual Results xi

Ob jectOriented Co des Overall Results

Ob jectOriented Co des Individual Results

Related Work

Summary

Optimization of Concurrent Ob jectOriented Programs

Simple Lo ck Optimization

Access Subsumption

Stateless Metho ds

Sp eculative Inlining Revisited

Access Regions

Extending Access Regions

Safety

Adding Statements to a Region

Making a New Region

Caching

Lo cal Variables

Instance Variables

Globals

Touch Optimization Futures

Pushing

Grouping

Exp erimental Results

Related Work

Summary

Acknowledgments

Hybrid SequentialParallel Execution

Adaptation

Hybrid Execution

Overview

Parallel Invo cations xii

Sequential Invo cations

Wrapp er Functions and Proxy Contexts

Evaluation

Sequential Performance

Parallel Performance

Related Work

Summary

Acknowledgments

Conclusions

Thesis Summary

Final Thoughts

App endix

A Annotations

B Raw Flow Analysis Data

C Raw OOP Data

D CA Standard Prologue

Bibliography

Index

Vita xiii

List of Figures

Execution Mo del Example

Abstract Class left and Concrete Implementation right

Polymorphic Variables in Lisp left and C right

Polymorphic Containers in Lisp left and C right

Polymorphic Functions Lisp Smalltalk left and C right

Sup erclass base class left and Sub class derived class right

Dynamic Dispatch Unfolded

Function left and Class right Templates

Generic Class

Potential Optimization Example

Inlining of Metho ds virtual functions

Concurrency Example

TreeStructured Concurrency

Other Synchronization Structures

Sequential Lo op of Concurrent Blo cks left and Concurrent Lo op right

Lo cal Consistency Examples

Concurrency Control Examples

Example Distributed Concurrent Abstraction

Examples In Box left and Outstanding Requests right

Concurrency Control Boundaries Example

Synchronization b etween Concurrent Ob jects

Fib onacci in Concurrent Aggregates

Counter Collection Aggregate in CA xiv

Atomic Op erations

Micropro cessor Blo ck Diagram

Micropro cessor Pip eline

Hardware Mo del

Software to Hardware Mapping

Stacks of Frames vs Trees of Contexts

Context Layout

Futures Examples

Continuations and Touches

Use of Continuations

Counting Futures and Continuations

Ob ject Layout

Slot Abstraction

Ob jectbased Access Control

Invo cation Sequence

Global Name Translation

Class Memory Map Conformance

Ob ject Memory Maps Which do not Conform

Concert Philosophy A Hierarchy of Mechanisms

Concert Overview Phases and Intermediate Representations

Core Language Example

Data Dep endencies Example

Co de b efore left and after right SSU Conversion

Order of Transformations

Analysis on Static Single Use Form

Adaptive Analysis Driver

The Flow Graph

Functions on the Flow Graph

Analysis Step Driver xv

Restrictions

Metho d Polymorphism

Polymorphic Ob jects

Selecting Contours

Selecting Contours Pseudo Co de

Metho d Splitting for Integers and Floats

Ob ject Splitting for Imprecision at l

Ob ject Splitting Example

Data Flow and Assignment Sets Example

Paths and Splittability Example

Eciency of Typ e Inference Algorithms

Contours p er Metho d

Matrix Multiplication Example

Cloning Optimization Example

Primary Cloning Information

Cloning Optimization Information

Limitation of Standard Dispatch Mechanism

Cloning Selection Drivers pseudo co de

Contour Equivalence Functions pseudo co de

Partitioning of innerproduct contours induces repartitioning of mm

contours

Example Requiring Repartitioning of Contours

Sp ecialization of Matrix Multiply Example

Selection of Concrete Typ es Class Clones

Selection of Metho d Clones

Dynamic Dispatch Sites

Percent of Total Dynamic

Percent of Remaining Dynamic

Total Numb er of Invo cations

Eect of Cloning on Co de Size xvi

Aliasing Example

Static and Dynamic Invo cation Sites in the Optimized Stanford Integer

Benchmarks

Static and Dynamic Invo cation Sites in the Optimized Stanford Ob jectoriented

Benchmarks

Sp eed Comparison for Static Binding in the Stanford Ob jectoriented

Benchmarks

Sp eculation Example

Static Arity of Dispatch Sites in the Stanford Ob jectoriented Benchmarks

Dynamic Arity of Dispatch Sites in the Stanford Ob jectoriented Benchmarks

Eects of Inlining on Execution Time

Calling Convention Conversion

Instance Variable Transformation Example

Example of Common Sub expression Elimination of Array Op erations

Geometric Mean of Execution Times Relative to C O for the Stanford Integer

Benchmarks for a Range of Optimization Settings

Performance relative to C O on the Stanford Integer Benchmarks

Geometric Mean of Execution Times Relative to C O for the Ob ject

oriented Benchmarks for a Range of Optimization Settings

Performance of CA and C relative to C O on OOP Benchmarks

General Form of Sp eculative Inlined Invo cation

Livermore Lo ops Kernel Co de and Access Regions

Adding j i to an Access Region

Livermore Lo ops Kernel with New Region

Livermore Lo ops Kernel After Merge

Livermore Lo ops Kernel After Hoist

Kernel After Lifting

Example of Caching and Partitions

Caching of Instance Variables

Temp orally Constant Globals xvii

Global Temp oral Inconsistency

Temp orally Constant Globals

Touch Placement for Latency Hiding

Data Frontier for Touch Insertion

Data Frontier for Lo ops

Touches and Active State

Latency Hiding and Multiple Touches

Cumulative Eect of Optimizations on Kernel

Performance on Livermore Lo ops

Performance Dierence COOPCC

Data Ob ject Layout Graph

Distributed Computation Structure

Generated Co de Structure for a Parallel Metho d Version

Parallel Invo cation Schema Graphic

Invo cation Schema Calling Interfaces

Co de Structure for a Nonblo cking Sequential Metho d

Mayblo ck Calling Schema

Mayblo ck Schema Graphic

Continuation Passing Calling Schema

Pseudo Co de for Continuation Creation

Continuation Passing Schema Graphic

Nonblo cking Wrapp er

Mayblo ck Wrapp er

Continuation Passing Wrapp er

A Annotations Example

A Annotations Propagation Example xviii

List of Tables

Runtime Op erations

Development of the Concert System

Core Language Statements

Lo cal Flow Graph No des

Lo cal Constraint Examples

Standard Optimizations

Invo cation Schemas

Continuation Passing Caller Information

Sequential Performance

Execution Results SOR

Execution Results MDForce

Execution Results EMD

B Results of Iterative Flow Analysis

C Raw Data for Stanford Integer Benchmarks

C Raw Data for Stanford Integer Benchmarks cont

C Raw Data for Stanford OOP Benchmarks

C Raw Data for Stanford OOP Benchmarks cont

C Raw Data for Stanford OOP Benchmarks cont cont xix

Chapter

Intro duction

The fact is that civilization requires slaves The Greeks were quite right here

Unless there are slaves to do the ugly horrible uninteresting work culture and

contemplation b ecome almost imp ossible Human slavery is wrong insecure de

moralizing On mechanical slavery on the slavery of the machine the future of the

world dep ends

Oscar Wilde The Soul of Man Under Socialism

The key to constructing large software systems is to write and reason through abstrac

tions leaving the details of the implementation to the compiler This follows from what I

call the programmers uncertainty principle This principle holds that when the limits of hu

man understanding are reached something must go either the big picture or the small picture

Ob jectorientation and negrain concurrency are to ols for reasoning ab out the big picture

This thesis is ab out automating the details the small picture

Ob jectoriented programming OOP is the byword of software engineering promising to

increase pro ductivity through abstraction and software reuse Concurrent ob jectoriented pro

gramming COOP applies those to ols to parallelism and distribution with applications from

sup ercomputing to web browsers It is an axiom of this thesis that ob jectoriented programming

and negrained concurrency are The Right Thing Instead of b elab oring the p oint I

defer to the references provided at the end of this thesis the readers intuition and the wisdom

of future generations

OOP and COOP pro duce programs with structure quite dierent from standard pro cedural

co des but with the same high demands for eciency Traditional lo cal optimization techniques

are p o orly suited to the dynamic nature of OOP and COOP co des The abstractions which free

programmers from implementation details hide information from compilers and can result in

p o or p erformance

It is my thesis that

Ob jectoriented programming and concurrent ob jectoriented programming can b e

made ecient through interpro cedural analysis and transformation

That is programs written in natural OOP and COOP style can b e made as fast as the

equivalent programs written in conventional languages and styles ie pro cedural C

The core of this work is the recognition that as programs pass from abstract descriptions

of p otential b ehavior to concrete events more information b ecomes available ab out what might

happ en until the p otential collapses into a singularity of what did happ en The key to building

an ecient computation is to take advantage of this information as so on as it b ecomes available

when transformation is cheap er For example knowing the typ e size and lo cation of data

and the set of op erations to p erform enables the compiler to statically schedule the computers

resources for maximum eciency If some information is not known for example the size of

the data some scheduling must b e done dynamically which is generally more costly and the

less information available the less ecient the computation

This emphasis on information leads to an optimization framework based on four elements

analysis The discovery of information by examination of the program text As in real life

much interesting information is predicated if A is true then B is true

sp ecialization The optimization of a program section for particular conditions eg A is

true may involve replicating the section for several dierent conditions

sp eculation The testing of a prop erty eg is A true followed by sp ecialization

adaptation Mo dication of b ehavior based on accumulated information

Analysis is capable of determining information early in the programs life cycle when it can

b e applied with maximum impact however this information is often predicated or incomplete

For example analysis might indicated that at a particular program p oint a piece of data might

have one of three values ab and c Early in the programs execution the data might take

on one of the values a later taking on the others Moreover the analysis might have additional

information for the case if the piece of data has the value a How do es analysis determine

this information and what go o d is it The answer to the rst question is found in Chapter

which contains a discussion of adaptive ow analysis Essentially this analysis symb olically

executes the program then examines the information obtained adapts itself to answer questions

ab out things it did not resolve and reanalyzes The information obtained by analysis is used

throughout compilation and execution

One use for the information pro duced by analysis is to sp ecialize p ortions of the program for

dierent situations In describing the desired program b ehavior ob jectoriented programmers

use general abstractions such as sets of ob jects again and again in a variety of circumstances

However the very exibility which makes these abstractions useful under many conditions can

make them inecient in particular situations By automatically building sp ecialized versions of

the general abstraction for those cases which are p erformance critical b oth exible expression

and ecient implementation can b e achieved

In programs the op erations p erformed often dep end on the values of the input data There

fore it is not p ossible to know everything ab out the execution of the program from the program

text However analysis can often delineate the range of p ossible b ehavior enabling to building

of sp ecialized co de for dierent p ossibilities These sp ecialized versions are based on sp ecula

tions which must b e veried b efore they are actually executed typically by testing at run time

These tests can b e costly so careful management of sp eculative information is imp ortant for

overall eciency

Finally run time information can b e used to adapt the implementation of the program for

higher eciency For example a p ortion of the program which often deals with data stored on

distant parts of the machine should overlap remote access with other op erations On the other

hand parts of the program which use lo cal data should pro ceed sequentially through their

op erations eliminating the overhead of juggling outstanding remote requests for data This

situation is analogous to prescribing a sp ecialized training regimen to an athlete with sp ecial

needs In this case the running program adapts itself to the lo cation of data

These four elements analysis sp ecialization sp eculation and adaptation are useful in

turn as the program passes from abstract description to concrete events Analysis involves

no run time cost but often provides only predicated information Sp ecialization can improve

eciency but may involve an increase in the size of the program as it replicates co de for the

dierent predicated conditions Sp eculation involves p otentially costly run time tests but it

can determine information unavailable at compile time Finally the program can adapt to

b ehavioral trends in b ehavior by collecting run time information and mo difying its activities

accordingly

These elements are the central concepts in this thesis and they app ear b oth as the central

concept of their own chapters analysis Chapter sp ecialization Chapter sp eculation

Chapter and adaptation Chapter as well as in combination and in supp orting roles

in these and other chapters The contents of these chapters are summarized in Section

b elow The contributions represented by this work in relation to the state of art are describ ed

in Section and the overall organization of this thesis is discussed in Section

Thesis Summary

This thesis describ es a optimization framework containing novel techniques for contextsensitive

analysis sp ecialization of abstractions sp eculative optimization and adaptive execution It

b egins by describing ob jectorientated and negrained concurrent programming mo dels in

tro ducing terms and providing a basis for understanding the diculty of pro ducing ecient

implementations for such programs Then it describ es the execution mo del through which

the program is mapp ed to the hardware Next it describ es the compilation framework which

p erforms this mapping This framework consists of a new contextsensitive analysis which is

b oth precise and practical a new cloning algorithm for constructing sp ecialized versions of

abstractions and a collection of individual optimizations for ob jectoriented and negrained

concurrent programs Finally a new hybrid sequentialparallel execution enables the program

to adapt to the lo cation of data at run time

Programming Mo del

The programming mo del is a simple pure ob jectoriented programming mo del with ob jects

data metho ds op erations and classes which link them It makes no distinctions are made

b etween primitive typ es eg integers and user dened typ es eg Set and requires no typ e

declarations prototyp es or canonical hints eg inline To this ob jectoriented mo del ne

grained concurrency is added in a manner consistent with the principles of simplicity and

abstraction The concurrency mo del has three key features

a shared name space

dynamic thread creation and

ob ject level access control

A shared namespace allows programmers to separate data layout and functional correct

ness Dynamic thread creation allows programmers to express the natural concurrency of the

application leaving the system to map it to the underlying machine and ob jectlevel access

control provides a basic mutual exclusion mechanism which can b e used to construct larger

atomic op erations and synchronization structures

Execution Mo del

The execution mo del is based on a set of conventional singlethreaded pro cessing elements with

a memory hierarchy where some memory is less costly to access ie lo cal In general only

ob jects lo cal to a pro cessing element are op erated on directly The system synthesizes the global

namespace of the programming mo del by detecting and mapping op erations on remote ob jects

to communication b etween pro cessing elements Concurrency in the programming mo del is

achieved by multithreading the pro cessing elements in software and parallel execution across

no des Thus each pro cessing element can b e viewed as a sequential machine augmented with

runtime primitives for naming lo cking lo cation and concurrency control This mo del supp orts

existing massively parallel pro cessors and networks of workstations

ObjectD ObjectA ObjectC

remote messages local messages ObjectE ObjectB threads context switches

processing element processing element

Figure Execution Mo del Example

Each ob ject has a global name a set of lo cks to implement access control and a queue

for ready and susp ended threads which store their state in contexts heapallo cated activation

records When a message is sent to an ob ject a future is created representing the return value

and a thread is started to compute the future value When the return value is required by the

initial thread the future is touched and the thread susp ends until the value is present Thus

a logical thread may split then rejoin or seem to migrate from pro cessor to pro cessor as in

Figure

Compilation Framework

The compilation framework starts with a new interpro cedural contextsensitive ow analysis

which breaks through abstraction barriers Classes and functions are then replicated cloned

for the contexts in which they are used An iterative cloning algorithm rebuilds the cloned call

graph using a mo died dispatch mechanism Using the information xed by cloning classes

and functions are sp ecialized removing much of the overhead of ob jectorientation instance

memb er and lo cal variables are unb oxed metho ds virtual functions are statically b ound and

inlining is p erformed sp eculatively based on the class of ob jects or function p ointers for OOP

and lo cation or lo ck availability for COOP Other optimizations include conversion of memb er

and global variables to lo cals lifting and merging of access regions removal of redundant lo ck

and lo cality check op erations and redundant array op eration removal

Analysis

The ow analysis is a new context sensitive interpro cedural analysis which adapts to the struc

ture of the program to eciently derive information at a cost prop ortional to the precision of

the information obtained Flow analysis of ob jectoriented programs is complicated by the in

teraction of data values and control ow information through dynamic dispatch and imp erative

up date of instance variables To address these problems this analysis combines simultaneous

data and control ow analysis with iterative adaptation to the structure of the program A sim

ple less control and data ow sensitive analysis is used to determine where more precise analysis

is needed Contours are used to summarize stack frames and groups of ob jects Adaptation is

achieved through the contour representation which describ es the summarization mapping It is

extended lo cally to provide more precision and then the program is reanalyzed

Cloning

Cloning builds sp ecialized versions of classes and metho ds for optimization purp oses It b egins

with the results of ow analysis including the call graph and the set of contours These contours

are partitioned into prototypical clones based on optimization criteria Ob ject contours are

partitioned into concrete typ es implementation classes and metho d contours are partitioned

into metho d clones Next an iterative algorithm is applied which repartitions the contours

until the call graph is realizable until the ob jects can b e created of correct concrete typ es and

the correct clones can b e invoked for each invo cation site The standard dynamic dispatch

mechanism which selects the desired metho d based on the selector and class of the target is

mo died to b e context sensitive using an invo cation site identier

Optimization of Ob jectoriented Programs

Several sources of ineciency are evident for ob jectoriented programs including abstraction

b oundaries and p olymorphism small metho d size high invo cation density data dep endent in

vo cations data access overhead and p otential aliasing Invo cation density is addressed through

invo cation optimizations static binding sp eculation and inlining Data access overhead is

addressed by unb oxing and the elimination of p ointerbased accesses through conversion of

instance variables to Static Single Assignment SSA form Array alias analysis and a suite

of standard low level optimizations are also p erformed with the resulting implementations as

ecient as conventional C implementations

Optimization of Concurrent Ob jectoriented Programs

Concurrent ob jectoriented programs also suer from ineciency in particular resulting from

a lack of information ab out the lo cation and lo cking status of ob jects Several optimizations

address these problems Lo ck op erations are optimized by taking advantage information pro

vided by the the call graph ab out when the access rights required by a metho d have already

b een acquired when the metho d is called Also analysis can recognize stateless methods which

do not required access at all Sp eculative inlining inserts tests around inlined co de These

tests intro duce access regions in which access to an ob ject has b een granted These regions are

transformed to amortize the cost of sp eculation and to increase p otential for conventional op

timizations Memory hierarchy trac is optimized by using ow analysis information and that

provided by access regions to cache data at higher levels of the memory hierarchy Likewise

distributed global variables are optimized by using the call graph to detect temporal ly constant

globals and by caching their value Finally synchronization of threads is is optimized by careful

placement of touches

Hybrid Execution

The program is implemented using hybrid sequentialparallel execution which enables the pro

gram to adapt at run time to the concurrency structure of the program and the lo cation and

availability of data Hybrid execution provides separately optimized sequential and parallel

versions of metho ds When p ossible metho ds are executed sequentially using FIFO First In

First Out order on a stack with low overhead The parallel versions store their lo cal data in

p ersistent heapbased contexts and are sp ecialized for generating parallel work and for hiding

the latency of long running andor nonlo cal op erations One of three sequential calling conven

tion of increasing exibility is selected automatically for each metho d based on interpro cedural

analysis enabling the use of sophisticated synchronization structures at no cost to other parts

of the program

The Concert System

The framework and algorithms describ ed in this thesis were develop ed as part of the Concert

pro ject and implemented as part of the Concert system Section The Concert system is

a complete programming system for developing high p erformance concurrent ob jectoriented

programs for execution on large scale parallel machines In addition to the compiler which

emb o dies the optimizations in this thesis there is a runtime sp ecialized for each target platform

an emulator for quick turnaround debugging a debugger and a standard library The results

rep orted in this thesis are for implementations pro duced by the Concert system

Results

This thesis demonstrates a numb er of dierent results for the various analyses and optimizations

However this is a real and complete system Each part of the framework dep ends on the other

parts For example the analysis Chapter alone do es not improve the program Cloning

Section dep ends on ow analysis but again do es little in itself to improve eciency The

optimizations in Chapters and which do improve eciency fundamentally dep end on ow

analysis and cloning Therefore the overall p erformance results are presented in later chapters

Chapter demonstrates that a pure dynamicallytyp ed ob jectoriented language can obtain

the same eciency as C for a numb er of standard b enchmarks Moreover it shows that the

same language can b e more ecient than C as compiled with a standard C compiler

G Likewise Chapter demonstrates that for the Livermore Lo ops essentially all the

overhead of a shared namespace and ob jectbased protection scheme can b e eliminated so that

the COOP programs are as ecient as C when the data is available Finally Chapter shows

that negrained concurrency can b e implemented eciently with adaptive sequentialparallel

execution Such hybrid execution can approach optimal eciency given by the total work and

communication overhead implied by the data layout

Contributions

The general contribution of this thesis is an optimization framework for ob jectoriented and

negrained concurrent languages Individual contributions are an adaptive ow analysis for

practical precise analysis of ob jectoriented programs a cloning algorithm for building sp e

cialized versions of general abstractions a set of optimizations for removing ob jectoriented

and negrained concurrency overhead and a hybrid sequentialparallel execution mo del

which adapts to the availability of data

A new adaptive ow analysis which is a signicant extension over previous

work When initially published it was demonstrated to b e more ecient than previous analyses

and it was the only analysis for ob jectoriented programs capable of handling arbitrarily

complex typ e structures It remains the only practical and demonstrated analysis capable

of analyzing b oth p olymorphic metho ds and p olymorphic classes in the presence of imp erative

up date ie real ob jectoriented programs as opp osed to functional or functional ob jectoriented

programs The most p owerful comparable analysis is limited to p olymorphic metho ds and

has not b een used for context sensitive optimization in context sensitive information was

summarized b efore optimization

A new cloning algorithm which represents the rst application of whole program

cloning to ob jectoriented programs It solves several unique problems including sp ecialization

of classes including data layout mo dication of the dispatch mechanism for context sensitivity

and the discover of a realizable call graph The closest comparable work by Co op er and Hall

is directed to handling sp ecialization of FORTRAN based on values of

parameters and is limited to forward data ow problems In contrast the new cloning algorithm

handles a more general class of data ow problems and includes data structure sp ecialization

OOP and COOP sp ecic optimizations which include several new or substantially new

The optimizations of lo cks access regions and touches Sections and are new

The application of ow analysis and cloning information to the problem of decreasing memory

hierarchy trac is similar to conventional alias analysis but the use of access region information

is new However the substantial contribution of this work is the demonstration over a suite

of standard programs that the set of optimizations describ ed are sucient to enable a pure

dynamicallytyp ed language to match the eciency of C Previous systems were

several times slower than C

A new hybrid sequentialparallel execution mo del which diers from its predecessors

in that it provides two separately optimized versions of co de one optimized for sequential

eciency and one for parallel eciency It also provides a hierarchy of calling conventions of

increasing exibility and cost These are selected automatically by the system based on the

requirements of the co de As a result hybrid execution is capable of matching the sp eed of

C when the required data is available and yet still provides supp ort for continuation passing

Section and latency hiding where required

Details of these contributions app ear in Chapters and and resp ectively

Organization

This thesis is organized into four parts The rst part is background material on the OOP and

COOP programming and execution mo dels Chapters and The second part describ es the

compilation framework in general Chapter and the Illinois Concert system in particular

The third part comprises the b o dy It describ es adaptive ow analysis Chapter cloning

Chapter and optimization of OOP and COOP Chapters and resp ectively Finally

part four Chapter discusses adaptive sequentialparallel execution

Chapter

Programming Mo del

The best way to do research is to make a radical assumption and then assume its

true For me I use the assumption that objectoriented programming is the way to

go

Bill Joy

This chapter describ es ob jectedoriented and negrained concurrent programming concepts

terminology and languages It is not intended to b e a comprehensive discussion of these topics

but rather to dene clearly the terms used in this thesis Thus the fo cus is on features as they

aect optimization and ultimately performance Three concurrent ob jectoriented languages

are discussed Two Concurrent Aggregates CA and Illinois Concert C ICC are

supp orted by the Concert system The last is the simple p edagogical language used in examples

throughout this thesis

Ob jectOrientation

Ob jectoriented programming is characterized by information hiding and reuse of sp ecications

Ob jectoriented programming languages contain sp ecial features for these purp oses including

abstract data typ es for encapsulation p olymorphism for sp ecication over abstractions and

inheritance for successive renement of sp ecications These language features in turn are

supp orted by implementation techniques which are discussed in Section

For want of a b etter term a specication is a part of a program which may b e incomplete

by itself eg an abstract class C template or generic function

Abstract Data Typ es

The ob jective of abstract data typ es is to encapsulate information ab out the implementation

of an ob ject They describ e an interface consisting of a set of op erations These op erations

are abstract in the sense that their implementations are opaque they are describ ed by the

information they require and pro duce but not by the steps they p erform For example compare

the two C stack classes in Figure The class on the left represents an abstract data

typ e It describ es only the abstract op erations push and pop the parentheses indicate

that these are functions On the other hand the class on the right describ es a concrete

implementation of a stack as an array and count of elements

class ConcreteStack

int data class AbstractStack

int ndata void pushint x

void pushint x datandata x int pop

int pop return datandata

Figure Abstract Class left and Concrete Implementation right

This goal of information hiding or encapsulation has two imp ortant ramications First

the interface should by as simple as p ossible In particular extraneous information such as

optimization annotations and implementation directives should b e avoided Notions such as

the lo cation of data function calls referencing and dereferencing should not b e part of the

abstraction Second abstract data typ es can b e implemented by any data structure supp orting

the op erations Thus the more general the abstraction the more options are available for

implementation In Chapter we will see how the compiler can leverage this exibility to

build high p erformance implementations

Polymorphism

Polymorphism is the ability of a sp ecication to op erate on any abstraction which provides the

required abstract b ehavior This enables sp ecications to b e reused and factored into shared

and unshared parts Sets of abstract op erations are called signatures For example

The appropriate sp ecialized parts are selected by dynamic dispatch Section

the abstract class on the left in Figure denes a signature containing the push and pop

op erations

The concept of p olymorphism applies to variables and by extension to classes and func

tions containing them Polymorphic variables can contain any ob ject supp orting the op erations

p erformed on the variable The set of these op erations forms a signature and the variable

can contain any ob ject which conforms to that signature Classes which contain p olymorphic

instance variables dene p olymorphic ob jects and metho ds which take p olymorphic arguments

are p olymorphic functions We will see in later chapters that these two typ es of p olymor

phic sp ecications require sp ecial techniques if they are to b e analyzed and optimized by the

compiler

Polymorphic Variables

Polymorphic variables are lo cations which can store any ob ject conforming to some signature

Two p olymorphic variable declarations app ear in Figure one on the left side for Lisp

and one on the right for C The Lisp variable a can b e assigned ob jects of any typ e facilities

exist in Lisp for arbitrarily limiting the range of ob jects assigned The C variable b can b e

assigned a p ointer to any ob ject which is of class B or a sub class of derived from B

B b NULL

defvar a nil

Figure Polymorphic Variables in Lisp left and C right

Polymorphic Classes

A common use of p olymorphic variables is container classes which can hold ob jects of more

than one class For example the two p olymorphic stacks in Figure can b e used to contain

various kinds of ob jects In the case of the Lisp CLOS class on the left the stack may

contain ob jects of any class even dierent classes simultaneously The only op erations the stack

p erforms on its contents assigning and moving are supp orted by all Lisp ob jects hence the

signature of the contents conforms to any ob ject The C class right denes a stack which

In the future we will use subclass to mean nonstrict sub class that is a sub class of B

may include B

can contain ob jects of any sub class of A Dening a stack in C which can contain any typ e

of ob ject requires templates and is discussed in Section

class Stack

defclass Stack

void pushA e

defun push s Stack e

A pop

defun pop s Stack

Figure Polymorphic Containers in Lisp left and C right

Polymorphic Functions

Polymorphic functions op erate on arguments conforming to some signatures Thus we can have

the p olymorphic functions dbl in Figure which can op erate on any ob ject supp orting the

op eration The Lisp co de top left denes a function over all typ es supp orted by the

function The Smalltalk co de b ottom left denes a method in C parlance a virtual

member function applicable to any class supp orting the metho d In b oth Lisp and Smalltalk

p olymorphic functions can b e used as any other function They can b e passed as values dbl

Lisp and dbl Smalltalk and stored in variables The ability to treat functions as rst class

citizens is a signicant source of expressive p ower which complicates analysis

subclass polymorphic C

polymorphic function Lisp

Addable dblAddable a

defun dbl a

return a a

a a

signature polymorphism C

polymorphic method Smalltalk

signature S S operatorS

dbl

S dblS a return a a

self self

Figure Polymorphic Functions Lisp Smalltalk left and C right

C has several mechanisms supp orting p olymorphism Figure right provides some

examples The primary mechanism in C for p olymorphism is sub classing The function

dbl top right is applicable to ob jects of any sub class of Addable This mechanism cannot

b e used with primitive typ es int float etc which are not part of the class hierarchy A

mechanism has b een prop osed which extends the C language to enable the denition of

dbl directly in terms of signatures b ottom This has the added b enet of separating the

typing and inheritance Section Likewise C templates describ e a set of

monomorphic functions an element of which is instantiated for each use These are covered in

Section

Inheritance

Inheritance is a language construct for building hierarchies Inheritance can b e applied b oth to

interfaces signatures and implementation b ehavior Applied to interfaces inheritance can

b e used to create a new signature with all the op erations of an old signature and some additional

op erations Similarly a new implementation can b e built by inheriting the b ehavior of a more

general one and then adding to or redening part of the b ehavior For example consider the up

down counter on the left in Figure Supp ose we wish to create a mo died sp ecication which

records the highest p oint reached We simply sub class the class UpDown to b e UpDownCount and

redene the b ehavior of the up metho d The new up metho d calls the old up metho d

and up dates max

class UpDownCount UpDown

int max class UpDown

void up int val

UpDownup void up

if maxval maxval void down

Figure Sup erclass base class left and Sub class derived class right

Dynamic Dispatch

Dynamic dispatch virtual function call is the selection of the function to b e executed based

on the run time class of the target ob ject and the selector or generic function name This

allows function calls to exhibit dierent b ehavior based on the class of an ob jects and supp orts

programming with p olymorphism Each metho d virtual function is an implementation of the

generic function sp ecic to ob jects of a particular class

For an example of dynamic dispatch recall the two classes dened in Figure which b oth

dene the up metho d If up is invoked on a p olymorphic variable a which might contain an

ob ject either of the two classes UpDown or UpDownCount left side of Figure the appropriate

Viewing signatures as typ es the new signature is a subtyp e of the old

version will b e selected at runtime The eect will b e as if a sequence of conditionals selected

the appropriate version of up right side Transformations for minimizing dynamic dispatch

overhead are discussed in Chapters and

void funcUpDown a

if aclass UpDown

void funcUpDown a

aUpDownup

aup

else if aclass UpDownCount

aUpDownCountup

Figure Dynamic Dispatch Unfolded

Parametric Polymorphism

Parametric p olymorphism is a limited form of p olymorphism which allows classes and func

tions to b e parameterized by the typ es of ob jects they supp ort Like inheritance parametric

p olymorphism can b e used for b oth interfaces and implementations For interfaces parametric

p olymorphism like typ e declarations constrain variables allowing the programmer to more

easily reason ab out the correctness of co de For implementations parametric p olymorphism

builds a sp ecialized version of a sp ecication for a particular situations as indicated by the

values of the parameters

Templates in C and generics in Ada use parametric p olymorphism simul

taneously for b oth interfaces and implementations For example in Figure the template

left describ es a set of the dbl functions applicable to any typ e supp orting the op erator

However in C no general dbl function exists hence it cannot b e passed as an argument

or assigned to a variable Instead a sp ecic function is generated from the sp ecication for

each use Particular monomorphic instances are derived by applying a general sp ecication to

a set of parameters in this case derived from the calling environment at compile time

Data typ es can also b e sp ecied using parametric p olymorphism In Figure the class

Stack is parameterized over the element typ e When the stack is created it is explicitly instan

tiated with the contents of typ e E b ottom right of Figure In C instantiation typically

involves the creation of sp ecialized copies of templated co de This replication can expand the

size of the executable program tremendously ie cause co de bloat While the stack

sp ecication is p olymorphic each instance is as if it were declared with a particular typ e and

template polymorphism C

template polymorphism C

template class E

template class A

class Stack

A dblA a

void pushE e

return a a

E pop

int i dbl

Stackint s

Figure Function left and Class right Templates

therefore nominally monomorphic These instances are still sub ject to sub class p olymorphism

and the asso ciated ineciencies Section b elow Techniques for automatically discovering

and optimizing parametric p olymorphism and managing co de size are discussed in Chapter

Implementation Issues

Abstractions app ear in the programming mo del as classes and functions If these are imple

mented directly each class and function would have a single unique implementation and each

function eg empty constructors would have to b e called This can lead to very inecient

co de For example take the generic class dened in Figure and the two uses one with

an int and one with a double One implementation would b e to create a single version of the

class with an indirection to a separately stored and tagged a eld Another implementation

might build sp ecial classes for each uses of A and sp ecialize all the co de on ob jects of these

classes

class A

a instance variable

Aaa aaa constructor initializing a to aa

func method

xy two instances

Figure Generic Class

Functions abstract the op erations of the executing program and direct implementations

which cross these these abstraction b oundaries can b e inecient While the function call itself

may require several instructions including dynamic dispatch the greatest cost results from

the loss of p otential optimization For example Figure uses the generic class A dened

in Figure The while lo op on the left contains seven function calls three read and one

write accessor call for Aa doit Afunc and done If we have the denitions for

A doit and func we can remove six of those calls pro ducing the co de on right However

ob jectorientation complicates such inlining of metho ds

doitxy xy

Afunc a

A xintargyfloatarg

let tx intargty floatarg

while donexy

while donetxty

ya doitxfuncya

ty tx

ya ty

Figure Potential Optimization Example

In general inlining of metho ds requires information ab out the actual as opp osed to de

clared typ e of ob jects Recall Figure which denes two classes UpDown and UpDownCount

Now consider the co de on the left in Figure The function func can b e called on UpDown

or UpDownCount ob jects This prevents the co de for up from b eing inlined directly The typ e

declarations of C do not solve this problem In the C co de on the right func is simi

larly called on ob jects of these typ es In general inlining such calls requires program analysis

Section and transformation Section

void funcUpDown x

funcx

xup

xup

funcnew UpDown

funcnew UpDownCount

Figure Inlining of Metho ds virtual functions

Concurrency

This thesis is concerned with concurrent programming languages and their execution on par

allel hardware not parallel programming languages Parallelism is two op erations happ ening

An accessor is a metho d which simply returns or sets an instance variable

This example is written in the language describ ed in Section

at the same time a situation reected in the physical world by the simultaneous action of two

dierent pieces of hardware Concurrency is a lo oser term indicating that active p erio ds of

the two op erations may overlap Concurrency allows but do es not require parallelism and is

more natural in the abstract world of programming where it is often desirable to abstract away

the mapping of conceptual op erations to physical hardware For example concurrency includes

where a single lo cus of control b ounces b etween two tasks so that b oth are active

but only one of which is physically executing at any given time The key asp ect of concurrency

is that that it is nonbinding so a p erfectly valid execution of two concurrent op erations is

sequential execute one wait for it to complete then execute the other

Under this denition a compiler can easily pro duce ecient sequential co de for any level

of concurrency in the program sp ecication by simply ignoring it In fact b ecause two

concurrent tasks can b e executed in any order the more concurrency in the sp ecication the

greater implementation freedom for the compiler For example in Figure the co de to the

left sp ecies a traditional sequence of four op erations There is precisely one way to execute

these calls correctly The co de to the right sp ecies four concurrent op erations There are

twentyfour four factorial correct orders of execution for these calls

conc

operation operation

operation operation

operation operation

operation operation

Figure Concurrency Example

Finegrained concurrent programs are simpler in a information theoretic sense since they

do not sp ecify unnecessary sequencing In a more practical sense mo dern sup erscalar and

pip elined micropro cessors are highly parallel Section Optimizing compilers for such

pro cessors must work hard to remove excess sequentiality from sp ecications in order to pro duce

ecient co de for sequential languages Thus at some level all programs executing on mo dern

hardware are negrained concurrent

Concurrent Statements

Concurrency can either b e treestructured or irregular Treestructured concurrency is so named

b ecause the task graph where the no des are tasks and the edges are dep endencies b etween

tasks is a tree In Figure a single task top has created four subtasks b ottom through

concurrent calls p ossibly message sends to concurrent ob jects These subtasks must synchro

nize with their parent up on completion This notication of termination makes treestructured

concurrency easier to reason ab out analyze and optimize than irregular concurrency where

the task can form a general graph For this reason many of the analyses and transformations

discussed in this thesis assume treestructured concurrency

task call

reply

Figure TreeStructured Concurrency

While treestructured concurrent child tasks generally complete b efore their parents this

is not always the case A subtask which has no outstanding subtasks may delegate the re

sp onsibility of synchronizing with the parent to its last subtask This allows the formation of

synchronization structures such as forwarding and barriers as in Figure The subtask for

warded to assumes the resp onsibility to synchronize with the parent much like a tail recursive

call Similarly all the tasks waiting on the barrier use it to synchronize with their parent

task call reply

Forward Barrier

Figure Other Synchronization Structures

There are two main mechanisms for generating treestructured concurrency concurrent

blo cks and concurrent lo ops Within concurrent blo cks statements are partially ordered to

preserve a consistent view of lo cal data

Concurrent Blo cks

As in Figure a set of statements can b e declared concurrent The blo ck completes only

when all the enclosing statements have completed Again concurrency is not mandatory if

the statements are simple arithmetic op erations for example they may b e executed in some

sequential order Note that even without concurrency this exibility can provide extra freedom

for instruction scheduling see Section

Concurrent Lo ops

Concurrent lo ops declare that sub ject to the conditions expressed in Section the itera

tions of the lo op are concurrent As a convenience to the programmer ICC Section

implicitly declares any concurrent lo op b o dy to b e concurrent Thus the lo op on the left of

Figure is a sequential lo op each iteration of which is a set of concurrent statements In

this case only one operation will ever b e active at any time On the other hand all the

statements in all the iterations of the lo op on the right are concurrent Thus MAX invo cations

of operation might b e active simultaneously

conc for iiMAXi for iiMAXi conc

operation operation

operation operation

operation operation

Figure Sequential Lo op of Concurrent Blo cks left and Concurrent Lo op right

Lo cal Consistency

Concurrent ob jects ensure the consistency of their internal state by controlling concurrent access

to that state In order to help ensure the consistency of function lo cal state op erations against

it are not allowed to interfere with each other The result of any such computation must b e that

which would b e pro duced by execution of the dep endent chain in normal sequential execution

order This preserves sequential semantics for function lo cal op erations by inducing a partial

order over statements

For example the three statements on the left in Figure are dep endent since L dep ends

on L L reads the value of i which L writes and L dep ends on L it reads the previously

conc conc

funci L i h L

i L i i j L

funci L funci L

Figure Lo cal Consistency Examples

written value of i Thus the argument of func will b e the result of the sequential execution

of these three statements However these statements need not b e actually executed sequentially

On the right of Figure only the statements L and L are dep endent in that order

In particular statements L and L are dep endent on the original assignment of i but are

otherwise indep endent Such write after read dep endencies are often called false dep endencies

for this reason

Concurrent Ob jects

Concurrent ob jects are abstract data typ es which provide a consistent interface in a concurrent

environment They control concurrent access to their instance variables provide for distributed

concurrency and make concurrency guarantees which enable the programmer to reason ab out

progress

Concurrency Control

In order to provide a consistent interface concurrent ob jects must control which messages are

pro cessed concurrently For instance a concurrent abstraction representing a collection cannot

compute the numb er of elements it contains at the same time that it is adding or deleting

elements Many control mechanisms are p ossible The simplest is to allow an ob ject to pro cess

only one message a time More complex schemes enable messages to b e pro cessed based

on the state they access or through set inclusion

The Concurrent Aggregates Section language subscrib es to the mo del that a single

message can b e pro cessed by a given ob ject at one time This was found to b e restrictive b oth

for the programmer and the optimizer see Section ICC Section takes a more

p ermissive view stating simply that intermediate ob ject states cannot b e seen Essentially

this requires that result of a set of op erations conform to some serialization For example in

Figure the calls to f and f can go on concurrently since they b oth read x However

class A

x

conc

y

af

f x x y

af

f y y x

af

f funcx

a

Figure Concurrency Control Examples

neither f and f nor f and f can go on concurrently since f writes x while

b oth the calls to f and f read x This allows the programmer to control nondetermin

ism without requiring the program to b e oversp ecied The nal values of ax and ay are

nondeterministic but the sum of x and y must b e

Distributed Consistency

Ob jectoriented programs are comp osed of a numb er of interacting ob jects In order to reason

ab out the comp osite b ehavior of a group of ob jects it must b e p ossible to comp ose a set of

transactions messages on a group of ob jects into a single transaction The simplest mechanism

for distributed consistency is to provide a single ob ject whose abstract typ e represents the

comp osite b ehavior For example Figure shows a concurrent hash table abstraction which

is implemented with a set of bucket ob jects Since interactions with the buckets are mo derated

by the hash table ob ject consistence can b e maintained over the abstraction

Concurrency Guarantees

Some programs require concurrent ob jects to b e able to overlap the pro cessing of certain mes

sages in order to prevent deadlo ck These concurrency guarantees can b e as simple requiring

that an ob ject b e able to send a message to itself Given an ob ject consistency mo del in which

only a single message can b e executed at one time an ob ject sending a message to itself would

result in deadlo ck eg as in CC Stronger concurrency guarantees increase the ex

pressiveness of the language as more programs which do not deadlo ck are p ossible However

More relaxed consistency mo dels are p ossible for example one in which a consistent set

of old values can b e read would allow the example to result in xy instead of see

Section Hash Bucket

Some fool 1 fools drive Urbana Il Some fool 1 fools drive Urbana Il Some fool 1 fools drive Urbana Il

SomeSome foolfool 1 fools drive 1 foolsUrbana Ildrive Urbana Il Some fool 1 fools drive Some fool Urbana Il 1 fools drive Some fool Urbana Il 1 fools drive Urbana Il Hash Bucket

Some fool 1 fools drive Urbana Il Some fool 1 fools drive Urbana Il

Concurrent Hash Table Hash Bucket

Some fool 1 fools drive Abstraction Urbana Il Some fool 1 fools drive

Urbana Il

Figure Example Distributed Concurrent Abstraction

weaker guarantees grant the scheduler more exibility so concurrency guarantees must b e bal

anced against implementation cost For example guarantees such as strong and weak fairness

can b e exp ensive to implement since they require sophisticated scheduling

Concurrent Aggregates provides for guaranteed concurrency b etween dierent ob jects and

the ability to invoke metho ds on the current ob ject self this in C Moreover CA

guarantees that any message which is sent will eventually b e pro cessed This has proven

very exp ensive for applications with recursive inner lo ops even when global analysis is used

to carefully place scheduling op erations As a result a compiler option allows this the fairness

guarantee to b e disabled ICC makes no such guarantee Concurrency guarantees are

discussed further in Sections and

Actors

Since they can op erate asynchronously concurrent ob jects are sometimes called actors

Likewise the invo cation of a metho d on a concurrent ob ject is sometimes called a message

b ecause like a piece of mail they need not b e resp onded to immediately and more than one

can b e outstanding at one time For example messages sent to an actor Figure on left

build up until the actor can pro cess them Likewise an actor right can send a numb er of

messages o at the same time

Version of the Concert compiler do es not break iterative innite lo ops for lo cal schedul

ing though it do es break recursion outstanding messages

Some fool 1 fools drive Urbana Il Some fool Some fool 1 fools drive 1 fools drive Urbana Il Urbana Il

SomeSome foolfool 1 fools drive 1 foolsUrbana Ildrive Urbana Il Some fool 1 fools drive Some fool Urbana Il 1 fools drive Some fool Urbana Il 1 fools drive Some fool Urbana Il 1 fools drive Urbana Il Some fool 1 fools drive Urbana Il IN

Some fool 1 fools drive Urbana Il Some fool 1 fools drive Mailbox Urbana Il (waiting messages) Actor Actor

(Concurrent Object) (Concurrent Object)

Figure Examples In Box left and Outstanding Requests right

Naming Lo cation and Distribution

The negrained concurrent mo del do es not include the lo cation of ob jects or their distribution

on the target platform within the language semantics That is the lo cation of ob jects do es not

inuence the meaning of a program Programmer level names of ob jects do not include their

lo cation which is managed transparently A discussion of the execution mo del which underlies

the implementation of the transparent global shared name space on distributed memory hard

ware app ears in Chapter Implementation and optimization of this mo del are discussed in

Chapters and

Implementation Issues

Finegrained concurrent programs can b e very inecient if implemented naively These ine

ciencies can result from the exibility of the programming mo del In particular protecting the

consistency of ob jects providing lo cation indep endence and simply managing the high degree

of concurrency provided by programs written to this mo del can b e very exp ensive

Concurrency Control Boundaries

All ob jects mediate access to their internal state through an abstraction b oundary Likewise

concurrent ob jects mediate concurrent access to their state through concurrency control b ound

aries These b oundaries are more exp ensive to pass through than abstraction b oundaries since

the accessing task may have to b e delayed For example in Figure the co de on the left

denes three ob jects two of which concurrently send a message to the third As we can see

on the right one of the messages nondeterministically has to b e delayed The interaction of

scheduling and concurrency control b oundaries is discussed in Sections and

A class First Sender Object y

msg (ACTIVE)

senda amsg

xyz Second Sender Object x Receiver Object

(WAITING) z

conc

ysendx Defered Message

zsendx IN Queue

Figure Concurrency Control Boundaries Example

Lo cality Boundaries

The negrained concurrent programming mo del do es not include the notion of lo cation of

ob jects Section However mo dern parallel and distributed computers are NonUniform

Memory Access NUMA the cost to access nearer data is less than the cost to access data

further away As we will see in Chapter the cost of access to data can vary tremendously In

order to provide go o d p erformance most accesses must b e to near data This induces regions

of lo cality groups of ob jects which may access each other at lower cost While the programmer

need not manage these regions explicitly p ortions of the program which are to b e executed

in parallel must necessarily span them requiring the compiler to manage them In Chapter

discusses the impact of crossing these locality boundaries

Excess Concurrency

The negrained concurrent mo del encourages maximal expression of concurrency Conceptu

ally each interaction b etween concurrent activities requires a synchronization This includes

message sends which must determine if the target ob ject is capable of receiving the message

and scheduling when a delayed message b ecomes active An example of a set of concurrent

ob jects x y and z synchronizing app ears in Figure Because y is pro cessing a message

from z when x sends its message xs message is delayed When the amount of concurrency

exceeds that required unnecessary synchronization can lead to ineciency The interaction of

excess concurrency and scheduling is discussed in Chapter inactive x y z active schedule sync

message sync

object

Time

Figure Synchronization b etween Concurrent Ob jects

Languages

This section discusses languages which emb o dy the concurrent ob jectoriented programming

mo del The two languages with supp orted by the Concert system Section are Concurrent

Aggregates CA and Illinois Concert C ICC While

sup ercially very dierent CA has a Lisplike syntax while ICC is derived from C the

compilation techniques required to obtain eciency for each language are almost identical The

language which will b e used in the examples throughout this thesis which is a blend of the

two reects this However since the particularities of the two languages supp orted by Concert

inuenced the implementation of the analyses and transformations as well as the test suite used

in their evaluation they are discussed here

Concurrent Aggregates CA

Concurrent Aggregates CA is a simple untyp ed pure ob jectoriented language It is

closely related to Concurrent Smalltalk CST which itself is closely related to

Smalltalk The syntax of CA is reminiscent of Lisp parentheses group each expres

sion and declaration It has single inheritance no typ e declarations and rst class selectors

and messages essentially a reication of a message send as a vector of arguments It is pure in

the sense that everything is an ob ject including integers oating p oint numb ers and globals

However unlike Smalltalk control ow is built in it has conditionals if and while lo ops Fur

thermore all statements expressions and arguments are assumed to b e evaluated concurrently

unless surrounded with an explicit sequential blo ck Figure contains the naive Fib onacci

program Note that the recursive calls are concurrent

method integer fib

reply if n

fib n fib n

method osystem initialmessage

reply fib

Figure Fib onacci in Concurrent Aggregates

Concurrent Aggregates includes homogeneous collections of ob jects which collab orate to

form an unserialized parallel abstraction These are called aggregates hence the name Fig

ure contains an example of the use of aggregates in CA The aggregate top level form

denes an aggregate class with nr reps representatives elements The value eld in each

element is initialized to zero by the forall concurrent for lo op Intraaggregate addressing

is through the sibling metho d A count message sent to the aggregate is vectored to an arbi

trary element where it sequentially up dates the lo cal value and replys The reply is used

for synchronization the sender of count knows the accumulation op eration has b een completed

when it receives the reply The sum metho d b egins the summation on the rst element Each

element accumulates to the total sum and the last element returns the total count over all the

elements to the original caller of sum

Illinois Concert C ICC

ICC is a negrained concurrent dialect of C p ossessing the characteristics

describ ed in this chapter It has concurrent blo cks and lo ops Sections and

collections and ob jectbased concurrency control Section It diers from CA and the

example language in that typ e declarations are required for all variables functions and classes

However this information is simply discarded after semantic checks in the front end by the

compiler which calculates more precise information in the analysis phase Section

As of Version We hop e to change this in the future

aggregate counters value aggregate class definition

parameters nrreps initialization parameters

initial nrreps initialization code nrreps elements

forall index from below groupsize

setvalue sibling group index

handler counters count val handler method definition

sequential setvalue self val value self

reply val

handler counters sum pass message on to first sibling

forward suminternal sibling group

handler counters suminternal sum

let newsum sum value self

nextindex myindex

if nextindex groupsize

forward suminternal sibling group nextindex newsum

reply newsum

Figure Counter Collection Aggregate in CA

One of the goals of ICC was to b e as compatible with C as p ossible Nevertheless

ICC departs from C where such departures are necessary to preserve consistency in a

concurrent environment In particular like ICC do es not allow p ointer arith

metic p ointers into ob jects to fundamental typ es or interconversion b etween p ointers and

arrays ICC also requires accessor functions to b e used to access all memb er variables

These functions are automatically dened Furthermore op erations such as and which

are normally considered to b e atomic are implemented as a single transaction Figure

shows the atomic ICC increment a CC style p ointer based atomic increment illegal

in ICC as it requires a p ointer into an ob ject breaking the ob ject abstraction Since C

is a sequential language the nal example C style nonatomic increment is equivalent in

C but not in ICC where a concurrent op eration setting the value of count b etween the

read and the write might b e lost

ICC atomic increment

selfcount

CC style pointer based atomic increment

int i thisARBITRARYcount

i

C style nonatomic increment

Counter element thisARBITRARY

int i elementcount

elementcount i

Figure Atomic Op erations

Example Language

Sections and p oint out several ways in which C fails to supp ort the principles

of ob jectoriented programming Moreover C includes a numb er of features which are either

overly complex overload resolution dangerous casts or simply p o orly considered access

control templates To wit the examples in this thesis will b e in a variation of CICC

with the following changes

Metho ds need not b e dened in the class denitions This is a violation of encapsulation

for which C substitutes access control

No C style access control ie private protected and public

No typ e declarations the let pseudo typ e can b e used to intro duce new bindings

The weak typing system of C is neither safe b ecause of casts nor p owerful C

substitutes templates

No templates Instead p olymorphic functions are sp ecied by omitting typ e declarations

No p ointers references or inline ob jects Instead we follow the SmalltalkScheme mo del

where all ob jects are by reference and any copies must b e made explicitly

All instance variables are accessed through accessor functions For a write of instance

variable a the accessor is operatora

Metho ds memb er functions without arguments do not require parentheses ie use af

instead of af These parentheses violate encapsulation since they dierentiate sys

tem and user dened accessor functions revealing the implementation

Comp ound statements can contain a nal unterminated expression which is the value

of the statement ie f f g has as its value

The return value of a function without a return is the value if any of its comp ound

statement

Tuples comma separated lists of values which are essentially anonymous records can b e

used to return multiple values from a function eg xy func ICC

The resulting language has the lo ok of C and the clarity and simplicity of Smalltalk or

Scheme Moreover as we will see in succeeding chapters it can b e compiled to execute with

the eciency of C Chapters and That is not to say that the techniques that will b e

discussed are not applicable to C only that many of Cs features are not necessary to

obtain that eciency

Related Work

The sub jects of this chapter are represented in the literature by three traditions First is

ob jectoriented programming alone Then there is the concurrent ob jectoriented tradition

including Actors Finally the subgroup of C based concurrent ob jectoriented languages is

so large and diverse that it deserves separate coverage

Ob jectOriented Programming

Ob jectoriented programming was initially p opularized in the United States by Smalltalk

and Flavors More recently C has b ecome very p opular for general purp ose pro

gramming in industry Ob jectoriented languages are very diverse but they generally fall along

a continuum b etween static and dynamic Generally sp eaking the more static the language

the more prop erties of its programs are determined at compile time For example Ada

has strong typing which constrains the typ es of ob jects at compile time while in Smalltalk all

data is dynamically typ ed the typ e safety of op erations is checked at run time Similarly

C do es not provide for automatic garbage collection while Mo dula Sather

and Eiel provide it as an option and it is an integral part of Lisp and Smalltalk Self

is p erhaps the most dynamic ob jectoriented language allowing dynamic mo dication of

the delegation style metho d lo okup path at runtime Generally sp eaking the more dynamic

the language the more p owerful and the more dicult to compile for ecient execution

Concurrent Ob jectOriented Programming

Concurrent ob jectoriented programming extends the ob jectoriented paradigm for concurrent

parallel andor distributed computing The Actors mo del is based on a simple

but p owerful semantics A numb er of dierent systems using this mo del have b een created

For example languages based on Actors include the dialects of ABCL HAL

ACORE and Rosette These systems are based on asynchronous messages

Other languages have added stronger typing systems for example Cantor and POOLT

Several forms of concurrent Smalltalk have b een created including Concurrent

Smalltalk and the language CST which CA resembles Sather an Eiellike language

has a more traditional parallel extension called pSather More recently the language

Ocore has b een develop ed by the Real World Computing Partnership

Parallel C

The other large b o dy of COOP research has centered around parallel extensions to C

This work can b e roughly divided into two groups data ob ject parallel and task parallel

Languages in the data parallel group include pC and C In these languages

the op erations sp ecied in a single thread of computation may b e executed on disjoint data

without interaction These op erations are expressed over aggregations of data ob jects

This diers markedly from more general parallel languages which allow the user to sp ecify more

than one logically concurrent thread of control as well as interactions b etween threads

There are many task parallel extensions to C which are divided into those based on fork

join and semaphore concurrency ob jectbased concurrency and extensible systems Systems

based on forkjoin and semaphore concurrency including ESKit Presto COOL

CC and CHARM provide facilities for programmers to construct ob jects

containing threads and ob jects which protect their state However these systems do not provide

language level ob ject consistency nor do they provide mechanisms for building of multiob ject

abstractions

ESKit C Mentat CHARM and Comp ositional C are all

mediumgrained explicitly task parallel languages where the user controls grain size None of

these systems has develop ed a global optimization framework probably owing to a desire to

leverage existing C compiler technology On the other hand the sub ject of this thesis is

automatic optimization of negrained concurrency through global analysis and transformation

Summary

Ob jectoriented programming is the pro cess of describing abstractions Through polymor

phism any abstraction can b e used which supp orts the required set of op erations a signa

ture Through inheritance one abstraction can extend the denition of another Finegrained

concurrency enables the programmer to sp ecify which op erations may b e executed in parallel

Finegrained concurrent ob jectoriented programs consist of a set of concurrent ob jects which

interact by sending messages to each other These ob jects encapsulate their state in a concurrent

environment through objectbased concurrency control Together ob jectoriented programming

and negrained concurrency ease the task of writing and understanding programs by enabling

abstractions to b e built with well dened b ehavior in a concurrent environment enabling lo cal

reasoning ab out the b ehavior and meaning of programs However programming systems which

implement these abstractions directly can b e very inecient

Chapter

Execution Mo del

The vil lainy you teach me I wil l execute and it shal l go hard but I wil l better the

instruction

William Shakesp eare The Merchant of Venice

An execution mo del abstracts the execution of a program It is the medium of compilation

by which the high level programming language is mapp ed to the low level hardware and a way

for the compiler writer to reason ab out the eciency of that mapping The execution mo del

presented in this chapter consists of a mo del of the hardware Section of the target platform

ie CPU network a mo del software implementation Section and a runtime interface

Section which connects the two From these we infer a cost mo del which motivates the

optimizations in later chapters As we will see those optimizations sometimes break through

these simple mo dels when necessary for eciency

Hardware

The hardware mo del describ es the target platforms and is used by the compiler to mo del

the cost of op erations The goals of the mo del are p ortability and scalability to provide

accurate cost estimates for a numb er of dierent systems of dierent size It has two parts the

sequential micropro cessor whose characteristics are of interest in the evaluation of optimization

for OOP Chapter and the parallel machine mo del for the evaluation of COOP sp ecic

optimizations Chapter Since the design of large scale parallel machines has not stabilized

we assume the least common denominator a collection of commo dity computers connected by

a communication medium This induces a simple two level lo cality mo del ie local vs remote

The impact of particular hardware features is considered in

Micropro cessor

The vast ma jority of computers to day including large scale parallel machines are constructed

from commo dity micropro cessors Such micropro cessors consist of a numb er of functional units

for arithmetic memory and control ow op erations and a memory hierarchy The memory hier

archy may contain registers reservation stations intermediate values in the pip eline register

windows hardware thread contexts level one two and three caches and nally lo cal memory

Figure contains a blo ck diagram of an example micropro cessor Assuming that it is work

conserving an execution is ecient if it can keep the functional units busy Two factors fun

damentally limit eciency eective use of the memory hierarchy and control ow ambiguities

Higher levels in the memory hierarchy can deliver more data p er time unit hence given the

ability of the functional units to sink large amounts of data eciency dep ends on how eec

tively the program uses those higher levels Primarily this means that a program should use the

very highest level registers for most op erations As a secondary consideration the program

should exhibit temporal locality of memory access

Execution Units

Registers Level 3 Level 1 Cache Cache

Level 2

Cache

Figure Micropro cessor Blo ck Diagram

Control ow ambiguities o ccur when the pro cessor cannot predict the target of a branch

instruction This forces the pro cessor to sp eculate on the target with incorrect sp eculations

resulting in waste of execution resources Take for example the pip eline of the Intel Pentium Pro

in Figure This pro cessor executes up to instructions at once in a pip eline clo ck cycles

deep A mispredicted branch typically incurs a cycle latency For a pro cessor which

can dispatch three instructions p er cycle to ve functional units this p enalty is substantial

Dynamic dispatch is a prime culprit see Section b elow contributing to control ow

ambiguity and inecient use of pro cessor resources

Register Read from Rename Register Set Branch Instruction x86 Decode Write to Execute Retire Target Cache upo Generation Register Buffer Access Set Access

1 2 3 4 5 6 7 8 9 10 11 12

Figure Micropro cessor Pip eline

Distributed Memory Multicomputer

Multiple micropro cessors are comp osed to form a parallel computer Each of these pro cessing

elements has some lo cal memory p ossibly overlapping with other pro cessors which it can

access more quickly than remote memory which is generally lo cal to other elements In

this mo del it is not imp ortant whether or not the hardware supp orts a single shared address

space or cache coherence for remote memory These access mechanisms are issues for the

software implementation What is imp ortant is that memory access cost is nonuniform This

mo del conforms to single pro cessor or SMP symetric multipro cessing no des spanned by an

interconnection network Figure diagrams such a multicomputer with two dimensional

interconnect

In the simplest case mapping the program onto the hardware involves mapping ob jects to

lo cal memory and metho ds to threads Section on pro cessing element asso ciated with

that memory This mapping is expressed in Figure Each thread logically op erates within

in order an ob ject a pro cessing element a lo cal memory space When a thread sends a

message the pro cessing element on which it is lo cated may switch to a new thread Since

switches involves ushing the higher levels of the memory hierarchy in order to make ro om for

data for the new thread pro cessor eciency dictates that they should b e minimized However Interconnect Interconnect

PE ... PE PE ... PE

Local Memory Local Memory Interconnect Interconnect

PE ... PE PE ... PE

Interconnect Local Memory Local Memory Interconnect

Interconnect Interconnect

Figure Hardware Mo del

active threads running in parallel generate work additional threads which is used to keep the

no des in a parallel machine busy So some numb er of thread switches are generally required

for parallel eciency

Object D Object A Object C

remote messages local messages Object E Object B threads context switches PE PE

Local Memory Space Local Memory Space

Figure Software to Hardware Mapping

Implementation Issues

The hardware mo del indicates several p otential sources of ineciency from ob jectorientation

and concurrency Dynamic dispatch Section induces control ow ambiguities which

among other things inhibit functional unit utilization Ob ject centrism Section en

courages p ersistent memorybased computation over the use of temp orary registerbased data

Finally distributing the computation across a distributed memory machine implies communi

cation b etween lo cal address spaces which is more exp ensive than are lo cal memory op erations

Control Flow Ambiguities

Ob jectoriented programming encourages the use of abstract interfaces inheritance and p oly

morphic metho ds The implementation of these features at run time by dynamic dispatch leads

to control ow ambiguities b ecause the co de to b e executed dep ends on the typ e of an ob ject or

the value of a selector variable Eliminating dynamic dispatch requires analyzing the program to

determine under what circumstances control will ow in which direction The co de can then b e

transformed such that sp ecialized versions are used when these conditions are static ie known

at compile time The dynamic dispatch can then b e replaced with a direct call eliminating

the ambiguity Chapters and are concerned with such analysis and transformation

Memory Hierarchy Trac

Ecient use of the memory hierarchy requires most data to b e allo cated in registers Ob ject

oriented programming encourages the use of dynamically allo cated indenite extent ob jects

which are accessed indirectly by p ointers As a result the instance variables of an ob ject

are p otentially aliased memory lo cations In order to allo cate instance variables to registers

and eliminate the memory trac asso ciated with accessing them the compiler must show

that during their allo cation to registers the variables cannot b e read or written through some

other p ointer Chapter considers this optimization in detail Likewise threads should also

use registers for lo cal data However these registers must b e ushed to memory at context

switches In order to minimize this memory trac the compiler groups and minimize context

switch p oints see Chapter

Communication

The cost of communication consists of two factors overhead and latency The overhead of

communication can b e reduced by sp ecializing the communication mechanisms using compile

time information For example directly executing a metho d from the communication buer

incurs less overhead than using a general purp ose interface Section Latency is a func

tion of the underlying communication hardware and the thread scheduling algorithm By

immediately scheduling messages which arrive at a no de latency can b e reduced Section

The remaining latency can b e hidden by p erforming multiple long latency op erations message

sends concurrently context switching and then restarting when all the results have arrived

Figure

Software

The soft droppes of rain perce the hard marble

John Lily

The execution of negrained concurrent ob jectoriented program can b e view as the inter

action of software constructs which implement program level constructs There are three main

program level constructs threads Section objects Section and messages Sec

tion Threads are asso ciated with a contexts Section which hold the temp orary

data use by the threads Threads synchronize using futures Section which promise a

value to b e delivered later and continuations Section which deliver the value Ob jects

are comp osed of slots Section which can contain p olymorphic variables or tagged data

lo cations Ob jects protect this state from race conditions with locks Section They

interact by asynchronous metho d invo cation message passing Section

Threads

Concurrency is implemented at run time as a collection of negrained short lived threads

The cost of creating blo cking and resuming these threads sets a lower b ound on the numb er of

instructions which they must contain for the program to b e considered ecient For example

if the average thread executes one thousand instructions but creating a thread requires two

thousand instructions only onethird of all instructions will b e doing useful work Thus

negrained threads are not implemented at the op erating system level which would require an

exp ensive change of hardware protection domain for scheduling Instead all the op erations on

threads creation susp ension and resumption are implemented at the user level Ecient

implementation of negrained threads is discussed in detail in Chapter

Contexts

A context is a nonLIFO Last In First Out store for thread lo cal state Contexts can b e

thought of as heap allo cated stack frames In sequential computation a metho d must complete

b efore its caller can continue In negrained concurrent computation the caller may continue

and invoke additional metho ds The storage for the temp orary data for the second metho d

cannot b e allo cated on a simple stack since the rst metho d has not yet completed In general

the concurrent mo del induces a forest of contexts as opp osed to a stack of frames see Figure

Stack−based Frames Heap−based Contexts

Figure Stacks of Frames vs Trees of Contexts

A context is similar to a stack frame with additional elds p eculiar to ob jectorientation

and concurrency Figure diagrams the mo del implementation of a context Metho d is

a reference to a metho d descriptor which is used during scheduling to determine the lo cking

requirements and co de of the metho d and along with the Program Counter by the garbage

collector to determine the typ es of unb oxed temp orary variables Ob ject refers to the ob ject

on which the metho d was invoked when the metho d exists any lo cks acquired on this ob ject

are released Program Counter is the lo cation within the metho d where the thread last

susp ended Continuation is the continuation for this metho d and much like a return address

forms the linkage with the calling metho d Finally Arguments and Temp oraries contain

tagged or unb oxed data used by the thread see Section

Context

Method Object WaitingQ.Next Program Counter Continuation Arguments

Temporaries

Figure Context Layout

This mo del implementation is only a starting p oint for optimization Cloning Chapter

and inlining Chapter break the connection b etween metho ds and the co de b o dies to which

contexts are asso ciated Sp eculative inlining Chapter and access regions Chapter break

the connection b etween ob jects and contexts by allowing the thread asso ciated with a context

to op erate on many ob jects acquiring and releasing lo cks as required For a variety of reasons

including p ortability interfacing with external sequential co de eciency and resource reuse

a stackbased mechanism can b e preferable Chapter describ es a hybrid stackheap scheme

which preserves the b enets of b oth systems by breaking the connection b etween contexts and

threads only creating contexts for threads that require scheduling

Scheduling

When threads are scheduled eects the execution in several ways First it determines the

dynamic task structure and the amount of parallelism A bushy task tree provides additional

units of computation for load balancing and latency hiding but scheduling and synchronizing

these tasks incurs overhead Second when a thread is scheduled eects the dep endent threads

waiting for its result If the result is not returned quickly the dep endent threads will b e ushed

from the higher levels of the memory hierarchy They will then have to b e reloaded at some

cost Finally when one thread is created by another on the same pro cessor the second thread

can b e scheduled immediately enabling the threads to communicate through the highest levels

of the memory hierarchy ie registers This motivates an eager scheduling mo del default and

the hybrid execution mo del presented in Chapter

Futures

New threads are logically created for each concurrent invo cations Section These threads

synchronize with their parent thread using futures Futures are essentially promises

made by a task to provide a result p ossibly at some later time In a negrained concurrent

mo del they are implicit unlike MultiLisp and MulT where their insertion is the

resp onsibility of the programmer They are automatically inserted so as to enforce user sp ecied

concurrency constraints sequential blo cks and lo cal data ow Section For example

in Figure the calls to func and func promise the results of their calculations as a

and b resp ectively These calls can b e implemented as concurrent threads whose results may

b e computed at any time The parent thread continues to the call func which requires the

results a and b When these results are ready func can b egin executing Finally when

func completes func can b egin executing

func1()

conc task concurrent

func

a future a func2()

func b read

b

funcab c write func3()

c func

func4()

Figure Futures Examples

Continuations

When a future is originally created it is empty and the right to determine the value of the

future is called a continuation The continuation is essentially a small closure which is applied

to provide the value of the future Continuations are rstclass citizens they can b e passed

to another metho d or thread and it can b e stored in memory For example in Concurrent

Aggregates the continuation can b e accessed through the pseudovariable requester or passed

on to another invo cation using the forward construct see Section while in ICC it is

accessible directly as an ob ject called reply available in every metho d When a thread wishes

to test a future to see if its value has arrived yet it touches the future If the value has not

arrived the thread must susp end

aFutureaContinuation MAKEFUTUREa

conc

MAKETHREADfuncaContinuation

a func

if TOUCHaFuture SUSPEND

b funca

bFuturebContinuation MAKEFUTUREb

MAKETHREADfuncabContinuation

Figure Continuations and Touches

Consider the co de on the left in Figure The metho d func dep ends on the value of

a The pseudoco de on the right describ es the steps taken by the abstraction to implement this

dep endence First a is made into an empty future and a continuation is created to return a

value to that future Second a thread is created to execute func and passed the continuation

for a Concurrent with the execution of func the future a is touched If the value a is not

present the thread susp ends In any case when a is available its value is passed to the metho d

func Of course since building futures and threads and susp ending is exp ensive the most

general forms of these op erations are rarely p erformed Optimizing these op erations is the topic

of Chapters and

funcacontinuation funca

funcacontinuation return funca conc

funcacontinuation funca b funca

continuationaa return aa

Figure Use of Continuations

Continuations can b e used like normal metho ds which complete immediately clearly we

cannot use a future to wait until the rst future received the value and returns no value

Figure shows a metho d invo cation on the left In the center is the denition of the metho d as

written by the programmer On the right is the same denition with the hidden continuation

variables made explicit Notice how the right to determine the value of b is forwarded from

func to func The continuation is eventually used by func to return the nal

result to b

Counting Futures and Continuations

Counting futures are futures which represent a numb er of outstanding values Along with

counting continuations counting futures enable a thread to synchronize once with an arbitrary

determined at runtime numb er of threads much like the future sets of MultiLisp On

the left in Figure a concurrent lo op invokes func on a numb er of ob jects The lo op

cannot complete until all of the invo cations have completed On the right a future cFuture is

created with a initial count of zero Each time an invo cation is made the numb er of outstanding

results is incremented The TOUCH op eration when applied to a counting future checks that

al l the results have returned susp ending the thread if this is not the case The implementation

implications of counting continuation are considered in more detail in Chapter

cFuture ZEROCOUNT

LOOP

conc for iini

aContinuation cFutureINC

oifunc

MAKETHREADoifunccContinuation

if TOUCHcFuture SUSPEND

Figure Counting Futures and Continuations

Ob jects

Ob jects are represented by a region of memory laid out with elds for instance variables

synchronization and scheduling Section Each instance variable eld is called a slot

Section These slots may b e tagged with the typ e of the contents enabling them

to represent p olymorphic instance variables Section Race conditions resulting from

concurrent access to these slots are prevented by lo cks Section When a metho d cannot

b e scheduled immediately for example if it cannot acquire the lo cks it requires it is delayed in

a queue of threads asso ciated with the ob ject

Ob ject Layout

Concurrent ob jects are in general more heavyweight than their sequential counterparts In

addition to the class descriptor virtual function table p ointer required for ob jectoriented

dispatch they must maintain lo ck information and a queue of outstanding blo cked messages

Also the compiler must b e able to generate co de to access instance variables and nd ob ject

p ointers for garbage collection purp oses Figure shows the layout of an ob ject Both

instance variables and array elements are optional and can b e included in any ob ject Since

the programming mo del do es not p ermit the interconversion of p ointers and arrays array are

like other ob jects and can contain instance variables The instance variables and array elements

may or may not b e tagged slots Figure b elow

Slots

A slot is a data lo cation which is tagged so that the typ e of the contents can b e determined at run

time Slots can contain immediate values integers oating p oint numb ers system constants

global names lo cal ob ject p ointers continuations and futures Section Figure

gives a example how a slot might b e declared in C The slot tags are used for dynamic dispatch Object

Class ID/VFTP Waiting Queue Locks Instance Vars

Array Elements

Figure Ob ject Layout

and by the garbage collector Section to determine if a slot contains a p ointer For

example Concurrent Aggregates is a pure ob jectoriented language in which a variable can

store an integer one moment and a reference to an ob ject the next Tag manipulations can

b e exp ensive and slots can o ccupy more space than unboxed untagged values Section

discusses eliminating these ineciencies through unb oxing

struct Slot

enum INT FLOAT FUTURE GLOBALPTR LOCALPTR tag

union

int i

float f

Future fut

Continuation cont

GlobalPtr gptr

LocalPtr lptr

value

Figure Slot Abstraction

Tags

Slots are used to store the value of p olymorphic variables and the return values of concurrent

invo cations which require futures Slots contain tags which indicate the typ e of data which they

contain These tags dierentiate fundamental typ es int float etc global and lo cal names

collections selectors continuations and futures b oth empty and full When analysis can show

that a slot can contain only one typ e of data at any one time the slot can b e converted into

unboxed data The tags are removed and space is allo cated only for the contents allowing the

removal of tag manipulation op erations and the recovery of the tag space

Lo cks

Ob jectbased access control is the underlying mechanism used to ensure consistency as required

by the programming mo del While the consistency mo del is describ ed in terms of visible state

changes the implementation is based on atomicity That is the internal state changes of a

transaction metho d are hidden by prohibiting other transactions from happ ening concurrently

While in some cases it is p ossible to ensure this by analyzing the control ow of the program in

general an ob ject must b e lo cked and conicting messages delayed Concurrent metho ds have the general form of the co de on the left in Figure

Incoming Message Afoo Lock

Waiting Queue LOCKself

Receiver Object

operate on self

UNLOCKself Waiting Method

Active Method

Waiting Method

Figure Ob jectbased Access Control

When a metho d b egins executing a thread is created which lo cks the target ob ject to prevent

interference from other threads Lo cks can b e checked taken and released requiring access and

manipulation of the lo ck elds stored in the ob ject In the case of a check the new thread

may b e required to susp end p ending availability of the lo ck Such messages arriving at a busy

ob ject are delayed in the waiting queue until the currently active thread frees the lo ck

The region of co de over which a thread has access to the ob ject is called the access region

There are two main ways to remove lo ck op erations First one lo cking op eration subsumes

another when the second o ccurs only when the rst lo ck op eration has succeeded Such lo ck

op erations are unnecessary and can b e removed Second consecutive acquisitions and releases of

the same lo ck can b e group ed and the intermediate releasetake pairs removed Optimizations

concerning lo ck subsumption detection and manipulations of access regions in order to minimize

the numb er of lo ck and scheduling op erations are discussed in Chapter

Messages

An invo cation message send in the negrained concurrent ob jectoriented mo del consists of

a numb er of steps illustrated in Figure These steps are abstract using the techniques

describ ed in this thesis many of them can b e optimized away or p erformed in a dierent order

First the global name is translated and the target address space determined Then the

message is constructed from the arguments and the continuation is built Next the message

is transfered to the target address space Some time later the message is scheduled Finally

the message is dynamically dispatched and the appropriate co de b egins executing

1 TRANSLATE NAME Invoking Object

2 CREATE MESSAGE

3 TRANSFER Target Object MESSAGE

4 SCHEDULE / INITIATE 5 DYNAMIC

DISPATCH

Figure Invo cation Sequence

Global Shared Name Space

A shared name space means that there is a single name space and that the names are always

valid nomatter where the data is lo cated Many concurrent ob jectoriented systems distinguish

the hardware lo cal and global namespaces eg SplitC CC Charm Mentat In such

systems only global names can b e passed b etween and used to refer to ob jects no des on dierent

no des of a distributed memory machine Section A shared name space greatly simplies

programming by enabling the expression of the algorithm to b e separated from the lo cation

of data Global names can implemented as a no de numb er and a lo cal memory address of the

current or initial lo cation of the ob ject or an index into a distributed hash table

if NODEa THISNODE

o LOCALPOINTERa

afoo

INVOKEofoo

else

SENDMESSAGENODEaafoo

Figure Global Name Translation

For distributed memory machines b oth with and without hardware supp ort for a shared

name space it is desirable to manage the lo cation of threads with resp ect to the data they

are op erating on in order to ensure lo cality of access This involves a translation from global

names to lo cal names which are valid within a single no de and typically represented by an

address in the lo cal memory The translation mechanisms are provided by the runtime system

Section For example in Figure the co de on the left shows program level invo cation

On the right the global name a is rst checked to see if it corresp onds to a lo cal ob ject and

either a lo cal invo cation is made or a remote message send Chapter discusses automatic

management of lo cality and the optimization of translation op erations

Scheduling

Scheduling is the pro cess of selecting which tasks run when When a message arrives at the

ob ject the default b ehavior is execute it immediately acquiring any required lo cks If the lo cks

cannot b e acquired the message is delayed The scheduler maintains a queue p er ob ject in

which is stored waiting messages in the order in which they arrived When lo cks are release on

the ob ject waiting messages attempt to acquire their lo cks and execute

The scheduler must supp ort the mo del of fairness required by the programming mo del In

this thesis we assume a restricted version of weak fairness Messages sent to ob jects will b e

handled eventually assuming that metho ds holding lo cks on the ob ject eventually terminate

threads executing on the pro cessing element eventually blo ck and the result of the computa

tion dep ends on the result of the message These conditions can b e violated by the program

deadlo cking lo oping innitely or not synchronizing on the termination of an op eration The

compiler must not to induce these conditions so long as the programmer relies only on the con

currency guarantees provided by the programming mo del This aects the lo cking optimizations

in Chapter

Dispatch Mechanism

The co de executed as a result of a metho d invo cation dep ends on the dynamic typ e of the

target ob jects Section A dispatch mechanism selects the co de to b e executed using a

set of dispatch criteria derived from the calling environment which can include the signature

declared and actual typ e of one or more arguments and the name of the metho d Several

dispatch mechanisms implementations are p ossible including run time searching of the metho d

dictionary classbased tables hash tables inline cashes and p olymorphic inline

caches The caching mechanisms attempt to reduce the amortized cost of dispatch by

providing a fast path for related temp orally proximate dispatches

Perhaps the most common implementation C is a table of metho ds indexed by the

order in which metho ds are declared in a class This metho d has the disadvantage that metho ds

to b e dispatched on must b e dened in some shared sup erclass a restriction C imp oses in

the typ e system This restriction is isomorphic to the ob ject layout problem Section

Faster mechanisms can also b e used when the values of some of the criteria are known at compile

time and the fastest mechanism is to avoid the call entirely and inline the co de Chapters

and are concerned with these optimization

Implementation Issues

The facilities provided by the execution mo del are sucient but their generality makes them

exp ensive Variables implemented as slots require additional space and time to manipulate the

tags For nonp olymorphic variables the tags and op erations can b e elided Likewise thread

creation scheduling and synchronization op erations should b e avoided whenever p ossible Three

other issues are memory map conformance Section load balance and data distribution

Section and garbage collection Section

Sp ecialization and Memory Map Conformance

When an ob ject of a particular class is created it can b e used in a numb er of dierent ways

If it is an array it may b e created to hold some arbitrary numb er of elements or a compile

time constant numb er Its instance variables may b e used to hold single typ es of data or those

of several typ es Access to its internal state may b e entirely mediated by some surrounding

ob ject or the ob ject may b e capable of receiving messages directly These dierence represent

optimization opp ortunities for which the ob ject may b e sp ecialized Sometimes it is p ossible to

statically determine these prop erties over an entire class of ob jects in which case the class and

all co de manipulating ob jects of that class can b e sp ecialized Inheritance however intro duces

additional complexities

Memory map conformance is a prop erty of the layout of ob jects within classes such that co de

compiled to manipulate ob jects of the sup erclass typ e can b e used on ob jects of the sub class

typ e For example consider a class A which denes a single instance variable a and a class

B which inherits from A and denes an additional instance variable b the compiler can ensure

that the lo cation of a within ob jects of typ e A and B will b e the same This is illustrated

in Figure This prop erty enables co de to b e shared b etween the sup erclass A and the

sub class B but it inhibits some optimizations

Class A Class B

class A

a Class ID Class ID

Waiting Queue Waiting Queue

B A class Locks Locks

b a a

b

Figure Class Memory Map Conformance

Memory map conformance can also b e an issue for ob jects in a single class For example if

a particular instance variables of an ob ject is known to b e a compile time constant the space

for that instance variable need not b e allo cated However this can lead to diculties since the

co de generated to manipulate the ob ject may b e shared with other ob jects which require the

instance variable Consider the two ob jects in Figure The memory map of x could b e

altered to remove the instance variable a since this variable is a compile time constant However

the co de which manipulates these ob jects must access b at dierent lo cations Performing such

optimizations b oth those which preserve memory map conformance and those which do not is one of the the sub jects of Chapter

Object x Object y

class A a b

foo o return obob

Class ID Class ID

conc

Waiting Queue Waiting Queue A x xa xb

y ya g yb

A Locks Locks

foox fooy b a

b

Figure Ob ject Memory Maps Which do not Conform

Load Balance and Data Distribution

The eciency of a parallel program dep ends on its processor eciency the eciency of exe

cution on a pro cessor and its paral lel eciency the numb er of pro cessors that the program

can keep busy consistently This thesis is concerned primarily with pro cessor eciency under

the assumption that the work and data are distributed There are many ways to distribute

work and data and balance the work load across the machine Generally these techniques in

volve data shipping caching ob ject migration etc function shipping remote pro cedure calls

distributed lo ops or b oth

To abstract the problem of pro cessor eciency we assume a simple mo del were metho ds

execute lo cal to the ob ject on which they are invoked and ob jects are distributed across the

machine Load balance is assumed to b e maintained by some combination of static placement

and data and function shipping mediated by the runtime system The lo cation where a thread

should b e created that of the target ob ject is mediated by runtime system through an interface

Section used by the compiler generated co de

Memory Allo cation and Garbage Collection

Ob jectoriented languages encourage dynamic creation and disp osal of ob jects at run time

instead of static allo cation of them at compile time It follows that the eciency of these op er

ations can greatly eect the overall p erformance of the application As we will see in Chapter

the eciency of at least one simple b enchmark doubles when the memory management mecha

nism is improved The nature of the ob ject memory map eects memory allo cation and garbage

collection For example if an ob ject do es not have an array p ortion or if the array p ortion is

known to b e of a particular size a custom binbased memory allo cator can b e used Similarly

in order for the garbage collector to nd all the reachable ob jects in the system it must b e

able to determine which data represent ob ject p ointers Therefor if ob jects and contexts see

Section are tagged by typ e descriptors individual tags on the instance variables and

temp orary variables can b e eliminated in favor of information stored in the typ e descriptor

Runtime

The compiled co de interacts with runtime system through an abstract interface which hides

many of the details of the underlying hardware It is the runtime system which denes the

translation and message transfer op erations Table summarizes the runtime interface

Again many of these op erations may b e optimized away The rst set INVOKEREPLY

is for invoking metho ds sending messages and replying to continuations The next set

OBJECTLOCAL POINTER exp oses the ob ject consistency lo cks and global to lo cal LOCK

name translation mechanisms The next set OBJCONTEXT SLOT allows translation to

and from unb oxed values and op erations on heap based contexts Continuation manipulation

CONTINUATIONTOUCH make up the next set Then we have ob ject op erations MAKE

creation NEW OBJECT and intracollection COL TO REP addressing and nally op er

ations on globals These op erations will b e considered in more detail in the chapters which

discuss their optimization

Related Work

At the hardware level there have b een a numb er of systems which were constructed esp ecially

to run ob jectoriented languages including the Xerox Dorado and Berkeley SOAR

The JMachine and the MDP were designed to run concurrent ob jectoriented programs

and use the COSMOS op erating system as a runtime environment on top of the hardware

Likewise ABCL has b een implemented on the hybrid data ow machine EM However

recently economy of scale and advances in compiler technology have favored commo dity micro

Op eration Variation and Related Op erations

INVOKE INVOKE IMMEDINVOKE LOCAL

REPLY REPLY WITH MSG

LOCK OBJECT UNLOCK OBJECTLOCKED

LOCAL POINTER

CONTEXT SLOT INST VARCONTINUAT ION SLOTARGUMENT

THE MSGTHE SUPER

OBJ INTFLOATGLOBAL NAME

MAKE CONTINUATIO N MAKE COUNT CONTINUATIO N

TOUCH MTOUCHTOU CH BEGINSWITCHTOUCH ENDMTOUCH BEGIN

NEW OBJECT NEW LOCAL OBJECTNEW LOCAL OBJECT SIZE

NEW ARRAYNEW LOCAL ARRAYNEW UNBOXED ARRAY

NEW LOCAL UNBOXED ARRAYNEW AGGREGATE

NEW LOCAL AGGREGATENEW UNIQUENULL OBJECT

COL TO REP COL TO PHYSICAL REP

READ GLOBAL READ GLOBAL SEQUENTIAL UNBOXED INT

READ GLOBAL SEQUENTIAL UNBOXED OBJREAD GLOBAL SEQUENTIAL

DISTRIBUTE GLOBAL

Table Runtime Op erations

pro cessors which proven cost eective For example ABCL implemented on the AP and

the Concert system on the Thinking Machines CM and Cray TD have all proven

eective often thanks to ecient runtime systems

There is a host of material on software mo dels for sequential ob jectoriented languages

most notably Smalltalk and Self These mo dels dier from that

discussed in this chapter with resp ect to supp ort for distribution and concurrency There are

also many concurrent ob jectoriented systems Of particular interest are Concurrent Smalltalk

ABCL ABCLR and ABCLf Many of the concepts in this chapter

are found in these systems but the sp ecic mechanisms esp ecially for synchronization dier

Summary

The execution mo del is the compilation target and cost mo del for the compiler It is comp osed

of a set of execution abstractions of the hardware software and runtime system The hardware

mo del is based on commo dity micropro cessors spanned by an interconnection network This

mo del indicates several p otential sources of ineciency including control ow ambiguities re

sulting from dynamic dispatch memory hierarchy trac and communication overhead The

software mo del describ es mo del implementations for threads objects and messages Threads

store their state in a context which are nonLIFO stack frames Threads synchronize with other

threads using futures which promise a value to b e provided later by a continuation Ob jects

store their instance variables in slots which may b e tagged with the typ e of the contents The

software mo del describ es the physical layout of contexts and ob jects and the implications for

memory allo cation and garbage collection It also describ es the metho d invo cation sequence

and dynamic dispatch mechanism The hardware and the software interact through a runtime

interface

Chapter

The Compilation Framework

Computers are useless They can only give you answers

Pablo Picasso

This chapter describ es the compilation framework and the Concert System in which

it is implemented The Concert System is a complete development system for concurrent

ob jectoriented programs consisting of a compiler runtime system and various supp ort to ols

and libraries The Concert vision expressed in Section is to combine highlevel expres

sion of programs and ecient implementation The Concert compiler Section achieves

this through aggressive analysis and transformation and collab oration with a ecient run

time system Section Working together the compiler and runtime can eciently map

b oth Concurrent Aggregates Section and ICC Section to a variety of target

machines Section

Concert

The Illinois Concert System is a pro ject of the Concurrent Systems Architecture Group Its

goal is to pro duce p ortable high p erformance implementations of concurrent ob jectoriented

languages on parallel machines The philosophy of the pro ject is that programs should b e

written at a high level exp osing all the negrained concurrency and that the language im

plementation should tune the execution grain size to that supp orted eciently by the target

platform The Concert system consists of a compiler a runtime sp ecialized to the underlying

hardware an emulator for quick turnaround debugging a debugger and a standard library The

Concert system has b een in development since January

Ob jective

The overriding ob jective of the Concert pro ject is to show that concurrent ob jectoriented

eciently can b e executed eciently on sto ck hardware We b elieve that empirical evaluation

must b e the ultimate arbiter and that necessity is the most eective way to establishes priorities

We established two constraints for our system First it should provide high p erformance Since

the techniques required to improve p erformance change dramatically as we approach that of

conventional languages on unipro cessors high p erformance is of b oth practical and academic

value The second constraint is that the system should b e general and p ortable applicable

to a range of concurrent ob jectoriented languages and target parallel machines Portability

is imp ortant from b oth a practical as well as an academic standp oint As a practical matter

parallel machines have diverse characteristics and short lifetimes while we were one of the early

users the CM went out of pro duction shortly after our p ort was complete From an academic

standp oint p ortability demonstrates generality of approach

Philosophy

The philosophy of Concert is that programmers should b e concerned with natural expression

of their programs and the system should b e concerned with pro ducing an ecient implemen

tation From our p oint of view natural expression involves high level abstractions and explicit

concurrency We b elieve that the compiler should map these dynamic high level abstractions to

static implementations at the earliest p oint p ossible eg partially evaluating expressions with

resp ect to known parameters Explicitly concurrent programming languages are strictly more

expressive than sequential languages since they can mo del nondeterministic algorithms We

b elieve the programming should express the natural concurrency in the algorithm and that it is

the resp onsibility of the system to tune the realized concurrency to that which can b e eciently

supp orted by the target platform

Our approach to tuning the execution grain size is to sp ecialize the computation relative to

the prop erties of the program as they b ecome know Some prop erties are known statically in

the program text at compile time Others require the program to b e transformed For example

replicating co de executed under dierent conditions can b e b enecial b ecause it enables creation

of unique versions with xed known prop erties Still other prop erties must b e tested at runtime

We can compile dierent version of co de sp ecialized for these dynamic prop erties and the

version executed is selected at runtime In any case taking advantage of these prop erties as

early as p ossible is desirable since implementation cost increases as we move closer to the

hardware This induces a hierarchy of mechanisms pictured in Figure By xing prop erties

as early as p ossible we separate the dynamism of the algorithm from that of the language used for expressing the algorithm

Program

Early Information Static Analysis and Intraprocedural Transformation

Static Speculation and Interprocedural Transformation

Dynamic Speculation and Runtime Transformation

Generic Code and General Runtime Primitives

Machine Code and Runtime Primitives Late Information

Target Platform

Figure Concert Philosophy A Hierarchy of Mechanisms

In order to provide the system maximal exibility the runtime system provides a hierarchy

of primitives of varying degrees of exibility For example the compiler can select a primitive

which statically binds the co de to b e executed for an invo cation if the co de is known at

compile time or one which dynamically binds the co de at run time In turn the compiler

provides information obtained by static analysis to the runtime system for use in runtime

transformations like caching and load balancing Thus the Concert philosophy is that the

compiler and the runtime should work together each leveraging the strengths of the other

Parts of the System

The Concert System is comp ose of a compiler runtime system emulator symb olic

debugger and a set of libraries The compiler and runtime are discussed in Sections and

resp ectively The emulator works within the Lisp environment by interpreting the

core language intermediate representation Section The ParaSight debugger

provides source level debugging of concurrent ob jectoriented programs on workstations The

standard library provides reusable co de for two and three dimensional grids barriers distributed

counters combining trees data parallel computations and other standard data structures and

algorithms

Timetable

The Concert System has undergone a staged development describ ed in Table The pro ject

originally derived from the thesis work of my advisor Andrew Chien That system consisted

of a translator and runtime for unipro cessor workstations Simple programs executed on that

system ran approximately six orders of magnitude one million times slower than unoptimized

C In the rst exploratory stage of work we added inheritance and dynamic dispatch to the

Concurrent Aggregates language develop ed a runtime interface for the CM and prototyp ed

an interpro cedural ow analysis This stage is describ ed in the table by the row lab eled Pre

Concert Based on this exp erience b egan a new design Concert including a new compiler

framework and runtime interface

Language Ma jor Compiler Other Ma jor Sp eed

Version Date Changes Changes Changes vs C

PreConcert CAOO protoFlow Analysis protoCM runtime e

Concert set CFGSSAbased Flow Analysis CM runtime e

Concert Primitive Inlining Emulator Debugger

Concert functions Sp eculative Inlining Parallel GC

Concert annotations PDGbased Access Regions TD runtime

Hybrid Execution User Distributions

Concert ICC Ob ject Inlining

Table Development of the Concert System

The rst version of Concert was an internal release capable of executing the test suite

develop ed for the preConcert system The internal representation was a Control Flow Graph

CFG in Static Single Assignment SSA form This version was considerably faster than

the preConcert system owning to a more ecient runtime system The rst external release

improved on this by inlining primitive op erations eg integer add Version added

sp eculative inlining Section runtime lo cality and lo ck tests conditioning inlined co de This

brought p erformance to within a factor of of C for many co des The last factor of required

more radical changes including shifting to a new internal form the Program Dep endence Graph

PDG It is the work required for this last factor which o ccupies Chapters and

Compiler

The Concert compiler employs aggressive static analysis and transformations exploiting infor

mation from the high level COOP semantics and the low level load op erate store semantics

of the programming mo del For example subsumption of ob jectlevel access control op era

tions Section requires interpro cedural analysis and transformation at the level of the

call graph Likewise ob jectlevel access regions eectively guarantee lack of aliasing which can

b e used for instruction scheduling and register allo cation at the lowest level To exploit these

levels the Concert compiler provides a compilation framework with eight representations of

the program from source co de to target co de so that optimizations can b e applied at the most

convenient level This framework is comp osed of ve phases parsing and semantic pro cessing

program graph construction analysis transformation and co de generation Excluding the two

parsers the compiler is approximately thirtyve thousand lines of Common LispCLOS

Overview

The Concert compiler is structured as a pip eline of phases with several feedback lo ops Fig

ure gives this structure along with the intermediate representations used in each phase The

parsing and semantic pro cessing phase consists of reading the program in the source language

into an Abstract Syntax Tree AST and computing various attributes Its pro duct is the

program translated into the core language intermediate form This form is then during the

program graph construction phase translated into a Program Dep endence Graph PDG

in Static Single Assignment SSA form by way of a Control Flow Graph CFG The

program graph intermediate form is used by the analysis and transformation phases Finally

during co de generation the control ow graph is rebuilt the program is translated into Register

Transfer Language RTL and target co de is generated FLOW PHASES INTERMEDIATE FORMS

Program Source Code (CA,ICC++)

Parsing and Abstract Syntax Tree Semantic Processing

Core Language

Program Graph Program Graph: CFG Construction Program Graph: CFG + SSA

Program Graph: PDG + SSA

Analysis

Transformation

Program Graph: PDG + SSA

Code Generation Program Graph: CFG RTL

Generated Code

Figure Concert Overview Phases and Intermediate Representations

Retargetable Front End

One of the goals of Concert is p ortability of the system the ability to compile a wide range

of concurrent ob jectoriented languages In order to meet this goal we designed a retargetable

front end Since we wanted to preserve high level information instead of immediate translation

to RTL eg like GCC the front end target is a simple core language The language

sp ecic front ends for CA and ICC pass a set of metho ds to the program graph construction

phase These metho ds are built from statements in the core languages over symb ols globally

unique identiers

Symb ols

Symb ols are passed as a simple list or at table of descriptions consisting of a unique identier

and a numb er of optional elds There are two typ es of symb ols variables and classes The

identier is used b oth within symb ol descriptions and within the core co de to represent the

symb ol It must b e unique program language level scop es must b e attened Internally the

identier is represented as a reference to a symb ol ob ject There is one optional eld applicable

to b oth variable and class symb ols

field type description

name string The printable name of the symb ol eg variable

name used by the debugger etc

several applicable only to variables

field type description

value string integer Indicates that this variable is a constant and gives the

or oating value

p oint numb er

typ e class symb ol The typ e used for dispatching purp oses For the self

identier argument this is class in which the metho d is dened

For other values this is the typ e which the variable

should b e considered to b e of for dispatching purp oses

ie for super in Smalltalk this is the sup erclass

self b o olean Indicates that this is a syntactic self send In b oth

CA and ICC invo cations which are made directly

on self or this are considered to b e part of the same

transaction Section

global b o olean Indicates that this is a global variable An initializa

tion b o dy will b e executed b efore the program prop er

counter b o olean Indicates that this variable should use a counting con

tinuation Section

and others only to classes

field type description

instance list of symb ol The instance variables for this class Any inherited

variables identiers instance variables should b e included here

sup erclass symb ol The sup erclass used by dynamic dispatch for metho d

identier lo okup

Core Language

A program in the core language is a set of metho ds in the form of an invo cation template and

statement The invo cation template has the form of a send statement where the selector is a

string constant and the parameters are the arguments There are ve statements in the core

language with two variations and one tag These are presented in Table

statement arguments description

send list of symb ols An invo cation

move symb ol Assignment of primitive typ es including p ointers and

symb ol references

if symb ol Conditional

statement

statement

while symb ol A while lo op concwhile indicates that the itera

concwhile statement tions can execute concurrently

seqconc list of A blo ck of statements conc indicates that they can

statements execute concurrently

future list of symb ols Tag which indicates that the actual argument is a

continuation for the future values of the symb ols

Table Core Language Statements

Figure is an example of the translation the p olymorphic function dbl The class

hierarchy is ro oted at rootclass all classes have ro otclass as a sup erclass There are four

symb ols including the argument whose typ e is used for dispatch the selector dbl and a temp o

rary value temp The metho d denes the invo cation template including the hidden continuation

Section argument In the b o dy the two send statements are nominally concurrent

though they will have to execute sequentially b ecause the second requires the result of the rst

polymorphic function to double numbers

dblx xx

core language translation

symbol class rootclass

symbol variable x type rootclass

symbol variable dbl value dbl

symbol variable temp

method dbl x c

conc

send x x future temp

send reply c temp

Figure Core Language Example

Thus the core language is a bare b ones concurrent ob jectoriented language similar to

Smalltalk but without even that languages syntactic conveniences While it is quite p ower

ful it has restrictions Control ow is restricted to if and while to simplify program graph

construction Functions are restricted to the top level so lo cal data can not b e accessed by

more than one function This enables later phases to easily preserve the consistence of lo cal

data Both CA and ICC are translated into the core language in their resp ective parsing

and semantic pro cessing phases

Parsing and Semantic Pro cessing CA

The CA front end is build using the PArser GENerator for Common Lisp by Ken Traub part of

the Dataow Compiler Substrate for the MIT Id compiler The parser generates an

abstract syntax tree on which inherited and synthetic attributes are dened by Lisp functions

and computed on demand Since CA is quite close to the core language this front end is

concerned mainly with attening scop es constructing accessor functions and initialization co de

for ob jects and globals As a result it is quite small requiring approximately two thousand

lines of Lisp co de

Parsing and Semantic Pro cessing ICC

The ICC front end was written by Julian Dolby with the help of HaoHua Chu and is derived

from the CPPP parser from Brown University It is built using a mo died version of the Zebu

parser by Joachim H Laubsch combined with several recursive descent prepasses in the lexer

The parser generates an abstract syntax tree on which substantial computation must b e done

for typ e checking insertion of co ercion op erations and overload resolution ICC is not as

similar to the core language as CA Section In particular ICC allows more general

control ow break continue return and goto which is translated into conditionals and

while lo ops Nevertheless the co de related to core language translation is a small p ortion

twenty p ercent or three thousand lines of the ICC front end which requires some fourteen

thousand lines of Lisp co de

Program Graph Construction

The program graph intermediate form is a variation on the Program Dep endence Graph PDG

in Static Single Assignment form SSA Construction of the program graph generally follows

standard algorithms except for building CFG Data Dep endences and Static Single Use SSU

form b oth describ ed b elow A program graph no de is created for each core language statement

presenting two views one with with elds for sets of rvals values read by this statement and

lvals values written and one with only args asynchronous message arguments including an

explicit continuation Lo cal and traditional sequential optimizations are implemented in terms

of rvals and lvals while interpro cedural and concurrency minded optimizations op erate on the

continuation passing view These no des are emb edded in four graphs The Control Flow Graph

CFG is constructed directly from the core language Forward and backward dominator graphs

are computed The Control Dep endence Graph CDG is built and the program is then

translated into SSU form using a variant of Note that since the core language contains

only while lo ops the CDG is a tree

CFG Data Dep endencies

We record CFG Data Dep endencies in the program graph for user sp ecied ordering of nonlo cal

op erations They are computed based on the semantics of concurrent and sequential blo cks and

lo ops These in turn dep end on the readafterwrite relationships b etween lo cal variables

For example two send statements might otherwise b e concurrent save that one is required to

precede a write to a variable which is read in a statement required to precede the second see

Figure When the program is converted into Static Single Use form the dep endencies for

move statements are dropp ed since they are implicit in the defuse relationships

The two sends must be executed in sequence

conc seq send meth a future x

move b

seq move b c

send meth d future y

Figure Data Dep endencies Example

Static Single Use Form

Static Single Use SSU form is a variant on Static Single Assignment SSA form

SSA form inserts No des essentially assignments with multiple right hand sides where control

ow merges For example after a conditional a No de renames variables assigned in either

branch This ensures that that each variable will app ear on the left hand side of only one

assignment SSU form adds No des which analogous to No des rename variables which are

read along dierent control ow paths These no des app ear b efore conditionals Consider

the co de in Figure In the co de on the left a is used under two dierent conditions when

condition is true and when it is false and it is assigned twice SSA form renames variables for

each use and assignment In addition to simplifying the construction of the ow graph this

renaming prevents interference b etween transfer functions during ow analysis Chapter

a a a

if condition

if condition

a a meth

a ameth

else

else

a a meth

a ameth

a a a

Figure Co de b efore left and after right SSU Conversion

The Standard Prologue

Concurrent Aggregates is a very simple pure ob jectoriented language All program data

are ob jects and all op erations are message sends Before any user co de is compiled the

compilation environment is augmented by compiling a standard prologue App endix D This

prologue contains the builtin classes and functions accessible to the user For example the

integer and float classes and their op erations are dened there Likewise arrays strings

globals the constants true and false and even the nil ob ject are all dened in the standard

prologue and their denitions can b e overridden by any user program For example b ounds

at op erations on the checking on arrays is implemented simply by overriding the at and put

array class At the lowest level these op erations are dened in turns of primitives whose

prop erties are known to the compiler Given this exibility the p erformance numb ers in

Table for the prototyp e version of Concert are not surprising Naive implementation of

integer addition as a fully concurrent dynamically dispatched message send would result in

extremely inecient programs

Primitive control ow b ecause of its ubiquitousness is not handled through messages

However nothing in the system prevents the programmer from using Smalltalk style True and

False classes and ifTrue ifFalse selectors to do so

Analysis

The goal of the analysis phase is to determine ow sensitive information by constructing an

interpro cedural data and control ow graph The no des of this graph are program variables

created under particular conditions and the arcs describ e the ow of data Because the values

of data classes or implementation typ es aect the ow of control for ob jectoriented programs

the analysis determines control and data ow simultaneously Some of the typ es of information

analyzed for this phase are implementation typ es of data the call graph sharing aliasing

patterns constants stateless metho ds and lo ck subsumption Section The algorithm

describ ed in detail in Chapter is iterative constructing successively rened approximations

Transformation

The transformation phase consists of three typ es of transformations intraprocedural which

op erate within a pro cedure interprocedural which op erate b etween pro cedures and whole pro

gram Figure shows these transformations and the approximate order in which they are

applied The intrapro cedural transformations are applied p erio dically to simplify the program

during interpro cedural transformation Cloning is discussed in Chapter General optimiza

tions for ob jectoriented programs are covered in Chapter and those sp ecic to concurrent

ob jectoriented programs in Chapter

Co de Generation

Co de generation is the last phase in which the graphbased intermediate form is converted

into an executable form First the control ow graph is reconstituted from the partial order

of execution sp ecied by the CDG and Data Dep endences Next SSA assignments are moved

from the conditionals where they are attached for convenience of transformation into the CFG

Touches are then inserted to enforce data dep endences The hybrid execution mo del Chapter

requires separately optimized sequential and parallel versions of metho ds to b e created This is

accomplished by translating the program no des in the CFG into RTL Registers are allo cated

over the RTL which is nally written out as a set of macros These macros are converted by the

runtime interface prepro cessor into C which we use as a very slow but p ortable assembler From Analysis

Whole Program Transformation Cloning

Write Once Globals

Global Constant Propogation Intraprocedural Constant Folding Transformations Invarient Lifting If−Conversion Global Common Subexpression Elimination Inlining Access Region Strength Reduction Extension Dead Code Elimination Interprocedural Transformations

To Code Generation

Figure Order of Transformations

Runtime

The runtime manages pro cessor and memory resources at runtime It provides functions for

communication thread management load balancing scheduling and memory management

including global garbage collection The interface describ ed in Section is given as a set

of m and cpp macros over a linkable library The runtime is written in C and assembly

language and consists of approximately twenty thousand lines of co de

Target Machines

Three dierent target platforms are available for executing Concert programs and used in the

evaluation of the compiler The rst is SPARC based unipro cessor workstations from

Sun Microsystems Computer Company which is used to evaluate sequential eciency The

second is the SPARC based distributed memory Connection Machine from Thinking Machines

Corp oration This machine shares the same instruction set architecture with the workstation

and includes direct pro cessor to pro cessor messaging The last platform is Cray Researchs

TD which provides hardware supp ort for a shared address space These last two platforms are

used to evaluate parallel overhead and p erformance

SPARC Workstation

The SPARC workstation which we will use for evaluate is the SPARCstation with Sup er

Cache This machine contains a Mhz Sup erSPARCI I Megabytes of RAM a one megabyte

rating of and second level cache Mhz MBus and Mhz SBus and has a SPECint

a SPECfp rating of The Sup erSPARCI I is capable of issuing three instructions p er

clo ck cycle It has a Kbyte instruction level cache a Kbyte level data cache and a

entry TLB with hardware pagetable walking It is fully SPARCVersion compliant The Mbus

is capable of Megabytessecond p eak and Megabytessecond sustained bandwidth The

workstation is system is running Solaris and the tests were conducted in singleuser mo de

Thinking Machines Corp oration Connection Machine

The Connection Machine CM machine used is a no de SPARC based machine at the

National Center for Sup ercomputing Applications NCSA It is a distributed memory parallel

machine with a memory mapp ed network interface residing on the main memory bus Mbus

Five word packets are injected into the network by storing the data and destination no de into

designated addresses Likewise packets are received by p olling a designated address The

pro cessors are single issue units op erating at Mhz with an external cache The network is

a Fat Tree capable of up to Megabytes p er second with crosssection bandwidth of

Megabytes p er second p er no de While the CM contains vector units capable of Megaops

p er second Concert do es not make use of these units

Cray Research TD

The TD machine used is a pro cessor no de DEC based machine at the Pitts

burgh Sup ercomputing Center It is a distributed memory parallel machine with hardware

supp ort for a global address space Individual pro cessors can read or write the address space

of other pro cessors directly The runtime system builds an ecient messaging layer up on hard

ware atomic swap write and prefetch queues The pro cessors are dual issue sup erscalar units

op erating at Mhz with an onchip direct mapp ed Kbyte data cache The network is a

three dimensional torus with p eak and sustainable pro cessor transfer rates of Megabytes

p er second and Megabytes p er second resp ectively Cross section bandwidth dep ends on the

exact top ology but each channel is capable of Megabytes p er second

Related Work

Instead of surveying the breadth of compiler research this section concentrates on those systems

which either tackle similar problems in terms of programming and execution mo del or which

employ related optimization strategies In the area of pure dynamicallytyp ed languages the

various versions of the Self system are notable for their approach which consists of capturing

and preserving dynamic information Earlier versions of the system describ ed by Chamb ers

guessed the typ es of data ob jects and inserting run time checks to verify the guesses

Inlining was used to increase the dynamic range of these checks and transformations preserved

the information This approach proved brittle In later versions of the Self system Holzle

used proling information to pro duced more robust p erformance The Cecil system

tackles the same problem by examining the class hierarchy to nd metho ds which are not

overridden and using proling information to sp ecialize multipledispatch ob jectoriented

programs

In the realm of concurrent ob jectoriented programming there have b een a numb er of

systems targeted sp ecically to negrained concurrency The CST Concurrent Smalltalk

compiler is for a language largely similar to CA but it did not p erform global restructing

Similarly concentrates largely on high p erformance runtime facilities The HAL system

originally provided only simple translation Recent versions have provided typ e

inference sp ecialization of Actors constructs and an ecient runtime system However direct

comparison is dicult b ecause of dierences b etween programming mo dels Earlier systems

concentrated on ecient runtime systems and mappings to the hardware or on

language issues

Summary

The Concert philosophy is that programmers should b e concerned with natural expression of

their programs and the programming system should pro duce an ecient implementation The

goal of the Concert pro ject is to pro duce p ortable high p erformance implementations of con

current ob jectoriented languages on parallel machines The compiler emb o dies a compilation

framework with ve phases parsing and semantic pro cessing program graph construction anal

ysis transformation and co de generation During these phases the program is translated into

several intermediate forms including and Abstract Syntax Tree AST a simple core language

a Control Flow Graph CFG a Program Dep endence Graph PDG and Register Transfer

Lanuage RTL In addition the program undergo es a translation to Static Single Assignment

form SSA The analysis and transformation phases include feedback lo ops and op erate on

the PDG in SSA A runtime system provides an abstraction of the hardware and is used as the

target for the compiler Tests conducted using the Concert system are conducted on SPARC

workstations the Thinking Machines CM and the Cray TD

Chapter

Adaptive Flow Analysis

You cant prove anything ab out a program written in C or FORTRAN Its really

just Peek and Poke with some syntactic sugar

Bil l Joy The Unix Haters Handbook

Control and data ow information forms the basis of program transformation Without this

information it would b e imp ossible to know whether or not a given transformation preserved

the meaning of the program Ob jectoriented programs are particularly dicult to analyze

b ecause the meaning of a statement dep ends on context in particular the classes of the ob jects

involved This chapter presents a context sensitive interpro cedural analysis which adapts to

the structure of the program to eciently derive ow information at a cost prop ortional to the

precision of the information obtained Moreover the results are applicable to such optimizations

as static binding inlining and unb oxing

This chapter is organized as follows Section describ es basic ow analysis and denes no

tation Section intro duces adaptive analysis The contour abstraction and basic algorithm

are presented in Section and extended to the adaptive algorithm in Section Recursion

and termination are discussed in Section Sections and discuss the implementa

tion and empirical results Finally we cover related work in Section and summarize in

Section

Background

Ob jectorientation increases b oth the imp ortance of ow information as well as the diculty of

acquiring it Ob jectorientation intro duces an additional level of indirection at an invo cation

site the co de executed dep ends not only on the selector generic function name but on

the class of the target ob ject as well Thus not only do data values dep end on the ow of

control but the ow of control dep ends up on the values of data the class of ob jects and the

values of selector function variables To optimize the program we need b oth go o d data ow

information and go o d control ow information Likewise to analyze the program ow analysis

must simultaneously derive control and data ow information

Constraintbased Analysis

Context sensitive ow analysis is a simultaneous control and data ow analysis which com

bines elements of abstract interpretation and data ow analysis The ow graph is

constructed by abstract interpretation and the approximations of data values are up dated by

propagation along the edges of the graph Such techniques have b een called constraintbased

b ecause the ow graph resembles a constraint network where the edges are constraints

and the no des are variables For example a ow analysis to determine the set of classes whose

instances a variable might take on generates ow edges when an ob ject of class C is created

indicating that the result must b e in the set containing at least C fCg Using v to denote

the set of classes for variable v and N and N to denote the ow graph no des for v and C

v C

resp ectively the constraints and corresp onding ow edges for ob ject creation and assignment

statements are

program text constraint ow edge

x new C x fC g N N

C x

x y x y N N

y x

When an invo cation site is encountered during construction of the ow graph the data

ow values at the invo cation site are used to conservatively approximate the metho d function

which may b e invoked based on the dispatch semantics of the language For example in a

singledispatch ob jectoriented language the metho ds are determined by the data ow values

The ob ject to which the message was sent in C the o in omethod

of the selector and target ob ject arguments The ow graph computes an approximation to

these reaching problems For example given a invo cation site with reaching selectors S and

a target ob ject with reaching classes C ow edges are constructed for all metho ds in the cross

pro duct S C allowing for inheritance

Since abstract interpretation of the invo cations construction of the ow graph uses the

data ow values the current solution to determine the metho d invoked the data ow

values are up dated concurrently with ow graph construction The meet function for the data

ow values is union and the data ow transfer functions are mo deled by constraints on the

values derived by abstract interpretation For example the constraint on the set of class names

C l ass for the ow graph no de o representing the target ob ject with incoming data ow arcs

from no des o at an invo cation site with p ossible metho ds M S C would b e

i

C l asso C l asso fc j x c M g

i

i

Likewise constants primitive functions and tests for equivalence with singleton ob jects like

NIL pro duce constraints which aect the data ow values at no des

The context sensitivity of ow analysis follows from the contour abstraction

In the theory of ow analysis the language to b e analyzed is rst given an exact semantics

which is essentially an interpreter The contours in such a semantics represent the call stack

and determine variable bindings For practical analysis cost and precision are balanced by

using abstract contours which represent some set of exact contours A contour abstraction

can therefore vary from coarse one contour p er metho d to ne one contour p er call frame

Since ow graph no des are created for each lo cal variable for each contour a separate context

sensitive solution is obtained for each calling context represented by a contour Thus contours

determine b oth the complexity cost of the analysis as well as the precision

For example thorder ControlFlow Analysis CFA uses one contour p er metho d while

CFA uses one for each call site With each additional level of caller context sensitivity

Section the precision of the information obtained increases but the cost increases as

well In general if we take a to b e the average numb er of invo cation sites for each metho d and

N

S to b e the numb er of statements in the program the cost of N CFA is OS a ie exp onential

This approximation is only safe when the analysis is complete requiring the analysis to

track changes to the approximation

in N Thus a xed contour abstraction is cost eective only for very small levels of caller

sensitivity

Program Representation

The program is represented in Static Single Use SSU form see Section This form

creates a unique variable for each assignment under each set of lo cal control ow conditions

Without these additional variables constraints could not b e safely applied to limit the values

propagated through a no de variable since the constraint might apply only to the variable

under one set of control ow conditions For example on the left side of Figure the variable

a variously holds instances of class A and integer to which are applied geta and resp ectively

If the transfer function required that the typ e of a contain only those classes to which b oth

geta and can b e applied analysis would incorrectly rep ort that no typ e could b e found for

a SSA conversion prevents this problem by creating new variables for each assignment of a

if

a new A

a

ageta

else

a

ageta

a a a

a

Figure Analysis on Static Single Use Form

A more common problem is presented by the use reading of a variable under dierent

conditions In order to prevent these conicts SSU form creates new variables for each use

along dierent control ow paths For example in Figure on the right the two metho ds

geta and are applied to a variable in the two branches of a conditional As with the two

assigned values a cannot b e safely restricted to classes supp orting b oth and geta Instead

the a in the left and right branches are restricted separately and the a following the conditional

has the meet union value of the two restricted as values The result is that the nal value

may b e of a class which supp orts either or geta

Contours

The key to the precision of context sensitive ow analysis is the contour abstraction the

mapping of contours to the invo cation environments of a metho d Many dierent mappings

are p ossible and discovery of ecient contour abstractions sp ecic to individual programs is

the sub ject of this chapter Clearly only those contours which represent usefully distinct

environments need b e created Assume that the goal of ow analysis is to determine the classes

of ob jects contained in variables and the co de executed at invo cation sites see Section

for discussion of other ow problems Environments are comp osed of the values of the ar

guments at the time of the invo cation The classes and metho ds asso ciated with arguments

therefore represent a useful basis of distinction For example the Cartesian Pro duct Algorithm

dierentiates contours based on the crosspro duct of the classes of the arguments Like

wise Jagannathan and Weeks dierentiate contours based on the abstract values of the

arguments

The Cartesian Pro duct Algorithm and Jagannathan and Weeks solutions determine the

contour abstraction immediately That is in a single pass of analysis at the p oint where

a contour is required the abstraction is selected and xed This has two problems First

the class of an argument do es not capture all of its useful distinctions The class may b e

p olymorphic Section so that dierent ob jects of that class may have instance variables

containing ob jects of dierent classes For example a instance of the List class may b e a

list of integer a list of oating p oint numb ers or a list of lists In order to capture these

distinctions the complexity of the domain of values must b e increased instead of just classes

ob jects might b e represented by their class and the classes of their instance variables This could

b e extended some numb er of levels and even to recursive and conditional typ es incurring

the concomitant cost

The second problem is that ob jectoriented languages are imp erative ob jects are bits of

p otentially aliased alb eit encapsulated state This means that the alias structure for the en

tire store all data might aect the classes of ob jects accessed within a metho d Thus the

problem of determining the co de executed at invo cation sites in ob jectoriented programs is

equivalent to the alias analysis problem A typical metho d for nding

safe approximations of the mayalias problem is to summarize groups of ob jects based on their

creation p oint This technique is applicable to ow analysis of ob jectoriented programs For

example dierentiates ob jects by the statement at which they were created However

this is simply a xed level abstraction which can b e logically extended to N levels by dieren

tiating ob jects based on the callers of the metho d containing the creation statement increases

cost exp onentially

Our solution is to use a exible abstraction which can b e extended to dierent levels for

dierent parts of the program b eing analyzed This enables ecient analysis b ecause while

it do es not decrease the exp onent it do es break up the cost into comp onents For example

the cost for N level caller dierentiation where each i is a regions of lo cal extension falls from

P

N

N

i

OS a to S a Each S corresp onds to a set of p olymorphic metho ds and each N to

i i i

i

i

the depth required to analyze S The advantage of adaptation is in the precision with which

i

invo cation sites are mapp ed to the metho ds which might b e invoked When the numb er of such

metho ds is close to a is close to bringing the cost of analysis close to OS linear and

i

enabling the invo cation sites to b e statically b ound in the implementation Section

Adaptive Analysis Overview

Its quite p ossible to have a mixed abstraction where the features of the program

determine whether a precise exp ensive abstraction will b e used for a given contour

or an approximate cheap one will b e used instead In fact we could employ a kind of

iterative deep ening strategy where the results of a very cheap very approximate

analysis would determine the abstractions used at dierent p oints in a second pass

providing precise abstractions only at the places they are required

Olin Shivers Thesis Extensions

Adaptive analysis pro ceeds stepwise by analysis and extension of the contour abstraction

The high level driver is given in Figure The reason that these two steps are p erformed

separately is that the structure of the ow graph is determined b oth by the contour abstraction

and the data ow value which in turn are determined by the structure of the ow graph Thus

changing the contour abstraction during an analysis step would either not increase precision if

the old conservative data ow values where preserved or require invalidating and recomputing

aected ow graph values and structure By delaying changing the contour abstraction until

the iteration has nished we can use all the information pro duced to guide the changes

The Analysis Step

Adaptive ow analysis consists of iteratively applying two steps analysis and incremental

precision extension The analysis step constructs the ow graph while maintaining the up dated

data ow values at the no des SSU form which induces an explicit lo cal data ow graph

void analysisdriver f

do f

analysis step

extension step

g until extension successful

g

Figure Adaptive Analysis Driver

simplies this construction over other analyses In order to further simplify the

algorithm we assume that instance variables are accessed via accessor metho ds and that no

variables are captured from surrounding scop es Section

Denitions

The denition of the ow graph app ears in Figure Each constant expression and denition

in the program is asso ciated with a Label Lo cal variables which are assigned only once are

asso ciated with the expression which assigns them Instance variables which can b e assigned

multiple times are asso ciated with the lab el of their denition These lab els are used to uniquely

identify the ow graph N ode representing the corresp onding variable and to represent selectors

and classes in data ow values

N N odes P N ode n N ode Label C ontour

E E dg es P E dg e e E dg e N ode N ode

C C ontour s P C ontour c C ontour N

V V al ues N ode V al ue v V al ue P N ode

R Restr icts C ontour Restr ict r Restr ict V al ue V al ue

n

I I nv ok es N ode I nv ok e i I nv ok e P C ontour

Figure The Flow Graph

C ontour s are unique identiers representing abstract calling environments we use the nat

ural numb ers N where is the top level environment The V al ue of a no de is the set of no des

representing the values constants selectors or ob ject contours which reach that no de Each

contour Restr icts the values its parameters can take on These restrictions represent qualities

of the set of abstract calling environments which the contour represents see Section The

I nv ok es function represents the abstract call graph by mapping invo cation no des to invoked

contours

The algorithm uses several functions to move around on the ow graph and extract infor

mation F l ow and B ack move along the edges of the ow graph taking a no de and returning

the set of no des in the forward and backward data ow directions resp ectively S el ector s takes

a no de and returns the the set of selectors lab els of generic function names or primitive func

tions of its value That is it nds the value of the no de with resp ect to the reaching selectors

problem Likewise C l ass takes a no de and returns the set of lab els for class names or primitive

classes and O bj ect returns the set of contours for the constructor metho ds of ob jects refer

enced by the no de Constants are dened in all contours so they are dened arbitrarily by

the contour for the top level environment

F l ow n fm j n m E g

B ack n fm j m n E g

0 0

S el ector sn fl j v V n v l c l fprimitive function selectorgg

0 0

C l assn fl j v V n v l c l fprimitive class class namegg

0 0

O bj ectn fc j v V n v l c l fprimitive class class namegg

0 0

N amev fl j v v v l cg

Figure Functions on the Flow Graph

Analysis Step Driver

Each analysis step b egins by placing an interpro cedural call graph edge E dg e for the program

entry p oint onto a worklist of E dg es As each E dg e is pro cessed new E dg es are found to

b e reachable and placed on the worklist When the worklist is empty the analysis step has

nished Figure contains the pseudo co de for the analysis step driver

void analysis step f

while worklist is not empty f

extract edge from worklist

if edge has no contour f

nd contour for edge

create local ow graph

g

attach edge cal ler to cal lee

g

g

Figure Analysis Step Driver

For each call edge if the edge do es not have a contour from a previous analysis iteration a

contour is selected and the ow graph representing the lo cal data ow inside the called metho d

with resp ect to this contour is constructed The lo cal ow graph is attached to the ow graph of

the calling metho d at the parameters and return values Then this lo cal ow graph is attached

to the global ow graph at global variables Global variables are not unique with resp ect to

metho d contours and are represented by a single ow graph no de If the contour is not new

only the connections for the parameter and return value need b e created since the existing lo cal

ow graph will b e used to summarize b oth invo cations

Lo cal Flow Graph Construction

The lo cal ow graph consists of no des representing the lo cal instance and global variables and

arcs representing data ow b etween them The no des of this graph are dened in Table

The no de for a lo cal variable is determined by its lab el and the metho d contour The no des

for an instance variable are determined by the O bj ect contours of self the target ob ject of

the accessor containing the instance variable Global variables are unique and determined by

the contour representing the top level environment

lo cal l c c is the metho d contour

instance l o o O bj ectself c

global l is the top level environment

Table Lo cal Flow Graph No des

The ow graph edges are those induced by data ow including the SSU assignments and

reads and writes of instance and global variables Data ow to or from instance variable is

construed to b e ow to or from all of the no des representing that instance variable

Constraints are imp osed on the values of lo cal variables based on how the variables are

used There are three typ es of lo cal constraints those for constants those induced by the use

of the variable in a dispatch p osition and those induced by primitive functions Constants

are constrained to take on the appropriate abstract value for example integers oating p oint

numb ers and strings are all immutable so we say that they are created a priori in the top

level environment and their C l ass value must include integer float and string resp ectively

Ob jects in the dispatch p osition are constrained to b e of a class supp orting the selectors to

which is applied as in Section Primitive functions can imp ose arbitrary constraints on

the classes of their parameters Figure provides some examples of lo cal constraints

C l ass fintegerg

C l ass ffloatg

of C l asso c classes supp orting f

integer addab C l assa c fintegerg

C l assb c fintegerg

Table Lo cal Constraint Examples

C l ass is a derived quantity so the constraints are reected on the V al ue of no des For

example the value of no de must include integer Likewise V o c is constrained

0

not to include any n c such that n is a class name in the set of those supp orting f indep endent

0

of the value of c

Global Flow Graph

The global ow graph is the collection of lo cal ow graphs interconnected during the analysis

step Section For each call edge the set of p ossibly invoked metho ds is determined Then

contours are selected to abstract the environments of the invoked metho ds These contours

imp ose constraints on the values of arguments in dispatch p ositions Section Next

ow edges are constructed b etween the no des in the calling metho d representing the formal

parameters and those in the callee metho d representing the arguments Finally any new call

edges are added to the worklist

The called metho ds are drawn from the applicable metho ds with the selectors which reaching

the invo cation Metho ds are applicable when an ob ject of the appropriate class can reach

each argument That is for arguments a and parameters p the intersection of C l assa

i i i

and fl j l c Rp g is nonempty Likewise the selectors reaching the selector argument

i

S el ector sa must include the name of the selector of the metho d Section discusses

selection of contours in more detail

When the values of arguments change new metho ds and contours b ecome reachable re

quiring edges to b e added to the worklist Since the data ow graph is maintained up to date

new edges may b e created anywhere in the global ow graph in resp onse to the addition of a

single lo cal constraint For example the addition of an integer constant constraint can causes

the value of all reachable variables to contain integer which in turn can cause a large

numb er of new edges to b e created for each invo cation site where those variables app ear in the

dispatch p osition Thus each time the value of a no de changes all aected invo cations must

b e examined and any new edges added to the worklist

Restrictions

Constraints are imp osed on parameter no des corresp ond to the feasible call edges under the

dispatch semantics Section and the contour abstraction Restr icts Thus values of the

parameters are sub ject to the restriction V p Rp For example Figure on the left

i i

shows a p olymorphic metho d which is applied to two arguments which could b e either integers

or oating p oint numb ers On the right are several dierent sets of contour restrictions The

rst provides a single contour covering all cases the second is more sp ecic and the last provides

for all four p ossible classes Thus restrictions enables the use of separate contours for dierent

combinations of values Since any given variable can only hold one value at one time separate

analysis for dierent combinations is safe so long as each element of the cross pro duct of values

is represented by some contour Section This is achieved in the alternative contour

representation of by singlevalue based analysis of curried functions

Our implementation attaches a list of such invo cation sites to each no de

fifgfifg

fij f g

figfifg ffgfifg

figfig figffg ffgfig ffgffg

if a else a

if b else b

where i integer

and f float

c fab

Figure Restrictions

Imprecision and Polymorphism

To simplify the exp osition of the algorithm we dierentiate method imprecision from object

imprecision Imprecisions are ow graph no des whose values are not singleton sets Metho d

imprecisions are those of no des dened by the surrounding metho ds contour lo cal variables

Ob ject imprecisions refer to no des dened by ob ject contours instance variables Imprecisions

can result from a numb er of sources including incomplete input ow insensitivity and for

mutable lo cations temp oral insensitivity This analysis fo cuses on the second sort which often

results from the use of p olymorphic metho ds or ob jects The level of polymorphism is the depth

of the p olymorphic metho d call path or p olymorphic reference path see Figures and

CFA handles no p olymorphism CFA handles one level and this algorithm adapts to

handle dierent levels of p olymorphism in dierent areas of the program

class tuple

l powerxy

r if y

left l xpowerxy

else

onex

let a tuple

b tuple

aleft power

bleft power

Figure Polymorphic Ob jects Figure Metho d Polymorphism

Adaptation

Adaptive ow analysis uses the results of the previous iteration starting with CFA to extend

the contour abstraction for the next iteration After each iteration the contour abstraction

is extended by splitting the set of invo cations asso ciated with a contour see Figure to

dierentiate uses of the metho d or class it represents A new analysis iteration starts by

clearing the values V and the edges E which make up the ow graph However the abstract

call graph I which captures the lo cal levels of context sensitivity is preserved In this way the

analysis adapts to the structure of the program across iterations

Splitting

Splitting divides contours increasing the numb er of ow graph no des and p otentially eliminat

ing imprecisions from the analysis results Splitting p olymorphic metho ds metho d splitting

divides the invo cations asso ciated with a metho d contour over a numb er of smaller of more

sp ecic contours Splitting p olymorphic ob jects ob ject splitting divides the invo cations asso

ciated with the creation of ob jects of a particular class over a numb er of contours representing

subsets of the instances which are used in dierent ways

The simplest form of splitting relies on argument values selecting a contour the values of

whose formal parameters most closely match those of the invo cation arguments Invo cations

are pro cessed in order so the arguments have approximations of their nal values when the

contours for each invo cation are selected To minimize the numb er of analysis iterations this

partial information is used to eagerly split metho d contours ie select more precise contours

to represent the abstract calling environments of new edges Similarly we can eagerly split

contours representing ob jects However since the selection of ob ject contours o ccurs at the

p oint where the ob jects are created b efore the instance variables are used it generally is less

eective Eager splitting o ccurs as part of contour selection

Selecting Contours

When an invo cation is encountered the set of applicable metho ds is determined and the con

tours are selected There are two goals for contour selection First the contour should not

b e overly general That is the contour should not b e used to abstract invo cations for which

dierent information is available since a conservative approximation of the information of all in

vo cations will b e used for analysis of the contour Secondly contours should b e shared whenever

p ossible Clearly we could satisfy the rst goal by selecting a new contour for every invo cation

but then the analysis would never terminate Sharing contours eectively is the key to ecient

analysis

For a given target metho d the conditions of dispatch induce constraints which are applied

to the values of the arguments to determine the values which will ow into the parameters

for this invo cation Section While a contour could b e created for each element of the

Q

cross pro duct of entering values w v this would b e exp ensive and in general prevent

i

i

termination see Section Instead we select contours based on information from the last

iteration and then eagerly split contours based on the Label comp onent S el ector s or C l ass of

the argument values leaving splitting based on the contour comp onent O bj ect of the values

to b e done noneagerly For example Figure shows three invo cations left with dierent

argument values right The argument values of the rst two invo cations dier only in the

contour comp onent so they will share the same contour while the value of rst argument of

the last invo cation has a dierent lab el the class B and will not share the same contour

funcnew Anew A fA c gfA c g

funcnew Anew A fA c gfA c g

funcnew Bnew A fB c gfA c g

Figure Selecting Contours

The contours for an invo cation from no de n are selected in three steps given in Figure

Q

r First from the cached contours I n we select those whose restriction cross pro duct

i

i

intersects w favoring those which intersect the smallest numb er of elements and remove those

elements from w For any remaining elements of w we select from all contours asso ciated with

the metho d those whose restrictions intersect w Finally we form subsets out of any remaining

Q

elements by applying N ame to each parameter value N amev and create contours for

i

i

each identical result with the singleton N ames as restrictions Intuitively these contours are

insensitive to particular contours reaching their parameters but are eagerly dierentiated

with resp ect to the names of the metho ds or classes reaching those parameters

As a sp ecial case a unique contour is always created for accessor metho ds for each O bj ect

value of the no de representing the target ob ject This simplies the exp osition of the algorithm

by eliminating the b oundary case where the accessor metho d contours require metho d splitting

to isolate accessor op erations to particular ob ject contours This enables a cleaner separation

b etween metho d and ob ject contour splitting

select contours f

for an invocation node n with arguments a

i

Q

w v where v V a

i i i

i

Q

for each c I n such that w Rc is not empty

select c

Q

w w Rc

while w is not empty

select c a new context with restrictions N AM E w

w w fx j N AM E x N AM E w g

where

Q

N amex g N AM E x

i

i

Figure Selecting Contours Pseudo Co de

Metho d Contour Splitting

Splitting metho d contours enables separate information to b e obtained for dierent uses of the

metho d The idea is to examine the data values after an iteration nd situations where the

caller argument values of a call edge are more precise than the values of the corresp onding

callee parameters and build new contours for the cases For example if the value of one of

arguments for a particular invo cation is a strict subset of all other corresp onding arguments

values a new contour is created for that invoke In the new contour the corresp onding formal

parameter will have the more precise subset value

Since the domain of values is recursive splitting for every dierence pro duces a nontermi

nating analysis Moreover not all dierences in argument values are meaningful For example

ob ject contours for a particular class denition may distinguish subsets of the classs instances

which are imp ortant to only a fraction of metho ds So instead of splitting for every dierence

in values we start from a sp ecic imprecision which we wish to eliminate eg where a typ e

check b oxing op eration or dynamic dispatch would b e required and lo ok for the imprecisions

which caused it This goaldriven analysis also has the advantage that resource use scales with

the precision demanded

Using the functions describ ed in Figure we traverse the ow graph Starting from the

p oint of imprecision we lo ok back for a set of conuences of values in data ow terms meets

a b where a b and a b Given a no de n with imprecise V al one of S el ector s O bj ect

or C l ass we nd the least set of conuences C onf n V al which ob ey

0 0 0

fn g if n B ack n V al n V al n

C onf n V al

otherwise

0 0

C onf n V al C onf n V al where n B ack n

The contours of parameter no des in C onf n V al represent the rst order contribution to

the imprecision and splitting them is the rst way in which the imprecision may b e eliminated

Imprecision can also arise from interpro cedural control ow ambiguity due to secondary impre

cisions in other arguments For example since the result value is the meet union of the results

of all the p ossible metho ds invoked if the set of selectors reaching an invo cation is imprecise

the result can b e imprecise Similarly the parameters of the invoked metho ds might b e impre

cise as a result of the extraneous ow edges Lastly imprecisions in ob ject contours can lead

0

to imprecise results of instance variable accesses We extend C onf n V al to C onf n V al to

handles these cases where i is an invo cation

0

C onf n V al C onf n V al

00 0

fn j n C onf n V al jV al nj

0

n is an argument or return variable of i

00 0

n C onf D ispatchAr g umenti C l ass

00 0

n C onf S el f Ar g umenti O bj ect

00 0

n C onf S el ector Ar g umenti S el e ctor sg

The three o ccurrences of Conf on the right hand side account for the eects of imprecise

dispatch arguments imprecise ob ject contours and imprecise reaching selectors resp ectively

The newly split contours are distinguished in the next analysis step through the changes to

This is not to b e confused with the ChurchRosser prop erty though b oth draw on the

common denition from the Oxford English Dictionary second edition conuence A owing

together the junction and union of two or more streams or moving uids

abstract call graph I or additional restrictions R For conuences C onf n V al is nonempty

0 0 0 0

l c we create new contours C with identical restrictions c C Rc Rc but with

S

0 0

I mapping callers with identical values to separate contours ie V v B ack l c

0

V l c For an imprecision jV al nj at argument no de n at p osition i we create

0

new contours with identical invokes l Label I l c I l c and mo dify the restrictions

0

r Rc to dierentiate the elements of V al n eg jV r j Dierent contours will then

i b e selected in the next iteration

power(1,2) power(1.0,2) power(1,2) power(1.0,2)

{integer}X {integer} {integer,float} {float}X {integer} {integer,float} {integer}X {integer} {integer} {float}X {integer} {float}

(contour 1) (contour 1) (contour 2) Method Splitting power(x,y) { power(x,y) { power(x,y) { ......

x 1 : {integer,float} x 1 : {integer} x 2 : {float}

y 1 : {integer} y 1 : {integer} y 2 : {integer} return : {integer,float} return : {integer} return : {float}

1 1 2

Figure Metho d Splitting for Integers and Floats

Figure illustrates metho d splitting involving the power metho d from Figure Note

that this particular situation would have b een taken care of by eager splitting however rather

than complicate the example we will simply imagine that it was not At the left the actual

arguments for the formal parameters x and y coming from power and power have

dierent C l asses so there is a conuence The imprecision manifests itself in a conuence at

the rst argument fintegerfloatg when it is clear that the value for the rst call is integer

and for the second float Splitting intro duces two sets of no des x y and x y eliminating

the conuence

Ob ject Contour Splitting

Ob ject splitting partitions contours based on the usage of the ob jects they represent It is

more complex than metho d splitting b ecause the p oint of conuence the instance variable is

separated from the p oint at which the contour was created in the ow graph In fact since

ob ject contours ow through the graph splitting the ob ject contour alone is not enough we

must ensure new contours remain distinct as they ow through the ow graph Note that

this requires splitting intermediate metho ds like the identity metho d which make no use of

the instance variables of the argument However these extra contour are coalesced by cloning Chapter and do not aect optimization

Object((l1,0)) = {1} Object((l2,0)) = {1} Object((l1,0)) = {1} Object((l2,0)) = {2}

tuple(1,2) tuple(1.0,2) Object Splitting tuple(1,2) tuple(1.0,2) l1 l2 l1 l2 l l l

1 : {integer,float} 1 : {integer}2 : {float}

Figure Ob ject Splitting for Imprecision at l

Figure is an example of ob ject splitting based on the program example in Figure

On the left the two creation p oints tuple and tuple pro duce the same contour

As a result the value of the instance variable l is fintegerfloatg Splitting the ob ject contour

discriminates the two cases pro ducing precise results for b oth cases

Again we start with an imprecision we wish to eliminate Ob ject splitting involves four

op erations

Identifying the assignments to the instance variable which give rise to the imprecision

Identifying the paths in the ow graph which the instance variables contour to ok from

its creation p oint to the assignments

Ensuring that these paths are distinct

Dividing the ob ject contour into a set of contours

The rst step is to identify the conicting assignments to the same instance variable with

the same contour Next we nd the ow paths from the creation of the contour c which denes

the instance variable no de n eg n l c to the assignments These paths must b e disjoint

to propagate the distinct contours we intro duce by ob ject splitting or we will fail remove the

imprecision Instead all variables which after the paths have conjoined will hold b oth contours

and for instance the read accessors will return the union of values of the corresp onding instance

variable for b oth contours We ensure disjointness by using metho d and ob ject splitting along

the ow paths where necessary When the paths are disjoint the conicting values are assigned

into dierent contours by splitting the original ob ject contour

class A f

a

operatora value a value

g

let x A I c fg

c

y A I d fg

d

xa I e fg

e

ya I f fg

f

Figure Ob ject Splitting Example

To illustrate the algorithm we will use the example in Figure In this example two

instances of class A are created The write accessor metho d operatora Section is

then used to set the a instance variable of each to a dierent typ e of numb er

Identifying the Assignments First the no des which are assigned have a ow edge to

the imprecise instance variable are found These are group ed so that all the no des in each

group have identical values with resp ect to the typ e of the imprecision indicated by the pa

rameterizing function V al again one of S el ector s O bj ect or C l ass We dene the function

Assig nS etsn V al which takes a no de n an imprecise instance variable and return a set s of

sets of no des s each of which represents a dierent use of the instance variable

i

0 00 0 00

Assig nS etsn V al s where s B ack n s s n s n s V al n V al n

i i i i

i

The no des in B ack n are the right hand sides of assignments to the imprecise instance

variable Figure shows the ow graph for our example and the assignment sets derived

Identifying the Paths Next we compute the ow paths which the instance variables con

tour to ok from its creation p oint to the assignments For each element a of Assig nS etsn V al

we nd the self no des S el f a of the accessor metho ds which contain the assignments a The

O bj ect value of these no des are the contours which denes the instance variable no de n ie for

o O bj ects S el f a n l c Then we compute the paths taken by the contours from

their creation p oint the no de new c to S el f a These paths must to b e distinct in order V (1,0) : {Integer} V (1.0,0) : {Float} V (self,1) : {1}

V (value,2) : {Integer}V (value,3) : {Float} VV(x,0) : {1} (y,0) : {1}

V (a,1) : {Integer,Float} V (self,2) : {1} V (self,3) : {1}

Name data flow Contour data flow

AssignSetsa Name ffvalue value gg

Figure Data Flow and Assignment Sets Example

to eliminate the imprecision Intuitively if the contours paths merge they will b e applied to

the same metho ds with the same values

a Assig nS etsn V al

S el f a fself c j l c ag

P atha C l osur eB ack S el f a

We compute the paths P atha for each assignment a by taking the closure of the function

B ack over the set containing the self no des for the metho ds containing the assignments The

self no des are found with the function S el f a by extracting the metho d contour from a

The paths P atha are those which would b e taken by a new contour created to eliminate the

p ortion of the imprecision V al a For example in Figure contour travels to value

through self and x Since this path must b e distinct from the other paths the

app earance of a no de on more than one of these paths represents a secondary imprecision For

each no de we need to know the subset paths in which it is contained

Al l P athsn V al fp j p P atha a Assig nS etn V al g

0 0

N odeP athsn n V al fps j n ps ps Al l P athsn V al g

We dene the function Al l P athsn V al to b e all the paths for all the assignment sets

Further we dene N odeP athsn V al to b e the subset of all the paths which a particular no de

0

n is on

Ensuring Discrimination Using the paths determined ab ove we now apply the conuence

nding algorithm recursively to determine the conuences of the p otential contours represented

by these paths However the paths are dened by the assignments and join at the creation

p oint whereas the other values are distinct when created but join at merges in the ow graph

Thus the path can b e thought of as owing backward in the data ow graph This requires

mo dication of the C onf function

F l ow if V al P ath

F l ow O r B ack V al

B ack otherwise

0 0

fng if n F l ow O r B ack V al n V al n V al n

C onf n V al

otherwise

The new C onf uses the F l ow O r B ack V al function which is either B ack as b efore or F l ow

when V al refers to the paths Assig nS et requires an analogous change and the rest of the

algorithm is identical

P athfvalue g fself x c g

P athfvalue g fself y d g

S pl ittabl e a C l ass ffc g fd gg

Figure Paths and Splittability Example

Resolving the Imprecision The last step is the actual splitting of the ob ject contours

When two or more paths do not share any no des the contour is split and a new contour

created for each path or set of paths not sharing no des Figure provides an example of a

contour which is determined to b e splittable The new contours will cause the accessor metho ds

and the no de representing the instance variable at the p oint of the imprecision to split removing

the imprecision The function S pl ittabl ec n V al determines the subsets of creation p oints for

contour c which can b e protably split for the imprecision at no de n of typ e V al

0

S pl ittabl ec n V al ft j t s N odeP athsn s n V al Al l P athsn V al

0 00 0 00

n t n t N odeP athsn n V al N odeP athsn n V al g

where s fn j n p p Al l P athsn V al B ack n fself cgg

Using s the no des which represent the creation p oints for the ob ject contour c we determine

the subsets of creation p oints whose paths are not disjoint Since the creation p oints are the

end of the paths nondisjointness implies equality and that the union of these subsets will b e s

Further since creation p oints corresp ond to invo cations on the class ob ject creation function

S pl ittabl ec n V al computes the sets the invo cations for the new contours Thus if

S pl ittabl ec n V al Al l P athsn V al

the ob ject contour cannot b e protably split When the ob ject contour is split the newly

created contours are substituted for the original in the restrictions for argument no des along

the corresp onding paths Thus the new contours will follow the distinct paths in the next

iteration and their instance variables will b e assigned a subset of the values of the original

removing the imprecision

Remaining Issues

In this section we discuss recursion termination and complexity issues and the applicability of

this analysis to other data ow problems Since the contour representation used by adaptive

ow analysis is not static but recursively dened recursion in the program b eing analyzed

requires sp ecial handling

Recursion

Since the denition of contours is recursive ensuring termination requires limiting the numb er

of contours pro duced by recursive program structures There are three typ es of these structures

Recursive metho ds

Recursive data structures

Metho ddata recursion

The rst two typ es are normal recursive metho ds and recursive typ es The third typ e

represents the case where a recursive metho d creates ob jects on which it is later invoked This

is the case for such common programming idioms as insertion into a linked list While contours

are represented by unique identiers their uniqueness is determined by their callers I and their

restrictions R The rst two typ es of recursive structures induce other contours by invo cation

I while the third induces them through restrictions R

After each iteration and b efore splitting we identify the strongly connected comp onents

SCCs in the graph whose no des are the contours and whose edges are

0 0

a contour c and the contours it invokes fc j c I l cg

0 0

a contour c and the contours it restricts fc j l c Rc g

i

The SCCs in this graph contain the contours which have a part in dening each other

To prevent nontermination we do not allow invo cations or restrictions b etween contours in

the same SCC to cause splitting Furthermore invo cations into recursive cycles can also lead

to nontermination as recursive cycles are successively p eeled These invo cations are also

prohibited from splitting b eyond a constant level in our implementation two levels Allowing

invo cations on the cycle to split to a constant level enables analysis of recursive structures with

a p erio d less than or equal to the constant since contours can form cycles up to that length

Termination and Complexity

Termination is ensured by limiting the numb er of contours pro duced by recursion However

since N odes and V al ues are dened by lab els program p oints and contours which in turn can

b e distinguished by their values at each argument the theoretical complexity is exp onential

Nevertheless in practice we have found the complexity to b e related to b oth the size and

levels of p olymorphism of the analyzed program Furthermore we have found that the level of

p olymorphism in programs increases relatively slowly with program size and the complexity of

analysis along with it see Section for an empirical evaluation

Other Data Flow Problems

Adaptive ow analysis can b e applied to other data ow problems where the precision of deep

ow sensitivity is desired There are two classes of such problems those that ow forward

through only lo cal variables and parameters and those that ow through instance variables

Since each additional problem adds another dimension to the splitting criteria additional levels

of recursive unfolding Section are required to prevent the solution of earlier problems

from interfering with any unfolding of recursive metho ds required for additional problems To

prevent such interference additional problems should b e solved in order that is all iterations

required for one problem should b e completed b efore any splitting is done for any later problem

For those problems which that ow only through lo cal variables and parameters only

metho d splitting need b e considered If the domain is small for example the distinction b e

tween local and remote ob jects the data ow problem can b e simply added as a new typ e of

imprecision ie b ound to V al in Section If the domain is larger for example constants

splitting should b e limited to those imprecisions likely to b e of imp ortance for optimization

Since adaptation is goal driven integrating optimization criteria is simply a matter of cho osing

the initial imprecisions Section

The second class of problems are those for which ow sensitivity is required for values

owing through instance variables The resulting information can b e used to sp ecialize memory

layout of generic classes For example analysis of a list class which shows that for a subset

of creation p oints no destructive up dates ie setcdr are p erformed can b e used to CDR

co deallo cate the contents in consecutive memory lo cations Since all imprecisions at instance

variables are resolved or b ecome imprecisions in the paths of p otential ob ject contours no sp ecial

mechanism is required to add additional problems However large domains may require use of

optimization criteria

Implementation

The full implementation of this analysis involved engineering decisions which are of sucient

interest to warrant discussion here As construed this analysis makes some simplifying as

sumptions ab out the structure of the program eg the use of accessor metho ds Also certain

language features are not sp ecically covered arrays closures and rst class continuations To

make impact of these assumptions more tangible the main data structures of our implementa

tion are presented and related to the assumptions ab out program structure Finally supp ort

of arrays rst class continuations and closures is discussed

Data Structures

The four main data structures describ e the interpro cedural data and control ow graphs These

are ow graph no des interpro cedural call edges metho d contours and ob ject contours The two

typ es of contours are separated in the implementation b ecause they require dierent handling

for splitting recursion etc To simplify matters each metho d contour is asso ciated with a

single ob ject contour that is we automatically split the metho d contour based on the ob ject

contours of the target ob ject see Section

Flow Graph No de

id The unique identier of the no de comp osed of a program variable lab el and metho d

contour andor ob ject contour

ow Set of no des in the forward direction in the ow graph

back Set of no des in the backward direction in the ow graph

upp ertyp e Constraints limiting the typ e of the variable eg the use of the variable as

the target of a message send Section

lowertyp e The current estimate of the variables typ e This is used to determine the

applicable metho ds Section derived from ob jectcontours b elow

ob jectcontours The current estimate of where ob jects referenced by the the variable

could have b een created

selectors The reaching selectors generic function names

continuedvalue The representative return value no des for continuations

pathsets The paths this variable is on which might carry a needed ob ject contour see

Section

argofmessage The invo cation no des which might generate new edges for changes in

variable

The id is a tuple containing elds for program variables and b oth metho d contour and

ob ject contours since some variables are determined by only ob ject contour instance variables

or neither global variables New outgoing edges resulting from changes in the data ow values

of arguments in the dynamic dispatch p ositions are triggered by checking the checking the

argofmessage eld when a no des value changes

Interpro cedural Call Edge

invokingstatement The statement from which the edge originated

sourcecontour The metho d contour from which the edge originated

ob jectcontour The ob ject contour of the target ob ject which determines contour

contour The contour on which the edge is incident

parameters The no des representing the invo cation parameters

The unique identier of an edge is a tuple containing invokingstatement sourcecontour

ob jectcontour selector That is it contains the ow sensitive invo cation site and the in

puts to the dispatch function This identier is used to recover the edge in successive analysis

iterations from a hash table since the recovered edge contains the contour this table represents

the I I nv ok es from Section which is preserved from iteration to iteration

Metho d Contour

metho d The metho d for which this contour represents a set of activations

restrictions The values which can pass in the formal parameters of this contour

edges The edges representing invo cations on this contour

invokesedges The edges representing invo cations from this contour

creationp oints The no des representing the no des at which ob jects are created within

this contour

recursiveset The set of recursive cycles containing this contour

Metho d contours have an identity which is dictated b oth by its metho d and restrictions

as well as its p osition in the cached interpro cedural call graph as dictated by edges The

invokesedges creationp oints and recursiveset elds are used to cache information used

in handling recursion see Section

Ob ject Contour

creationp oints The no des representing the context sensitive statements where ob jects

describ ed by this contour are created

metho dcontours The metho d contours determined by this ob ject contour

containingcontours The ob ject contours in whose instance variables ob jects with this

contour are stored

recursiveset The set of recursive cycles containing this contour

Ob ject contours are uniquely determined by the creationp oints they represent The

metho dcontours eld is used to quickly determine which no des may b e eected by a split of

the creation set during the clearing of the values b etween iterations It is also used along with

containingcontours and recursivesets to handle recursion see Section

Accessors

The analysis as describ ed requires all instance variables to b e accessed through accessor meth

o ds The reasons for this are two fold First they allow the functions which walk the data ow

graph to map from a no de back to program statements Instance variables are not asso ciated

with a metho d contour hence without an interp osing lo cal variable it would b e dicult to

nd the metho ds resp onsible for a data ow arc b etween two instance variables Second if

instance variable accesses were allowed into arbitrary metho ds the ob ject contour determining

the instance variable no de might come from a parameter a return value or another instance

0

variable This would require additional rules b oth to generate the ow arcs and in the C onf

function Section to determine the p ossible cause of an imprecision Requiring the use

of accessors is somewhat conservative In fact the implementation supp orts direct access to

instance variables within all metho ds for single dispatch languages so long as an intermediate

lo cal variable is used for assignments b etween instance variables

Arrays

Analysis of the contents of arrays is handled analogously to instance variables Since the

analysis is temp orally insensitive for instance variables all reads and writes are to a single

no de representing the instance variable indep endent of where they take place in program and

since a no de summarizes all instances represented by an ob ject contour having a single no de

summarize all accesses to all elements of an array is a safe conservative technique That is

the contents of arrays can b e analyzed homogeneously as a single instance variable using a

sp ecial Label to represent array contents However the two dimensions of precision temp oral

and elementwise are amenable to additional techniques An example of temp oral sensitivity

is covered in Section

Elementwise separate variables can b e used to represent subsets of elements For example

the rst class messages of CA are essentially vectors of arguments which can b e manipulated

as arrays and then sent as messages These message values are analyzed by using a separate

no de to represent each argument at a known oset and a no de representing all other elements

Accesses to using constant indices are applied to the appropriate element while accesses with

unknown indices are applied to all including the no de for other elements

First Class Continuations

First class continuations Section are essentially ob jects which are used to

return values to a future Since they are ubiquitous used for every return value in our COOP

programming mo del the implementation uses a sp ecial temp orally sensitive mechanism instead

of the standard ob ject contour splitting mechanism The values returned by continuations are

represented by a secondary no des attached to the primary no de representing the continuation

prop er That is a set of secondary ow graphs are created running parallel to the ow of the

continuation but in the opposite direction The values in the secondary graph are those to

which the continuation is applied and the standard up date mechanism carries them back to

the invo cation site where they ow into the return value of the invo cation expression

Closures

Closures are metho ds which scop e mutable variables The current implementation do es not

handle closures directly The primary reason is that COOP languages cannot allow mutable

lo cal variables to b e scop ed by other metho ds This would violate encapsulation of lo cal state

and allowing race conditions unconstrained by the concurrency control mechanisms However

ICC supp orts multiple return values

closures are easily mo deled as ob jects The metho d contour of the surrounding scop e represents

the ob ject contour for the scop ed variables Thus closures could b e handled by extending the

notion of contours to include a map from the Label s of those variables to C ontour s as in

and by using data splitting to extend the precision of captured variables

Performance and Results

In this section we present an empirical study on a selection of programs

Test Suite

The test programs are concurrent ob jectoriented co des written by a variety of authors of

diering levels of exp erience with ob jectoriented programming They range in size from kernels

to small applications They all make use of p olymorphism for co de reuse and abstraction

Program ion network circuit pic mandel tsp richards mmult p oly test

User Lines

Total Lines

The rst three programs simulate the ow of ions across a biological membrane ion a

queueing network network and an analog circuit circuit pic p erforms a particleincell

calculation and mandel computes the Mandelbrot set using a dynamic algorithm The tsp

program solves the traveling salesman problem richards is an op erating system simulator used

to b enchmark the Self system The last three programs are kernels representing uses

of p olymorphic libraries mmult multiplies integer and oating p oint matrices p oly evaluates

integer and oating p oint p olynomials and test is a synthetic co de which uses multilevel

p olymorphic data structure All the programs were compiled with the standard CA prologue

of lines of co de App endix D

Analysis

We implemented three dierent analysis algorithms CFA with one ow graph no de p er

program variable OPS with contours distinguished by their immediate caller ie one

level of callerbased splitting for metho ds and ob jects and the adaptive ow analysis describ e

in this chapter AFA We compared these algorithms based on precisions time complexity

and space complexity

Algorithm Progs Progs Type Runtime

Typed Failed Checks secs

AFA

OPS

CFA

Precision

We use two criteria for precision typing assignment of typ es such that run time typ e checks

are not required and elimination of dynamic dispatch In this section we cover the former

leaving the latter for Section The table ab ove shows that CFA was unable to typ e even

simple programs OPS fared little b etter typing only three of nine programs However AFA

was able to typ e all the programs The times are for our implementation in CMU Common

LispPCL on a Sparc

Time Complexity

Figure shows the time taken by the three algorithms which were implemented in the same

framework using identical data structures Note that the sp eed of AFA compares favorably to

that of OPS in two of the three cases where the were b oth able to typ e the program This

is b ecause AFA fo cuses its eort only on areas of the program where it is required However

when AFA pro duces b etter information it requires more time

Program Lines OPS Time AFA

Typed Sec OPS

ion NO

circuit NO

pic NO

tsp NO

mmult NO

test NO

network YES

mandel YES

p oly YES

Figure Eciency of Typ e Inference Algorithms

Space Complexity

We compare the space complexity by examining the numb er of contours used p er metho d the

numb er of no des used by each algorithm are rep orted in the App endix B Figure show

the numb er of contours required AFA and OPS as a multiple of the metho ds in the program

AFA requires and p er metho d while OPS requires While additional contours

can result in greater precision AFAs goal directed splitting reduces the numb er required for

a given level of precision

5 AFA 4.5 OPS 4

3.5

3

2.5

2

1.5 Contours/Method 1

0.5

0 pic tsp ion circut mandel

network

Figure Contours p er Metho d

Related Work

Control and data ow information is vital to optimizing compilers of high level languages It is

useful for constant copy and lamb da propagation static binding inlining and sp eculative

inlining typ e recovery safety analysis customization sp ecialization

and cloning and other interpro cedural optimizations

The use of nonstandard abstract semantic interpretation for ow analysis in Scheme by

Olin Shivers provides a go o d basis for this and other work on practical typ e inference

In particular the ideas of a call context cache to approximate interpro cedural data ow and

the reow semantics to enable incremental improvements in the solution foreshadow this work

Recently Stefanescu and Zhou as well as Jagannathan and Weeks have provided

simplied frameworks for ow analysis However these frameworks are theoretical in nature

with no provisions for managing cost and not suitable for a practical implementation

Iterative typ e analysis and message splitting using run time testing are conceptually similar

techniques develop ed in the SELF compiler However iterative typ e analysis do es

not typ e an entire program only small regions Essentially these techniques simply preserve

the information obtained by runtime checks inserted into the co de Later work by Holzle on

the SELF compiler uses the results of p olymorphic inline caches ie proling to determine

likely run time typ es inserting typ e tests to ensure that the exp ected actually o ccurs

Typ e inference in ob jectoriented languages in particular has b een studied for many years

Constraintbased typ e inference is describ ed by Palsb erg Schwartzbach and Oxhj

in Their approach was limited to a single level of discrimination and motivated our

eorts to develop an extendible approach Agesen extended the basic one level approach

to handle the features of SELF His technique is limited to eager splitting and is incapable

of handling p olymorphic data structures which are destructively up dated as a result of his single

pass approach

The soft typing system of Cartwright and Fagan extends a HindleyMilner style typ e

inference to supp ort union and recursive typ es as well as insert typ e checks To this Aiken

Wimmers and Lakshman add conditional and intersection typ es enabling the incorp oration

of ow sensitive information However these systems are for languages which are purely func

tional where the question of assignment do es not arise and extensions to imp erative languages

are not fully develop ed Lastly our algorithm shares some features of the closure analysis and

binding time analysis phases used in selfapplicative partial evaluators again for purely

functional languages

Summary

Flow analysis of ob jectoriented programs is complicated by the interaction of data values and

control ow information through dynamic dispatch and imp erative up date of instance variables

This chapter presents a ow analysis technique which combines simultaneous data and control

ow analysis with iterative adaptation to the structure of the program Essentially a simple

less ow and data sensitive analysis is used to determine where more precise analysis is needed

The contour representation a summarization of stack frames or groups of ob jects is extended

lo cally to provide more precision and the program is reanalyzed Using a numb er of COOP

programs this adaptive analysis is shown to b e b oth eective and ecient

Chapter

Cloning

Prop erty was thus appalled

That the self was not the same

Single natures double name

Neither two nor one was called

Shakespeare The Phoenix and the Turtle

Cloning is the pro cess of building sp ecialized versions of classes and metho ds from generic

sp ecications in order to improve eciency These versions are sp ecialized with resp ect to the

manner in which they are used by the programmer andor implemented their context In

particular p olymorphic classes and metho ds are sp ecialized into a set of monomorphic classes

and metho ds whose storage maps dispatch tables and call bindings are optimized for the

corresp onding classes

This chapter presents a cloning algorithm which attempts to maximizes optimization op

p ortunities while minimizing co de replication The numb er of clones is minimized by creating

only those dictated by a given set of optimization criteria Example criteria are minimization

of dynamic dispatch and maximization of unb oxing within p erformance critical p ortions of the

co de These clones are shared across the program to limit overall co de expansion The algo

rithm is b oth ecient and eective In our study it pro duces mo dest co de size increases in

the range of to while statically binding approximately of all invo cations and

through inlining eliminating to of invo cations overall

The structure of this chapter is as follows Section intro duces a matrix multiply example

which will b e used in this and later chapters Section describ es how the contours pro duced

by context sensitive ow analysis Chapter are used to guide cloning Section presents

a mo died dispatch mechanism required to make it p ossible for the call graph containing the

sp ecialized clones to b e realized at run time Selection of clones is covered in Section

Optimization of the clones including data layout and call bindings is discussed in Section

Finally a study of p erformance of the algorithm and its eectiveness for optimization app ears

in Section

Example Matrix Multiply

In order to illustrate the cumulative eects of cloning and later OOP and COOP sp ecic

optimizations Figure contains an example which is threaded through this thesis The mul

tiplication of encapsulated two dimensional matrix ob jects was selected b ecause it is simple

and illustrates many sources of ineciency While not a conventional ob jectoriented example

this co de illustrates b oth p olymorphism and several levels of encapsulation Moreover this

program could b e written cleanly in FORTRAN or assembly language for that matter with

go o d p erformance However our goal is to have b oth high level abstraction and low level p er

formance This thesis demonstrates how to build an implementation with the p erformance and

lo op structure of the ecient pro cedural algorithm while retaining the abstraction expression

of ob jectoriented programming

The co de in Figure declares the twodimensional p olymorphic array class ArrayD and

the innerproduct and matrix multiply mm metho ds The ArrayD class encapsulates a contigu

ously allo cated two dimensional array ob ject The metho d at accesses an element of that

array by the standard technique of linearizing the indices These metho ds are then used

to multiply two arrays containing integers ai and bi into an integer array result ri and two

arrays containing oats af and bf into a oat array result rf

Clones and Contours

The result of ow analysis Chapter is context sensitive information where a context is given

in terms of call paths for metho ds and creation p oints for ob jects More precisely a context c is

0 0

a metho d m invoked from a statement s in context c on an ob ject created at statement s in

ArrayDinnerproductabij

let result aati batj

class ArrayD Array

for let kiacolsk

rows

result aatik batkj

cols

atputijresult

atij

atputijvalue

innerproductabij

ArrayDmmab

mmab

for let iibrowsi

for let jjacolsj

innerproductabij

ArrayDati j

return selfcols i j

main

ArrayD aIbI L

ArrayDatputi j value

ArrayD aFbF L

selfcols i j value

rimmaIbI

rfmmaFbF

Figure Matrix Multiplication Example

00

context c For each context the analysis provides the classes which each variable might p oint

to and likewise for the ob ject creation p oints the classes of instance variables of ob jects created

there Furthermore the analysis provides an interpro cedural call graph over the contexts The

cloning phase uses this information to decide which contexts should b e instantiated as unique

metho ds and which sets of ob jects should b e instantiated as unique classes Essentially the

analysis phase treats the users metho ds and classes as a set of uninstantiated templates see

Section and determines how they might b e instantiated automatically by cloning for b oth

ecient and compact co de

For the matrix multiply example the analysis determines that the ob jects aIbI are of class

ArrayD and contain integers call them ArrayDint and aFbF contain oats ArrayDfloat

Furthermore it shows that there are two versions of mm one called on aIbI which op erates

entirely on ArrayDint and another called on aFbF which op erates on ArrayDfloat Within

these two versions of mm the corresp onding versions of innerproduct were called and within

The recursion in the denition is headed o by having main invoked from a distinguished

context Section

them the corresp onding versions of at and at put Thus the at within innerproduct

on ArrayDint is known to return an integer

This information is sucient to build sp ecialized optimized versions of classes and metho ds

For example in Figure the generic class ArrayD is used to hold integers The metho d

foreach is used to invoke the metho d for a given selector on each element of the array By

cloning a sp ecial versions of ArrayD and foreach the co de on the right can b e generated

The integer elements of the array are unb oxed the class identication tag removed The lo ops

have b een merged and the double op eration converted to a shift Instead of requiring hundreds

of dynamically dispatched metho d invo cations multiplications and indirections the op eration

to double every element require only one statically b ound function call

foreacha

for let iiarowsi foreachdbla

for let jjacolsj for let iii

aatputijfaatij ai ai

dbli ii

ArrayDint a

ArrayD a foreachdbla

foreachadbl

Figure Cloning Optimization Example

While all the information required to make these transformations is supplied by the analysis

the analysis results cannot b e used directly for cloning First the natural candidates for replica

tion contours are to o numerous Second contours can b e distinguished during analysis

by elements of the calling context which are not covered by the standard dispatch mechanism

Section Thus cloning requires a mo dication to the dynamic dispatch mechanism

and the p ower of this mechanism must b e balanced with any additional cost A go o d cloning

algorithm enables eciency to b e balanced with co de size

This is a result b oth of manner in which the analysis discovers the program structure and

of the relative complexity of the data and control ow information required by analysis as

compared to that needed for sp ecic optimizations That is what the analysis discovers in one

part of the program may not b e relevant lo cally but enable optimization of another part of

the program

Overview of the Algorithm

The cloning algorithm pro ceeds by applying optimization criteria eg maximize static binding

Section to the contours provided by analysis These criteria induce partitions over the

contours based on the information available for optimization These partitions are candidate

clones These partitions are then iteratively rened broken into smaller partitions sub ject

to the requirement that the call graph of the program represented by the partitions clones

is realizable That is the mo died dispatch mechanism is capable of selecting the appropriate

clone for each invo cation site Section presents an ecient mo died dispatch mechanism

which requires at most one additional instruction and Section covers constructing the initial

partitions and their renement First we intro duce the data structures used by the algorithm

Information from Analysis

The information pro duced by analysis and describ ed in Figure is imp orted by the cloning

algorithm as a set of functions Figure provides a terse description of these functions which

are used by the cloning algorithm Contours are the basic element of context sensitivity Each

contours represent some abstract set of metho d activations summarized by of the metho d

the analysis Similarly each element of the set of class contours summarizes some abstract

set of ob jects analyzed as a unit Each contour is asso ciated with its particular metho d or

class Each metho d contains a set of statements representing invo cations invsites and the

p oints Finally each metho d contains a set of variables creation of new ob jects creation

and each class contains a set of instance variables

metho d contours the set of metho d contours pro duced by analysis

contours the set of class contours pro duced by analysis class

metho dm the metho d asso ciated with metho d contour m

classc the class asso ciated with class contour c

invsitesm the invo cation statements in metho d m

variablesm the variables in metho d m

p ointsm the ob ject creation statements in metho d m creation

instance varaiblesc the instance variables of class c

Figure Primary Cloning Information

Each contour corresp onds to a unique set of context sensitive information For example

each contour determines the typ es of variables and binding of invo cation sites Figure

describ es the functions used by the cloning algorithm to access this information A creation

statement and a metho d contour uniquely determines the class contour of ob jects created at

contour Likewise an invo cation site and a metho d that statement in that context created

contour determine the call binding whether or not the site can b e statically b ound For

instance and lo cal variables the class or metho d contour rep ectively determine the b oxing

whether or not the variable can b e unb oxed Section Finally the interpro cedural call

graph call graph edges is represented as a set of edges for each invo cation statement Each

edge represents a invoke on a metho d contour the callee Each edge is further asso ciated with

a particlar selector generic function name and class contour of the target ob ject ob ject

contoursm the class contour of ob jects created at statement s in the context of created

metho d contour m

bindingsm the set of metho ds which might b e invoked at statement s in metho d contour m

b oxingvm the data layout eg raw integer raw oating p oint numb er b oxed integer or

p ointer required for variable v in metho d contour m

call graph edgess the set of call edges for statement s

calleee the callee metho d contour for edge e

selectore the selector generic function name whose availability at the call site in part in

duced the edge e

ob jecte the class contour whose availability at the invo cation site in part induced the edge e

Figure Cloning Optimization Information

The information in Figure represents only a small subset of that which can b e determined

by our adaptive context sensitive analysis In general the optimization criteria can dep end on

any analyzed prop erty Also analysis may distinguish metho d contours by arbitrary asp ects of

the calling environment including the contours from which they were invoked the typ es

of all the arguments and other criteria As a result a call graph on contours cannot

in general b e realized by the standard dispatch mechanism

Mo died Dynamic Dispatch Mechanism

Cloning mo dies the call graph by replicating subgraphs which are then called by only a sub

set of the previous callers The information within the cloned subgraphs is then sp ecialized

for the subset If an invo cation site within a sp ecialized subgraph can only invoke a single

target metho d that site can b e statically b ound connected directly to the appropriate clone

However if the invo cation site requires a dynamic dispatch the standard singledispatch mech

anism ie the one used by C or Smalltalk is in general insucient to distinguish the

correct callee clone The problem is that this dispatch mechanism determines the metho d to

b e executed based on the selector generic function name and runtime class of the target ob

ject sel ector cl ass and these are identical for all clones of a given metho d Figure

illustrates one limitation

class Stream

main

class StringStream Stream

Object o new Circle

class Shape

Stream s

class Square Shape

class Circle Shape

if s new StringStream

else s new Stream

StreamprintShape o

sprinto

o new Square

CLONE StreamprintSquare o

sprinto

CLONE StreamprintCircle o

CLONE StringStreamprintSquare o

CLONE StringStreamprintCircle o

Figure Limitation of Standard Dispatch Mechanism

In Figure the print metho d in the Stream class takes a single argument o which is

either a Circle or a Square Since the variable s can b e either a Stream or a StringStream

the invo cation requires dynamic dispatch However the standard dispatch mechanism only

dispatches on the selector and the class of the target and hence cannot select b etween the

versions of Streamprint cloned based on the class of parameter o one for Square and one

for Circle A more p owerful dispatch mechanism is required to handle this case

To address this problem an invo cation site sp ecic dispatch mechanism is used Each site

is given an identier which is used during dynamic dispatch to distinguish the appropriate

callee clone for each selector and target ob ject typ e pair In our example the invo cation site

information would allow us to select the version of print for Circle at the rst site and that

for Square at the second It should b e noted that even multipledispatch is not sucient

since the distinguishing factor could b e anything related to the calling environment including

how the return value will b e used in the future

A second problem is that cloning partitions the ob jects in user dened classes into concrete

typ es for which more sp ecialized information is available Concrete typ es are essentially imple

mentation classes describing for example a sp ecial the memory layout sp ecic to some subset

of ob jects of a particular user class The new dispatch mechanism may need to distinguish

these concrete typ es For the example this case would o ccur if we had constructed sp ecialized

versions of Circle such as BigCircle and SmallCircle with their own memory layouts

The mo died dispatch mechanism uses site sel ector concr ete ty pe to select the metho d

to b e executed Since only a single dimension is added this mechanism is the smallest extension

sucient to select the correct clone and unlike multipledispatch is indep endent of the numb er

of arguments This mechanism can b e implemented to induce no overhead when the selector

is known ie when the selector do es not come from a variable and only one instruction

otherwise since the site can b e added into the selector to form a single index into a virtual

function table

Selecting Clones

Clones are selected by partitioning metho d contours and concrete typ es by partitioning class

contours The initial set of partitions is determined by optimization criteria such as mini

mization of dynamic dispatch and maximizing unb oxing These partitions represent p otential

concrete typ es and clones versions of metho ds amenable to sp ecial optimization which are

then iteratively rened until the cloned call graph is realizable by the new dispatch mechanism

The overall algorithm is presented in Figure It is based on two functions one which

determines if two metho d contours can share a clone are equivalent and an analogous function

for class contours These functions shown in Figure induce partitions over their resp ective

contours Then repartition computes these partitions by grouping the contours such that

all the contours in a partition are equivalent Initially this equivalence corresp onds to that

induced by the optimization criteria Since ner partition of class contours can induce a ner

partition of metho d contours and vice versa we rep eat the pro cess until a xed p oint is reached

cloneselection

establish initial partitions

methodpartition new List

forall m in methodcontours do

repartitionpartequivalent

partitionm methodpartition

result new List

classpartition new List

resultadd new Listpartfirst

forall c in classcontours do

forall e in partrest do

partitionc classpartition

forall s in result do

refine for equivalence and realizability

if forall r in s do

while fixedpoint

equivalenter

repartitionmethodcontours

sadde

methodcontoursequivalent

else resultadd new Liste

checkclasscontoursusedfordispatch

repartitionclasscontours

classcontoursequivalent

Figure Cloning Selection Drivers pseudo co de

Termination is ensured b ecause the numb er of contours is nite and the partitioning pro ceeds

monotonically see Figure under the comment monotonicity

The initial partitions are built based on optimization criteria by the contour equivalence

functions To maximize static binding we examine each invo cation site in the metho d for the two

contours and if they would bind to dierent clones metho d contour partitions or dierent sets

of clones we declare the two contours not equivalent Similarly for representation optimizations

unb oxing if a variable within two metho d contours or an instance variable within two class

contours has dierent ecient representations unb oxed or inlined ob jects we declare them not

equivalent This is b ecause grouping the contours would prevent optimization The co de to

check these optimization criteria app ears in Figure and is indicated by the optimization

criteria comment Standard techniques for proling or frequency estimation can b e

applied to maximize the b enets of optimization while limiting co de expansion by ignoring

optimization of nonp erformance critical co de

To ensure that the call graph is realizable by the mo died dispatch mechanism further

renement of the partitions may b e required aecting b oth metho d and class partitions This

o ccurs when the dispatch mechanism is not able to resolve a unique metho d at an invo cation site

Figure shows graphically for the matrix multiply example how the decision to sp ecialize

innerproduct by partitioning the contours for ArrayDint and ArrayDfloat induces a

repartitioning of contours of and ultimately the sp ecialization of mm

boolean methodcontoursequivalentab

return

partitiona partitionb monotonicity

foreach s in invsitesmethoda do optimization criteria

bindingsabindingsb

foreach v in variablesmethodb do

boxingvaboxingvb

foreach c in creationpointsmethoda do realizability

classcontourcaclasscontourcb

boolean classcontoursequivalentab

return

partitiona partitionb monotonicity

foreach v in instancevariablesclassb do optimization criteria

boxingvaboxingvb

b in notequivalenta realizability

checkclasscontoursusedfordispatch

foreach s in invsites do

foreach ee in callgraphedgess do

if partitioncalleee partitioncalleee

selectore selectore

partitionobjecte partitionobjecte

makenotequivalentobjecteobjecte

makenotequivalentab

notequivalentaaddb

notequivalentbadda

Figure Contour Equivalence Functions pseudo co de

Since the dispatch mechanism uses concrete typ e class contour partition to select the

target metho d if the invo cation site and selector are the same the two class contours must

b e to b e in dierent partitions in order b e able to resolve the appropriate metho d Consider

the case in Figure of optimizing the binding of print in the metho d print contents

to Circleprint for circle containers and Squareprint for square containers at site

Since the invo cation site and selector are identical the concrete typ e of c must b e used to

contents has distinguish the correct version Thus the metho d contour partition of print

induced a class contour partition of Container to distinguish those instances for which o is a ri.mm(aI,bI); rf.mm(aF,bF);

Array2Dint Array2Dfloat

Array2D::mm()

Array2D::innerproduct() Array2D::innerproduct()

Array2Dint Array2Dfloat

Figure Partitioning of innerproduct contours induces repartitioning of mm contours

Circle from those for which o is a Square The function which checks this condition and ensures

class contours used for dispatch that two class contours will b e nonequivalent is check

in Figure

class Container Object o

void Containerprintcontents thisoprint

Container create return new Container

main

Container a create site

Container b create site

ao new Circle

bo new Square

Container c a

if c b

cprintcontents site

Figure Example Requiring Repartitioning of Contours

Similarly class contour partitions can induce metho d contour partitions Class contours are

dened by their creation p oint creating statement and surrounding metho d contour Since the

partitions of class contours will b e the concrete typ es which are used by the dispatch mechanism

ob jects must b e tagged at their creation p oints with their concrete typ e This means that two

metho d contours cannot b e in the same partition if they dene dierent class contour partitions

For example in Figure we have partitioned the class contour for Container based on the

typ e of o Circle or Square In order to tag Circle containers and Square containers as

dierent concrete typ es enabling the dispatch mechanism to select b etween them we must

repartition the metho d contours for create separating those called from site from those

invoked from site Thus the class contour partition of Container has induced a metho d

contours equivalent contour partition of create This is checked by the function method

under the comment realizability in Figure

Creating Clones

A metho d clone is created for each metho d contour partition and a concrete typ es for each

class contour partition For each metho d clone the co de of the original metho d is duplicated

and the data ow information up dated to reect only that for the contours in its partition

The invo cation sites and variables have the more precise information dictated by the opti

mization criteria enabling a wide variety of optimizations see Chapters and for a detailed

discussion and exp erimental results In particular the statically b ound invo cation sites are

connected to the appropriate clone and are amenable to inlining Metho ds which contain cre

ation p oints are mo died so that the created ob jects are tagged with the appropriate concrete

typ e instead of the original class Finally the mo died dispatch tables are constructed

Invo cations which require dynamic dispatch are assigned identiers For each edge in the in

terpro cedural call graph from these sites an entry is made into the dispatch table mapping the

site sel ector concr ete ty pe to the appropriate clone

main()

ri.mm(aIi,bI); rf.mm(aF,bF);

Array2Dint::mm() Array2Dfloat::mm()

Array2Dint::innerproduct() Array2Dfloat::innerproduct()

Array2Dint::at() Array2Dfloat::at()

Array2Dint::at_put() Array2Dfloat::at_put()

Figure Sp ecialization of Matrix Multiply Example

For our example the sp ecialized classes ArrayDint and ArrayDfloat are created with

the knowledge of the typ es of inner outer and the array elements The constructors for aIbI

and aFbF are sp ecialized to create ob jects of these new classes The access metho ds inner

and outer are sp ecialized to extract unb oxed integers Likewise the sp ecialized versions of

put innerproduct and mm are created Finally the call graph is up dated so at at

that for instance innerproduct on ArrayDfloat invokes at put on ArrayDfloat as in

Figure

Performance and Results

In this section we describ e the results of applying the cloning algorithm to the test suite of

Section The programs were analyzed with the owsensitive interpro cedural analysis

describ ed in Chapter The numb er of clones pro duced and the eects of cloning on dynamic

dispatch pro cedure calls and co de size are rep orted

Clone Selection

To evaluate clone selection initial contour partitions were generated using aggressive optimiza

tion criteria One criteria is to remove as many dynamic dispatches as p ossible regardless of

the numb er of times the statement is executed The second criteria was to optimize the rep

resentation of as many arrays and lo cal integer and oating p oint variables by unb oxing We

applied these criteria and evaluated the numb er of concrete typ es and metho d clones pro duced

To demonstrate that clone selection was able to combine contours not required for optimization

we also rep ort the numb er of contours pro duced by the analysis

It should b e noted that the numb er of contours pro duced by an analysis is only sup ercially

related to the quality of information it pro duces and the diculty of selecting clones based on

that information In theory ow analyses pro duce O N O N O N or more contours for

a program of size N and can require large amounts of space The adap

tive analysis in Chapter creates contours in resp onse to imprecisions discovered in previous

iterations As a result it is relatively conservative with resp ect to the numb er of contours it

creates

Selection of Concrete Typ es Class Clones

The numb er of user classes analyzed class contours and the numb er of concrete typ es pro duced

by the selection algorithm are rep orted b elow

Program ion network circuit pic mandel tsp richards mmult p oly test

Program Classes

Class Contours

Concrete Types

Figure Selection of Concrete Typ es Class Clones

The data in Figure show that the numb er of class contours is much greater than the

numb er of userdened classes However the numb er of concrete typ es selected by our cloning

algorithm is closer to the numb er of user classes This is b ecause not all those distinguished

by the analysis are required for optimization In particular when all invo cations on ob jects

corresp onding to some class contour are statically b ound the dispatch mechanism do es not

need a concrete typ e for dispatch and no distinct concrete typ e is created Metho ds for such

ob jects are simply sp ecialized for the class contour and statically b ound

Selection of Metho d Clones

The numb er of reachable user metho ds used as opp osed to simply dened in the program

analyzed metho d contours clones selected by our algorithm and the nal numb er of metho ds

after inlining app ear in Figure The inlining criteria Section are based on the size of

the source and target metho ds as well as a static estimation of the invo cation frequency When

all invo cations on a metho d are inlined that metho d is eliminated from the program

Program ion network circuit pic mandel tsp richards mmult p oly test

Methods

Contours

Clones

After Inlining

Figure Selection of Metho d Clones

Again the analysis creates many more metho d contours than user dened metho ds How

ever the selection algorithm cho oses only those required for optimization in most cases ending

with only somewhat more than the numb er of user dened metho ds Moreover since many

invo cation sites can b e statically b ound after cloning many of the smaller metho ds can b e in

lined at all their callers Thus the numb er of metho ds which remain after cloning and inlining

is actually smaller than the numb er of metho ds in the original programs

Dynamic Dispatch

Dynamic dispatch virtual function calls is describ ed in Section Static binding is the

pro cess of transforming dynamic dispatches into regular function calls Cloning enables static

binding by creating versions of co de sp ecialized for the classes of data they op erate on We

compare three optimizations levels unoptimized optimized and cloning The unoptimized

co de represents the lower b ound on eciency indicating the numb er of metho ds and messages

required by a naive implementation were only accessors Section are inlined The opti

mized CFA version uses customization to create sp ecialized versions of metho ds for each

target ob ject class and statically binds all metho ds for which there is only one p ossible target

metho d

Site Counts

In Figure we rep ort the numb er of dynamic dispatch sites in the nal co de Overall the

numb er of dynamic dispatch sites in the optimized co des is almost identical to that in the

unoptimized co de The dierences result from inlining and dead co de elimination On the

other hand very few dynamic dispatch sites remain in the cloning co des As we will see in

Chapters and elimination of these sites enables many optimizations

Without cloning all the programs but two contain a numb er of dynamic dispatch sites

mandel is primarily numerical and do es not use p olymorphism and in test the selectors are

unique enabling invo cations to b e statically b ound even without sophisticated analysis With

cloning only one program has more than two dynamic dispatch sites Those dispatches which

remain corresp ond to the true p olymorphism in the programs and cannot b e statically b ound

to single metho ds For instance in richards the OS simulator the single remaining dispatch

is in the task dispatcher where the simulated tasks are executed Since the tasks are data

dep endent this dynamic dispatch cannot b e eliminated 160.00%

140.00%

120.00% Unoptimized 100.00%

80.00% Optimized 60.00% Cloning 40.00%

20.00%

0.00% ion pic tsp test poly mmult circuit mandel

network

Figure Dynamic Dispatch Sites

Event Counts

The runtime counts in Figure demonstrate the eectiveness of cloning for elimination of

dynamic dispatch during program execution The test suite was optimized and then executed

on a sample input and the numb er of invo cations b oth dynamically dispatched and statically

b ound were collected The numb er of dynamic dispatches is rep orted as a p ercentage of

those o ccurring in the unoptimized co de While global analysis and optimization alone is

able to statically bind many invo cations reducing the counts by approximately x cloning is

able to statically bind many more Moreover once the numb er of invo cations is reduced by

inlining those remaining in the optimized case are frequently dynamic dispatches Figure

shows the numb er of dynamic dispatches as a p ercentage of the remaining invo cations This

shows that optimization of the optimized co de is limited by dynamic dispatches which inhibit

inlining In contrast cloning keeps dynamic dispatches to a small fraction of the total numb er

of invo cations Note that this graph should not b e used to compare the absolute numb er of

dynamic dispatches since the total numb er of invo cations in the cloned version is less than that

in the optimized version 20 | Optimized Cloning

15 |

10 |

5 | Percent of Dynamic Dispatches

0 || pic ion tsp test poly circuit mmult mandel network

richards

Figure Percent of Total Dynamic

Numb er of Invo cations

In Figure we rep ort the total numb er of invo cations static and dynamic after optimization

For the baseline we use the numb er of invo cations in the baseline version Global

analysis and inlining eliminate b etween and of the invo cations and in some cases

cloning eliminates more The use of b etter use of frequency information combined with

the greater numb er of statically b ound metho ds in the cloning version might reduce the numb er

of calls even further

Co de Size

One imp ortant measure of the eectiveness of clone selection is the nal co de size Figure

compares the resulting co de size b efore and after cloning The cloned programs usually increase

in size by a mo dest amount and always by less than The relatively large increase in ion is

the result of extensive use of rst class selectors virtual function p ointers in C for program

output Co de size expansion can b e reduced by using proling or frequency estimation to

restrict cloning to the parts of the program which execute the most Since the output phase is

only executed once such restrictions would have help ed for ion 35 | Optimized Cloning 30 |

25 |

20 |

15 |

10 |

5 | Percent of Dynamic Dispatches

0 || pic ion tsp test poly circuit mmult mandel network

richards

Figure Percent of Remaining Dynamic

Discussion

Pure ob jectoriented languages rely on p olymorphism and dynamic binding to express generic

abstractions In C templates give the programmer explicit control over how and when

co de is replicated andor shared In order to avoid co de bloat C programmers must use

derivation inheritance to ensure co de sharing among dierent typ es as Bjarne Stroustrup said

People who do not use a technique like this in C or in other languages with similar facilities

for typ e parameterization have found that replicated co de can cost megabytes of co de space

even in mo derate size programs Cloning uses optimization criteria interpro cedural

analysis and transformation to automatically generate ecient sp ecialized data structures and

co de which reect the actual application structure However the generic versions can b e used to

share co de for nonp erformance critical parts of the application Moreover these p erformance

tuning considerations are decoupled from the higher level expression of the program simplifying

co ding and increasing the p otential for reuse 100 |

90 | Optimized Cloning 80 |

70 |

60 |

50 | | Invokes 40

30 |

20 |

10 |

0 || pic tsp ion test poly circuit mmult mandel network

richards

Figure Total Numb er of Invo cations

Related Work

Co op er presents general interpro cedural analysis and optimization techniques Whole pro

gram global analysis is used to construct the call graph and solve a numb er of data ow

problems Transformation techniques are describ ed to increase the availability of this informa

tion through linkage optimization including cloning However this work do es not address clone

minimization Co op er and Hall present comprehensive interpro cedural

compilation techniques and cloning for FORTRAN This work is general over forward data ow

problems and presents mechanisms for preserving information across clones and minimizing

their numb er However concrete typ es are not a forward data ow problem Hall determines

initial clones by propagation of clone vectors containing p otentially interesting information

which are merged using state vectors of imp ortant information into the nal clones We handle

forward ow problems in a similar manner but rely on global propagation to determine the

nal clones for recursive metho ds

Several dierent approaches have b een used to reduce the overhead of ob jectorientation

Customization is a simple form of cloning whereby a metho d is cloned for each sub class 1500 | Optimized | 1400 Cloning 1300 |

1200 |

1100 |

1000 |

900 |

800 |

700 |

600 |

500 |

400 | | Object Code of Size (K) 300

200 |

100 |

0 || pic tsp ion test poly circuit mmult mandel network

richards

Figure Eect of Cloning on Co de Size

which inherits it This enables invo cations on self or this in C terminology to b e

statically b ound Another simple approach is to statically bind invo cations when there is only

one p ossible metho d This idea was extended by Calder and Grunwald through if

conversion essentially a static version of p olymorphic inline caches This work also shares

some similarities with that done for the Self and Cecil languages Chamb ers and

Ungar used splitting essentially an intrapro cedural cloning of basic blo cks to preserve typ e

information within a function Early work on Smalltalk used inline caches to exploit typ e

lo cality Holzle and Ungar have shown the information obtained by p olymorphic inline

caches can b e used to sp eculatively inline metho ds While run time tests are still required

various techniques are presented to preserve the resulting typ e information None of these

approaches uses globally analyzes and transformation to eliminate the run time checks nor to

preserve general global data ow information More recently Dean Chamb ers and Grove

have used information collected at run time to sp ecialize metho ds with resp ect to argument

typ es While this can remove dynamic dispatches across metho d invo cations it do es not handle

p olymorphic instance variables Finally Agesen and Holzle have recently used the results of

global analysis in the Self compiler However the information for all the contours for each

customized metho d is combined b efore b eing used by the optimizer

The cloning algorithm we have presented is general enough to enable optimization based

on any data ow information provided by global ow analysis All that is required is that

the contour equivalence functions b e mo died to reect the new optimization criteria We have

used optimization criteria for increasing the availability of interpro cedural constants integrating

sub ob jects and separating algorithm phases successfully with this cloning algorithm However

ecient cloning for such information requires estimating its p otential use for optimization

Interested readers are referred to for a discussion of such issues

Summary

Cloning builds sp ecialized versions of classes and metho ds for optimization purp oses It b egins

with the results of ow analysis Chapter the call graph and a set of contours These contours

are partitioned into prototypical clones based on optimization critieria Ob ject contours are

partitioned into concrete typ es and metho d contours are partitioned into metho d clones Next

an iterative algorithm is applied which repartitions the contours until the call graph is realizable

until the ob jects can b e created of correct concrete typ es and the correct clones can b e invoked

for each invo cation site The standard dynamic dispatch mechanism which selects the desired

metho d based on the selector and class of the target must b e mo died to b e context sensitive

The new dispatch mechanism uses an invo cation site identier during dynamic dispatch This

identier can typically b e folded into the selector A study of nine ob jectoriented programs

demonstrates that of all invo cations can b e statically b ound to a single metho d through

cloning with mo dest co de size expansion

Chapter

Optimization of Ob jectOriented

Programs

One do es not know cannot know the b est that is in one

Nietzsche Beyond Good and Evil

This chapter describ es a range of general and ob jectorientation sp ecic optimizations and

demonstrates for a set of standard b enchmarks that they are sucient to enable a pure

dynamicallytyp ed ob jectoriented language to match the p erformance of C GCC and b eat

that of C G The optimizations in this chapter o ccur after ow analysis Chapter

and cloning Chapter Section maps the p otential ineciencies of the programming and

execution mo dels Section and Section to particular optimization problems Sec

tion provides an overview of the solutions to these problems which are then covered in

detail in the remaining sections

Section discusses optimization of invo cations including static binding ifconversion

inlining and sp eculative inlining Section is concerned with optimization of low level data

access through unb oxing the removal of typ e tags and the op erations which manipulate them

Section covers the promotion of instance variables to lo cal variables using interpro cedural

aliasing information Finally in Section a suite of general optimizations necessary to extract

the nal mo dicum of p erformance are discussed

Eciency Problems

In Section we p ointed out that abstraction b oundaries and p olymorphism are p otential

sources of ineciency in the programming mo del Crossing abstraction b oundaries trivially

maps to metho d invo cations virtual function calls as in C which are more exp ensive

than inline op erations Likewise p olymorphic ob jects must b e handled indirectly ie using

p ointers increasing p otential aliasing Finally Section p oints out the impact of control ow

ambiguities resulting from dynamic dispatch Thus OOP programs have more smaller metho ds

more data dep endent control ow and more p otential aliases than pro cedural programs

These features decrease p erformance and moreover their eects comp ound For example the

large numb er of data dep endent invo cations increases aliasing ambiguity which in turn increases

register spill at the large numb er of invo cation sites

Metho d Size

Small metho d size in ob jectoriented programs is a result of encapsulation and programming

by dierence which are supp orted by metho ds and inheritance resp ectively Metho ds describ e

the interface to the ob ject physically emb o dying the abstraction b oundary Using inheritance

the programmer partitions the program into metho ds representing a general solution and a set

of variation p oints which are delimited by metho d b oundaries

The eects of ob jectorientation on program characteristics have b een conrmed empirically

by comparision of of C and C Calder et al found that the instructions to invo cation

ratio for C was less than half that of C Moreover the basic blo ck sizes for C was

slightly smaller than that of C In addition to the overhead of the metho d invo cation itself

small metho ds and basic blo ck size make it harder for mo dern micropro cessors to extract the

instruction level parallelism they dep end on for high p erformance see Section In CA and

other pure ob jectoriented languages like Self and Smalltalk the invo cation density is even

higher

In addition to the direct cost of the metho d invo cations themselves small metho d size

decreases the eectiveness of register allo cation and instruction scheduling b oth critical to

p erformance on mo dern micropro cessors Section To increase the size of size of metho ds

the b o dies of small metho d need to b e inlined Section spliced into their caller where

the can b e optimized in context

Data Dep endent Control Flow

Since p olymorphic variables may at dierent times refer to ob jects several classes metho ds

invoked on them are dep endent on the concrete typ e of the ob ject In general the metho d

executed is determined by a combination of the selector generic function name and the actual

class of the ob ject b oth of which may vary at run time This data dep endence of control ow

complicates inlining and increases the cost of metho d invo cations Much data dep endence can

b e eliminated through a combination of global analysis Chapter and cloning Chapter

However transforming the program to take advantage of the additional information and to

eliminate the remaining data dep endence requires sp ecial optimization Invo cation sites are

statically b ound Section or inlined sp eculatively Section based on the selector

and class of the target ob ject

Aliasing

Since ob jects are referenced indirectly they and consequently their instance memb er vari

ables are p otentially aliased As a result these variables generally cannot b e cached in registers

across metho d invo cations or assignments through p ointers Since instance variables are im

plicitly scop ed in C they need not b e accessed through the this p ointer this p erformance

consequence may not b e readily apparent to the programmer constructing or using the ab

straction Moreover the p otential alias problem is exacerbated by high metho d invo cation

frequency

class ArrayD

rows

cols

for j j arows j

atij

for i i acols i

bcompute aatji

ArrayDatij

return selfi cols j

Figure Aliasing Example

An example of this eect app ears in Figure which derived from the matrix multiply

example of Figure In isolation the metho d cols requires a memory access to retrieve

the value of the cols instance variable Inlined into the for lo op the load of cols cannot b e

hoisted ab ove the lo op as an invariant unless the compiler can prove that acols cannot b e

changed by compute The interpro cedural call graph which is discussed in Section is needed

to make this determination Similarly approximations of alias information for arrays can also

b e used for optimization Section

Optimization Overview

Analysis provides the information and cloning makes it available so that the compiler can

convert metho d invo cations into lower level op erations which are amenable to conventional

optimizations The sp ecialized clones and concrete typ es pro duced by cloning Chapter are

similar to C template instantiations in that the information they contain has b een made

more precise by co de replication However high metho d invo cation density and the large

numb er of p ointerbased data accesses result in inecient co de

The problem of invo cation density is addressed through a set of invo cation optimizations

including static binding sp eculation and inlining Data access overhead is addressed by un

b oxing and the eliminating p ointerbased accesses through conversion of instance variables to

Static Single Assignment form and array alias analysis Finally a suite of standard low level

optimizations are p erformed to eliminate the residue of high level abstractions

Benchmarks

In order to illustrate and evaluate the optimizations in this chapter a set of b enchmarks are

used The Stanford Integer Benchmarks consist of bubble sort bubble integer matrix mul

tiply intmm a p ermutation generator p erm a puzzle solver puzzle the Nqueens

problem queens the sieve of Erastothenes sieve the towers of Hanoi towers and a

program to construct a random binary tree tree The Stanford OOP b enchmarks include

Richards an op erating system simulator which creates a numb er of dierent tasks which are

stored in a queue and p erio dically executed and Delta Blue a constraint solver which builds

a network solves it a numb er of times and removes the constraints Two dierent test cases

are provided for Delta Blue Chain which builds a chain of Equal constraints and Pro jection

which builds two sets of variables related by ScaleOffset constraints These b enchmarks and

the testing metho dology are discussed further in Section

Invo cation Optimization

Since ob jectoriented programming pro duces co de with a large numb er of invo cations and uses

a relatively exp ensive calling mechanism invo cation optimization is of particular imp ortance

First the call mechanism can b e optimized by static binding or the conversion of a dynamic

dispatched invo cation site to a static call a normal C style call This is p ossible only when

it can b e proven that only one metho d may b e called from that invo cation site When that

is not the case we can sp eculate as to which metho d will b e called insert co de to verify the

sp eculation and use a static call if we are correct Last for statically b ound sites the b o dy of

the called metho d can b e inserted inline eliminating the invo cation overhead and allowing the

co de to b e sp ecialized for the sp ecic calling context

Static Binding

General metho d invo cations dynamically dispatched can b e converted to direct function calls

when it can b e determined that only one metho d could p ossibly b e called In C this is

trivially the case when a metho d is not declared virtual is static or is never overridden in

a sub class As we will see in Section this information in C is insucient in general to

enable an ecient implementation In a dynamically typ ed language eg Smalltalk CA

or the language used in this thesis a dynamic dispatch can only b e transformed without

analysis when the selector is constant and there is only one metho d with that name

Global analysis and cloning see Chapters and are capable of resolving many dynamic

dispatch sites to a metho d the eectiveness of which on a set of general ob jectoriented pro

grams is discussed in Section For programs in which all the p olymorphism is parametric

determined by static parameterization of classes and metho ds with resp ect to their creation or

calling environment these techniques allow static binding of all invo cations Figure shows

the numb er of static and dynamic dispatch sites in the Stanford Integer Benchmarks describ ed

in Section after analysis and cloning Since these are pro cedural co des translated into an

ob jectoriented style it is not surprising that all invo cations can b e statically b ound

120 |

110 | Statically Bound Dynamically Bound 100 |

90 |

80 |

70 |

60 |

50 | |

Count of Sites 40

30 |

20 |

10 |

0 || trees perm sieve intmm puzzle towers bubble

queens

Figure Static and Dynamic Invo cation Sites in the Optimized Stanford Integer Bench

marks

For the Stanford ob jectoriented b enchmarks the results show that the programs do require

runtime dynamic dispatch Nevertheless very few dynamic dispatch sites remain in the gener

ated co de Figure shows the numb er of statically and dynamically b ound invo cation sites

remaining after optimization In fact each co de only contains a single invo cation site which

requires dynamic dispatch once cloning has created sp ecial versions of classes and metho ds for

each instance of parametric p olymorphism The reason that proj and chain seem to have a

numb er of dynamic dispatch sites is that the single site is inlined in several places

Static binding can have a signicant impact on p erformance compared to using a general

dispatch mechanism In Figure we compare the p erformance of the Stanford Benchmarks

Integer and OOP with and without static binding In this study all other optimizations are

enabled in particular inlining is used when p ossible however when a invo cation is required

a general dynamic dispatch mechanism is used In Concert the general dispatch mechanism

requires arguments to b e b oxed see Section b elow and the metho d lo okup uses a bucket

hash table When the correct metho d is found a wrapp er interfaces the b oxed arguments to

the unb oxed values used in the metho d 120 |

110 | Statically Bound Dynamically Bound 100 |

90 |

80 |

70 |

60 |

50 | |

Count of Sites 40

30 |

20 |

10 |

0 || proj chain

richards

Figure Static and Dynamic Invo cation Sites in the Optimized Stanford Ob jectoriented

Benchmarks

Sp eculation

Even a thought even a p ossibility can shatter us and transform us

Nietzsche Eternal Recurrence

Sp eculation is optimization which takes into account probability and relative cost to optimize

average case p erformance The p ossibility that an event could o ccur for instance a invo cation

to a particular metho d at a particular p oint in the co de is used to transform that co de Instead

of using a general calling mechanism eg table driven indirection to obtain the target metho d

a conditional is inserted In the two branches of the conditional some of the p ossibilities have

b een eliminated rening the information available and p otentially enabling optimization

This idea will b e used extensively in Chapter but rst let us examine an example

In Figure on the left the metho d foreach is dened which takes a selector f If it could b e

determine that the selector argument was dbl all the metho d invo cations could b e eliminated

resulting in much more ecient co de While in general analysis and cloning cannot always

resolve such arguments to a single metho d they can reduce also the numb er of p ossibilities

b oth of selectors and target ob ject classes In such cases it is often protable to sp eculate for

example to insert a conditional to check the value of f and if it is dbl to execute the optimized

co de 220 | Statically Bound 200 | Dynamically Bound

180 |

160 |

140 |

120 |

100 |

80 |

60 | Execution Time (secs) 40 |

20 |

0 || proj chain

richards

Figure Sp eed Comparison for Static Binding in the Stanford Ob jectoriented Benchmarks

foreach af

if f dbl dbli i i

for let i iasizei

ai ai foreach af

else for let i iasizei

for let i iasizei ai fai

ai fai

Figure Sp eculation Example

In Figure the average numb er of target metho ds which might p ossibly b e invoked at a

dispatch site the static arity is rep orted for the Stanford ob jectoriented b enchmarks The

graph on the left presents the counts of invo cation sites and p ossible targets while the graph

on the right presents the ratio of targets to invo cations Inlining Section has eliminated

many of the statically b ound invo cation sites which would have an arity of one Nevertheless

the ratio of approximately indicates that the ma jority of remaining sites are still statically

b ound

Figure rep orts the average arity of invo cations during execution of the Stanford ob ject

oriented b enchmarks dynamic arity Since the programs require a dynamic dispatch in the

inner lo op on the evaluation selector of the constraint network no de in the case of Delta Blue

and on the class of the simulated task in Richards the dynamic arity is higher than the static

arity In particular through inlining the inner lo op of Richards has b een reduced to a single 500 | 2.0 | Call Sites

450 | Dynamically Bound 1.8 |

400 | 1.6 |

350 | 1.4 |

300 | 1.2 |

250 | 1.0 |

200 | 0.8 | Count of Sites 150 | 0.6 | Arity (Targets/Sites)

100 | 0.4 |

50 | 0.2 |

0 || 0.0 || proj proj chain chain richards

richards

Figure Static Arity of Dispatch Sites in the Stanford Ob jectoriented Benchmarks

5 | 2.2 | Call Sites | 4 | Call Targets 2.0 | 4 | 1.8 |

| 1.6 4

1.4 | 3 | 1.2 | 2 | 1.0 | 2 | 0.8 | | Count (millions)

2 | 0.6 Arity (Targets/Sites) | 1 0.4 | | 0 0.2 |

0 || 0.0 || proj proj chain chain richards

richards

Figure Dynamic Arity of Dispatch Sites in the Stanford Ob jectoriented Benchmarks

metho d containing a dynamic dispatch at the core These dynamic dispatches are necessary as

they result from nonparametric p olymorphism in the data structures

Through sp eculation inserting conditionals in the place of dynamic dispatch the overhead

of the general invo cation mechanism can b e avoided As we have demonstrated the numb er

of such invo cations sites tends to b e small as do es their arity Thus the eect on the overall

co de size should b e small The main advantage of sp eculation comes from inlining invo cations

under the conditionals

Inlining

Inlining is the pro cess of replacing a statically metho d invo cation with the b o dy of the called

metho d This can improve the p erformance of co de b oth by removing the metho d invo cation

overhead and by enabling the b o dy of the metho d to b e optimized in context In order to prevent

excessive co de expansion inlining is p erformed based on heuristics which attempt to balance

p erformance and co de size The Concert compiler uses a combination of static estimation

and size constraints to decide when to inline eliminating the cost of crossing a pro cedure

b oundary

Since many small metho d b o dies contain only a few instructions inlining is of particular

imp ortance for ob jectoriented programs Figure compares the sp eed of the fully optimized

programs to those for which only accessors metho ds which access instance variables and

primitive op erations eg integer add have b een inlined This latter case corresp onds roughly

to the level of optimization available from simple C implementations

10 | Optimized 9 | No Inlining

8 |

7 |

6 |

5 |

4 |

3 |

2 | Normalized Execution Time

1 |

0 || proj trees perm sieve chain intmm puzzle towers bubble queens

richards

Figure Eects of Inlining on Execution Time

The interpro cedural call graph can b e used to determine when a metho d has b een inlined

at all invo cation sites and eliminate the metho d As we saw in Section inlining need

not result in a large co de size increase For example b oth versions of the innerproduct

metho d will b e inlined into their corresp onding mm invo cation sites and the metho ds will b e

eliminated since they have no other callers

When sp eculation is combined with inlining the inlined co de is placed under a conditional

which can prevent some optimizations Chapter describ es optimizations which expand these

conditionals to encompass larger p ortions of co de removing overhead and enabling additional

optimization These transformations along with dead co de elimination Section p erform

the same function as splitting by preserving information obtained by runtime typ e or selector

checks within metho ds

Unb oxing

Unb oxing converts a tagged slot Section into an untagged data lo cation This decreases

memory requirements and eliminates the overhead of tag manipulations To unb ox a piece of

data it is not necessary to know the exact typ e eg p ointer to a Point ob ject or p ointer

to a Circle ob ject only primitive typ e eg integer p ointer oating p oint numb er This

information is provided by analysis and cloning There are four typ es of variables which can b e

unb oxed lo cal variables instance variables array elements and arguments

Lo cal Variables

Unb oxing lo cal variables requires building a new memory map for the context containing the

lo cal state see Section This memory map describ es which lo cations the runtime

and garbage collection system can exp ect will contain p ointers and which will b e tagged

Unb oxed lo cal variables can b e allo cated to registers Chapter considers the tradeo b etween

increasing the active state and decreasing context switch time in a concurrent ob jectoriented

system

Instance Variables and Arrays

Likewise unb oxing of instance variables requires building a new memory memory map Cloning

can pro duce several concrete typ es corresp onding to a single class However sp ecializing the

memory maps or failing to up date tag elds appropriately can result in the inability to share

a single version of metho d co de across these concrete typ es The same is true for sup erclass

metho ds which ob jects of a sub class may not b e able to share if they mo dify their memory

map in an unconformant way Section

Arrays can b e unb oxed just as instance variables and entail the same conformance problems

Both CA and ICC provide arrays as ob jects with the array part stored after the instance

variables The array part can then b e considered as a nal instance variable Conformance is

based on the start lo cation of the array part and whether or not the elements are tagged or

packed

Arguments

Unb oxing of arguments requires changing the calling convention In the presence of dynamic

dispatch it is p ossible for the caller to not have the precise typ e information which is implied by

the dispatch criteria For example in Figure the increment inc metho ds can take unb oxed

arguments but at the invoking site the variable x is p olymorphic If the compiler decides to

unb ox the self argument for inc either sp eculation or a tramp oline must b e used to map the

b oxed caller Calling convention conversion are discussed in Chapter

intinc self

floatinc self

let x

y xinc

Figure Calling Convention Conversion

Instance Variable to Static Single Assignment Conversion

Since instance variables are ubiquitous accessed by reference and p otentially aliased see Sec

tion in ob jectoriented programs they must b e loaded from and stored into memory

frequently increasing memory hierarchy trac and representing a large p otential overhead

Using the interpro cedural call graph and the ob ject creation context information provided by

the analysis we estimate whether a metho d invo cation or instance variable access might alias

a given instance variable Since go o d encapsulation disallows p ointers into ob jects only other

accesses to the same instance variable of ob jects created at the same p oint as the instance

variable in question can alias it The interpro cedural call graph enables us to approximate

the statements reached by a metho d invo cation Instance variables unaliased over a range of

statements are transformed into lo cals and can b e allo cated to registers

Likewise global variables can b e transformed into lo cal variables Since global variables are

uniquely named and cannot b e aliased their sharing patterns can b e easily determined from

the interpro cedural call graph Section describ es related optimizations of global variables

for concurrent ob jectoriented languages

ArrayDmmab

ArrayDmmab tmprows brows

for iibrowsi tmpcols acols

for jjacolsj for iitmprowsi

innerproductabij for jjtmpcolsj

innerproductabij

Figure Instance Variable Transformation Example

In our example the cols and rows instance variables are part of the ArrayD ob ject which

is p otentially aliased However using the call graph we can determine that for the ob jects

created at L and L Figure these instance variables are not changed within any metho d

invoked from mm Thus as in Figure we can transform the instance variables to lo cal

cols and tmp rows and hoist them out of the lo op temp oraries tmp

Array Aliasing

In the same way that alias information enables transformation of instance variables into lo cals

it enables optimization of arrays accesses Since go o d encapsulation prevents p ointers into

the middle of arrays absolute andor symb olic analysis can b e used to determine that array

accesses do not conict We use a simple creation p oint test to estimate interpro cedural array

aliasing and combine it with simple symb olic analysis to enable array references to b e lifted

and common sub expression eliminated

For example in Figure from the bubble sort b enchmark see Section the inner

lo op contains two array reads in the conditional and two in the b o dy left We can determine

by simple analysis over the call tree that the array could not b e written b etween the rst

invo cation to aati and the second Thus we can lift and common sub expression eliminate

the aati in the lo op right

tmpi aati

if aati aati

tmpi aati

tmp aati

if tmpi tmpi

aatputiaati

aatputitmpi

aatputitmp

aatputitmpi

Figure Example of Common Sub expression Elimination of Array Op erations

General Optimizations

Since the abstractions of ob jectoriented programming are often used to hide the representa

tion of data ostensibly simple op erations may require many instructions For example the

CA language do es not supp ort native multidimensional arrays These are constructed out of

single dimension arrays with instance variables containing the dimension sizes and linearization

metho ds as in Figure When multidimension arrays are used in lo ops the instance vari

ables can b e transformed to lo cal variables and the linearization op erations strength reduced

and moved outside the lo op With these optimizations matrix multiply of multidimensional

arrays in CA is as fast as C see Section even though the array op erations are abstracted

and ostensibly require much more work

Since these general optimizations are a standard part of most optimizing compilers they

are only summarized in Table More information can b e found in the large b o dy of compiler

design literature ie

Overall Performance and Results

We use a standard b enchmark suite Section to evaluate and compare the p erformance

of CA C and C The Stanford Integer Benchmarks Richards and Delta Blue were used to

evaluate the Self language by Chamb ers and later by Holzle The Stanford Integer

Benchmarks are small pro cedural co des The CA versions use encapsulated ob jects for the

primary data structures but otherwise follow the C co de structure Richards and Delta Blue

b oth use p olymorphism some of which can b e removed at compile time by templates or cloning

ie parametric p olymorphism and some which cannot the task queue and constraint network

resp ectively The C co des are annotated by declaring functions virtual only when nec

Constant Propagation The realization that when a variable is assigned

only a single value it must b e that value applied

transitively

Constant Folding Executing an op eration whose inputs are compile time

constants at compile time to pro duce a new constant

Algebraic prop erties are normally used to enable more

constant folding in languages like CA and ICC

which allow it

Dead Co de Elimination Removal of co de which cannot b e executed normally

b ecause the conditions predicating its execution have

b een determined not to hold This is often the result

of inlining and sp ecialization

Calling Convention Unb oxing of arguments and removal of unused argu

Optimization ments from the metho d interface

Common Sub expression The recognition that two computations compute the

Elimination same value and substitution of the result of one for

that of the other enabling dead co de elimination of

the other

Global Value Numb ering The extension of common sub expression elimination

across basic blo cks

Invariant Lifting The removal of a computation from a lo op which do es

not dep end on any values computed in the lo op en

abling the value to b e reused for each iteration

Strength Reduction The transformation of a scaling op eration in a lo op

into a set of successive additions

Table Standard Optimizations

essary including inline accessors and in Delta Blue by the use of a List template The

CA versions of these co des follow the C encapsulation and co de structures

Metho dology

We compare the p erformance of CA co des translated from the Stanford sources Our compiler

uses the GNU compiler as a back end enabling us to control for instruction selection and

scheduling dierences by using the same version for b oth the back end of the Concert

compiler and the C and C b enchmarks All tests were conducted on an unloaded Mhz

SPARCStation with Sup erCache running Solaris We present b oth individual results

Our thanks to Craig Chamb ers and Urs Holzle for making the co des available

and summarized results for the pro cedural and ob jectoriented b enchmarks The individual

results are the average execution time of rep etitions of each b enchmark at each optimization

setting normalized to the execution time of CC at O The summarized results are the

geometric means of the normalized times over all the b enchmarks at each optimization setting

This eectively constructs a synthetic workload in which each b enchmark runs for the same

amount of time for CC at O

8.0 7.0 6.0 5.0 4.0 3.0

Time vs. C -O2 2.0 1.0 0.0

C -O2 CA base C inline

CA no instvarsCA no inliningCA no cloningCA no analysis

CA no arrayaliasCA no standard

Figure Geometric Mean of Execution Times Relative to C O for the Stanford Integer

Benchmarks for a Range of Optimization Settings

Pro cedural Co des Overall Results

Figure graphically summarizes the execution time of the Stanford Integer Benchmarks

under various optimization settings relative to the C optimized at O Overall the results show

that at full optimization the p erformance of the dynamicallytyp e pure ob jectoriented co des

matches that of C The dierent bars rep ort the cumulative eect of disabling optimizations In

order to make a comparison with C easier the no inlining bar do es not prevent inlining of

accessors or op erations on primitive data typ es eg integer add which would automatically b e

inlined in C Also the analysis bar only disables ow sensitivity Flow insensitive analysis

provides information roughly comparable to typ e declarations particularly with regard to C

primitive data typ es

Each optimization contributes to the overall p erformance which would otherwise b e an order

of magnitude times less than C Context sensitive ow analysis alone provides a factor of

two by allowing more primitive op erations to b e inlined and metho d invo cations to b e statically

b ound Cloning contributes another factor of two for essentially the same reasons by making

monomorphic versions of p olymorphic co de Inlining op erates on statically b ound invo cations

more than doubling p erformance by eliminating invo cation overhead Transforming instance

variables to lo cals enables many of the standard optimizations which together with array alias

analysis provide the last twenty p ercent of overall p erformance

These results demonstrate that for such pro cedural kernels the cost of the unused exibility

of ob jectoriented features can b e eliminated The remaining dierences in p erformance reect

the low level optimizations favored by the GCC compiler and the co de structure more than

any inherent language advantage For example GCC strength reduces array accesses based on

sizeofint using a simple heuristic which is easily confused by the RTLlike output of the

Concert compiler Likewise computing b o oleans into intermediates can inhibit direct use of

condition co des On the other hand GCC only inlines functions which app ear previously in

the same le and the standard malloc routine is relatively inecient

5.0

4.0 CA base CA no arrayalias 3.0 CA no standard CA no instvars 2.0 CA no inlining CA no cloning Time vs. C -O2 1.0 CA no analysis C inline 0.0 C -O2

intmm perm sieve trees

bubble puzzle queens towers

Figure Performance relative to C O on the Stanford Integer Benchmarks

Pro cedural Co des Individual Results

Figure rep orts the individual p erformance of the b enchmarks Overall the results are

split with CA base outp erforming C O in six cases and C O outp erforming CA base

in two C inline increases that numb er by one adding towers a highly recursive co de to

the list for which C is faster Since these co des are largely monomorphic they can b e analyzed

easily and they dier only in their control structures and metho d b oundaries The CA co des

are faster b ecause of relatively more aggressive inlining except trees which allo cates many

ob jects Even though the CA co de garbage collects eleven times during each run it is still more

ecient than malloc On the other hand CA do es not supp ort break and the resulting

additional condition in while lo ops drops CA p erformance on puzzle by almost a factor of

two Discounting these sp ecial cases individual b enchmark p erformance is nearly identical

Dierent optimizations had larger eect on dierent b enchmarks indicating their individual

imp ortance The bubble sort and p ermutation p erm programs are heavily dep endent on

inlining for the swap which provides most of the p erformance The trees and puzzle programs

b enet directly from instance variable promotion in trees case b ecause of the heavy use of

the left and right child instance variables Matrix multiply intmm and puzzle are lo op

based and dep end on the standard optimizations and in particular strength reduction which

is enabled by instance variable promotion for the inner lo op dimension Finally towers and

puzzle b enet from array alias analysis b ecause the co de rep eatedly accesses the elements at

the same array osets

5.0 5.78 13.8 4.0 3.0 2.0 1.0

Time vs. C++ -O2 0.0

CA base C++ -O2 C++ inline C++ all virtual CA no instvarsCA no inliningCA no cloningCA no analysis C++ no inlining CA no arrayaliasCA no standard

C++ no inlining all virtual

Figure Geometric Mean of Execution Times Relative to C O for the Ob ject

oriented Benchmarks for a Range of Optimization Settings

Ob jectOriented Co des Overall Results

Figure graphically summarizes the execution time for the ob jectoriented b enchmarks

for CA at seven optimization settings and for C versions at ve Overall the Concert

compiler pro duces much b etter p erformance a factor of four improvement than the C

compiler even with user provided annotations and automatic inlining enabled O Again

the individual optimizations made their contributions starting from a initial p erformance p oint

an approximately order of magnitude times worse than C Context sensitive ow

analysis provided a factor of This is more than the factor of two for the pro cedural co des

indicating its relative imp ortance for ob jectoriented co des Likewise cloning was resp onsible

for approximately a factor of Inlining contributed a factor of four again showing its relative

imp ortance for OO co des Finally standard low level optimizations enabled by instance variable

promotion contributed a factor of two

We evaluated the p erformance of the b enchmarks with the C compiler at ve optimiza

tion settings including the base O automatic inlining O without any inlining with all

virtual functions and without any inlining and all virtual functions Automatic inlining im

proved p erformance by approximately two p ercent indicating that most of the automatically

inlinable functions had b een annotated Disabling either inlining or static binding annotations

reduced p erformance by p ercent and with b oth were disabled p erformance decreased by

p ercent Performance of the CA co de without inlining was comparable to that of the C

compiler without inlining However as we will see in the next section the individual results

vary indicating this is probably just coincidental

3.0 CA base CA no arrayalias CA no standard 2.0 CA no instvars CA no inlining CA no cloning 1.0 CA no analysis C++ inline C++ -O2 Time vs. C++ -O2 C++ no inlining 0.0 C++ all virtual C++ no inlining all virtual Chain

Projection Richards

Figure Performance of CA and C relative to C O on OOP Benchmarks

Ob jectOriented Co des Individual Results

Figure rep orts the results for the Richards Delta Blue Chain and Pro jection individual

b enchmarks under various optimization settings of b oth the Concert compiler and the C

compiler The CA versions vary from almost six times faster Chain to twentyve p ercent

faster Richards than C O In the case of Delta Blue this dierence is attributable to

many invo cations to small metho ds For example in ob jectoriented fashion Delta Blue uses a

general List container ob ject with a metho d which applies a selector function p ointer across

its elements eg do in Smalltalk or map in Scheme The Concert compiler clones and inlines

b oth invo cation sites turning this into a simple C style lo op containing op erations directly on

the elements while the C compiler do es not For Richards the p erformance dierence is

primarily a result of optimizations enabled by instance variable promotion Richards uses an

ob ject to encapsulate its current state including the head of the task queue and manipulation

of this state makes up the largest part of the execution time

The dierent C optimization settings pro duced dierent results for the dierent b ench

marks much as they did for CA For Richards disabling inlining decreased p erformance by

p ercent However making all metho ds virtual has no eect on p erformance at all This

is b ecause Richards is largely a pro cedural co de where the central switch statement has b een

replaced with a dyanamic dispatch the run metho d on Task ob jects On the other hand

Delta Blue uses accessor metho ds and other small metho ds for encapsulation of ob ject state

For Chain disabling inlining decreases p erformance by p ercent and for Pro jection the im

pact is a p ercent decrease Furthermore since Delta Blue do es most computation through

metho ds making all metho ds virtual which eectively prevents inlining of metho ds as well

reduces p erformance by p ercent and the p erformance is unchanged when inlining is disabled

as well

These results show that for these b enchmarks the overhead from ob jectorientation can b e

removed automatically Furthermore it they show that the annotations provided by the C

programmer are not sucient to optimize the programs

Related Work

Sp eculative inlining and sp ecialization with resp ect to runtime checks is discussed in detail by

Craig Chamb ers and Urs Hoolzle in their resp ective theses on the Self system This

work was based either on guesses as to the typ e of particular ob ject or on runtime feedback

proling instead of interpro cedural analysis The fo cus of Chamb ers work was preserving the

information acquired through runtime checks over larger b o dies of co de similar in purp ose to

the access region manipulations describ ed in Chapter Earlier information on these optimiza

tion in the Self system include Simple sp eculative optimization by ifconversion

was discussed by Brad Calder and Dirk Grunwald in as a replacement for table indirected

dynamic dispatch in cases only a small set of metho ds of the same name existed based on exam

ining the class hierarchy Recently interest has turned to dynamic compilation using compiler

supplied templates This work uses a combination of static dynamic information and

runtime checks to select optimized versions of co de Again the empasis is on simple analysis

combined with proling information Optimization of ob jectoriented calling mechanisms has

b een discussed extensively Two relevant sources include for the pure ob jectoriented language

Self Holzle and for C Stroustrup The advantage of the system describ ed in this

chapter is that ow analysis and cloning can often determine arity of a metho d invo cation site

to b e a small numb er often one This results in a shift in the fo cus of optimization toward

static binding and inlining since the general dispatch mechanism is rarely used Unb oxing

has also b een a p opular for many years in the Lisp community Most recently researchers in

ML have discovered it in the context of p olymorphic typ es Because these ML

systems are typ ebased and supp ort separate compilation they have neither complete knowl

edge nor access to the whole program preventing them from the sort of global transformations

describ ed in this chapter Finally the Standard Template Libraray has b een prop osed as

a solution to the problem of optimization of p olymorphic co de in C In this system the

notions of typ e parameterization and co de replication are combined The result is a system

which provides sp ecialized versions of co de by massive co de replication largely out the hands

of the compiler Given that mo dern computers are memory bandwidth limited this approach

is of dubious value

Summary

Abstraction b oundaries and p olymorphism in ob jectoriented programs are p otential sources of

ineciency Moreover small metho d size high invo cation density data dep endent invo cations

data access overhead and aliasing problems comp ound these ineciencies Invo cation density

is addressed through invo cation optimizations static binding sp eculation and inlining Data

access overhead is addressed by unb oxing and the elimination of p ointerbased accesses through

conversion of instance variables to Static Single Assignment SSA form Array alias analysis

and a suite of standard low level optimizations are also p erformed Results for the Stanford

Integer Benchmarks demonstrate that a pure dynamicallytyp ed ob jectoriented language im

plemented by Concert can b e as ecient as C Moreover the Concert system implementations

of the Stanford ob jectoriented b enchmarks were much more ecient than the C implemen

tation G at the highest optimization level

Chapter

Optimization of Concurrent

Ob jectOriented Programs

This chapter describ es optimizations which are sp ecic to distributed and concurrent ob ject

oriented programs Many of these are targeted to asp ects of the execution mo del presented in

Chapter and include ob ject access control op erations Section Section revisits the

topic of sp eculative inlining in the context of lo cality and lo ck optimizations Sections and

are concerned with extending the dynamic range of sp eculation and with optimization of

the use of the memory hierarchy resp ectively Sections discusses touch placement Finally

Section demonstrates the eectiveness of many of these transformations on the Livermore

Lo ops

Simple Lo ck Optimization

Our execution mo del requires lo cks for every metho d which accesses instance variables of the

target ob ject Section While these semantics supp ort data abstractions in a concur

rent environment given the frequency of metho d invo cations in ob jectoriented programs Sec

tion a direct implementation of this mo del implies a frequency of lo cking op erations

which is a serious source of ineciency Two optimizations which can eliminate unnecessary

lo ck op erations in many cases are access subsumption and exploitation of stateless metho ds

The latter are also useful since no ob ject is not actually required for the metho d to execute

allowing such metho ds to executed anywhere on the target machine

Access Subsumption

Access subsumption avoids redundant acquisition of lo cks that must have b een previously ac

quired ab ove in the call graph So long as the calling metho d do es not complete b efore the

callee ie treestructured concurrency Section the calling metho ds access rights will sub

sume the callees making it unnecessary for the callee to acquire lo cks Access subsumption

optimization uses the call graph provided by analysis and cloning to recognize cases where a

metho d is called from another metho d which has already acquired the lo cks required by the

rst Alternatively if not all callers acquire the needed lo cks two versions of the metho d a ag

or alternate entry p oints can b e used to separate the cases or pass along contextual information

Stateless Metho ds

Stateless metho ds do not on their own access any state of the target ob ject and therefore need

neither execute lo cal to the ob ject nor acquire any lo cks to ensure correct semantics In many

cases stateless metho ds merely validate arguments or add default parameters Often they were

not stateless initially but after cloning and interpro cedural constant propagation they b ecome

stateless For example consider the at metho d in Figure If this metho d is cloned

for arrays of a particular dimensionality the cols instance variable b ecomes constant and

the two dimensional at b ecomes stateless simply passing a computed value to its inherited

at metho d Likewise aggregates within CA and col lections within ICC are

ob jects which provide message vectoring and indexing capabilities based on invariant distributed

information While not technically stateless these metho ds can b e executed anywhere and

do not require lo cks Similarly metho ds which only read temp orally constant globals are

stateless Stateless metho ds do not require lo cks and can b e inlined without concern for

lo cality

Sp eculative Inlining Revisited

Sp eculative inlining uses runtime checks to condition the execution of inlined co de In Sec

tion sp eculation was used to inline co de when the selector or concrete typ e of the target

ob ject was not known at compile time In this chapter we are concerned with run time prop er

ties resulting from distributed and concurrent execution lo cality and lo cking A metho d may

only b e inlined if any required data is available the target ob ject is lo cal and any required lo ck

is available

if selector SELECTOR AT runtime checks

POINTERX LOCAL

CONCRETE TYPEXCLASS ARRAY

LOCK OBJECTX

inlined method body of at access region of X

UNLOCK OBJECTX release resources

else

INVOKEselector X i fallback co de

Figure General Form of Sp eculative Inlined Invo cation

In general a sp eculative inlined metho d will have the form in Figure The rst con

dition checks to see that the selector is that of the metho d to b e inlined at The second

POINTER checks to see that the target ob ject is lo cal Since the state of the ob ject in LOCAL

cluding its concrete typ e and the values of its lo ck elds will not b e available unless the ob ject

is lo cal this check must b e conducted rst If the ob ject is lo cal the concrete typ e is veried

ARRAY The last check attempts to acquire the to b e that of the metho d to b e inlined CLASS

necessary lo cks In ICC an additional parameter is required giving the lo ck mask since the

current ICC implementation provides individual lo cks for each instance variable If these

checks succeed the b o dy of the metho d is inlined For a given sp eculative inline it may b e

p ossible to omit some or all of these checks In this chapter we will assume that metho ds have

b een statically b ound and omit the selector and concrete typ e checks This region of co de in

which access to the ob ject has b een obtained is called an access region

Access Regions

Access regions are created for each inlined op eration on ob jects which require them Figure

shows the twelfth kernel of the Livermore Lo ops as an example The C co de for the kernel

on the left contains a doubly nested lo op surrounding three accesses to two ob jects On the

In fact a separate lo ck need only b e provided for sets of instance variables accessed as a

unit

right is a graphical representation of the access regions induced by sp eculative inlining of the

access op erations The form of the graph is the Program Dep endence Graph PDG The

squares represent basic blo cks sets of statements which are all guaranteed to execute if any

one executes the ovals conditionals and the circles access regions The lines represent control

decisions which are followed when the condition is true T or false F Multiple lines with the

same condition value are all followed for that value

while ACCESS REGION T

while

T T T

for l lloop l

test: y test: y test: x

for k kn k

TTTFFF

xk yk yk

read: y invoke: y read: y invoke: y write: x invoke: x

fallback fallback fallback

Figure Livermore Lo ops Kernel Co de and Access Regions

Extending Access Regions

Entering the access regions intro duced by sp eculative inlining requires costly run time checks

Since metho ds are often small in unoptimized co de access regions are entered frequently and the

overhead can b e severe Consider the lo op in Figure Three run time checks are issued for

each iteration of the inner lo op In order to reduce the overhead of these checks the dynamic

extent of the access regions should b e expanded This not only reduces the runtime check

overhead but also pro duces larger basic blo cks for the general optimizations of Section

Since the Concert compiler normalizes all lo ops to b e while lo ops the PDG forms a tree

Thus access regions are prop erly nested with the lo cks b eing acquired and released at the

same nesting level This means that all op erations on access regions can b e broken down into

op erations which move statements into a region and those which create new empty regions with

some set of conditions eg sp eculatively acquiring access to a set of ob jects The statements

which are moved into a region may include other regions conditionals and lo ops

In this section we rst consider correctness issues in extending the dynamic extent of access

regions Section We the describ e three particular transformations adding statements to

access regions Section merging of adjacent access regions Section and lifting

access regions over lo ops and conditionals Section

Safety

Optimizations which expand access regions must not change the meaning of the program they

must b e safe In particular they must preserve the original exclusivity prop erties and may not

intro duce deadlo ck For example transformations may not make it p ossible for the op erations

given by two separate metho ds to interfere reading and writing the same variables over the

same p erio d of time Likewise transformations cannot hold a resource and then try to acquire

that resource again with a blo cking op eration inducing deadlo ck These prop erties will b e

discussed for the individual optimizations

Only the safety of those transformations which are p eculiar to access regions are discussed

here In particular the safety of moving a statement into b oth branches of a conditional

breaking a twosided condition into two onesided conditionals and eliminating co de are not

considered as they are covered in standard compiler texts

Adding Statements to a Region

Entrance criteria for a region condition the enclosed storage accesses by tests for lo cality and

access control conditions Furthermore the resources represented by the lo cks held on the

ob jects within the region are held over the region Hence there are three typ es of statements

functional statements those which do not access storage or hold resource statements which

access storage and statements which hold resources additional regions and blo cking primitives

op erations

Statements which are functional cannot interfere nor can they capture shared resources

the registers and execution units they require are managed by the compiler Hence they can

b e moved safely into any region For a storage access to b e moved into the region the tests for

the destination region must subsume the tests for the storage access Furthermore if storage

accesses for the same ob ject from two distinct regions are moved into the region they must b e

relatively exclusive One way to achieve this is to serialize the op erations within the region

Finally statements which hold resources cannot b e moved into a region if doing so will induce

a cycle in the resource graph and deadlo ck Lo ck cycles are prevented by merging the access

regions Section

Primitives which hold resources eg inputoutput can also give rise to deadlo ck

However they are not common in p erformance critical sections and can b e conservatively

excluded from the region Furthermore standard deadlo ck prevention techniques like ordering

resources can b e applied using the conservative call graph and alias information provided by

ow analysis Chapter

Beyond these basic constraints accessregion expanding optimizations must also ensure that

we do not move op erations into an ob jects access region which could aect or b e aected by

the lo cality or lo cked status of the ob ject For example we cannot move in any op eration which

might migrate the ob ject nor any op eration which might directly or indirectly require its own

lo ck on the ob ject

POINTERX if LOCAL

OBJECTX LOCK

inlined method body of at access region

j i

UNLOCK OBJECTX release resources

else

fallback co de INVOKEat X i

j i

Figure Adding j i to an Access Region

Figure shows a simple function statement j i added to the access region from

Figure Since the statement is functional no additional conditions are required for entrance

into the region The statement is added b oth to the access region and to the fallback co de

preserving the meaning of the program under dierent dynamic conditions

Making a New Region

An empty access region consists solely of a test of access conditions and a temp orary acquisition

of resources Such regions do not change the meaning of the program unless they intro duce

new deadlo cks Such deadlo cks would arise from new dep endences b etween lo cks and can b e

prevented by testing and obtaining all required lo cks atomically The runtime Section

provides multilo cking atomic primitives Thus if a set of resources is available they are

acquired Since the empty region contains no statements whether they are acquired or not

the resources are immediately released Thus new regions can b e created without intro ducing

deadlo cks and statements can b e moved into the region as ab ove

while ACCESS REGION T

T while T TT

test: y,x test: y test: y test: x

TF TTTFFF

fallback read: y invoke: y read: y invoke: y write: x invoke: x

fallback fallback fallback

Figure Livermore Lo ops Kernel with New Region

New regions are created with access conditions for a set of op erations which can b e protably

optimized as a unit For example to optimize the statements enclosed in the three regions from

Figure the access conditions would require testing the prop erties of b oth x and y Figure

show the creation of such a new region Note that b oth the access region and fallback co de

blo cks are empty

Merging Access Regions

Merging access regions combines the access conditions for two regions and merges the access

and fallback co de blo cks First a new region is created with the conjunction of the access

conditions Then using the partial order of execution derived from lo cal data ow and the

CFG Data Dep endences Section in the PDG statements which must execute b etween

the two regions are determined These statements are moved into b oth the access region and

the fallback blo cks for the new region If all such statements can b e moved the regions can b e

merged Finally the access and fallback statements are moved from the two initial regions into

their resp ective branches of the new region

The combined access conditions represent the conjunction of those for the original regions

If the new conditions attempt to acquire the lo cks on a single ob ject twice the attempt will

fail preserving mutual exclusion However if we know the two ob jects are the same we

can take out a single set of lo cks and ensure mutual exclusion through relative atomicity by

sequencing the op erations from the two regions so that they do not interleave This requires a

mustalias determination which need only b e conservative since the fallback co de is completely

general The statements from the access and fallback blo cks for two such regions are given an

order which is consistent with the partial order of execution when they are placed into their

resp ective branches of the new region

while

ACCESS T REGION while

T

test: y, x TF

fallback

read: y invoke: y read: y invoke: y

write: x invoke: x

Figure Livermore Lo ops Kernel After Merge

The result of merging the regions in Figure app ears in Figure The new region

intro duced in Figure has absorb ed all the statements from the other regions These regions

can then b e eliminated safely following the logic by which new empty regions were intro duced

without changing the meaning of the program The three conditionals have b een merged into

a single access condition and the three optimized and fallback blo cks into single optimized and

fallback blo cks

Lifting Access Regions

The PDG is a tree whose the interior no des are conditionals and while lo ops Lifting access

regions higher in this tree can improve eciency by enabling runtime testing overhead to b e

removed from lo op b o dies Lifting access regions over conditionals can enable further merging

and lifting op erations and which also increase eciency The key to lifting access regions is

to ensure that all the statements under the PDG no de to b e lifted over can b e safely added

to the region Using a b ottom up traversal of the PDG at each level we attempt to merge

adjacent access regions until only one remains within the control dep endence region We then

attempt to move any remaining statements into the single access region If there is one access

region and no other statements in a control dep endence region that region can b e lifted over the conditional or while lo op

test: y, x ACCESS REGION T F

while while

T T

while while

T T

read: y invoke: y read: y invoke: y write: x invoke: x

fallback

Figure Livermore Lo ops Kernel After Hoist

Consider the two typ es of control structures in our PDG while lo ops and conditionals

While lo ops have a single control dep endence region under them corresp onding to the lo op

b o dy If this control dep endence region is entirely contained in a single access region the lo op

header can b e moved into the access region This results in an access region which is lifted

over the lo op For single armed conditionals the transformation is analogous So long as lo ops

eventually terminate the resources acquired by the access region will b e released Fortunately

the notion of fairness supp orted by Concert requires lo ops to terminate A stronger notion of

fairness would require p erio dically releasing the lo cks at p oints consistent with the required

mutual exclusion of the initial regions

Conditionals with two arms require the two regions to b e merged and lifted simultaneously

Logically we break such conditionals into two onearmed conditionals the second with the

negation of the original condition Then the two access regions are lifted over the conditionals

Next the two access regions are merged Finally the two onearmed conditionals within the

access region are recombined in a single twoarmed conditionals The result b eing that the

if LOCALPOINTERx LOCALPOINTERy

LOCKOBJECTxy

for l lloop l

for k kn k

xk yk yk

UNLOCKOBJECTxy

else

for l lloop l

for k kn k

t INVOKEat y k

t INVOKEat y k

INVOKEputat x t t

Figure Kernel After Lifting

region has b een lifted over the twoarmed conditional In practice these op erations can b e

combined and the entire transformation done at once

Access Region Optimization

Determining which regions can b e most protably merged requires information ab out the prob

ability that access conditions will hold For example a blo ck of co de which op erates on three

ob jects two of which are usually lo cal and one of which is usually remote should b e imple

mented with two access regions one for the two usually lo cal ob jects and one for the usually

remote This is b ecause the optimized path is only executed when all the conditions are satis

ed at once Since the cost of the general case blo cking for a lo ck or remote message is large

the optimization extracts high eciency from the optimized path at a relatively small increase

in cost along the unoptimized path However including a condition which is often unsatised

prevents the optimization of all aected co de Of course additional levels of sp eculation can

b e used in the fallback blo ck at the cost of additional co de expansion

Once the access regions have b een expanded the co de consists of larger regions of optimiz

able sequential co de If the program sp ends the ma jority of its time in these regions it will b e

very nearly as ecient as a sequential implementation Applying these transformations to the

co de on the left of Figure results in the co de structure in Figure The nal result is an

optimized lo op b o dy under two lo ops conditioned by a single combined access condition and

the fallback co de under a copy of the same two lo ops When b oth x and y are lo cal the lo op

nest in the access region is identical to a sequential program An example of the resulting co de

app ears in Figure

Caching

Caching is the storage of a piece of data higher in the memory hierarchy Section where

it can b e accessed with greater eciency There are three classes of data which can b e cached

under dierent conditions globals Section lo cal variables and instance variables

Our programming mo del disallows p ointers to lo cal variables since this breaks encapsulation

of the lo cal state Chapter If this were not the case aliasing problems combined with

concurrency might require a lo cal variable to b e reloaded from memory for each op eration

Instead the programming mo del allows for multiple return arguments which satises most uses

of such p ointers In the remaining cases the data element can b e represented by an ob ject Since

ob jects are p otentially aliased by extension an aliasing problem exists for instance variables

Fortunately lo cks and the information provided by ow analysis can b e used to eectively test

and conservatively approximate ob ject aliasing

Lo cal Variables

Lo cal variables are part of the state of a thread Because threads can blo ck for indeterminate

time in general lo cal variable data must b e stored in a p ersistent store like the heap Since

accessing such data can b e exp ensive lo cal variables are cached into registers where p ossible

Two pieces of information are required to cache lo cal variables the typ e of the data which

determines the sort of register to use and the lifetime of the value This lifetime diers from

normal lifetime computations b ecause it is dep endent on p ossible context switch p oints where

the data must b e stored to the p ersistent store

Exploiting lifetimes across basic blo cks is discussed in detail in Essentially the idea

is to partition the control ow graph into contiguous sets of instructions delimited by p ossible

Chapter considers the tradeo b etween caching data and saving data back in the heap

context

retrieve a and b from the store

funca

b a b

funca

funca

store b to the store

POSSIBLE CONTEXT SWITCH

if switched retrieve a from the store

funca

funca

Figure Example of Caching and Partitions

context switch p oints touches For each partition the set of variables active in that partition

are cached in registers At partition b oundaries the compiler generates co de mapping the active

set of one partition to the other by storing or loading variables b etween registers and the heap

as appropriate Figure shows and example with two partitions and the co de to load and

store the lo cal variables If the context switch do es o ccur all variables will have b een stored to

the heap and all required variables will b e loaded on resumption

Instance Variables

In Section aliasing information derived from ow analysis is used to convert instance vari

ables into Static Single Assignment form This enables them to b e allo cated to registers over

parts of their lifetime Access regions provide aliasing information which can b e used for the

same purp ose Within an access region all accesses to the data of lo cked ob jects must b e

through known p ointers Hence by exploiting access regions ob ject state can b e cached in

registers safely eliminating memory accesses and requiring only a single up date at the end of

the access region or subsequent metho d invo cation on the ob ject in question

On the left in Figure a twodimensional array is accessed by linearization Section

The co de on the right uses the prop erties of access regions to cache cols in a lo cal temp orary

Sp ecialized versions of metho ds can also b e created which would take the instance variables

in registers Such data structure transformations are made p ossible through cloning Chapter

which allows sp ecialized versions of metho ds to b e created based on characteristics of the calling

environment

if LOCALPOINTERaLOCKOBJECTa

if LOCALPOINTERaLOCKOBJECTa

temp acols

for let i i arows i

for let i i arows i

for let j j acols j

for let j j temp j

aacols i j

atemp i j

UNLOCKOBJECTa

UNLOCKOBJECTa

else

else

fallback

fallback

Figure Caching of Instance Variables

temp This optimization saves a memory reference in the innermost lo op and enabling other

optimizations such as strength reduction Even if there was an assignment to cols within the

lo op so long as it did not involve the ob ject a this transformation would b e safe The access

conditions act as a dynamic alias check ensuring that no other accesses to the ob ject will o ccur

except through the designated reference

Globals

In order to reduce the volume of communication in a distributed memory machine variables

which are used frequently but do not change their value over some p erio d of time should b e

replicated in memory closer to each pro cessing element This is done by detecting a set of

reads delimited by synchronizations from surrounding write op erations inserting a distribution

op eration and redirecting the reads to lo cal copies

In a programming mo del with treestructured concurrency time can b e dened in terms of

the synchronizations at the start and end of a subtree of tasks Thus globals which are constant

in all concurrent subtrees are constant over some p erio d of time For example in Figure

the task on the left writes x but is synchronized with the two concurrent task subtrees on the

right b oth of which read x Thus x is do es not change after the synchronizations

task child sync x = ...

x is constant

... = x ... = x

Figure Temp orally Constant Globals

This situation of temp orally constant globals can b e detected by a sp ecialized interpro ce

dural analysis using the interpro cedural call graph pro duced by ow analysis Chapter The

ob jective is to detect when some group of reads is delimited by synchronization p oints from the

writes The initial synchronization is the p oint where the value b ecomes xed and it can b e

replicated in each lo cal memory The nal synchronization prevents causal inconsistencies

To see the causal inconsistency more clearly it is necessary to atten the call tree into a task

graph In Figure the write at A is separated by a synchronization shown as an arrow

from the reads at B and C however the write at D is unsynchronized In theory b oth B and

C could b oth use the value pro duced by A If D and B are assigned to the hardware so that

they share the same lo cal memory D could up date the value of y which B reads while C would

read the replicated value from A Because B is synchronized with C this forms a temp oral

inconsistency B sees the world after D while C which should o ccur after sees the world b efore

D We might try to eliminate this problem by separating the replicated and up datable data

but that leads to a second typ e of inconsistency

y = ... x = ... x = ... D A E BC ... = x ... = x

... = y

Figure Global Temp oral Inconsistency

The second sort of inconsistency comes from indirect communication through shared data

Consider the variable y in Figure Here we have an analogous situation The values of

y and x are synchronized through D hence B cannot use the replicated value of x and the

unreplicated y Clearly there are situations where the requirement of the nal synchronization

can b e avoided However this optimization has proven useful even with that restriction As

we will see in Sections and scheduling and atomicity can b e exploited to further reduce

communication costs for temp orally constant data

Thus when a set of delimited read op erations is found a distribution op eration is inserted

and the reads are redirected to the lo cal copies of data Figure shows a group of read op erations optimized in this manner

... = x local x = ... Replicate X D A E BC

... = xlocal ... = x local

Figure Temp orally Constant Globals

Touch Optimization Futures

Touches are the p oints at which a thread synchronizes with the results of asynchronous message

sends Their placement aects the numb er of context switches and the amount of state which

must b e saved at each context switch p oint In order to minimize the numb er of context switches

and maximize the eectiveness of latency hiding touches are pushed forward in the program

and group ed

Pushing

The ability of a concurrent program to withstand remote op eration latency is dep endent on the

numb er of outstanding concurrent asynchronous op erations and the distance in time b etween

when the op erations are initiated and when the results are required Touches are the p oints

where the results are demanded and must o ccur b efore the results are used Placing touches

as far from the p oint where the future was created as p ossible increases b oth the numb er and

duration of asynchronous op erations

Figure shows two alternative touch placements On the left each op eration is initiated

and the results required in turn ie sequentially On the right b oth op erations are initiated

rst allowing them to overlap in time b efore any result is required In this way the program

pays the maximum of the two op eration latencies instead of the sum

Touches are inserted along the data frontier The data frontier is the last set of p oints in the

control ow graph which p ossibly redundantly dominates all of the uses of the data Consider

xFuturexCont MAKEFUTUREx xFuturexCont MAKEFUTUREx

yFutureyCont MAKEFUTUREy INVOKEmeth a xCont

INVOKEmeth a xCont TOUCHx

INVOKEmeth b yCont yFutureyCont MAKEFUTUREy

TOUCHx INVOKEmeth b yCont

TOUCHy TOUCHy

z x y z x y

Figure Touch Placement for Latency Hiding

INVOKE(...CONTINUATION(x)...) INVOKE(...CONTINUATION(x)...) INVOKE(...CONTINUATION(x)...)

TFTF data frontier ... x ... data frontier data frontier ... x ...... x ...... x ...

... x ...

... x ...

Figure Data Frontier for Touch Insertion

the control ow graphs in Figure Within a basic blo ck the frontier is the single p oint

b efore the data is used left If a conditional or lo op is encountered the frontier may include

several p oints Because touches are idemp otent it is safe to touch a value which has already

b een touched Hence the data frontier may include more than one p oint along some paths

through the program right

When control ows to a higher level in the PDG ie exits a lo op or conditional branch a

no de is encountered causing the touch to b e inserted b efore the lo op or branch is completed

Counting continuations are an exception Section Counting continuations are used

to synchronize the termination of all the b o dies of a parallel lo op The creation of a normal

continuation for a variable which had not yet b een touched would result in the destruction of

the future and a failure to synchronize with the rst result value However since counting

continuations record the numb er of outstanding result values multiple continuations can b e

constructed for the same variable and a single touch placed outside the lo op is used for syn

chronization Figure shows a lo op with left with a standard continuation The variable

is touched b efore the iteration terminates In the case of the counting continuation right the

variable is touched outside the lo op synchronizing once for all iterations data frontier F F data frontier T T

INVOKE(...CONTINUATION(x)...) INVOKE(...COUNT_CONTINUATION(x)...)

Figure Data Frontier for Lo ops

There is an inherent tradeo b etween latency hiding and active thread state The more

asynchronous invo cations which are outstanding the more active state must b e maintained

For example in Figure the co de on the left touches the variables x and y early This

allows it to resolve the variable q making x and y inactive Alternatively the co de on the right

waits for all of the variables x y and z to b e computed b efore doing any computation The

co de on the left requires less active state but the co de on the right hides latency b etter

xFuturexCont MAKEFUTUREx xFuturexCont MAKEFUTUREx

yFutureyCont MAKEFUTUREy yFutureyCont MAKEFUTUREy

zFuturezCont MAKEFUTUREz INVOKEmeth a xCont

INVOKEmeth a xCont INVOKEmeth b yCont

INVOKEmeth b yCont TOUCHx

INVOKEmeth c zCont TOUCHy

TOUCHx q x y

TOUCHy zFuturezCont MAKEFUTUREz

TOUCHz INVOKEmeth c zCont

q x y TOUCHz

r z q r z q

Figure Touches and Active State

Grouping

To minimize the numb er of times a thread is restarted as a result of the value of a future

b ecoming available touches are group ed so that the thread restarts once for a set of values

Figure illustrates how multiple outstanding messages and a single multitouch op eration

are used In a negrained concurrent language the order of the invo cations of meth on a

b and c is not strictly sp ecied This enables the compiler to pull the messages sends up and

push the touches down

v ameth

vFutvContinuation MAKEFUTUREv

v bmeth

vFutvContinuation MAKEFUTUREv

v cmeth

vFutvContinuation MAKEFUTUREv

funcvvv

ifMUTIPLETOUCHvFutvFutvFut SUSPEND

Figure Latency Hiding and Multiple Touches

Exp erimental Results

To test the eectiveness of these transformation we compare the p erformance of our concurrent

ob jectoriented system to C For comparison the Livermore Lo ops a set of numerical

kernels are used to measure eciency through computation rate The Livermore Lo ops

are used for three reasons First they are a traditional b enchmark for sequential compilers

Second they are extremely sensitive to any ineciency Any extra op erations in the inner

lo op can dramatically reduce p erformance Third the granularity of each metho d amount of

work p er metho d invo cation is very small Many metho ds simply compute the array index

or load or store a single element This makes the elimination of overhead esp ecially imp ortant

All rep orted numb ers are for Workload of the Livermore kernels at single precision run on a

Sparcstation I I The COOP execution times were collected with the UNIX time facility using

high iteration counts and are accurate to within a few p ercent

The COOP programs are written in a natural ob jectoriented style Multidimensional

arrays were created by sub classing a single dimensional array and using metho ds to linearize

the indexing op erations see Section Since our COOP programming mo del do es not allow

p ointers the programmer cannot bypass the encapsulation of the arrays as is typically done in

C programs to obtain eciency We compare our COOP systems p erformance against the

native C version of the Livermore kernels compiled by the same GNU CC compiler as we

used for the Concert backend minimizing dierences in lowlevel optimizations like instruction

selection and scheduling

To illustrate the eect of the dierent optimizations we applied each in turn to Kernel As

with the results presented in Chapter a full suite of conventional low level optimizations were

also applied These have b een separated out These numb ers are presented in Figure Each

transformation pro duces signicant p erformance improvement Traditional optimizations alone

none achieve only several kiloFLOPS thousands of oating p oint op erations p er second Concert compiler bac plemen impro Sp eculativ p erformance impro than that of COOP or C p erformance access regions b orien MFLOPS where the sequen to the same v virtual exceeds than a third of the COOP or the C p erformance of the COOP co de and that of the nativ k The p erformance results for all of the Liv Figure end ted v e b tations COOPCC functions that whic CC st y essen e inlining pro duced an eigh A yle ersion of the GNU C compiler used for the other exp erimen t none this Figure y merging increased p erformance b W and sho tial C program ac to compilers inabilit h w e p oin After applying these standard lo v ws measured emen with nonvirtual functions only access ould b e tially matc the t MFLOPS 0.2 0.4 0.6 0.8 1.2 1.4 1.6 0 1 t closing on Cs p erformance p erformance p erformance elemen non deliv Cum the Of hing the nativ ts ered p erformance ulativ hiev inline the y to do in teenfold p erformance impro w es MFLOPS Cac e C co de are quite close b a as of e Eect of Optimizations on Kernel y most C extend onedimensional k still the ernels common optimizations on ermore Lo ops are in Figure e C implemen COOP quite of Kernel on t hoist y another while lifting them brings this t our w lev w o MFLOPS p o or compilers on co de cache COOP implemen represen The remaining gap is the result of the el optimizatio arra around hing tatio all implemen w y tativ F odimensional arra tati n and nearly times b etter urthermore this p erformance ac v cac emen hiev e ons C less than he ns the nal results Liv the es t resulted in times kiloFLOPS tat relativ written in an ermore ts inline co de ion w The p erformance Kernel using e one MFLOPS output as to k ernels ys ac fth Expanding more the on ob ject a b C hiev of the y using than co de all less im the es 8 COOP 7 C 6

5

4 MFLOPS 3

2

1

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Kernel

Figure Performance on Livermore Lo ops

faster on ve the C implementation was more than faster on six and the remaining

thirteen were essentially the same For the co des where the C compiler gave sup erior p erfor

mance the dierences are the result of idiomatic array manipulation optimizations in the GNU

C compiler and some deciencies in our strength reduction optimization it uses extra registers

and do es not work for op erations under conditionals in lo ops as in Kernel The GNU C

compiler manipulates arrays without the use of integer multiplication dramatically improving

p erformance on the SparcStation I I which uses a multiply step instruction Where the COOP

system was faster the ma jor factor was again low level optimizations By recognizing more

general patterns instead of idioms we were able to apply some optimizations where the GNU

C compiler was unable to

Overall these results demonstrate that the p otential overhead of the COOP mo del can b e

eliminated The remaining dierences are the result of dierences in standard low level opti

mization Furthermore in some cases the COOP mo del has exp osed additional opp ortunities

for optimization For example dynamic aliasing information is explicit in the access regions

and could b e used to reorder memory optimizations within the lo ops p otentially increasing the

eciency even further 100 80 60

40 20

0 -20 -40

-60

% Performance Difference -80 -100 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Kernel

Figure Performance Dierence COOPCC

Related Work

There are few high p erformance negrained COOP language implementations many COOP

system use COOP simply as a co ordination language for algorithmic cores written in more

conventional languages like C or FORTRAN Nevertheless there are systems which transform

COOP programs for greater eciency The HAL system supp orts Actors style

programming which diers somewhat in synchronization and concurrency intro duction from

the Concert COOP style with a high degree of exibility and eciency The family of ABCL

implementations use a variety of dierent techniques and strategies to obtain p erformance

with an emphasis on the ecient implementation of various

language features Exclusivity and deadlo ck issues app ear in the concurrent systems framework

of critical regions monitors and deadlo ck prevention Our inlining and access region

lifting techniques draw on the eorts at Rice on Fortran D and in particular the ideas in

and the imp ortance of interpro cedural optimizations within lo ops Also combining and

lifting access regions resemble invarient lifting and the lifting and blo cking of communication

in parallel Fortran However access region optimizations have sp ecial safety requirements and

require managing fallback co de and manipulating entrance criteria

Summary

This chapter discusses a numb er of optimizations sp ecic to concurrent ob jectoriented pro

gramming Lo ck op erations can b e optimized by taking advantage of access subsumption when

the structure of the call graph necessitates that the access rights required by a metho d will

have already b een acquired when the metho d is called and by recognizing stateless methods

which do not required access at all Sp eculative inlining intro duces access regions regions of

co de over which access to an ob ject has b een granted These regions can b e transformed so

to amortize the cost of sp eculation and to increase the size of access regions for conventional

optimizations Optimization of memory hierarchy trac is of particular imp ortance for COOP

co des where much of the data is access by indirection Flow analysis and the information pro

vided by obtaining access to ob jects can b e used to cache data at higher levels of the memory

hierarchy for eciency Distributed global variables can likewise b e optimized by using the call

graph to detect temporal ly constant globals and by caching their value on each no de Synchro

nization of threads is another p otential source of ineciency which can optimized by careful

placement of touches The programmer can also provide lo cality and lo cking information which

must b e propagated to the p oints of use in the intermediate representation It is shown that

the Livermore Lo ops written in a natural COOP style can b e made as as ecient as C when

the data is available lo cal and the required lo cks are available

Acknowledgments

Caching of lo cal variables summarized in Section is the work of Xingbin Zhang who

contributed greatly to this work in general and it is included here only for completeness

Chapter

Hybrid SequentialParallel

Execution

The hybrid parallelsequential execution mo del presented in this chapter adapts for paral

lel or sequential execution and provides a hierarchy of calling schemas of increasingly p ower and

cost Together these enable hybrid execution to achieve b oth high sequential eciency when

the required data is available and latency hiding and parallelism generation where required

Section describ es how irregular programs can b enet from the adaptive nature of hybrid ex

ecution Section describ es hybrid execution the parallel and sequential versions of metho ds

four sequential schema and wrapp er functions used to match dierent calling conventions Fi

nally Section evaluates the eectiveness of hybrid execution for invo cation intensive co des

regular and irregular applications

Adaptation

Irregular and dynamic programs such as molecular dynamics particle simulations and adaptive

mesh renement have a data distribution which cannot in general b e predicted statically

In addition mo dern algorithms for such problems dep end increasingly on sophisticated data

structures to achieve high eciency Moreover runtime techniques like dynamic

data shipping for increased data lo cality and dynamic function shipping for load distribution

disrupt the static data lo cality relationships As a result a program implementation must

adapt to the irregular and dynamic structure of the data exploiting lo cality where available

to achieve high p erformance

Adaptation is a exible technique which enables the program to take advantage of go o d

data distributions provided by the programmer or the runtime system The data distribution

can even b e mo died at runtime based on evolving program information The execution of the

program including thread creation messaging and synchronization adapts to the new data

distribution providing immediate improvement Figures shows an example of data the

circles distributed around a parallel machine the squares Relationships b etween the pieces

of data are describ ed by lines b etween them This is a go o d data layout since groups of tightly

coupled ob jects are on the same no de

Figure Data Ob ject Layout Graph

An ecient computation over the data minimizes the communication b etween no des For

example in Figure the tree of threads is distributed over the machine with p ortions at the

leaves executing on colo cated ob jects This structure enables sp ecialized sequential co de to

b e executed for sets of threads near the leaves and sp ecialized parallel co de for those near the

ro ot Such sp ecialized co de can b e optimized to its task so that the algorithm core will b e as

fast as conventional sequential co de whereas the parallel co de will eciently spawn parallelism

and hide latency

The sequential co de is optimized by assuming that threads will complete immediately This

enables temp orary storage to b e reused immediately exp ensive scheduling op erations to b e

avoided and cheap linkage mechanisms to b e used The normal program stack call and return

mechanisms can b e used enabling eciency matching conventional sequential co de If the

thread do es not complete the running program adapts by spinning o a parallel thread lazily

Figure Distributed Computation Structure

Since the overhead of creating a thread is high compared to that of a conventional call this

strategy pro duces low overhead even when fall back to the parallel version is common

Hybrid execution is the complement of hardware shared memory which adapts the lo cation

of data to suit the lo cation of computation Hybrid execution can adapt the lo cation of the

computation to the lo cation of the data Moreover the units of communication and computation

are controlled by the compiler and runtime system Where the shared memory system must

migrate a cache line to the requesting pro cessor hybrid execution can adapt to the movement of

data and functions of arbitrary size Also while the shared memory hardware allows a limited

numb er of outstanding data requests based on a nonblo cking cache the numb er of outstanding

op erations and their latency has no xed limit with hybrid execution

While a static execution mo dels might provide go o d p erformance for a particular computa

tional structures the hybrid execution mo del with its adaptation and hierarchy of invo cation

schemas can provide go o d p erformance for many structures Using the results of ow analysis

Chapter the compiler sp ecializes the calling conventions based on the synchronization fea

tures required by the called metho d Furthermore the runtime provides a hierarchy of runtime

primitives of increasing cost and complexity enabling the compiler to select the most ecient

mechanism for a given circumstance

Hybrid Execution

Hybrid execution adapts to the computational structure of a running program by providing

separately optimized sequential and parallel co de The sequential co de executes metho d in

vo cations representing p otentially indep endent threads in LIFO last in rst out order and

schedules them immediately This eliminates the overhead normally asso ciated with thread

creation scheduling and synchronization The parallel co de spins o indep endent threads

generating parallel work and hiding the latency of concurrent op erations

The goals of the hybrid sequentialparallel execution are eciency exibility p ortability

and supp ort for range of data layouts through runtime adaptation This execution scheme is

meant to supplement static compile time techniques such as static data placement and co de

sp ecialization Chapter In order to supp ort these goals hybrid execution uses sequential

and parallel versions of metho ds and four distinct invo cation schema These schemas range

from cheap simple and limited to general complex and more exp ensive To avoid confusion

invo cations in the concurrent ob jectoriented programming mo del are called method invocations

and implementation level C calls function cal ls

Overview

For each metho d there are two versions a parallel version optimized for latency hiding and

parallelism generation using a heap context and a sequential version optimized for ecient

sequential execution using the stack The parallel version is completely general capable of

handling remote invo cations and susp ension but can b e inecient when the generality is not

required The sequential version comes in three avors of increasing generality These dierent

versions and avors use dierent calling conventions to handle synchronization return values

and reclaim activation records Table describ es these cases

Case Basic Op eration

Parallel Most general schema all argumentslinkage through the heap

frame reclamation based on function termination

Sequential Nonblo cking Regular C callreturn

MayBlo ck Regular call check return co de to either continue

computation or p eel stack frames to heap

Continuation Extension of MayBlo ck which enables

Passing forwarding on the stack

Table Invo cation Schemas

In practice one of the versions may not actually b e generated if it is deemed unnecessary

for either correctness or eciency

In the remainder of this section the mapping from metho d invo cations to C function calls

is describ ed Since the Concert compiler backend generates C this description roughly

parallels the output of the compiler First the parallel invo cation mechanism is describ ed Sec

tion followed by the avors of sequential invo cation Section Finally Section

discusses the proxy contexts and wrapp er functions which are used to handle certain b oundary

cases

methodargs f

Slot a b c

INVOKEmethodAa

INVOKEmethodBb

INVOKEmethodCc

continue heap execution

if touchabc f

storestate

suspend

g

use values in a b c

value REPLYcontinuation return

g

Figure Generated Co de Structure for a Parallel Metho d Version

Parallel Invo cations

The parallel invo cation schema is a conservative implementation of the general case allo cating

the activation record on the heap and passing the arguments through the heap as well Parallel

invo cations create threads which preserve their state b etween context switches in the heap

activation record This is the execution mo del describ ed in Section By storing inactive

temp orary values in the heapbased record context the cost of susp ension is minimized Such

susp ensions o ccur while waiting for the result of

A remote invo cation

An invo cation on a lo cked ob ject

A blo cking primitive eg IO or

A lo cal invo cation which itself has susp ended

Susp ension and fall back from the stack invo cation schemas are describ ed b elow

Figure shows an example of a parallel version of a metho d which several parallel metho d

invo cations and synchronizes on their return with a single touch A p ointer is provided to the

slot corresp onding to the return value If the runtime system decides to create a new thread for

the invo cation it will create a continuation and convert the slot into a future Section

If the task is scheduled immediately the continuationfuture pair need not b e created and the

value can b e written directly into the return lo cation with minimal overhead

B B B

1 2

A A A A

(i) (ii) (iii) (iv)

All cases: need to create linkage between caller and callee: 1. create context and continuation for B, future for A; write arguments

2. return value passed through continuation to future

Figure Parallel Invo cation Schema Graphic

Figure gives a graphical representation of the parallel calling schema On a call to B a

thread is created for B and a heapbased context A futurecontinuation pair are created for

the return value The future is created in A and the continuation is stored in B Both threads

are now free to execute in parallel When B completes it uses the continuation to pass the

return value to the future in A

Since the invo cations execute in parallel the return values can arrive in any order Touching

a set of futures are at one time to avoid unnecessary restarts of the activation when not all of

the needed values are available Section Concurrency is generated b oth across parallel

calls and b etween caller and callee and latency is masked by enabling several invo cations from

the same metho d to pro ceed concurrently Thus parallel metho d versions are optimized for

concurrency generation and latency hiding

Sequential Invo cations

There are three dierent schema for sequential metho d versions each requiring a dierent calling

convention see Figure Since determining the correct schema can dep end on nonlo cal

transitive prop erties the interpro cedural call graph provided by ow analysis Chapter is

used to conservatively determines the blo cking and continuation requirements of metho ds and

to select the appropriate schema Since only one sequential version of each metho d is generated

this classication determines the calling convention used when the metho d is invoked

val non blocking method Nonblocking return

May callee context may block methodreturn val

Continuation caller context cont passing methodreturn valcaller info

Passing

Figure Invo cation Schema Calling Interfaces

The criteria for selection of the sequential metho d versions is as follows If the metho d and

all of its callees cannot blo ck then the Nonblo cking version is used In this case the function

return value can b e used to convey the future value When it cannot b e shown that blo cking

will not o ccur but the callee do es not require a continuation the Mayblo ck is used In this

case we optimistically assume the metho d will not blo ck and allo cate any required context

lazily as describ ed in Section Finally the Continuation Passing version is used if

the callee may require the continuation of a future in the callers as yet uncreated context In

this case b oth context and continuation are created lazily Lazy creation of continuations is

describ ed in Section

Nonblo cking Standard Call

When the compiler determines that a metho d will not blo ck a standard C pro cedure invo cation

is used Since this situation is determined over the call graph entire nonblo cking subgraphs

are executed with no overhead Thus those p ortions of the program which do not require the

exibility of the full concurrent ob jectoriented programming mo del are not p enalized

int methodargs f

variable method

variable method

variable method

use values in variable variable variable

value return return

g

Figure Co de Structure for a Nonblo cking Sequential Metho d

Figure shows the structure of a nonblo cking sequential metho d Return values are

assigned directly to variables which are then used normally And the result of the metho d itself

is returned directly Logically the thread representing the invoked metho ds have b een statically

scheduled immediately Since they cannot blo ck fairness Section is not a problem

Mayblo ck Lazy Context Allo cation

In the mayblo ck case the calling schema assumes that the metho d will complete immediately

and if it blo cks that it will create its context lazily Linkage is provided by having the caller

create future and continuation for the result value and place the continuation in the newly

created context

calleecontext mayblockmethodreturnval

if calleecontext NULL fallback code

context createcontext

calleecontextcontinuation makecontinuationcontext

it savestatetoheap context

return context propagate blocking

Figure Mayblo ck Calling Schema

Figure shows an example of the mayblo ck calling schema This schema distinguishes two

outcomes for the callee successful completion and blo cking If the callee runs to completion

a NULL value is returned If the callee blo cks it allo cates a heap context stores its state and

return a p ointer to that context Since the C return value is used to indicate completion the

result of the metho d is returned through a p ointer passed in as an additional argument A

native co de implementation would likely use an additional register for the result instead

In the example on successful completion the caller extracts the actual return value from

val If the metho d blo cks the callee context is returned This context is used to set return

up the linkage b etween caller and callee The caller creates a continuation for the future value

of the result and places that continuation in the callee context This pro cess can cascade The

caller may if necessary create its own context revert to its parallel metho d version and return

its context to its caller

B B 1 A A A A A A 2

3

B A B (i) (ii) (iii) (i) (ii) (iii)

Callee completes without blocking Callee blocks: need to create linkage b/w caller and callee: 1. create callee context (B) and return 2. create caller context (A)

3. store continuation in callee context

Figure Mayblo ck Schema Graphic

Figure shows an example of the calling schema for the mayblo ck case in action The

gure on the left shows successful completion In it the lo cal state for thread B is stored in

a stack frame Since the thread completes b efore A is rescheduled the stack frame can b e

deallo cated in normal FIFO order The gure on the right shows the stack unwinding when

the call cannot b e completed In it B was allo cated on the stack but then blo cked In order to

preserve FIFO scheduling order the stack frame must b e ushed to the heap In this example

A also ushes itself to the heap and writes a continuation into the heap context of B If A did

not require the result of B if the result was to b e ignored or forwarded back to As caller A

could ll in the continuation and continue executing o the stack

Thus a sequence of mayblo ck metho d invo cations can run to completion on the stack or

unwind o the stack and complete their execution in the heap The fallback co de creates the

Attempting to use long long for this purp ose in GCC resulted in unnecessarily

inecient co de

callees context saves lo cal state into it and propagates the fall back by returning this context

to its caller which then sets up the linkage

Continuation Passing Lazy Continuation Creation

Explicit continuation passing Section can improve the comp osability of concurrent

programs However creating continuations and using continuations to return results

is exp ensive Moreover continuation passing requires more information to make the linkage

Normally if an invo cation is b eing executed on the stack the callees continuation is implicit

In the mayblo ck case the callee returns a p ointer to its context into which the caller writes

the continuation In the continuation passing case the callee may require the continuation

immediately so that it can pass it on Furthermore since one of our goals is to execute forwarded

invo cations on the stack lazy allo cation of the continuation is essential

info The continuation passing schema see Figure uses an additional parameter caller

which along with the return ptr enco des the information necessary to determine what to do

should the continuation b e needed The caller info eld is not used if the callee do es not

need to manipulate the continuation directly eg store it or pass it o no de In the case

info information is simply passed along If a metho d needs of lo cal forwarding the caller

the continuation the information is used to create it The caller info indicates whether the

context containing the continuations future has already b een created the contexts size if it

has not the lo cation of the return value within the context and whether the continuation was

forwarded Table describ es the caller info information in detail

As with the mayblo ck schema the result the metho d invo cation is passed back in one of

two ways If the continuation is not needed ie the continuation is not explicitly manipulated

val ptr The metho d simply the metho d invo cation result is passed back using the return

writes the result through return val ptr and passes NULL return values back to its caller

The caller of the rst continuationpassing metho d ro ot of the forwarding chain receives

this NULL value and lo oks in return val for the result Thus lo cal continuation passing is

executed completely on the stack

info is consulted There are If on the other hand the continuation is required caller

four cases which are handled by the fallback co de First if the continuation was initially

forwarded the context must already exist as must the continuation which is always stored

Context rootmethodreturnvalptr

callercontext contpassingintermedreturnval

makecallerinforootfunc

if callercontext NULL

savestatetoheapcallercontext

return callercontext propagate blocking

Context contpassingintermedreturnvalptrcallerinfo

callercontext contpassingmethodreturnvalptrcallerinfo

return callercontext

Context contpassingmethodreturnvalptrcallerinfo

if canreturnvalue

returnvalptr value

return NULL

else need continuation

callercontext createcontextfromcallerinforeturnvalptrcallerinfo

mycontext createcontext

mycontextcontinuation makecontinuationcallercontext callerinfo

savestatetoheapmycontext

return callercontext

Figure Continuation Passing Calling Schema

at a xed lo cation in heap contexts The continuation is extracted by subtracting the return

lo cation oset in caller info from the return val ptr adding on the xed lo cation oset

and dereferencing Second if the context already exists but the continuation do es not the

continuation is created for a new future at return val ptr The future needs to have a reference

to the context in order restart the thread It computes this reference using the return val ptr

info Finally if the context do es not exist it is and the the return lo cation oset in caller

created based on the size information from caller info and the continuation is created for

a future at the return value oset The callee now has the continuation with which it may do

what it needs Figure contains pseudo co de describing these cases

context ag indicates whether or not the heap context to which the result

of this metho d invo cation should go has already b een created

forward ag indicates whether or not the continuation was forwarded from

val ptr context ag initial context p ointed to by return

is always true when forward ag is true

future ag indicates that the future has already b een created This

makes the continuation creation op eration to b e idemp otent

increasing placement exibility

count ag indicates that the continuation is a counting continuation

This allows many continuations to b e forwarded on the stack

then ono de from a parallel lo op

null ag indicates that the continuation is null simply consumes the

result

return lo cation oset the oset within the heap context where the future for the

metho d invo cation result should b e created Along with the

return val ptr can b e used to calculate the p ointer to the

context when context ag is true

metho d a descriptor which is used to create the context to which the

result will b e sent

Table Continuation Passing Caller Information

When the callee completes it indicates that it required the continuation by passing the

continuations futures context back to its caller Note that in the case of forwarding this is

not the callers context When a forwarding invo cation returns a nonNULL value the caller

may continue to execute but must ultimately return the context p ointer to its caller regardless

of whether or not it completes This is b ecause the context may b e that of its caller which will

then fall back to its parallel version

Figure shows a graphical example of the continuation passing schema The thread

method from Figure is the ro ot of a continuation forwarding chain in which A root

we can imagine that cont passing intermed is an intermediate function through which the

passing method On the left normal completion continuation is forwarded to thread B cont

has the caller info and return val ptr passed as a pair from A through to B where the

return val ptr is used to return the result directly from B to A On the right B requires

the continuation and it must create the context for A in order to build it When As context

p ointer is eventually returned to A it stores its state and reverts to its parallel version

makecontinuationpassingcontinuation return val ptr caller info f

if caller infoforward flag

return extract continuationextract contextreturn val ptrcaller info

else f

context NULL

infocontext flag if caller

context extract contextreturn val ptrcaller info

else

contextcaller infomethod context create

return make continuationcaller infocount flag

context

caller inforeturn location offset

g

g

Figure Pseudo Co de for Continuation Creation

Wrapp er Functions and Proxy Contexts

Calling the sequential versions of metho ds from the runtime or a dierent schema metho d

can require imp edance matching interfacing the available information to the desired interface

For example when a message arrives at a no de it contains a continuation for the result If

the appropriate stackbased schema is nonblo cking a wrapp er function is used which calls

the sequential version obtains the result and passes it to the calling metho d through the

continuation This example is illustrated in Figure The result is check to verify that a

value was returned which will not b e the case in a purely reactive computation which is then

passed to the waiting future by way of the continuation

The wrapp er function takes either a vector of arguments or a communication buer and

invokes the stackbased version of a metho d with the appropriate calling convention In this

manner a remote message can b e pro cessed entirely on the stack and if the continuation is

forwarded it may pass through several no des nally resp ond to the initial caller all without

allo cating a heap context

Figure illustrates the case for mayblo ck metho ds The result value is checked as in

the nonblo cking case and passed through the continuation if available Furthermore if the

callee blo cks the continuation is placed in the callees context Note b oth the result value

and callee context p ointer are checked in any case since the result may or may not b e returned

indep endent of whether or not the metho d blo cks B B 2 . . . . 1 A A A A A A 4 5 3

A B A B A B (i) (ii) (iii) (i) (ii) (iii) (iv)

Callee completes without blocking: Callee requires own continuation: caller_info is passed through the call chain, 1. create caller's (A) context and return value is passed back 2. create callee's (B) context 3. create and store callee's continuation (create linkage) 4. save caller state to created context

5. callee (eventually) returns value to caller

Figure Continuation Passing Schema Graphic

void nonblockingmsgwrapperSlot buff Nonblocking

resultval nonblockingmethodbuff

if EMPTYresultval replybuffCONTINUATIONresultval

Figure Nonblo cking Wrapp er

Finally for continuation passing Figure a proxy context and a caller info pa

rameter are created which indicates that the context exists and that the continuation was

forwarded Thus if the continuation is required it is extracted from the proxy context see

Section This proxy context technique is also used when an arbitrary continuation

p erhaps one stored in a data structure is passed by a metho d to another which requires a

return val and caller info pair This can o ccur with userdened synchronization structures

like barriers

Evaluation

This section evaluates the eectiveness of hybrid parallelsequential execution First the p erfor

mance of hybrid execution for is compared to the p erformance of straight sequential execution

for the same programs written in C This comparison is made using a variety of synchroniza

tion structures Then parallel p erformance is examined by measuring the improvement over

straight heapbased execution and demonstrate the ability of the execution mo del to adapt

void mayblockingmsgwrapperSlot buff Mayblock

Slot resultval EMPTYSLOT

Context calleecontext mayblockmethodresultvalbuff

if EMPTYresultval replybuffCONTINUATIONresultval

if calleecontext NULL

calleecontextcontinuation buffCONTINUATION

Figure Mayblo ck Wrapp er

void contpassingmsgwrapperSlot buff Continuation Passing

Proxy proxycontext

proxycontextreturnval EMPTY

proxycontextcontinuation buffCONTINUATION

CallerInfo callerinfo PROXYCALLERINFO

Context callercontext contpassingmethodproxycontextresultval

if EMPTYproxycontextresultval replybuffCONTINUATIONresultval

Figure Continuation Passing Wrapp er

to dierent data lo cality characteristics for b oth regular and irregular parallel programs The

sequential exp eriments were conducted on SPARC workstations and the parallel exp eriments

on the CM and the TD Section

Sequential Performance

Sequential p erformance is evaluated for a set of invo cation intensive b enchmark programs

Table presents the results The sequential execution times in seconds using the hybrid

mechanisms are compared with times for a parallelonly version and the original C versions of

the programs The hybrid versions are evaluated under varying degrees of exibility interface

uses only the Continuationpassing interface while interfaces uses all three interfaces Finally

Seqopt is a version which eliminates parallelization overheads

Using the most exible hybrid version interfaces all programs run signicantly faster

than than the heaponly versions and achieve close to the p erformance of a comparable C

Program Parallel Hybrid ParallelSequential Versions C

Description Version interface interfaces interfaces Seqopt Program

b

tak

qsort

nqueensx

listtraversal

a a a a

elements

a

without forwarding optimization supp orted by Continuationpassing interface

Table Sequential Performance

program despite concurrent semantics Several programs required all three interfaces to achieve

comparable p erformance The remaining overhead is due to parallelization Seqopt shows the

p erformance when this overhead is eliminated

The parallel and hybrid versions can b e run directly on parallel machines They include

parallelization overhead in the form of name translation lo cality and concurrency checks Chap

ter Since sp eculative inlining Section is required to obtain go o d p erformance from a

negrained mo del it was used as one of the optimizations contributing to these results Since

sp eculative inlining lowers the overall invo cation frequency it decreases the impact of our hybrid

execution mo del However this impact is mitigated by only inlining one level of recursion

The three hybrid versions provide b enets in dierent cases Each level of exibility en

ables additional op erations to b e executed on the stack at some increase on cost Allowing

all three stack interface versions ie the nonblo cking mayblo ck and continuationpassing

call schemas improves p erformance by up to as compared to when only the most general

continuation passing interface is always used The cases where the p erformance using two

interfaces is worse than that using only one interface are an anomaly arises from an improp er

alignment of invo cation arguments causing them to b e spilled to stack instead of b eing passed in

registers In summary the hybrid mechanism enables Clike p erformance when data is lo cally

accessible

The relative p erformance of fib and tak is a result of the comparatively aggressive inlining

of our compiler

Parallel Performance

This section considers three application kernels which characterize the parallel p erformance

of the hybrid mechanisms as compared to straight heapbased execution The p erformance

is characterized with resp ect to the eect of data lo cality First a regular co de Successive

Over Relaxation SOR is considered and the p erformance improvement resulting from the

use of the hybrid is shown to increases with the amount of data lo cality reaching close to the

theoretical maximum for the given lo cality Then two irregular co des MDForce and EMD

are considered and the ability of the hybrid mechanisms to adapt to available data lo cality in

the presence of irregular computation and communication structures is demonstrated

Regular Parallel Co de SOR

Successive Over Relaxation SOR is an indirect metho d for numerically solving partial dier

ential equations on a grid The algorithm evaluates the new value of a grid p oint according

to a p oint stencil and consists of two halfiterations in the rst halfiteration a new value

is computed for each grid p oint and in the second halfiteration the grid p oint is up dated

with the computed value To characterize the impact of data lo cality the grid size is xed

and various blo ck sizes are considered for a blo ckcyclic distribution of the grid

on an grid of pro cessors These dierent data layouts result in dierent ratios of lo cal to

remote metho d invo cations and corresp ond to diering amounts of data lo cality

Data Locality CM Performance TD Performance

Block Local vs Paral lel Hybrid Paral lel Paral lel Hybrid Paral lel

Size Remote secs secs Hybrid secs secs Hybrid

Table Execution Results SOR

Table shows the p erformance of the hybrid mechanisms on no de congurations of

the CM and TD for ve choices of the blo ck size The parallel execution times are for SOR

on a grid at iterations The CM and TD b oth used no de congurations

The p erformance of hybrid mechanisms is compared with a parallelonly version for varying

amounts of data lo cality indicated by Block Size for a blo ckcyclic distribution of the grid

Local vs Remote gives the ratio of lo cal to remote metho d invo cations for the given blo ck size

The results in Table show that improvement from hybrid execution is prop ortional to

data lo cality The sp eedup of hybrid mechanisms over the parallel version increases from

when the fraction of lo cal invo cations is to when the fraction of lo cal invo cations

is These numb ers are in the neighb orho o d of the theoretical p eak values which are

determined by the relative costs of useful work invo cation overhead and remote communication

For example factoring out the useful work in the SOR blo ck layout on the CM

the maximum p ossible sp eedup is given that on average a remote invo cation incurs

times the cost of a lo cal heap invo cation The measured value of approaches this maximum

indicating that hybrid execution eciently adapts to available lo cality

Irregular Parallel Co de MDForce

MDForce is the kernel of the nonb onded force computation phase of a molecular dynamics

simulation of proteins The computation iterates over a set of atom pairs that are within

a spatial cuto radius Each iteration up dates the force elds of neighb oring atoms using their

current co ordinates resulting in irregular data access patterns b ecause of the spatial nature

of data sharing The implementation reduces the communication demands of the kernel by

caching the co ordinates of remote atoms and combining force increments

Data Locality CM Performance TD Performance

Data Local vs Paral lel Hybrid Paral lel Paral lel Hybrid Paral lel

Layout Remote secs secs Hybrid secs secs Hybrid

Random

Blo ck

Table Execution Results MDForce

Table shows the p erformance of the MDForce kernel Parallel execution times for MD

Force kernel atoms for iteration on no de congurations of the CM and TD The

p erformance of the hybrid mechanisms is compared with a parallelonly version for lowlo cality

random and highlo cality data layouts The random layout uniformly distributes atoms on the

no des ignoring the spatial distribution of atoms The spatial layout uses orthogonal recursive

bisection to group together spatially proximate atoms

Because of p o or lo cality the execution time for the random distribution is communication

overhead dominated Since communication costs remain unchanged by the choice of the in

vo cation mechanisms the sp eedup is in this case For the spatially blo cked distribution

the hybrid mechanisms enable the computation to adapt dynamically to data lo cality yielding

sp eedups of and on the CM and TD resp ectively

As with SOR when run time checks determine that b oth atoms of an atom pair are lo cal

and the computation is small it is sp eculatively inlined When an atom is found to b e remote

but its co ordinates are in the cache the computation completes on the stack without incurring

parallel invo cation overhead When communication is required and the stack invo cation falls

back to the parallel version to enable multithreading for latency tolerance Furthermore hybrid

execution provide a p erformance improvement for communication b ecause the invoked metho d

can execute directly from the message handler Thus hybrid execution adapts to the data

layout and synchronization structures of the program

Irregular Parallel Co de EMD

EMD is an application kernel which mo dels propagation of electromagnetic waves The

data structure is a graph containing no des for the electric eld and for the magnetic eld with

edges b etween no des of dierent typ es A simple linear function is computed at each no de based

on the values carried along the edges Three versions of the EMD co de are used to evaluate the

ability of the hybrid mo del to adapt to dierent communication and synchronization structures

Since they are intended to examine invo cation mechanisms elab orate communication blo cking

mechanisms are not used The rst version pull reads values directly from remote no des The

second version push writes values to the computing no de up dating from the remote no des

each timestep Finally in the forward version the up dates were done by forwarding a single

message through the no des requiring the up date

Table describ es the p erformance of the three versions of EMD on a no de CM and

a no de TD The parallel execution times are rep orted for no des of degree for

iterations The p erformance for three versions of the algorithm using the hybrid mechanisms is

Data Locality CM Performance TD Performance

Local vs Paral lel Hybrid Paral lel Paral lel Hybrid Paral lel

Algorithm Remote secs secs Hybrid secs secs Hybrid

EMD

pul l

EMD

push

EMD

f or w ar d

Table Execution Results EMD

compared with parallelonly versions for random no de placement with low lo cality

and placement with high lo cality

The results show that the hybrid scheme is capable of improving p erformance for dierent

communication and synchronization structures for b oth cases of high and low data lo cality

For low lo cality eciency is increased b ecause ono de requests are handled directly from the

message buer without requiring the allo cation of a heap context When lo cality is high hybrid

execution executes lo cal p ortions of the computation entirely on the stack Hybrid execution

yields sp eedups ranging from unity to nearly four times achieving sup erior p erformance in all

but one case where the continuation passing schema is used with extremely low lo cality on the

CM In addition to the cost of fallback this combination pro duces the worst case for our

scheduler on the CM

Overall the pul l version provides the b est absolute p erformance since it computes directly

from the values it retrieves rather than using intermediate storage The forward version requires

longer up date messages than push but fewer replies On the CM replies are inexp ensive a

single packet so the cost of forwards longer messages overwhelms the cost of the larger numb er

of replies required by push However on the TD the decrease in overall message count enables

forward to p erform b etter than push for low lo cality Also the CM compiler p erforms b etter

on the unstructured output of our compiler than the TD compiler As a result the cost of

the additional op erations required by push and forward has less of an impact on the TD than

messaging overhead Thus for high lo cality the hybrid mechanism is most b enecial for pul l

on the CM and for forward on the TD where lo cal computation and messaging dominate

resp ectively

Related Work

Several languages supp orting explicit futures on sharedmemory machines have fo cused on

restricting concurrency for eciency However unlike our programming mo del where

almost all values dene futures futures o ccur less frequently in these systems decreasing the

imp ortance of their optimization More recently lazy task creation leapfrogging and

other schemes have addressed load balancing problems resulting from serialization by

stealing work from previously deferred stack frames However none of these approaches deals

with lo cality constraints arising from data placements and lo cal address spaces on distributed

memory machines

Several recent thread management systems have b een targeted to distributed memory ma

chines Two of them Olden and Stacklets use the same mechanism for work generation

and communication Furthermore they require sp ecialized calling conventions limiting their

p ortability StackThreads used by ABCLf has a p ortable implementation However this

system also uses a single calling convention and allo cates futures separate from the context

Thus an additional memory reference is required to touch futures Also its single version

of each metho d cannot b e fully optimized for b oth parallel and sequential execution The

HAL system provides sp ecialized calling conventions for dierent synchronization idioms

However it is not adaptive and do es not provide sep erate sequential and parallel versions

concentrating instead on the eciency of individual op erations

Summary

Irregular and dynamic programs require an execution mo del which can adapt to the run time

computation and communication structure Also adaptation is critical for supp orting runtime

techniques such as data and function shipping Hybrid execution uses separately optimized

sequential and parallel versions of various levels of exibility and eciency to adapt at run time

to data layout and computational structure Sequential eciency can b e achieved for COOP

co des by dynamically aggregating the many small threads into larger threads executed in LIFO

fashion using a convention stack Parallel eciency is achieved by generating parallelism and

hiding latency by creating threads using heapbased contexts Hybrid execution uses sp ecialized

parallel and sequential versions of each metho d and four distinct invo cation schemas of varying

complexity and cost to eciently For call intensive kernels it is demonstrated that hybrid

execution is very nearly as ecient as straight sequential C co de Furthermore the dierent

calling schemas contribute to this eciency Hybrid execution approaches optimal eciency

for regular problems and eectively adapt to irregular problems with complex communication

and synchronization structures It is demonstrated on the CM and TD that hybrid execution

provides p erformance b enets prop ortional to the amount of data lo cality Hybrid execution

provides b oth sequential eciency when data is available and parallel eciency for work gen

eration and latency hiding

Acknowledgments

The SOR and MDForce source co de numb ers and results analysis in this chapter are the

work of Vijay Karamcheti and Xingbin Zhang resp ectively They app ear here to illustrate the

eectiveness of hybrid execution on a wider range of programs

Chapter

Conclusions

OOP and COOP can be made ecient through interprocedural analysis and trans

formation

Thesis

The pro of of this thesis has b een a long pro cess spread over many years and involving

many parts the utility of which are not necessarily obvious in a vacuum The goal of the

Concert pro ject is ambitious to start with a a highly abstract programming mo del and build

a programming system to rival in absolute eciency state of the art systems for conventional

languages with decades of high p erformance tradition My part of that goal was sequential

eciency that is computational eciency discounting load balance and the availability of data

two enormous areas of research in themselves This part alone required developing new analysis

and transformation techniques which proved useful for tackling the other problems as well

Furthermore b ecause we chose absolute eciency as our goal many small optimizations for

common b oundary cases were required as well as a host of standard transformations available

in any optimizing compiler

Building a complete system was extremely satisfying in the sense that I am condent that

we have explored a go o d part of the problem space and develop ed an understanding far b eyond

the sterile and often deceptive lines of theory But it has b een disturbing as well Much of

what we have learned is that there is no panacea no quick x that a workable solution requires

many pieces which dep end strongly on the capabilities of the others Very simple programming

systems of immense p ower and eciency can b e constructed but no short cuts are p ossible and

the solution is not incremental It requires rethinking of the interaction b etween language and

implementation and reformulation of the compilation pro cess This translation of viewp oint

do es not t easily with either the academic nor commercial mindsets It do es not divulge its

secrets to simplied problems or yield to incremental solutions The work presented here only

b egins to describ e what will ultimately b e required to fully realize simple p owerful parallel

programming systems It will b e years and probably decades b efore such systems b ecome

common

Thesis Summary

Ob jectoriented programming OOP is the byword of software engineering promising to in

crease pro ductivity through abstraction and software reuse Concurrent ob jectoriented pro

gramming COOP applies those to ols to parallelism and distribution with applications from

sup ercomputers to web browsers These technologies pro duce programs with very dierent

structures than standard pro cedural co des but with the same high demands for eciency

However lo calized compilation techniques are p o orly suited to the dynamic nature of OOP and

COOP co des The abstractions which free programmers from implementation details hide in

formation from the compiler resulting in conservative implementations and p o or p erformance

Through interpro cedural analysis and transformation sp ecialized implementations of these

abstractions can b e constructed so that OOP and COOP can b e as ecient as conventional

pro cedural programming I have develop ed a compilation framework using novel optimization

techniques for contextsensitive analysis and sp ecialization of abstractions This framework

maps exible dynamic programs to ecient static implementations For standard b enchmarks

including the Stanford Integer Richards and DeltaBlue b enchmarks and the Livermore Lo ops

these techniques pro duce implementations as ecient as those written in C and more ecient

than those written in C and compiled with standard compilers

The framework starts with a new interpro cedural contextsensitive ow analysis which

breaks through abstraction barriers Classes and functions are then cloned for the contexts

in which they are used An iterative cloning algorithm rebuilds the cloned call graph using

a mo died dispatch mechanism Using the information made static by cloning classes and

functions are sp ecialized memb er instance and lo cal variables are unb oxed virtual functions

metho ds are statically b ound and inlining is p erformed sp eculatively based on the class of

ob jects or function p ointers for OOP and lo cation or lo ck availability for COOP Other opti

mizations include promotion of memb er and global variables to lo cals lifting and merging of

access regions removal of redundant lo ck and lo cality check op erations and redundant array

op eration removal Finally the program is mapp ed to a new hybrid stack and heapbased

execution mo del suitable for distributed memory multicomputers

The general contribution of this thesis is an optimization framework for ob jectoriented

and negrained concurrent languages Individual contributions are a new iterative ow

analysis for the analysis of ob jectoriented programs which is b oth practical and more precise

than previous analyses a new cloning algorithm which extends the state of the art in cloning

to general ow problems and ob jectoriented programs a set of novel optimizations for

removing ob jectoriented and negrained concurrency overhead and a new hybrid stack

heap execution mo del which provides two separately optimized versions of co de enabling it to

adapt to the lo cation of data

Final Thoughts

While the work describ ed in this thesis forms a cohesive whole it is by no means the nal closed

solution There is a great deal left to b e done foremost p erhaps in the area of interpro cedural

analysis and transformation Adaptive analysis techniques with time b ounds for particular

program structures need to b e develop ed These must b e able to handle the problem of structure

analysis the determination of the structure of imp eratively manipulated p ointerbased data

structures directly from the program text The structure analysis problem subsumes alias

analysis an outstanding problem of great complexity and imp ortance

Transformation must go farther to break the b oundaries b etween compilation calculation

communication and data Translation b etween these dierent mo des of computation should

b e in the providence of the programming system For example data can b e migrated cached

recomputed and enco ded in control ow Communication can b e similarly b e transformed

mapp ed into control ow cached replaced with one of the p ossible results indicated by non

determism Calculations can b e replaced by partially computed inferred and sp eculated values

Finally run time compilation can blur the b oundary b etween static and dynamic data and co de

Nondeterminism needs to b e recognized as an essential and desirable prop erty of algorith

mic description a to ol for describing a range acceptable b ehavior In contrast deterministic

solutions are oversp ecied and have no physical analogue in the real world where simultaneity

is a reality Nondeterminism is an enormously p owerful transformational to ol enabling the

implementation to select a path to the answer based on eciency Such tradeos are organic

and natural for the vast ma jority of computer users As computing emerges from the cold and

brittle world of mechanical logics into the warmth of art and information age the nature of

computation will change

On a similar note consistency mo dels which have heretofore b een relegated to the mechanics

of hardware cache consistency need to b e expanded to the level of the programming mo del We

live in a world where p erformance is limited by the ability of the system to deliver and maintain

the consistency of data where distribution and simultaneity have obliterated the notions of

absolute time and absolute lo cation and where facts are relative Internal consistency for

a given system encapsulation of assumptions predicated answers and answers as of a given

time are all common in the natural systems with which the computing fabric is ever more

intertwined The science of computation must embrace these concepts b ecause they are the

future of computing

With the recent explosion of interest in network computing the world wide web and Java

is clear that at its core computing is parallel and distributed It is also clear that the old fork

joinsemaphore mo del has b ecome hop elessly dated The rise of p ortable executable formats

justintime and wholeprogram compilation has upset the traditional view of compilers This

area which had stagnated under endless discussions of yet another parsing algorithm and

an obsession with sp eeding translation through clever co ding schemes has b een given new

life Unfortuntely the current climate of compulsive pap er chase discourages revolutionary

developments and deep thought in general For this we suer as the same old tired ideas

cast in new terms and sup ercially evaluated clog the literature numbing minds and wearying

spirits In this age where timetomarket is king industry ever with all its shortsightedness

has surpassed academia for innovation and progressiveness The ivory towers like the castles

of provincial lords at the close of feudal times are threatened by progress and the rising p ower

of the merchant class It is my hop e that academia can rise to the challange and using the

towers height and relative p eace for far sight and contemplation to rene p erfect and lay the

groundwork for the next revolution I hop e this thesis makes a contribution to that end

And what is writ is writ

Would it were worthier Lord Byron

App endix A

Annotations

In order to instruct the compiler and runtime with information ab out the placement and acces

sibility of ob jects the Concert system supp orts among others lo cality and lo cking annotations

These annotations are not meant to change the meaning of the program but rather to

enable the programmer supply information which might not b e easily accessible to the compiler

andor runtime and to enable the compiler writer to test the usefulness of particular pieces of

information

a newLOCAL A

b newLOCAL B

c new C

ax b

d flag a b

e flag a c

if cond e local nolockfoo

Figure A Annotations Example

Figure A shows a blo ck of co de which uses annotations and allo cation directives Here

a and b are b oth allo cated lo cal to the current ob ject self As a result b oth a and b are

relatively lo cal to each other and to self Conservatively we can then infer that d will b e

relatively lo cal to self as well We may not infer that e is relatively lo cal to self since c may

b e allo cated on any no de However we can infer that ax will b e relatively lo cal to a so long

as this is the only assignment to x Finally in the last line the conditional cond is evaluate

and the compiler is directed that e is lo cal and no lo ck is required to access it a situation

implied by cond

b a local

if

d c local

else

e c local

Figure A Annotations Propagation Example

Annotations propagation is a data ow problem Annotations are propagated along the

lo cal data ow arcs using a conservative merge When a reference to a lo cal ob ject merges

with a reference to a remote ob ject the resulting reference is to a remote ob ject Since the

annotation sp ecify a prop erty of a value they ow backward as well If an annotation has

b een placed on every value into which another value may ow the original value must have

that prop erty as well For example in Figure A a and b are sub ject to the same conditions

of execution Therefore b must b e lo cal in the blo ck containing a Likewise d and e are

annotated to b e lo cal Since one or the other branch must b e executed it must b e true that c

is lo cal in the surrounding blo ck Care must b e taken when using data or function migration

For example if the unsp ecied condition in Figure A moved the ob ject c so that it was lo cal

c need not b e lo cal b efore the conditional In such cases annotation should not b e propagated

across op erations which can change the prop erties b eing annotated

App endix B

Raw Flow Analysis Data

The results of analysis for the three algorithms on a variety of complete kernel and synthetic

programs app ear in Table B IFA refers to our incremental inference algorithm OPS refers

to the inference algorithm in and CFA refers to a standard ow insensitive algorithm

which allo cates exactly one typ e variable p er static program variable

The numb er of Passes is determined by the algorithms automatically when it determines

that no run time typ e errors are p ossible No des is the numb er of ow graph no des used by

the algorithm Invokes is the numb er of invo cations abstract calls analyzed Contours is

the numb er of contours In CFA this corresp onds to the numb er of metho ds A program can

b e Typ ed by an algorithm if it can prove an absence of run time typ e errors Checks is

the numb er of typ e checks which must b e made to ensure such an absence The numb er of

imprecisions Im indicates numb er of no des which were not resolved to a singleon value The

implementation is approximately lines of largely unoptimized Common LispCLOS and

Time in seconds is rep orted for CMU Common LispPCL on a Sparc

Program Lines Passes Nodes Invokes Contours Typed Checks Im Time

IFA

ion YES

network YES

circuit YES

pic YES

mandel YES

tsp YES

mmult YES

p oly YES

test YES

OPS

ion NO

network YES

circuit NO

pic NO

mandel YES

tsp NO

mmult NO

p oly YES

test NO

CFA

ion NO

network NO

circuit NO

pic NO

mandel NO

tsp NO

mmult NO

p oly NO

test NO

Table B Results of Iterative Flow Analysis

App endix C

Raw OOP Data

b enchmark CA baseline CA no arrayalias CA no standard CA no instvars

bubble

intmm

p erm

puzzle

queens

sieve

towers

trees

Table C Raw Data for Stanford Integer Benchmarks

b enchmark CA no inlining CA no cloning CA no analysis C inline C O

bubble

intmm

p erm

puzzle

queens

sieve

towers

trees

Table C Raw Data for Stanford Integer Benchmarks cont

b enchmark CA base CA no arrayalias CA no standard CA no instvars

Pro jection

Chain

Richards

Table C Raw Data for Stanford OOP Benchmarks

b enchmark CA no inlining CA no cloning CA no analysis C inline C O

Pro jection

Chain

Richards

Table C Raw Data for Stanford OOP Benchmarks cont

b enchmark C no inlining C all virtual C no inlining all virtual

Pro jection

Chain

Richards

Table C Raw Data for Stanford OOP Benchmarks cont cont

App endix D

CA Standard Prologue

Standard Prologue for Concurrent Aggregates

Builtin Classes must be declared early as they are used to define

constants

ROOTCLASS everything inherits from this

class rootclass

method rootclass primitive handled internally

method rootclass new noexclusion

method rootclass newlocal noexclusion

method rootclass eq a noexclusion

reply primitive rootequal self a

method rootclass neq a noexclusion

reply primitive rootnotequal self a

method rootclass assertclass assertedclass

reply primitive assertclass self assertedclass

method rootclass checktype v

seq primitive checktype v self reply done

method rootclass checknull v

seq primitive checknull v self reply done

function rootclass break noexclusion

let con global console

if eq self con

reply primitive debuggerbreak self

reply primitive debuggerbreak con

function rootclass numproc noexclusion

reply primitive numproc

function rootclass procid noexclusion

reply primitive procid

function rootclass iflocal obj noexclusion

reply primitive checkiflocal obj

CONTINUATIONCLASS

class continuationclass rootclass

method continuationclass reply value handled internally

ROOTAGG all aggregates inherit from this

aggregate rootagg rootclass group groupsize myindex physize

parameters sz

initial sz

unoptimized

method rootagg sibling i noexclusion unoptimized

if boundscheck i groupsize

reply primitive aggregatetorepresentative self i

SIBLINGBOUNDSERROR global console self groupsize i

optimized

method rootagg sibling i noexclusion optimized

reply primitive aggregatetorepresentative self i

method rootagg physicalsibling i noexclusion

reply primitive aggtophysicalrep self i

method rootagg physicalgroupsize noexclusion

reply primitive physicalgroupsize self

only works on local objects

method rootagg iflogical noexclusion

reply primitive checkiflogical self

NULLCLASS

class nullclass rootclass

method nullclass eq a noexclusion

reply primitive nullequal self a

method nullclass neq a noexclusion

reply primitive nullnotequal self a

unoptimized

method nullclass rest unoptimized

NILMESSAGE global console

optimized

method nullclass rest optimized

OSYSTEM

class osystem rootclass

GLOBALCLASS

class globalclass rootclass state

method globalclass global

reply state self

method globalclass setglobal val

seq setstate self val

reply state self

STRINGCONSTANT

class stringconstant rootclass

SELECTOR

class selector stringconstant

INTEGER

class integer rootclass

method integer i noexclusion

reply primitive integeradd self i

method integer i noexclusion

reply primitive integersubtract self i

method integer i noexclusion

reply primitive integermultiply self i

method integer i noexclusion

reply primitive integerdivide self i

method integer i noexclusion

reply primitive integergreaterthan self i

method integer i noexclusion

reply primitive integerlessthan self i

method integer i noexclusion

reply primitive integergreaterthanorequalto self i

method integer i noexclusion

reply primitive integerlessthanorequalto self i

method integer i noexclusion

reply primitive integerequalto self i

method integer i noexclusion

reply primitive integernotequalto self i

method integer mod i noexclusion

reply primitive integermodulo self i

method integer lshift i noexclusion

reply primitive integerleftshift self i

method integer rshift i noexclusion

reply primitive integerrightshift self i

method integer and i noexclusion

reply primitive integerbitwiseand self i

method integer or i noexclusion

reply primitive integerbitwiseor self i

method integer not noexclusion

reply primitive integernot self

method integer intfloat noexclusion

reply primitive integertofloat self

method integer min i noexclusion

if self i reply i

reply self

method integer max i noexclusion

if self i reply self

reply i

method integer ash i noexclusion

if i reply lshift self i

reply rshift self i

method integer noexclusion

reply self

method integer boundscheck size noexclusion

reply and self self size

FLOAT

class float rootclass

method float f noexclusion

reply primitive floatadd self f

method float f noexclusion

reply primitive floatsubtract self f

method float f noexclusion

reply primitive floatmultiply self f

method float f noexclusion

reply primitive floatdivide self f

method float f noexclusion

reply primitive floatgreaterthan self f

method float f noexclusion

reply primitive floatlessthan self f

method float f noexclusion

reply primitive floatequalto self f

method float f noexclusion

reply primitive floatnotequalto self f

method float sin noexclusion

reply primitive floatsin self

method float cos noexclusion

reply primitive floatcos self

method float tan noexclusion

reply primitive floattan self

method float sqrt noexclusion

reply primitive floatsquareroot self

method float exp noexclusion

reply primitive floatexp self

method float pow f noexclusion

reply primitive floatpow self f

method float log noexclusion

reply primitive floatlog self

method float asin noexclusion

reply primitive floatarcsin self

method float acos noexclusion

reply primitive floatarccos self

method float atan noexclusion

reply primitive floatarctan self

method float atan f noexclusion

reply primitive floatarctan self f

method float ceil noexclusion

reply primitive floatceil self

method float floor noexclusion

reply primitive floatfloor self

method float floatint noexclusion

reply primitive floattointeger self

method float min i noexclusion

if self i reply i

reply self

method float max i noexclusion

if self i reply self

reply i

ARRAY

class array rootclass size

unoptimized versions

method array at index unoptimized

if boundscheck index size self

reply primitive arrayat self index

ARRAYBOUNDSERROR global console

method array putat val index unoptimized

if boundscheck index size self

reply primitive arrayputat self val index

ARRAYBOUNDSERROR global console

method array putat val index unoptimized

if boundscheck index size self

reply primitive arrayputat self val index

ARRAYBOUNDSERROR global console

optimized versions with no bounds check to match FORTRAN and C

method array at index optimized

reply primitive arrayat self index

method array putat val index optimized

reply primitive arrayputat self val index

method array putat val index optimized

reply primitive arrayputat self val index

included for backward compatibility

method array atput val index

reply putat self val index

MESSAGECLASS

class messageclass array selector receiver continuation noreaderwriter

method messageclass msgat pos noexclusion

reply primitive messageat self pos

method messageclass msgputat val pos noexclusion

reply primitive messageputat self val pos

method messageclass msgatput val pos noexclusion

reply msgputat self val pos

method messageclass send noexclusion

forward primitive messagesend self

method messageclass sendto to noexclusion

forward primitive messagesendto self to

method messageclass getrequester noexclusion

reply primitive messagerequester self

method messageclass setrequester to noexclusion

reply primitive messagesetrequester self to

method messageclass getreceiver noexclusion

reply primitive messagereceiver self

method messageclass setreceiver to noexclusion

reply primitive messagesetreceiver self to

method messageclass getselector noexclusion

reply primitive messageselector self

method messageclass setselector to noexclusion

reply primitive messagesetselector self to

RAWARRAY

class rawarray array

STRING

class string rawarray

parameters isize

initial initial super isize

STREAMCLASS

class streamclass rawarray id status

parameters name

initial setid self primitive openfile name self

method streamclass fileopen name noexclusion

reply new streamclass name bytes NULL

method streamclass id noexclusion

reply id

method streamclass close noexclusion

forward primitive closefile self

method streamclass readint noexclusion

forward primitive readinteger self

method streamclass readfloat noexclusion

forward primitive readfloat self

method streamclass writeint i noexclusion

forward primitive writeinteger self i

method streamclass writefloat f noexclusion

forward primitive writefloat self f

method streamclass writestring s noexclusion

forward primitive writestring self s

method streamclass writeobject o noexclusion

forward primitive writeobject self o

method streamclass endoffile noexclusion

forward primitive endoffile self

CONSOLECLASS

class consoleclass streamclass

parameters name

initial initialstreamclass super name

method consoleclass id noexclusion

reply id

method consoleclass rest noexclusion

forward primitive writemessage self msg

INSTRCOUNTERAGG and auxiliary classes

function rootclass increment instrcounterobj

reply

primitive incrementinstrcounter

storageobj sibling instrcounterobj procid

function rootclass incrementby instrcounterobj value

reply

primitive incrementinstrcounter

storageobj sibling instrcounterobj procid value

function rootclass readcountinstrcounterobj

reply read instrcounterobj

function rootclass readlocalcountinstrcounterobj nodeid

reply readlocal sibling instrcounterobj nodeid

aggregate instrcounteragg summaryobj storageobj

parameters isize summary storagesize

initial isize

forall i from below groupsize

init sibling self i summary storagesize

handler instrcounteragg create

let summaryobj create instrsummaryclass

aggobj new instrcounteragg numproc summaryobj

reply aggobj

handler instrcounteragg init summary size

seq

setsummaryobj self summary

setstorageobj self newlocal array size

primitive initinstrcounter storageobj self

reply DONE

handler instrcounteragg reset noexclusion

seq

forall i from below groupsize

resetmyself sibling self i

reset summaryobj self

reply DONE

handler instrcounteragg resetmyself noexclusion

seq

primitive resetinstrcounter storageobj self

reply DONE

handler instrcounteragg read noexclusion

let count

seq

forall i from below groupsize

set count count readlocal sibling self i

reply count

handler instrcounteragg readlocal noexclusion

reply primitive readinstrcounter storageobj self

handler instrcounteragg min noexclusion

seq

syncsummary self

reply min summaryobj self

handler instrcounteragg max noexclusion

seq

syncsummary self

reply max summaryobj self

handler instrcounteragg mean noexclusion

seq

syncsummary self

reply mean summaryobj self

handler instrcounteragg stdev noexclusion

seq

syncsummary self

reply stdev summaryobj self

handler instrcounteragg summarize stream title noexclusion

let totalcount

seq

writestring stream title

writestring stream tNodetCountn

forall i from below groupsize

let count readlocalcount self i

seq

writestring stream t

writeint stream i

writestring stream t

writeint stream count

writestring stream n

set totalcount totalcount count

inc summaryobj self intfloat count

writestring stream tTotalt

writeint stream totalcount

writestring stream ntMint

writefloat stream min summaryobj self

writestring stream tMaxt

writefloat stream max summaryobj self

writestring stream ntMeant

writefloat stream mean summaryobj self

writestring stream tStdevt

writefloat stream stdev summaryobj self

writestring stream n

reply DONE

handler instrcounteragg syncsummary noexclusion

seq

reset summaryobj self

forall i from below groupsize

inc summaryobj self intfloat readlocal sibling self i

reply DONE

INSTRSUMMARYCLASS

class instrsummaryclass array

initial primitive initinstrsummary self

method instrsummaryclass create

reply new instrsummaryclass

method instrsummaryclass inc value

reply primitive incinstrsummary self value

method instrsummaryclass min

reply primitive instrsummarymin self

method instrsummaryclass max

reply primitive instrsummarymax self

method instrsummaryclass mean

reply primitive instrsummarymean self

method instrsummaryclass stdev

reply primitive instrsummarystdev self

method instrsummaryclass reset

reply primitive resetinstrsummary self

INSTRSTOPWATCHCLASS

class instrstopwatchclass rootclass starttime elapsedtime

initial setstarttime self setelapsedtime self

method instrstopwatchclass create

reply new instrstopwatchclass

method instrstopwatchclass start

setstarttime self primitive now

reply elapsedtime self

method instrstopwatchclass stop

seq

setelapsedtime self

elapsedtime self primitive now starttime self

reply elapsedtime self

method instrstopwatchclass read

reply elapsedtime self

method instrstopwatchclass reset

let oldelapsed elapsedtime self

setstarttime self

setelapsedtime self

reply oldelapsed

TRACECLASS

class traceclass

method traceclass rest

reply primitive writeevent msg

SYSTEM CONSTANTS

constant true

constant false

PREDEFINED GLOBALS

global nil new nullclass

global console new consoleclass consolestream same as streamclass

global tracemonitor new traceclass

Bibliography

O Agesen J Palsb erg and M Schwartzbach Typ e inference of Self Analysis of ob jects

with dynamic and multiple inheritance In Proceedings of ECOOP

Ole Agesen The cartesian pro duct algorithm Simple and precise typ e inference of

parametric p olymorphism In Proceedings of ECOOP pages SpringerVerlag

Lecture Notes in Computer Science No

Ole Agesen and Urs Holzle Typ e feedback vs concrete typ e analysis A comparison

of optimization techniques for ob jectoriented languages In Proceedings of OOPSLA

pages

G Agha and C Hewitt ObjectOriented Concurrent Programming chapter on Actors

A Conceptual Foundation for Concurrent Ob jectOriented Programming pages

MIT Press

Gul Agha Actors A Model of Concurrent Computation in Distributed Systems MIT

Press Cambridge MA

Gul Agha The structure and semantics of actor languages In JW de Bakker W P

Ro ever and G Rozenb erg editors Foundations of ObjectOriented Languages pages

SpringerVerlag

A V Aho R Sethi and J D Ullman Compilers Principles Techniques and Tools

Computer Science AddisonWesley Reading Massachusetts

Alexander Aiken Edward L Wimmers and T K Lakshman Soft typing with conditional

typ es In Twenty First Symposium on Principles of Programming Languages pages

Portland Oregon January

P America Inheritance and subtyping in a parallel ob jectoriented language In Proceed

ings of ECOOP pages SpringerVerlag June

Pierre America POOLT A parallel ob jectoriented language In Aki Yonezawa and

Mario Tokoro editors ObjectOriented Concurrent Programming pages MIT

Press

Pierre America A parallel ob jectoriented language with inheritance and subtyping In

Proceedings of ECOOPOOPSLA pages

T Anderson D Culler and D Patterson A case for NOW networks of workstations

IEEE Micro

Apple Computer Inc Cup ertino California Object Pascal Users Manual

Jo el Auslander Matthai Philip ose Craig Chamb ers Susan J Eggers and Brian N Ber

shad Fast eective dynamic compilation In Proceedings of the ACM SIGPLAN

Conference on Programming Language Design and Implementation pages May

J Barnes and P Hut A hierarchical ON log N force calculation algorithm Technical

rep ort The Institute for Advanced Study Princeton New Jersey

Gerald Baumgarter and Vince F Russo Signatures A language extension for improving

typ e abstraction and subtyp e p olymorphism in C Software Practice and Experience

August

BN Bershad ED Lazowska and HM Levy Presto A system for ob jectoriented

parallel programming Software Practice and Experience August

Rob ert D Blumofe Christopher F Jo erg Bradley C Kuszmaul Charles E Leiserson

Keith H Randall Andrew Shaw and Yuli Zhou Cilk An ecient multithreaded runtime

system In Proceedings of Principles and Practice of Paral lel Programming

Nan J Bo den A study of negrain programming using cantor Masters thesis California

Institute of Technology CaltechCSTR

C A R Hoare Monitors An op erating system structuring concept Communications

of the Association for Computing Machinery

Brad Calder and Dirk Grunwald Reducing indirect function call overhead in C pro

grams In Twentyrst Symposium on Principles of Programming Languages pages

ACM SIGPLAN

Brad Calder Dirk Grunwald and Benjamin Zorn Quantifying dierences b etween C

and C programs Technical Rep ort CUCS University of Colorado Boulder

January

Peter S Canning William R Co ok Walter L Hill and Walter G Oltho Interfaces for

stronglytyp ed ob jectoriented programming languages In Proceedings of OOPSLA

Fourth Annual Conference on ObjectOriented Systems Languages and Applications

pages Octob er

Rob ert Cartwright and Mike Fagan Soft typing In Proceedings of the ACM SIGPLAN

Conference on Programming Language Design and Implementation pages

Ontario Canada June

D A Case Computer simulations of protein dynamics and thermo dynamics IEEE

Computer

C Chamb ers and D Ungar Customization Optimizing compiler technology for Self a

dynamicallytyp ed ob jectoriented programming language In Proceedings of SIGPLAN

Conference on Programming Language Design and Implementation pages

C Chamb ers and D Ungar Iterative typ e analysis and extended message splitting In

Proceedings of the SIGPLAN Conference on Programming Language Design and Imple

mentation pages

C Chamb ers and D Ungar Making pure ob jectoriented languages practical In OOPSLA

Conference Proceedings pages May

C Chamb ers D Ungar and E Lee An ecient implementation of Self a dynamically

typ ed ob jectoriented language based on prototyp es In OOPSLA Conference Pro

ceedings pages July

Craig Chamb ers The Design and Implementation of the Self Compiler an Optimizing

Compiler for ObjectOriented Programming Languages PhD thesis Stanford University

Stanford CA March

Craig Chamb ers The Cecil language Sp ecication and rationale Technical Rep ort TR

Department of Computer Science and Engineering University of Washington

Seattle Washington March

Craig Chamb ers The Cecil language Sp ecication and rationale version Technical

rep ort Department of Computer Science and Engineering University of Washington

Seattle Washington March

Rohit Chandra Ano op Gupta and John L Hennessy Data lo cality and load balancing

in COOL In Proceedings of the Fourth ACM SIGPLAN Symposium on Principles and

Practice of Paral lel Programming

K Mani Chandy and Carl Kesselman Comp ositional C Comp ositional parallel

programming In Proceedings of the Fifth Workshop on Compilers and Languages for

Paral lel Computing New Haven Connecticut YALEUDCSRR Springer

Verlag Lecture Notes in Computer Science

A A Chien U S Reddy J Plevyak and J Dolby ICC a C dialect for high

p erformance parallel computing In Proceedings of the nd International Symposium on

Object Technologies for Advanced Software SpringerVerlag LNCS

A A Chien M Straka J Dolby V Karamcheti J Plevyak and X Zhang A case study

in irregular parallel programming In DIMACS Workshop on Specication of Paral lel

Algorithms May Also available as SpringerVerlag LNCS

Andrew Chien Vijay Karamcheti and John Plevyak The Concert systemcompiler and

runtime supp ort for ecient negrained concurrent ob jectoriented programs Technical

Rep ort UIUCDCSR Department of Computer Science University of Illinois

Urbana Illinois June

Andrew Chien and Uday Reddy ICC language denition Concurrent Systems Archi

tecture Group Memo Also available from httpwwwcsagcsuiucedu February

Andrew A Chien Concurrent Aggregates An ob jectoriented language for negrained

messagepassing machines Technical Rep ort MITAITR Massachusetts Institute

of Technology Articial Intelligence Lab oratory Cambridge MA Septemb er

Andrew A Chien Concurrent aggregates Using multipleaccess data abstractions to

manage complexity in concurrent programs In In Proceedings of the Workshop on Object

Based Concurrent Programming at OOPSLA App eared in OOPS Messenger

Volume Numb er April

Andrew A Chien Concurrent Aggregates Supporting Modularity in MassivelyParal lel

Programs MIT Press Cambridge MA

Andrew A Chien and William J Dally Exp erience with concurrent aggregates ca

Implementation and programming In Proceedings of the Fifth Distributed Memory Com

puters Conference Charleston South Carolina April SIAM

Andrew A Chien Vijay Karamcheti John Plevyak and Xingbin Zhang Concurrent Ag

gregates language rep ort Available via anonymous ftp from csuiucedu in pubcsag

or from httpwwwcsagcsuiucedu Septemb er

JongDeok Choi Michael Burke and Paul Carini Ecient owsensative interpro cedural

computation of p ointerinduced aliases and side eects In Twentieth Symposium on

Principles of Programming Languages pages ACM SIGPLAN

W Clinger and J Rees editors Rev ised rep ort on the algorithmic lanuguage scheme

ACM Lisp Pointers IV JulySeptemb er

William D Clinger Foundations of actor semantics Technical Rep ort AITR MIT

Articial Intelligence Lab oratory

K Co op er K Kennedy and L Torczon The impact of interpro cedural analysis and

n

optimization in the R environment ACM Transactions on Programming Languages and

Systems Octob er

K D Co op er M W Hall and K Kennedy Pro cedure cloning In Proceedings of the

IEEE Computer Society International Conference on Computer Languages pages

April

K D Co op er M W Hall and K Kennedy A metho dology for pro cedure cloning

Computer Languages April

Patrick Cousot and Radia Cousot Abstract interpretation A unied lattice mo del for

static analysis of programs by construction or approximation of xp oints In Fourth

Symposium on the Principles of Programming Languages pages

Cray Research Inc Eagan Minnesota CRAY TD Software Overview Technical

Note

Cray Research Inc Cray TD System Architecture Overview March

R Cytron J Ferrante B Rosen M Wegman and F Zadeck An ecient metho d

of computing static single assignment form and the control dep endence graph ACM

Transactions on Programming Languages and Systems Octob er

William Dally and Andrew Chien Ob ject oriented concurrent programming in cst In

Proceedings of the Third Conference on Hypercube Computers pages Pasadena

California SIAM

William J Dally A VLSI Architecture for Concurrent Data Structures Kluwer Academic

Publishers Boston Mass

William J Dally Andrew Chien Stuart Fiske Waldemar Horwat John Keen Michael

Larivee Rich Lethin Peter Nuth Scott Wills Paul Carrick and Greg Fyler The J

Machine A negrain concurrent computer In Information Processing Proceedings

of the IFIP Congress pages August

William J Dally J A Stuart Fiske John S Keen Richard A Lethin Michael D Noakes

Peter R Nuth Roy E Davison and Gregory A Fyler The messagedriven pro cessor

IEEE Micro pages April

Jerey Dean Craig Chamb ers and David Grove Selective sp ecialization for ob ject

oriented languages In Proceedings of the ACM SIGPLAN Conference on Programmin

g Language Design and Implementation pages La Jolla CA June

Jerey Dean Dave Grove and Craig Chamb ers Optimization of ob jectoriented pro

grams using static class hierarchy analysis Technical Rep ort TR Department

of Computer Science and Engineering University of Washington Seattle Washington

Decemb er

Alain Deutsch Interpro cedural mayalias analysis for p ointers Beyond klimiting In

Proceedings of the SIGPLAN Conference on Programming Language Design and Imple

mentation pages

L Peter Deutsch and Allan M Schiman Ecient implementation of the smalltalk

system In Eleventh Symposium on Principles of Programming Languages pages

ACM

J Dolby Using the concert debugger CSA Group Memo Octob er

J Dolby V Karamcheti and J Plevyak Using the concert system on sun workstations

CSA Group Memo Septemb er

Julian Dolby Parasight A debugger for concurrent ob jectoriented programs Mas

ters thesis University of Illinois Department of Computer Science W Springeld

Avenue Urbana Illinois August

Julian Dolby Vijay Karamcheti John Plevyak and Xingbin Zhang Concert tutorial

CSA Group Memo March

Jack J Dongarra Roldan Pozo and David W Walker LAPACK A design overview

of ob jectoriented extensions for high p erformance linear algebra In Proceedings of Su

percomputing pages

J Elliot and B Moss Managing stack frames in smalltalk In Proceedings of the SIG

PLAN Symposium on Interpreters and Interpretive Techniques pages

SIGPLAN NOTICES Volume Numb er July

Margaret A Ellis and Bjarne Stroustrup The Annotated C Reference Manual

AddisonWesley

Ana M Erosa and Laurie J Hendren Taming control ow A structured approach to

eliminating goto statements Technical Rep ort McGill University Septemb er

ftp wallycsmcgillca as memo

A Krishnamuthy et al Parallel programming in SplitC In Proceedings of Supercomput

ing pages

C Leiserson et al The network architecture of the Connection Machine CM In Pro

ceedings of the Symposium on Paral lel Algorithms and Architectures Available from

ftpcmnsthinkcomdocPapersnetpsZ

Jeanne Ferrante Karl J Ottenstein and Jo e D Warren The program dep endence graph

and its use in optimization ACM Transactions on Programming Laguages and Systems

July

S Frolund Inheritance of synchronization constraints in concurrent ob jectoriented pro

gramming languages In Proceedings of the European Conference on Objectoriented Pro

gramming ECOOP

Kildal G A unied approach to global program optimization In ACM Symposium on

Principles of Programming Languages pages

Richard Gabriel Lisp Go o d news bad news how to win big In First European Confer

ence on the Practical Applications of Lisp Cambridge England March Cambridge

University Available at httpwwwaimitedugo o dnewsgo o dnews

Adele Goldb erg and David Robson Smal ltalk The language and its implementation

AddisonWesley

Seth Cop en Goldstein Klaus Eric Schauser and David Culler Lazy threads stack

lets and synchronizers Enabling primitives for parallel languages In Proceedings of

POOMA

J Graver and R Johnson A typ e system for smalltalk In Proceedings of POPL pages

L Greengard and V Rokhlin A fast algorithm for particle simulations Journal of

Computational Physics

A Grimshaw Easytouse ob jectoriented parallel pro cessing with Mentat IEEE Com

puter May

Concurrent Systems Architecture Group The ICC reference manual Concurrent

Systems Architecture Group Memo May

Linley Gwennap P underscores Intels lead Microprocessor Report February

M W Hall Managing Interprocedural Optimization PhD thesis Rice University

M W Hall S Hiranandani and K Kennedy Interpro cedural compilation of Fortran D

for MIMD distributed memory machines In Supercomputing pages

Mary W Hall Ken Kennedy and Kathryn S McKinley Interpro cedural transforma

th

tions for parallel co de generation In Proceedings of the Annual Conference on High

Performance Computing Supercomputing pages Novemb er

Mary W Hall John M MellorCrummey Alan Clarle and Rene G Ro driguez FIAT

A framework for interpro cedural analysis and transformation In Proceedings of the Sixth

Workshop for Languages and Compilers for Paral lel Machines pages August

Rob ert H Halstead Jr Multilisp A language for concurrent symb olic computation ACM

Transactions on Programming Languages and Systems Octob er

P Brinch Hansen Structured multiprogramming Communications of the ACM

July

Samuel P Harbison Modula Prentice Hall

Rob ert Harp er and Greg Morrisett Compiling p olymorphism using intensional typ e

analysis In Twentysecond Annual ACM Symposium on Principles of Programming Lan

guages pages ACM SIGPLAN

J Hermans and M Carson Cedar do cumentation Unpublished manual for CEDAR

C Hewitt and H Baker Actors and continuous functionals In Proceedings of the IFIP

Working Conference on Formal Description of Programming Concepts pages

August

Seema Hiranandani Ken Kennedy and ChauWen Tseng Compiler optimizations for

FORTRAN D on MIMD distributedmemory machines Communications of the ACM

August

Richard C Holt Some deadlo ck prop erties of computer systems ACM Computing

Surveys Sept

Urs Holzle Adaptive Optimization for SELF Reconciling High Performance with Expo

ratory Programming PhD thesis Stanford University Stanford CA August

Urs Holzle Craig Chamb ers and David Ungar Optimizing dynamicallytyp ed ob ject

oriented languages with p olymorphic inline caches In ECOOP Conference Proceedings

SpringerVerlag Lecture Notes in Computer Science

Urs Holzle and David Ungar Optimizing dynamicallydispatched calls with runtime

typ e feedback In Proceedings of the ACM SIGPLAN Conference on Programming

Language Design and Implementation pages June

W Horwat A Chien and W Dally Exp erience with CST Programming and implemen

tation In Proceedings of the SIGPLAN Conference on Programming Language Design

and Implementation pages ACM SIGPLAN ACM Press

W Horwat B Totty and W Dally Cosmos An op erating system for a negrain

concurrent computer To app ear in Research in Objectbased Concurrency G Agha

Editor

Waldemar Horwat Concurrent Smalltalk on the messagedriven pro cessor Masters

thesis Massachusetts Institute of Technology Cambridge Massachusetts June

C Houck and G Agha HAL A highlevel actor language and its distributed implemen

tation In Proceedings of the st International Conference on Paral lel Processing pages

St Charles IL August

Christopher James Houck Runtime system supp ort for distributed actor programs

Masters thesis University of Illinois at UrbanaChampaign

International Organization for Standardization Ada Reference Manual version

edition Dec

Suresh Jagannathan and Jim Philbin A foundation for an ecient multithreaded Scheme

system In Proc ACM Conference on Lisp and pages

June

Suresh Jagannathan and Stephen Weeks A unied treatment of ow analysis in higher

order languages In Twentysecond Symposium on Principles of Programming Languages

pages ACM SIGPLAN

Suresh Jagannathan and Andrew Wright Eective ow analysis for avoiding runtime

checks Technical Rep ort NEC Research Institute May

Guy L Steele Jr Common LISP The Language Digital Press second edition

L V Kale and Sanjeev Krishnan CHARM A p ortable concurrent ob ject oriented

system based on C In Proceedings of OOPSLA pages

Vijay Karamcheti and Andrew Chien Concert ecient runtime supp ort for concurrent

ob jectoriented programming languages on sto ck hardware In Proceedings of Supercom

puting

Vijay Karamcheti and Andrew Chien Do faster routers imply faster communica

tion In Proceedings of the Paral lel Computer Routing and Communication Workshop

PCRCW pages Seattle Washington SpringerVerlag Lecture Notes in

Computer Science No

Vijay Karamcheti and Andrew Chien Software overhead in messaging layers Where

do es the time go In Proceedings of the Sixth Symposium on Architectural Support for

Programming Languages and Operating Systems ASPLOSVI Available from

httpwwwcsagcsuiucedupapersasplosps

Vijay Karamcheti and Andrew A Chien A comparison of architectural sup

p ort for messaging on the TMC CM and the Cray TD In Proceedings

of the International Symposium on Computer Architecture Available from

httpwwwcsagcsuiucedupaperscmtdmessagingps

Vijay Karamcheti John Plevyak and Andrew A Chien Runtime mechanisms for ecient

dynamic multithreading Journal of Paral lel and Distributed Computing

Brian W Kernighan and Dennis M Ritchie The C Programming Language Prentice

Hall Inc Englewo o d Clis New Jersey

Wo o Young Kim and Gul Agha Ecient supp ort for lo cation transparency in concur

rent ob jectoriented programming languages In Proceedings of the Supercomputing

Conference San Diego CA Decemb er

H Konaka An overview of o core A massively parallel ob jectbased language Technical

Rep ort TRP Tsukuba Research Center Real World Computing Partnership

Tsukuba Mitsui Building F Takezono Tsukubashi Ibaraki JAPAN

D Kranz R Halstead Jr and E Mohr MulT A highp erformance parallel lisp In

SIGPLAN Conference on Programming Language Design and Implementation pages

W Landi and B Ryder Pointerinduced aliasing A problem classication In Symposium

on Principles of Programming Languages

James Larus C a largegrain ob jectoriented data parallel programming language

In Proceedings of the Fifth Workshop for Languages and Compilers for Paral lel Machines

pages SpringerVerlag August

James Richard Larus Restructuring symb olic programs for concurrent execution on

multipro cessors Technical Rep ort UCBCSD University of California at Berkeley

W J Leddy and K S Smith The design of the exp erimental systems kernel In Pro

ceedings of the Fourth Conference on Hypercube Computers March

J Lee and D Gannon Ob ject oriented parallel programming In Proceedings of the

ACMIEEE Conference on Supercomputing IEEE Computer So ciety Press

Peter Lee and Mark Leone Optimizing ml with runtime co de generation In Proceedings

of the ACM SIGPLAN Conference on Programming Language Design and Imple

mentation pages May

Thomas Lengauer and Rob ert Endre Tarjan A fast algorithm for nding dominators in

a owgraph Transactions on Programming Languages and Systems July

Xavier Leroy Unb oxed ob jects and p olymorphic typing In Proc th symp Principles

of Programming Languages pages ACM press

Carl R Manning Acore The design of a core actor language and its compiler Masters

thesis Massachusetts Institute of Technology August

H Masuhara S Matsuoka T Watanaba and A Yonezawa Ob jectedoriented concurrent

reective languages can b e implemented eciently In Proceedings of OOPSLA pages

S Matsuoka and A Yonezawa Research Directions in ObjectBased Concurrency chapter

Analysis of Inheritance Anomaly in Ob jectOriented Concurrent Languages MIT Press

F H McMahon The Livermore Fortran kernels a computer test of the numerical p erfor

mance range Technical rep ort UCRL Lawerence Livermore National Lab oratory

Livermore California

Bertrand Meyer Objectoriented Software Construction Prentice Hall

Robin Milner Mads Tofte and Rob ert Harp er The Denition of Standard ML The

MIT Press

E Mohr D Kranz and R Halstead Jr Lazy task creation A technique for increasing the

granularity of parallel programs IEEE Transactions on Paral lel and Distributed Systems

July

D A Mo on Ob jectOriented Programming with Flavors SIGPLAN Notices Proc Intl

Conf on ObjectOriented Programming Systems Languages and Applications OOP

SLA Septemb er

Stephan Murer Jerome A Feldman ChuCheow Lim and MartinaMaria Seidel pSather

Layered extensions to an ob jectoriented language for ecient parallel computation Tech

nical Rep ort TR International Computer Science Institute Berkeley CA June

Novemb er

N Oxhj J Palsb erg and M Schwartzbach Making typ e inference practical In Proceed

ings of ECOOP pages SpringerVerlag Lecture Notes in Computer Science

No

J Palsb erg and M Schwartzbach Ob jectoriented typ e inference In Proceedings of

OOPSLA pages

Jens Palsb erg and Michael I Schwartzbach Safety analysis versus typ e inference for

partial typ es Information Processing Letters

David A Patterson and John L Hennessy Computer Architecture A Quantitative Ap

proach Morgan Kaufmann Publishers

K A Pier A retrosp ective on the dorado In Proceedings of the Tenth Annual Symposium

on Computer Architecture pages June

John Plevyak and Andrew Chien Incremental inference of concrete typ es Technical

Rep ort UIUCDCSR Department of Computer Science University of Illinois

Urbana Illinois June

John Plevyak and Andrew A Chien Precise concrete typ e inference of ob jectoriented

programs In Proceedings of OOPSLA ObjectOriented Programming Systems Lan

guages and Architectures pages

John Plevyak and Andrew A Chien Typ e directed cloning for ob jectoriented programs

In Proceedings of the Workshop for Languages and Compilers for Paral lel Computing

pages

John Plevyak Vijay Karamcheti and Andrew Chien Analysis of dynamic structures

for ecient parallel execution In Proceedings of the Sixth Workshop for Languages and

Compilers for Paral lel Machines pages August

John Plevyak Vijay Karamcheti Xingbin Zhang and Andrew Chien A hybrid execution

mo del for negrained languages on distributed memory multicomputers In Proceedings

of Supercomputing

John Plevyak Xingbin Zhang and Andrew A Chien Obtaining sequential eciency

in concurrent ob jectoriented programs In Proceedings of the ACM Symposium on the

Principles of Programming Languages pages January

A Rogers M Carlisle J Reppy and L Hendren Supp orting dynamic data structures

on distributed memory machines ACM Transactions on Programming Languages and

Systems

Erik Ruf Contextinsensitive alias analysis reconsidered In Proceedings of the ACM

SIGPLAN Conference on Programming Language Design and Implementation pages

June

Bernhard Rytz and Marc Gengler A p olyvariant binding time analysis Technical Rep ort

YALEUDCSRR Yale University Department of Computer Science Pro ceed

ings of the ACM Symp osium on Partial Evaluation and SemanticsBased Program

Manipulation

G Sab ot The Paralation Model MIT Press Cambridge Massachusetts

A D Samples D Ungar and P Hilnger Soar Smalltalk without byteco des In

OOPSLA Prodeedings pages Septemb er

Michael Sannella John Maloney Bjorn FreemanBenson and Alan Borning Multiway

versus oneway constraints in user interfaces Exp erience with the deltablue algorithm

Software Practice and Experience May

Marcel Schelvis and Eddy Bledo eg The implementation of a distributed smalltalk In

Proceedings of the European Conference on ObjectOriented Programming

Zhong Shao and Andrew W App el A typ ebased compiler for standard ML In SIGPLAN

Conference on Programming Language Design and Implement ation pages

Olin Shivers ControlFlow Analysis of HigherOrder Languages PhD thesis Carnegie

Mellon University Department of Computer Science Pittsburgh PA May also

CMUCS

Olin Shivers The semantics of scheme controlow analysis In Symposium on Partial

Evaluation and SemanticsBased Program Manipulation pages June

Olin Shivers Topics in Advanced Language Implementation chapter DataFlow Analysis

and Typ e Recovery in Scheme pages MIT Press Cambridge MA

K Smith and A Chatterjee A C environment for distributed application execution

Technical Rep ort ACTESP Micro electronics and Computer Technology Corp o

ration MCC Novemb er

K Smith and R Smith I I The exp erimental systems pro ject at the micro electronics and

computer technology corp oration In Proceedings of the Fourth Conference on Hypercube

Computers March

SPARC International NJ The SPARCArchitecture Manual Version

Vugranam C Sreedhar and Guang R Gao A linear time algorithm for placing no des In

Twentysecond Symposium on Principles of Programming Languages pages ACM

SIGPLAN

Richard Stallman The GNU C Compiler Free Software Foundation

Dan Stefanescu and Yuli Zhou An equational framework for the ow analysis of higher

order functional programs In Proceedings of ACM Conference on Lisp and Functional

Programming pages

A A Stepanov and M Lee The standard template library ISO programming language

C pro ject Technical Rep ort XJ WGNO HewlettPackard May

David Stoutamire and Stephen Omohundro Sather draft Available from

httpwwwicsiberkeleyedusatherSatherps August

Bjarne Stroustrup The C Programming Language Addison Wesley second edition

Bjarne Stroustrup The Design and Evolution of C AddsionWesley

Mahesh Subramaniam Using the Concert emulator CSA Group Memo Feburary

Sun Microsystems Computer Corp oration The Java Language Specication March

Available at httpjavasuncomalphado cjavawhitepap erps

Norihisa Suzuki Inferring typ es in Smalltalk In Eighth Symposium on Principles of

Programming Languages pages January

K Taura S Matsuoka and A Yonezawa An ecient implementation scheme of concur

rent ob jectoriented languages on sto ck multicomputers In Proceedings of the Fifth ACM

SIGPLAN Symposium on the Principles and Practice of Paral lel Programming

Kenjiro Taura Design and implementation of concurrent ob jectoriented programming

languages on sto ck multicomputers Masters thesis The University of Tokyo Tokyo

February

Kenjiro Taura Satoshi Matsuoka and Akinori Yonezawa StackThreads An abstract

machine for scheduling negrain threads on sto ck CPUs In Joint Symposium on Paral lel

Processing

Thinking Machines Corp oration First Street Cambridge MA The Con

nection Machine CM Technical Summary Octob er

C Tomlinson M Scheevel and V Singh Rep ort on rosette MCC Internal Rep ort

Ob jectBased Concurrent Systems Pro ject Decemb er

K Traub A dataow compiler substrate Computation Structures Group Memo

MIT Lab oratory for Computer Science

David Ungar and Randall B Smith Self The p ower of simplicity In Proceedings of

OOPSLA pages ACM SIGPLAN ACM Press

David M Ungar the Design and Evaluation of a High Performance Smal ltalk System

MIT Press

David M Ungar and David A Patterson Smal ltalk Bits of History Words of Advice

chapter Berkeley Smalltalk Who Knows Where the Time Go es AddisonWesley series

in Computer Science AddisonWesley

Thorsten von Eicken David E Culler Seth Cop en Goldstein and Klaus Erik

Schauser Active Messages A mechanism for intergrated communication and compu

tation In International Symposium on Computer Architecture Available from

httpwwwcscornelledu Info ProjectsCAM iscaps

David B Wagner and Bradley G Calder Leapfrogging A p ortable technique for imple

menting ecient futures In Proceedings of the Fifth ACM SIGPLAN Symposium on the

Principles and Practice of Paral lel Programming pages

Tim A Wagner Vance Maverick Susan L Graham and Michael A Harrison Accurate

static estimators for program optimization In Proceedings of the ACM SIGPLAN Con

ference on Programming Language Design and Implementation pages Orlando

Florida USA June

Rob ert P Wilson and Monica S Lam Ecient contextsensitive p ointer analysis for

c programs In Proceedings of the ACM SIGPLAN Conference on Programming

Language Design and Implementation pages June

M Yasugi S Matsuoka and A Yonezawa ABCLonEM A new softwarehardware

architecture for ob jectoriented on an extended dataow sup er

computer In Proceedings of the ACM Conference on Supercomputing

Y Yokote and M Tokoro Concurrent programming in ConcurrentSmalltalk In Aki

Yonezawa and Mario Tokoro editors ObjectOriented Concurrent Programming pages

MIT Press

A Yonezawa E Shibayama T Takada and Y Honda Ob jectoriented concurrent

programming mo delling and programming in an ob jectoriented concurrent language

ab cl In Aki Yonezawa and Mario Tokoro editors ObjectOriented Concurrent Pro

gramming pages MIT Press

Akinori Yonezawa editor ABCL An ObjectOriented Concurrent System MIT Press

ISBN

Akinori Yonezawa Satoshi Matsuoka Masahiro Yasugi and Kenjiro Taura Implementing

concurrent ob jectoriented languages on multicomputers IEEE Paral lel and Distributed

Technology pages May

Index

static CFA

b o oleans CFA

b oundaries

abstract data typ e

lo cality

access

protection

conditions

b oxing see unb oxing

region

brittle

adding statements

bubble sort

extending

Byron Lord

lifting

CA merging

caching storage

instance variables subsumption

lo cal variables accessor

calls accessors

convention Actors

convention optimization adaptation

virtual function adaptive

cartesian pro duct analysis

CFG mesh renement

Data Dep endencies aliasing

circuit array

cloning analysis

creating arrays

selection complexity

closures data structures

CM termination

co de generation arguments

co de size arity

commercial dynamic

common sub expression elimination static

communication arrays

compiler aliasing

analysis assignment

co de generation assignments

front end

basic blo cks

CA

b est

core language

binding

counting ICC

creation symb ols

rst class intermediate form

lazy creation CFG

passing core language

Contour PDG

contours SSA

metho d intermediate from

ob ject AST

selecting overview

control dep endence region transformation

control ow complete

ambiguities complexity

data dep endent space

graph time

counting futures Concert

CSE ob jective

customization parts

philosophy

data

timetable

abstract typ es

concrete typ e

dep endent control ow

concurrency

distributed

control

distribution

excess

ow problems

guarantees

frontier

concurrent

structure analysis

lo ops

data dep endence

ob jects

control ow

statements

dbl

Concurrent Aggregates

dead co de elimination

Conf

deadlo ck

conuences

Delta Blue

Connnection Machine

discrimination

consistency

dispatch mechanisms

distributed

distributed

mo dels

consistency

constant

data

folding

memory

globals

dynamic

propagation

dynamic dispatch

constraintbased analysis

mechanism

context

lazy allo cation

Edge

proxy

eciency

context sensitivity

entrance criteria

continuations

Erastothenes

inlining evil

interference example

interpro cedural call edge language

invariant lifting

fairness

invo cation

fall back

invo cations

fallback

Invoke

exible

ion

ow graph

irregular

global

iterative deep ening

lo cal

Joy Bill no de

uids

kiloFLOPS

FORTRAN

forwarding

languages

function

latency

size

leapfrogging

wrapp er

lifetime

functions

linearizing

generic name

Lisp

memb er

load balance

transfer

lo cality

virtual

b oundaries

futures

checks

counting

lo cation

lo cking

garbage collection

optimization

GCC

lo cks

global value numb ering

lo ops

global variables

concurrent

GNU

innite

go o d

inner

while Hanoi

hardware

machines

micropro cessor

mallo c

hybrid

mandel

matrix multiply IO

MDForce ICC

memory Illinois Concert C

distributed imprecisions

hierarchy resolving

map conformance inconsistency

reference indirect communication

shared inheritance

memory allo cation inline caches

graph message

Program Dep endence Graph messages

programs metho d

dynamic size

irregular metho ds

promotion stateless

prop erty micropro cessor

protection b oundaries mindset

No des ML

puzzle mmult

molecular dynamics

queens

multicomputer

mutual exclusion

radical

recursion

network

Register Transfer Language

Nietzsche Friedrich

replicated

No de

resources

nondeterminism

Restrict

restrictions ob jectorientation

Richards ob jects

richards layout

RTL lo cks

runtime shared name space

one

scheduling

immediate panecea

schema parametric p olymorphism

continuation passing particle simulations

mayblo ck paths

nonblo cking PDG see Program Dep endence Graph

parallel invo cation p eek

sequential invo cation p ermutation

selector No des

sequentialparallel Pho enix

Shakesp eare William pic

shared memory p oke

shared name space p oly

shatter p olymorphic

Shivers Olin classes

sieve functions

slots variables

tags p olymorphism

software precision

Solaris primitives

SOR program

SPARC workstation dep endence graph

placing SPARCStation

pushing sp ecialization

towers of Hanoi sp eculation

tramp oline inlining

transformation splitting

tree metho d

tsp ob ject

Turtle SSA

SSU

unb oxing

standard prologue

unwind

Stanford

Stanford Integer Benchmarks

vacuum

statements

Value

concurrent

variables

functional

arguments

static binding

global

Static Single Assignment form

instance

Static Single Use form

caching

streams

lo cal

strength reduction

caching

structure analysis

Successive Over Relaxation

years

synchronization

TD

tags

target machines

CM

TD

workstation

task

lazy creation

task graph

template

test

test suite

Thesis

thread

threads

context

continuations

futures

scheduling

touch

touches

optimization

Vita

Date of Birth Decemb er Citizenship USA

W Phone Address Department of Computer Science

H Phone University of Illinois at UrbanaChampaign

Email jplevyakcsuiucedu West Springeld

Urbana Illinois

WWW httpwwwcsagcsuiuceduindividualjplevyak

Research Interests Programming languages compilers and computer architecture

Particularly ob jectoriented and negrained concurrent pro

gramming languages interpro cedural ow analysis and opti

mization and parallel and distributed computers

Academic History

PhD Computer Science University of Illinois at UrbanaChampaign

BA Computer Science University of California at Berkeley

Professional History

Research Assistant Concurrent Systems Architecture Group present

Research Assistant National Center for Sup ercomputing Applications

Programmer Amp ex Corp oration

Programmer Ilex Corp oration

Awards CW Gear Outstanding Graduate Student Award

Professional So cieties

Memb er Asso ciation for Computing Machinery ACM

Memb er Institute of Electrical and Electronics Engineers IEEE

Professional Activities

Referee partial list

Ob jectoriented Programming Languages and Systems OOPSLA

Static Analysis Symp osium SAS

Journal of Parallel and Distributed Computing JPDC

International Parallel Pro cessing Symp osium IPPS

International Symp osium on Ob ject Technologies for Advanced Software ISOTAS

SystemSoftware Releases

Illinois Concert System Versions from CSAG

The Concert System is a complete development environment for negrained concur

rent ob jectoriented programs I am the architect and principal author of the Con

cert compiler which emb o dies the research describ ed in my thesis It is available from

httpwwwcsagcsuiucedu

DTM Version from NCSA

The Data Transfer Mechanism DTM is a message passing facility designed to facilitate

the creation of sophisticated distributed applications in a heterogeneous environment It

is available from httpwwwwncsauiucedu

ACE Versions from Amp ex Corp oration

A commercial video edit controller available as a pro duct

Additional Information

Classes

Computer Systems Organization CS Intro duction to VLSI Design

CS Program Verication CS Finegrained Systems Design

and Programming CS Actor Systems CS Ob jectOriented

Programming and Design CS Programming Language Semantics

CS Topics in Graph and Geometric Algorithms CS Genetic

Programming GE State in Programming Languages CS Top

ics in Compiler Construction CS

Seminar Classes

Research Issues in Massively Parallel Machines semesters

Programming Languages Seminar semester

Actor Systems semester

Parallelizing Compilers and Program Optimization Seminar semester

Seminar on Languages and Compilers for HighPerformance Computing semester

Lecture Series

Center for Sup ercomputing Research and Development Collo quia p erio dic

Computer Science Department Seminars and Lectures p erio dic

Department of Electrical and Computer Engineering Graduate Seminars p erio dic

TeachingLecturing

I have lectured in many of the seminar classes every semester in group

meetings and twice for my advisors classes I am comfortable preparing

and lecturing on my own and others material

Other Classes Attended

Art History ARTHIARTHI Baro queArt Nouveau

English ENGLENGL Topics in Mo dern American Literature

Other Occasional Audits

Linguistics History Molecular Dynamics

Other Activities

Tennis Wallyball