An Ecient and General Implementation of Futures on

Large Scale SharedMemory Multipro cessors

A Dissertation

Presented to

The Faculty of the Graduate School of Arts and Sciences

Brandeis University

Department of Computer Science

James S Miller advisor

In Partial Fulllment

of the Requirements of the Degree of

Doctor of Philosophy

by

Marc Feeley April

This dissertation directed and approved by the candidates committee has b een ac

cepted and approved by the Graduate Faculty of Brandeis University in partial fulll

ment of the requirements for the degree of

DOCTOR OF PHILOSOPHY

Dean Graduate School of Arts and Sciences

Dissertation Committee

Dr James S Miller chair

Digital Equipment Corp oration

Prof Harry Mairson

Prof Timothy Hickey

Prof David Waltz

Dr Rob ert H Halstead Jr Digital Equipment Corp oration

Copyright by

Marc Feeley

Abstract

An Ecient and General Implementation of Futures on Large Scale

SharedMemory Multipro cessors

A dissertation presented to the Faculty of the Graduate School of

Arts and Sciences of Brandeis University Waltham Massachusetts

by Marc Feeley

This thesis describ es a highp erformance implementation technique for Multilisps

future parallelism construct This metho d addresses the nonuniform memory access

NUMA problem inherent in large scale sharedmemory multiprocessors The technique

is based on lazy task creation LTC a dynamic task partitioning mechanism that

dramatically reduces the cost of task creation and consequently makes it p ossible to

exploit ne grain parallelism In LTC idle pro cessors get work to do by stealing

tasks from other pro cessors A previously prop osed implementation of LTC is the

sharedmemory SM proto col The main disadvantage of the SM proto col is that

it requires the stack to b e cached sub optimally on cacheincoherent machines This

thesis prop oses a new implementation technique for LTC that allows full caching of

the stack the messagepassing MP proto col Idle pro cessors ask for work by sending

work request messages to other pro cessors After receiving such a message a pro cessor

checks its private stack and task queue and sends back a task if one is available The

message passing proto col has the added b enets of a lower task creation cost and simpler

algorithms Extensive exp eriments evaluate the p erformance of b oth proto cols on large

sharedmemory multiprocessors a pro cessor GP and a pro cessor TC

The results show that the MP proto col is consistently b etter than the SM proto col

The dierence in p erformance is as high as a factor of two when a cache is available

and a factor of when a cache is not available In addition the thesis shows that the

semantics of the Multilisp language do es not have to b e imp overished to attain go o d

p erformance The laziness of LTC can b e exploited to supp ort at virtually no cost

several programming features including the KatzWeise continuation semantics with legitimacy dynamic scoping and fairness

Acknowledgements

Cette theseest dedieeames grandparents Rose et Emile Monna pour lamour

que jai pour eux

I wish to thank my family my friends and colleagues without whom this thesis would

not have b een p ossible

Sp ecial thanks go to Jim Miller my thesis advisor for giving me the freedom to explore

my ideas at my own pace He has gone b eyond the call of duty to see me through with

my degree

Bert Halsteads words of encouragement gave me the condence that my ideas were

interesting and worth writing ab out Thank you Bert

Sabine Bergler deserves sp ecial thanks for taking care of me

To Chris Mauricio Harry Emmanuel Don Shyam Larry Xiru Mary and Paulo

thank you for making my stay at Brandeis so enjoyable

Finally I wish to thank the National Science and Engineering Research Council of

Canada and the Universitede Montrealfor nancial supp ort and Michigan State Uni

versity Argonne National Lab oratory Lawrence Livermore National Lab oratory and the MIT AI Lab oratory for the use of their computers

Contents

Introduction

Motivation

Why Multilisp

Fundamental Issues

Architecture

SharedMemory MIMD Computers

NonUniform Memory Access

Sharing Data

Caches

Memory Consistency

The GP and TC Computers

Memory Management

Dynamic Partitioning

Eager Task Creation

Lazy Task Creation

Overview

Background

Schemes Legacy

FirstClass Continuations

Continuation Passing Style

Programming with Continuations

Multilisps Mo del of Parallelism

FUTURE and TOUCH i

Placeholders

Spawning Trees

Types of Parallelism

Pip eline Parallelism

ForkJoin Parallelism

Divide and Conquer Parallelism

Implementing Eager Task Creation

The Work Queue

FUTURE and TOUCH

Scheme Enco ding

Chasing vs No Chasing

Critical Sections

Centralized vs Distributed Work Queue

Fairness of Scheduling

Dynamic Scoping

Continuation Semantics

Original Semantics

MultiScheme Semantics

KatzWeise Continuations

KatzWeise Continuations with Legitimacy

Implementing Legitimacy

Sp eculation Barriers

The Cost of Supp orting Legitimacy

Benchmark Programs

abisort

allpairs

fib

mm

mst

poly

qsort

queens

rantree ii

scan

sum

tridiag

The Performance of ETC

Lazy Task Creation

Overview of LTC Scheduling

Task Stealing Behavior

Task Susp ension Behavior

Continuations for Futures

Pro cedure Calling Convention

Unlimited Extent Continuations

Continuation Heapication

Parsing Continuations

Implementing FirstClass Continuations

The LTC Mechanism

The Lazy Task Queue

Pushing and Popping Lazy Tasks

Stealing Lazy Tasks

The Dynamic Environment Queue

The Problem of Overow

The Heavyweight Task Queue

Supp orting Weaker Continuation Semantics

Synchronizing Access to the Task Stack

The SharedMemory Proto col

Avoiding Hardware Lo cks

Cost of a Future on GP

Impact of Memory Hierarchy on Performance

The MessagePassing Proto col

Really Lazy Task Creation

Communicating Steal Requests

Potential Problems with the MP Proto col

Co de Generated for SM and MP Proto cols iii

Summary

Polling Eciently

The Problem of Pro cedure Calls

Co de Structure

CallReturn Polling

Short Lived Pro cedures

Balanced Polling

Subproblem Calls

Reduction Calls

Minimal Polling

Handling Join Points

Polling in Gambit

Results

Summary

Exp eriments

Exp erimental Setting

Overhead of Exp osing Parallelism

Overhead on GP

Overhead on TC

Sp eedup Characteristics

Sp eedup on GP

Sp eedup on TC

Eect of Interrupt Latency

Cost of Supp orting Legitimacy

Summary

Conclusion

Future Work

A Source Co de for Parallel Benchmarks

A abisort

A allpairs iv

A fib

A mm

A mst

A poly

A qsort

A queens

A rantree

A scan

A sum

A tridiag

B Execution Proles for Parallel Benchmarks

B abisort

B allpairs

B fib

B mm

B mst

B poly

B qsort

B queens

B rantree

B scan

B sum

B tridiag v vi

List of Tables

Costs of memory hierarchy for the GP and the TC

Characteristics of parallel b enchmark programs running on GP

Size of closure for each future in the b enchmark programs

Cost of op erations involved in task stealing

Measurements of memory access b ehavior of b enchmark programs

Overhead of p olling metho ds on GP

Performance of SM proto col on GP

Performance of MP proto col on GP

Performance of SM proto col on TC

Performance of MP proto col on TC

Performance of MP proto col on GP with I

Performance of MP proto col on GP with I

Overhead of supp orting legitimacy with and without sp eculation barrier

on GP vii viii

List of Figures

The sharedmemory MIMD computer used in this thesis

Nonlo cal exit using callcc

Parallel map denition and spawning trees

Parallel vector map

Scheme enco ding of Multilisp core

Pro cedures needed to supp ort Multilisp core

Exception system based on dynamic scoping and callcc

Implementation of dynamic scoping with tail recursive callcc

MultiSchemes implementation of the future sp ecial form

A sample use of futures and callcc

A future b o dys continuation called multiple times

Exception pro cessing with futures

The KatzWeise implementation of futures

An application of sp eculation barriers

Forkjoin algorithms and their legitimacy chain in the absence of chain

collapsing

General case of legitimacy chain collapsing for forkjoin algorithms

Fib and a p o or variant obtained by unrolling the recursion

The task stack

Continuation representation and op erations

Underow and heapication algorithms

Resuming a heavyweight task ix

The LTQ and the steal op eration

The task stealing mechanism

The implementation of dynbind

The DEQ and its use in recovering a stolen tasks dynamic environment

Co de sequence for a future under the SM proto col

Thief side of the SM proto col

Victim side of the SM proto col

Relative imp ortance of stack and heap accesses of b enchmark programs

Thief side of the MP proto col

Victim side of the MP proto col

Assembly co de generated for fib

The foreach pro cedure and its corresp onding co de graph

Two instances of short lived pro cedures

The maximal delta metho d

Pro cedure return invariants in balanced p olling

Compilation rules for balanced p olling

Minimal p olling for the recursive pro cedure sum and a tail recursive variant

Sp eedup curves for fib queens rantree and mm on GP

Sp eedup curves for scan sum tridiag and allpairs on GP

Sp eedup curves for abisort mst qsort and poly on GP

Sp eedup curves for fib queens rantree and mm on TC

Sp eedup curves for scan sum tridiag and allpairs on TC

Sp eedup curves for abisort mst qsort and poly on TC

Task creation b ehavior of MP proto col on GP

Task susp ension b ehavior of MP proto col on GP x

Chapter

Introduction

This work is ab out the design of an ecient implementation strategy for Multilisps fu

ture parallelism construct on large sharedmemory multiprocessors A strategy known

as lazy task creation is used as a starting p oint for this work Two implementations

of lazy task creation one based on a sharedmemory paradigm and the other based on a

messagepassing paradigm are explained and compared by extensive exp eriments with

a large number of b enchmarks The result can b e summarized as follows

An implementation of lazy task creation based on a messagepassing paradigm

is sup erior to one based on a sharedmemory paradigm b ecause it is

simpler to implement

more exible and

more ecient in nearly all situations b ecause it allows full caching of

the stack on machines that lack coherentcaches the dierence in p er

formance is as much as a factor of two on the TC multiprocessor

In addition this work shows how to eciently implement two imp ortant

language features in the presence of futures dynamic scoping and rstclass

continuations An ecient p olling metho d designed to supp ort message

passing is also describ ed and evaluated

This thesis provides a detailed account of this result

CHAPTER INTRODUCTION

Motivation

As applications b ecome bigger and more demanding it is hard to resist the seductive

qualities asso ciated with parallel pro cessing All to o often however application writers

are disillusi oned when they discover that their carefully rewritten application running

on a parallel computer is barely faster if not slower than it was when running on a

cheaper unipro cessor machine

Poor p erformance can b e caused by a combination of factors The degree of par

allelism in the algorithms is one of the most imp ortant factors b ecause it puts a strict

upp er b ound on the p erformance achievable by the program Some algorithms have a

limited amount of parallelism and thus it is not p ossible to increase p erformance b eyond

a certain size of machine Moreover even algorithms that scale up well with the size of

the machine ie yield a sp eedup roughly equal to the number of pro cessors may still

have p o or absolute p erformance if the parallel algorithms hidden constant is large

when compared to a sequential algorithm

Another factor is the technological lag that the hardware of parallel machines often

suers This is due to the smaller market and longer design times of parallel machines

when compared to mainstream unipro cessor machines This lag can b e exp ected to

decrease as parallel systems b ecome more common

The imp ortance of these two factors can b e minimized to some extent by careful

algorithm design and co ding and the use of state of the art hardware However there

still remains another hurdle to overcome the inherent ineciency of the language im

plementation Clearly the language features needed to supp ort parallelism must b e

implemented well to exploit the concurrency available in the application It is just as

imp ortant however for the sequential constructs to b e ecient since they account for a

high prop ortion of a programs co de There is little incentive to use a parallel machine

with pro cessors if the implementation runs sequential programs on one pro cessor

times slower than when a nonparallel language is used This explains the lack of

p opularity of interpreter based implementations of Multilisp which run purely sequen

tial co de much slower than compiler based implementations of Lisp Interestingly the

language implementations with p o or absolute p erformance usually have excellent rela

tive p erformance ie selfrelative sp eedup This is b ecause the asp ects of the system

that are critical to p erformance such as memory latency and task spawning costs are

masked by the huge overhead of interpretation usually a factor of to times slower than compiled co de

WHY MULTILISP

Absolute p erformance is a ma jor concern in this thesis For this reason the Multilisp

implementation techniques prop osed here are evaluated in the context of a pro duction

quality implementation To p erform exp eriments a highly ecient Scheme compiler

called Gambit Feeley and Miller is used as a platform into which the implemen

tation techniques are integrated and tested This is to ensure that the setting is realistic

and that p erformancecritical issues are not overlooked Typically the co de generated by

Gambit for sequential programs is only ab out p ercent slower but sometimes faster

than co de generated by optimizing C compilers for equivalent C programs Multilisp is

a suciently general to b e considered as a substitute for con

ventional languages for many sequential programming tasks The results of this thesis

will make it even more attractive to choose Multilisp over other languages since it also

allows ecient parallel programming

Why Multilisp

Sup ercomputers have traditionally b een employed for scientic purp oses so it isnt sur

prising that numerical applications have b een the fo cus of most of the parallel pro cessing

research However the need for highp erformance is no longer b ound exclusively to sci

entic applications as timeconsuming symbolic applications b ecome more widespread

These include applications such as exp ert systems databases simulation typesetting

compilation CAD systems and user interfaces

The growing need for highp erformance parallel symbolic pro cessing systems is the

initial motivation for this work Multilisp suggests itself naturally since it is a member of

the Lisp family of symbolic pro cessing languages It was designed by Halstead Halstead

as an extension of Scheme with a few additional constructs to deal with parallelism

The most imp ortant of these is the future sp ecial form whose origin can b e traced back

to Baker and Hewitt

From its inception the purp ose of Multilisp has b een to provide a testb ed for ex

p erimentation in the design and implementation of parallel symbolic pro cessing sys

tems Through the years it has evolved along several distinct paths to accommo date

novel uses of the language The rst implementation of Multilisp was Concert Multi

lisp which ran on a custom designed multiprocessor Halstead Halstead et al

Multilisps mo del of parallel computation has b ecome increasingly p opular and

some of its features have now b een adopted by other parallel Lisp systems This in

cludes b oth academic research systems such as QLisp Gabriel and McCarthy

Goldman and Gabriel MultiScheme Miller Miller MulT Kranz et

CHAPTER INTRODUCTION

al Gambit Feeley and Miller PaiLisp Ito and Matsui Spur Lisp

Zorn et al Buttery p ortable standard lisp Swanson et al and Concur

rent Scheme Kessler and Swanson Kessler et al as well as commercially

available systems such as BBN Lisp Steinberg et al Allegro Common Lisp Fra

and Top Level Common Lisp Murray The future construct is actually

quite general and it has b een used in more conventional languages such as C Callahan

and Smith

Fundamental Issues

Assuming that sp eed of computation is the main ob jective the job of a Multilisp im

plementor can b e seen as an optimization problem constrained by three factors

The semantics of the language

The characteristics of the target machine

The exp ected use of the system ie applications

Each instance of these factors denes a particular implementation context It is the

task of the designer to devise the most ecient implementation strategies that correctly

realize the given language semantics on the target machine It is also imp ortant to

consider the target applications b ecause it is through these that the features of the

system that are most critical for high p erformance can b e identied They also form

the ultimate measure of success of an implementation as a whole

To explore the entire sp ectrum of implementation contexts for Multilisp would b e a

daunting task well b eyond the scop e of this work Rather contexts that are most likely

to b e useful in the present or the near future are examined Emphasis is put on language

features multiprocessor architectures and programming styles that have acquired some

p opularity The semantics of Multilisp and applications are discussed in greater depth

in Chapter

Architecture

Inherent limitations of the target machine are inevitable facts of life for the implementor

of any language To adequately address the issue of p erformance it is crucial to deter

ARCHITECTURE

mine the salient features and weaknesses of the target architecture This is esp ecially

true for parallel machines b ecause of the vast disparity in parallel architectures

SharedMemory MIMD Computers

The multiple instruction stream multiple data stream MIMD sharedmemory multi

pro cessor computer is used as the target architecture for this work This choice is fueled

by on the one hand the p opularity and availability of these machines and on the other

the similarity with the programming mo del adopted by Multilisp

There are two ma jor architectural requirements imp osed by Multilisp The rst is

the p ossibility for pro cessors to act indep endently from one another This is needed

b ecause Multilisp expresses parallelism through control parallelism that is it is p ossi

ble to express concurrency b etween heterogeneous computations Separate instruction

streams op erating on separate data are thus needed to execute these computations in

parallel The second requirement is the existence of a shared memory In Multilisp

as in most other Lisps all ob jects exist in a single address space that is visible to all

parts of the program There are no a priori restrictions on which pro cedure or tasks

can access a given ob ject

The sharedmemory architecture has b een severely criticized by some The most

imp ortant ob jection is that the cost of accessing the shared memory must grow with the

size of the machine Thus large machines will suer from high latencies for references

to shared memory

This fact is duly acknowledged but must b e put in p ersp ective Programs which oer

a limited amount of parallelism only need to b e run on machines whose size matches that

parallelism Secondly the existence of a shared memory do es not imply that the pro

grams make an imp ortant use of it Messagepassing paradigms can easily and eciently

b e implemented on top of a shared memory for example see LeBlanc and Markatos

However implementing shared memory on conventional messagepassing ma

chines is impractical b ecause sharedmemory op erations are usually ne grained whereas

messagepassing op erations are typically optimized to manipulate large chunks of data

Programs with irregular and dynamically changing communication patterns have a le

gitimate need for shared memory These programs are often found in symbolic pro

cessing applications which need to traverse linked data structures such as lists trees

and graphs Implementing these programs on a messagepassing machine would b e pro

hibitively exp ensive Finally it is exp ected that scalable caching techniques will hide

the high latencies of large shared memory to some extent Caching issues are explored

CHAPTER INTRODUCTION

Pro cessor Pro cessor

Cache Cache

Private Private Shared Shared

s s s

Memory Memory Memory Memory

Interconnection Network

Figure The sharedmemory MIMD computer used in this thesis

later in this chapter

NonUniform Memory Access

The mo del of the sharedmemory MIMD architecture used in this thesis is shown in

Figure A machine is comp osed of a number of pro cessing no des each of which has a

pro cessor and three forms of memory cache memory private memory and shared mem

ory Each pro cessor has direct access to its own private and shared memory ie local

memory and through the use of the interconnection network has access to the shared

memory of other pro cessors ie remote memory The shared memory is physically

distributed across the machine while private memory is only visible to its asso ciated

pro cessor

This is a nonuniform memory access NUMA architecture b ecause the cost of

memory references is not constant The cost dep ends on the type of memory b eing

referenced and its distance from the pro cessor A reference to the cache is thus cheaper

than a reference to lo cal memory which in turn is cheaper than a reference to remote

memory The NUMA mo del is interesting b ecause it reects realistic prop erties of the

architecture as explained next

ARCHITECTURE

Sharing Data

An imp ortant characteristic of data is the extent to which it must b e shared The

following classication will b e used for the dierent types of data

Private data is data that do es not need to b e communicated to other pro cessors

A simple example of private data is temp orary values which are pro duced and

used by the same program section

Single writer shared data is accessible to more than one pro cessor but it is

only mutated by a distinguished pro cessor the owner of the data

Multiple writer shared data is accessible to more than one pro cessor and can

b e mutated by any of these pro cessors

These types of data have dierent storage requirements Private data is the least

restrictive it could reside in the same storage as shared data and multiple writer

shared data is the most restrictive These dierences are a source of optimization for

the architecture which can implement each type in a dierent way and at a dierent

cost Thus computers are often designed with various forms of private storage Since

a pro cessor has exclusive access to this storage it can b e implemented eciently b ecause

there is no need for an arbitration mechanism or multiple data paths The pro cessors

registers are an extreme instance of private storage Shared data is more exp ensive

b ecause it must b e stored in a lo cation that is accessible to all pro cessors Single writer

and multiple writer shared data are distinguished b ecause they oer dierent caching

p ossibilities

Caches

Caches are a well known mechanism to enhance the p erformance of memory A prop erty

shared by almost all programs is that memory references are unevenly distributed A

large prop ortion of all references are to a small prop ortion of the data This observation

has lead to the design of multilevel memory systems The idea is to place frequently

accessed data in a fast memory a cache in order to reduce the average time needed

for a reference If the cache is large enough and the applications reference pattern is

well b ehaved then the cache will service most of the references A memory hierarchy

can have several levels of caches but only a single one will b e considered here

CHAPTER INTRODUCTION

Caches are quickly b ecoming a necessity to fully harness the p ower of mo dern pro

cessors Current RISC pro cessors have a cycle time that is much smaller than the fastest

memory chips Pro cessors with a nanosecond cycle time will so on b e available but it

is unlikely that the sp eed of large RAM chips will ever b e close to that of the pro cessor

for example DRAM chips currently have a nanosecond cycle time at b est Cache

memories are much faster than main memory b ecause due to their small size they can

b e put on the same chip as the pro cessor or at least close to it and it is p ermissible

to use faster circuitry even if it is more exp ensive The sp eed dierence b etween these

two types of memories varies from system to system but it is not uncommon for cache

memory to b e to times faster than main memory Clearly it is a go o d idea to

design a system so that it maximizes cache usage The b enets of caching on a range

of programs is explored further in Chapter

An imp ortant feature of caches is that they op erate automatically The programmer

do es not have to explicitly state where a particular piece of data should go The accesses

to memory are monitored and a copy of the frequently accessed data is kept in the cache

The rst reference to a piece of data that is not in the cache ie a cache miss actually

references the memory but subsequent references are p otentially much faster b ecause

a copy has b een put in the cache When space is needed in the cache older pieces of

data are selectively purged from the cache according to a particular replacement p olicy

eg random or leastrecently used LRU

The p erformance of a cache dep ends on h the probability of a cache hit also called

the hit rate and L and L the latency of an access to the cache and to main

cache main

L is given by memory resp ectively The average access latency

mem

L hL hL

mem cache main

Clearly a high hit rate is advantageous since a value near one makes it app ear as though

the memory can resp ond at the sp eed of the cache There are many ways to improve the

hit rate The size of the cache can b e increased Given the high cost of cache memory

this may b e a cost eective solution only up to a certain p oint Another technique

is to reorganize the program so that data references to a particular datum are closer

in time The probability of a datum b eing resident in the cache is higher if it has

b een referenced recently and even more so if LRU replacement is used Finally it is

sometimes preferable to disable the caching of data whose referencing pattern is such

that it do es not gain much by caching Caching such data is detrimental b ecause it

causes the frequently used data to b e purged from the cache thus decreasing the hit rate

ARCHITECTURE

Two caching strategies have b een p opular in unipro cessor computers copyback and

writethrough caching These strategies dier in how writes to memory are handled

Copyback caching handles a write by only mo difying the copy in the cache

The memory will eventually receive the correct value when the datum is purged

from the cache after a cache miss this is called a writeback The exp ense of writes

is thus attributed to cache misses If there are very few cache misses writes to

memory are essentially the same cost as reads

Writethrough caching bypasses the cache and p erforms the write to main

memory However the state of the cache is mo died to reect the new content

of memory If the address b eing written to is resident in the cache it is simply

up dated Otherwise the datum is added to the cache most probably causing an

entry to b e purged In addition to h L and L the p erformance of

cache main

writethrough caching dep ends on the read ratio r the prop ortion of all memory

references which are reads The average access latency for writethrough caching

is thus

L r hL hL r L

mem cache main main

r hL r hL

cache main

Note that here h is the hit rate for reads only The two caching metho ds have the

same p erformance when r but writethrough caching quickly degrades as the

number of writes increases

Memory Consistency

The notion of a single monolithic shared memory is a convenient abstraction to write

and reason ab out programs However caching if not done prop erly may violate this ab

straction b ecause memory consistency b etween pro cessors is not preserved For private

data there is no consistency problem caused by caching since all references go through

the cache For single writer shared data it is p ossible to maintain consistency by us

ing writethrough caching The pro cessor owning the data uses writethrough caching

and the readers disable the caching of the data Consistency is preserved b ecause the

memory always has the correct value for the datum and the readers always access the

The datum could also b e disregarded ie not entered in the cache This might b e preferable for

application s which rarely read the lo cations recently written to such as when initial izi ng or up dating a large data structure

CHAPTER INTRODUCTION

memory when they reference the datum of course this means that only the owner

of the data b enets from the cache Unfortunately writethrough caching by itself is

not suciently p owerful to maintain consistency for multiple writer shared data The

problem is that the p erception of the memory state can b e dierent from pro cessor to

pro cessor if each one has cached the same datum in its own cache and mutated it in

a dierent way For example under copyback and writethrough caching when two

pro cessors A and B read variable x a copy of x will exist in As cache and another in

B s If A then mutates x B still b elieves that x has the original value

There are two approaches to the memory consistency problem The rst is to put

the resp onsibility of consistency on the programmer or compiler by providing a less rigid

consistency mo del At appropriate p oints in the program sp ecial op erations must b e

added to ush or invalidate some of the entries in the caches In the terminology of

Gharachorloo et al the strictest consistency mo del is sequential consistency In

this mo del memory b ehaves as though only one access is serviced at a time ie accesses

are sequential Thus any read request returns the last value written In processor

consistency writes can b e delayed an arbitrary but nite amount of time as long

as the writes from any given pro cessor are p erformed in the same order as they were

issued by that pro cessor there is no ordering restrictions b etween pro cessors This

mo del can b e implemented more eciently than sequential consistency b ecause it allows

some form of pip elini ng and caching of the writes Machines implementing pro cessor

consistency usually have a write barrier instruction which waits until the memory has

pro cessed all of that pro cessors writes The weak consistency and release consistency

mo dels Dub ois and Scheurich are still weaker and more ecient They guarantee

consistency only at synchronization p oints in the program In other words lo ck and

unlo ck op erations or similar synchronization op erations are barriers which wait until

the memory has pro cessed all p ending transactions In these mo dels reads and writes

can b e buered b etween synchronization op erations

An orthogonal approach to the consistency problem is to design sp ecialized hardware

that maintains consistency b etween the caches and memory In the previous example

this would mean that when A mutates x the new value for x is written to memory

as in writethrough caching and B s cache and any other cache holding a copy of x

is notied to either invalidate or up date the appropriate entry This is relatively easy

to p erform on busbased architectures b ecause all caches and memory are immediately

aware of all transactions they are directly connected to the shared bus So called

sno opycaches Go o dman are based on this principle Unfortunately busbased

architectures do not scale well b ecause the bus has a limited bandwidth Typically

busbased machines are designed with just enough pro cessors to match the bandwidth

ARCHITECTURE

of the bus For example the bus in the Encore Multimax can supp ort up to fairly

lowpower pro cessors

Maintaining consistency on scalable architectures is much harder Currently most

scalable cache designs are based on directories Censier and Feautrier With

each datum is kept a list of the caches that are holding a copy of the datum and that

must b e notied of any mutation If n pro cessors are holding a datum in their cache

then a mutation by one pro cessor will require at least n messages to b e sent to

notify the caches The moment at which these notications are sent dep ends on the

consistency mo del b eing used Scalable cache designs usually do not implement strict

consistency in order to exploit buering and pip elini ng of writes The main drawbacks of

directory based metho ds are the added memory needed for the directory and the added

intercache trac which reduces the eective bandwidth of the interconnection network

Fortunately it seems that in typical applications most of the shared data is shared by

a very small number of pro cessors Lenoski et al OKrafka and Newton

Limited directory caching metho ds such as Chaiken et al take advantage of

this fact to reduce the space for the directory by only allowing a small number of copies

of a datum to exist at any given p oint in time

However there are certain forms of sharing that inevitably lead to p o or cache p erfor

mance One such case is when two or more pro cessors are very frequently writing to the

same memory lo cation p erhaps to implement some kind of negrain communication

through shared memory This causes thrashing in directory based metho ds b ecause a

substantial amount of time is sp ent sending messages b etween the caches This p o or

p erformance is not surprising since caches are helpful only if there is lo cality of reference

to exploit If the goal is to exchange data as quickly as p ossible b etween the pro cessors

caching is of little use since network latency will b e unavoidable

The moral here is that sp ecialized hardware for memory consistency is not the

solution to all data sharing problems Sp ecialized hardware can only help if the program

has well b ehaved data usage patterns When designing algorithms it is unreasonable to

assume an ecient consistent shared memory simply b ecause the machine supp orts it

in hardware The costs will vary according to how the data needs to b e shared As a

general rule algorithms should b e designed to promote lo cality of reference and rely as

little as p ossible on a strict consistency mo del and on multiple writer shared data

It is interesting to note that even though it uses sno opycaches the Multimax only implements weak consistency

CHAPTER INTRODUCTION

The GP and TC Computers

Data sharing issues play a central role in this thesis The multilevel memory system

of the architectural mo del chosen here ie Figure reects the imp ortance of data

sharing issues by making the costs of sharing explicit In this mo del caches do not

automatically preserve consistency It is only by segregating the various types of data

and using the appropriate caching p olicy that consistency is maintained It is assumed

that the caches can op erate in copyback and writethrough caching on selected areas

of memory Because private memory always contains private data it is cached with the

most ecient caching p olicy copyback caching Single writer shared data is cached

using writethrough caching by the owner of the data and is not cached by the other

pro cessors Finally multiple writer shared data is not cached in any way

This mo del is attractive b ecause building such a machine is relatively inexp ensive

using current technology yet it has a high p otential p erformance Each no de in the

architecture corresp onds roughly to a mo dern unipro cessor computer The only ex

tra hardware needed to build a complete machine is that for the interconnect and its

interface to the pro cessing no des The TC computer BBN manufactured

by BBN Computers and introduced in matches this structure very closely A

scalable multistage buttery network is used for the interconnection network There

is a single lo cal memory p er no de that is partitioned into shared and private sections

by system calls to the op erating system Other system calls allow the selection of the

caching p olicy for each memory blo ck allo cated The GP computer BBN

also by BBN has a very similar architecture but uses older technology the TC

uses M pro cessors rated at MIPS whereas the GP uses M pro cessors

rated at roughly MIPS The GP also suers from a slower interconnection net

work approximately half the bandwidth of the TC and the lack of a data cache

These two computers are used throughout the thesis to do measurements and to com

pare dierent implementation strategies Because scalability is an imp ortant issue large

machines were used a pro cessor GP at Michigan State University and a

pro cessor TC at Argonne National Lab oratory To serve as a guide the costs of

the memory hierarchy for these computers is given in Table The timings corresp ond

to the latency for referencing a single word for each level of the hierarchy Note that the

However each pro cessor has a small instruction cache

These costs were measured with b enchmarks sp ecially designed to test the memory As rep orted in

BBN the timing dep ends on many parameters such as the caching p olicy in use the type of access

read or write the size of machine and the contention on the interconnection network The timings in

the table are the average time b etween reads and writes caching was inhibited when measuring lo cal and remote memory costs

MEMORY MANAGEMENT

Latency in secs Relative latency

Machine Cache Lo cal Remote Cache Lo cal Remote

GP

TC

Table Costs of memory hierarchy for the GP and the TC

cache on the TC is faster than lo cal memory by only a factor of Many systems

currently have caches that p erform much b etter than this Also note that the latency

of a buttery network grows logarithmically with the number of pro cessors Machines

with several hundred pro cessors would thus have roughly the same relative costs for the

memory hierarchy

Memory Management

The design of a highp erformance Multilisp system is a complex task where many often

conicting issues have to b e addressed Clearly an implementor must worry ab out how

to b est implement the parallelism constructs themselves but it is imp ortant to realize

that the supp ort of parallelism has an impact on the sequential parts of the language

as well Highp erformance techniques used in unipro cessor implementations of Lisp

cannot always b e carried over to Multilisp as is either b ecause they b ecome inecient

in a multiprocessor environment or even worse they do not work at all due to the

presence of concurrency

As should b e clear from the previous section one of the most imp ortant problems

to tackle for a NUMA architecture is that of memory management Lisp and symbolic

pro cessing in general relies heavily on the manipulation of data structures and on their

dynamic creation The costs of allo cating referencing and deallo cating ob jects are

thus ma jor comp onents of the overall p erformance of the system For a language like

Multilisp where data is implicitly shared memory management is tricky to implement

eciently b ecause in general data must b e accessible to all the pro cessors and b e

mutable by all the pro cessors

In order to keep the reference costs low a memory management p olicy for a NUMA

architecture must strive to physically lo cate the shared data close to the pro cessor that

needs to access the data most frequently For the TC this means that data should

reside in the cache or the lo cal memory of the pro cessor most frequently accessing the

CHAPTER INTRODUCTION

data This is the proximity issue

Another imp ortant goal is to arrange the data so that contention is minimized

Contention o ccurs when more than one pro cessor is trying to access the same shared re

source such as a memory bank or a path in the interconnection network The resource

b ecomes a b ottleneck to p erformance b ecause requests must b e serviced sequentially

Contention can b e inherent in the algorithm when expressed explicitly as a critical

section but it can also app ear insidiously b ecause of some particularity of the language

implementation or target machine For example a simple allo cation strategy for vectors

is to reserve the space for all elements in a given memory bank In such a situation the

references to dierent elements of the vector are forced to b e done sequentially even if

they are all logically concurrent The same problem o ccurs when unrelated data val

ues are referenced simultaneously and they happ en to have b een allo cated in the same

memory bank Certain sharedmemory machines such as the BBN Monarch Rettb erg

et al and IBM RP Pster et al avoid some contention problems by

using combining networks which combine similar requests to the same memory lo ca

tion eg read clear add a constant However combining networks are ineective for

contention to unrelated data A simple and general approach to minimize contention is

to scatter the data among all the memory banks If the referencing pattern is uniformly

distributed the probability that two references are to the same memory bank out of n

Unfortunately this strategy compromises proximity b ecause the memory banks is

n

n

which approaches for a large probability that a reference is to remote memory is

n

machine

There are basically two extreme ways in which the proximity and contention issues

can b e handled The placement of ob jects in memory can b e left to the user or b e

done automatically by the implementation User controlled placement can b e expressed

in several ways including declarations and the use of sp ecialized data manipulation

op erators Automatic placement has the advantage of preserving the highlevel nature

of the language that is the user do es not need to know the details of the target machine

However there is just so much that can b e exp ected of automatic techniques and at least

for sp ecial purp ose applications the user can have knowledge of the memory reference

patterns that are next to imp ossible for the compiler to infer automatically

It is imp ortant to distinguish two classes of data User data is data explicitly created

and referenced by the data manipulation pro cedures of the language eg cons car and

setcar Internal data corresp onds to data used internally by the implementation

DYNAMIC PARTITIONING

to supp ort the language Internal data includes

Environment frames

Continuation frames

Closures

Cells for mutable variables

Global variables

Tasks

Constants

Program co de

Because these data structures are used in well dened ways under the control of the

implementation it is p ossible to design sp ecial purp ose memory management p olicies

for them For instance lo cal contention free accesses to the program co de and constants

are p ossible if they are copied to the private memory of each pro cessor when the program

is loaded

Both user data and internal data are imp ortant to optimize in a system However

this thesis concentrates on the management of internal data and in particular the data

structures that are involved in dynamic partitioning The placement of user data is not

considered here

Dynamic Partitioning

One of the most fundamental op erations p erformed by any parallel system is the distri

bution of work throughout the system Each pro cessor has to b e aware of the compu

tations it is required to do and at what time The overall goal is to have the b est usage

of the pro cessing resources that is to have the greatest number of pro cessors doing use

ful things Partitioning consists of dividing the programs total workload into smaller

tasks that can b e assigned to the pro cessors for concurrent execution A prerequisite to

partitioning is of course knowing which pieces of the program can b e done concurrently

Since in Multilisp concurrency is stated explicitly by the user it will b e assumed here

that the only source of concurrency is the future construct

Thus in the expression x y the concurrency p ossible in the evaluation of the

arguments to will b e disregarded b ecause it is not expressed with a future

CHAPTER INTRODUCTION

Partitioning can b e done once and for all b efore the program is run This static

partitioning has the advantage of b eing simple to conduct when the program naturally

decomp oses into a xed number of equal sized tasks It also p ermits some compilation

optimizations b ecause imp ortant information such as the particular assignment of tasks

to pro cessors the intertask communication pattern and the type of communication

can sometimes b e known at compile time Programs with a regular computational

structure are go o d candidates for static partitioning

Dynamic partitioning relegates the partitioning decisions to when the program is

running This approach is more general b ecause it can b e applied to programs with

complex concurrency structures and also to programs whose concurrency is dep endent

on the input data set This generality is needed for Multilisp b ecause the arbitrary

concurrency structures expressible with the future construct cannot b e handled by static

partitioning metho ds Another advantage is that b etter partitioning decisions can b e

made b ecause more information is available at run time The size of the machine

number of pro cessors and memory size is an imp ortant parameter that may not b e

known at compile time There are other equally imp ortant but more subtle partitioning

parameters that are only available at run time For example the number of active tasks

and idle pro cessors at a given p oint in time are useful indicators of partitioning needs

In a way dynamic partitioning has the ability to adapt to its execution environ

ment whereas static partitioning is stuck with irreversible compile time decisions that

are based on predictions of what the execution environment will b e Adaptability is

crucial to account for the varying computational nature of certain programs Paral

lel sort is a go o d example to illustrate this p oint The sort may have more or less

concurrency dep ending on the data set size ie the number of items to sort and the

cost of comparing two items These parameters can vary in the same program if the

sort is called multiple times Concurrency can also b e aected by the initial ordering

of the items The sort algorithm might degenerate to a sequential algorithm for some

orderings and b e p erfectly parallel for others Large programs add another dimension

to the argument Large programs are typically comp osed of several smaller indep endent

mo dules Concurrency can o ccur inside a mo dule b etween purely sequential mo dules

and also b etween internally concurrent mo dules It is quite p ossible that an internally

concurrent mo dule such as parallel sort has to execute by itself at some p oint and

concurrently with other mo dules at some other p oint The partitioning requirements

may vary greatly b etween these two cases At one extreme no partitioning is needed for

the sort if the other mo dules are doing long sequential computations and there happ en

to b e n of them on an n pro cessor machine

DYNAMIC PARTITIONING

The main inconvenience of dynamic partitioning is that it adds a run time over

head Dynamic partitioning is administrative work that gets added to the op er

ations strictly required by the program ie the mandatory work Tasks are cre

ated to enable concurrent execution but each task created adds a cost in time and

space b ecause its state has to b e maintained throughout its life this includes task cre

ation activation susp ension and termination A dynamic partitioning strategy must

nd some compromise b etween the b enet of added concurrency and the drawback of

added overhead Some have avoided this problem to some extent by relying on sp ecial

ized hardware to reduce the cost of managing tasks Dataow machines Srini

Arvind and Nikhil and multithreaded architectures Halstead and Fujita

Nikhil et al Agarwal fall in this category However software metho ds

are attractive b ecause they oer p ortability and low hardware cost This thesis ex

plores software metho ds for lowering the cost of task management in the context of the

Multilisp language

In a strict sense partitioning only refers to the way the program gets divided up

into tasks This denition is not very useful for Multilisp b ecause each evaluation of a

future leads to the creation of a new task there are no partitioning decisions to b e made

However choices are available at another level There can b e several representations

for tasks each having its own set of features and management costs The appropriate

representation for a particular task will dep end on many factors but as a general rule

it will b e b est to select the one with the lowest cost that has all the required features

Partitioning has a broad sense in this thesis It refers to the choice of representation

that is used for the tasks in the program and the way that they are managed

An imp ortant parameter aecting the p erformance of dynamic partitioning is the

granularity of parallelism G of the program G is dened as the average duration of

a task

T

seq

G

N

task

Here N is the total number of tasks created by the program and T is the duration

task seq

of the program when all task op erations are removed ie T is the mandatory work

seq

When the task op erations are present the work required for the program is T plus

seq

some task management overhead T for each task created

task

T T N T

par seq task task

T contains the time to create start and terminate a task The total work required task

CHAPTER INTRODUCTION

to run the program on an n pro cessor machine T n will b e T plus some amount

total par

that accounts for all other parallelism overheads including the costs of transferring tasks

b etween pro cessors synchronizing tasks sharing user data and b eing idle The run time

T n

total

The eciency E of the pro cessors is the prop ortion on n pro cessors is thus

n

of the time they sp end doing mandatory work G and T are imp ortant parameters

task

b ecause they put an upp er b ound on eciency

T T

seq seq

E

T

task

T n T N T

total seq task task

G

This equation suggests that eciency is a function of the relative size of G with resp ect

to T Higher eciency can b e obtained either by increasing G or decreasing T

task task

Eager Task Creation

A well known dynamic partitioning metho d is eager task creation ETC Its main

advantage is simplicity Only a single representation for tasks exists in ETC the heavy

weight task ob ject Unfortunately the task management cost for heavyweight tasks is

relatively high on the order of hundreds of machine instructions A coarse granular

ity is thus required to get go o d p erformance For example the granularity must b e

at least in the hundreds of machine instructions to achieve b etter than eciency

This makes the programming task that much more dicult b ecause granularity must

b e taken into account when designing programs Moreover coarse grain programs have

less parallelism fewer tasks so there is a risk that they will only p erform well on small

machines Finally some programs are hard to express with coarse grain parallelism

Lazy Task Creation

A more ecient partitioning metho d called lazy task creation LTC is explored in this

thesis In addition to the heavyweight task representation LTC uses a much cheaper

lightweight representation The metho d is describ ed in detail in Chapter but a general

description is given here to explain some of the issues

LTC lowers the average task management cost by creating only as many heavyweight

tasks as necessary to keep all pro cessors working To do this each pro cessor maintains

a lo cal data structure the lazy task queue LTQ that indicates the availability of

tasks on that pro cessor When the program asks for the creation of a task the LTQ is

DYNAMIC PARTITIONING

up dated to indicate the presence of this new task This op eration is ecient b ecause a

lightweight task representation is used A lightweight task preserves enough information

to recreate the heavyweight task later on if needed Each entry in the LTQ is a p ointer

into the stack marking the b oundary of that tasks stack The b eauty of LTC is that

when the pro cessor b ecomes idle it can get work from its own LTQ at a low cost and

completely avoid the creation of a heavyweight task When the LTQ is empty the

pro cessor must instead nd a task to resume from some other pro cessors LTQ It is

only in this case that a high cost is paid to create a heavyweight task and transfer it

b etween pro cessors

SharedMemory Proto col

But how exactly do es this interaction take place The proto col adopted in Mohr

uses a sharedmemory paradigm The stack and LTQ of all pro cessors are directly

accessible to all pro cessors ie they are shared data When pro cessor A needs to get

work from pro cessor B it directly manipulates B s LTQ and stack to extract a task

This approach has unfortunate consequences First of all access and mutation of the

LTQ must b e arbitrated b ecause several pro cessors may b e comp eting for access This

means that the cost of lightweight task creation is higher than might have b een exp ected

b ecause synchronization op erations are needed to ensure that accesses to the LTQ are

mutually exclusive This may b e tolerable in certain contexts since the overhead cost will

b e high only for parallel programs with ne grain parallelism The second consequence

is much more serious The proto col assumes that the stack and LTQ are in consistent

memory Therefore they cannot b e cached as eciently as private data This can have

a severe impact on p erformance b ecause the stack is one of the most intensively used

internal data structures The cost is also unrelated to the use of parallelism sequential

programs will suer just as much as parallel ones It is preferable for the stack to b e

a private resource so that copyback caching can b e used as is the case for sequential

implementations of Lisp

MessagePassing Proto col

The stack and LTQ can b e made private by adopting a messagepassing proto col for

work distribution When A needs to get work from B it sends a request for work to

B Up on receiving this message B checks its LTQ for an available task and if one is

available sends it back to A Since the LTQ and stack are only accessed lo cally there is

no need for synchronization op erations when up dating them Lightweight task creation

CHAPTER INTRODUCTION

is thus cheaper than with the sharedmemory proto col This allows very ne grain

parallelism to b e ecient Sequential co de also b enets b ecause copyback caching can

now b e used for the stack

Although it is promising the messagepassing proto col introduces some new issues

How is the communication mechanism implemented and what is its cost The latency of

the communication is also a factor Can the pro cessor resp ond fast enough to minimize

the idle time of the requesting pro cessor

Overview

The thesis is organized in chapters Chapter gives a description of the Multilisp

language and its traditional implementation using ETC Some ne p oints of its semantics

are discussed to clarify the constraints that must b e met by the partitioning metho ds

Finally the b enchmarks used for later exp eriments are presented

Chapter provides a detailed description of the sharedmemory and messagepassing

implementations of LTC It is shown how supp ort for dynamic scoping continuations

and fairness can b e added to LTC This chapter also examines the memory usage char

acteristics of the b enchmark programs to evaluate the b enets of caching

Chapter concentrates on the communication mechanism required by the message

passing proto col An ecient software implementation is describ ed and evaluated

Chapter compares the two LTC proto cols The p erformance of b oth proto cols is

measured on several b enchmarks and under numerous conditions

The closing chapter summarizes the results of the thesis and suggests some future

lines of research

Chapter

Background

Before discussing the implementation of the future construct it is necessary to establish

the set of features that must b e supp orted by the implementation This is particularly

imp ortant b ecause there is no formal standard for the Multilisp language nearly every

implementation has its own p eculiarities This thesis takes the pragmatic view that

Multilisp is dened by the set of features common to a number of implementations

Cho osing the set of supp orted features is a delicate pro cess that is similar in many

ways to language design itself The set should not b e limited to the features that

are strictly common to all implementations as this would b e ridiculously restrictive

Features that have acquired a certain level of acceptance in the eld should also b e

included On the other hand it is wise to select a small set of features that interact in a

coherent well dened way in order to provide a programming mo del with few surprises

The chapter starts o by giving a denition of the Multilisp semantics targetted by

this work This includes the future construct common to all Multilisp implementations

and also two useful features of sequential Lisps which p ose sp ecial problems in a parallel

setting dynamic scoping and rstclass continuations The ETC implementation of

this semantics is then presented The chapter ends with a description of some Multilisp

programs later used to evaluate and compare various implementation strategies

Schemes Legacy

Multilisp inherits its sequential programming features from the Scheme dialect of Lisp

IEEE Std Scheme was designed to b e a relatively small and simple

CHAPTER BACKGROUND

language with exceptional expressive p ower There are few rules and restrictions for

forming expressions in Scheme yet most of the ma jor programming paradigms can

conveniently b e expressed with it This is not surprising since the language is based on

the theory of the

There are six basic types of expressions in Scheme constant variable reference

assignment conditional pro cedure abstraction lambdaexpression and pro cedure call

All the other types of expressions can b e derived from the basic types and this is in fact

how they are dened in the standard IEEE Std RRS Being

able to reduce a program to the basic expressions is helpful b oth as an implementation

technique and as a means to understand programs and prove some of their prop erties

It is also a considerable advantage for any extension eort such as Multilisp b ecause

the interaction of the extensions with the language can b e more carefully analyzed by

limiting the study to the basic types of expressions

Scheme oers a rich set of data types including numbers symbols lists vectors

pro cedures characters and strings There are also several predened primitives to op

erate on these data types including pro cedures to create destructure and mutate data

Although Lisplike languages have a historical inclination towards symbolic pro cessing

applications the elab orate supp ort of numerical types in Scheme makes it a candidate

for numerical applications as well

There has b een an eort in Scheme to make the language as uniform as p ossible

All types of ob jects in Scheme share some basic prop erties that make them rstclass

values Any ob ject can b e used as an argument to pro cedures returned as the result

of pro cedures stored in data structures and assigned to variables Departing from

Lisp tradition Scheme evaluates the op erator p osition of pro cedure calls like any other

expression and do es not imp ose any particular ordering on the evaluation of arguments

to pro cedures The let and let sp ecial forms are handy to force a particular ordering

when it is needed this is what is done in the examples

Ob jects have unlimited extent They conceptually exist forever after they have b een

created In general this means that ob jects must b e allo cated in the heap When there

is no space left in the heap the system automatically invokes the pro cess of garbage

collection to reclaim the heap space allo cated to ob jects that are no longer needed for

the rest of the computation In certain circumstances it is p ossible at compile time to

detect that an ob ject is no longer needed past a certain p oint in the program The

compiler can then use a sp ecialized allo cation p olicy such as a stack and explicitly

p erform the deallo cation This reduces the frequency and cost of garbage collection

SCHEMES LEGACY

Scheme relies solely on static scoping as a metho d to resolve variable names An

identier refers to the variable with the same name in the innermost blo ck that lexically

contains the reference and declares the variable If no such blo ck exists the identier

refers to a variable in the global environment This naming rule corresp onds to that

of blo ck structured languages such as Pascal and Algol Dynamic scoping is an

alternative metho d that has b een traditionally used in other Lisps The identity of

variables is not based purely on the lexical characteristics of the program available at

compile time but rather dep ends on the control path taken by the program at run time

Although dynamic scoping has its sp ecialized uses eg see Section its p ervasive use

is not generally viewed as promoting mo dularity In addition ecient implementation

of dynamic scoping is often based on shallow binding a strategy that is not well suited

for parallel execution Static scoping p ermits the use of certain compilation techniques

such as data ow analysis that are dicult or imp ossible to p erform with dynamically

scop ed variables b ecause the analysis would have to b e done on the entire program

In Scheme pro cedures are viewed as rstclass values and thus have the same basic

prop erties as the other data types With rstclass pro cedures many programming

techniques are easily implemented Higher order functions lazy evaluation streams and

ob jectoriented programming can all b e done using rstclass pro cedures for example

see Adams and Rees Friedman et al Pro cedures created by lambda

expressions are usually called closures to distinguish them from predened pro cedures

The static scoping rules require all closures to carry at least conceptually the set of

variables to which they might refer the closed variables Consequently variables have

unlimited extent and cannot generally b e allo cated in a stacklike fashion as in more

conventional languages Closures p ose additional problems in a parallel setting Because

closures are just another data structure contention may happ en if several pro cessors

are simultaneously calling the same closure A typical situation would b e the parallel

application of a closure to a set of values Some optimizations can avoid contention

in some cases For example closures with no closed variables such as globally dened

pro cedures are essentially constant so they can b e created and copied to all pro cessors

when the program is loaded Lambdalifting can also eliminate the need to create

closures by explicitly passing the closed variables b etween pro cedures Both of these

techniques are used in Gambit However the general case remains hard to solve as it

is equivalent to the problem of data sharing For this reason true closures have b een

avoided as much as p ossible in the b enchmarks

In accord with the goal of simplicity the only way to transfer control in Scheme is

through the use of pro cedure calls All types of recursion whether they corresp ond to

an iteration or not are expressed as pro cedure calls There are two types of calls If the

CHAPTER BACKGROUND

value returned by a call is immediately returned by the pro cedure containing the call

it is a reduction call Otherwise the call is a subproblem call All implementations are

required to b e prop erly tail recursive That is they must guarantee that lo ops expressed

recursively do not cause the program to run out of memory In implementation terms

this means that reduction calls must not retain the current pro cedures activation frame

the lo cal variables and return address past the actual transfer of control to the called

pro cedure

Scheme is a call by value or applicative order language The evaluation of the pro

gram is forced to follow an ordering that evaluates all arguments to a pro cedure b efore

the pro cedure is entered The opp osite p olicy call by need or normal order evaluation

do esnt evaluate any of the arguments to a pro cedure when the pro cedure is called

Evaluation o ccurs when a strict op erator such as addition needs the actual value

Data transfer op erations such as parameter passing and creation of datastructures are

not considered to b e strict Both p olicies have advantages Programs using normal

order evaluation sometimes terminate when their applicative order counterparts do not

On the other hand applicative order is often more ecient In Scheme it is p ossible to

get the equivalent of normal order evaluation by using the delay sp ecial form to delay

evaluation and by redening the primitive pro cedures so that they force the evaluation

of the arguments in which they are strict The future construct is the dual of the delay

sp ecial form giving eager evaluation instead of lazy evaluation

Scheme supp orts various avors of sideeects such as assignment data structure

mutation and inputoutput op erations Thus it is considered to b e an imp erative pro

gramming language where sequencing of op erations is a necessary concept Nevertheless

Scheme contains a p owerful functional subset which can b e used for purely functional

programming Some algorithms are naturally expressed in a functional way some oth

ers are expressed b etter with the use of sideeects In Scheme b oth paradigms can

app ear in the same program and the programmer can choose which b est matches his

needs at any given p oint It is however a go o d idea to limit the scop e of sideeects by

hiding them through abstraction barriers For example a sorting pro cedure can have a

functional sp ecication even if it uses sideeects internally In practice it seems that

Scheme favors a mostly functional style of programming where sideeects are used

with discretion This style of programming lends itself well to parallelism b ecause sub

problems are often indep endent and are thus p ossible targets for concurrent evaluation

Delay only exists in R RS RRS

FIRSTCLASS CONTINUATIONS

FirstClass Continuations

Perhaps Schemes most unusual features is the availability of rstclass continuation

ob jects Continuations have b een used in the past to express the denotational se

mantics of programming languages such as Algol and Scheme itself RRS

Clinger Most programming languages use continuations but they are usually

hidden whereas in Scheme they can b e manipulated explicitly Firstclass continuations

are useful to implement advanced control structures that would b e hard to express

otherwise

Intuitively a continuation represents the state of a susp ended computation The

p ower of continuations stems from the ability to reinstate a computation at any moment

and p ossibly multiple times It is convenient to think of a continuation as a pro cedure

that restores the corresp onding computation when it is called Often it is necessary to

inuence the computation that is b eing restored This is done by passing parameters

to the continuation Continuations typically have a single parameter the return value

but some continuations may take none or more than one parameter

Continuation Passing Style

Continuations are b est understo o d by examining the underlying mechanism of evalua

tion Each expression in the program is the pro ducer of a value that is to b e consumed

by some computation the expressions continuation For example in f x the pro

cedure f is the consumer of the value pro duced by the expression x Each expression

can b e viewed as b eing implemented by an internal pro cedure whose purp ose is to

compute the value of the expression and send it to the consumer computation Thus

one of the parameters of this internal pro cedure is a continuation which takes a single

argument the value of the expression

This mo del of evaluation gives rise to a programming style called continuation pass

ing style or CPS CPS was originally used as a compilation technique for Scheme Steele

but CPS is equally useful to explain how continuations work The interest of CPS

is that programs written in this style are expressed in a restricted variant of Scheme

yet all Scheme programs can b e converted to CPS An imp ortant byproduct of CPS

conversion is that pro cedure calls never have to return they are always reductions and

can thus b e viewed as jumps that pass arguments

The CPS conversion pro cess consists of adding a continuation as an extra argu

CHAPTER BACKGROUND

define mapsqrt lst

callwithcurrentcontinuation

lambda cont

map lambda x if negative x cont f sqrt x

lst

Figure Nonlo cal exit using callcc

ment to each pro cedure call and adding a corresp onding parameter to all pro cedures

Primitive pro cedures must also b e redened to ob ey this proto col The continuation

argument sp ecies the computation that will consume the result of the pro cedure b eing

called For subproblem calls the continuation argument is a single argument closure

representing the computation that remains to b e done by the caller when the called

pro cedure logically returns For reduction calls the continuation argument is the same

as the callers continuation thus implementing prop er tail recursion Wherever a pro

cedure would normally return a value other than by a reduction call a jump to the

continuation argument is p erformed instead

In Scheme access to the implicit continuation is provided by the predened pro

cedure callwithcurrentcontinuation abbreviated callcc A single argument

pro cedure must b e passed as the sole argument of callcc When it is called callcc

takes its own implicit continuation converts it into a Scheme pro cedure and passes it

to its pro cedure argument The CPS denition of callcc is simply

CPScallcc lambda k proc proc k lambda dummyk x k x

Note that there are two ways in which the captured continuation k can b e invoked

Either proc calls the continuation it was passed as an argument or proc returns nor

mally

Programming with Continuations

Several control constructs can b e built around callcc A typical application is for non

lo cal exit and exception pro cessing which are normally done in Lisp using the sp ecial

forms catch and throw In Scheme this can b e done by saving the current continuation

b efore entering a blo ck of co de An exit from the blo ck o ccurs either when the blo ck

terminates normally or when the saved continuation is called An example of this is

given in Figure The pro cedure mapsqrt returns a list containing the square ro ot

FIRSTCLASS CONTINUATIONS

of every item in a list but only if they are all nonnegative The value f is returned

if any item is negative To do this mapsqrt binds its continuation to cont A call to

cont thus corresp onds to a return from mapsqrt When a negative value is detected

by mapsqrt the pro cessing of the rest of the list is bypassed by the call cont f

which immediately causes mapsqrt to return f

Callcc however is more versatile than Lisps catch and throw b ecause it do es not

restrict the transfer of control to a parent computation Thus it is p ossible to directly

transfer control b etween two dierent branches of the call tree This characteristic can

b e exploited to implement sp ecialized control structures such as backtracking Haynes

coroutines Haynes et al and multitasking Wand A less frequent

but p ossible use of continuations is to reenter a computation that has already completed

see Rozas for an application

The generality of rstclass continuations comes at a price a more complex pro

gramming mo del In many languages including Lisp pro cedure calls have dynamic

extent This means that every entry of a pro cedure is balanced by a corresp onding exit

normal or not This is not the case in Scheme b ecause the computation p erformed in

a pro cedure can b e restarted multiple times and thus a pro cedure can exit more than

once even if it is called only once Because the programmers intuition often fails when

dealing directly with continuations it is sometimes helpful to build abstraction barriers

that oer restricted versions of callcc for example see Friedman and Haynes

Firstclass continuations also cause an implementation problem If pro cedures have

dynamic extent continuations can easily b e represented by a single stack of control

frames ie return addresses Control frames get allo cated when pro cedures are called

and deallo cated when pro cedures return in a lastin rstout LIFO fashion This

form of garbage collection is p ossible b ecause control frames cannot b e referenced after

the corresp onding pro cedure returns The unlimited extent of continuations in Scheme

means that a more general garbage collection mechanism for control frames must b e used

b ecause a pro cedures control frame might still b e needed after the pro cedure returns

At least in some cases control frames must b e allo cated on the heap A common

implementation strategy is to allo cate all control frames on the stack as though they

had dynamic extent and to move them to the heap only when their extent is no longer

known to b e purely dynamic usually at the moment a continuation is captured by a

callcc This way the eciency of stack allo cation is obtained for programs that

do not make use of rstclass continuations This strategy is describ ed in detail in

Section

The next section examines the problems that arise when continuations are used in

CHAPTER BACKGROUND

a parallel setting

Multilisps Mo del of Parallelism

Parallel programming languages can b e classied according to the level of awareness of

parallelism required by the programmer when writing programs At one end of the scale

there are languages with implicit parallelism that rely exclusively on the ability of the

system to detect and exploit the parallelism available in programs In these languages

the compiler must analyze the program to determine what parts can and should b e

executed concurrently In general this is a hard task for imp erative languages b ecause

of the existence of sideeects Even in the absence of sideeects the compilation may

b e dicult if an algorithmic transformation is required to obtain a suciently parallel

algorithm

Multilisp is at the other end of the scale Parallelism is explicitly introduced by the

programmer through the use of the future construct The future construct marks the

parts of the program where concurrent evaluation is allowed Of course this style has its

price the burden put on the programmer for sp ecifying concurrency and the p ossibility

of error ie incorrectly sp ecifying concurrency The advantage of this approach is that

it provides more control over the programs execution The programmer can sp ecify

concurrency at places which might escap e an automatic analysis and can choose to

disregard some forms of concurrency if it is judged that the cost of exploiting the

concurrency is greater than what is gained

This level of control is useful for the programmer wanting to exp eriment with various

ways of parallelizing a program It is also appropriate when Multilisp is considered as

the ob ject co de of a compiler for a higher level parallel language Such a compiler

could b e aware of where parallelism is b oth p ossible and desirable and emit co de with

appropriately placed futures Gray is a go o d example of this application

FUTURE and TOUCH

Futures are expressed as FUTURE expr where expr is called the futures body The

future construct b ehaves like the identity function in the sense that its value is the

value of its b o dy However the b o dy is conceptually evaluated concurrently with the

futures continuation The only restriction to this concurrency comes as a result of the

ordering dep endencies imp osed by the strict op erations in the program When the value

MULTILISPS MODEL OF PARALLELISM

of a future is used in a strict op eration the op eration can only b e p erformed after the

evaluation of the futures b o dy For example in the expression

let x FUTURE f

g x f

the evaluation of f is done concurrently with the evaluation of f Because

is a strict op eration in b oth of its arguments the addition and the call of the pro cedure

g can only o ccur after the evaluation of f has completed

As long as they resp ect the temp oral ordering imp osed by the strict op erations the

op erations required to compute the b o dy of a future are sub ject to arbitrary interleaving

with the op erations p erformed by the futures continuation Because Multilisp allows

unrestricted sideeects it is an indeterminate language Separate runs of the same

program can p otentially generate dierent results As a simple example consider the

expression

let x

FUTURE set x

x

The evaluation of this expression can either return or dep ending on whether the

reference to x happ ens to b e done b efore or after the assignment to x

In certain circumstances a program needs to imp ose sp ecial control dep endencies in

addition to those given by the data dep endencies of the program Such control dep en

dencies are only required in imp erative parts of the program to enforce a certain ordering

of sideeects For example it might b e imp ortant to guarantee that some restructuring

of a database has completed b efore some other pro cessing of the database is p erformed

For this purp ose Multilisp provides the primitive pro cedure TOUCH that b ehaves like a

strict identity function TOUCH can b e viewed as the fundamental strictness op eration

All other strict op erations use TOUCH internally

In order to show clearly where the TOUCH op erations are needed the co de examples

and b enchmark programs that follow include explicit calls to TOUCH

To b e precise the steps required to bind x evaluate g and x and enter the pro cedure can also

b e done concurrently with the evaluation of f

Indeterminacy also exists in Scheme but at a dierent level In a pro cedure call arguments and

the op erator p osition can b e evaluated in any order but sequentially that is with no overlap in time

The following expression has p ossible values and

let x

car cons x set x

CHAPTER BACKGROUND

Placeholders

A more traditional description of futures consists of introducing a new type of ob ject

the placeholder that is used to synchronize the computation of a futures b o dy with the

touching of its value Miller When a future is evaluated it returns a placeholder

as a representative of the value of the b o dy A placeholder can b e in one of two states

It is undetermined initially and for as long as the evaluation of the futures b o dy has not

completed When the evaluation of the b o dy is nished the resulting value is stored in

the placeholder ob ject which is then said to b e determined Using placeholder ob jects

TOUCH has an obvious denition if the argument is not a placeholder just return it

otherwise wait until the placeholder is determined and then return its value

It is imp ortant to understand that placeholders are used here as an artice to explain

how futures work Although placeholders are commonly used in Multilisp systems

an implementation is free to choose any metho d that gives the same result Even if

placeholders are present in the system the user can b e totally unaware of their existence

if the implementation do es not provide constructs to manipulate them directly This is

the view adopted by Gambit

Spawning Trees

It is sometimes useful to represent the eects of evaluating futures and touching place

holders by a diagram the spawning tree which shows the state of the concurrent com

putations as a function of time A spawning tree resulting from the evaluation of a

single future lo oks like

Continuation

z

s

Time

z

Bo dy

A computation is represented by a horizontal line whose extent corresp onds to its du

ration A dashed vertical line marks the evaluation of the future At that p oint a new

computation corresp onding to the b o dy of the future is started Arrows are used to

express the data dep endencies introduced by the TOUCH op eration An arrow links the

computation that determined a placeholder with the computations that touches it

a computation can p oint to several others The tail of an arrow indicates the p oint

where a placeholder was determined whereas the head indicates the p oint where the

TYPES OF PARALLELISM

TOUCH was requested If an undetermined placeholder was touched the arrow will p oint

backwards in time indicating that the touching computation had to wait

A second representation of spawning trees used here is as a ro oted tree Each no de

of the tree represents a future and the children of a no de are the futures dynamically

nested in the b o dy of the corresp onding future The ro ot of the tree corresp onds to a

virtual future in which the program is executed

Types of Parallelism

Parallelism comes in many avors Control parallelism o ccurs when dierent parts of

an algorithm can b e done simultaneously Data parallelism o ccurs when dierent data

values can b e pro cessed concurrently The advantage of data parallelism is that it scales

well Larger data sets will oer more parallelism and thus provide b etter opp ortunities

for sp eedup In control parallelism the degree of parallelism is in principle limited by

the structure of the algorithm For this reason data parallelism is more useful than

control parallelism for large scale computations

The future construct is app ealing b ecause it can b e used to express several types of

parallelism

Pip eline Parallelism

Pipeline parallelism is a sp ecial case of control parallelism where the pro cessing of data

is overlapped with the pro cessing of the result Pip eline parallelism is the primitive

form of parallelism provided by the future construct It enables the pro duction of a

value by the futures b o dy to b e done concurrently with the consumption of the value

by the futures continuation

Pip eline parallelism is particularly useful when pro cessing a data structure built

incrementally such as a list of values At any given p oint in time the part of the

data structure that has b een computed by the pro ducer is available for pro cessing by

the consumer computation An example of this is the pro cedure pmap as dened in

CHAPTER BACKGROUND

define pmap proc lst

if pair lst

let tail FUTURE pmap proc cdr lst

let val proc car lst

cons val tail

a basic denition

continuation

z

s s s

cons

f

cons

f

cons

f

b spawning tree for basic denition

continuation

z

s s s s s s

cons

Z

Z

f

cons

f

cons

f

c spawning tree for variant with FUTURE proc car lst

continuation

z

s

Xy

X

cons

f

X

X

X

X

s X

cons

f

HY

H

H

s H

cons

f

d spawning tree for variant with cons val TOUCH tail

Figure Parallel map denition and spawning trees

TYPES OF PARALLELISM

Figure Pmap is a parallel version of map which applies a pro cedure to each element

of a list and returns the list of results Parallelism has b een introduced by allowing the

tail of the resulting list to b e generated while the rst element is computed and used

by pmaps caller Because cons is a nonstrict op erator it immediately returns a pair

with a placeholder as its tail after proc has b een called on the rst element The rst

element is thus immediately available for pro cessing by the consumer It is only when

the consumer needs to access the tail that a synchronization must take place p ossibly

susp ending the consumer until the next pair in the list is generated

A variant of pmap with even more p otential for parallelism is obtained by also wrap

ping a future around the call to proc This allows the computation of the rst element

to overlap pmaps continuation The dierence in b ehavior is b est visualized by exam

ining the spawning tree for these two variants of pmap Figure shows the spawning

trees for the call pmap f Parentheses have b een added in these diagrams

to indicate entry and exit of pmap As is clear from the two upp er spawning trees the

extra future allows more computations to overlap Whether this added parallelism is

actually b enecial will dep end on the task granularity the spawning cost the number

of pro cessors and the way in which pmaps result is used by the continuation

Pmaps parallelism is not easy to classify At rst glance it seems that it is an

instance of control parallelism b ecause it expresses concurrency b etween two dierent

computations the continuation and the application of the pro cedure to an element of

the list However this control parallelism is not static Pmap calls itself recursively so

the parallelism varies with the length of the list When viewed globally pmap exhibits

data parallelism b ecause it expresses the parallel application of a pro cedure to a set of

values If the task granularity is large enough the pro cessing of longer lists will oer

more parallelism

ForkJoin Parallelism

The ab ove variants of pmap are said to export concurrency b ecause some of the work

logically started inside pmap may b e in progress after the pro cedure has returned

The shorter denition

define pmap proc lst

if pair lst

cons proc car lst FUTURE pmap proc cdr lst

is not equivalent b ecause the two p ossible orderings of the evaluation of the arguments to cons do not give the same paralleli sm b ehavior

CHAPTER BACKGROUND

Exp orted concurrency is a nuisance for some programming styles If proc p erforms

some sideeects on a global state the computation following pmap cannot assume that

they have all b een done Some explicit synchronization is needed to guarantee that

all of pmaps futures are done In the simple case where proc do es not itself exp ort

any concurrency this synchronization can b e done by walking the resulting list and

touching all values that are the result of a future A more elegant solution is to include

the required synchronization inside pmap This is easily achieved by having the futures

extent match that of the pro cedures b o dy In other words the pro cedure is written so

that each future the fork is balanced with a corresp onding TOUCH the join executed

b efore the pro cedure returns This is a trivial change to pmap a TOUCH is added around

the second argument to cons ie cons val TOUCH tail The spawning tree

resulting from this variant of pmap is shown in Figure d

Divide and Conquer Parallelism

An unfortunate characteristic of pmap is that it scales p o orly due to the inherently se

quential nature of lists The pro cessing of an n element list requires at least n sequential

steps just to traverse the list No matter how quickly each element can b e pro cessed

the time required to pro cess n elements will b e n This may b e of little consequence

when task granularity is large and lists are short but massively parallel applications are

b ound to suer more

For this reason it is preferable to use scalable data structures such as trees and

arrays when lists would create a b ottleneck But this is not the only step to take As

long as futures are started sequentially such as in a lo op a b ottleneck will b e present

A divide and conquer paradigm DAC can b e used to start futures faster allowing n

futures to b e started in log n time This is actually the b est that can b e exp ected of

the future construct b ecause each future splits a thread of computation into two

Pvmap shown in Figure is a DAC version of pmap that works on vectors The

input elements are stored in a vector which is mutated to construct the result The

vector is divided in two and the mapping is p erformed recursively on b oth parts When

a single element is obtained the mapp ed pro cedure is applied to the value and the result

is stored back in the vector To avoid allo cating new vectors subvectors are represented

by two indices lo and hi which denote the subvectors extent Because it uses a fork

join paradigm all sideeects will b e nished when pvmap returns Note also that the

TOUCH is used only for synchronization The actual value of sync is irrelevant

Multilisp programs are frequently organized around DAC parallelism Not only is it

TYPES OF PARALLELISM

define pvmap proc vect

define maprange proc lo hi

if lo hi

vectorset vect lo proc vectorref vect lo

let mid quotient lo hi

let sync FUTURE maprange proc mid hi

maprange proc lo mid

TOUCH sync

maprange proc vectorlength vect

vect

a denition

f

f

b spawning tree for pvmap f v with v

Figure Parallel vector map

CHAPTER BACKGROUND

a fundamental technique for constructing parallel algorithms Mou it also blends

naturally with the recursive algorithms and data structures commonly found in Lisp

and symbolic pro cessing Several of the parallel b enchmarks used in this thesis see

Section are based on DAC parallelism

Implementing Eager Task Creation

This section describ es the eager task creation ETC implementation of futures It will

serve b oth as a reference implementation and as a basis on which lazy task creation is

built A few implementation details have b een omitted for the sake of clarity A more

elab orate description can b e found in Miller

As might b e exp ected the implementation of a Multilisp system is in many ways

similar to that of a multitasking op erating system At the heart of b oth are utilities to

supp ort the management of various pro cessing resources For the management of the

pro cessors an imp ortant concept is that of the task which is an abstract representation

of a computation in progress A program rst starts out with a single root task in

charge of p erforming the computation required by the program Tasks are created and

terminated dynamically as the computation progresses p ossibly causing the number of

tasks to exceed the number of pro cessors in the machine

The task abstraction is supp orted by the scheduler whose job is to run tasks by

assigning them to pro cessors A task can b e in one of three states It is running when

it is b eing executed by some pro cessor It is ready or runnable if it is only waiting for

the scheduler to assign it to a pro cessor Finally it is blocked if some event must o ccur

b efore it is allowed to run

Eager task creation ETC is a straightforward dynamic partitioning metho d that

has b een used in several implementations of Multilisp Halstead Miller

Swanson et al Kranz et al With ETC there is a single representation

for tasks the heavyweight task ob ject This is a heap allo cated ob ject with a number

of elds that describ e the state of the computation asso ciated with the task When

the task needs to b e started or resumed its state is restored by reading the elds of

the corresp onding task ob ject When a task needs to b e susp ended the task ob ject

is up dated to reect the current state of the task The most imp ortant information

The denition of heavyweight tasks used here is not the same as the common meaning in op erating

systems ie a pro cess with its own address space Here heavyweight task simply means a representation that is more exp ensive than the one used for lazy task creation

IMPLEMENTING EAGER TASK CREATION

retained in a task ob ject is the continuation It indicates where control must return

when the task is resumed Task continuations dier from rstclass continuations in

that they do not need to b e given a result to continue with They are zero argument

pro cedures Also the full generality of rstclass continuations is not necessary for

task continuations since they are invoked at most once Other elds can b e added to

task ob jects to supp ort sp ecial language features but they are not strictly required for

implementing futures In fact an implementation could simply use continuations to

represent tasks Nevertheless task ob jects will b e used here to make the algorithms

more general

The Work Queue

ETC lends itself well to self scheduling where each pro cessor is resp onsible for schedul

ing tasks to itself All pro cessors share a global queue the work queue that contains

the set of runnable tasks When a pro cessor b ecomes idle typically after a task blo cks

or terminates it removes a task from the work queue and starts running it If there are

none available the pro cessor just keeps on trying until one is added to the work queue

by some other pro cessor Self scheduling has the advantage of automatically balancing

the load across the pro cessors As explained in Section the work queue can b e

distributed but for now it is assumed to b e a single centralized queue

FUTURE and TOUCH

Tasks are created through the evaluation of futures When a task the parent evaluates

FUTURE expr it creates a placeholder ob ject to represent the value of expr and then

creates a child task whose role is to compute expr and determine the placeholder with

the resulting value The child task is added to the work queue to make it runnable and

the placeholder is returned as the result of the future Thus the parent task immediately

starts working on the continuation using the placeholder as a substitute for the value

of expr while the child task waits in the work queue until it can b e started by an idle

pro cessor

Placeholder ob jects can b e represented by a structure containing three slots the

state the value and the waiting queue The meaning of the state and value slots is

obvious The waiting queue is used to record the tasks that have b ecome blo cked

b ecause they need to wait until the placeholder has a value When the placeholder gets

determined the tasks that are in the waiting queue are transferred to the work queue

CHAPTER BACKGROUND

b ecause they are now ready to run When a task touches an undetermined placeholder

it is susp ended and added to the placeholders waiting queue The pro cessor is now idle

and must nd a new task to run from the work queue When the blo cked task later

resumes inside the TOUCH the placeholders value is fetched and returned

Scheme Enco ding

A Scheme enco ding of these algorithms is given in Figure and the denition of the

supp ort pro cedures is given in Figure Note that the co de in Figure is schematic

and do es not address all atomicity issues

Idle is the pro cedure that is run by pro cessors in need of work When the program

starts up all pro cessors call idle except for the single pro cessor that is running the

ro ot task Idle continually tries to remove a ready task from the work queue To

implement TOUCH each pro cessor must keep track of its currently running task When

a task is found resumetask is called The task b ecomes the current task of that

pro cessor and it is restarted by calling its asso ciated continuation It is assumed that

each pro cessor has a private storage area to store the currently running task The

pro cedures currenttask and currenttaskset access this storage

The future sp ecial form can b e thought of as a derived form that expands into a call

to makeFUTURE Its only argument is a nullary pro cedure a thunk that contains the

futures b o dy The expression FUTURE expr is really an abbreviation for the pro cedure

call makeFUTURE lambda expr MakeFUTURE rst creates an undetermined

placeholder to represent the b o dys value and then creates a child task The child task

is set up so that its continuation when called by resumetask will compute the value

of the b o dy by calling the thunk The pro cedure endbody contains the work to b e

done after the b o dy is computed Endbody calls testanddetermine to determine

the result placeholder with the b o dys value Control then go es back to idle Note

that endbody signals an error when a placeholder is determined more than once This

might happ en if a continuation captured by a callcc in the b o dy is invoked after the

b o dy has already returned

Testanddetermine is an atomic op eration similar in spirit to the traditional

testandset op eration It tests if a placeholder is determined and if it isnt the place

holder gets determined to the second parameter and true is returned to indicate success

Otherwise the placeholder remains as is and false is returned When a placeholder is de

termined the tasks on its waiting queue are transferred to the work queue thus making them runnable

IMPLEMENTING EAGER TASK CREATION

define idle

if queueempty workqueue

idle

resumetask queueget workqueue

define resumetask task

currenttaskset task

taskcontinuation task

define makeFUTURE thunk

let resph makeph

let child maketask

lambda endbody resph thunk

queueput workqueue child

resph

define endbody resph result

if testanddetermine resph TOUCH result

idle

error placeholder previously determined

define testanddetermine ph val

if phdetermined ph

f

begin

determine ph val

t

define determine ph val

phvalueset ph val

phdeterminedset ph t

queueappend workqueue phqueue ph

define TOUCH x

if ph x

if phdetermined x phvalue x TOUCHundet x

x

define TOUCHundet ph

callwithcurrentcontinuation

lambda cont

let task currenttask

taskcontinuationset task

lambda

cont

if ph ph phvalue ph ph

queueput phqueue ph task

idle

Figure Scheme enco ding of Multilisp core

CHAPTER BACKGROUND

Op erations on queues

queueempty q Tests if q is empty

queueget q Removes and returns the item at q s head

queueput q x Adds x to q s tail

queueappend q q Transfers all items from q to q s tail

Op erations on placeholders

makeph Creates and returns an undetermined placeholder

ph x Tests if x is a placeholder

phdetermined ph Tests the state of ph

phdeterminedset ph x Sets the state of ph

phvalue ph Returns the value of ph

phvalueset ph x Sets the value of ph

phqueue ph Returns the waiting queue of ph

Op erations on tasks

maketask c Creates and returns a task whose continuation is c

taskcontinuation t Returns ts continuation

taskcontinuationset t c Sets ts continuation to c

Op erations on the pro cessors lo cal state

currenttask Returns the task currently running on the pro cessor

currenttaskset t Sets the task currently running on the pro cessor to t

Other op erations

workqueue Returns the work queue

Figure Pro cedures needed to supp ort Multilisp core

IMPLEMENTING EAGER TASK CREATION

Touching is implemented by TOUCH and TOUCHundet TOUCHundet handles the

case where the value to b e touched is an undetermined placeholder When an unde

termined placeholder is b eing touched the current task must b e susp ended and put

on the placeholders waiting queue This is done by a call to callcc which captures

TOUCHs continuation Note that since this continuation is guaranteed to b e called at

most once a less general but more ecient version of callcc could b e used The task

is then put on the placeholders waiting queue so that it can later b e made runnable

by testanddetermine As the current task is now blo cked control is transferred to

idle to move on to some other piece of work When the task is resumed the place

holders value will b e returned to TOUCHs continuation

Chasing vs No Chasing

An interesting issue is whether placeholders should b e allowed to b e determined with

other placeholders If this is p ermitted the touching of a placeholder must p erform the

recursive touching of its value This chasing pro cess can b e exp ensive if the chain of

placeholders is long This happ ens in programs where the future b o dies often return

placeholders and placeholders are touched multiple times

The alternative strict metho d requires that placeholders b e only determined with

nonplaceholders The co de in Figure implements the strict metho d A chasing

implementation is obtained by removing the TOUCH on line adding a TOUCH around

line and replacing line by ph The drawback of the strict metho d is that the number

of blo cked tasks will increase in the cases where chasing would b e required It may also

restrict concurrency b ecause it has an additional control dep endency None of these

metho ds is clearly sup erior to the other in all contexts Fortunately b oth metho ds can

co exist in the same system as long as the two types of placeholders are distinguished

and the appropriate touching and determining mechanisms are called Having two types

of placeholders is useful to implement legitimacy see Section

Unless otherwise noted the strict metho d will b e assumed b ecause it is conceptually

simpler ie determined placeholders are guaranteed to have a nonplaceholder value

and it gives a shorter co de sequence for inline calls to TOUCH

Critical Sections

Various implementation details have b een omitted from the ab ove description One

problem that must b e addressed is the p ossible race conditions in these algorithms

CHAPTER BACKGROUND

Several pro cessors may simultaneously attempt to mutate the work queue or a place

holder To preserve the integrity of these data structures some op erations must app ear

to b e mutually exclusive This is usually done by introducing lo cks in the data struc

tures to control access to them Spin lo cks are sucient b ecause the critical sections

consist of only a few instructions The op erations that must b e protected are

Testing and removing a task from the work queue when a pro cessor is idle

Adding a task to the work queue when a future is evaluated

Checking the state of a placeholder and adding a task to a placeholders waiting

queue when an undetermined placeholder is touched

Changing the state and value of a placeholder when a placeholder gets deter

mined

Garbage collection adds another complication If the value of placeholders is assumed

to b e immutable it is p erfectly valid to replace any reference to a determined placeholder

by the placeholders value This optimization called splicing can in principle b e done

at any moment but usually it is p erformed by the garbage collector The advantage

of splicing is that subsequent calls to TOUCH will b e faster b ecause the dereferencing of

the placeholder is avoided this is particularly helpful to reduce the cost of chasing

Consequently the implementation must prevent the splicing of the placeholder currently

b eing manipulated Several techniques are p ossible such as temp orarily disabling the

garbage collector or temp orarily marking the placeholder as nonspliceable The test at

line in TOUCHundet is needed to account for the splicing of the touched placeholder

Aside from this test the co de in Figure do es not include the op erations required to

prevent splicing

Centralized vs Distributed Work Queue

A p otential source of ineciency in the scheduler is caused by the centralized work

queue accessed by all pro cessors The contention for the work queue may b ecome an

imp ortant b ottleneck as the number of pro cessors is increased Each access to the work

queue is mutually exclusive so all op erations on the work queue get sequentialized The

time it takes to add and remove a task from the work queue puts an upp er b ound on

the rate at which tasks can b e created and resumed Clearly it would b e preferable if this rate scaled up with the number of pro cessors

FAIRNESS OF SCHEDULING

A common solution is to distribute the work queue Each pro cessor has its own work

queue which it uses to make tasks runnable These work queues are accessible from all

pro cessors When a pro cessor is lo oking for work it rst lo oks for runnable tasks in

its own work queue and go es on to search the work queue of other pro cessors only if

its work queue is empty This reduces contention and remote memory trac and also

improves lo cality since tasks restarted from the lo cal work queue are likely to have b een

created lo cally

Fairness of Scheduling

Another imp ortant consideration is fairness of scheduling In a fair system a tasks

computation is guaranteed to progress as long as the task is runnable In other words

there is a nite amount of time b etween a task b ecoming runnable and it actually

running on a pro cessor

Fairness can b e implemented by preventing a task from running longer than a certain

stretch of time quantum without giving all other runnable tasks a chance to run as

well The scheduler eectively cycles through all runnable tasks giving each of them a

quantum of time to advance their computation At regular time intervals all pro cessors

receive a preemption interrupt to signal that the quantum has expired Up on receiving

this interrupt a pro cessor susp ends the currently running task puts it at the tail of the

work queue and then resumes the task at the head

In a system with a centralized work queue at least min n r tasks are resumed every

quantum where n is the number of pro cessors and r is the number of runnable tasks

0 0

It follows that a task will start running in no more than br nc quantums where r is

0

the number of runnable tasks at the time the task was made runnable If r do es not

0

vary much the tasks will get an even share of the pro cessors roughly the p ower of nr

0

pro cessor p er task if r n

In a system with a distributed work queue at least one task is resumed from every

work queue every quantum A task will thus start running in no more than q

quantums where q is the length of the lo cal work queue at the time the task was made

runnable Thus the pro cessing p ower given to tasks residing on a pro cessor is evenly

distributed but the pro cessing p ower of tasks residing on dierent pro cessors may b e

substantially dierent

It is assumed that the quantum is large enough so that the eects of contention on the work queue are negligibl e

CHAPTER BACKGROUND

The original Multilisp semantics Halstead had a scheduling p olicy that was

fair as long as all tasks were of nite duration The only guarantee made by the scheduler

was that a runnable task would run if there were no other runnable tasks Under the

nite task assumption this implies that all tasks will eventually run Finiteness is

a reasonable assumption for Multilisp programs since it is common to design parallel

programs by annotating terminating sequential programs with futures In sequential

programs all expressions evaluated corresp ond to mandatory work that needs to b e

done to compute the result of the program Any execution order for the tasks will

compute the correct result as long as it resp ects the basic ordering imp osed by the

strict op erations However there are sp ecial situations where true fairness is useful

Programs are sometimes organized around tasks that conceptually never terminate

One example is the clientserver mo del where each task implements a particular service

for some clients Server tasks receive requests from the clients and send back a reply for

each request serviced Each server task is in an innite receivecomputerespond lo op

Without a fair scheduler a set of server tasks could monop olize all the pro cessors if

they continually have requests to service Other server tasks would never get a chance

to run A multiuser Multilisp system can b e viewed as an instance of this mo del the

clients are the users and the server tasks are the readevalprint lo ops

Another application of fairness is to supp ort speculative computation A computa

tion is sp eculative if it is not yet known to contribute to the programs result Sp eculative

computation arises naturally in search problems where multiple solutions may exist but

only one is needed Several search paths can b e explored in parallel and as so on as a

solution is found the search can b e stopp ed This form of computation which Osb orne

Osb orne calls multiple approach sp eculative computation is known in parallel

logic programming as ORparallel If the likelihoo d of nding a solution in any given

path is fairly similar then it is reasonable to sp end an equal eort searching each path

This is easily approximated by a fair scheduler which timeslices tasks from a centralized

work queue

However the solutions are typically not distributed equally among the search paths

The paths that are likely to lead quickly to a solution should b e searched more eagerly

than others Thus a system aimed at general sp eculative computation should provide

some ner level of control over the scheduler such as a mechanism to assign priorities

to the sp eculative tasks Because there is currently no consensus as to which level

of control is b est this thesis do es not investigate the implementation of such priority

mechanisms Fairness of scheduling plays a minor role in this thesis Chapter shows that lazy task creation can supp ort fairness

DYNAMIC SCOPING

Dynamic Scoping

Multilisp uses static scoping as its primary variable management discipline Static

scoping has the advantage of clarity b ecause the identity of a variable only dep ends on

the programs lo cal structure not its runtime b ehavior With the exception of global

variables a variable can only b e accessed by an expression textually contained in the

binding form that declares the variable

Static scoping is not well suited for certain applications Sometimes it is necessary

to pass an argument to one or several pro cedures far down in the call tree such as the

default output p ort or the exception handler Such arguments must either b e passed in

global variables or b e passed as explicit arguments from each pro cedure to the next in

the call chain The rst solution is not appropriate in a parallel system b ecause of the

p ossible conict b etween tasks The second solution clearly lacks mo dularity b ecause

each pro cedure must b e aware of the arguments passed from parent pro cedures to all

its descendants

Dynamic scoping oers an elegant solution A dynamically scop ed variable can b e

accessed by any computation p erformed during the evaluation of the b o dy of the binding

form that declares the variable In a sense dynamic variables are implicit parameters

to all pro cedures The set of bindings the dynamic environment is passed implicitly

by each pro cedure to its children in the call tree A given binding is thus only visible in

the call tree that stems from the binding form with the exception of the subtrees where

the binding is shadowed by a new binding to the same variable

There are several p ossible constructs to express dynamic scoping For the sake

of simplicity two sp ecial forms are used here The form dynbind id v al body

introduces a new binding of the dynamic variable id to the value v al for the duration

of the b o dy The form dynref id returns the value of the dynamic variable id in

the current dynamic environment Note that id is not evaluated and that lexically

scop ed variables and dynamic variables exist in separate namespaces Figure shows

a typical use of dynamic scoping to implement a simple exception system The dynamic

variable EXCEPTIONHANDLER contains a single argument pro cedure that is called with

an error message when an error is detected The pro cedure catchexceptions takes a

thunk as argument and calls it in a dynamic environment where EXCEPTIONHANDLER

is b ound to the continuation of catchexceptions Thus the call to the exception

handler in raiseexception will immediately exit from catchexceptions with the

error message as its result for example the call mapsqrt returns the

An obvious extension would b e an assignment construct

CHAPTER BACKGROUND

define catchexceptions thunk

callwithcurrentcontinuation

lambda abort

dynbind EXCEPTIONHANDLER abort thunk

define raiseexception msg

dynref EXCEPTIONHANDLER msg

define squareroot x

if negative x

raiseexception domain error

sqrt x

define mapsqrt lst

catchexceptions

lambda map squareroot lst

Figure Exception system based on dynamic scoping and callcc

string domain error

An implication of the ab ove semantics is that dynamic environments are asso ciated

with continuations All continuations carry with them the dynamic environment that

was in eect when they were created ie due to the evaluation of some subproblem

call When a continuation is invoked the captured dynamic environment b ecomes

the current dynamic environment Dynbind creates a new dynamic environment for

the evaluation of the b o dy simply by adding a new binding to the current dynamic

environment This new binding remains in eect only for the duration of the b o dy

b ecause the continuation invoked to exit the b o dy normally dynbinds continuation

but p ossibly some continuation captured with callcc outside the b o dy will restore the

dynamic environment to the appropriate value In implementation terms this implies

that each subproblem call must save the dynamic environment on the stack prior to the

call and restore it up on return

Because the saverestore pair is added to all subproblem calls this may result in an

unacceptably high overhead Notice that in normal situations the dynamic environment

do es not actually change when a continuation is invoked Only dynbinds continuation

and continuations captured by callcc might b e invoked from a dierent dynamic

environment An alternative approach is thus to put the saverestore pair only around

the evaluation of dynbinds b o dy and around calls to callcc This approach oers

DYNAMIC SCOPING

more ecient subproblem calls but also has the unfortunate consequence that callcc

and dynbind are no longer prop erly tailrecursive Callccs pro cedure argument

and dynbinds b o dy are not reductions b ecause their continuation contains a new

continuation frame The loss of prop er tail recursion for dynbind is probably not

very troublesome most Lisp systems implement the dynamic binding construct with

similar saverestore pairs However it is harder to justify for callcc

To preserve callccs tail recursive prop erty callcc can b e redened as shown

in Figure It is assumed that the state of the dynamic environment is maintained

in a global data structure accessible through the pro cedures currentdynenv and

currentdynenvset The implementation exploits the invariant that pro cedures al

ways invoke their implicit continuation with the same dynamic environment that existed

when they were called Thus a normal return from the call to proc in callcc invokes

the captured continuation with the correct dynamic environment An abnormal return

to cont is only p ossible by calling the closure passed to proc This closure explicitly

restores the correct dynamic environment b efore invoking the captured continuation

Parallel pro cessing raises additional implementation issues In order for the future

constructs semantics to b e as nonintrusive as p ossible the dynamic environment used

for the evaluation of the futures b o dy should b e the same as the one in eect when

the future itself was evaluated Consequently the parent task must save the dynamic

environment into the child task and the child task must restore this environment when

it starts running This adds an overhead to task creation susp ension and resumption

Another issue is the representation of dynamic environments A p opular approach in

unipro cessor Lisps is shal low binding The environment is represented as a table of cells

Each cell holds the current value of a dynamic variable A new binding is introduced

by saving the current value of the cell on a stack and assigning the new value to the

cell Up on exit from the binding construct the previous binding is restored by p opping

the old value o the stack Thus dynbind and dynref are constant time op erations

However saving the entire dynamic environment ie the op eration currentdynenv

is exp ensive b ecause it implies a copy of the binding table An alternative approach

shown in Figure is deep binding The dynamic environment is represented as a

stack of bindings ie an asso ciation list Dynbind simply adds a new binding at the

head of the list and dynref searches the list for the most recent binding of the variable

Unfortunately the cost of dynref is O b where b is the number of bindings in the

environment This may b e exp ensive if b is large and the variables lo oked up are those

The following pro cedure will thus run out of memory when it is called

define loop callwithcurrentcontinuation lambda k loop

CHAPTER BACKGROUND

define callwithcurrentcontinuation proc

primitivecallwithcurrentcontinuation

lambda cont

proc let env currentdynenv

lambda val

currentdynenvset env

cont val

The sp ecial forms dynref and dynbind expand into

dynref id currentdynenvlookup id

dynbind id v al body begin

currentdynenvpush id v al

let result body

currentdynenvpop

result

Denitions for deep binding

define currentdynenvlookup id

cdr assq id currentdynenv

define currentdynenvpush id val

currentdynenvset cons cons id val currentdynenv

define currentdynenvpop

currentdynenvset cdr currentdynenv

Figure Implementation of dynamic scoping with tail recursive callcc

CONTINUATION SEMANTICS

that were b ound early On the other hand currentdynenv only requires a single

p ointer copy so the overhead for callcc and task op erations is minimal Deep binding

is adequate when dynamic variables are referenced infrequently for example if their

main purp ose is to supp ort the exception pro cessing system Yet another approach is to

represent environments with or AVL search trees thus p ermitting O log n cost for

dynbind and dynref where n is the number of variables b ound in the environment

and constant cost for currentdynenv and currentdynenvset It isnt clear

which of these last two representations is most ecient in practice The deep binding

approach has b een used in this work for simplicity but the implementation strategies

explained in the next chapter are equally applicable to the search tree representation

Continuation Semantics

Continuations also present sp ecial problems in a parallel setting It isnt clear what the

terminal continuation of a child task should b e This continuation is the one that is

passed to the b o dy of the future In other words what should b e done with the value

returned by the b o dy This is an imp ortant question b ecause the approach chosen will

sp ecify the b ehavior of rstclass continuations in the presence of futures

Original Semantics

Several approaches have b een prop osed In the original Multilisp denition Halstead

the b o dys value was used to determine the placeholder created for the future

and the task was simply terminated This is the semantics implemented by the co de in

Figure

MultiScheme Semantics

MultiScheme adopted a subtly dierent mo del for continuations The child task and

placeholder created by a future are conceptually linked The placeholder is called the

goal of the task and the task is the placeholders owner This linkage was introduced

Eciency can b e improved somewhat by adding a cache to hold the value of recently accessed

variables for example see Rozas and Miller

Multilisp was not designed to supp ort rstclass continuations so it isnt surprising that the original

semantics do es not interact well with them

The term motivated task was used in Miller

CHAPTER BACKGROUND

define makeFUTURE thunk

let resph makeph

let child maketask

lambda endbody thunk

resph

queueput workqueue child

resph

define endbody result

let resph taskgoalph currenttask

if testanddetermine resph TOUCH result

idle

error placeholder previously determined

Figure MultiSchemes implementation of the future sp ecial form

to p ermit the garbage collection of tasks Finding the value of the futures b o dy is seen

as the tasks sole reason of existence Since the goal placeholder is the representative

of this value the owner task can safely b e terminated if the placeholder is known to b e

unnecessary for the rest of the computation

The implementation of this semantics is given in Figure Note that the pro ce

dure maketask now takes two arguments the continuation and the goal placeholder

Also note that endbody takes only one argument b ecause the placeholder to determine

implicitl y comes from the task executing endbody ie the current task The goal

placeholder is now embeded in the child task instead of the terminal continuation as

is done in the original semantics This is an imp ortant distinction b ecause a task can

replace its current continuation by a completely dierent one by calling a continuation

created by callcc However the goal placeholder never changes Interestingly the

original and MultiScheme implementations are equivalent in the absence of callcc

This is b ecause in such a case the only task that can execute a given continuation is

the task created with that continuation Taking the placeholder to determine from the

continuation as in the original semantics or from the task ob ject as in MultiScheme

will give the same placeholder b ecause of the onetoone corresp ondence b etween con

tinuations and tasks

Figure gives an example where the two implementations dier Here two tasks

T and T are involved in addition to the ro ot task The corresp onding placeholders are

P h and P h The call to callcc binds k to T s continuation Thus k corresp onds to

a call to endbody With the original implementation of futures k contains an implicit

CONTINUATION SEMANTICS

define x

TOUCH FUTURE

callwithcurrentcontinuation

lambda k

TOUCH FUTURE k

Figure A sample use of futures and callcc

reference to P h When T calls k P h gets determined to Following this the ro ot

task can return from the rst TOUCH and consequently x gets b ound to Note that T

is susp ended indenitely on the second TOUCH b ecause P h never gets determined

With MultiSchemes implementation of futures a call to k determines the goal place

holder of the current task Since it is T that is calling k P h gets determined to T

then pro ceeds from the second TOUCH adds and calls k with the lambdaexpressions

b o dy implicitly calls k This time it is T that is calling k so P h gets determined to

Finally the ro ot task can return from the rst TOUCH binding x to

KatzWeise Continuations

A nice feature of futures is that in typical purely functional programs they can b e added

around any expression without changing the result of the program In other words

futures are equivalent to an identity op erator when only the result of the computation

is considered Futures only aect the order of evaluation This suggests an attractive

mo de of programming rst write a correct functional program without any futures and

then explore various placements of futures to turn the program into an ecient parallel

one

Unfortunately the original and MultiScheme semantics for continuations do not p er

mit this for all purely functional programs b ecause inserting futures in a program that

uses callcc can alter the result computed For MultiScheme this should b e clear

from the previous example For the original semantics all is ne as long as the future

b o dys continuation is invoked at most once including the normal return from the b o dy

To explain what happ ens when the continuation is called multiple times consider the

contrived expression in Figure In this expression the continuation created by

callcc is called exactly twice Assume for the moment that the TOUCH and FUTURE

op erations are not present Y will get b ound to the continuation created by callcc

the continuation that takes a value and binds y to it Since at this p oint y is not a

CHAPTER BACKGROUND

define x

let y TOUCH FUTURE

callwithcurrentcontinuation

lambda k k

if number y

y

y

Figure A future b o dys continuation called multiple times

number the continuation is restarted with thus binding y to Since y is now a

number it is returned and x gets dened to

When TOUCH and FUTURE are present an undetermined placeholder will b e created

and a child task created to evaluate the callcc The continuation captured here

ie k corresp onds to the tasks continuation that is a call to endbody The place

holder will get determined to this continuation and through the TOUCH y gets b ound

to it However when this continuation is called an attempt is made to determine the

placeholder a second time this time with and then to terminate the current task

This is clearly an error b ecause a placeholder cannot represent more than one value and

deadlo ck would o ccur since all tasks would have terminated

An interesting implementation of futures that solves this problem was prop osed by

Katz and Weise Katz and Weise The idea is to preserve the link b etween the

future b o dys continuation and the futures continuation On the rst return to the

b o dys continuation the placeholder gets determined and the task is terminated as in

the original semantics However on every other return the b o dys continuation acts

exactly like the futures continuation as if the future had never existed

KatzWeise Continuations with Legitimacy

Unfortunately this approach do es not solve all interaction problems b etween rstclass

continuations and futures It is still p ossible to write purely functional programs that do

not return the same value when futures are added Consider the program in Figure

which is a simplied form of exception pro cessing If the future sp ecial form is not

present a value of is returned b ecause the call abort is done rst bypassing the

b o dy of the let and the binding of dummy With the future a child task is created

to evaluate abort and the parent task implicitly returns to abort Each task

CONTINUATION SEMANTICS

callwithcurrentcontinuation

lambda abort

let dummy FUTURE abort

Figure Exception pro cessing with futures

exits the callcc with its own b elief of the result the parent task with and the child

task with In general this means that multiple tasks may return to the programs

ro ot continuation One of these tasks has the right result ie the same result as a

sequential version of the program but which task Cho osing the rst task to arrive at

the programs ro ot continuation is not a valid technique b ecause of the race condition

involved

The solution prop osed in Katz and Weise introduces the concept of legiti

macy A particular sequence of evaluation steps a thread is legitimate if and only if it

is executed by the sequential version of the program Legitimacy is thus a characteristic

that dep ends on the control ow of the program It can b e derived from the fact that

the ro ot thread is legitimate and the causality rules inherent in the sequential subset

of the language In particular if a thread is legitimate and it returns from expr with

the value v then the thread corresp onding to the execution of expr s continuation with

the value v is also legitimate This rule naturally extends to the future sp ecial form

by attaching legitimacy to tasks after a child task is spawned by FUTURE expr the

parent task is legitimate if and only if the corresp onding placeholder gets determined

by a legitimate task The parent tasks legitimacy is thus equal to the legitimacy of

the task that gets to determine the placeholder Note that the child task inherits the

legitimacy of its parent at the moment of the task spawn As an example consider the

following program which involves three tasks T T and the ro ot task T

root

let x FUTURE expr

y FUTURE expr

expr

After spawning the tasks T and T the ro ot task will evaluate expr The ro ot task

is legitimate if and only if the rst task to return from expr is legitimate This fact

can b e expressed by the constraint

Leg itT Leg itD etP h

root T

That is the legitimacy of the ro ot task is equal to the legitimacy of the task that

CHAPTER BACKGROUND

determines the placeholder created for T Similarly task T is legitimate if and only if

the rst task to return from expr is legitimate

Leg itT Leg itD etP h

T

In the event that it is T that returns rst from expr ie D etP h T the

T

ro ot tasks legitimacy will b ecome equal to the legitimacy of the rst task returning

from expr That is

Leg itT Leg itT Leg itD etP h

root T

This illustrates that a tasks legitimacy at a given p oint in time is represented by a

chain that mo dels the legitimacy dep endencies inferred up to that p oint Initially the

links b etween tasks are unknown and as tasks terminate and determine placeholders

the links get lled in The gaps in the chain corresp ond to future b o dies that have not

yet returned normally Abnormal exits from the b o dy of a future can create indep endent

chains that never get connected to the legitimate chain Note that there is at all times

exactly one legitimate task in the system All other tasks can b e viewed as b eing

sp eculative tasks b ecause there is no guarantee that they actually contribute to the

computation at hand At the moment of its death the legitimate task will turn one of

the sp eculative tasks into the legitimate task

Implementing Legitimacy

An implementation of the KatzWeise semantics with legitimacy is shown in Figure

The legitimacy chain is conveniently implemented with placeholders Each task has a

legitimacy ag represented by a placeholder The ro ot task is initially legitimate so its

legitimacy ag is a nonplaceholder When a child task is created its legitimacy ag is

taken from the parent task Since the parent task is going to invoke the futures con

tinuation its legitimacy ag is replaced by a newly created undetermined placeholder

legph which represents the as of yet unknown legitimacy of the rst task to return

from the futures b o dy which might not b e the child Legph must also b e embeded

in the b o dys continuation When this continuation is returned to which corresp onds

to a call to endbody the result placeholder gets determined and the legitimacy chain

is extended by unifying legph with the current tasks legitimacy ag

CONTINUATION SEMANTICS

define makeFUTURE thunk

callwithcurrentcontinuation

lambda k

let resph makeph

legph makeph

parent currenttask

let child maketask

lambda endbody k resph legph thunk

tasklegitimacy parent

tasklegitimacyset parent legph

queueput workqueue child

resph

define endbody k resph legph result

if testanddetermine resph TOUCH result

begin

determine legph tasklegitimacy currenttask

idle

k result

define speculationbarrier

TOUCH tasklegitimacy currenttask

Figure The KatzWeise implementation of futures

Sp eculation Barriers

A straightforward use of legitimacy is to prevent sp eculative tasks from terminating

the program and only allowing the legitimate task to do this This speculation barrier

can b e accomplished simply by touching the tasks legitimacy ag at the programs

terminal continuation Conceptually this touch walks down as far as it can in the

tasks legitimacy chain and blo cks until the task is known to b e legitimate Only the

legitimate task is allowed to pro ceed b eyond the touch the other tasks are susp ended

indenitely

Using a sp eculation barrier at the very tail of a program guarantees that the correct

result will b e returned but it do es little to prevent sp eculative tasks from consuming

pro cessing resources It is p ossible to add sp eculation barriers at well chosen places in

the program to limit the extent of sp eculative parallelism Even though this reduces the

amount of parallelism in the program it may yield a more ecient program b ecause a

higher prop ortion of the time will b e sp ent doing mandatory work A case where this

might b e useful is given in Figure For simplicity it is assumed that map pro cesses

CHAPTER BACKGROUND

define mapsqrt lst

callwithcurrentcontinuation

lambda abort

map lambda x

FUTURE

if negative x abort x sqrt x

lst

define mapsqrtwithbarrier lst

let result mapsqrt lst

speculationbarrier

result

Figure An application of sp eculation barriers

the values from head to tail For each value in the list mapsqrt spawns a task to

compute the square ro ot of the value and returns a list of the results In a sequential

version of the program ie if the future is absent the rst negative value is returned by

mapsqrt In the parallel version the ro ot task and all tasks pro cessing negative values

will return from mapsqrt Mapsqrtwithbarrier obtains the same result as the

sequential version by using a sp eculation barrier after the call to mapsqrt Only the

task pro cessing the rst negative value will b e legitimate and will cross the barrier Since

this task bypasses the determining of its result placeholder its parents legitimacy ag

will remain undetermined forever All the tasks spawned by the parent and its children

after the legitimate task will have undetermined legitimacy ags Consequently these

tasks will get susp ended when they reach the barrier

The Cost of Supp orting Legitimacy

The cost of supp orting legitimacy is an imp ortant issue Sp eculation barriers are cer

tainly useful to express some programs but many programs have no need for them in

particular those that only contain mandatory tasks Consequently it is imp ortant to

evaluate the cost of supp orting legitimacy in b oth contexts

For programs which contain sp eculation barriers one concern is the space o ccupied

by tasks susp ended at barriers A careful study of Figure reveals that these tasks are

only retained if they might b ecome legitimate These tasks are susp ended on legph

The Scheme language do es not imp ose a particular ordering

CONTINUATION SEMANTICS

which is only accessible through the childs terminal continuation In the previous

example Figure this continuation was discarded when abort was called by the

child Since legph is unreachable it will eventually get garbage collected along with the

tasks susp ended on it On the other hand if the childs continuation had b een saved

prior to the call to abort by calling callcc and saving the continuation away it

would not b e p ossible to garbage collect the susp ended tasks b ecause legph would still

b e reachable This is clearly the correct b ehavior since any number of the susp ended

tasks could still b ecome legitimate for example if the saved continuation is invoked by

the legitimate task

Two other costs are legitimacy testing and propagation The cost of legitimacy

propagation is particularly imp ortant b ecause it is paid even by programs that do not

use legitimacy or that use it infrequently In Figure the current tasks legitimacy

placeholder is propagated directly to the next task in the chain line in endbody

Legitimacy propagation is thus constant cost but legitimacy testing can b e exp ensive

A program which spawns n mandatory tasks thus creating a legitimacy chain with n

placeholders will require O n time to test legitimacy at the programs termination the

task spawning strategy whether it is a sequential lo op or DAC lo op is irrelevant

Another approach is to touch the current tasks legitimacy on line b efore propa

gating it to the next task In other words the task waits to b e legitimate b efore marking

the next task as legitimate Legitimacy testing is then constant cost but legitimacy prop

agation is exp ensive for two reasons it is inherently sequential and it pro duces frequent

task switches Because of the touch a particular legitimacy placeholder in the chain

can only b e determined after the previous legitimacy placeholder has b een determined

This implies that the last task will at b est b e marked as legitimate n time after the

rst task Also any task terminating b efore its predecessor in the chain will have to b e

susp ended and eventually resumed just to set the next legitimacy placeholder

A b etter strategy is to shrink the legitimacy chain as the computation progresses

All the links in the chain will have to b e followed but this can b e done in parallel

The metho d uses a collapse op eration that walks a chain of placeholders and returns

its tail element ie either an undetermined placeholder or a nonplaceholder This

op eration is added to line so that the current task propagates its collapsed legitimacy

chain to the next task Nothing is gained if a task terminates b efore its predecessor

but if it terminates afterwards one or more links in the chain will get removed for the

b enet of the successor tasks But how frequently will it b e p ossible to collapse the

chain

Clearly the order of task termination has a direct inuence on the collapsing of the

CHAPTER BACKGROUND

define fj n define fj n

if n if n

let l FUTURE fj n let l FUTURE fj n

r FUTURE fj n r fj n

TOUCH l TOUCH r TOUCH l r

S

X

X

Sw

X

X

Xz

Pi

P

Z Z

P

P

S Z Z

HY

H

Sw

H

S

H

A A

A A

Z Z

S

H

AU AU

AU AU

Z Z

S

Z Z

Figure Forkjoin algorithms and their legitimacy chain in the absence of chain

collapsing

chain An imp ortant case to consider is forkjoin parallel algorithms which imp ose a

strict termination order on tasks In forkjoin algorithms a parent task P sequentially

spawns a certain number of children C to C and later touches the result of the chil

k

dren b efore terminating In the absence of collapsing the legitimacy chain corresp onds

to a p ostx walk of the spawning tree Figure illustrates this for two forkjoin

pro cedures fj and fj Each no de corresp onds to a task in the spawning tree The

no des are numbered according to a p ostx walk of the tree the left child is spawned

rst and the arrows represent links of the legitimacy chain eg task is legitimate if

task is legitimate Note that the link coming out of task i is only lled in when

task i terminates Due to the forkjoin nature of the program all tasks in the spawning

tree ro oted at task i will have terminated when task i terminates This implies that

when task i terminates all links of the legitimacy chain enclosed in task is spawning

tree are known and can b e collapsed In the worst case this collapsing will stop at L

i

the leftmost task in task is spawning tree In other words task i will set task i s

th

legitimacy link to L But as shown in Figure if i C ie i is the j child of P

i j

then either i P or i L It follows that the collapsing of the links in the

C

j

legitimacy chain b etween P and L takes at most k sequential steps after all children

P

are done Given that the spawning of the children by P takes k time anyway the

cost of propagating legitimacy do es not change the complexity of the program There is

only a constant overhead p er task created This overhead is rather low since it amounts

BENCHMARK PROGRAMS

P

C C C

k k

B B B B

B B B

B

B B B

B

B B B

B

BN

B B B

B B B

q q q

I I

Figure General case of legitimacy chain collapsing for forkjoin algorithms

to following one link of the legitimacy chain p er task spawned This result holds for any

forkjoin algorithm regardless of how well balanced the spawning tree is including the

forkjoin DAC pro cedures fj and fj ab ove as well as the linear forkjoin pro cedure

pmap in Section

Benchmark Programs

In order to guide the design pro cess and provide a basis for evaluating and compar

ing the p erformance of the implementation strategies it is imp ortant to identify the

salient characteristics of the target applications Following common practice a set of

b enchmark programs were selected as representives of typical applications of Multil

isp These b enchmark programs are used throughout the thesis for various evaluation

purp oses

The biggest aw of these b enchmarks is their small size Real applications will

probably b e much longer and more complex Characteristics such as lo cality of reference

paging task granularity and available parallelism may b e substantially dierent Small

programs are no substitute for the real thing They can only serve as rough mo dels of

real applications The main advantage of small programs is that they usually stress a

well dened part of the system so the measurement can b e interpreted more readily

Both sequential and parallel b enchmarks were used The sequential b enchmarks are

mostly taken from the Gabriel suite Gabriel which has traditionally b een used

CHAPTER BACKGROUND

to evaluate implementations of Lisp To these b enchmarks were added four sequential

b enchmarks compiler the Gambit compiler conform a type checker earley a

parser and peval a partial evaluator These are sizeable programs that achieve some

useful purp ose compiler contains more than lines of Scheme co de Note that

for some measurements it was not p ossible to run compiler due to lack of memory

There are twelve parallel b enchmarks Half of these were originally written in Mul

T by Eric Mohr as part of his PhD thesis work Mohr To these were added

a few classical parallel programs matrix multiplication parallel prex and parallel

reduction and programs based on pip elin e parallelism p olynomial multiplication and

quicksort A general description of the parallel b enchmarks is given next None of the

b enchmarks require the KatzWeise continuation semantics or legitimacy Chapter

evaluates their cost in another way App endix A contains some additional details

includin g the source co de and compilation options App endix B contains execution

proles for the b enchmarks These indicate the activity of the pro cessors as a function

of time thus allowing a b etter visualization of the programs b ehavior

abisort

This program sorts n integers using the adaptive bitonic sort algorithm Bilardi

and Nicolau This algorithm is optimal in the sense that on the PRAMEREW

n log n

theoretical mo del it runs in O time where p is the number of pro cessors and

p

n

To achieve this p erformance abisort stores the sequence of elements p

blog log nc

in a bitonic tree which is a full binary tree with the prop erty that many elements can b e

logically exchanged by a small number of p ointer exchanges To sort a tree b oth subtrees

are rst sorted recursively in parallel and then they are merged The advantage of this

algorithm over mergesort is that the merging of bitonic trees can b e done in parallel

Both the recursive sorting phase and the merging phase are based on parallel forkjoin

DAC algorithms Abisort puts high demands on the memory interconnect b ecause it

frequently references and mutates the shared bitonic tree data structure

allpairs

This program computes the shortest paths b etween all pairs of n no des using a

parallel version of Floyds algorithm The input is a square distance matrix D where

D is the length of the edge b etween no des i and j The algorithm go es through n

ij

Parallel Random Access Machine with Exclusive Read Exclusive Write memory

BENCHMARK PROGRAMS

steps each of which up dates D in place based on its current state At the b eginning of

th

the k step D represents the length of the shortest path from i to j that do es not

ij

go through any no de greater or equal to k The up date op eration consists of replacing

for each p ossible i and j D by D D if that value is smaller Since D is always

ij ik k j k k

th

neither row k or column k of D will change during the k step Consequently all

up date op erations of a given step can b e done concurrently Parallelizing b oth the lo op

on i and j would have resulted in an unnecessarily ne task granularity so only the

outermost of the two lo ops was done in parallel by a parallel forkjoin DAC lo op

The computation thus consists of a sequence of steps each of which contains

tasks The execution prole for this program lo oks like a comb where each to oth

corresp onds to one step of the outer lo op Allpairs has the coarsest task granularity

and the highest run time of all the b enchmarks

fib

th

This program computes F the b onacci number using the straightforward but

obviously inecient doubly recursive algorithm It is a very compute intensive b ench

mark which do es not reference any heap allo cated data Fib is interesting to examine

b ecause it can serve as a mo del for ne grain forkjoin DAC algorithms Fib has the

nest task granularity of all the b enchmarks The spawning tree is fairly bushy but

is not p erfectly balanced The imbalance follows the golden ratio each subtree has

roughly more tasks on the fat branch than on the other branch

mm

This program multiplies two matrices of integers by The standard algorithm

with three nested lo ops is used All these lo ops can b e parallelized but only the two

outermost lo ops were turned into parallel forkjoin DAC lo ops The program thus

involves fairly coarse grain tasks each of which is in charge of computing one of

the entries in the result matrix

mst

This program computes the minimum spanning tree of an n no de graph A

parallel version of Prims algorithm is used The input is a symmetric distance matrix

D where D is the length of the edge b etween no de i and no de j The algorithm ij

CHAPTER BACKGROUND

constructs the minimum spanning tree incrementally in n steps It starts with a set

of no des containing a single no de and at each step it adds to this set the no de not yet

in the set that is closest to one of the no des in the set In order to nd the closest no de

quickly each no de not yet in the set remembers the shortest edge that connects it to the

set This shortest connecting edge must b e recomputed when a new no de is added

th

to the set The k step is a lo op over n k no des that rst recomputes each no des

shortest connecting edge based on the last no de added to the set and then nds the

shortest of these edges Mst p erforms this lo op in parallel using a parallel forkjoin DAC

lo op Note that the degree of parallelism decreases with time this is clearly visible in

th

the execution prole The k step involves n k tasks

poly

This program computes the square of a term p olynomial of x with integer co e

cients The resulting p olynomial is then evaluated for a certain value of x This en

sures that the computation of all co ecients has nished Polynomials are represented

as a list of co ecients The pro duct of two p olynomials P and Q with co ecients

P P and Q Q is obtained by rst computing the pro duct of P and

n m

0

Q Q Q and then adding the result shifted by one p osition to P scaled by

m

Q The following diagram shows the unfolded recursion for computing R PQ when

n and m

P Q P Q P Q P Q P Q

H H H H

H H H H

j Hj Hj Hj

H

H

P Q P Q P Q P Q P Q

H H H H H

H H H H H

j Hj Hj Hj

H H

H H

P Q P Q P Q P Q P Q

R R R R R R

This algorithm is co ded with two lo ops The inner lo op do es the op erations cor

resp onding to a row in the ab ove diagram It combines the scaling and summing op

erations in a single multiplyandadd step The result of the inner lo op is the list of

co ecients to b e added by the next row Poly exploits the parallelism available in the

inner lo op in a way similar to the pro cedure pmap of Figure The multiplyandadd

step corresp onding to P Q is done after spawning a task to pro cess the rest of row j

i j

Consequently there is one task p er multiplyandadd step Moreover the pro cessing

BENCHMARK PROGRAMS

of the rows is pip elined the pro cessing of row j can start b efore the pro cessing of

row j is nished An alternative algorithm is to spawn a task for each co ecient of R

Task k computes

mink m

X

R P Q

k k j j

j maxk m

Because it spawns fewer tasks O n m instead of O nm this algorithm is prob

ably more ecient However the rst algorithm was chosen b ecause it is more repre

sentative of applications with ne grain pip eline parallelism

qsort

This program sorts a list of randomly ordered integers using a parallel version

of the Quicksort algorithm The lists head element is used to construct two sublists

with the remaining elements a list of the smaller values and a list of the not smaller

values The two partitions are then sorted in parallel The partitioning pro cedure uses

a pip eline parallelism technique similar to the pro cedure pmap The b eginning of the

partition is available to the continuation b efore the rest of the list has b een partitioned

This means that the sorting of the partition can start as so on as the rst element of

the partition is generated Although there are more ecient parallel sorting algorithms

eg abisort qsort is interesting to consider b ecause it combines pip elin e parallelism

and DAC parallelism

queens

This program computes the number of solutions to the nqueens problem with n

It is based on a recursive pro cedure which given a placement of k queens on the rst

k rows computes the number of legal ways the remaining n k queens can b e placed

a queen must not b e on the same row column or diagonal as another queen For

each valid p osition of a queen on row k the pro cedure spawns a task that calls the

pro cedure recursively with the new placement The number of solutions in each branch

is nally summed up Bit vectors are used to eciently enco de the current placement

of queens As a consequence queens do es not access any heap allo cated data structure

The call tree is not well balanced Most branches of the search tree lead to dead ends

quickly Queens is a go o d mo del for combinatorial search problems such as the traveling salesman problem and the searching of game trees

CHAPTER BACKGROUND

rantree

This program mo dels the traversal of a random binary tree with on the order of

no des The branching factor is This means that the subno des of a no de are uni

formly distributed in the left and right branches The average length of the paths from

the ro ot is Path length roughly follows a normal curve distribution extending from

a length of to a length of Like queens rantree uses forkjoin DAC parallelism

it do es not access any heap allo cated data and the call tree is not well balanced

scan

This program computes the parallel prex sum of a vector of integers The vector

is mo died in place A given element is replaced by the sum of itself and all preced

ing elements in the vector Scan is based on the o ddeven parallel prex algorithm

illustrated by the following diagram

A A A A

A A A A

AU AU AU AU

Parallel Prex Sum

A A A

A A A

AU AU AU

The rst step is to sum every element at an o dd index with its immediate predecessor

The parallel prex algorithm is then applied recursively to the subvector consisting of

the elements with an o dd index Finally every element with an even index is summed

with the preceding element if it exists When the recursion is unfolded this algorithm

consists of two passes over the vector using treelike reference patterns In the Multilisp

enco ding the rst pass is p erformed by the combining phase of a parallel forkjoin DAC

lo op whereas the second pass is p erformed by the dividing phase of a second parallel

forkjoin DAC lo op These two passes are clearly visible on the execution prole

BENCHMARK PROGRAMS

sum

This program computes the reduction using of integers stored in a vector

A parallel forkjoin DAC algorithm is used The vector is logically sub divided in two

b oth halves are then pro cessed recursively in parallel and nally the two resulting sums

are added Sum is the nest grain program that accesses heap allo cated data It serves

as a mo del for ne grain data parallel computations such as the reduction of a set of

values or the mapping of a function on a set of values

tridiag

This program solves a tridiagonal system of equations The computation pro ceeds

in two sequential phases the reduction of the system by the metho d of cyclic reduction

Ho ckney and Jesshop e followed by backsubstitution Cyclic reduction takes a

k

tridiagonal system of order n ie n equations over the variables x to x and

n

pro duces a reduced tridiagonal system of order n For each o dd numbered

equation i the equations i i and i are combined in such a way as to eliminate

variables x and x The resulting equation only contains variables x x and

i i i i

x as shown here

i

Tridiagonal system Reduced system

B x C x Y

A x B x C x Y B x C x Y

A x B x C x Y

A x B x C x Y A x B x C x Y

A x B x C x Y

A x B x C x Y A x B x Y

n n n n n n n n n

n n n

A x B x Y

n n n n n

The reduction pro cess is applied to the reduced system until a single equation of

the form bx y is obtained this takes k reductions Note that b ecause

n

equation i will not b e needed later it can b e replaced by the new equation in other

words the k reductions pro duce an equivalent set of n equations The solution

to x is then backsubstituted to nd the value of x and x and

n n n

so on recursively After k backsubtitutions the value of all variables is obtained

CHAPTER BACKGROUND

The backsubstitution is implemented with a single treelike DAC metho d The re

ductions could b e directly parallelized by p erforming a sequence of k parallel forkjoin

DAC lo op but tridiag uses a clever treelike metho d that has fewer synchronization

constraints

The Performance of ETC

The main problem with ETC is the high cost of manipulating heavyweight tasks This

section evaluates the b est p erformance that can b e exp ected of ETC for typical pro

grams

The total work p erformed by a Multilisp program when run on an n pro cessor

machine ie the pro duct of the run time and n is

T n T O O n

total seq expose exploit

T O and O n all dep end on the program T corresp onds to the

seq expose exploit seq

run time of a sequential version of the program the parallel program with futures and

touches removed The overhead of parallelism is split into two comp onents O

expose

represents the overhead of exp osing the parallelism to the system It reects the extra

work p erformed by the futures and touches in the program with resp ect to the sequential

version The pro duct T O is thus the run time of the parallel program on one

seq expose

pro cessor ie T The extra work is the sum of the costs for each future and touch

par

executed by the program

P P

N

N

future

touch

T T

future touch

i

i i

i

O

expose

T

seq

N and N are resp ectively the number of futures and touches evaluated

future touch

th

by the program T and T are resp ectively the cost of the i future and

future touch

i

i

touch op erations when only one pro cessor is b eing used In general the costs of these

op erations are not constant b ecause they dep end on several factors including the task

scheduling order which might vary from one run to the next the compilers ability to

Overheads are expressed as multipliers An overhead of x indicates that the amount of work or

other measure is larger by a factor of x Consequently an overhead b elow indicates a decrease The

x

term an overhead of x is used to denote small overheads It means an overhead of

THE PERFORMANCE OF ETC

generate sp ecial case co de for the op eration given its particular lo cation in the program

and the complexity of the task to b e created susp ended or resumed For evaluating

b est case p erformance it is useful to dene a minimum cost for futures and touches

T and T resp ectively This leads to the following lower b ound on

future min touch min

O expressed as a function of T and the programs granularity

expose future min

N T T N T

min touch touch min min future future future

O

expose

T G

seq

G is a measure of the programs granularity It is the average amount of computation

T

seq

p erformed by each task G

N

future

The second part of the parallelism overhead O n indicates how well the

exploit

programs parallelism is exploited by the system It corresp onds to the additional work

p erformed when running the parallel program on an n pro cessor machine O n

exploit

contains the following costs not present in O memory interconnect contention and

expose

pro cessor starvation ie lack of tasks to run Pro cessor starvation is b oth dep endent on

the programs degree of parallelism and on the schedulers sp eed at assigning runnable

tasks to idle pro cessors In addition O n reects the variation in scheduling

exploit

order which might cause an increase or decrease in the number of tasks susp ended and

resumed By denition O

exploit

In ETC T is relatively high If it is assumed that all tasks created even

future min

tually run and terminate T is the cost of creating starting and terminating

future min

a heavyweight task The bare minimum work caused by the evaluation of a future

corresp onds to the following sequence

Creating a closure for the futures b o dy

In makeFUTURE

Creating the result placeholder asso ciated lo ck and waiting queue

Creating the childs initial continuation

Creating the child task ob ject

Lo cking the work queue

Enqueuing the child on the work queue

Unlo cking the work queue

All tasks terminate in programs with mandatory tasks those that p erform all the work of their

sequential counterpart This is the case for all the parallel b enchmarks

CHAPTER BACKGROUND

In idle

Lo cking the work queue

Dequeuing the child from the work queue

Unlo cking the work queue

Restoring the childs continuation

In determine

Lo cking the result placeholder

Setting the placeholders value and determined ag

Checking for susp ended tasks to reactivate

Unlo cking the placeholder

This sequence do es not include the op erations for dynamic scoping KatzWeise

continuation semantics and legitimacy A few tricks can b e used to improve the eciency

of this sequence The heap allo cations of steps through can b e combined to reduce

the cost of checking for heap overow In fact nothing prevents the closure placeholder

task ob ject and initial continuation to b e the same physical ob ject This reduces the

eectiveness of garbage collection all ob jects are retained for as long as any of them

is reachable but it do es lessen the ob ject formatting overhead The use of lo cal work

queues also p ermits some optimization of the lo cking and unlo cking of the work queue

To simplify step and the touch op eration a sp ecial value can b e assigned to the

placeholders value slot to indicate that it is undetermined

Even with all these optimizations the sequence and the asso ciated control ow

instructions will translate into a mo derate number of instructions probably around to

machine instructions The p erformance of previous implementations of ETC seem to

conrm this lower b ound The MulT system was carefully designed to minimize the cost

of ETC Kranz et al When run on an Encore Multimax MulT requires roughly

machine instructions to implement the sequence the actual cost dep ends on the

number of closed variables their lo cation etc Other compiler based systems require

even more instructions Portable Standard Lisp on the GP Swanson et al

takes secs ab out instructions given that each pro cessor gives out MIPS

and QLisp on an Alliant FX Goldman and Gabriel takes instructions

it is p ossible to get a lower b ound on O With this lower b ound on T

min expose future

from the value of G The left part of Table gives the value of G T N and

seq future

N measured for the b enchmark programs when run on the GP with a single

touch

pro cessor The b enchmarks have b een ordered by increasing granularity Note that the

THE PERFORMANCE OF ETC

Lower b ound on O when

expose

T in sec is

future min

Program G in secs T N N

seq

future touch

fib

sum

qsort

scan

queens

rantree

abisort

poly

mst

tridiag

mm

allpairs

Table Characteristics of parallel b enchmark programs running on GP

number of futures is equal to the number of touches for all b enchmarks based on fork

join parallelism all b enchmarks except qsort and poly The right part of the table

gives the lower b ound on O computed from G and various values of T

min expose future

According to this table an optimized version of ETC ie one with T

min future

secs machine instructions will have an overhead that spans a range from

essentially nonexistent to fairly sizeable As the granularity decreases the overhead

increases and almost reaches a factor of for ne grain programs This overhead is

a conservative estimate MulTs implementation of ETC gives a measured value of

O for fib Mohr Whether this is an acceptable overhead or not

expose

for typical programs is of course a sub jective matter However it is clear that a

high overhead for ne grain programs will have an impact on the style of programming

adopted by users

There will b e a high incentive to design programs with coarse grain parallelism

even if there exists a natural ne grain solution Frequently it is p ossible to manually

transform a ne grain program into a coarser grain program by grouping several small

tasks into a single one that executes them sequentially this is akin to unrolling lo ops

by hand in sequential languages to reduce the lo op management overhead This type

of transformation has several drawbacks If the task grouping is articial the program

CHAPTER BACKGROUND

define fib n

if n

n

define fib n

fib n fib n

if n

n

define fib n

let x FUTURE fib n

if n

y fib n

n

TOUCH x y

let x FUTURE fib n

y fib n

TOUCH x y

File: "fib.elog" Processors: 32 File: "fib-unroll.elog" Processors: 32 100 100

80 80

60 60 % % 40 40

20 20

0 0 0 5 10 15 20 25 30 35 40 msec 0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 msec

0 10 20 30 40 50 60 70 80 90 100 % 0 10 20 30 40 50 60 70 80 90 100 %

interrupt working idle touch determine stealing interrupt working idle touch determine stealing

Figure Fib and a p o or variant obtained by unrolling the recursion

b ecomes more complex and harder to maintain An overhead cost must also b e exp ected

if task grouping is managed dynamically by user co de as is the case for the depth and

height cuto metho ds prop osed for treelike computations by Weening Weening

The transformation is also error prone Logical bugs as well as p erformance problems

can b e introduced by the user For example the recursion of fib can b e unrolled once

as shown in Figure to double the task granularity One might exp ect the program

to b e more ecient b ecause of the lower task management overhead but in reality it

p erforms p o orly b ecause a sequential dep endency has b een introduced this can b e seen

clearly in the execution proles Finally the program will b e less p ortable b ecause the

selection of an appropriate granularity dep ends on several parameters of the run time

environment number of pro cessors task op eration costs shared memory costs etc

The problem with a high task management cost is not so much that it prevents

the user from attaining go o d p erformance The problem is that the language cannot

realistically b e viewed as a highlevel language b ecause the user must program at a

lowlevel to attain go o d p erformance Selecting the right granularity for a program can

quickly b ecome the users overriding concern

The next chapter explores a more ecient approach to task management called

lazy task creation The cost of evaluating a future with this approach is very small

on the order of sec on the GP Table can b e used to approximate T

min future

THE PERFORMANCE OF ETC

the overhead of this approach The nest grain program ie fib should have a value

of O close to Note that the table gives a lower b ound and that the actual

expose

overhead will b e somewhat larger Chapter contains the measured value of O

expose

for the b enchmarks With such a small overhead the user has virtually no incentive to

avoid ne grain tasks and thus has added lib erty in the programming styles that can b e used

CHAPTER BACKGROUND

Chapter

Lazy Task Creation

Several plausible semantics for Multilisp were compared in the preceding chapter The

KatzWeise semantics with legitimacy is attractive b ecause it provides an elegant inter

action b etween futures and continuations In addition dynamic scoping and fairness of

scheduling are desirable features Unfortunately ETC is not an adequate implementa

tion of futures b ecause its p erformance is p o or on ne grain programs

This chapter explores lazy task creation LTC an alternative task creation mech

anism that is more ecient than ETC esp ecially for ne grain programs The LTC

mechanism describ ed here supp orts the Multilisp semantics given ab ove Two variants

of LTC are examined one that assumes an ecient shared memory and one that do es

not As conrmed in Chapter b oth variants have roughly the same p erformance when

consistent shared memory is ecient but when this is not the case for example on large

scale multiprocessors the later variant p ermits a more ecient execution faster by as

much as a factor of on the TC

In this chapter algorithms are given in pseudoC Assembly co de is also used to

explain the details of the co de sequences generated by the compiler

Overview of LTC Scheduling

This section explains the scheduling p olicy adopted by LTC and its b enets

Task execution order has a direct impact on p erformance The implementation

must choose an ordering that minimizes the task management overheads There are

CHAPTER LAZY TASK CREATION

four places where an implementation has lib erty as to which task to run next

Task spawning

Task termination

Task susp ension

Preemption interruption

Only the rst two situations are examined here the last two are discussed in later

sections Any runnable task can b e run next in these four situations However only

the subsets of runnable tasks that are most promising are considered in the following

discussion In particular the task to run next is preferentially selected from the lo cal

work queue b ecause this will promote lo cality and reduce contention When the lo

cal work queue is empty a task must b e stolen from another pro cessors work queue

Task stealing is the only way for work to get distributed b etween pro cessors The two

pro cessors involved in a task steal are the thief pro cessor and the victim pro cessor

When a task is spawned one of two tasks can b e run next by the spawning pro cessor

the child task or the parent task The ETC implementation describ ed in the preceding

chapter uses parent rst scheduling When a future is evaluated the child task is made

to wait for an available pro cessor whereas the parent task immediately starts executing

the futures continuation LTC uses the reverse scheduling order child rst scheduling

The childs execution is started immediately by the spawning pro cessor and the parent

is delayed until a pro cessor is ready to run it

The use of child rst scheduling in Multilisp has imp ortant advantages First it

tends to reduce the number of task susp ensions caused by touches The child is com

puting a value that is used by the futures continuation Since the parent gets delayed

with resp ect to the child there is a higher likelihoo d that the child will have completed

when its result is rst touched by the parent or one of its other descendants

When a task terminates however there is no incentive to delay its parent any further

In fact now that the tasks result is known it makes sense to execute the parent next

Since the parent consumes the value just computed it is less likely that it will get

susp ended This p olicy will b e called parent next scheduling

Child rst scheduling combines naturally with parent next scheduling to give an

ecient stacklike scheduling p olicy LIFO scheduling The set of runnable tasks on

a pro cessor is kept in a stack the task stack asso ciated with that pro cessor see Fig

ure The main op erations available on the task stack are task push task p op and

OVERVIEW OF LTC SCHEDULING

PUSH POP

Youngest task

Oldest task

Figure The task stack

task steal When a task is spawned the parent is simply pushed onto the task stack

and control go es to the child When a task terminates the parent is necessarily on top

of the task stack if it hasnt b een run yet this assumes that pro cessors can steal but

cannot push a task onto another pro cessors task stack If the parent is still there it

gets p opp ed from the task stack and executed by the same pro cessor that pushed it

LIFO scheduling yields a task execution order very similar to that of the program

with futures removed In fact the execution order is identical when no task is ever

stolen from the task stack This happ ens for example when the machine has a single

pro cessor or when all pro cessors have enough lo cal work to keep them busy In this

situation there are no task susp ensions b ecause the only computation that might touch

the tasks placeholder ie the continuation necessarily follows the termination of the

task

Task Stealing Behavior

Under LIFO scheduling tasks could b e stolen from either end of the task stack Tasks

are always stolen from the task stacks b ottom in LTC It is interesting to see why this

bottom stealing is preferable to top stealing Top stealing might seem b etter for the

same reason as child rst scheduling Favoring the execution of younger tasks should

reduce the likelihoo d of susp ension in older tasks

However this analysis do es not take into account that older tasks generally run

longer b efore termination or susp ension than younger tasks For DAC programs with

balanced spawning trees the task size will decrease geometrically with the task stack

depth When a child task is pushed onto the task stack the amount of work it contains

is a fraction f of the amount remaining in the parent Thus in a DAC program the

The amount of work remaining in a task is all the work remaining b efore its termination includin g

the work contained in the tasks that it will spawn In a well balanced binary DAC program such as

CHAPTER LAZY TASK CREATION

th i

i removed child from a task has f times the work of that task and collectively a task

d P

f

d

i

the amount of f and the d descendants b elow it on the task stack have

i

f

work This means that the amount of work in the oldest task is approximately equal to

0

that of its youngest d d log f descendants Consequently the amount

f

of work T remaining in the oldest task is equal to the work in all other tasks on

oldest

the task stack except a constant number of the oldest tasks The task stealing overhead

0

will b e higher for top stealing b ecause it requires at least d times more task steals than

b ottom stealing to distribute T units of work In reality the number of steals will

oldest

0

b e higher than d b ecause the victim is continuously replenishing the task stack with

small tasks as the thief is stealing them The probability of stealing a task close to the

leaves of the spawning tree is relatively high

Individual task steals are also faster with b ottom stealing b ecause there are two

nearly indep endent ways to access the task stack A pro cessor can push or p op a task

from its lo cal task stack while some other pro cessor is simultaneously stealing a task

This parallelism which is no more than a degree of enables tasks to b e created and

started faster In addition b etter caching of the task stack top is p ossible b ecause it is

single writer shared data as opp osed to multiple writer shared data for top stealing

Mohr Mohr has analyzed the task stealing b ehavior of b ottom stealing for

treelike DAC parallel programs He has derived an upp er b ound of p h task steals

for programs with binary spawning trees of height h running on a machine with p

pro cessors This upp er b ound relies on the use of polite stealing In p olite stealing a

pro cessor whose last steal was from victim V must try to steal from all other pro cessors

b efore stealing again from V An outline of Mohrs pro of follows

At any given p oint in time a pro cessor i is either idle and is trying to steal a task

or is in charge of running the tasks in some subtree of the spawning tree Call h the

i

height of pro cessor is subtree h when it is idle and H the maximum height of all

i

p

subtrees H max h After a task is stolen from pro cessor i b oth the victim and

i

i

the thief will b e in charge of subtrees of height h Note that to decrease H by one it

i

is necessary to steal a task from all pro cessors i with h H Polite stealing guarantees

i

that all these pro cessors will have b een tried by a given pro cessor in no more than p

steals or steal attempts Because up to p pro cessors might b e attempting to steal

tasks it will take no more than p steals to steal at least one task from each pro cessor

with h H When H reaches zero no tasks are left to steal Consequently no more

i

sum f will b e close to For fib which has an imbalanced spawning tree f is ab out An f close

to approximates lo op based parallel algorithms such as pmap

P

k+1 k+1 d+1 d+1

d

f f f f

i

This result is obtained by solving for d d k in f

f f f ik

OVERVIEW OF LTC SCHEDULING

than p h steals can o ccur

h

In the absence of p olite stealing O steals can o ccur p otentially all tasks are

stolen Although p olite stealing insures the upp er b ound of p h steals it isnt clear that

this makes a dierence in practice Mohr ran programs with and without p olite stealing

for a wide range of values of h and p The number of steals was comparable usually

within to and only in extreme cases was there a noticeable advantage to use

p olite stealing a factor of to for high h and p Gambit uses p olite stealing with

the particularity that each pro cessor has a probing order generated randomly when the

system is loaded This was done in an eort to reduce interference b etween comp eting

thief pro cessors With a sequential probing order there is a p otential loss of parallelism

b ecause several thieves might b ecome synchronized following each other in lo ckstep

Task Susp ension Behavior

Bottom stealing also leads to fewer task susp ensions To simplify the analysis it is

assumed that tasks touch the value of their children just b efore termination and that

there are only two pro cessors

When b ottom stealing T time units will elapse b efore the rst touch that might

oldest

0

cause a susp ension The d youngest tasks are not aected by the steal so in this time

there is necessarily p erio d they will have a susp ensionfree execution When f

no task susp ension b ecause all the descendants have terminated when the touch is

p erformed A single susp ension o ccurs when f and the steal happ ened not to o late

after the rst descendant was spawned

0

When top stealing there are d tasks at least that might susp end in the same

time p erio d The likelihoo d of susp ension increases with the depth of the task due to a

combination of two factors First deep er tasks have less work and second it is faster

to remove tasks from the lo cal task stack than to steal them from other pro cessors the

costs are resp ectively T and T Let T b e the amount of work remaining

local steal task

in the stolen task and T the work remaining in its currently running child The

child

stolen task will terminate or get susp ended in T T time whereas its parent

steal task

T

task

time the pro cessor will nish executing the will touch its value in T T

child local

f

child and then lo cally resume the stolen tasks parent A susp ension o ccurs in either

of the following cases

T T T stolen task gets susp ended

steal task child

T

task

T T T T stolen tasks parent gets susp ended

steal task child local f

CHAPTER LAZY TASK CREATION

The second case is highly likely for ne grain DAC programs b ecause as the depth

of the task increases T and T b ecome negligible when compared to T and it

task child steal

is always the task closest to the leaves of the spawning tree that is b eing stolen

Continuations for Futures

Continuations play a central role in the implementation of futures A tasks state is

mostly comp osed of a continuation In addition the KatzWeise semantics as dened

in Figure requires that the futures continuation b e captured and shared b etween

the child and parent tasks Consequently the eciency of continuation op erations and

futures are intimately tied This section describ es the implementation of continuations

on top of which LTC will b e implemented

Conceptually a continuation is a chain of frames Each frame corresp onds to some

subproblem call that is currently p ending completion A frame contains the context

required to p erform the computation that follows the corresp onding subproblem call

The frame includes temp orary values and variables or alternatively an environment

p ointer and also contains a parent continuation The parent continuation is used when

the pro cedure containing the subproblem call exits by a normal return or a reduction

call This link is what gives the stack structure to continuations Note that in some

situations the parent continuation is never used and could b e removed from the frame by

a smart compiler For simplicity it is assumed that the parent continuation is always

present in the frame The oldest frames parent is the root continuation which is sp ecial

in that it has no parent The ro ot continuation symbolizes the end of the program

Several strategies for implementing continuations have b een describ ed and compared

by Clinger et al Their results suggest that the incremental stack heap strategy

is more ecient than the other strategies in most cases and not noticeably slower than

the other strategies in extreme cases With the exception of a few details this is the

strategy used by Gambit

This is p ermissible if the subproblem call is done inside an innite lo op For example in the

following denition the frame for the subproblem call to g need not contain fs continuation b ecause f

never returns

define f

g f

CONTINUATIONS FOR FUTURES

Pro cedure Calling Convention

Since continuations are manipulated at every pro cedure call and return it is imp ortant

to have ecient supp ort for these common op erations The incremental stackheap

strategy puts very few constraints on pro cedure calling conventions This means that

the presence of unlimited extent continuations in the language do es not imp ose a sp ecial

runtime overhead

Parameters can b e passed in any lo cation typically in registers andor on the stack

and a pro cedure can return simply by jumping to the return address passed to the

pro cedure by the caller Within a pro cedure the stack can b e used freely to allo cate

temp orary values and lo cal variable bindings

Continuation frames created at subproblem calls are always allo cated from the run

time stack as is normally done for other languages The pro cedure that allo cated a

frame is resp onsible for its deallo cation from the stack Deallo cation o ccurs at some

p oint b efore the pro cedure is exited by a normal return or a reduction call This

insures that at the subproblem calls return p oint the continuation frame created for

the call is still topmost on the stack A pro cedures continuation is thus a combination

of two values the return address and the value of the stack p ointer Note that the

return address passed to a pro cedure is always contained in any continuation frame it

creates

Unlimited Extent Continuations

This implementation can b e extended to supp ort unlimited extent continuations The

continuation is split into two parts The most recently created frames of a continuation

are on the stack and the oldest frames reside in the heap This situation is depicted

in Figure where frame i is created by pro cedure pi and ret is the return address

i

into pi The implicit continuation passed to a pro cedure is represented by a triplet

SPRETUNDERFLOWCONT The stack p ointer SP p oints to the topmost frame on the

stack and the return register RET contains the return address UNDERFLOWCONT cor

resp onds to the heap continuation and it contains two elds link a p ointer to the

Note that the semantics of continuations in Scheme require that there b e only one instance of

any variable allo cated To supp ort this it is common to create a cell in the heap for each mutable

variable The extra dereference needed to access mutable variables adds an overhead whose imp ortance

will dep end on the program However there is no overhead for functional programs

RET could also b e passed on the stack but it is simpler to think of it as b eing contained in a dedicated

register Gambit actually dedicates a register for the return address

CHAPTER LAZY TASK CREATION

HEAPIFICATION

SP RET SP RET

under

ret

flow

A

R

STACK

A

AU

ret UNDERFLOW CONT

ret

R

STACK

ret

HEAP ret

q

q

underflow

define p p

q

define p p

define p p

CONT SP RET UNDERFLOW

define p p

ret ret

define p p

R R

define p

STACK

ret underflow

HEAP

q

UNDERFLOW CONT

q

q

ret

R

q

Continuation on entry to p

q

HEAP

q

UNDERFLOW

Figure Continuation representation and op erations

CONTINUATIONS FOR FUTURES

topmost heap frame and ret the return address for the topmost heap frame Note

that the stack frames are only linked conceptually in reality they are allo cated con

tiguously on the stack On the other hand heap frames are indep endent ob jects in a

format suitable for garbage collection and explicit links b etween them are maintained

The link b etween the stack frames and the heap frames is preserved in a sp ecial

way This link is traversed when a pro cedure returns to its continuation and the stack

is empty This is called a stack underow When the stack underows the topmost

heap frame must b e copied back to the stack so that the return p oint can access the

content of the continuation frame in a normal manner This is the only frame that is

immediately needed The older heap frames get restored one at a time by subsequent

underows

A sp ecial mechanism is used to avoid having to check explicitly for stack under

ow at every pro cedure return The return address logically attached to the oldest

stack frame is stored in UNDERFLOWCONTret In its place the continuation frame

contains a p ointer to the underow hand ler This handler consequently gets called

by the normal pro cedure return mechanism when the stack underows The handler

p erforms the following sequence of steps the correct return address is extracted from

UNDERFLOWCONTret the topmost heap frame is copied to the stack UNDERFLOWCONT

is up dated to represent the parent heap frame the return address in the stack frame

is replaced by the underow handler to prepare it for underow and nally control is

returned to the correct return address The cost for an underow is thus dep endent

on the frame size which in typical cases is fairly small For example the largest frame

size for the parallel b enchmarks is slots and the average measured statically is just

b elow An underow should thus b e fairly cheap for these programs b etween and

instructions if the underow handler and heap frame format are chosen carefully

Continuation Heapication

Heap continuations are created by the pro cess of heapication Heapication trans

forms the current continuation into one that only contains heap frames The stack

frames are transferred one by one to the heap with the appropriate links b etween them

The oldest stack frame must b e handled sp ecially When it is copied its return ad

dress is rst recovered from UNDERFLOWCONTret and its parent link is obtained from

UNDERFLOWCONTlink Finally the stack is cleared by resetting SP to the b ottom of

stack and RET and UNDERFLOWCONT are up dated to reect the new lo cation of the con

tinuation The current continuation b efore and after heapication are logically equiva

CHAPTER LAZY TASK CREATION

lent only the representation changes

Parsing Continuations

One complication with the underow and heapication mechanisms is that it must b e

p ossible to parse the stack to know where each frame b egins and ends and also which

frame slot contains the return address One way to achieve this is to asso ciate the

description of a frames layout length and return address lo cation with the return

address of the subproblem call that created the frame The frame descriptor can for

example b e stored just b efore the return p oint as is done in Hieb et al RET can

then b e used to get the size of the topmost stack frame and the lo cation of its return

address The return address in this frame in turn gives the size of the next frame and

so on

The heapication and underow mechanisms can now b e describ ed in detail The

algorithms are given in Figure In these algorithms two functions are used to parse

the continuation framesizer and retadroffsr return resp ectively the size

and return address oset of the continuation frame asso ciated with return address r

It is assumed that all data structures grow towards higher addresses and that in all

drawings addresses grow towards the top of the page

Implementing FirstClass Continuations

Firstclass continuations can easily b e implemented with the heapication mechanism

Callcc rst heapies its implicit continuation and then packages up UNDERFLOWCONT

in a new closure When called this closure discards the current continuation by resetting

SP to the b ottom of stack restores the new continuation by setting UNDERFLOWCONT

to the saved value and then jumps to the underow handler to transfer control to the

return p oint Supp ort for dynamic scoping is a simple addition to this mechanism The

current dynamic environment is saved in the closure at the moment of the callcc and

is restored just b efore jumping to the underow handler

Heapication might seem to b e doing more work than strictly required by callcc

By leaving the stack in its original state after its content is copied to the heap some

returns would b ecome cheaper b ecause the restoration of the frames by the underow

mechanism would b e avoided However new costs in space and time would b e introduced

The ability to parse the stack is also useful to implement introspective to ols such as debuggers and prolers

CONTINUATIONS FOR FUTURES

typedef struct frm heap frame format

struct frm link parent frame pointer

value slots content of frame

frame

value SP

instr RET

struct frame link instr ret UNDERFLOWCONT

underflow

frame f UNDERFLOWCONTlink get topmost heap frame

instr r UNDERFLOWCONTret get return address

for i iframesizer i copy frame to stack

SPi fslotsi

UNDERFLOWCONTlink flink prepare for underflow

UNDERFLOWCONTret SPretadroffsr

SPretadroffsr underflow

SP framesizer update stack pointer

jumpto r jump to return point

heapification

if RET underflow check for empty stack

heapifyframe SP RET

SP bottomofstack clear stack

RET underflow

heapifyframe s r

value s

instr r

value b s framesizer compute frames base

frame f alloc framesizer allocate heap frame

instr p bretadroffsr get parent ret adr

if p underflow oldest frame

bretadroffsr UNDERFLOWCONTret

else

heapifyframe b p

for i iframesizer i copy frame content

fslotsi bi

flink UNDERFLOWCONTlink link frame to parent

UNDERFLOWCONTlink f update UNDERFLOWCONT

UNDERFLOWCONTret r

Figure Underow and heapication algorithms

CHAPTER LAZY TASK CREATION

since there could now b e multiple copies of the same stack frame This o ccurs when

multiple continuations which share the same tail are captured Programs with nested

calls to callcc such as those typically found in backtracking algorithms and exception

pro cessing exhibit this b ehavior As an example consider this denition for f

define f n

if zero n

callwithcurrentcontinuation

lambda cont

f n

Note that the call f n calls callcc n times If there are k stack frames in the

n

continuation for the call f n nk heap frames will b e created The sharing

prop erties of heapication are much b etter b ecause there is at most one heap copy of

any continuation frame In the example only k n heap frames will b e created a

savings of a factor of O n The same reasoning holds for nested futures when they

are implemented with callcc as is the case for the implementation of the KatzWeise

semantics shown in Figure

The LTC Mechanism

An imp ortant b enet of combining LIFO scheduling and b ottom stealing is that it pro

motes stacklike execution For forkjoin DAC programs entire subtrees of the spawning

tree get executed in an uninterrupted stacklike fashion b ecause it is the older tasks that

get stolen those closer to the spawning trees ro ot Since the tasks in these subtrees

are exactly those that are not stolen they will b e called nonstolen tasks Stacklike ex

ecution stops only when the oldest nonstolen task terminates the one at the nonstolen

subtrees ro ot

LTC presupp oses that this stacklike execution is the predominant execution order

In other words LTC sp eculates that most tasks are not stolen Several task spawning

steps are only required if the task is stolen Referring to Figure these steps include

the heapication of the parent continuation the call to callcc and the creation and

manipulation of the tasks result and legitimacy placeholders the calls to makeph

LTC p ostp ones these steps until it is known that the task is stolen this explains the

name lazy task creation In summary nonstolen tasks completely avoid these steps

whereas stolen tasks p erform these steps when the task is stolen

To achieve this LTC uses a lightweight task representation When a future is

THE LTC MECHANISM

evaluated a lightweight task representation of the parent task is pushed on the task

stack The task stack push and p op op erations which are the only op erations needed

for a purely stacklike execution can b e implemented at a very low cost with this

representation Moreover there is enough information in a lightweight task to recreate

the corresp onding heavyweight task ob ject if the task is ever stolen from the task stack

The rest of this section is a more detailed description of the LTC mechanism The

imp ortant issue of synchronization b etween the thief and victim is discussed in the

section that follows

The Lazy Task Queue

The task stack is represented by a group of three stacklike data structures the run

time stack the lazy task queue LTQ and the dynamic environment queue DEQ

The same terminology as Mohr has b een used when p ossible for consistency The

term lazy task refers to a task in the lightweight representation ie a task contained

in the task stack These three data structures are really double ended queues which

are mostly used as stacks Items can b e pushed and p opp ed from the tail of these

queues Items can also b e removed from the head For eciency the entries are laid

out contiguously in memory For the LTQ and DEQ two p ointers indicate the extent

of the queue the head and tail

The run time stack contains the continuation frames of all the tasks in the task stack

The LTQ and DEQ contain p ointers to continuation frames in the run time stack The

DEQ which is only needed to supp ort dynamic scoping is explained in Section

The purp ose of the LTQ is to keep track of each lazy tasks continuation For each lazy

task in the task stack there is exactly one p ointer on the LTQ Each p ointer p oints to

the rst continuation frame of the corresp onding futures continuation The b efore

part of Figure shows a p ossible state of the LTQ and run time stack on entry to

pro cedure p after a call to pro cedure p

define p p

define p p

define p FUTURE p

define p p

define p FUTURE p

define p p

define p FUTURE p

define p p

define p

The LTQs TAIL p oints to the youngest entry on the LTQ whereas HEAD p oints just b elow

the oldest entry Thus the LTQ is nonempty if and only if HEAD TAIL Otherwise

CHAPTER LAZY TASK CREATION

the LTQ is empty and HEAD TAIL The same is true for the DEQ with the p ointers

DEQHEAD and DEQTAIL

Pushing and Popping Lazy Tasks

The task stacks push and p op op erations translate into a small number of steps When

a future is evaluated the thunk representing the futures b o dy is called as a subproblem

The continuation frame created on the run time stack for this call corresp onds to the

rst frame of the parent tasks continuation To indicate the presence of the parent

task on the task stack a p ointer to the continuation frame ie SP is pushed on the

LTQ thereby incrementing TAIL up on entering the thunk This p ointer is used by

the steal op eration to recreate the parent task The pro cessor has eectively queued

the parent on the task stack and is now running the child When the thunk returns

the LTQ is either empty indicating that the parent was stolen or not indicating

that the parent is still on the LTQ If the LTQ is not empty the parent task gets

resumed in parent next fashion Note that at this p oint b oth SP and the topmost

p ointer on the LTQ p oint to the parents continuation frame To p op the parent task

it is sucient to place an instruction that decrements TAIL at the subproblem calls

return p oint After decrementing TAIL the pro cessor has eectively terminated the

child and resumed the parent The b o dys result has b een transferred from the child to

the parent without having to create a placeholder Moreover legitimacy propagation

cost nothing b ecause the parent tasks legitimacy b efore and after executing the child

are identical A single legitimacy ag CURRENTLEGITIMACY is needed p er pro cessor It

logically corresp onds to the legitimacy of the task currently running on that pro cessor

Similarly each pro cessor has a CURRENTDYNAMICENV variable that is always b ound to

the dynamic environment of the currently running task There is no need to change this

variable when a lazy task is pushed or p opp ed from the task stack The handling of a

stolen parent is explained in the next section

It would seem that most of the work to push a task on the task stack go es into two

op erations the creation of the closure for the b o dy and the creation of the continuation

frame However these op erations do not really constitute an imp ortant overhead with

resp ect to a purely sequential execution of the program

Firstly it isnt necessary to heap allo cate the closure b ecause its single call site is

known It is more ecient to lambdalift the closure so that the closed variables are

passed to the b o dy as parameters Frequently these variables are already in registers

so they can b e left as is for the b o dy to use As shown in Table most of the

THE LTC MECHANISM

Program Number of closed

variables for each future

and number copied

abisort

allpairs

fib

mm

mst

poly

qsort

queens

rantree

scan

sum

tridiag

Table Size of closure for each future in the b enchmark programs

b enchmarks require little or no work to setup the closed variables for the b o dy b ecause

they are already in registers Gambit do es a go o d job at allo cating variables to regis

ters A system could b e designed to avoid any copying by directly accessing the closed

variables in the parent continuation frame However this would create dep endencies

b etween frames which are hard to manage in particular heapication would b ecome

more complex and exp ensive b ecause the frames cant b e separated

Secondly the continuation frame created by the future can b e reused by the futures

b o dy Futures are typically subproblems and have a pro cedure call as their b o dy all

the futures in the b enchmarks are like this A sequential version of the program would

create a continuation frame for the call just b efore the pro cedure is invoked The same

continuation frame is created by the future but there is no need to create another frame

for the call in the b o dy since it is now a reduction call The only dierence is that

the frame is created b efore the arguments to the pro cedure are evaluated rather than

afterwards but the cost will b e the same

CHAPTER LAZY TASK CREATION

resumetask t

task t

CURRENTTASK t

UNDERFLOWCONTlink CURRENTTASKcontlink

UNDERFLOWCONTret CURRENTTASKcontret

CURRENTDYNAMICENV CURRENTTASKcontdenv

resultlocation CURRENTTASKcontval

CURRENTLEGITIMACY CURRENTTASKlegflag

SP bottomofstack

TAIL bottomofLTQ

HEAD bottomofLTQ

DEQTAIL bottomofDEQ

DEQHEAD bottomofDEQ

underflow

Figure Resuming a heavyweight task

Stealing Lazy Tasks

When a thief pro cessor steals a lazy task from a victim pro cessors task stack it removes

the oldest entry on the LTQ thereby incrementing HEAD and then must do three things

recreate the parent task as a heavyweight task ob ject notify the victim so that it knows

the oldest lazy task is no longer on the task stack and nally resume the parent task

A heavyweight task is represented with a structure containing ve elds

contlink

contret

contdenv

contval

legflag

The rst four elds describ e the tasks continuation Contlink is a p ointer to

the continuation frames in the heap contret is the continuations return address

contdenv is the continuations dynamic environment and contval is the value passed

to the continuation when the task is resumed The fth eld legflag is the tasks

legitimacy ag Resuming a heavyweight task is p erformed by the steps in Figure

Note that variables are lo cal to the pro cessor unless explicitly marked otherwise the

notation P v where P is a pro cessor will b e used to denote P s lo cal variable v

Thus resumetask rst sets the pro cessors current task and after initializin g the task

stack uses the underow mechanism to restore the tasks continuation The value in

THE LTC MECHANISM

contval is passed to the continuation by setting resultlocation It is assumed

that all continuations including those for futures receive their result in this lo cation

resultlocation is a machine register in Gambit This restriction could b e lifted

by parameterizing the result lo cation by the return p oint that is UNDERFLOWCONTret

this would require adding a eld to the frame descriptor

Figure will help illustrate the eect of a steal on the LTQ and run time stack

The p ointer p removed from the victims LTQ p oints to the rst continuation frame

of the corresp onding task frame in the gure To ease its manipulation the tasks

continuation is rst heapied from this continuation frame down to the next frame

having the underow handler as its return address This is achieved by the call

heapifyframe p r

where r corresp onds to the return address asso ciated with frame p ie ret in the

example In addition r must b e replaced by a p ointer to the underow handler

so that the child invokes UNDERFLOWCONT when it is done An imp ortant issue is

how to lo cate r from p but for now this op eration will b e hidden in the pro cedure

swapchildretadrwithunderflowp that sets r to underflow and returns its

previous value The victims current continuation is now logically the same as b efore

only the representation has changed

After b eing heapied the future b o dys continuation is in UNDERFLOWCONT Note

that UNDERFLOWCONTret contains the address of the subproblems return p oint The

rst instruction at this address is the one which decrements TAIL The only purp ose of

this instruction is to p op the parent task on a parent next transition and it shouldnt

b e executed in any other case The futures continuation is reconstructed by adjusting

0

UNDERFLOWCONTret so that it p oints to the following instruction ie ret in the

example At this p oint UNDERFLOWCONT corresp onds to the parent tasks continuation

k in Figure The thief can now use this continuation to create a heavyweight

task representation of the parent The contlink and contret elds are initialized

directly from UNDERFLOWCONT An undetermined placeholder resph is also created

to represent the result of the future Resph is stored in the eld contval so that it

will get passed to the parents continuation To represent the parent tasks legitimacy

another undetermined placeholder legph is created and stored in the eld legflag

The eld contdenv is initialized to the dynamic environment in eect when the task

was pushed on the task stack the next section explains how this is done

This may not b e this simple b ecause all return addresses must b e parsable Gambit always generates

a secondary return p oint along with each future b o dy return p oint at a constant distance from it The secondary return p oint contains a jump to the instruction that follows the p opping of the parent task

CHAPTER LAZY TASK CREATION

SP RET SP RET

ret ret

R R

ret ret

ret ret

ret ret

STACK STACK

ret ret

TAIL TAIL

ret underflow

HEAD

p

HEAD

ret ret

LTQ LTQ

underflow underflow

CONT UNDERFLOW CONT UNDERFLOW

end

ret

body

R parent end frameR

q

leg flag ph leg

q

res HEAP q cont val ph

H

Hj

ret denv cont

Z

link cont

Z

Z

cont ret q q q ret

ret

HEAP

ret

q

q

q

BEFORE AFTER

Figure The LTQ and the steal op eration

THE LTC MECHANISM

task stealtask p

value p

instr r swapchildretadrwithunderflowp update childs ret adr

heapifyframe p r heapify parents cont

task parent alloctask allocate heavyweight task

frame endframe allocframe allocate endframe

parentcontlink UNDERFLOWCONTlink setup parents cont

parentcontret futuresecondaryretadrr using secondary ret adr

parentcontdenv recoverdynenvp setup tasks dynamic env

parentcontval allocph allocate result ph

parentlegflag allocph allocate legitimacy ph

endframelink parentcontlink setup endframe

endframeslots parentcontret

endframeslots parentcontval

endframeslots parentlegflag

UNDERFLOWCONTlink endframe setup UNDERFLOWCONT

UNDERFLOWCONTret endbody

return parent

Figure The task stealing mechanism

The thief will resume the parent task by a call to resumetask Before doing this

however the victims underow continuation must b e changed so that it will take the

appropriate action when it returns from the child Note that this new continuation

will b e invoked with the result of the futures b o dy Consequently this continuation

must logically corresp ond to pro cedure endbody of Figure The rst time it is

called endbody uses the result it is passed to determine the placeholder resph and the

task is terminated after propagating the tasks legitimacy ie CURRENTLEGITIMACY

to legph Subsequently the result is simply passed on to the parent continuation

This functionality is obtained by pushing a new continuation frame endframe to

the front of the continuation in UNDERFLOWCONT Endframe corresp onds to the contin

uation frame created for the call to thunk in Figure Thus UNDERFLOWCONTret

is set to that calls return address which is essentially a call to pro cedure endbody

Endframe contains the following values needed by endbody the parent tasks contin

uation and the placeholders resph and legph The after part of Figure shows

the systems state just b efore the thief resumes the parent task Figure gives the

complete task stealing mechanism except for removing p from the LTQ

CHAPTER LAZY TASK CREATION

The Dynamic Environment Queue

For every task that is stolen it is necessary to know what the dynamic environment was

when the task was pushed on the task stack When the recreated task is resumed by the

thief CURRENTDYNAMICENV will b e set to that dynamic environment thus restoring it

to its previous state

A straightforward solution is to store the value of the dynamic environment in the

futures continuation frame In other words CURRENTDYNAMICENV is pushed on the

stack on entry to the future b o dys thunk Unfortunately this adds an overhead to all

futures indep endently of how heavily dynamic scoping is actually used if at all

It would b e preferable if the cost of supp orting dynamic scoping was only related

to how heavily it is used This can b e achieved by a lazy mechanism that recreates a

tasks dynamic environment when it is stolen It is assumed that the dynamic binding

construct dynbind creates a new continuation for the evaluation of its b o dy as in

Figure The continuation frame contains prevenv the dynamic environment that

was in eect when dynbinds evaluation was started Since a change of the dynamic

environment is always indicated by one of these frames the following invariants will

hold

The dynamic environment E asso ciated with a continuation frame f is equal to

f

the prevenv eld of the rst dynamic binding continuation frame ab ove f on the

stack

If there is no dynamic binding continuation frame ab ove f then E is equal to

f

CURRENTDYNAMICENV

The DEQ provides an ecient mechanism to nd the rst dynamic binding con

tinuation frame ab ove the stolen tasks continuation frame For each dynamic binding

continuation frame on the stack there is exactly one entry in the DEQ a p ointer to

the frame The p ointer is pushed onto the DEQ just b efore evaluating the b o dy and

is p opp ed after the b o dy as shown in Figure this co de uses the asso ciation list

representation of dynamic environments but the search tree representation could also

b e used

A stolen tasks dynamic environment is easily recovered with the DEQ If the frame

p ointer removed from the LTQ is p a linear or binary search can lo cate the lowest

p ointer on the DEQ that is larger than p Figure shows how this is done Note that a linear search as shown is acceptable b ecause its cost is of the same order as the cost

THE LTC MECHANISM

dynbind id val body

value id val

instr body

SP RET create continuation frame

SP CURRENTDYNAMICENV setup prevenv

DEQTAIL SP push frame pointer onto DEQ

CURRENTDYNAMICENV install new dynamic env

cons cons id val CURRENTDYNAMICENV

RET envrestore execute body

jumpto body

envrestore

if DEQTAIL DEQHEAD DEQTAIL pop frame pointer from DEQ

CURRENTDYNAMICENV SP restore dyn env to prevenv

RET SP return from dynbind

jumpto RET

Figure The implementation of dynbind

of heapifying the stolen tasks continuation ie there are no more entries skipp ed on

the DEQ as there are frames heapied

The cost of supp orting dynamic scoping can b e attributed entirely to the use of

dynbind ie the cost is O n where n is the number of dynbinds evaluated For

each dynbind evaluated a few instructions in dynbind are needed to maintain the

DEQ and a few more instructions are needed in recoverdynenv to skip its entry on

the DEQ if it is part of a stolen tasks continuation a DEQ entry is never skipp ed more

than once

The Problem of Overow

Because the LTQ DEQ and run time stack are of nite size an imp ortant concern is

the detection and handling of overows A useful invariant of these structures is that

the combined number of entries in the LTQ and DEQ is never more than the number of

frames in the stack Since each frame contains at least one slot for the return address

the space o ccupied by the LTQ and DEQ is never more than the space o ccupied by

the stack If these structures are allo cated in two equal sized areas one for the LTQ

and DEQ growing towards each other and one for the stack then the stack will always

overow b efore the LTQ and DEQ Thus it is only necessary to check for stack overow

Chapter explains how stack overows can b e detected eciently

CHAPTER LAZY TASK CREATION

define p dynbind y p

define p FUTURE p

define p p

define p FUTURE dynbind z p

define p FUTURE p

define p p

define p

SP RET

ret

R CURRENT DYNAMIC ENV z

ret

restore env

prev env y

So

S

E

p

ret

S

S

S

S

ret

S

S

S

TAIL

ret

S

DEQTAIL

HEAD

p

DEQHEAD

restore env

LTQ DEQ

prev env x

ret

q

q

q

underflow

HEAP STACK

value recoverdynenv p

value p

while DEQHEAD DEQTAIL DEQHEAD p DEQHEAD

if DEQHEAD DEQTAIL

return CURRENTDYNAMICENV

else

return DEQHEAD get frames prevenv

Figure The DEQ and its use in recovering a stolen tasks dynamic environment

THE LTC MECHANISM

A stack overow could simply cause the program to signal an error or to terminate

This approach puts a strict limit on the depth of the call chain so it is inappropriate

for a language like Lisp where recursion is used lib erally A more elegant approach that

removes this restriction is to heapify the current continuation and then clear the stack

LTQ and DEQ Note that b ecause the stack might contain lazy tasks this heapication

is sp ecial as discussed in the next section Subsequent computation will reuse the

stack and p ossibly cause some other stack overows The continuation thus migrates

to the heap incrementally and it is only when there is no space left in the heap that an

error is signalled

The Heavyweight Task Queue

In general the current continuation might contain lazy tasks when it is heapied The

four situations where this happ ens are

Task susp ension for touching an undetermined placeholder

Task switch caused by a preemption interrupt

Stack overow

callcc

In these situations something has to b e done with the lazy tasks currently on the

stack so that they remain runnable and indep endent Since the lightweight represen

tation is no longer adequate for these tasks they are converted to the heavyweight

representation and added to the pro cessors heavyweight task queue HTQ This queue

contains all the heavyweight tasks runnable on that pro cessor It is in this queue that

susp ended tasks are put when the placeholder they are waiting on gets determined

Before heapifying the current continuation the pro cessor will in essence steal all lazy

tasks on its own task stack by calling stealtask HEAD while HEADTAIL and

add the resulting tasks to its HTQ

But is this the b est thing to do in the case of a task susp ension The only task that

has to b e susp ended is the currently running task so it seems wasteful to remove all

lazy tasks The topmost lazy task could simply b e recreated and resumed ie p opp ed

from the task stack after adding the current task on the placeholders waiting queue

Mohrs system Mohr uses this approach which he calls tailbiting even though

he concedes that it go es against our preference for oldestrst scheduling since we have eectively

CHAPTER LAZY TASK CREATION

created a task at the newest p otential fork p oint Performance can suer b ecause

this task is more likely to have small granularity also further blo cking may result

p ossibly leading to the dismantling of the entire lazy task queue

Tailbiting oers no savings when supp orting the KatzWeise semantics b ecause the

parent continuation must b e saved in the susp ended task Thus the whole stack needs

to b e heapied anyway In addition by immediately moving all lazy tasks to the HTQ on

a task susp ension and by managing the HTQ as a FIFO structure the same scheduling

order as b ottom stealing is obtained oldest task rst There is also greater lib erty as

to which task to run next after the susp ension Gambit uses the following heuristic for

choosing the next task if x is the placeholder that caused the susp ension then the child

task asso ciated with x ie xs owner task is resumed if it is runnable otherwise the

pro cessor go es idle Conversely when a task terminates after determining placeholder

y one of the tasks waiting on y will b e resumed if there is one otherwise the parent task

asso ciated with y is resumed if it is runnable These heuristics promote an execution

order close to the programs data dep endencie s so it tends to reduce the number of task

susp ensions

Since there are two sources of runnable tasks p er pro cessor the HTQ and the task

stack idle pro cessors could obtain a runnable task from either source Gambit how

ever checks the HTQ rst and then the task stack b ecause this promotes the LIFO

scheduling order it avoids allo cating new heavyweight tasks and it is faster b ecause

the heavyweight tasks can b e resumed immediately

Another advantage of managing the HTQ as a FIFO structure is that scheduling

will b e fair b ecause all runnable tasks including the lazy tasks on the task stack are

guaranteed to start running in a nite amount of time On every preemption interrupt

all lazy tasks and the current task are transferred to the HTQ and the rst task on the

HTQ is resumed Consequently if there are m tasks in the task stack and n tasks in

the HTQ at the moment of the preemption interrupt then these m n tasks will get

at least one quantum out of the next m n quantums

Supp orting Weaker Continuation Semantics

The task stealing algorithm can b e mo died to accommo date any of the other continu

ation semantics describ ed in Section These weaker semantics oer a lower cost for

task stealing b ecause they avoid some steps

The link to the owner task is recorded in x when the parent task is stolen

A link to the parent task is recorded in endframe when the parent task is stolen

THE LTC MECHANISM

Firstly since these semantics do not supp ort legitimacy they do not need to create

the legitimacy placeholder and of course the parent task and endframe need not

contain the legflag and legph elds Also legitimacy propagation in endbody is

not needed

Secondly the parent tasks continuation is not needed in endframe In fact

endframe just like the ro ot continuation frame has no parent continuation For

the original Multilisp semantics endframe will only contain the result placeholder

resph It is the only parameter passed to the pro cedure endbody apart from the

b o dys result

For the MultiScheme semantics endbody only takes the b o dys result as a param

eter Consequently endframe contains no p ertinent information and can simply b e

preallo cated once and for all at program startup Nevertheless the result placeholder is

needed by the child task so an extra eld goalph must b e added to heavyweight task

ob jects At the time of the steal the parent tasks goal placeholder is initialized from

the childs goal placeholder and the result placeholder b ecomes the new goal placeholder

of the child ie

parentgoalph CURRENTTASKgoalph

CURRENTTASKgoalph resph

The steps avoided by the weaker continuation semantics do not amount to much

p erhaps a saving of the order of to machine instructions p er steal A more

promising source of saving is the handling of the parent continuation Since only the

parent task needs this continuation and it is immediately going to b e restored by the

thief it seems useless to heapify the continuation The steal op eration could transfer

the continuation frames from the victims stack to the thief s stack in a single blo ck

with a blo ck transfer or similar op eration When heapifying the continuation two

copies of the frames are done once to the heap for heapication and once to the

stack b ecause of underow Moreover these copies are more complex to p erform than

a blo ck transfer of the stack b ecause of the frame formatting and underow handler

overheads

Up on closer examination neither metho d is clearly sup erior to the other Firstly

communication b etween the thief and victim pro cessors is more imp ortant than the

complexity of the algorithms Assuming the thief actually returns through all the con

tinuation frames the frames only need to b e transferred once b etween the pro cessors in

either metho d When using heapication one of the transfers will b e b etween pro cessors

To preserve the format of frames and avoid a sp ecial case in the underow handler it is b est if these frames contain a dummy parent continuation

CHAPTER LAZY TASK CREATION

and one b etween lo cal memory and the cache assuming the stack lives mostly in the

cache Since interprocessor communication is an order of magnitude more exp ensive

than lo cal memory accesses b oth metho ds will have roughly similar p erformance

Secondly the thief might not use all of the parent continuation frames In such a case

a blo ck transfer will do more work than strictly required When using heapication only

the frames which are needed are transferred since frames are restored on demand This

can make a big dierence in some programs in particular when a given task spawns

several children deep in some recursion To explain this case consider the following

variant of pmap

define pmap proc lst

if pair lst

let val FUTURE proc car lst

let tail pmap proc cdr lst

cons TOUCH val tail

Assume the ro ot task calls pmap with a continuation containing k stack frames Note

th

that the continuation of the i evaluation of the future contains k i frames Also

note that the only task that ever gets stolen with LTC is the ro ot task If the list is of

P

n

n

length n and there are n steals a total of k i nk frames are transferred

i

b etween pro cessors when using the blo ck transfer metho d The cost is lower by a factor

of O n when the parent continuation is heapied on every steal On the rst steal

k frames are heapied and the topmost is transferred and restored by the thief

Subsequent steals will heapify two frames one for the recursive call to pmap and one

for the call to the futures thunk and a single frame will b e transferred and restored

Finally in the unwinding of the recursive calls to pmap n frames will b e transferred and

restored The total is n k heapied frames n restored frames and n frames

transferred b etween pro cessors

Synchronizing Access to the Task Stack

In the ab ove description of LTC a critical issue was not addressed the synchronization of

the pro cessors This is an issue b ecause multiple pro cessors including the victim might

try to simultaneously remove the same task from the task stack Some synchronization

is needed to resolve this race condition

The case of multiple thieves can b e prevented by asso ciating a steal lo ck with

every pro cessor A pro cessor wanting to steal from a victim rst acquires the victims

THE SHAREDMEMORY PROTOCOL

steal lo ck b efore attempting to steal a task The lo ck is released when the attempt is

nished so there in never more than one thief trying to steal from a given victim

The only remaining race condition o ccurs when the victims task stack contains a

single task and the thief tries to steal the task while the victim is trying to p op the task

The term protocol refers to how the thief and victim pro cessors interact to avoid conicts

when accessing the task stack Two proto cols are explored here the sharedmemory

SM proto col and the messagepassing MP proto col

The SharedMemory Proto col

The SM proto col tries to maximize concurrency b etween the thief and victim by mini

mizing the interference of the thief on the victims current execution The victim do es

not co op erate with the thief but rather the resp onsibility of stealing falls entirely on

the thief a cute analogy is that the thief is b ehaving like a pickpo cket trying to stay

unnoticed by its victim Thus it is the thief that executes the steps in Figure

The problems with this approach are explained throughout the description of the SM

proto col that follows

The rst problem is that at the moment of a steal the thief has no way of know

ing where the childs return address r is b ecause the victim could b e in any of several

states this problem shows up in swapchildretadrwithunderflowp The re

turn address is only on the victims stack if the child is in the pro cess of executing a

subproblem call Even if the pro cedure calling convention required that r b e passed

on the stack in a predetermined slot eg the rst there would b e a problem b ecause

when r is invoked to return from the futures b o dy r will rst get p opp ed from the

stack b efore the parent task is p opp ed This race condition b etween the thief mutating

r and the victim invoking r can b e handled in the following way Instead of having the

thief mutate r to bring the victim to call underflow when it returns from the child the

detection of a stolen parent task is done explicitly by the victim at the futures return

p oint The test at the return p oint will cause a branch to the underow handler if the

parent was stolen Nevertheless the thief must still know the value of r to reconstruct

the parents continuation A simple solution is to save the value of r inside the futures

continuation frame just b efore pushing the lazy task on the LTQ Thus the thief can

get the value of r by indirecting p

Before stealing a task the thief must rst verify that one is present that is check

if HEADTAIL However this only tests the instantaneous presence of a task b ecause

CHAPTER LAZY TASK CREATION

nothing prevents the victim from immediately decrementing TAIL as part of the p opping

of a lazy task To prevent this from happ ening each LTQ entry could b e augmented

with a p opping lo ck that controls the p opping of the corresp onding task The victim

acquires the p opping lo ck under TAIL b efore decrementing TAIL and the thief acquires

the p opping lo ck under HEAD b efore testing for the presence of a task If a task is

present ie HEADTAIL the thief is certain that this condition will remain true until

the p opping lo ck is released b ecause the victim cannot decrement TAIL from HEAD to

HEAD Note that lo cking is not needed for pushing a lazy task since this cant cause a

race with the thief as long as TAIL is up dated after the entry is written to the LTQ To

complete the stealing of the task the thief increments HEAD recreates the task by calling

stealtask HEAD and releases the p opping lo ck under HEAD Unfortunately the

cost of lo ck op erations on some machines is an order of magnitude more exp ensive

than typical instructions For example the aquisition of a lo ck on the GP is done

through a system call that takes secs the equivalent of roughly instructions

Accessing the lo cks would constitute the dominant cost of a future b ecause it is needed

on every task p op The next section explains how hardware lo cks can b e avoided

A ma jor problem with the SM proto col is that the task stack and related data struc

tures must b e accessible to all pro cessors This includes the following data structures

the runtime stack and UNDERFLOWCONT

the LTQ and its HEAD and TAIL p ointers and

CURRENTDYNAMICENV the DEQ and its DEQHEAD and DEQTAIL p ointers

The problem is that these data structures must b e in shared memory and cant b e

cached optimally The victim pro cessor would have faster access to these data structures

if they were private data This is the prime motivation for the MP proto col describ ed

in Section Two of these data structures can nevertheless b e private even with

the SM proto col the TAIL and DEQTAIL p ointers Since this is achieved in a similar

way for b oth p ointers it will only b e explained for TAIL The idea is to maintain the

following invariant all LTQ entries ab ove TAIL contain a sp ecial marker for example

a NULL p ointer all LTQ entries are initialized with this value This means that for

all XHEAD XTAIL if and only if XNULL The thief can thus replace the test

HEADTAIL by HEAD NULL The victim can keep TAIL in the most convenient place

Gambit dedicates one of the pro cessor registers Pushing and p opping an entry on the

LTQ each require a single memory write to the LTQ SP and NULL resp ectively and an

adjustment of TAIL The co de sequences for this metho d are given in the next section

THE SHAREDMEMORY PROTOCOL

RET retpoint setup future bodys return address

SP RET save ret adr in continuation frame

TAIL SP push parent task on LTQ

futures body

retpoint

SMattemptpop pop parent task if still there

secondaryretpoint

SP pop ret adr from continuation frame

Figure Co de sequence for a future under the SM proto col

Avoiding Hardware Lo cks

Hardware lo cks can b e avoided in the task p opping op eration by implementing the

p opping lo cks with any of several software lo ck algorithms based on shared variables

such as Dekkers algorithm Dijkstra and Petersons algorithm Peterson

The same basic principles used by these algorithms can b e adapted to design a sp ecial

purp ose synchronization mechanism for LTC as describ ed next With the exception of

the previously mentioned metho d to make TAIL private this algorithm is similar to the

one describ ed in Mohr The only atomic op erations in these algorithms are the

memory references and lo ck op erations increments and decrements do not have to b e

atomic

The mechanism arbitrates access to the task stack during task steal and task p op

op erations using only the p ointers HEAD and TAIL and a lo ck governing mutation of

HEAD ie HEADLOCK Note that HEADLOCK can b e either a hardware or software lo ck

but b ecause it is used infrequently in the p opping op eration it do esnt really matter

which type it is The task stealing and p opping op erations are implemented by the

pro cedures SMattemptsteal and SMattemptpop resp ectively the co de is given in

Figures and These pro cedures attempt to remove a task from the task

stack and indicate if the attempt was successful SMattemptsteal indicates failure

by returning NULL otherwise it returns a heavyweight task ob ject corresp onding to the

stolen task SMattemptpop indicates failure by calling the underow handler directly

otherwise control returns to the caller The co de sequence generated for a future calls

SMattemptpop at the futures return p oint as shown in Figure The p erformance

of the p opping op eration can b e improved by inlining the instructions of pro cedure

SMattemptpop at the return p oint or at least the two rst instruction which are the

CHAPTER LAZY TASK CREATION

task SMattemptsteal V V is victim processor

processor V

value p entry obtained from Vs LTQ

if VHEAD NULL return NULL nothing to steal if LTQ empty

acquirelock VHEADLOCK get right to increment HEAD

VHEAD increment HEAD

p VHEAD get entry from LTQ

if p NULL check for conflict

task parent stealtask V p won race recreate parent

releaselock VHEADLOCK done with HEAD

return parent indicate success

VHEAD lost race undo increment

releaselock VHEADLOCK done with HEAD

return NULL indicate failure

Figure Thief side of the SM proto col

SMattemptpop

TAIL NULL remove topmost LTQ entry

if HEAD TAIL check for possible conflict

boolean thiefwon

acquirelock HEADLOCK prevent thief from mutating HEAD

thiefwon HEAD TAIL definitive conflict check

releaselock HEADLOCK

if thiefwon if thief won race

TAIL SP restore LTQ top

underflow jump to endbody

Figure Victim side of the SM proto col

THE SHAREDMEMORY PROTOCOL

most frequently executed instructions In SMattemptsteal stealtask needs to

know which task stack to access so it is called with the victim pro cessor as an extra ar

gument Also note that the op eration swapchildretadrwithunderflowp used

by stealtask is equivalent to p the childs return address is not mutated

Clearly there is no p ossible conict b etween the thief and victim when the task

stack contains more than one task The thief can increment HEAD and take the lowest

entry on the LTQ at the same time that the victim voids the topmost entry by writing

NULL and decrements TAIL A conict can only o ccur if calls to SMattemptsteal

and SMattemptpop overlap in time and the task stack contains a single task that is

HEADTAIL The idea is to let the thief and victim blindly access the LTQ as though

there was no conict thereby adjusting HEAD and TAIL and only then check to see if

there is a conict that is check if HEADTAIL or equivalently HEADTAIL When a

conict is detected one of the two pro cessors is selected as the winner of the race for

the task and it returns success The other pro cessor undo es its mutation of the LTQ

and returns failure The thief detects success very simply it is the winner if and only

if the entry it reads from the LTQ at line is not NULL This entry can only b ecome

NULL if the victim voids it by executing line The two p ossible orderings of these lines

are considered next

Thief executes line b efore the victim executes line

The thief has won the race It will recreate the parent task and returns it from

SMattemptsteal Note that from this p oint on HEAD will never p oint lower than

the entry that was removed HEAD can only increase When the victim eventually

executes line with TAIL p ointing to the removed entry it will decrement TAIL

to b elow HEAD and consequently line will detect the conict Line will nd

the same result so the victim will conclude that the parent was stolen and will

jump to endbody

Victim executes line b efore thief executes line

The thief will lose the race b ecause it will read NULL at line Consequently the

thief will restore HEAD to its previous value at line There are two sub cases

dep ending on what the thief is doing when the victim executes line

a Thief is not b etween lines and when victim executes line

The thief has either not yet tried to remove the entry or has restored HEAD

to the value it had just b efore line Thus HEADTAIL when line is

executed The victim sees no conict and declares success by returning from

SMattemptpop

CHAPTER LAZY TASK CREATION

b Thief is b etween lines and when victim executes line

The thief has not yet restored HEAD to its original value so HEADTAIL

The victim thus detects a p ossible conict at line The reason for acquiring

HEADLOCK at line is to make sure that the thief is not b etween lines

and when the test at line is executed At that p oint the thief will have

restored HEAD and will not mutate HEAD again b ecause HEADLOCK is lo cked

Line thus sees HEADTAIL causing SMattemptpop to return successfully

The role of line is to ensure that the victim eventually acquires the lo ck

at line in systems where lo cks are not fair It prevents new thieves from

crossing line so eventually the victim will b e the only pro cessor trying to

lo ck HEADLOCK It also avoids the overhead of attempting to steal from a

pro cessor with an empty task stack

Thus the SM proto col satises the following correctness criteria

Safety Either the thief or the victim but not b oth will remove a given entry

from the LTQ

Liveness An attempt to remove an entry will eventually indicate failure or

success ie deadlo ck and livelock are imp ossible

Cost of a Future on GP

This section describ es the details of the GP implementation of the SM proto col and

evaluates the costs related to the evaluation of a future on that machine As explained

ab ove the cost of a future dep ends on many parameters but mostly on whether the

corresp onding parent task is stolen or not

Parent Task is not Stolen

If the parent is not stolen the cost is simply that of pushing and p opping a lazy task

Pushing a lazy task requires four steps setting up the b o dys return address setting

up the arguments to the b o dy the closed variables pushing the return address to the

stack and pushing the stack p ointer to the LTQ The rst step typically replaces the

same step that would b e required in a sequential version of the program to evaluate the

b o dy assuming it is a pro cedure call so it wont b e counted as overhead Often the

second step requires no instructions b ecause the arguments are already in a lo cation

THE SHAREDMEMORY PROTOCOL

accessible to the b o dy eg in the registers Only the last two steps are necessary

extra work with resp ect to a sequential version of the program Popping a lazy task

takes two steps p opping and voiding the topmost entry on the LTQ and checking for a

conict The p opping of the return address from the stack has no cost b ecause it can b e

combined with the deallo cation of the continuation frame by the futures continuation

To get a concise co de sequence on the GP some of the sp ecial addressing mo des

of the M pro cessor were used in particular predecrement and p ostincrement in

direct addressing TAIL SP and RET are all kept in address registers a sp and a

resp ectively The two required steps in the lazy task push translate into two instruc

tions and a lazy task p op translates into three instructions as shown b elow

movl asp push return address to stack

movl spa push stack pointer to LTQ

code for futures body

retpoint

clrl a pop and void entry on LTQ

cmpl HEADa compare head and tail

bcs conflict jump to handler if conflict

secondaryretpoint

code for futures continuation

Note that the stack grows downward on the M Of the ve instructions three

are writes to shared memory The sequence accounts for a run time of roughly secs

The assembly co de generated for the SM proto col when compiling the fib b enchmark

is given in Section

Parent Task is Stolen

To the ab ove cost must b e added the extra work p erformed as a consequence of the

steal Assuming that there is always a single return from the futures b o dy the thief

and victim will p erform the following op erations

Thief

Heapify the parent continuation

Find the parents dynamic environment

Allo cate new ob jects This includes the allo cation and initialization of the

parent task result and legitimacy placeholders and endframe

CHAPTER LAZY TASK CREATION

Op eration Instruction count

stealtask excluding heapifyframe

and recoverdynenv

heapifyframe f s

recoverdynenv b

resumetask excluding underflow

0

underflow s

determine w

w otherwise

idle only accounts for search n

n otherwise

Table Cost of op erations involved in task stealing

Resume the parent task Note that only the rst continuation frame needs

to b e restored

Victim

Invoke endbody This is p erformed by the underow handler

Terminate the child The result and legitimacy placeholders get determined

and then control go es to idle

Find new work The victim must nd a runnable task to resume The task

either comes from the victims HTQ or is stolen from another pro cessor

In addition there is a cost for restoring the other frames of the parent continuation

heapied in This is done at least in part by the thief but maybe also by some other

pro cessors if the parent task migrates to other pro cessors

Table gives the cost of the op erations involved in task stealing the costs cor

resp ond to the number of machine instructions executed in Gambits enco ding of the

algorithms In this table f is the number of frames heapied which is the number of

frames separating the future from the enclosing future s is the number of values on

the stack b is the number of dynamic variable bindings that were added to the dynamic

0

environment since the enclosing future s is the size of the continuation frame to re

store w is the number of tasks on the placeholders waiting queue and n is the number

of pro cessors that were considered in the search for a runnable task n when the

task is found in the lo cal HTQ Note that these costs do not account for the lo cation

ie lo cal vs remote memory of the data b eing accessed

IMPACT OF MEMORY HIERARCHY ON PERFORMANCE

From the table can b e derived the approximate costs asso ciated with the victim

T the thief T and the pro cessors that restore the parents continuation

victim thief

T

underow

T

victim

T f s b f s b

thief

T f s

underow

f s b

The minimal cost corresp onds to f s b w and n This gives a

total cost of instructions secs In a more realistic situation the frames will

b e larger and more numerous so the cost of heapication and underow will increase

Assuming s and f the total cost will b e instructions secs

Impact of Memory Hierarchy on Performance

An unfortunate requirement of the SM proto col is that all pro cessors must have access

to the task stacks data structures in particular the runtime stack and LTQ Making

these structures accessible to all pro cessors has a cost b ecause it precludes the use of the

more ecient caching p olicies The runtime stack and the LTQ are read and written by

the victim but are only read by thief pro cessors thus they are single writer shared data

and can b e cached by the victim using the writethrough caching p olicy as explained in

Section This however is not as ecient as the copyback caching p olicy normally

used in single pro cessor implementations of Lisp For typical Lisp programs caching of

the stack will likely b e an imp ortant factor since the stack is one of the most intensely

accessed data structures Caching of the LTQ will also b e an imp ortant factor for

parallel programs with small task granularity b ecause each evaluation of a future causes

a few memory writes to the LTQ and stack three in the SM proto col Although this

may not seem like much at rst sight the cost of a memory write to a writethrough

cached lo cation on mo dern pro cessors such as the M pro cessors in the TC

is to times larger than the cost of a nonmemory instruction or a cache hit read

or write to a copyback cached lo cation Note that this is not an issue on the GP

which lacks a data cache

But how large is the p erformance loss due to a sub optimal caching p olicy To

b etter understand the imp ortance of caching on p erformance it is useful to analyze

the memory access b ehavior of typical programs The run time of a Lisp program can

b e broken down into the time sp ent accessing data in memory and the time sp ent on

CHAPTER LAZY TASK CREATION

pure computation Memory accesses can further b e broken down into two categories

accesses to the stack and accesses to the heap Thus a program is describ ed by the

three parameters S stack H heap and C pure computation which represent the

prop ortion of total run time sp ent on each category of instructions S H C

For reference purp oses these parameters are dened with resp ect to an implementation

where the stack and heap are not cached ie all accesses go to lo cal memory

Some exp eriments were conducted to measure the value of S H and C for each

b enchmark program on b oth the GP and TC All these programs were run on

a single pro cessor as sequential programs futures and touches were removed from the

parallel b enchmarks The run time of each program was measured in three dierent

settings The rst run was with the stack and heap lo cated in noncached lo cal memory

The second run was with the stack lo cated in remote memory on another pro cessor

so that each access to the stack would cost more The nal run was with the heap in

remote memory The three run times are resp ectively T T and T Now since the

S H

relative cost R of a remote access with resp ect to a lo cal access is known R on

the GP and R on the TC a system of three linear equations is obtained

S H C

SR H C T T

S

S HR C T T

H

This system can easily b e solved to nd the value of S H and C Note that

this mo del do es not take into account factors such as the pip elini ng of instructions by

the pro cessor and the dierence in costs b etween reads and writes Also note that

the values are dep endent on the quality of the co de generated by the compiler but

b ecause an optimizing compiler was used the measurements are representative of a

highp erformance system As a sanity check the values of S H and C obtained on

the TC were used to predict the run time of the program when the stack is cached

with the copyback p olicy Assuming that the cache hit ratio for the stack is close to

which is reasonable due to the high lo cality of stack accesses the run time should b e

S

T H C where K is the relative cost of a lo cal memory access with resp ect

K

to a cache access For most programs out of the prediction was within of

the actual run time Only programs had a dierence ab ove fib with mm

with and sum with This suggests that the values obtained for S H and C are reasonably close to reality

IMPACT OF MEMORY HIERARCHY ON PERFORMANCE

GP TC

Stack Caching

Program S H C O S H C O O O

RemHeap RemHeap None WT

boyer

browse

cpstak

dderiv

deriv

destruct

div

puzzle

tak

takl

traverse

triangle

compiler

conform

earley

peval

abisort

allpairs

fib

mm

mst

poly

qsort

queens

rantree

scan

sum

tridiag

Table Measurements of memory access b ehavior of b enchmark programs

CHAPTER LAZY TASK CREATION

.

.

.

H

.

.

.

abisort

.

.

.

.

.

t

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

takl

qsort

.

.

.

t

.

.

t

.

.

div

.

.

.

t

.

t

.

. cpstak

.

.

.

.

.

.

.

.

.

.

puzzle

.

.

.

t

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

peval

.

traverse

.

.

t

. t t

.

destruct

.

mst

t

t allpairs

.

.

.

.

.

triangle

t .

tridiag .

.

mm t

.

.

t

.

compiler

t .

conform

.

t .

.

t

.

.

.

.

.

.

browse

.

t

.

. boyer

.

. earley

.

sum

t

scan .

t

.

dderiv

. t

.

deriv

.

t

.

t

.

.

.

.

.

.

.

poly

.

t

.

.

.

.

.

.

rantree queens fib

.

.

tak

.

.

.

t t t t

.

S

GP

.

.

.

.

H

.

.

.

.

.

.

.

.

.

.

.

.

abisort .

.

.

.

.

.

.

s

.

. s

.

.

.

.

.

. s

.

. cpstak div

.

.

.

.

.

.

. s

.

. takl

.

.

.

.

.

.

.

.

.

.

.

destruct

.

.

.

. s

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

puzzle

.

.

.

.

s

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

ctak

.

.

.

s

.

.

.

.

.

.

.

. s

.

qsort

. peval

s

.

.

.

.

.

.

mst

.

.

.

.

.

.

s

.

.

.

.

.

.

.

.

.

. triangle

.

.

.

s

.

.

.

.

tridiag

. s

traverse

.

.

.

.

s

.

.

.

.

browse

.

.

s

.

.

s boyer

.

. allpairs

s

.

mm

. s

.

.

.

.

dderiv

s

.

.

s

.

poly

. s

.

.

.

deriv

.

.

. earley

scan

.

s

.

.

s

.

.

.

s

.

.

. conform

.

.

sum

.

s

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

rantree fib queens

.

.

.

tak

.

.

.

s s s s

.

S

TC

Figure Relative imp ortance of stack and heap accesses of b enchmark pro grams

IMPACT OF MEMORY HIERARCHY ON PERFORMANCE

These additional measurements were also taken

O the overhead of lo cating the heap in remote memory rather than lo cal

RemHeap

memory when the stack is cached optimally ie no caching on GP and copy

back caching on TC This value is a go o d indicator of the overhead that will

app ear due to the sharing of user data if the program is run in parallel assuming

user data gets distributed uniformly to all pro cessors the number of pro cessors is

large and there is little contention

O TC only the overhead of not caching the stack rather than using

None

copyback caching

O TC only the overhead of caching the stack with writethrough caching

WT

rather than with copyback caching

The measurements are given in Table and Figure presents this data in a

more readable form plots in S H space

A few observations can b e made from Figure Firstly most of the programs

access the stack more often than the heap ie all the programs b elow the S H

line This tendency is even more pronounced for the parallel b enchmarks ie the

b oxed names in the plots This is to b e exp ected since the ma jority of the parallel

b enchmarks are based on recursive DAC algorithms

Secondly the imp ortance of memory accesses is greater on the TC than on the

GP ie the p osition of a given program on the S H plane is further from the

origin This is in agreement with the well known fact that mo dern pro cessors need

caches and a high hit rate to keep them going at p eak sp eed Most of the programs

actually sp end more time accessing memory than doing pure computation when run on

As indicated by column O of Table copyback the TC C is b elow

None

caching the stack provides an imp ortant p erformance gain This gain is in some cases

higher than a factor of However the median gain is and the average is

The last column in the table O is of sp ecial interest b ecause it reects the

WT

cost of sub optimally caching the stack to supp ort the SM proto col The overhead of

using writethrough caching rather than copyback caching is as high as The

sequential b enchmarks have a median overhead of average of whereas the

median overhead for the parallel b enchmarks is average of Note also that

the cache on the TC is not very fast only a factor of faster than lo cal memory

Some machines have caches which op erate several times faster with a corresp onding

increase in O The ob jective of the MP proto col is to avoid this overhead altogether WT

CHAPTER LAZY TASK CREATION

The MessagePassing Proto col

If the role of the thief in the SM proto col is analogous to a pickpo cket in the MP proto col

stealing a task is analogous to a holdup b ecause the victim actively co op erates with the

thief To initiate a task steal the thief sends a steal request message to the victim and

starts waiting for a reply The victim eventually interrupts its current execution and

calls a steal request hand ler routine to pro cess the message This handler checks the task

stack and if a lazy task is available recreates the oldest task and sends it back to the

thief Otherwise a failure message is sent back to the thief which must then try stealing

from some other pro cessor The victim then resumes the interrupted computation

There are several advantages to this proto col Firstly it relies less on an ecient

shared memory All the data structures comprising the task stack are private to each

pro cessor The stack LTQ DEQ and asso ciated p ointers can all b e cached with copy

back caching All programs which use the stack andor dynamic scoping will thus

b enet whether they are sequential or parallel Parallel programs will in addition

b enet from the caching of the LTQ which reduces the cost of pushing and p opping

lazy tasks

Secondly it is p ossible to handle the race condition more eciently than the SM

proto col b ecause all task removals from the task stack are p erformed by its owner

Preventing the race condition b etween task steals and task p ops is as simple as inhibiting

interrupts for the duration of the task p op This can b e achieved by adding a pair of

instructions around the task p opping sequence to disable and then reenable interrupts

to the pro cessor The metho d used by Gambit is to detect interrupts via p olling and

never check for interrupts inside the p opping sequence ecient p olling is explained in

Chapter There are other metho ds that have no direct overhead For example in the

instruction interpretation metho d App el the hardware interrupt handler checks

to see if the interrupted instruction is in an uninterruptible section ie a p opping

sequence If it is the rest of the section is interpreted by the interrupt handler b efore

the interrupt is serviced Other zero cost techniques are describ ed in Feeley

Thirdly the op eration swapchildretadrwithunderflowp can b e imple

mented according to its original sp ecication ie an actual mutation of the childs

return address thus avoiding the push of the b o dys return address to the stack and

the explicit check for underow at the futures return p oint The sequence generated for

a future only has to push an entry to the LTQ b efore evaluating the b o dy and to decre

ment TAIL at the futures return p oint Doing this in the SM proto col was not p ossible

b ecause the thief could not know where the victim had stored the return address r In

THE MESSAGEPASSING PROTOCOL

the MP proto col r can b e lo cated in several ways

Scanning the stack downward from the top The system can b e designed

so that the steal request handler is always called in the same way as a subproblem

call This is fairly easy to do when the system detects interrupts through p olling

b ecause the call to the handler is a subproblem call For a system that uses

hardware interrupts it is more complex but still p ossible Thus when the handler

is executed SP and RET can b e used to parse the content of the stack The handler

can walk back through the frames until the frame directly ab ove p is found At

this p oint the format of this frame is known so r can b e accessed directly This

approach may b e exp ensive since there can b e an arbitrary number of frames

ab ove p at the moment the steal request is received

Scanning the stack upward from p Assuming the handler is always called

as a subproblem either r has b een saved to the stack by the childs outermost

subproblem call or it has b een saved in the continuation frame for the call to the

handler Thus when the handler is executed r will necessarily b e the rst return

address ab ove p on the stack ie the return address in the frame directly ab ove

p An upward search of the stack starting from p and stopping at the rst return

address will lo cate r It is assumed here that the values on the stack are tagged

at least to the extent of allowing return addresses to b e distinguished from other

values It is also assumed that return addresses are not rstclass ob jects and

that return addresses are never saved to more than one lo cation Achieving this

might require a close coupling of the steal request handler interrupt system and

compiler The cost of nding r with this metho d is O n where n is the size of

the frame ab ove p This metho d is used by Gambit Gambit makes an eort to

lessen the cost of the search by using heuristics that favor the saving of the return

address in the lower end of continuation frames

Finally in the MP proto col it is the victim that is in charge of creating the parent

task its continuation and related structures By allo cating these structures in the

victims lo cal memory stealtask avoids remote memory accesses and thus completes

faster than in the SM proto col Remote memory accesses are p erformed by the thief

when it resumes the task but strictly on demand The parent task may actually start

For example a table could b e setup with a description of the register allo cation for every instruction

in the program This description indicates among other things where the parent return address is lo cated

when the instruction is executed This table is used by the handler to build a correctly formatted continuation frame for the return to the interrupted co de

CHAPTER LAZY TASK CREATION

executing so oner than with the SM proto col b ecause only the parent task ob ject and its

rst continuation frame need to b e transferred from victim to thief The total number

of remote memory accesses may also b e smaller if the parents continuation is not used

fully by the thief for example if the parent task migrates to another pro cessor

The disadvantages of the MP proto col are explained in Section

Really Lazy Task Creation

The basic idea of LTC is to defer the creation of heavyweight tasks to the moment they

are known to b e required that is when they are stolen This usually saves a lot of work

b ecause nonstolen tasks are handled at very low cost and the cost of stealing a task is

roughly the same as creating a heavyweight task in the rst place In the MP proto col

the cost of a nonstolen task is two instructions This cost can actually b e removed

completely by doing more work when the task is stolen Notice that the only purp ose

of the LTQ is to facilitate the reverse parsing of the stack ie from b ottom to top to

nd the task continuation b oundary of the lowest task Finding the task continuation

b oundaries can however b e done by parsing the stack from top to b ottom and checking

for return addresses to future return p oints As explained previously this parsing can

b e done by the steal request handler The problem with this metho d is that the cost of

stealing is not b ounded since all the stack must b e parsed Fine grain programs with

shallow recursions may nevertheless p erform b etter with this metho d if most tasks are

not stolen Due to its worstcase b ehavior and the fact that it saves only two inexp ensive

instructions this metho d is not very app ealing for general use

Communicating Steal Requests

The algorithms for the thief and victim sides of the MP proto col are shown in Fig

ures and resp ectively Even though they are based on a messagepassing

paradigm these algorithms implement the communication using shared variables THIEF

and REPLY In addition the parent task is also communicated through shared mem

ory The victims THIEF variable is set by the thief so that the victim can tell which

pro cessor has sent the steal request It is also used to indicate the presence of a steal

request when there is a steal request THIEF NULL A thief s REPLY variable is set by

the victim in resp onse to a steal request After the thief has sent a request it busywaits

until the victim resp onds by setting the REPLY variable to the task that was stolen or

THE MESSAGEPASSING PROTOCOL

task MPattemptsteal V V is victim processor

processor V

REPLY NONEYET initialize with special marker

VTHIEF CURRENTPROCESSOR tell victim who the thief is

raiseinterrupt V get victim to process the request

while REPLY NONEYET busywait until victim replies

return REPLY

Figure Thief side of the MP proto col

interrupthandler

if THIEF NULL check for a steal request

the steal request handler

processor T THIEF get pointer to thief

THIEF NULL set it up for next request

if HEAD TAIL anything on the task stack

TREPLY stealtask HEAD send oldest task to thief

else

TREPLY NULL indicate failure to thief

check other sources of interrupts

Figure Victim side of the MP proto col

to NULL if the victim had an empty task stack Note that the interrupt handler can

get invoked for other reasons than the call to raiseinterrupt at line assuming all

types of interrupts go through interrupthandler This means that the victim might

detect the steal request at line as so on as line is executed Consequently it is

imp ortant for the thief to initialize REPLY b efore line THIEF must also b e reset line

b efore the reply is sent back In the reverse order a deadlo ck might o ccur if a second

steal attempt executes line b efore THIEF is reset The victim would b e unaware of

the second request and would never send a reply back to the thief the thief would thus

busywait forever

The implementation of raiseinterrupt will dep end on the interrupt handling

mechanism If p olling is used then raiseinterrupt can simply raise the victims

interrupt ag the cost is that of a remote memory access Sometime after this the

The advantage of having REPLY in the thief s lo cal memory is that the busywaiting do es not create any trac on the memory interconnect

CHAPTER LAZY TASK CREATION

victim will detect the interrupt and call interrupthandler Note that this requires

the interrupt ag to b e multiple writer shared data so it cant b e cached by the victim

or any other pro cessor Other systems send interrupts to other pro cessors through

dedicated hardware in the interconnect the CM for example Sending an interrupt

on these systems might require a system call Clearly the cost will vary according to

the features of the machine and op erating system

Potential Problems with the MP Proto col

The MP proto col has a number of characteristics that enhance p erformance but also

some others that degrade it This section examines the detrimental asp ects and briey

discusses their severity An imp ortant question is whether the p erformance gains are

more imp ortant than the losses This question will not b e answered fully here b ecause

there are to o many p erformance related parameters to consider Chapter will instead

evaluate the p erformance of the MP and SM proto cols exp erimentally

BusyWaiting

The most obvious problem with the MP proto col is that the busywait for the reply

wastes pro cessing resources The total time wasted by the thief is the time it takes

b efore the victim sends back the reply This is the steal latency The steal latency is

the sum of the time needed by the victim to detect the steal request T and the

detect

time to pro cess the request T If the request is successful T is roughly the

process process

time required to call stealtask T otherwise T

steal task process

The time wasted by the busywait must b e put in context If the steal is success

ful the thief receives a task after wasting T T of its time and taking

detect steal task

time away from the victim so the total amount of work exp ended to get the T

task steal

task is T T If T is the time the thief sp ends running the stolen

detect steal task work

task b efore another task needs to b e stolen the overhead costs for stealing the task in

the MP and SM proto cols are

T T

detect

steal task

O

MP

T

work

T

task steal

O

SM

T

work

for the SM proto col is larger O and O are hard to compare b ecause T

task SM MP steal

than for the MP proto col due to the additional remote memory accesses If the p enalty

of a remote memory access is suciently low O will b e lower than O However

SM MP

THE MESSAGEPASSING PROTOCOL

the dierence will b e small when T is large relative to T and T This

work steal task detect

is help ed by the fact that LTC tends to increase the eective granularity of programs

ie the granularity of heavyweight tasks and T is directly related to the eective

work

granularity However an increase in the number of pro cessors tends to decrease the

eective granularity thus increasing the imp ortance of O relative to O

MP SM

Sp eed of Work Distribution

The sp eed at which work gets distributed to the pro cessors is dep endent on the steal

latency Distributing work quickly is crucial to fully exploit the machines parallelism

It is esp ecially imp ortant at the b eginning of the program b ecause all pro cessors are

idle except one Reducing the steal latency not only gets pro cessors working so oner but

also allows these pro cessors to generate new tasks so oner for other pro cessors The MP

proto col has a p otentially smaller steal latency than the SM proto col but only if T

detect

is kept small Unfortunately minimizing T may increase the cost of other parts of

detect

the system thus creating a tradeo situation As explained in the next chapter p olling

will b ecome more exp ensive b ecause interrupts need to b e checked more frequently

Interrupt Overhead

Finally the cost of failed steal requests is a concern b ecause the victim pays a high price

for getting interrupted but this serves no useful purp ose The victim might get requests

at such a high rate that it do es nothing else but pro cess steal requests For example

a continuous stream of steal requests will b e received by the victim if it is executing

sequential co de and all other pro cessors are idle The problem here is that pro cessors

are to o secretive No information ab out the task stack is shared with other pro cessors

so the only way for a thief to know if the victim has some work is to send it a steal

request

A simple solution is to have each pro cessor regularly save out HEAD and TAIL in

a predetermined sharedmemory lo cation Before attempting a steal the thief checks

the copy of HEAD and TAIL in shared memory to see if a task might b e available For

thief pro cessors this snapshot only reects a previous state of the task stack but if it

is up dated frequently enough its correlation to the current state will b e high If the

snapshot indicates a nonempty task stack it is thus likely that the steal attempt will b e

successful Gambit always keeps HEAD in shared memory so it do es not need to b e saved

Or more precisely a transition from sequential to parallel execution

CHAPTER LAZY TASK CREATION

out this do es not aect p erformance b ecause the victim accesses HEAD infrequently

TAIL is saved out on every interrupt check

Unfortunately this strategy reduces the sp eed of work distribution b ecause thieves

can only b ecome aware of a tasks presence at the next interrupt check Performance

is not aected if the task stack was not empty at the last interrupt check However if

the task stack was empty the newly created task can at b est b e stolen at the second

following interrupt check The rst interrupt check will announce the tasks presence

to the thieves and the steal request will b e handled at b est at the second interrupt

check Since a pro cessors task stack is empty immediately after it has stolen a task it

is imp ortant to have a low interrupt check latency so that work can spread quickly to

idle pro cessors

Co de Generated for SM and MP Proto cols

This section compares the co de generated for a small program when using the SM and

MP proto cols on the GP The program used here is the b enchmark fib Figure

shows the M assembly co de generated for fib for each proto col

The following information will b e useful to understand the co de Integer ob jects are

times their value b ecause the three lower bits are used for the type tag Fibs entry

p oint is lab el L When fib is called the return address is passed in register a and

parameter n is passed in register d Register d is also used to return fibs result

The following registers have a dedicated role a contains TAIL a is a p ointer to the

interrupt ag and pro cessor lo cal data d is a mask to test for placeholder ob jects d

is a private counter to p erform interrupt checks intermittently this counter is explained

in the next chapter

The b oxed parts contain the instructions that relate to p olling and the parallelization

of fib The rest of the co de is identical in b oth proto cols A sequential version of

fib is obtained by removing the b oxed parts from the co de One parallelization cost

common to b oth proto cols is the touch op eration Of its three instructions only the

rst two are executed when a nonplaceholder is touched the run time for this case was

measured at roughly secs The most imp ortant dierence b etween the proto cols

is in the lazy task push and p op op erations These op erations take two instructions in

the MP proto col The run time for these instructions was measured at roughly secs

Except for the instruction at L which is dierent due to one of the compilers stack allo cation optimization s

CODE GENERATED FOR SM AND MP PROTOCOLS

define fib n

if n

n

let f FUTURE fib n

f fib n

TOUCH f f

MessagePassing Proto col SharedMemory Proto col

L L

moveq d moveq d

cmpl dd cmpl dd

ble L ble L

bra L bra L

L L

movl asp

lazy task push

movl spa lazy task push movl spa

subql d subql d

moveq d moveq d

cmpl dd cmpl dd

bgt L bgt L

L L

movl asp movl asp

movl dsp movl dsp

lea La lea La

dbra dL dbra dL

moveq d movl aa

cmpl asp Interrupt check moveq d

Interrupt check

bcc L cmpl asp

jsr intrhandler bcc L

jsr intrhandler

bra L bra L

L

jsr conflict conict handler

L L

bra L bra L

L L lazy task p op

clrl a subql a lazy task p op

cmpl aa

bcs L

L L

movl dsp movl dsp

movl spd movl spd

moveq d moveq d

addl dd addl dd

lea La lea La

moveq d moveq d

cmpl dd cmpl dd

ble L ble L

L L

jmp a jmp a

L L

movl spd movl spd

btst dd btst dd

bne L bne L TOUCH TOUCH

jsr touchundet jsr touchundet

L L

addl dd addl dd

dbra dL dbra dL

moveq d movl aa

cmpl asp moveq d Interrupt check

Interrupt check

bcc L cmpl asp

jsr intrhandler bcc L

jsr intrhandler

L L

addql sp addql sp

rts rts

Figure Assembly co de generated for fib

CHAPTER LAZY TASK CREATION

compared to secs for the ve instructions required in the SM proto col Notice that

in b oth proto cols lab el L is the futures return p oint and L is the secondary return

p oint which jumps past the p opping sequence The frame description information

has b een removed from the co de for clarity The other dierence is in the interrupt

check sequence The co de for the MP proto col has one more instruction to save out

TAIL However this instruction is in the b o dy of the interrupt check sequence which is

executed once out of times The only accesses to shared memory in the MP proto col

are in the b o dy of the interrupt check sequence a test of the interrupt ag and the

saving of TAIL

Summary

ETC is not an adequate implementation of futures b ecause the overhead of creating a

heavyweight task for each future is to o high for ne grain programs LTC p ostp ones

the creation of the heavyweight task until it is known to b e required This only happ ens

when another pro cessor needs work or there is a task susp ension a preemption inter

rupt a stack overow or a call to callcc To do this LTC uses a lightweight task

representation that contains enough information to recreate the corresp onding heavy

weight task Lightweight tasks are put in a lo cal task stack that is accessed by three

op erations push p op and steal A future translates to pushing the parent task onto

the task stack evaluating the futures b o dy and then p opping the parent task to resume

it assuming it is still on the task stack Since a task is essentially a continuation a

future is nothing more than a sp ecial pro cedure call The task stack is the runtime

stack and a table LTQ that indicates the extent of each continuation on the stack

In principle the push and p op op erations are only one instruction apiece The Katz

Weise continuation semantics and dynamic scoping have no cost for nonstolen tasks

b ecause the asso ciated supp ort op erations ie copying the futures continuation and

the dynamic environment can also b e p ostp oned to the time of the steal

Thief pro cessors access the task stack from the b ottom the older task is stolen

rst In divideandconque r algorithms this has the advantage of reducing the number

of task steals required b ecause the task containing the most work is transferred b etween

pro cessors

A critical issue is which pro cessor extracts the task from the task stack at the time of

a steal In the sharedmemory SM proto col the thief accesses the victims stack and

LTQ directly to steal the task Careful synchronization b etween the thief and victim

is needed to avoid a steal and p op of the same task An unfortunate consequence of

SUMMARY

the SM proto col is that the stack and LTQ must b e accessible to all pro cessors so they

cant b e cached optimally on a machine such as the TC This sub optimal caching

of the stack causes a sizeable overhead b ecause the stack is one of the most frequently

accessed data structures In the messagepassing MP proto col the stack and LTQ are

only accessed by the owner pro cessor so they can b e fully cached The thief sends a

work request message to the victim which sends back a task from its task stack if one is

available One of the imp ortant issues for the MP proto col is the interrupt latency If

it is to o large then the thief will lose precious time busywaiting and it will hinder the

exploitation of the machines parallelism b ecause work distribution will b e slow

CHAPTER LAZY TASK CREATION

Chapter

Polling Eciently

The messagepassing implementation of LTC relies on a mechanism to communicate

messages asynchronously from one pro cessor to another Such a mechanism must have

the ability to interrupt a pro cessor at any time Conceivably this could b e done using

some sp ecial feature of the hardware eg interrupt lines of the pro cessor or the op erat

ing system eg the Unix signal system Unfortunately these solutions are not very

p ortable and a suitable p erformance cannot b e guaranteed across a range of machines

Instead it is b etter to consider software metho ds that are p ortable and provide a ner

control of p erformance

The idea b ehind software metho ds is rather simple Each pro cessor has a ag in

shared memory that indicates whether or not that particular pro cessor has a p ending

interrupt The pro cessor p erio dically checks ie polls this ag and traps to an interrupt

handling pro cedure when it discovers that the ag has b een raised The interrupt

check co de necessary for p olling the ag is added by the compiler to the normal stream

of instructions required for the program This unfortunately means that there is an

overhead cost for any program even if interrupts never o ccur Minimizing this overhead

is thus an imp ortant goal

In theory the compiler could arbitrarily reduce the p olling overhead O by

poll

decreasing the prop ortion of executed interrupt checks with resp ect to the normal in

structions executed by the program If all instructions take unit time then O

poll

N N where N is the number of interrupt checks executed and N is the

poll instr poll instr

number of non interrupt check instructions executed This strategy lowers the frequency

of interrupt checking and consequently increases the time b etween an interrupt request

L and p olling over and the actual acknowledgement by the pro cessor Average latency

CHAPTER POLLING EFFICIENTLY

N N

instr

poll

head are inversely related by L Note that interrupt latency

N O

poll poll

here refers to the time interval b etween interrupt checks and not the time b etween an

interrupt request and its acknowledgement Here latency is expressed in number of in

structions To account for nonunit time instructions latency can b e expressed in units

of time or number of machine cycles This leads to the denitions O T T

poll poll instr

T T

instr

poll

L where T is the total time sp ent on interrupt checks and T and

poll instr

N

poll

the time sp ent on other instructions If an interrupt check takes k units of time on av

L k erage then To simplify the discussion all instructions will b e assumed

O

poll

to take unit time

As explained in the previous chapter increasing the interrupt latency is detrimental

to parallel programs b ecause it will take longer to resp ond to steal requests This limits

the rate at which work can get distributed to other pro cessors Thus there is a tradeo

b etween overhead and latency High latency is preferable for sequential co de b ecause

the p olling overhead is low and low latency is b est for parallel co de b ecause parallelism

can b e exploited b etter The imp ortance of latency is actually more subtle than this

simple statement suggests A high latency may b e appropriate for applications where

tasks often susp end on undetermined placeholders Tasks that b ecome ready following

a determine are made available to other pro cessors by placing them on the HTQ

The HTQ is conveniently accessed through shared memory making it imp ervious to

interrupt latency If most of the tasks migrate in this fashion to the HTQ a low latency

may not signicantly improve the rate of work distribution

An optimal latency for all programs do es not exist b ecause the ratio of sequential

to parallel co de diers from program to program The compiler could select a latency

that suits the needs of the particular program or pro cedure b eing compiled Even

if the compiler had enough information to make such a decision this strategy is still

questionable Latency requirements vary at runtime as the program switches back and

forth b etween a sequential and parallel mo de of execution A pro cedure might b e called

b oth when latency requirements are low and high and so a xed p olling frequency

will give sub optimal p erformance One could imagine having multiple versions of each

pro cedure with varying p olling frequencies but this introduces new problems

Instead of further exploring such ad ho c strategies this chapter addresses the prob

lem of eciently achieving a particular latency with the use of p olling It will b e assumed

that co de duplication is not p ermitted The next chapter explores the eect of interrupt

latency on the p erformance of the parallel b enchmark programs The results indicate that a particular choice of latency p erforms well for a wide range of programs

THE PROBLEM OF PROCEDURE CALLS

The Problem of Pro cedure Calls

Although p olling seems simple enough to implement there is a complication Normally

programs are not comp osed of a single stream of instructions If this were the case the

compiler could simply count the instructions it emits and insert an interrupt check after

every so many instructions Branches and pro cedure calls can alter the ow of control

in unpredictable ways and so it isnt clear how the compiler can achieve a constant

number of instructions b etween interrupt checks A reasonable compromise is to ask of

the compiler to emit interrupt checks such that a given latency L is never exceeded

max

Co de Structure

To explore the problem further it is convenient to introduce a formalism to describ e

the structure of a pro cedures co de In general the co de of a pro cedure can b e viewed

as a graph of basic blo cks of instructions There are two sp ecial types of basic blo cks

entry points and return points There is a single entry p oint p er pro cedure and one

return p oint for each pro cedure call in subproblem p osition

The only place where branches are allowed is as the last instruction of a basic blo ck

There are four types of branches lo cal branches p ossibly conditional to other basic

blo cks of the same pro cedure tail calls to pro cedures ie reductions nontail calls to

pro cedures ie subproblems and returns from pro cedures Lo cal branches and nontail

calls are not allowed to form cycles and thus they imp ose a DAG structure to the co de

Lo ops can only b e expressed with tail calls

Note that subproblem and reduction calls always jump to entry p oints and that pro

cedure returns always jump to return p oints These restrictions are imp ortant b ecause

they simplify the analysis of a programs control ow

Figure gives the graph for the pro cedure foreach which contains all four types

of branches Returns and tail calls have b een represented with dotted lines b ecause they

do not corresp ond to DAG edges Solid lines are used for subproblem calls to highlight

the fact that just like direct branches it is known where control continues after the

pro cedure returns if it returns at all The generality of the DAG is only needed to

express the sharing of co de For the moment it is sucient to make the simplifying

assumption that the DAG has b een converted into a tree by duplicating each shared

branch The handling of shared co de is describ ed in Section

A necessary condition for any p olling strategy is that an inline sequence of more

CHAPTER POLLING EFFICIENTLY

foreach

null l

S

define foreach f l

S

Sw

if null l

f car l

f

begin

f

f car l

foreach f cdr l

cdr l

Figure The foreach pro cedure and its corresp onding co de graph

than L instructions is never generated without an intervening interrupt check The

max

compiler can exploit the co de structure for this purp ose A locally connected section is

any subset of the basic blo cks that is connected by lo cal branches only for example

the three basic blo cks at the top of Figure or the b ottom one For any instruction i

in a lo cally connected section it is easy to determine what instructions are on the path

to i from the sections ro ot These instructions are exactly those that are executed at

runtime b efore i Thus for any instruction in a lo cally connected section the compiler

can tell how far back the last interrupt check o ccurred assuming there is one on the same

path from that sections ro ot The number of instructions that separate an instruction

from the previous interrupt check is called the instructions delta When the delta is

L an interrupt check is inserted by the compiler b efore the instruction

max

CallReturn Polling

Polling strategies dier in how the transition b etween lo cally connected sections is han

dled Cal lreturn polling is a simple p olling strategy that consists of putting an interrupt

check as the very rst instruction of each sections ro ot Since the ro ot of a section is

either the entry p oint of the pro cedure or the return p oint of a subproblem call this

corresp onds to p olling on pro cedure call and return

For instructions that are not preceded by an interrupt check in the same section the denition of delta will vary according to the p olling strategy

SHORT LIVED PROCEDURES

define makeperson name age gender vector name age gender

define personname x vectorref x

define personage x vectorref x

define persongender x vectorref x

define sum vect l h sum vector from l to h

if l h

vectorref vect l

let mid quotient l h

lo sum vect l mid

hi sum vect mid h

lo hi

Figure Two instances of short lived pro cedures

There are several variations on this theme The interrupt check at the return p oint

can b e removed if checks are put on all return branches Similarly the interrupt check at

the entry p oint can b e replaced by checks on branches to pro cedures b oth tail calls and

nontail calls The four p ossible variations give equivalent dynamic b ehavior ie same

number of interrupt checks executed but one may b e preferable to the others if it yields

more compact co de This dep ends on the particular co de generation techniques used

by the compiler and the programs b eing compiled Compactness of co de is not a big

issue here so it wont b e considered further

Short Lived Pro cedures

Unfortunately callreturn p olling can break down in certain circumstances The worst

case o ccurs when pro cedures are short lived that is they return shortly after b eing

called At least two interrupt checks are p erformed p er pro cedure call in subproblem

p osition once on entry and once on exit and one if it is a reduction This is a signicant

overhead if the pro cedure contains few instructions This would not b e a serious problem

in languages that promote the use of large pro cedures but in Lisp it is common to

arrange programs into many short pro cedures

Two instances of this style typied in Figure are the implementation of data

abstractions and divide and conquer algorithms This later situation is esp ecially rele

vant b ecause in Multilisp parallelism is frequently expressed using divide and conquer

algorithms In binary divide and conquer algorithms at least half of the recursive calls

CHAPTER POLLING EFFICIENTLY

m

L

max

BM

B

P R z

B

B

B

B

Interrupt checks

Figure The maximal delta metho d

corresp ond to the base case If the algorithm is ne grained such as the pro cedure sum

the overhead of p olling will b e noticeable b ecause all the leaf calls are short lived

Putting an interrupt check at every sections ro ot is a very conservative metho d that

do esnt take the structure of the program into account If it is known that a pro cedure

P is always called when delta is equal to n then the compiler could infer that

the rst instruction in P has a delta of n This would introduce a grace p erio d of

L n instructions at P s entry p oint during which interrupt checks are not needed

max

A similar statement holds for return p oints Note that this yields a p erfect placement of

interrupt checks if it is carried out at all pro cedure entry and return p oints Interrupt

checks o ccur exactly every L instructions

max

A more realistic solution is needed to handle the case where pro cedures and return

p oints are called in dierent contexts ie from call sites with dierent deltas A simple

extension to the previous metho d is to use m instead of n where m is the maximum

delta of all call sites to P and similarly for return p oints This maximal delta metho d

is illustrated in Figure where dark rectangles are used to represent interrupt check

instructions Note that delta now represents an upp er b ound on the number of non

interrupt check instructions preceding an instruction The maximal delta metho d is not

an ideal solution for two reasons First it forces all control paths through P to have

an early interrupt check in P if just one call site to P has a high delta It would b e

much b etter if each pro cedure call paid its own way meaning that p olling should b e

put on the call sites with high deltas Not only would this improve P s grace p erio d it

would put the interrupt check where it causes the least overhead b ecause a high delta

at a call site is a sign of a high number of normal instructions preceding it

For simplici ty it is assumed here that all paths to P are equiprobabl e

BALANCED POLLING

A second shortcoming of this metho d is that the source and destination of pro ce

dure calls has to b e known at compile time In Scheme this information is not generally

available although one could reasonably argue that with the use of programmer anno

tations andor control ow analysis the destination of most pro cedure calls could b e

inferred by the compiler for typical programs However the destination of returns is

harder to determine b ecause it would require a full dataow analysis of the program

and in general there are multiple return p oints for each pro cedure The existence of

higher order functions is another source of diculty

Balanced Polling

This section presents a general solution that do es not rely on any knowledge of the

control ow of the program The metho d could b e extended with appropriate rules

such as maximal delta to b etter handle the cases where control ow information is

available but this is not considered here

The idea is to dene p olling state invariants for pro cedure entry and exit The

p olling strategy exp ects these invariants to b e true at the entry and return p oints of all

pro cedures and consequently must arrange for them to b e true at pro cedure calls and

returns

Sp ecically the invariant at pro cedure entry is that interrupts have b een checked

at most L E instructions ago Here E is the grace p erio d at entry p oints and is

max

constant for all pro cedures In other words delta is dened to b e L E at entry

max

p oints The invariant at pro cedure return is more complex Either delta is less than E

or the path from the entry p oint to the return instruction is at most E instructions

These invariants are represented in Figure Pro cedure P has two branches that

illustrate the two cases for pro cedure return Note that a pro cedure can b e exited by a

pro cedure return as well as a reduction call For now reduction calls will b e ignored to

simplify the discussion

Subproblem Calls

These invariants have imp ortant implications To b egin with short lived pro cedures

are handled well b ecause there is no need to check interrupts on any path that returns

quickly without a call to another pro cedure ie with less than E noncall instructions This corresp onds to the rightmost path in Figure

CHAPTER POLLING EFFICIENTLY

call sites

at most

to P

L E

max

instructions

P

U

entry p oint

S

at most E

S

Sw

instructions

pro cedure

return

at most E

instructions

pro cedure

return

Figure Pro cedure return invariants in balanced p olling

BALANCED POLLING

Moreover the delta at return p oints can b e dened as E plus the delta for the

corresp onding call p oint This can b e conrmed by considering the two p ossible cases

Assume pro cedure P do es a subproblem call to pro cedure P which eventually returns

back to P via a pro cedure return in P ie

subproblem

pro cedure

return call

P P P

Either the last interrupt check was in P so by denition delta at the return p oint

in P is less than E Alternatively P was short lived and didnt check interrupts so

there are at most E instructions that separate the call site in P from the return p oint

in P As far as p olling is concerned a pro cedure called in subproblem p osition can

b e viewed as an interrupt check free sequence of E instructions The compilation rule

here is that if delta at a call p oint exceeds L E then an interrupt check is inserted

max

at the call

This rule means that up to bL E c subproblem pro cedure calls can b e done in

max

sequence without any interrupt checking To see why consider the scenario where the

rst call is immediately preceded by an interrupt check At the return p oint delta is

equal to E If the instructions for argument setup and branch are ignored delta at the

th

n return p oint is n E Only when this reaches L is an interrupt check needed

max

Reduction Calls

As describ ed the p olling strategy do es not handle reduction pro cedure calls tail calls

very gracefully The case to consider here is when a subproblem call is to a pro cedure

which exits via a series of tail calls nally ending in a pro cedure return ie

subproblem

pro cedure

reduction reduction

call call call return

P P P P P P

n n

An interrupt check must always b e put at a reduction call p oint to guard against the

case where the called pro cedure returns quickly without checking interrupts as in P

n

calling P Note that the return p oint in P can have a delta as low as E Note also

n

that P might execute as many as E non interrupt check instructions b efore returning

n

to the return p oint in P Thus it is not valid for P to jump to P with a delta

n n

greater than b ecause this would violate the p olling invariant at the return p oint in

P

The treatment of reductions can b e improved by introducing a new parameter R

and consequently adjusting the p olling invariants to supp ort it R is dened as the

CHAPTER POLLING EFFICIENTLY

largest admissible delta at a reduction call Thus an interrupt check is put on any

reduction call whose delta would otherwise b e greater than R Note that the same

p olling b ehavior as b efore is obtained by setting R to The p olling constraints for

reduction calls can b e relaxed by increasing the value of R R can b e as high as L E

max

b ecause a reduction call might b e to a pro cedure that do esnt check interrupts for as

many as E instructions

A new invariant for return p oints has to b e formulated to accomo date R The delta

at return p oints must now b e at least E R to account for the case explained previously

a chain of reduction calls from P to P ending in a pro cedure return to P That is on

n

return to P there could b e up to E instructions in P plus as much as R instructions

n

at the tail of P since the last interrupt check When the compiler encounters a

n

subproblem pro cedure call it sets the delta at the return p oints to E plus the largest

value b etween R and the delta for the corresp onding call p oint If this value is greater

than L an interrupt check is rst put at the call site and the delta at the return

max

p oint is set to E R The introduction of R also makes it p ossible to relax the invariant

for pro cedure returns Since the delta for return p oints is at least E R a delta as high

as E R can b e tolerated at pro cedure returns without requiring an interrupt check

With these new invariants there can b e up to bL RE c subproblem pro cedure

max

calls in sequence without interrupt checks This p olling strategy will b e called balanced

polling A summary of the compilation rules for balanced p olling is given in Figure

The two constants E and R must b e chosen carefully to achieve go o d p erformance

Small values for E and R increase the number of interrupt checks for short lived pro ce

dures and tail recursive pro cedures resp ectively On the other hand high values increase

the number of interrupt checks in co de with many subproblem pro cedure calls eg re

cursive pro cedures Cho osing E R bL k c is a reasonable compromise and a

max

value of k gives go o d p erformance in practice This suggests that there are typically

less than subproblem pro cedure calls p er pro cedure in the b enchmark programs see

Section

Minimal Polling

The choice of L is also an issue A high L will give a low p olling overhead

max max

However it is imp ortant to realize that there is a limit to how low the p olling overhead

can b e made by increasing the value of L This is due to the conservative nature of

max

the strategy Whatever the values of L E and R are at least one interrupt check max

BALANCED POLLING

Lo cation Action by compiler

Entry p oint L E

max

Nonbranch if L then add interrupt check

max

instruction for the next instruction

Subproblem if L E then add interrupt check

max

call E maxR for the return p oint

Reduction if R then add interrupt check

call

Pro cedure if E R and there is an interrupt check on the path

return from the pro cedures entry p oint then add interrupt check

Figure Compilation rules for balanced p olling

is generated b etween the entry p oint and the rst pro cedure call Delta is L E

max

on entry to a pro cedure so clearly the rst call reduction or subproblem must b e

preceded by an interrupt check Similarly there is at least one interrupt check b etween

any return p oint and the exit of the pro cedure return or reduction call b ecause delta

at any return p oint is at least E R These two types of paths are the only ones

that are a necessary part of any unbounded length path Thus it is sucient to have

one interrupt check on each of these paths to guarantee that all p ossible control paths

have a b ounded number of instructions b etween interrupt checks This minimal polling

strategy is useful b ecause its overhead is a lower b ound that can b e used to evaluate

other techniques

An example of minimal p olling for the pro cedure sum and the tail recursive variant

trsum is presented in Figure For the call sum v l h there are exactly h l

interrupt checks executed or nearly one interrupt check p er pro cedure call assuming

h l is a p ower of two By comparison checking interrupts at pro cedure entry and

exit would execute twice as many interrupt checks two p er pro cedure call However

for the tail recursive pro cedure trsum b oth metho ds are essentially equivalent with one interrupt check p er iteration

CHAPTER POLLING EFFICIENTLY

sum

trsum

l h

i

S

S

S

Sw

S

Sw

quotient

vectorref

s s

l h

vect l

vectorref

vect i

sum

i

mid

sum

define trsum vect s i

if i

s

lo hi

trsum

vect

s vectorref vect i

i

Figure Minimal p olling for the recursive pro cedure sum and a tail recursive variant

HANDLING JOIN POINTS

It is interesting to note that balanced p olling is more general than minimal p olling

and callreturn p olling These can b e emulated by judiciously choosing E R and L

max

Minimal p olling is obtained when E R L ie E and R are arbitrarily

max

large and L is arbitrarily larger An interrupt check is put at the rst call and

max

another one is put at the return or reduction call that follows the last return p oint

Callreturn p olling o ccurs when E R L This places interrupt checks at

max

all entry p oints and return p oints

Handling Join Points

It has b een assumed that the co de of pro cedures is in the form of a tree However

the compilation of conditionals eg and or if and cond in subproblem p osition

introduces join p oints that give a DAG structure to the co de Certain optimization

techniques such as common co de elimination can also pro duce join p oints to express

the sharing of identical co de branches A simple approach for join p oints is to use the

maximal delta metho d That is the delta at the join p oint is the maximum delta of all

branches to the join p oint Although this is not an optimal strategy its p erformance

on the b enchmark programs seems suciently go o d to b e content with it

Polling in Gambit

Polling is a general mechanism that can serve many purp oses In Gambit p olling is

used for

Stack overow detection

Interprocessor communication for stealing work

Preemption interruption for multitasking

Intertask communication for interrupting tasks

Barrier synchronization eg for synchronizing all pro cessors for a garbage collec

tion and to copy ob jects to the private memory of every pro cessor

A sp ecial technique is used to check all these cases with a single test The interrupt

ag in shared memory is really a p ointer that is normally set to p oint to the end of the

area available for the stack An interrupt check consists of comparing the ag to the

current stack p ointer and to jumping to an out of line handler when the stack p ointer

CHAPTER POLLING EFFICIENTLY

exceeds this limit A pro cessor can b e interrupted by setting the ag to a value that

forces this situation eg The interrupt handler can then use some other ags to

discriminate b etween the p ossible sources of interrupt

Although it can b e done with a single test the interrupt check may still b e rela

tively exp ensive due to the reference to shared memory Increasing L is not a viable

max

solution b ecause the p olling frequency cant b e lowered b eyond a certain p oint To

provide a ner level of control interrupts can b e checked intermittently Polling in

structions generated by the compiler represent virtual interrupt check p oints and an

actual interrupt check o ccurs only every so many virtual checks This new parameter

is the intermittency factor and is called I Intermittent checking is easily implemented

by a private counter that is decremented at every virtual check When it reaches zero

it is reset to I and the interrupt check is p erformed The average cost of an interrupt

th

check will thus b e the cost of up dating and checking the counter plus I the cost of

checking the interrupt ag

An interesting optimization o ccurs here Balanced p olling has a tendency to put the

interrupt checks at branch p oints An interrupt check itself involves a branch instruction

so in many cases it is p ossible to combine the two branches into a single one Moreover

several machines have a combined decrement and branch instruction that helps reduce

the cost even further All these ideas are implemented in Gambit

Results

To have a b etter idea of the p olling overhead that can b e exp ected from these p olling

metho ds it is imp ortant to measure the overhead on actual programs Two situa

tions are esp ecially interesting to evaluate the overhead on typical programs and on

pathological programs that are meant to exhibit the b est and worst p erformance

Several programs and p olling metho ds were tested The programs were run on the

GP using a single pro cessor Each program was compiled in four dierent ways

with no interrupt checks with minimal p olling with callreturn p olling and balanced

p olling For balanced p olling L was set to values from to and E and R were

max

set at bL c A value of I was used as the intermittency factor The average

max

run time on ten runs was taken for each situation The p olling overhead of minimal

p olling over the program compiled with no interrupt checks is rep orted in the rst

column of Table The overhead for the other p olling metho ds is expressed relatively

to the overhead of minimal p olling Thus a relative overhead of means that the

RESULTS

overhead is twice that of minimal p olling Overheads lower than one can b e explained

by a combination of factors timing inaccuracies and degradation of instruction cache

p erformance due to the dierent loading lo cation of the programs The table also gives

the average latency obtained with minimal p olling and balanced p olling at L

max

and L The latency for compiler is not shown b ecause the number of interrupt

max

checks executed was not available to measure it the program must b e compiled with a

statistics gathering option which increases the size of the co de so much that it can not

t anymore on the GP

The program tight shown b elow was designed to exhibit worstcase b ehavior

define tight n

if n

tight n

It is a tight lo op that do esnt do anything except up date a lo op counter There are only

two instructions executed on every iteration an increment and a conditional branch

Interrupt checks will clearly add a high overhead to this For most p olling metho ds the

overhead is ab out In the case of balanced p olling with L the overhead

max

is roughly twice that b ecause two interrupt checks get added to every lo op b ecause

E R

The program unfolded is the same lo op as tight but unfolded times Thus it is

a long inline sequence of decrements followed by one conditional branch instruction

The p olling metho ds do well on this program ab out for minimal and callreturn

p olling b ecause pro cedure calls are relatively infrequent and it is easy to handle the

inline sequence of instructions As exp ected for balanced p olling increasing L de

max

creased the overhead down to ab out L would have to b e higher than

max

ie to reduce the overhead to that of minimal p olling at L there are

max

two interrupt checks p er lo op

The other programs are from the standard set of b enchmarks The parallel programs

were compiled as sequential programs ie with futures and touches removed to factor

out the overhead of supp orting parallelism

The results for these programs indicate that minimal p olling outp erforms callreturn

p olling in nearly all cases Sometimes by as much as a factor of four but by a factor

closer to on average The largest dierences o ccur for ne grain recursive pro

grams eg tak and fib and programs with a profusion of data abstraction pro cedures

eg conform The p erformance of balanced p olling is rather p o or for small values of

L two to three times the overhead of minimal p olling when L However

max max

CHAPTER POLLING EFFICIENTLY

Call

Minimal p olling Balanced p olling

L secs

return

O L Rel ov when E R bL c and L is p olling for L

max max max

poll

Program secs Rel ov

tight

unfolded

boyer

browse

cpstak

dderiv

deriv

destruct

div

puzzle

tak

takl

traverse

triangle

compiler

conform

earley

peval

abisort

allpairs

fib

mm

mst

poly

qsort

queens

rantree

scan

sum

tridiag

Table Overhead of p olling metho ds on GP

SUMMARY

balanced p olling gives p erformance close to minimal p olling when L is high With

max

L the p olling overhead ranges from to The highest overheads are for

max

ne grain recursive programs The average overhead for balanced p olling is ab out

for values of L higher than

max

Summary

Interrupts can b e detected by the pro cessors hardware interrupt system or by p olling

Polling has the advantage of simplicity and p ortability A common claim is that p olling

is not appropriate for a highp erformance system b ecause it has a high overhead This

chapter describ ed the balanced p olling metho d whose overhead is almost half that of

the more straightforward callreturn p olling metho d Balanced p olling as implemented

on the GP has a overhead on average This overhead still seems rather high

but this can b e explained by the high quality of co de generated by Gambit and the p o or

instruction set of the M pro cessors on the GP Systems with a compiler that

generates less tight co de or with a pro cessor that p ermits a lower cost co de sequence for

an interrupt check for example a fast compare and trap on condition instruction

would have a corresp ondingly lower overhead for p olling

Clearly the pro cessors hardware interrupt system should b e used to implement the

MP proto col if the interrupt latency and overhead are low enough and the state of the

pro cessor at the time of interrupt can b e recovered conveniently If not p olling is at

least a viable alternative

CHAPTER POLLING EFFICIENTLY

Chapter

Exp eriments

Performance is the main design ob jective of the implementation strategies presented in

this thesis In most cases a purely theoretical p erformance analysis is not satisfying

b ecause it must abstract away many real issues to make the analysis manageable The

goal of this chapter is to evaluate p erformance using exp eriments Concrete evidence

for the following claims is given

Exp osing parallelism with LTC is relatively inexp ensive when the MP proto col is

used The worstcase overhead when programs are very ne grain is ab out

In the absence of a cache the overhead of exp osing parallelism with the SM

proto col is ab out twice that of the MP proto col ie the worstcase overhead is

ab out When a cache is available the overhead for the SM proto col can b e

higher than a factor of two

LTC scales well to large sharedmemory multiprocessors The two proto cols have

very similar sp eedup characteristics when a cache is not present

The MP proto col has sp eedup characteristics that are consistently b etter than the

SM proto col on multiprocessors with caches The dierence in p erformance when

using a large number of pro cessors is as high as a factor of two on the TC

The steal request latency can b e relatively large without adversely aecting the

MP proto cols p erformance

Supp orting the KatzWeise semantics and legitimacy generally has a negligible

impact on p erformance

CHAPTER EXPERIMENTS

Exp erimental Setting

Several exp eriments were conducted to evaluate and compare the various implemen

tation strategies The exp eriments consisted of running each b enchmark program in a

particular context and measuring some of its characteristics The context was dep endent

on the following parameters

Machine and compiler The exp eriments were p erformed on the GP and

TC multiprocessors Each of the M pro cessors on the GP delivers

roughly MIPS and each M pro cessor on the TC delivers roughly

MIPS Only the TC has a data cache Each machine has its own version of

the compiler but the frontends are the same The backend for the GP gen

erates highly optimized native co de whereas the version for the TC generates

p ortable C co de which must b e subsequently compiled with a C compiler The

price to pay for this p ortability is a slowdown of a factor of to over native

co de dep ending on the program The slowdown is a result of extra pure compu

tation instructions The number of memory accesses would however b e the same

in a native co de implementation This means that the imp ortance of the TCs

memory hierarchy is lower than it would b e if the backend generated native co de

Consequently the results obtained with the GP are more representative of

a highp erformance compiler and the results obtained on the TC are more

representative of a mo dern multiprocessor with a low cost memory hierarchy

A severe handicap of these machines is the small size of physical memory The lo cal

memory on each pro cessor is only Mbytes on the GP and Mbytes on the

TC Since this memory holds the op erating systems co de and data structures

and the programs co de little space is left for the programs heap only ab out

Mbytes on the GP Allo cating virtual memory is not a solution b ecause it

adversely aects the p erformance of garbage collection and also b ecause it do esnt

scale well page faults are handled by a small set of pro cessors dedicated for this

purp ose To minimize these problems the b enchmarks were chosen so that the

data they allo cate ts in the heap without causing any garbage collection In an

eort to reduce the number of page faults the b enchmarks p erform a few dry

runs b efore the run actually measured Nevertheless some memory intensive

programs allpairs and poly in particular consistently caused page faults due

to their p o or lo cality of reference

Number of pro cessors One of the goals of this thesis is to show that LTC

scales well to large sharedmemory multiprocessors For this reason the exp eri

EXPERIMENTAL SETTING

ments were conducted on the largest machines that were accessible a pro cessor

GP at Michigan State University and a pro cessor TC at Argonne

National Lab oratory These are multiuser machines where pro cessors are dy

namically allo cated into partitions at the time the program is launched by the

user The program is only aware of the pro cessors in its partition but b ecause

the memory interconnect is a buttery network shared by all the partitions the

contention on the network dep ends on the other programs running on the ma

chine To minimize this eect exp eriments were p erformed at op eak hours

and the average of several runs usually was taken However it was dicult

to nd times where large partitions could b e allo cated so it was necessary to

limit the number of exp eriments and number of runs for the larger partitions this

explains at least in part the greater variations in the results on large partitions

The largest partition used on the GP was pro cessors on the TC it

was pro cessors

Another problem aicts large partitions Each pro cessor on the GP and

TC has a limited size TLB translation lo okaside buer for holding the map

ping information that is used to translate virtual addresses to physical addresses

The TLB is managed like a cache and has roughly entries Each entry maps

a page of the programs virtual address space When a memory reference is to a

page not currently mapp ed by the TLB a translation fault o ccurs and the op er

ating system must load the appropriate mapping information into the TLB from

a table in memory Translation faults must b e avoided b ecause they are handled

in software and are relatively exp ensive Programs with p o or lo cality of reference

and that have more than or so pages in their working set will cause frequent

translation faults Unfortunately several of the b enchmarks have p o or lo cality

b ecause they distribute user data evenly across the machine to reduce contention

The working set of these programs increases with the number of pro cessors and

thrashing o ccurs when the working set exceeds pages this typically starts hap

p ening somewhere b etween and pro cessors but the exact p oint dep ends on

the program Moreover p o or lo cality is inherent in the search for a task to steal

which p ossibly ushes several entries from the TLB that are part of the stolen

tasks working set The imp ortance of this factor will increase with the number

of pro cessors and the scarcity of tasks to steal

Polling parameters Balanced p olling with E R and L was

max

used for all exp eriments The steal request latency was controled by changing the

p olling intermittency factor I Unless otherwise indicated I was set to the

same value used in the previous chapter to evaluate the p olling metho ds

CHAPTER EXPERIMENTS

Stealing proto col Both the SM and MP proto cols were tested

Continuation semantics Two continuation semantics were used the original

Multilisp semantics and the KatzWeise semantics On the GP the original

semantics was used with the SM proto col and the KatzWeise semantics was used

with the MP proto col The TC used the original semantics for b oth proto

cols For the original semantics the transfer of the stolen tasks continuation was

p erformed with a single blo ck transfer op eration The KatzWeise semantics was

implemented with heapication

Legitimacy Unless otherwise indicated legitimacy was not supp orted

Overhead of Exp osing Parallelism

O corresp onds to the cost of exp osing the parallelism to the system Part of this

expose

cost comes from the futures and touches added to the sequential program to parallelize

it The other part of the cost is a consequence of the less ecient caching p olicy that is

needed for the SM proto col Recall that T is the run time of a sequential version of

seq

the program the parallel program with futures and touches removed and T is the

par

run time of the parallel program on one pro cessor T T and O are related

par seq expose

by the equation

T

par

O

expose

T

seq

To evaluate O the run time was measured on a single pro cessor partition with

expose

the program compiled with and without futures and touches giving T and T

par seq

resp ectively T and O are given on the left side of Tables through

par expose

The rst two tables are for the SM and MP proto cols on the GP and the last two

tables are for the SM and MP proto cols on the TC On the TC the stack was

writethrough cached for measuring the SM proto cols T and the stack was copyback

par

cached for measuring T and the MP proto cols T

seq par

Notice that for nearly all programs the SM proto col has an O larger than the

expose

MP proto col The only exceptions are the programs mm and abisort on the GP

SPEEDUP CHARACTERISTICS

Overhead on GP

On the GP O is closely dep endent on G the task granularity and n the

expose

number of closed variables that must b e copied for the futures b o dy Tables and

give the value of G and n for each b enchmark O is approximately equal to

expose

nsecs nsecs

when using the SM proto col and when using the MP

G G

proto col This is consistent with the costs measured in Chapter for the lightweight task

push and p op sequence secs for the SM proto col and sec for the MP proto col

and the sec cost for a touch most programs have the same number of touches and

futures For the SM proto col O is at its lowest value for allpairs the

expose

program with the largest granularity The highest overhead is for fib the

program with the smallest granularity For the MP proto col allpairs and fib also

yield the lowest and highest overheads and This is ab out half the overhead

of the SM proto col

Overhead on TC

On the TC O for the MP proto col ranges from to which is

expose

essentially the same range as for the GP However O for the SM proto col

expose

is much larger ranging from to The highest overhead is for fib which

runs a factor of slower than the sequential version of the program For the MP

proto col the overhead for fib is only The large dierence in overheads is mostly

due to the SM proto cols use of writethrough caching for the stack and LTQ According

to column O of Table writethrough caching the stack accounts for an overhead

WT

of on sequential fib Thus the additional overhead of the parallel version to go

from to is attributable mostly to the three stack and LTQ writes p erformed

for each future On the other hand the overhead of coarse grain programs is closer to

O For example allpairs has an O of and an O of

WT WT expose

Sp eedup Characteristics

The right side of Tables through provides some information on the parallel

b ehavior of the programs The programs were run on increasingly large partitions up

to pro cessors on the GP and pro cessors on the TC to see how well

they exploit parallelism For the GP three measurements were taken the run time of the program the number of heavyweight tasks created and the number of task

CHAPTER EXPERIMENTS

Sp eedup TC and TS when

number of pro cessors is

Program T

par

O

expose

fib S

TC

TS

queens S

TC

TS

rantree S

TC

TS

mm S

TC

TS

scan S

TC

TS

sum S

TC

TS

tridiag S

TC

TS

allpairs S

TC

TS

abisort S

TC

TS

mst S

TC

TS

qsort S

TC

TS

poly S

TC

TS

Table Performance of SM proto col on GP

SPEEDUP CHARACTERISTICS

Sp eedup TC and TS when

number of pro cessors is

Program T

par

O

expose

fib S

TC

TS

queens S

TC

TS

rantree S

TC

TS

mm S

TC

TS

scan S

TC

TS

sum S

TC

TS

tridiag S

TC

TS

allpairs S

TC

TS

abisort S

TC

TS

mst S

TC

TS

qsort S

TC

TS

poly S

TC

TS

Table Performance of MP proto col on GP

CHAPTER EXPERIMENTS

Sp eedup when

number of pro cessors is

Program T

par

O

expose

fib

queens

rantree

mm

scan

sum

tridiag

allpairs

abisort

mst

qsort

poly

Table Performance of SM proto col on TC

Sp eedup when

number of pro cessors is

Program T

par

O

expose

fib

queens

rantree

mm

scan

sum

tridiag

allpairs

abisort

mst

qsort

poly

Table Performance of MP proto col on TC

SPEEDUP CHARACTERISTICS

susp ensions that o ccurred Each entry in the table contains three values computed from

these measurements

S This is the programs sp eedup over the sequential version of the program

ie that has futures and touches removed and that is run with copyback caching

of the stack on the TC

T

seq

S

run time

TC This is the prop ortion of lightweight tasks that were transformed into

heavyweight tasks

tasks created heavyweight

TC

N

future

TS This is the number of task susp ensions expressed relatively to the number

of lightweight tasks

of task suspensions number

TS

N

future

Note that a few of the b enchmarks allpairs mst poly and qsort did not run

prop erly with the SM proto col on the GP The tables for the TC only contain

the sp eedup The sp eedup data is repro duced as sp eedup curves in Figures through

The sp eedup curves for the GP also contain data for runs of the MP proto col

with higher and lower intermittency factors For now only the curves marked I are

considered TC and TS for the MP proto col on GP are also plotted as a function

of the number of pro cessors in Figures and The b enchmark programs can b e

roughly classied in three groups according to the shap e of their sp eedup curves

Parallel and compute b ound fib queens rantree These programs do

not access memory The sp eedup curve is initially close to linear sp eedup and

gradually diverges from it as the number of pro cessors increases in other words

the rst derivative of the curve starts at and the second derivative is negative

The attening out of the curve as the number of pro cessors increases is explained

by Amdahls law ie each program has a maximal sp eedup

The bug has stump ed me to this day I susp ect that it is a race condition I introduced in the assembly

language enco ding of the algorithms Gambits kernel contains ab out lines of hand optimized

assembly co de After obtaining a working version of the SM proto col on the TC written in C I

convinced myself that the problem was not algorithmic The problem may also b e related to a known bug in the parallel garbage collection algorithm

CHAPTER EXPERIMENTS

queens

fib

. .

. .

. .

. .

. .

. .

. .

. .

SM I SM I

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

MP I MP I L secs L secs . .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

L secs L secs MP I MP I

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

MP I MP I L secs L secs . .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

S S

. .

. .

. .

. .

. .

. .

. .

. .

P P

. .

. .

. .

. .

. .

. .

. .

. .

E E

. .

. .

. .

. .

. .

. .

. .

. .

E E

. .

. .

. .

. .

. .

. .

. .

. .

D D . .

. .

. .

. .

. .

. .

. .

. .

. .

U U

. .

. .

. .

. .

. .

. .

. .

. .

P P

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

Number of Pro cessors Number of Pro cessors

mm

rantree

. .

. .

. .

. .

. .

. .

. .

. .

SM I SM I

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

MP I L secs MP I L secs . .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

MP I MP I L secs L secs

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

L secs L secs . . MP I MP I

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

S S

. .

. .

. .

. .

. .

. .

. .

. .

P P

. .

. .

. .

. .

. .

. .

. .

. .

E E

. .

. .

. .

. .

. .

. .

. .

. .

E E

. .

. .

. .

. .

. .

. .

. .

. .

. . D D

. .

. .

. .

. .

. .

. .

. .

. .

U U

. .

. .

. .

. .

. .

. .

. .

. .

P P

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

Number of Pro cessors Number of Pro cessors

Figure Sp eedup curves for fib queens rantree and mm on GP

SPEEDUP CHARACTERISTICS

scan sum

. .

. .

. .

. .

. .

. .

. .

. .

SM I SM I

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

MP I L secs MP I L secs . .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

L secs L secs MP I MP I

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

MP I MP I L secs L secs . .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

S S

. .

. .

. .

. .

. .

. .

. .

. .

P P

. .

. .

. .

. .

. .

. .

. .

. .

E E

. .

. .

. .

. .

. .

. .

. .

. .

E E

. .

. .

. .

. .

. .

. .

. .

. .

D D . .

. .

. .

. .

. .

. .

. .

. .

. .

U U

. .

. .

. .

. .

. .

. .

. .

. .

P P

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

Number of Pro cessors Number of Pro cessors

tridiag allpairs

. .

. .

. .

. .

. .

. .

. .

. .

SM I

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

L secs L secs . . MP I MP I

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

MP I MP I L secs L secs

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

L secs L secs . . MP I MP I

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

S S

. .

. .

. .

. .

. .

. .

. .

. .

P P

. .

. .

. .

. .

. .

. .

. .

. .

E E

. .

. .

. .

. .

. .

. .

. .

. .

E E

. .

. .

. .

. .

. .

. .

. .

. .

. . D D

. .

. .

. .

. .

. .

. .

. .

. .

U U

. .

. .

. .

. .

. .

. .

. .

. .

P P

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

Number of Pro cessors Number of Pro cessors

Figure Sp eedup curves for scan sum tridiag and allpairs on GP

CHAPTER EXPERIMENTS

mst

abisort

. .

. .

. .

. .

. .

. .

. .

. .

SM I

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

MP I MP I L secs L secs . .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

L secs L secs MP I MP I

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

MP I MP I L secs L secs . .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

S S

. .

. .

. .

. .

. .

. .

. .

. .

P P

. .

. .

. .

. .

. .

. .

. .

. .

E E

. .

. .

. .

. .

. .

. .

. .

. .

E E

. .

. .

. .

. .

. .

. .

. .

. .

D D . .

. .

. .

. .

. .

. .

. .

. .

. .

U U

. .

. .

. .

. .

. .

. .

. .

. .

P P

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

Number of Pro cessors Number of Pro cessors

qsort

poly

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

MP I L secs L secs . . MP I

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

MP I MP I L secs L secs

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

L secs L secs . . MP I MP I

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

S S

. .

. .

. .

. .

. .

. .

. .

. .

P P

. .

. .

. .

. .

. .

. .

. .

. .

E E

. .

. .

. .

. .

. .

. .

. .

. .

E E

. .

. .

. .

. .

. .

. .

. .

. .

. . D D

. .

. .

. .

. .

. .

. .

. .

. .

U U

. .

. .

. .

. .

. .

. .

. .

. .

P P

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

Number of Pro cessors Number of Pro cessors

Figure Sp eedup curves for abisort mst qsort and poly on GP

SPEEDUP CHARACTERISTICS

queens

fib

. .

. .

. .

. .

. .

. .

. .

. .

SM SM

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

MP MP

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

S S

. .

. .

. .

. .

. .

. .

P P

. .

. .

. .

. .

. .

. .

E E

. .

. .

. .

. .

. .

. .

E E

. .

. .

. .

. .

. .

D D . .

. .

. .

. .

. .

. .

U U

. .

. .

. .

. .

. .

. .

P P

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

Number of Pro cessors Number of Pro cessors

mm

rantree

. .

. .

. .

. .

. .

. .

. .

. .

SM SM

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

MP MP

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

S S

. .

. .

. .

. .

. .

. .

P P

. .

. .

. .

. .

. .

. .

E E

. .

. .

. .

. .

. .

. .

E E

. .

. .

. .

. .

. .

D D . .

. .

. .

. .

. .

. .

U U

. .

. .

. .

. .

. .

. .

P P

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

Number of Pro cessors Number of Pro cessors

Figure Sp eedup curves for fib queens rantree and mm on TC

CHAPTER EXPERIMENTS

scan sum

. .

. .

. .

. .

. .

. .

. .

. .

SM SM

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

MP MP

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

S S

. .

. .

. .

. .

. .

. .

P P

. .

. .

. .

. .

. .

. .

E E

. .

. .

. .

. .

. .

. .

E E

. .

. .

. .

. .

. .

D D . .

. .

. .

. .

. .

. .

U U

. .

. .

. .

. .

. .

. .

P P

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

Number of Pro cessors Number of Pro cessors

tridiag allpairs

. .

. .

. .

. .

. .

. .

. .

. .

SM SM

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

MP MP

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

S S

. .

. .

. .

. .

. .

. .

P P

. .

. .

. .

. .

. .

. .

E E

. .

. .

. .

. .

. .

. .

E E

. .

. .

. .

. .

. .

D D . .

. .

. .

. .

. .

. .

U U

. .

. .

. .

. .

. .

. .

P P

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

Number of Pro cessors Number of Pro cessors

Figure Sp eedup curves for scan sum tridiag and allpairs on TC

SPEEDUP CHARACTERISTICS

mst

abisort

. .

. .

. .

. .

. .

. .

. .

. .

SM SM

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

MP MP

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

S S

. .

. .

. .

. .

. .

. .

P P

. .

. .

. .

. .

. .

. .

E E

. .

. .

. .

. .

. .

. .

E E

. .

. .

. .

. .

. .

D D . .

. .

. .

. .

. .

. .

U U

. .

. .

. .

. .

. .

. .

P P

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

Number of Pro cessors Number of Pro cessors

qsort

poly

. .

. .

. .

. .

. .

. .

. .

. .

SM SM

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

MP MP

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

S S

. .

. .

. .

. .

. .

. .

P P

. .

. .

. .

. .

. .

. .

E E

. .

. .

. .

. .

. .

. .

E E

. .

. .

. .

. .

. .

D D . .

. .

. .

. .

. .

. .

U U

. .

. .

. .

. .

. .

. .

P P

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

Number of Pro cessors Number of Pro cessors

Figure Sp eedup curves for abisort mst qsort and poly on TC

CHAPTER EXPERIMENTS

fib scan abisort

queens sum mst

rantree tridiag qsort

mm allpairs poly

TC TC TC

Number of Pro cessors Number of Pro cessors Number of Pro cessors

Figure Task creation b ehavior of MP proto col on GP

fib scan abisort

queens sum mst

rantree tridiag qsort

mm allpairs poly

TS TS TS

Number of Pro cessors Number of Pro cessors Number of Pro cessors

Figure Task susp ension b ehavior of MP proto col on GP

SPEEDUP CHARACTERISTICS

Parallel and memory accessing abisort allpairs mm scan sum tridiag

These programs access memory to various extents The sp eedup curves for these

programs is S like ie the second derivative is initially p ositive and then neg

ative A go o d example is abisort The initial b end in the curve is explained

by the increase in cost for accessing shared user data which is distributed evenly

n

of b eing to remote across the machine A memory access has a probability of

n

memory where n is the number of pro cessors so the average cost of an access to

LRn

shared user data is where R is the cost of a remote memory access and

n

L is the cost of a lo cal memory access The b end in the curve is consequently more

pronounced for programs which sp end a high prop ortion of their time accessing

the heap eg abisort allpairs and mm

Poorly parallel mst poly qsort These are programs whose algorithms do

not contain much parallelism or that contain a form of parallelism that is not well

suited for LTC The sp eedup curves for these programs are mostly at b ecause

little of the parallelism is exploited Generally the curve starts going down after

a certain number of pro cessors b ecause no more parallelism can b e exploited but

other costs such as contention and memory interconnect trac increase

Sp eedup on GP

On the GP it is striking how similar the tables and sp eedup curves are for the

SM and MP proto cols The sp eedup number of tasks created and the number of task

susp ensions are normally within a few p ercent of each other Nevertheless the MP

proto col typically has a slightly higher sp eedup esp ecially for the ne grain programs

This can b e explained by the fact that the dierence in O b etween proto cols is

expose

larger for ne grain programs

Recall that on the GP the SM proto col is using the original continuation se

mantics and the MP proto col is using the KatzWeise semantics without legitimacy

supp ort Since the sp eedup characteristics for b oth proto cols are so similar it follows

that the additional work needed to supp ort the KatzWeise semantics mostly that of

heapication is globally negligible The cost of supp orting legitimacy is examined in a

later section

For b oth proto cols the number of heavyweight tasks created by most programs is a

small fraction of what ETC would have created When fib is run on pro cessors only

ab out of the lightweight tasks are transformed to heavyweight tasks As suggested

by the curves in Figure ab ove pro cessors TC increases roughly linearly with the

CHAPTER EXPERIMENTS

number of pro cessors The notable exceptions are allpairs mst and poly whose TC

levels o as it nears and qsort whose TC rst go es up roughly as the square of the

number of pro cessors b efore leveling o as it nears All programs have TC on

pro cessors except mm allpairs mst poly and qsort

The high TC of these programs can b e explained by their coarse granularity and

low degree of parallelism except qsort which is explained later These programs create

relatively few lightweight tasks so prop ortionately more of them need to b e stolen to keep

the pro cessors working An extreme example is allpairs which on each iteration creates

only lightweight tasks ie the maximum parallelism is It isnt surprising that

on a pro cessor partition nearly all of the tasks get stolen to balance the load across

the machine

The reason why TC is high for qsort and also poly is that most of the stolen

tasks p erform very little work ie T is only a few instructions Most of qsorts

work

stolen tasks p erform a single call to cons b efore they terminate A handful of similarly

simple op erations are p erformed by polys stolen tasks Thieves that have just stolen a

task will so on b e lo oking for new tasks to steal so the lightweight tasks that are created

are likely to get stolen Qsorts p o or sp eedup is explained by its high TC and low T

work

combined with its ne granularity G secs and heavy remote memory usage

O

RemHeap

Similarly Figure suggests that ab ove pro cessors the number of task susp en

sions increases fairly linearly with the number of pro cessors for most programs The

notable exceptions are allpairs mst and poly which have a fairly constant TS ab ove

pro cessors

Sp eedup on TC

On the TC the sp eedup curves for the MP proto col have a similar shap e to those

for the MP proto col on the GP The actual sp eedup is however slightly higher for

the TC This is probably due to the TCs faster memory system combined with

the lower quality of co de generated by the compiler which makes the memory system

app ear even faster These factors reduce the relative imp ortance of task management

op erations and memory accesses Consequently a native co de implementation on the

TC would have a lower sp eedup but higher absolute p erformance

The SM proto col however has a consistently lower sp eedup than that of the MP

proto col For each proto col the sp eedup curve starts o at O on pro cessor

expose

for their resp ective O and as the number of pro cessors increases the curves tend expose

EFFECT OF INTERRUPT LATENCY

to get closer Programs with go o d sp eedup characteristics eg fib and sum maintain a

roughly constant distance b etween the sp eedup curves In other words the ratio of their

run time stays close to the ratio of their O On the other hand programs with

expose

p o or sp eedup characteristics eg mst and qsort have sp eedup curves that b ecome

colinear at a high number of pro cessors This can b e explained by the progressive

decrease of mandatory work b eing p erformed by the program The main cause of

the overhead O that is sub optimally caching the stack and task stack mostly

expose

aects the p erformance of the mandatory work The relative imp ortance of sub optimally

caching the stack will thus decrease as the programs sp end more and more time b eing

idle andor accessing remote memory

The only p oint where the sp eedup curves cross is for qsort at pro cessors How

ever the same thing should b e exp ected for other b enchmarks on larger partitions

b ecause as the number of pro cessors increases the b enets of caching decrease whereas

the sp eed of work distribution b ecomes more critical to p erformance Since the SM

proto col has a lower steal latency it will likely outp erform the MP proto col on very

large partitions Note however that this might happ en at the p oint where the eciency

ie the ratio of sp eedup and the number of pro cessors is so low that it is not cost

eective For instance the b est sp eedup attained by qsort is at pro cessors using

the MP proto col whereas the b est sp eedup for the SM proto col is at pro cessors

Eect of Interrupt Latency

In order to study the eect of the interrupt latency on the p erformance of the MP

proto col the programs were tested on the GP with lower and higher intermittency

factors The previous exp eriments were p erformed with I and a new set of mea

surements were taken with I and I These changes in I cause the interrupt

latency to vary roughly by a factor of decrease and increase resp ectively Tables

and contain for each program the value of T O and for each partition size

par expose

S TC and TS Figures through contain the sp eedup curves for each setting of

L the average interrupt latency L is T divided by the number of I and also give

par

interrupt checks executed Note that the average time b efore an interrupt is detected

T is L

detect

The settings for I were chosen so that T would b e roughly comparable to

detect

the cost of stealing a task Exp erimental measurements put T T

task steal task steal

at b etween and secs dep ending on the program When I T is

detect

normally a fraction of T and when I it is normally larger

steal task

CHAPTER EXPERIMENTS

Sp eedup TC and TS when

number of pro cessors is

Program T

par

O

expose

fib S

TC

TS

queens S

TC

TS

rantree S

TC

TS

mm S

TC

TS

scan S

TC

TS

sum S

TC

TS

tridiag S

TC

TS

allpairs S

TC

TS

abisort S

TC

TS

mst S

TC

TS

qsort S

TC

TS

poly S

TC

TS

Table Performance of MP proto col on GP with I

EFFECT OF INTERRUPT LATENCY

Sp eedup TC and TS when

number of pro cessors is

Program T

par

O

expose

fib S

TC

TS

queens S

TC

TS

rantree S

TC

TS

mm S

TC

TS

scan S

TC

TS

sum S

TC

TS

tridiag S

TC

TS

allpairs S

TC

TS

abisort S

TC

TS

mst S

TC

TS

qsort S

TC

TS

poly S

TC

TS

Table Performance of MP proto col on GP with I

CHAPTER EXPERIMENTS

Overall the sp eedup curves indicate that the setting of I do es not signicantly aect

p erformance For small partitions the sp eedup curves for I are consistently b etter

but only slightly than smaller values of I This is simply due to the slightly lower

p olling overhead for I As the number of pro cessors increases and the programs

work distribution requirements b ecome more critical the p erformance for the lower

values of I improves and eventually surpasses the p erformance for I The only

exception is fib which at pro cessors is still a little faster with I On large

partitions most programs p erform b etter with a setting of I but the p erformance

of I is very close The dierence in p erformance b etween I and I at

pro cessors is less than with the exception of allpairs and mst It is

interesting to note however that go o d p erformance is obtained for all settings of I such

allpairs and mst with I are on the b order with L is less than T that

task steal

L equal to and secs resp ectively

Cost of Supp orting Legitimacy

The previous exp eriments were p erformed with a version of the MP proto col that did

not contain supp ort for legitimacy To evaluate the cost of supp orting legitimacy the

appropriate op erations were added to the task management algorithms ie the creation

of the legitimacy placeholder its installation in the stolen task and endframe and the

legitimacy propagation and chain collapsing in endbody The programs were run

on the GP with increasingly large partitions up to pro cessors Two runs

were p erformed one with and one without a sp eculation barrier at the end of the

program The run time was measured and compared to the run time of the version

lacking legitimacy supp ort The overhead the ratio of run times is shown in Table

The results clearly show that for all programs based on forkjoin algorithms the cost

of supp orting legitimacy is negligible In fact it can hardly b e measured at all the cost

is b elow the noise level of The collapsing of the legitimacy chain app ears to b e

working out as exp ected for forkjoin algorithms Only the programs qsort and poly

which are based on pip eline parallelism have measurable overheads The overheads

increase with the number of pro cessors indicating that the legitimacy chain is getting

longer and its collapsing is getting more exp ensive The highest overhead is for

poly at pro cessors when a sp eculation barrier is present On pro cessors the

overhead is a little lower by to when there is no sp eculation barrier

SUMMARY

Number of Pro cessors

Program without with without with without with without with without with

fib

queens

mm

scan

rantree

sum

tridiag

allpairs

abisort

mst

poly

qsort

Table Overhead of supp orting legitimacy with and without sp eculation barrier on

GP

Summary

This chapter has evaluated the p erformance of the SM and MP proto col implementations

of LTC on large sharedmemory multiprocessors up to pro cessors Exp eriments

were conducted with several b enchmark programs on the GP multiprocessor which

lacks a data cache and the TC which has a data cache The results show that

The parallelizati on cost is low The overhead of parallelizin g a sequential

program by adding futures and touches is typically less than when using

the MP proto col For the SM proto col the overhead is twice as large when a

cache is not available However when a cache is available the overhead is much

more imp ortant up to a factor of two on the TC b ecause the SM proto col

must cache the stack and LTQ sub optimally

LTC scales well Programs with a high degree of parallelism have fairly linear

sp eedup with resp ect to the sequential version of the program The SM and MP

proto cols have almost identical sp eedup curves when a cache is not available

When a cache is available the sp eedup curve for the MP proto col is consistently

b etter due to the dierence in caching p olicy However this dierence gradually

CHAPTER EXPERIMENTS

decreases as the number of pro cessors increases b ecause the caching p olicy b ecomes

less imp ortant the caching p olicy has no inuence on the idle time and remote

memory access time which increase with the number of pro cessors

Interrupt latency can b e relatively high For the MP proto col an interrupt

latency as high as the time to steal a task provides adequate p erformance On a

pro cessor GP the run time is usually within of the run time for the

b est latency

Supp orting the KatzWeise semantics and legitimacy generally has a

negligible impact on p erformance There was no noticeable p erformance

dierence b etween a version of the system that supp orted the KatzWeise seman

tics and one that did not This indicates that the additional cost of heapication

is low relatively to the other costs of stealing in particular the remote memory

references needed to transfer the task b etween pro cessors The cost of legitimacy

propagation and testing is also very low The overhead for forkjoin programs

is to o low to measure However programs with a less restrictive task termina

tion order exhibit a measurable but small overhead no more than on pro cessors

Chapter

Conclusion

The initial goal of this work was the implementation of a highp erformance Multilisp

system Earlier implementations of Multilisp such as Concert Multilisp Halstead

and MultiScheme Miller gave interesting self relative sp eedups but b ecause they

were based on interpreters it was not clear that the same sp eedups would apply to a

pro duction quality system As a rst step of this work a highly optimizing compiler

for Scheme was developed to provide a realistic setting for exploring new implementation

strategies for Multilisp and evaluating their p erformance This eort resulted in Gambit

Feeley and Miller currently the b est Scheme compiler in terms of p erformance

of the co de generated

The system was p orted to the GP and TC multiprocessors and supp ort for

Multilisps parallelism constructs added to the compiler Initially the eager task creation

ETC metho d was used to implement futures but it was so on clear that the overhead

of task creation would b e to o high for ne grain programs as explained in Chapter

Work on the lazy task creation LTC mechanism was triggered by a comment on lazy

futures in Kranz et al LTC p ostp ones the creation of a task until it needs

to b e transferred to another pro cessor the thief Consequently the overhead of task

creation is mostly dep endent on the work distribution needs of the program and not so

much the programs granularity For divideandconquer programs LTC has the nice

prop erty of transferring large pieces of work and roughly balancing the work b etween

the thief and victim pro cessors This helps reduce the number of task transfers needed

to keep pro cessors busy Most tasks end up b eing executed lo cally at low cost

Eric Mohr indep endently explored the LTC mechanism with the MulT system on

the Encore Multimax multiprocessor a UMA computer with up to pro cessors and

CHAPTER CONCLUSION

ended up using a version of the sharedmemory SM proto col very similar to the one

used here Mohr In the SM proto col thief pro cessors directly access the stack

of other pro cessors to steal tasks This thesis extends his results in several ways

Exp erience on large machines Exp eriments on a pro cessor GP with

a wide range of b enchmarks provide concrete evidence that LTC scales well to large

machines and that go o d sp eedup is p ossible for realistic programs

Supp ort of a rich semantics The semantics of the Multilisp language do es

not have to b e imp overished to attain go o d p erformance In fact the laziness

of LTC can b e exploited to implement several programming features at low cost

These include

The KatzWeise continuation semantics with legitimacy which provides an

elegant semantics for rstclass continuations

Dynamic scoping

Fairness

Better implementation of the SM proto col A slightly faster implemen

tation of the SM proto col was developed It requires fewer instructions fewer

memory references and is simpler to prove correct

The messagepassing MP proto col The main problem with the SM pro

to col is that all pro cessors must have access to the runtime stack On machines

lacking coherentcaches such as the TC the stack can only b e cached in write

through mo de instead of the more ecient copyback mo de This aects the sp eed

of computation in general parallel and sequential parts of the programs suer

A study of several b enchmarks in Chapter shows that the stack is one of the

most frequently accessed data structures and that the dierence in caching p olicy

can account for an imp ortant dierence in p erformance as high as a factor of two

on the TC

In the MP proto col the stack is a private data structure that can b e cached

optimally To obtain a task to run a thief pro cessor sends a work request message

to the victim pro cessor When the request is serviced the victim accesses its

own stack to remove a lazy task and packages it in a heavyweight task that is

sent back to the thief This approach would app ear to dep end on a low latency

interrupt mechanism such as p olling but in fact the exp eriments indicate that

p erformance is close to optimal when the interrupt latency is comparable to the time required to p erform the task steal

FUTURE WORK

Future Work

The results of this thesis suggest that task partitioning can b e done eciently on ma

chines that lack an ecient shared memory Coherentcaches are not really required

as shown by the MP proto col implementation of LTC There is thus hop e that at least

for some problems Multilisp can run eciently on distributedmemory machines A

machine like the Thinking Machines CM which lacks a shared memory but provides

a fast messagepassing system would b e an ideal candidate

One of the shortcomings of LTC as implemented here is that it do es not address the

data partitioning problem The scheduling algorithm makes no attempt to run a task

on or close to the pro cessor that contains the data it accesses As shown in Chapter

a substantial p erformance loss is attributable to the remote memory accesses to user

data up to a factor of on the GP and a factor of on the TC Coherent

caches may help reduce this problem on sharedmemory machines but the p enalty on

distributedmemory machines will b e much higher

Another problem is the overhead of touching Contrary to Multilisps original sp ec

ication this work has assumed that touches are inserted explicitly by the user This

is hard to do for programs with complex data dep endencies It would b e more con

venient for the user if touches were inserted automatically by the compiler Adding

a touch on each strict op eration is a p o or solution b ecause it causes a high over

head On the GP the overhead is roughly a factor of on typical programs

but a lower overhead may b e p ossible on mo dern pro cessors which are optimized

for register op erations A b etter solution would b e for the compiler to do a data

ow analysis of the program to identify all the strict op erations that might b e passed

a placeholder Controlow and dataow analysis techniques such as Shivers

Shivers would b e a go o d starting p oint

CHAPTER CONCLUSION

App endix A

Source Co de for Parallel

Benchmarks

This app endix contains the source co de for the parallel b enchmark programs used in

chapters chapters chapters and A general description of these programs is

given in section Half of the programs were originally written in MulT by Eric Mohr

as part of his PhD thesis work Mohr These programs were translated to Scheme

with sup ercial changes to suit Gambits particular features These changes include

Macro denitions Gambit uses the nonstandard construct definemacro

The denition of record structures Gambit do es not have a predened construct

for dening structures plain vectors were used instead

The p erformance of abisort allpairs and mst was improved by partial evaluat

ing the programs by hand The algorithms are the same but some of the pro cedure

abstractions were removed by replacing pro cedure denitions by macro denitions

The programs abisort rantree and tridiag originally had a few uses of a non

standard construct to return multiple values Since Gambit do es not have such

a feature the multiple returns were reformulated in standard Scheme This only

aects rantrees p erformance b ecause the two other programs used multiple value

returns exclusively in the initialization phase which is not measured

Tridiag which solves a set of equations uses only half as many equations ie

This data set just barely ts in the memory available on a single pro cessor no de

of the GP Ab out Mbytes p er pro cessor out of a total of Mbytes are

APPENDIX A SOURCE CODE FOR PARALLEL BENCHMARKS

available for the heap after Gambit has started This makes it p ossible to evaluate

the program in a unipro cessor conguration which is useful to generate sp eedup

curves All other programs were run with the same data set size in order to make

direct comparisons easier

The new programs fall into two main classes The programs mm matrix multiplica

tion scan parallel prex op eration on a vector and sum parallel reduction op eration

on a vector are based on divide and conquer algorithms The program poly p olyno

mial multiplication implements a form of pip eline parallelism and qsort quicksort is

a combination of pip eline and divide and conquer parallelism

The programs were mo died in certain places to address sharedmemory prob

lems To lessen contention to shared data in vectors the nonstandard pro cedures

makecvector and cvectorref were used instead of the corresp onding standard vector

op erations A cvector is a vector with immutable elements ie a constant vector

When a cvector is created it is copied to the lo cal memory of each pro cessor Access

to a cvector is thus b oth contention free and fast as fast as a lo cal memory reference

However access to the elements of a cvector may still exhibit some contention and

remote memory reference latency if the elements are memory allo cated structures as is

the case in tridiag the only program that uses cvectors

When the shared data was in mutable vectors ie the programs allpairs mm

mst scan and sum the nonstandard pro cedures makedvector dvectorref and

dvectorset were used instead of the corresp onding standard vector op erations A

dvector is a vector whose entries are evenly allo cated across the machine ie a dis

tributed vector If entry i is in the lo cal memory of pro cessor j then entry i is

on pro cessor j mo dulo the number of pro cessors On an n pro cessor machine a

reference to the vector will corresp ond to a lo cal memory reference with probability

n

n

This means that the average cost of and to a remote reference with probability

n

an access to a dvector increases with the number of pro cessors quickly approaching

the cost of a remote reference Dvectors have go o d contention characteristics b ecause

during a given cycle there can b e as many accesses to dvectors as there are pro cessors

The average number of contention free accesses will b e lower but this is more of an

academic question since in general pro cessors do not all access memory at the same

moment

Record structures were similarly distributed where p ossible ie the programs abisort

mst and tridiag This was done with a call to the pro cedure makevectorchain which builds a chain of xed size vectors that are evenly distributed across the machine

The creation of all these sp ecial data structures happ ens once and for all in the

initialization phase of the programs Thus it do esnt contribute to the measurements

Memory allo cation in the main part of the program only o ccurs for qsort and poly and

is done with the standard cons pro cedure This means that space is allo cated in the

lo cal memory of the pro cessor doing the allo cation

The programs were all compiled with sp ecial declarations meant to improve p erfor

mance All references to predened variables such as cons and car were assumed to

b e to the corresp onding primitive pro cedure This essentially means that inline co de

was generated for calls to simple predened pro cedures All arithmetic op erations were

assumed to b e on small integers xnums except for the program poly which uses

generic arithmetic

In the co de that follows FUTURE and TOUCH have b een underlined to make them

stand out The last line of each program is a call to the macro benchmark which starts

the run The subforms passed to benchmark are in order the name of the program the

expression used to initialize the input data and the expression that starts the part of

the program b eing measured A brief description is included with each program

APPENDIX A SOURCE CODE FOR PARALLEL BENCHMARKS

A abisort

This program sorts integers using the adaptive bitonic sort algorithm describ ed

in Bilardi and Nicolau

definemacro makenode makevector f

definemacro nodeleft x vectorref x

definemacro nodevalue x vectorref x

definemacro noderight x vectorref x

definemacro nodeleftset x v vectorset x v

definemacro nodevalueset x v vectorset x v

definemacro noderightset x v vectorset x v

definemacro swapleft l r

let temp nodeleft l

nodeleftset l nodeleft r

nodeleftset r temp

definemacro swapright l r

let temp noderight l

noderightset l noderight r

noderightset r temp

definemacro fixuptree root up

let loop pl nodeleft root

pr noderight root

if pl

compareandswap pl pr up

swap right subtrees search path goes left

begin swapright pl pr

loop nodeleft pl nodeleft pr

search path goes right

loop noderight pl noderight pr

definemacro fixuptree root up

let loop pl nodeleft root

pr noderight root

if pl

compareandswap pl pr up

swap left subtrees search path goes right

begin swapleft pl pr

loop noderight pl noderight pr

search path goes left

loop nodeleft pl nodeleft pr

A ABISORT

definemacro compareandswap node node up true false

let v nodevalue node

v nodevalue node

cond if up v v

nodevalueset node v

nodevalueset node v

true

else false

definemacro pbimerge root spare up

let loop root root spare spare

compareandswap root spare up

fixuptree root up

fixuptree root up

cond nodeleft root

let lefthalf FUTURE loop nodeleft root root

loop noderight root spare

lefthalf TOUCH

define pbisortup root spare

let left nodeleft root

if left

let lefthalf FUTURE pbisortup left root

pbisortdown noderight root spare

TOUCH lefthalf

pbimerge root spare t

compareandswap root spare t t f

define pbisortdown root spare

let left nodeleft root

if left

pbisortdown left root let lefthalf FUTURE

pbisortup noderight root spare

TOUCH lefthalf

pbimerge root spare f

compareandswap root spare f t f

define newnode l r v

let node makenode

nodeleftset node l

noderightset node r

nodevalueset node v node

APPENDIX A SOURCE CODE FOR PARALLEL BENCHMARKS

define nodechain f

define initnodechain n make a chain of element vects

set nodechain makevectorchain n

define makenode

let node nodechain

set nodechain vectorref node

node

define makeinordertree depth

let loop i

depth depth

if depth

cons newnode f f i i

let x loop i depth

ltree car x

limax cdr x

let y loop limax depth

rtree car y

rimax cdr y

cons newnode ltree rtree limax rimax

define r f

define s f

define k

define init

initnodechain expt k

let x makeinordertree k

root car x

imax cdr x

let spare newnode f f imax

set r root

set s spare

benchmark ABISORT init pbisortup r s

A ALLPAIRS

A allpairs

This program computes the shortest paths b etween all pairs of no des using a parallel

version of Floyds algorithm

definemacro doall var lo hi body

let loop var lo hi hi

if var hi

let body

let mid quotient var hi

loop var mid lohalf FUTURE

loop mid hi

lohalf TOUCH

define apsppar a n

let n n

do k k

k n

let kn k n

doall i n

let in i n

ink in k

do j j

j n

let kpath dvectorref a ink

dvectorref a kn j

inj in j

if kpath dvectorref a inj

dvectorset a inj kpath

define makelinearadjacencymatrix n

let a makedvector n n quotient mostpositivefixnum

dvectorset a

do i i

i n

dvectorset a i n i

dvectorset a i n i

dvectorset a i n i

a

define a f

define n

define init

set a makelinearadjacencymatrix n

benchmark ALLPAIRS init apsppar a n

APPENDIX A SOURCE CODE FOR PARALLEL BENCHMARKS

A fib

th

This program computes F the b onacci number using the standard doubly

recursive algorithm

define pfib n

let fib n n

if n

n

let f FUTURE fib n

f fib n

TOUCH f f

benchmark FIB f pfib

A MM

A mm

This program multiplies two matrices of integers by

define mm m m m m m m

define computeentry row col loop to compute inner product

let loop i row n

j n n col

sum

if j

loop i

j n

sum dvectorref m i dvectorref m j

dvectorset m i col sum

define computecolsbetween row i j DAC over columns

if i j

computeentry row i

let mid quotient i j

let half FUTURE computecolsbetween row i mid

half computecolsbetween row mid j

TOUCH half

define computerowsbetween i j DAC over rows

if i j

computecolsbetween i n n

let mid quotient i j

let half FUTURE computerowsbetween i mid

half computerowsbetween mid j

half TOUCH

computerowsbetween n

define m f

define m f

define m f

define n

define init

set m makedvector n n

set m makedvector n n

set m makedvector n n f

benchmark MM init mm m m m

APPENDIX A SOURCE CODE FOR PARALLEL BENCHMARKS

A mst

This program computes the minimum spanning tree of a no de graph A parallel

version of Prims algorithm is used

definemacro makecity makevector f

definemacro cityx x vectorref x

definemacro cityy x vectorref x

definemacro cityclosest x vectorref x

definemacro citydistance x vectorref x

definemacro cityxset x v vectorset x v

definemacro cityyset x v vectorset x v

definemacro cityclosestset x v vectorset x v

definemacro citydistanceset x v vectorset x v

define newcity x y closest distance

let city makecity

cityxset city x

cityyset city y

cityclosestset city closest

citydistanceset city distance

city

define prim cities ncities findclosestcity

let maxi ncities

target dvectorref cities maxi

cityclosestset target target makes drawing easier

let loop maxi maxi

target target

if maxi

addlastcity dvectorref cities target

let closesti findclosestcity cities maxi target

newcity dvectorref cities closesti

dvectorset cities closesti dvectorref cities maxi

dvectorset cities maxi newcity

loop maxi newcity

define addlastcity city newcity

let newdist distance city newcity

olddist citydistance city

cond newdist olddist

citydistanceset city newdist

cityclosestset city newcity

A MST

define distance c c

let dx cityx c cityx c

dy cityy c cityy c

dx dx dy dy

definemacro combineintervalptree lo hi f combine

let lo lo hi hi

let n hi lo

adjust lo

firstleaf quotient n

treeval

let loop i

cond i firstleaf

let left FUTURE loop i

right combine loop i

f i adjust

left combine right TOUCH

else

f i adjust

if even n

combine treeval f hi

treeval

define findclosestcityptree cities maxi newcity

combineintervalptree maxi

lambda i updatecity i cities newcity

lambda i i

if citydistance dvectorref cities i

citydistance dvectorref cities i

i

i

define updatecity i cities newcity

let city dvectorref cities i

newdist distance city newcity

olddist citydistance city

cond newdist olddist

citydistanceset city newdist

cityclosestset city newcity i

APPENDIX A SOURCE CODE FOR PARALLEL BENCHMARKS

define citychain f

define initcitychain n make a chain of element vects

set citychain makevectorchain n

define makecity

let city citychain

set citychain vectorref city

city

define random makerandom

define randomrange

define makerandomvectorofcities n

let cities makedvector n

do i i

i n cities

dvectorset cities i

newcity modulo random randomrange

modulo random randomrange

mostpositivefixnum

cities

define c f

define n

define init

initcitychain n

set c makerandomvectorofcities n

benchmark MST init prim c n findclosestcityptree

A POLY

A poly

This program computes the square of a term p olynomial of x with integer co e

cients and evaluates the resulting p olynomial for a certain value of x

declare generic use generic arithmetic

define poly p p compute pp

if or null p null p

polyk cons poly p cdr p

p

car p

define polyk p p k compute ppk

if null p

p

if null p

let rest FUTURE polyk cdr p k

cons car p k rest

let rest FUTURE polyk TOUCH cdr p cdr p k

cons car p car p k rest

define polyeval p x compute value of p at x

let loop p p y sum

if pair p

cdr p x y sum car p y loop TOUCH

sum

define p terms

benchmark POLY f polyeval poly p p

APPENDIX A SOURCE CODE FOR PARALLEL BENCHMARKS

A qsort

This program sorts a list of integers using a parallel version of the Quicksort

algorithm

define qsort lst

definemacro filter keep lst

let loop lst lst

lst let lst TOUCH

if pair lst

let head car lst

if keep head

cons head FUTURE loop cdr lst

loop cdr lst

define qs lst tail

if pair lst

let pivot car lst

other cdr lst

let sortedlarger

FUTURE qs filter lambda x not x pivot other

tail

qs filter lambda x x pivot other

cons pivot sortedlarger

tail

qs lst

define walk lst

let loop lst lst

let lst TOUCH lst

if pair lst loop cdr lst

lst

define l randomized list of numbers to

benchmark QSORT f walk qsort l

A QUEENS

A queens

This program computes the number of solutions to the nqueens problem with n

define queens n

let try rowsleft n

freediag all bits set

freediag

freecols ash n bits to n set

let free logand freecols logand freediag freediag

let loop col

cond col free

logand col free

loop col

rowsleft

loop col

else

let subsolns

FUTURE

try rowsleft

ash freediag col

ash freediag col

freecols col

othersolns loop col

subsolns othersolns TOUCH

benchmark QUEENS f queens

APPENDIX A SOURCE CODE FOR PARALLEL BENCHMARKS

A rantree

This program mo dels the traversal of a random binary tree with on the order of

no des The branching factor is

define lehmerleft seed seed xface

define lehmerright seed seed xfeed

define pseudorandomtree n

let loop n n seed

cond n

n

seed

let ln modulo seed n

rn n ln

left FUTURE loop ln lehmerleft seed

right loop rn lehmerright seed

TOUCH left right

else

loop n lehmerleft seed

benchmark RANTREE f pseudorandomtree

A SCAN

A scan

This program computes the parallel prex sum of a vector of integers The vector

is mo died in place A given element is replaced by the sum of itself and all preceding

elements

definemacro scan f c v

let c c v v

let n dvectorlength v

define pass i j

if i j

let m quotient i j

left FUTURE pass i m

right pass m j

left right result f TOUCH

dvectorset v j result

result

dvectorref v j

define pass i j c

if i j

let m quotient i j

left FUTURE pass i m c

cc f c dvectorref v m

right pass m j cc

dvectorset v m cc

left TOUCH

if n

let j n

pass j

pass j c

dvectorset v j f c dvectorref v j

define scan c v scan c v

define v f

define n

define init

set v makedvector n

benchmark SCAN init scan v

APPENDIX A SOURCE CODE FOR PARALLEL BENCHMARKS

A sum

This program computes the sum of a vector of integers

define sum vect l h sum vector from l to h

if l h

dvectorref vect l

let mid quotient l h

lo FUTURE sum vect l mid

hi sum vect mid h

TOUCH lo hi

define v f

define n

define init

set v makedvector n

benchmark SUM init sum v n

A TRIDIAG

A tridiag

This program solves a tridiagonal system of equations

definemacro a obj vectorref obj

definemacro b obj vectorref obj

definemacro c obj vectorref obj

definemacro y obj vectorref obj

definemacro x obj vectorref obj

definemacro aset obj v vectorset obj v

definemacro bset obj v vectorset obj v

definemacro cset obj v vectorset obj v

definemacro yset obj v vectorset obj v

definemacro xset obj v vectorset obj v

define reducepar equ imid

define reduceequation i delta

let equileft cvectorref equ i delta

equiright cvectorref equ i delta

equi cvectorref equ i

e quotient a equi b equileft

f quotient c equi b equiright

aset equi e a equileft

cset equi f c equiright

bset equi b equi

e c equileft

f a equiright

yset equi y equi

e y equileft

f y equiright

let dobranch i imid

delta quotient imid

if delta

reduceequation i delta

let ileft i delta

iright i delta

l FUTURE dobranch ileft quotient delta

dobranch iright quotient delta

TOUCH l

do d d

d delta

reduceequation i d

APPENDIX A SOURCE CODE FOR PARALLEL BENCHMARKS

define backsolvepar equ imid

let loop i imid delta imid

let equi cvectorref equ i

xset equi quotient y equi

a equi

x cvectorref equ i delta

c equi

x cvectorref equ i delta

b equi

if delta

let newdelta quotient delta

l FUTURE loop i newdelta newdelta

loop i newdelta newdelta

l TOUCH

define abcyxchain f

define initabcyxchain n make a chain of element vects

set abcyxchain makevectorchain n

define makeabcyx

let node abcyxchain

set abcyxchain vectorref node node

A TRIDIAG

define n f

define imid f

define equ f

define k

define init

let n expt k

set n n

set imid quotient n

initabcyxchain n

set equ makecvector n makeabcyx

define init

do i n i

i

let equi cvectorref equ i

aset equi

bset equi

cset equi

yset equi

xset equi

let equ cvectorref equ

aset equ

bset equ

cset equ

yset equ

let equn cvectorref equ n

aset equn

bset equn

cset equn

yset equn

define run

reducepar equ imid

backsolvepar equ imid

benchmark TRIDIAG begin init init run

APPENDIX A SOURCE CODE FOR PARALLEL BENCHMARKS

App endix B

Execution Proles for Parallel

Benchmarks

This App endix contains execution proles for each of the parallel b enchmarks of

App endix A An execution prole is a plot representing the activity of the pro cessors

as a function of time Proles are useful to visualize the b ehavior of parallel programs

They are also an invaluable to ol to detect p erformance related problems with algorithms

and the language implementation

To generate the proles the programs were compiled with the default p olling settings

with an intermittency factor of The messagepassing proto col supp orting the Katz

Weise continuation semantics and legitimacy was used but fairness was disabled The

programs were run on the GP with pro cessors Pro cessors can b e in one of six

distinctive states in the messagepassing proto col

Interrupt The pro cessor is servicing a steal request This state accounts for

heapifying the parent continuation creating the task the result and legitimacy

placeholders and resp onding to the thief

Working The pro cessor is running the main b o dy of the program ie user

co de This accounts not only for all the work that is strictly required by a se

quential version of a program but also includes the following extra work needed

to supp ort parallelism pushing and p opping lazy tasks checking for placehold

ers as part of TOUCH waiting for references to remote memory and restoring

continuations

Measuring all these cases indep endently would b e useful unfortunately it is imp ossibl e to do in an

APPENDIX B EXECUTION PROFILES FOR PARALLEL BENCHMARKS

Idle The pro cessor is lo oking for work but hasnt yet found an available task

in a work queue or a victim pro cessor to interrupt

Touching an undetermined placeholder An undetermined placeholder was

touched This state indicates the susp ension of a task

Determine A placeholder is b eing determined prior to the termination of a

task

Stealing The pro cessor has found a victim pro cessor sent a steal request and

is waiting for a resp onse The cost of restarting the task is also included except

for restoring the tasks continuation

Only certain transitions b etween these states are p ossible as dened by the following

diagram

stealing

P

P

Pq

idle working

Pi

P

P

Z

determine

Z

touch undet

interrupt interrupt

Note that it is p ossible to go directly from the idle state to the working state This

happ ens when a task is taken from a pro cessors HTQ Also note that interrupts can

only b e serviced in the idle state and in the working state

For the proles to b e signicant it is imp ortant to minimize the impact of monitoring

on the b ehavior of the system The proles were obtained by having each pro cessor log

an event in a table in lo cal memory whenever there was a state transition The extra

co de needed to do this is conned to the runtime system user co de is not changed in

any way Each event indicates the state b eing entered and the current time taken from

a real time clo ck with a secs resolution These tables were then dump ed to disk

for later pro cessing by the analysis program generating the proles The cost of logging

an event in this way is ab out secs This is relatively small compared to the typical

duration of states usually much more that secs

A prole is divided into three sections The top part displays the instantaneous

activity of the machine That is what prop ortion of all the pro cessors are in each

state as a function of time time is always expressed in milliseconds Below this is the

unintrusive way This is why all these dierent cases were group ed together in one state Time sp ent

in the working state can only serve as an approximation of the work required by a sequential version of the program

global activity chart It indicates what p ercentage of the run time is sp ent in each of

the states in other words it gives the area covered by each state in the instantaneous

activity chart The b ottom section consists of state duration histograms for every

state Each histogram indicates the distribution of state durations and also the average

duration Note that each state is represented by a dierent shade of gray To help

distinguish the shades the states are always in the same order from b ottom to top in

the instantaneous activity chart and from left to right in the global activity chart

For each b enchmark two proles are given The rst is for the complete run and the

second is a closeup of the b eginning of the run

The time sp ent servicing interrupts is ignored to compute the duration of the working and idle states

APPENDIX B EXECUTION PROFILES FOR PARALLEL BENCHMARKS

B abisort

File: "abisort-mp.elog" Processors: 64 100

80

60

%

40

20

0 0 100 200 300 400 500 600 700 800 msec

0 10 20 30 40 50 60 70 80 90 100 % 77% 48% 56% 50% 60% 9%

0 1 0 100 200 300 0 10 20 .0 .5 0 1 0 1 2 3 msec .168 12.603 .958 .315 .159 .523

interrupt working idle touch_undet determine stealing

File: "abisort-mp.elog" Processors: 64 100

80

60

%

40

20

0 0 1 2 3 4 5 6 7 8 9 10 msec

0 10 20 30 40 50 60 70 80 90 100 % 32% 6% 5% 33% 10%

.0 .5 0 5 0 5 .0 .5 0 1 2 msec .119 4.402 4.500 .260 .640

interrupt working idle determine stealing

B ALLPAIRS

B allpairs

File: "allpairs-mp.elog" Processors: 64 100

80

60

%

40

20

0 0 500 1000 1500 2000 2500 3000 msec

0 10 20 30 40 50 60 70 80 90 100 % 60% 19% 17% 37% 62% 16%

0 1 2 3 0 10 20 0 10 20 0 1 2 3 4 0 1 2 3 0 5 msec .206 7.311 5.746 .369 .168 .518

interrupt working idle touch_undet determine stealing

File: "allpairs-mp.elog" Processors: 64 100

80

60

%

40

20

0 0 5 10 15 20 25 30 35 40 45 50 55 60 msec

0 10 20 30 40 50 60 70 80 90 100 % 40% 10% 15% 34% 58% 13%

.0 .5 0 10 0 10 20 0 1 0 1 2 0 1 2 3 4 msec .165 7.619 6.481 .391 .190 .530

interrupt working idle touch_undet determine stealing

APPENDIX B EXECUTION PROFILES FOR PARALLEL BENCHMARKS

B fib

File: "fib-mp.elog" Processors: 64 100

80

60

%

40

20

0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 msec

0 10 20 30 40 50 60 70 80 90 100 % 63% 30% 56% 59% 55% 50%

0 1 0 10 0 5 .0 .5 0 1 0 1 msec .147 2.053 .912 .251 .138 .205

interrupt working idle touch_undet determine stealing

File: "fib-mp.elog" Processors: 64 100

80

60

%

40

20

0 .0 .5 1.0 1.5 2.0 2.5 3.0 msec

0 10 20 30 40 50 60 70 80 90 100 % 35% 11% 11% 68%

.0 .5 0 1 2 0 1 2 .0 .5 msec .107 1.066 1.766 .178

interrupt working idle stealing

B MM

B mm

File: "mm-mp.elog" Processors: 64 100

80

60

%

40

20

0 0 10 20 30 40 50 60 70 80 90 msec

0 10 20 30 40 50 60 70 80 90 100 % 50% 23% 64% 46% 66% 14%

0 1 0 50 0 10 0 1 .0 .5 0 1 msec .173 12.403 1.776 .344 .153 .414

interrupt working idle touch_undet determine stealing

File: "mm-mp.elog" Processors: 64 100

80

60

%

40

20

0 .0 .5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 msec

0 10 20 30 40 50 60 70 80 90 100 % 29% 7% 7% 15%

.0 .5 0 1 2 3 4 5 6 0 1 2 3 4 .0 .5 1.0 msec .127 1.546 3.011 .445

interrupt working idle stealing

APPENDIX B EXECUTION PROFILES FOR PARALLEL BENCHMARKS

B mst

File: "mst-mp.elog" Processors: 64 100

80

60

%

40

20

0 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 11000 12000 13000 14000 15000 msec

0 10 20 30 40 50 60 70 80 90 100 % 80% 8% 54% 64% 75% 20%

0 5 0 10 0 100 200 0 5 0 10 0 10 msec .204 .913 .354 .319 .165 .618

interrupt working idle touch_undet determine stealing

File: "mst-mp.elog" Processors: 64 100

80

60

%

40

20

0 0 5 10 15 20 25 30 35 40 msec

0 10 20 30 40 50 60 70 80 90 100 % 52% 7% 23% 43% 60% 16%

0 1 0 10 0 10 .0 .5 .0 .5 0 msec .173 3.356 3.498 .335 .154 .831

interrupt working idle touch_undet determine stealing

B POLY

B poly

File: "poly-mp.elog" Processors: 64 100

80

60

%

40

20

0 0 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500 1600 1700 1800 msec

0 10 20 30 40 50 60 70 80 90 100 % 62% 95% 90% 28% 91% 60%

0 1 2 3 4 0 100 0 100 200 0 5 0 10 20 0 50 msec .181 .559 2.518 .543 .242 .790

interrupt working idle touch_undet determine stealing

File: "poly-mp.elog" Processors: 64 100

80

60

%

40

20

0 0 10 20 30 40 50 60 70 80 90 100 msec

0 10 20 30 40 50 60 70 80 90 100 % 24% 69% 20% 25% 34% 33%

0 1 0 50 0 100 0 5 0 1 2 3 4 0 5 msec .163 1.936 28.542 2.437 .510 .619

interrupt working idle touch_undet determine stealing

APPENDIX B EXECUTION PROFILES FOR PARALLEL BENCHMARKS

B qsort

File: "qsort-mp.elog" Processors: 64 100

80

60

%

40

20

0 0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 msec

0 10 20 30 40 50 60 70 80 90 100 % 62% 93% 62% 35% 44% 13%

0 1 0 10 20 30 0 10 20 30 0 1 2 0 1 2 0 1 2 3 msec .157 .270 .658 .320 .198 .580

interrupt working idle touch_undet determine stealing

File: "qsort-mp.elog" Processors: 64 100

80

60

%

40

20

0 0 1 2 3 4 5 6 7 8 9 10 msec

0 10 20 30 40 50 60 70 80 90 100 % 30% 47% 12% 24% 44% 22%

0 1 0 10 0 10 0 .0 0 1 2 msec .130 .361 2.066 .388 .195 .459

interrupt working idle touch_undet determine stealing

B QUEENS

B queens

File: "queens-mp.elog" Processors: 64 100

80

60

%

40

20

0 0 5 10 15 20 25 30 35 40 45 50 55 msec

0 10 20 30 40 50 60 70 80 90 100 % 63% 23% 57% 57% 54% 42%

0 1 0 10 0 10 0 1 2 0 1 0 1 msec .179 1.362 .810 .283 .137 .232

interrupt working idle touch_undet determine stealing

File: "queens-mp.elog" Processors: 64 100

80

60

%

40

20

0 .0 .5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 msec

0 10 20 30 40 50 60 70 80 90 100 % 34% 7% 15% 53% 55% 36%

.0 .5 0 1 2 0 1 2 3 .0 .0 .0 .5 msec .138 .806 1.532 .260 .134 .203

interrupt working idle touch_undet determine stealing

APPENDIX B EXECUTION PROFILES FOR PARALLEL BENCHMARKS

B rantree

File: "rantree-mp.elog" Processors: 64 100

80

60

%

40

20

0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 msec

0 10 20 30 40 50 60 70 80 90 100 % 65% 21% 68% 42% 55% 36%

.0 .5 0 0 10 0 1 0 1 0 1 msec .157 .969 1.278 .280 .144 .235

interrupt working idle touch_undet determine stealing

File: "rantree-mp.elog" Processors: 64 100

80

60

%

40

20

0 .0 .5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 msec

0 10 20 30 40 50 60 70 80 90 100 % 37% 13% 21% 44% 54% 36%

.0 .5 0 1 2 0 1 2 3 .0 .5 .0 .5 0 1 msec .135 .761 1.099 .253 .139 .226

interrupt working idle touch_undet determine stealing

B SCAN

B scan

File: "scan-mp.elog" Processors: 64 100

80

60

%

40

20

0 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 msec

0 10 20 30 40 50 60 70 80 90 100 % 67% 14% 58% 60% 66% 24%

.0 .5 0 10 20 0 10 .0 .5 .0 .5 0 1 2 msec .170 3.248 1.244 .286 .151 .288

interrupt working idle touch_undet determine stealing

File: "scan-mp.elog" Processors: 64 100

80

60

%

40

20

0 .0 .5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 msec

0 10 20 30 40 50 60 70 80 90 100 % 30% 9% 7% 71% 29%

.0 .5 0 1 2 3 0 1 2 3 .0 .1 0 1 msec .118 1.832 2.601 .138 .303

interrupt working idle determine stealing

APPENDIX B EXECUTION PROFILES FOR PARALLEL BENCHMARKS

B sum

File: "sum-mp.elog" Processors: 64 100

80

60

%

40

20

0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 msec

0 10 20 30 40 50 60 70 80 90 100 % 64% 10% 44% 65% 64% 33%

0 1 0 10 0 .0 .5 0 1 0 1 msec .158 2.460 1.120 .293 .152 .254

interrupt working idle touch_undet determine stealing

File: "sum-mp.elog" Processors: 64 100

80

60

%

40

20

0 .0 .5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 msec

0 10 20 30 40 50 60 70 80 90 100 % 34% 10% 8% 100% 44% 28%

.0 .5 0 1 2 3 4 0 1 2 3 .0 .0 .0 .5 msec .117 2.108 2.245 .218 .135 .246

interrupt working idle touch_undet determine stealing

B TRIDIAG

B tridiag

File: "tridiag-mp.elog" Processors: 64 100

80

60

%

40

20

0 0 50 100 150 200 250 300 msec

0 10 20 30 40 50 60 70 80 90 100 % 66% 17% 81% 42% 57% 13%

0 1 0 100 0 50 .0 .5 .0 .5 0 5 msec .170 17.283 4.276 .325 .155 .687

interrupt working idle touch_undet determine stealing

File: "tridiag-mp.elog" Processors: 64 100

80

60

%

40

20

0 0 1 2 3 4 5 6 7 8 9 10 msec

0 10 20 30 40 50 60 70 80 90 100 % 35% 7% 7% 11%

.0 .5 0 5 10 0 1 2 3 4 5 6 0 1 msec .118 4.425 4.856 .817

interrupt working idle stealing

APPENDIX B EXECUTION PROFILES FOR PARALLEL BENCHMARKS

Biblio graphy

Adams and Rees N Adams and J Rees Ob jectoriented programming in

Scheme In Conference Record of the ACM Conference on Lisp and Functional

Programming pages August

Agarwal A Agarwal Performance tradeos in multithreaded pro cessors Tech

nical Rep ort MITLCSTR Massachusetts Institute of Technology Cambridge

MA April

App el A W App el Allo cation without lo cking Software Practice and Experi

ence July

Arvind and Nikhil Arvind and R S Nikhil Executing a program on the MIT

taggedtoken dataow architecture IEEE Transactions on Computers

March

Baker and Hewitt H Baker and C Hewitt The incremental garbage collection

of pro cesses Technical Rep ort AI Memo Mass Inst of Technology Articial

Intelligence Lab oratory March

BBN BBN Advanced Computers Inc Cambridge MA Inside the GP

BBN BBN Advanced Computers Inc Cambridge MA Inside the TC

Computer

Bilardi and Nicolau G Bilardi and A Nicolau Adaptive bitonic sorting An

optimal parallel algorithm for sharedmemory machines SIAM Journal of Computing

April

Callahan and Smith D Callahan and B Smith A futurebased parallel language

for a generalpurp ose highlyparallel computer In Papers from the Second Workshop

BIBLIOGRAPHY

on Languages and Compilers for Parallel Computing pages University of

Illinois at UrbanaChampaign

Censier and Feautrier L M Censier and P Feautrier A new solution to co

herence problems in multicache systems IEEE Transactions on Computers pages

December

Chaiken et al D Chaiken J Kubiatowicz and A Agarwal LimitLESS direc

tories A scalable cache coherence scheme In ASPLOS IV Architectural Support for

Programming Languages and Operating Systems pages

Clinger et al W Clinger A Hartheimer and E Ost Implementation strategies

for continuations In Conference Record of the ACM Conference on Lisp and

Functional Programming pages Snowbird UT July

Clinger W Clinger The Scheme compiler an exercise in denotational

semantics In Conference Record of the ACM Symposium on Lisp and Functional

Programming pages

Dijkstra E W Dijkstra Co op erating sequential pro cesses In Programming

Languages pages Academic Press

Dub ois and Scheurich M Dub ois and C Scheurich Memory access dep enden

cies in sharedmemory multiprocessors IEEE Transactions on Software Engineering

June

Feeley and Miller M Feeley and J S Miller A parallel virtual machine for

ecient Scheme compilation In Proceedings of the ACM Conference on Lisp

and Functional Programming Nice France June

Feeley M Feeley Polling eciently on sto ck hardware In Proceedings of the

ACM Conference on Functional Programming Languages and Computer Architecture

Fra Franz Inc Berkeley CA Al legro CL User Manual

Friedman and Haynes D P Friedman and C T Haynes Constraining control

In Proceedings of the Twelfth Annual Symposium on Principles of Programming Lan

guages pages New Orleans LA January ACM

Friedman et al D P Friedman M Wand and C T Haynes Essentials of

Programming Languages MIT Press and McGrawHill

BIBLIOGRAPHY

Gabriel and McCarthy R P Gabriel and J McCarthy Queuebased multi

pro cessing Lisp In Conference Record of the ACM Symposium on Lisp and

Functional Programming pages Austin TX August

Gabriel R P Gabriel Performance and Evaluation of Lisp Systems Research

Rep orts and Notes Computer Systems Series MIT Press Cambridge MA

Gharachorloo et al K Gharachorloo A Gupta and J Hennessy Performance

evaluation of memory consistency mo dels for sharedmemory multiprocessors In

Proceedings of the th International Conference on Architectural Support for Pro

gramming Languages and Operating Systems pages ACM April

Goldman and Gabriel R Goldman and R P Gabriel Preliminary results with

the initial implementation of Qlisp In Conference Record of the ACM Con

ference on Lisp and Functional Programming pages Snowbird UT July

Go o dman J R Go o dman Using cache memory to reduce pro cessormemory

trac Proceedings of the th International Symposium on Computer Architecture

pages June

Gray S L Gray Using futures to exploit parallelism in Lisp Masters thesis

Mass Inst of Technology

Halstead and Fujita R Halstead and T Fujita MASA A multithreaded pro

cessor architecture for parallel symbolic computing In Proceedings of the th Annual

International Symposium on Computer Architecture pages

Halstead et al R Halstead T Anderson R Osb orne and T Sterling Con

cert Design of a multiprocessor development system In Intl Symp on Computer

Architecture volume pages June

Halstead R Halstead Implementation of Multilisp Lisp on a multiprocessor

In Conference Record of the ACM Symposium on Lisp and Functional Program

ming pages Austin TX August

Halstead R Halstead Multilisp A language for concurrent symbolic compu

tation In ACM Trans on Prog Languages and Systems pages Octob er

Halstead R Halstead Overview of concert Multilisp A multiprocessor symbolic

computing system ACM Computer Architecture News March

BIBLIOGRAPHY

Haynes et al C T Haynes D P Friedman and M Wand Continuations and

coroutines In Conference Record of the ACM Symposium on Lisp and Func

tional Programming pages Austin TX

Haynes Christopher T Haynes Logic continuations In Proceedings of the Third

International Conference on Logic Programming pages SpringerVerlag

July

Hieb et al Rob ert Hieb R Kent Dybvig and Carl Bruggeman Representing

control in the presence of rstclass continuations In ACM SIGPLAN Conf on

Programming Language Design and Implementation pages White Plains New

York June

Ho ckney and Jesshop e R W Ho ckney and C R Jesshop e Parallel Computers

Adam Hilger Bristol and Philadelphia

IEEE Std IEEE Std IEEE Standard for the Scheme Pro

gramming Language Institute of Electrical and Electronic Engineers Inc New York

NY

Ito and Matsui T Ito and M Matsui A parallel Lisp language PaiLisp and

its kernel sp ecication In Parallel Lisp Languages and Systems pages

SpringerVerlag

Katz and Weise M Katz and D Weise Continuing into the future on the

interaction of futures and rstclass continuations In Proceedings of the ACM

Conference on Lisp and Functional Programming Nice France June

Kessler and Swanson R Kessler and M Swanson Concurrent Scheme In Par

al lel Lisp Languages and Systems pages SpringerVerlag

Kessler et al R Kessler H Carr L Stroller and M Swanson Implementing

concurrent Scheme for the Mayy distributed parallel pro cessing system Lisp and

Symbolic Computation An International Journal

Kranz et al D Kranz R Halstead and E Mohr MulT A highp erformance

parallel Lisp In ACM SIGPLAN Conf on Programming Language Design and

Implementation pages June

LeBlanc and Markatos T J LeBlanc and E P Markatos Shared memory vs

message passing in sharedmemory multiprocessors Technical rep ort University of Ro chester April

BIBLIOGRAPHY

Lenoski et al D Lenoski J Laudon K Gharachorloo WD Weber A Gupta

J Hennessy M Horowitz and M S Lam The Stanford Dash multiprocessor IEEE

Computer March

Miller J S Miller MultiScheme A Parallel Processing System Based on MIT

Scheme PhD thesis Mass Inst of Technology August Available as MIT

LCSTR

Miller J S Miller Implementing a Schemebased parallel pro cessing system

International Journal of Parallel Processing Octob er

Mohr E Mohr Dynamic Partitioning of Parallel Lisp Programs PhD thesis

Yale University Department of Computer Science Octob er

Mou Z G Mou A formal mo del of divideandconquer and its parallel real

ization Computer science research rep ort PhD dissertation Yale University

Murray K Murray The future of Common Lisp Higher p erformance through

parallelism In The rst European Conference on the Practical Application of Lisp

Cambridge UK March

Nikhil et al R S Nikhil G M Papadopoulos and Arvind T A multithreaded

massively parallel architecture Technical Rep ort Computations Structures Group

Memo Mass Inst of Technology Lab oratory for Computer Science Cam

bridge MA November

OKrafka and Newton B W OKrafka and A R Newton An empirical evalu

ation of two memoryecient directory metho ds In Proceedings of the th Annual

International Symposium on Computer Architecture pages ACM May

Osb orne R Osb orne Speculative Computation in Multilisp PhD thesis Mass

Inst of Technology Available as MIT LCSTR

Peterson G L Peterson Myths ab out the mutual exclusion problem Informa

tion Processing Letters

Pster et al G F Pster W C Brantley D A George S L Harvey W J

Kleinfelder K P McAulie E A Melton V A Norton and J Weiss The IBM

Research Parallel Pro cessor Prototype RP Introduction and architecture Inter

national Conference on Parallel Processing pages

RRS Revised rep ort on the algorithmic language Scheme ACM Sigplan No

tices December

BIBLIOGRAPHY

RRS Revised rep ort on the algorithmic language Scheme Technical Rep ort

MIT AI Memo b Mass Inst of Technology Cambridge Mass November

Rettb erg et al R D Rettb erg W R Crowther P P Carvey and R S Tomlin

son The Monarch parallel pro cessor hardware design IEEE Computer

April

Rozas and Miller G Rozas and J S Miller Free variables and rstclass envi

ronments Lisp and Symbolic Computation An International Journal

Rozas G Rozas A computational mo del for observation in quantum mechanics

Masters thesis Mass Inst of Technology Available as MIT AITR

Shivers O Shivers Control ow analysis in Scheme In ACM SIGPLAN

Conf on Programming Language Design and Implementation pages At

lanta Georgia June

Shivers O Shivers Dataow analysis and type recovery in Scheme In Peter

Lee editor Topics in Advanced Language Implementation The MIT Press Cam

bridge Mass

Srini V P Srini An architectural comparison of dataow systems IEEE Com

puter March

Steele G L Steele Rabbit a compiler for Scheme MIT AI Memo Mas

sachusetts Institute of Technology Cambridge Mass May

Steinberg et al S Steinberg D Allen L Bagnall and C Scott The Buttery

Lisp system In Proc AAAI volume Philadelphia PA August

Swanson et al M Swanson R Kessler and G Lindstrom An implementation

of p ortable standard Lisp on the BBN Buttery In Conference Record of the

ACM Conference on Lisp and Functional Programming pages Snowbird

UT July

Wand M Wand Continuationbased program transformation strategies Jour

nal of the ACM

Weening J S Weening Parallel Execution of Lisp Programs PhD thesis Stan

ford University Department of Computer Science Available as STANCS

BIBLIOGRAPHY

Zorn et al B Zorn P Hilnger K Ho J Larus and L Semenzato Features

for multiprocessing in SPUR Lisp Technical Rep ort Rep ort UCBCSD Uni

versity of California Computer Science Division EECS March