<<

Lazy Threads Compiler and Runtime Structures for

FineGrained Parallel Programming

by

Seth Cop en Goldstein

BSE Princeton University

MSE University of CaliforniaBerkeley

A dissertation submitted in partial satisfaction of the

requirements for the degree of

Do ctor of Philosophy

in

Computer Science

in the

GRADUATE DIVISION

of the

UNIVERSITY of CALIFORNIA at BERKELEY

Committee in charge

Professor David E Culler Chair

Professor Susan L Graham

Professor Paul McEuen

Professor Katherine A Yelick

Fall

Abstract

Lazy Threads Compiler and Runtime Structures for FineGrained Parallel Programming

by

Seth Cop en Goldstein

Do ctor of Philosophy in Science

University of California Berkeley

Professor David E Culler Chair

Many mo dern parallel languages supp ort dynamic creation of threads or require

multithreading in their implementations The threads describ e the logical parallelism in the

program For ease of expression and b etter resource utilization the logical parallelism in

a program often exceeds the physical parallelism of the machine and leads to applications

with many negrained threads In practice however most logical threads need not b e

indep endent threads Instead they could b e run as sequential calls which are inherently

cheap er than indep endent threads The challenge is that one cannot generally predict which

logical threads can b e implemented as sequential calls In lazy multithreading systems each

logical thread b egins execution sequentially with the attendant ecientstack management

and direct transfer of control and data Only if a thread truly must execute in parallel

do es it get its own thread of control

This dissertation presents new implementation techniques for lazy multithreading

systems on conventional machines and compares a varietyofdesignchoices Wedevelop an

abstract machine that makes explicit the design decisions for achieving lazy multithread

ing Weintro duce new options on each of the four axes in the design space the storage

mo del the thread representation the disconnection metho d and the queueing mechanism

Stacklets our new storage mo del allows parallel calls to maintain the invariants of se

quential calls Thread seeds our new thread representation allows threads to b e stolen

without requiring thread migration or shared memory Lazydisconnect our new discon

nection metho d do es not restrict the use of p oin ters Implicit and Lazy queueing our two

new queueing mechanisms eliminate the need for explicit b o okkeeping Additionallywe

develop a core set of compilation techniques and runtime primitives that form the basis for

the ecient implementation of any design p oint

Wehaveevaluated the dierent approaches by incorp orating them into a compiler

for an explicitly parallel extension of SplitC Weshow that there exist p oints in the design

space eg stacklet thread seeds lazydisconnect and lazy queueing for which negrained

parallelism can b e eciently supp orted even on distributed memory machines thus allowing

programmers freedom to sp ecify the parallelism in a program without concern for excessive

overhead

iii

Contents

List of Figures vii

List of Tables xi

Intro duction

Motivation

The Goal

Contributions

Road Map

Multithreaded Systems

The Multithreaded Mo del

Using Fork for Thread Creation

Dening the Potentially Parallel Call

Problem Statement

Multithreaded Abstract Machine

The Multithreaded Abstract Machine MAM

Threads

Pro cessors and MAM

Threads Op erations

Inlets and Inlet Op erations

Thread Scheduling

Discussion

Sequential Call and Return

MAMDFSupp orting Direct Fork

The MAMDF Sc heduler

Thread Op erations in MAMDF

Continuation Stealing

Thread Seeds and Seed Activation

Closures

Discussion

MAMDSSupp orting Direct Return

MAMDS Op erations

CONTENTS iv

Discussion

The Lazy Multithreaded Abstract Machine LMAM

Disconnection and the Storage Mo del

Eagerdisconnect

Lazydisconnect

Summary

Storage Mo dels

Storing a Threads Internal State

Remote Fork

Linked Frames

Op erations on Linked Frames

Linked Frame Timings

Multiple Stacks

StackLayout and Stubs

Stack Implementation

Op erations on Multiple Stacks

Multiple Stacks Timings

Spaghetti Stacks

Spaghetti StackLayout

Spaghetti Stack Op erations

Spaghetti Stack Timings

Stacklets

Stacklet Layout

Stacklet Op erations

Stacklet Stubs

Compilation

Timings

Discussion

Mixed Storage Mo del

The Memory System

Summary

Implementing Control

Representing Threads

Sequential Call

The Parallel Ready Sequential Call

Seeds

Closures

Summary

The Ready and Parent Queues

The Ready Queue

The Parent Queue

The Implicit Queue and Implicit Seeds

The Explicit Queue and Explicit Seeds

CONTENTS v

Interaction Between the Queue and Susp ension

Disconnection and ParentControlled Return Continuations

Compilation Strategy

Lazydisconnect

Eagerdisconnect

Synchronizers

Lazy Parent Queue

Costs in the Seed Mo del

The Implicit Parent Queue

The Explicit Parent Queue

The Lazy Parent Queue

Continuation Stealing

Control Disconnection

Migration

Integrating the Control and Storage Mo dels

Linked Frames

Stacks and Stacklets

Spaghetti Stacks

Discussion

Programming Languages

SplitCthreads

Example SplitCthreads Program

Thread Extensions to SplitC

Forkset statement

Pcall Statement

Fork statement

Start Statement

Susp end Statement

Yield Statement

Function Typ es

Manipulating Threads

SplitPhase Memory Op erations and Threads

Other Examples

Lazy Thread Style

IStructures and Strands

SplitCThreads Compiler

Id

Empirical Results

Exp erimental Setup

Compiler Integration

Eager Threading

Comparison to Sequential Co de

Register Windows vs Flat Register Mo del

CONTENTS vi

Overall Performance Comparisons

Comparing Memory Mo dels

Thread Representations

Disconnection

Queueing

Running on the NOW

Using CopyonSusp end

Susp end and Steal Stream Entry Points

Comparing Co de Size

Id Comparison

Summary

Related Work

Conclusions

A Glossary

Bibliography

vii

List of Figures

Examples of p otentially parallel calls

One example implementation of a logically unb ounded stack assigned to each

thread The entire structure is called a cactus stack

Logical task graphs

The syntax of the formal descriptions

Formal denition of a thread as a tuple

Formal denition of a pro cessor as a tuple

Formal denition of the MAM as a tuple

Example translation of a function into psuedoco de for the MAM

The thread exit instruction

The yield instruction

The susp end op eration

The fork op eration for MAM

The send instruction

The ireturn instruction

The legal state transitions for a thread in MAM

The enable instruction

Describ es how an idle pro cessor gets work from the ready queue

Redenition of the machine and thread for MAMDF

The legal state transitions for a thread in the MAMDF without work stealing

The fork op eration under MAMDF

The dfork op eration under MAMDF transfers control directly to the forked

child

The dexit op eration under MAMDF transfers control directly to the parent

The susp end op eration under MAMDF for a lazy thread

The new state transitions for a thread in MAMDF using continuation stealing

Formal semantics for stealing work using continuation stealing

Three p ossible resumption p oints after executing dfork

Example thread seed and seed co de fragments asso ciated with dforks

Work stealing though seed activation

LIST OF FIGURES viii

Example translation of a function into psuedoco de for the MAMDF using

thread seeds

Example of seed activation

The seed return instruction is used to complete a seed routine

Example translation of a function into psuedoco de for the MAMDS using

thread seeds

Redenition of a thread for MAMDS

A pseudoco de sequence for an lfork with its two asso ciated return paths

The lfork op eration in MAMDS

The lreturn op eration in MAMDS when the parent and child are connected

and the parent has not b een resumed since it lforked the child

The susp end op eration in MAMDS

The lreturn op eration when the parent and child have b een disconnected by

acontinuation stealing op eration

The lreturn op eration in MAMDS when a thread seed in the parent has

b een activated

The lfork op eration in LMAM with multiple stacks

Eagerdisconnect used to disconnect a child from its parent when a linked

storage mo del is used

An example of eagerdisconnect when the child is copied to a new stack

An example of eagerdisconnect when the parent p ortion of the stack is copied

to a new stack

The susp end op eration under LMAMMS and LMAMS using child copy for

eagerdisconnect

An example of how a parent and its child are disconnected by linking new

frames to the parent

The susp end op eration for LMAMMS and LMAMS using lazydisconnect

The lfork op eration under LMAMMS using lazydisconnect after a thread

has b een disconnected from a child

A cactus stack using linked frames

A cactus stack using multiple stacks

Layout of a stack with its stub

Allo cating a sequential call or an lfork onastack

Allo cating a new stack when a fork is p erformed

Disconnection using childcopy for a susp ending child

Disconnection caused by seed activation

Disconnection caused bycontinuation stealing

Two examples cactus stacks using spaghetti stacks

Allo cation of a new frame on a spaghetti stack

Deallo cating a child that has run to completion

Deallo cating a child that has susp ended but is currently at the top of the

spaghetti stack

Deallo cating a child that has susp ended and is in the middle of the stack

A cactus stack using stacklets

LIST OF FIGURES ix

The basic form of a stacklet

The result of a sequential call which do es not overow the stacklet

The result of a fork or a sequential call whichoverows the stacklet

A remote fork leaves the currentstacklet unchanged and allo cates a new

stacklet on another pro cessor

Pseudoco de for a parallel ready sequential call and its asso ciated return

entry p oints

Example of a closure and the co de to handle it

An example of a cactus stack split across two pro cessors with the implicit

parent queue emb edded in it

Example of when a fork generates a thread seed and when it do es not

Example of how an explicit seed is stored on the parent queue

A comparison of the dierent implementations of susp end dep ending up on

the typ e of parent queue

Example implementation of two lforksfollowed by a join

Each forkset is compiled into three co de streams

An example of applying the synchronizer transformation to tw o forks followed

by a join

The table on the left shows how a seed is represented b efore and after con

version The routines that aid in conversion are shown on the right

Example implementation of a parallel ready sequential call with its asso ciated

inlets in the continuationstealing control mo del

A parallel ready sequential call with its asso ciated steal and susp end routines

help er inlets and routines for handling thread migration

A localthread stub that can b e used for either stacks or stacklets

Co de fragments used to implement a minimal overhead spaghetti stack

An implementation of the Fib onacci function in SplitCthreads

The new keywords dened in SplitCthreads

New syntax for the threaded extensions in SplitCthreads

Example uses of forkset

Implementation of Istructures in SplitCthreads using suspend

The ow graph for the sequential and susp end co destreams for a forkable

function compiled with an explicit parent queue the stacklet storage mo del

lazydisconnect and thread seeds

The grain microb enchmark

Improvementshown bylazymultithreading relativetoeagermultithreading

Memory required for eager and lazy multithreading

Slowdown of unipro cessor multithreaded co de running serially over GCC co de

Average slowdown of multithreaded co de running sequentially on one pro

cessor over GCC co de

Comparison of the three memory mo dels

LIST OF FIGURES x

Comparison of continuation stealing and seed activation when all threads run

to completion

The execution time ratio b etween the continuationstealing mo del and the

threadseed mo del when the rst pcall of the forkset susp ends

Comparing the eect of which of the ten pcalls in a forkset susp ends

Comparing thread representations as the p ercentage of leaves that susp end

increases

Comparison of eager and lazy disconnection when all the threads run to

completion

Comparison of eager and lazy disconnection as the number of leaves susp end

ing changes

Comparison of eager and lazy disconnection when the rst pcall in a forkset

with two or ten pcalls susp ends

Comparison of the three queueing metho dsexplicit implicit and lazyfor

grain when all threads run to completion

Comparison of the queueing mechanisms as the numb er of threads susp ending

increases

Sp eedup of grain on the NOW relativetoaversion compiled with GCC on

a single pro cessor

A breakdown of the overhead when running on one pro cessor of NOW

Sp eedup of lazy threads stacklets thread seeds lazy disconnection and

explicit queue on the CM compared to the sequential C implementation

as a function of granularity

The eect of reducing the p olling frequency

Sp eedup of grain for the explicit and lazy queue mo dels

Sp eedup of an unbalanced grain where there is more work in the second

pcall of the forkset

Sp eedup of an unbalanced grain where there is more work in the rst pcall

of the forkset

Sp eedup and eciency achieved with and without copyonsusp end

The amount of co de dilation for eachpoint in the design space

Which p oints in the design space are eective and why

xi

List of Tables

Comparison of features supp orted by dierent programming systems

Times for the primitive op erations using linked frames

Times for the primitive op erations using multiple stacks

Times for the primitive op erations using spaghetti stacks

Times for the primitive op erations using stacklets

Times for the primitive op erations using a sequential stack for purely sequen

tial calls stacklets for p otentially parallel and susp endable sequential calls

with mo derate sized frames and the heap for the rest

The cost of the basic op erations using an implicit parent queue

The cost of the basic op erations using an explicit parent queue

The cost of the basic op erations using an lazy parent queue

Comparison of basic thread op erations using library packages versus compiler

integration

Comparison of grain compiled by GCC for register windows and for a at

register mo del

The minimum slowdowns relative to GCCcompiled executables for the dif

ferent memory mo dels and where in the design space they o ccur for a grain

size of instructions

Dynamic runtime in seconds on a SparcStation for the Id b enchmark

programs under the TAM mo del and lazy threads with multiple strands

Chapter

Intro duction

Motivation

As parallel computing enters its adolescence its success dep ends in large part on

the p erformance of parallel programs in the nonscientic domain Typically these pro

grams are less structured in their control ow resulting in irregular parallelism and the

increased need to supp ort multiple dynamically created threads of control Hence the suc

cess of parallel computing hinges on the ecient implementation of multithreading This

dissertation shows how to eciently implementmultithreading on standard architectures

using novel compilation techniques making it p ossible to tackle a broad class of problems

in mo dern parallel computing The goal of this thesis to present and analyze an ecient

multithreading system

Many mo dern parallel languages provide metho ds for dynamically creating mul

tiple indep endent threads of control such as forks parallel calls futures ob ject metho ds

and nonstrict evaluation of argument expressions These threads describ e the logical par

al lelism in the program ie the parallelism that would b e present if there were an innite

numb er of pro cessors to execute the threads The language implementation maps the dy

namic collection of threads onto the set of physical pro cessors executing the program either

byproviding its own languagesp ecic scheduling mechanisms or by using a general threads

package These languages stand in contrast to languages with a single logical thread of con

trol such as High Performance Fortran or a xed set of threads such as SplitC

and MPI whicharetypically targeted to scientic applications

CHAPTER INTRODUCTION

There are many reasons to have the logical parallelism of the program exceed the

physical parallelism of the machine including ease of expression and b etter resource uti

lization in the presence of synchronization delays load imbalance and long communication

latency When threads are available to the programmer they can b e used to cleanly

express multiple goals coroutines and parallelism within a single program Threads can

b e used by the programmer or compiler to hide the communication latency of remote mem

ory references More imp ortantly they can b e used to hide the p otentially long latency

involved in accessing synchronizing data structures or other synchronization constructs

Load balance can also b e improved when many threads are distributed among fewer pro

cessors on a parallel system Finally the semantics of the language or the synchronization

primitives may allow dep endencies to b e expressed in suchaway that progress can b e made

only byinterleaving multiple threads eectively running them in parallel even on a single

pro cessor

To supp ort unb ounded logical parallelism regardless of the language mo del the

underlying execution mo del must have three key features First the control mo del must

allow the dynamic creation of multiple threads with indep endent lifetimes Second the

mo del must allow switching b etween these threads in resp onse to synchronization events

and longlatency op erations Third since each thread mayrequireanunb ounded stack

the storage mo del is a tree of stacks called a cactus stack Unfortunately a paral

lel call or thread fork is fundamentally more exp ensive than a sequen tial call b ecause of

the storage management data transfer control transfer scheduling and synchronization

involved Much previous work has sought to reduce this cost by using a combination of

compiler techniques and clever runtime representations

or by supp orting negrained parallel execution directly in hardware These ap

proaches among others have b een used in implementing parallel programming languages

such as ABCL CC Charm Cid Cilk Concert Id

MulT and Olden In some cases the cost of the fork is reduced by severely re

stricting what can b e done in a thread Lazy Task Creation implemented in MulT

is the most successful in reducing the cost of a fork However in all of these approaches a

fork remains substantially more exp ensive than a simple sequential call

CHAPTER INTRODUCTION

The Goal

Our goal is to supp ort an unrestricted parallel thread mo del and yet reduce the

cost of thread creation and termination to little more than the cost of a sequential call

and return We also want to reduce the cost of switching and scheduling threads so that

the negrained parallelism needed to implement unstructured parallel programs will not

adversely aect the overall p erformance of these programs

We observe that negrained parallel languages promote the use of small threads

that are often short lived on the order of a single function call However the advantages

of negrained parallelism can b e overwhelmed by the overhead of creating a new thread

which is inherently a more complex and exp ensive op eration than sequential call Fortu

nately logically parallel calls can often b e run as sequential calls For example once all

the pro cessors are busy there maybenoneedtospawn additional work and in the vast

ma jority of cases the logic of the program p ermits the child to run to completion while

the parent is susp ended Thus a parallel call should b e viewed as a potential ly paral lel

cal lthat is a p oint in the computation where an indep endent thread may but will not

necessarily b e created The problem is that in general when the thread is created there is

no way to know whether it will complete without susp ending or if it will need to run in

parallel with its parent

In Figure weshow three examples of forks that are p otentially parallel calls In

Figure a fork fibn is a p otentially parallel call b ecause if all the pro cessors are busy

bn can run to completion while the invoker is susp ended In Figure b the the thread

created by fork consumerqueue can run to completion if the the shared queue qisnever

empt y and getItemq never susp ends However it is a p otentially parallel call b ecause if

the queue b ecomes empty the thread running consumer will susp end and it needs to run in

parallel with a thread that lls up the shared queue The last fork of task in Figure c is a

potentially parallel call b ecause if iptr p oints to another pro cessor then the get op eration

i iptr can susp end causing the thread running task to susp end If iptr p oints to a

lo cation on the same pro cessor as is running the thread then the op eration will complete

immediatelyinotherwords it can run to completion in serial with its parent

We implement a p otentially parallel call almost exactly like a stackbased sequen

tial call to exploit the fact that most such calls complete without susp ending This is

dierentthanatypical implementation which assumes all parallel calls will run in paral

CHAPTER INTRODUCTION

int fibint n

int xy

if n return

forkset

control leaves forkset when all forks in forkset have completed

x fork fibn

y fork fibn

return xy

a fork fibn

void consumerSharedQueue q

void taskint global iptr

int item

int x

while item getItemqEND

i iptr

c fork taskint global i b fork consumerqueue

Figure Examples of p otentially parallel calls

lel The typical implementation allo cates a child frame on the heap stores the arguments

into the frame schedules the child Later the child is run stores its results in its parent

frame schedules the parent and returns the child frame to the heap store We on the

other hand implement the p otentially parallel call as a p aral lelready sequential cal lWe

call our overall approach lazy threads The call allo cates the child frame on the stack of the

parent like a sequential call As with a sequential call control is transferred directly to

the child susp ending the parent arguments are transferred in registers and if the child

returns without susp ending results are transferred in registers when control is returned to

the parent Moreover even if the child susp ends the parent is resumed and continues to

execute without copying the stack Instead the susp ending child assumes control of the

parents stack and further children called by the parent are executed on stacks of their

own In other words the susp ending child steals the thread out from under its parent If

work is needed by another pro cessor the parent can b e resumed to spawn children on that pro cessor

CHAPTER INTRODUCTION

The key to an eective implementation of lazy threads is to invoke the childand

susp end the parentwithout p erforming anybookkeeping op erations ab ove those required

by a sequential call and yet allow the parenttocontinue b efore its lazy child completes In

order to reach this goal we need to pay attention to four comp onents of the invo cation and

susp ension pro cesses

How the thread state is stored Weinvestigate four storage mo dels linked frames

multiple stacks spaghetti stacks and stacklets

Howwork in the parent is represented We study three dierent representations for

work in the parent thread seeds closures and continuations

Howwork is enqueued so it can later b e found Weinvestigate three queuing metho ds

implicit queues explicit queues and lazy queues

How the child and parent are disconnected so that the parentmaycontinue b efore

the child completes We study two disconnection metho ds eagerdisconnect and

lazydisconnect

The advantage of lazy threads is that when they run sequentially ie when they

run to completion without susp ending they have nearly the same eciency as a sequential

call Yet the cost of elevating a lazy thread into an indep endent thread of control is close

to that of executing it in parallel outright

Unlike previous approaches to lazy parallelism eg loadbased inlining and Lazy

Task Creation our co degeneration strategy avoids b o okkeeping op erations likecre

ating task descriptors initializing synchronization variables or even explicitly enqueueing

tasks Furthermore our approach do es not limit the class of legal programs as opp osed

to some previous approaches eg Through careful attention to the program

representation wepay almost nothing for the abilitytoelevate a sequential call into a full

thread on demand This contrasts with previous approaches to lazy parallelism based on

continuation stealing in which the parentcontinues in its own thread forcing stack copies and migration

CHAPTER INTRODUCTION

Contributions

The main contribution of this thesis is the development of a set of primitives that

can b e used to compile negrained parallel programs eciently

Weintro duce new memory management primitives called stacklets and stubs to more

eciently manage the cactus stack

Weintro duce new control primitives called thread seeds and parentcontrolled return

continuations to reduce the cost of scheduling and synchronizing threads

We analyze for the rst time the four dimensional design space for multithreaded

implementations The storage mo del linked frames multiple stacks spaghetti

stacks and stacklets the thread representations continuations thread seeds

and closures the queue mechanism implicit explicit or lazy and the dis

connection metho d eager or lazy We showhoweachpoint in the space can b e

implemented using the same set of primitives

Using the techniques presented in this thesis we reimplement several earlier ap

proaches allowing us to compare for the rst time the dierentpoints in the design

space in the same environment

We develop a new parallel language SplitCthreads which is one of the source

languages for our exp eriments

We develop a compiler for a new multithreading language SplitCthreads a multi

threaded extension of SplitC by implementing our compilation framework in GCC

We implement a new bac kend of the Id compiler a second source language for

our exp eriments which pro duces signicantly faster executables than previous ap

proaches

Road Map

Chapter denes the typ e of multithreaded systems on which this thesis fo cuses

Chapter presents an abstract multithreaded machine whichemb o dies the principal con

cepts b ehind these systems We use the abstract machine to explore the four axes of the

CHAPTER INTRODUCTION

design space storage mo dels disconnection metho ds thread representations and queueing

metho ds In Chapter we describ e the implementations of the dierent p ossible storage

mo dels and how they aect the overall eciency of multithreading In Chapter we describ e

the mechanisms used by the dierentcontrol mo dels and showhow they interact with the

storage mo dels Chapter describ es the two source languages SplitCthreads and Id

used in our exp eriments and the techniques used in their compilation Chapter presents

exp erimental results which compare the dierentmultithreaded implementations using our

compilation techniques and the mechanisms intro duced in Chapters and Chapter

discusses related work Finally Chapter contains a summary and concluding remarks A glossary of sp ecial terminology used in this dissertation can b e found in App endix A

Chapter

Multithreaded Systems

In this chapter we dene the scop e of the multithreaded systems that we study in

this dissertation We b egin by dening the attributes of a multithreaded system We then

describ e the language constructs for thread creation and dene the p otentially parallel calls

which can b e optimized by our system

The Multithreaded Mo del

A multithreaded system is one that supp orts multiple threads of control where

each thread is a single sequential owofcontrol We classify multithreaded systems into

three categories limited multithreading systems virtual software pro cessor systems and

virtual hardware pro cessor systems Limited multithreading systems are those that supp ort

multiple threads of control but in whicheach thread must run to completion or a thread

do es not haveanunb ounded stack This thesis concerns itself primarily with virtualizing a

software pro cessor ie a multithreaded system in whicheach thread is a virtual pro cessor

with its own lo cus of control and unb ounded stack but which do es not supp ort preemption

A thread in a virtual hardware pro cessor system is a virtual pro cessor in every sense ie

each thread has its own lo cus of control an unb ounded stack and supp orts preemption

There are ve features that such a system must supp ort to virtualize a hardware

pro cessor

Creation It must b e able to create new threads dynamically during the course of

the program This is in direct contrast to the singleprogram multipledata SPMD

CHAPTER MULTITHREADED SYSTEMS

System Features

Dynamic Pre

creation Lifetime Susp ension Stack emption

SplitC no xed no Unb ounded no

Cilk Yes parent no NA no

Filaments Yes parent no NA no

CThreads Yes indep endent yes xed no

NewThreads Yes indep endent yes xed no

Chorus Yes indep endent yes xed no

Solaris Threads Yes indep endent yes xed yes

Nexus Yes indep endent yes xed yes

Lazy Threads Yes indep endent yes unb ounded no

Table Comparison of features supp orted by dierent programming systems NA

indicates a feature which is not applicable In particular in systems which do not supp ort

susp ension threads always run to completion Thus each thread do es not need its own

unb ounded stack all threads can share a single stack

D

C FG

B E

A

Figure One example implementation of a logically unb ounded stack assigned to each

thread The entire structure is called a cactus stack

programming mo del where the numb er of threads is xed throughout the life of a

program eg in SplitC and FortranMPI

Lifetime Each thread mayhave an indep endent lifetime If the lifetime of a thread

is limited to the scop e of its parent or is xed for the duration of the program then

there are typ es of programs describ ed in Section which cannot b e represented

Stack Each thread has a logically unb ounded stack Manymultithreaded systems

limit the size of a threads stack Chorus QuickThreads etc which in turn

restricts the kinds of op erations that a thread may p erform In contrast we provide

a logically unb ounded stack This do es not require that the stack b e implemented in

CHAPTER MULTITHREADED SYSTEMS

contiguous memory Figure shows a logically unb ounded stack implemented by

linking heapallo cated frames for each function to the frame of the invoking function

Empirically this structure is tall and thin so we call it a cactus stack

Susp ension Threads may susp end and later b e resumed This implies that they can

interact with and dep end on each other Systems that do not supp ort susp ension

eg Cilk and Filaments severely restrict the set of programs that can b e

expressed For instance these systems cannot use threads in pro ducerconsumer

relationships

Preemption A thread may b e preempted by another thread Some multithreaded

systems also allow preemption of threads which allows fair scheduling of threads

Without preemption threads maynever get a chance to execute and without some

thought programs can b ecome livelo cked While preemption is essential for op erating

system tasks it is less imp ortant for application programs

If a system supp orts preemption then a thread is not guaranteed to run indenitely

In fact the op erating system mayatany time remove a thread from its pro cessor and

start another thread This is p ossibly the most controversial feature of multithreaded

systems While required in systems where threads are comp eting for resources it may

b e a burden in systems that supp ort multithreading among co op erating threads In

the former case if threads are nonpreemptive then one thread could blo ck comp eting

threads from ever executing In the latter case however preemption is not needed to

guarantee that a thread will run For example a thread that is computing a result

needed by other threads will always get to run b ecause the threads needing the result

will eventually blo ck In addition if preemption exists then explicit action is required

to ensure atomicity of op erations In the absence of preemption a thread makes

all decisions as to when it can b e interrupted so atomicity is easier to guarantee

However care must b e taken to avoid constructs like spinlo cks

In the multithreaded systems we consider logical parallelism is indep endentof

physical parallelism Amultithreaded system do es not sp ecify whether the individual

On a distributed memory machine where lo cal memory is owned by a pro cessor and one pro cessor

cannot access another pro cessors lo cal memory atomicity is easily provided by ignoring the network On

shared memory machines this is harder but for nonpreemptive systems compilation techniques can provide

atomicity when required

CHAPTER MULTITHREADED SYSTEMS

threads are run on many hardware pro cessors on a single hardware pro cessor by time

slicing or on a hardwaremultithreaded machine

In short a multithreaded system supp orts the dynamic creation of threads each

of which is susp endable has an indep endent lifetime and has a p ossibly unb ounded stack

In addition the system may allow threads to b e preempted In this thesis we are concerned

with virtualizing the pro cessor for a single co op erating task and thus we concentrate on

multithreaded systems that do not supp ort preemption Our multithreaded system is aimed

at application programs and not op erating system tasks

Using Fork for Thread Creation

Many constructs have b een develop ed to create and manage multiple threads of

control such as forks parallel calls futures active ob jects etc These constructs maybe

exp osed to the programmer in library calls as in PThreads or as part of the language

as in MulT and Cilk or they may b e hidden from the programmer but used by

compiler to implement the language as in Id

Fork is the fundamental primitive for creating a new thread When executed it

creates a new thread whichwecallthechild thread that runs concurrently with the thread

that executed the fork whichwecallthe parent thread While there are many metho ds to

synchronize these threads the most common is a join The join acts as a barrier for a group

of threads When they have all executed the join they join to b ecome a single thread All

but one of the participating threads cease to exist after control has passed the p ointofthe

join We call the threads that are synchronized byasinglejoinaforkset

All of the other metho ds for creating threads can b e implemented by means of

fork Henceforth we use fork and parallel call interchangeably and only they are men tioned

explicitly

Dening the Potentially Parallel Call

In this section we dene the p otentially parallel call by lo oking at the three kinds

of threads that can b e created in a multithreaded system downward upward and daemon

threads This thesis concerns itself mainly with optimizing the p otentially parallel call that

CHAPTER MULTITHREADED SYSTEMS

6

1 2 4

3 5

(a) (b) (c)

Only downward threads. Downward threads and one Downward threads and one

upward thread, thread 3. daemon thread, thread 5.

Figure Logical task graphs No de is a downward thread No de is an upward thread

No de is a child of no de No de is the youngest ancestor alive at the time of no de s

death No de is a daemon thread No de is a ro ot thread

creates a downward thread Wealsoshowhow threads that ultimately require concurrency

including upward and daemon threads can b e implemented eciently

The logical task graph represents the interaction among all of the threads created

during the life of a multithreaded program Each no de in the graph represents a thread

The directed edges represent the transfer of control b etween threads See Figure Each

no de has an edge to each of its children and to the youngest ancestor that is aliveatits

death If no outgoing edge exists for a no de then the thread it represents was still executing

at the end of the program If no incoming edge ends at a no de then it is a ro ot thread ie

one of the threads created at program startup There can b e more than one ro ot thread

By distinguishing three kinds of threadsdownward upward and daemonwe

are able to precisely dene the p otentially parallel call A downward thread is one that

terminates b efore its parent terminates its ancestor edge p oints to its parent All the

threads in Figure a are downward threads An upwardthread is one which terminates



after its parent terminates For example thread in Figure b is an upward thread

indicated by the upward edge going from thread to thread an ancestor other than its

h is not a ro ot thread but is still alive at the end of the parent A daemon thread is one whic



program ie has no outgoing edges For example in Figure c thread is a daemon

thread

When there are neither upward threads nor daemon threads created during an

execution the logical task graph equals its transp ose ie every edge participates in a

The name upward comes ab out since the characteristics of upward threads mirror those of upward

funargs

The name daemon is b orrowed from Java

CHAPTER MULTITHREADED SYSTEMS

simple cycle of length two In this case the only synchronization primitive needed for

control is the join primitive If a downward thread do es not susp end then the fork that

created it could have b een replaced with a sequential call In other words downward threads

are those created by p otentially parallel calls

On the other hand daemon and upward threads require that indep endent threads



b e created even if none of them susp end The creation of these threads always incurs the

cost of a fork Daemon and upward threads are not created with p otentially parallel calls

but with parallel calls

Problem Statement

This thesis presents a design space for implementing multithreaded systems with

out preemption Weintro duce a sucient set of primitives so that anypoint in the design

space may b e implemented using our new primitives The primitives and implementation

techniques that we prop ose reduce the cost of a p otentially parallel call to nearly that of

a sequential call without restricting the parallelism inherent in the program Using our

primitives we implement a system in some cases a previously prop osed system at each

p oint in the design space

Similar to tail recursion elimination some upward threads can b e treated as downward threads If

an upward thread is to b e joined with one of its ancestors then b efore it terminates it must b e given a

continuation to the ancestor with which it will join If the parentisadownward thread and creates a child

thread as its last op eration it can give the child the parents own return continuation and then terminate

itself In other words the parent can wait for its child to complete b efore returning itself For this reason

we treat the apparently upward child thread as a downward thread

Chapter

Multithreaded Abstract Machine

In this chapter we describ e the primitives needed to eciently implement p oten

tially parallel calls on a multithreaded system We divide these primitives into two classes

storage primitives and control primitives Storage primitives are used to manage the thread

state and to maintain the global cactus stack Control primitives are used to manage control

ow through thread creation susp ension termination etc These primitives supp ort the

potentially parallel call without restricting parallelism yet bring the cost of thread creation

and termination down to essentially the cost of the sequential call and return primitives

As we formalize the multithreaded system weintro duce the abstract primitives

that will b e used in our implementations We discuss the thread representation mechanisms

threads closures thread seeds and continuations Wepresent the two basic metho ds for

disconnecting a lazy thread from its parentandelevating the child to an indep endent thread

eagerdisconnect and lazydisconnect We then formalize the four basic storage mo dels that

can b e used to store thread state linked frames multiple stacks stacklets and spaghetti

stacks The dierent queueing metho ds are discussed in Chapter

Our exp osition pro ceeds from a basic machine MAM to one which eciently

executes p otentially parallel calls LMAM in four steps We b egin by formalizing a basic

multithreaded abstract machine MAM Wenextshowhowtomake forks similar to calls by

intro ducing a direct fork op eration MAMDF Next weintro duce an abstract machine that

makes the termination op eration similar to the return mechanism MAMDS Finallywe

incorp orate more ecient storage managementinto a lazy multithreaded abstract machine

LMAM This last abstract machine has all the primitives necessary for ecient p otentially parallel calls

CHAPTER MULTITHREADED ABSTRACT MACHINE

Used to enclose a tuple

inst ip The instruction at address ip

Indicates that the eld has no valid value

Represents a sequence which has b een dened only on the values given

but which can expand innitely eg a stack

Can b e anyvalue of the correct typ e

Used for elds with a null value

x The contents of memory lo cation x

r Register number x

x

x y x is assigned the value y

x y x is dened to b e y

x y x has the value y

Figure The syntax of the formal descriptions

While there are manyways to distribute work among pro cessors we fo cus on work

stealing Work stealing is a metho d of pulling work ie threads to idle pro cessors

One of the fo cuses of this work is the representation of threads that are assigned to idle

pro cessors We shall describ e two metho ds of work stealing continuation stealing and seed

activation Both continuation stealing and seed activation continue work in the parent while

its child is still alive Continuation stealing resumes the parentby resuming the parents

currentcontinuation Seed activation resumes work generated by the parent the work is a

child thread that would have b een created if the parents continuation had b een stolen and

the parent immediately executed a fork

This chapter intro duces a substantial amount of new terminology The reader may

want to refer to the glossary in App endix A It is suggested that on rst reading the formal

denitions and rewrite rules b e skipp ed

The Multithreaded Abstract Machine MAM

In this section we dene a multithreaded abstract machine MAM and the com

ponents that makeupallmultithreaded systems eg threads and pro cessors In MAM

each thread op eration is implemented in the most naive and direct manner p ossible This

will serve as the basis for the renements required to makemultithreading more ecient

It also mo dels the multithreaded systems as implemented in muc h of the previous work

CHAPTER MULTITHREADED ABSTRACT MACHINE

Thread hip spt si li

p p

Where ip is valid when the thread is not running and represents the address to

continue at in the thread

sp is valid when the thread is not running and represents the current

stack p ointer for the thread

t isapointer to its parent thread If it has no parent thread the case

p

with daemon threads then it is

represents the current thread status fready running idlegAt

any instance a thread can b e in only one state

s is the threads stack

i is the threads return inlet an instruction address It p oints to the

p

inlet used to return results to the parent This eld combined with t

p

compromise the return register Inlets are dened in Section

l is either unlo cked orlocked A thread is lo cked to allow certain

op erations to b e carried out atomically

Figure Formal denition of a thread as a tuple The return register is broken down

into its two comp onents the parent thread and the inlet address elds three and six

resp ectively The state of the thread is included in the tuple Finally a lo ck bit is

included

Threads

In this work a thread is a lo cus of control that can p erform calls to arbitrary

nesting depth susp end at any p oint and fork additional threads Threads are scheduled

indep endently and are nonpreemptive

Each thread has an asso ciated instruction p ointer ip stack p ointer sp stack

s and return registert and i The instruction pointer points to the next instruction

p p

to b e executed in the thread The stack pointer points to the top of the threads stackand

the end of the currentactivation frame The return register is used to return results to the

parent and contains a continuation to an inlet in the threads parent

The general registers are asso ciated with the pro cessor not the thread In partic

ular the general registers used to carry out the computation must b e explicitly saved or

restored when a thread exits or is entered

The formal denition of a thread as shown in Figure diers from the descrip

tion in the previous paragraph to facilitate the denition of the op erational semantics of

Inlets are dened in Section

CHAPTER MULTITHREADED ABSTRACT MACHINE

Pro cessor hip sptRi

Where ip is the instruction p ointer

sp is the stack p ointer

t is the thread b eing executed by the pro cessor If no thread is b eing exe

cuted then the thread executes the dispatch thread which is indicated

bytheip eld having the address wait

R is the set of generalpurp ose registers fr r g



Figure Formal denition of a pro cessor as a tuple

the MAM In particular each thread has an asso ciated lo ckagl and state eld

The lo ck ag is used to describ e which thread and inlet op erations must execute atomically

with resp ect to other op erations The state eld is used to describ e the scheduling status

of a thread running indicates that the thread is assigned to a pro cessor and currently

executing ready indicates that the thread is on the MAM ready queue and can b e assigned

to a pro cessor and idle indicates that the thread is neither running or ready These states

and thread scheduling transitions are fully describ ed in Section

Of course any data from the thread needed for the op eration of MAM can b e

stored in the threads stack We represent the thread as a separate tuple in order to make

the rewrite semantics easier to understand

Pro cessors and MAM

A pro cessor is the underlying hardware that threads execute onit is what the

threads are virtualizing Each pro cessor has an instruction p ointer ip a stack p ointer

sp a running thread t and a set of generalpurp ose registers R The thread currently

running on the pro cessor has access to all the registers in RWe represent a pro cessor as a

tuple as shown in Figure Notice that the pro cessor do es not have a stack asso ciated

with it but a stack p ointer The stack whichispointed to by sp is the one contained in t



MAM is comp osed of a single global address space M and S which is accessible

by all threads a set of p ossibly only one pro cessors P and a ready queue Q The ready

state queue contains the set of threads that are ready to execute ie those in the ready

Despite its name the ready queue is not necessarily a queue and nothing in the semantics

The address space may b e comp osed of distributed or shared memory

CHAPTER MULTITHREADED ABSTRACT MACHINE

Machine hP T Q S M i

Where P is the set of pro cessors

T is the set of all threads

Q is the ready queue This is the set of threads in the ready state

S is the set of currently allo cated stacks Each s S can grow innitely

s M and s t S s t

sS

M is the entire global memory including the heap and all of the stacks

Figure Formal denition of the MAM as a tuple

of MAM dep ends on it having any particular structure It may also b e distributed among

the pro cessors We use the terms enqueue and dequeue to mean add to or remove

from a data structure without requiring that the element b e added at or removed from a

particular lo cation in the data structure We represent the MAM formally as a tuple as

shown in Figure

The co de for a program on MAM is divided into two classes co deblo ck and inlet

The codeblock co de p erforms the actual work of the threads The inlet co de describ ed in

Section is the p ortion of the program that p erforms interthread communication The

program text is divided in this manner to highlight the fact that the co deblo ck co de carries

out the computation in the program while the inlet co de is generated by the compiler to

handle the communication

We concern ourselves here with the instructions that aect multithreading The



instructions that may app ear in a co deblo ck are fork exit suspend yield and send

These instructions create terminate susp end yield and send data to a thread resp ectively

The inlet instructions are recv enable and ireturn which handle incoming communica

tion to a thread Standard instructions eg arithmetic logical memorycontrol transfer

etc may b e used in either co deblo ck or inlet co de

In order to give the reader an overview of how the individual instructions work

together wenow present an example program with a translation for the MAM The reader

is not exp ected to understand all the details of the individual instructions Figure shows

an example translation of a simple function with two forks and a join The rst lines are

There is no explicit join instruction as join is synthesized from more basic instructions

CHAPTER MULTITHREADED ABSTRACT MACHINE

int function() 1: function: { 2: set synch=2 int a,b,c; 3: fork inlet1 = X() 4: fork inlet2 = Y() a = fork X(); 5: suspend Codeblock Code b = fork Y(); 6: // enable will restart here. join; 7: add c = a + b c = a+b; 8: send tp, ip, c return c; 9: exit } 10:inlet1: 11: recv a 12: add synch = synch - 1 13: cmp synch, 0 14: be enab 15: ireturn 16: 17:inlet2: 18: recv b Inlet Code 19: add synch = synch - 1 20: cmp synch, 0 21: be enab 22: ireturn 23: 24:enab: 25: enable

26: ireturn

Figure Example translation of a function with two forks into psuedoco de for the MAM

t is the parent thread i is the parent inlet

p p

the co deblo ck co de They handle all the op erations in the user co de except for receiving

the results from the spawned children Lines are the inlet co de which consists of

two inlets Each handles the result from a child thread and also schedules the thread when

appropriate

The join is accomplished by setting a synchronization variable synch to the num

b er of forks in the forkset in this case in Line When the children return they

decrement the synchronization variable in Lines and and if the variable is zero

they enable the thread in Line indicating that the join has succeeded

The function starts by setting the synchronization variable and then the forks

are p erformed followed bya suspend which causes the runtime system to spawn another

thread that is ready to run ie a thread in the ready state When the twochildren nish

the co deblo ck is resumed at Line The thread nishes by sending the result c to an inlet

i of its parent thread t The parent thread and inlet are b oth stored in the thread state

p p when the thread is created

CHAPTER MULTITHREADED ABSTRACT MACHINE

hP T Q S M i where p P

and p hip spti

and instip exit

and t h t runningsii

p

hP T QS Mi where p hwaiti

and T T ftg

and S S fsg

Figure The thread exit instruction

Each fork sp ecies the inlet that will b e invoked when the child thread issues the

send instruction For example inlet on Line is executed when the rst child the one

executing the function X nishes The inlet stores the result into a as sp ecied in the recv

instruction on Line Next the synchronization variable is decremented and tested against

zero If it is not zero the inlet ends with an ireturn on Line If the synchronization

variable is zero then b oth threads have nished and the co deblo ck can continue The thread

is made ready to run with bytheenable in Line

Threads Op erations

In addition to standard sequential op erations there are four thread op erations

that can b e p erformed by a thread fork exit yieldandsuspend fork creates a

new thread exit terminates a thread and frees all the resources asso ciated with it see

Figure yield allows a thread to give up its pro cessor by placing itself on the ready

queue see Figure suspend causes a thread to give up its pro cessor and marks the

thread as idle see Figure There is no in trinsic join op eration nor is there anyintrinsic

op eration for testing the status of a thread but b oth can b e synthesized

fork creates and initializes a new thread see Figure fork has two mandatory

arguments the startaddress and the inletaddress and an optional list of parameters that



is passed to the new thread We call the thread executing the instruction the parent thread

and the newly created thread the child thread When fork executes the child threads ip is

set to the startaddress and its sp is set to the base of its newly allo cated stack It is placed

on the ready queue and its parentcontinuation is set to the return continuation sp ecied

A more general mo del suchasTAM separates thread creation from the passing of arguments In the

more general mo del fork can only create the thread send instructions are then used later to pass the new

thread its arguments TAM requires this more general mo del

CHAPTER MULTITHREADED ABSTRACT MACHINE

hP T Q S M i where p P

and p hip spti

and instip yield

and t h t runningsii

p

and t Q

and t hip sp t readys i i

p

hP T Q SMi where p hip sp t i

and t hip spt readysii

p

and t h t runnings i i

p

and Q Q ft gftg

hP T Q S M i where p P

and p hip spti

and instip yield

and Q

hP T Q S M i where p hip spti

Figure The yield instruction The rst rule applies when there is at least one other

thread that is ready to run The second rule applies when there are no other threads in the

ready queue it is eectively a NOP

hP T Q S M i where p P

and p hip spti

and instip suspend

and t h t runningsii

p

hP T Q S M i where p hwaiti

and t hip spt idlesii

p

Figure The susp end op eration A new thread is scheduled on the pro cessor by the Idle

op eration See Section

hP T Q S M i where p P

and p hip sptRi

and instip fork i adrarg arg arg

 n

and t h running i

hP T Q S Mi where p hip sptRi

and T T ft g

new

and Q Q ft g

new

and S S fs g

new

and t hadrs n t readys ii

new new new

and s arg arg arg

new

 n

Figure The fork op eration for MAM

CHAPTER MULTITHREADED ABSTRACT MACHINE

in the fork If there are any arguments passed to the child thread they are put at the base

of the child threads stack The sp is set to the rst free lo cation in the child threads stack

The child thread enters the ready state Finallycontrol is passed to the next instruction

in the parent thread

In MAM there are no intrinsic synchronization instructions Instead synchroniza

tion instructions are synthesized from the primitives of MAM Since we are concentrating

on downward threads the most imp ortantsynchronization instruction is join which syn

chronizes a parent with a set of children that it forked The parentcontinues only after

all of its children havereached the join ie when they have all completed To implement

this each time a parent forks a child or a set of children it increments a counter variable

called a join counter The inlet that the parent passes to its child decrements the same join



counter making the parent ready with enable if the counter is zero The join op eration

is thus a test on the join counter If it is not zero then the parent will susp end

Inlets and Inlet Op erations

Inlets and their asso ciated instructions generalize the data transfer p ortion of the

sequential return instruction for threads A conventional sequential return instruction im

plicitly p erforms two tasks It transfers control from the child to the parent and it transfers

results from the child to the paren t in the pro cessor registers The calling convention sp ec

ies which pro cessor registers mayb e just one contain the results to b e returned to the

parent The co de executed in the parentby the return instruction the co de following the

call that invoked the child that is issuing the return may use these results immediatelyor

it may store them in the activation frame for later use Just as a sequential call has a co de

fragment that follows it in MAM each fork is asso ciated with an inlet

An inlet is a co de fragment that pro cesses data received from another thread which



sends the data to the inlet with a send instruction We call the thread that issues the send

the sending thread and the thread to which the data is sentthedestination thread Because

data and control are not transferred simultaneously the inlet typically has to p erform some

synchronization task in addition to storing the data in the destination thread

This works b ecause inlets run atomically with resp ect to each other and the co deblo ck instructions

Of course inlets maybeusedtoreceive data other than results from another thread They are a general

mechanism that enables one thread to communicate with another

CHAPTER MULTITHREADED ABSTRACT MACHINE

hP T Q S M i where p P

and p hip sptRi

and instip send t adr arg arg arg

 n

and t h runningii

and t T

and t h s li

and l

and instadr recv slot slot slot

 n

hP T Q S M i where p hadr sp t Ri

and s slot arg x x n

x

x

and sp ip

and sp t

and l

Figure The send instruction

The inlet is run in the senders thread but it aects the destination threads state

In other words a new thread of control is not started for the inlet Although an inlet runs in

the senders thread and therefore on the senders pro cessor it can b e viewed as preempting

the destination thread Thus thread op erations like fork cannot app ear in an inlet Inlets

run atomically with resp ect to the destination thread and other inlets

MAM denes four instructions that are sp ecically aimed at inlets send recv

ireturnandenable send transfers values to another thread recv is always the rst

instruction in an inletit indicates where the data v alues sp ecied in the corresp onding

send are stored in the destination threads frame ireturn is used to return control from

the inlet routine back to the sending thread Finally enable places an idle thread in the

ready queue if it is not already there

Typically a thread in MAM ends with a send followed byanexit The send

transfers the data from the child to its parent This invokes an inlet which stores the data

from the child into the parent frame and then p erforms a synchronization op eration If the

inlet determines that the parent should b e placed on the ready queue then it executes an

enable The inlet ends with an ireturn which transfers control back to the parent The

exit then terminates the child thread

send sp ecies the destination thread the inlet to execute and the data to b e sent

to the destination thread As shown in Figure a send must b e matched with a recv

at the inlet address recv and send must have the same number and typ es of arguments

CHAPTER MULTITHREADED ABSTRACT MACHINE

hP T Q S M i where p P

and p hip sptRi

and instip ireturn

and t h li

and l

and sp ip

and sp t

hP T Q S M i where p hip sp t Ri

and l

Figure The ireturn instruction

In other words a sendrecv pair is used by threads in place of the pro cessor registers

used by sequential return to transfer data from the child to the parent send carries out

several op erations atomically First it lo cks the destination thread to prevent other inlets



and co deblo ck instructions from executing for the destination thread It then saves the

senders thread and ip on the senders stack Next it stores the arguments of send in the

appropriate slots of the destination threads stack as sp ecied by recv Finally it sets the

sender pro cessors ip to the instruction following the recv send may app ear only in a



co deblo ck

After recv completes the inlet continues to execute instructions whichmay access

the destination threads state Among these maybe enable see Figure and ireturn

ireturn returns control from the inlet back to the senders thread unlo cking the destination

thread in the pro cess see Figure ireturn may only app ear in an inlet

Thread Scheduling

ready runningor During the life of a thread it may b e in one of three states

idleAready thread is one that the scheduler may run when there is a pro cessor available

A running thread is one that is currently assigned to a pro cessor An id le thread is waiting

on some event to b ecome ready The state transitions for a thread are shown in Figure

For readers familiar with active messages the setting of a lo ckmay b e confusing The use of a lo ck

is strictly to facilitate the denition of the rewrite rules In fact the lo ck is implicit on networked machines

b ecause inlets by construction run atomically with resp ect to threads and other inlets

Toavoid unnecessary complexity without loss of generality in the denition of MAM we disallow inlets

from issuing send instructions In particular if inlets are allowed to issue sends then some mechanism is

needed to prevent deadlo ck See for a discussion on allowing inlets to issue sends and how it relates to

distributed memory machines

CHAPTER MULTITHREADED ABSTRACT MACHINE

Returned to parent Running Selected by scheduler

Suspend

Created by parent Yield

Ready Idle

Enabled by inlet

Figure The legal state transitions for a thread in MAM

hP T Q S M i where p P

and p hip sptRi

and instip enable

and t hip sp t sii

p

hP T Q SMi where p hip sptRi

and t hip sp t sii

p



ready if idle

and

otherwise



Q ftg if idle

and Q

Q otherwise

Figure The enable instruction whichisalways executed in an inlet This instruction

is a NOP if t is not in the idle state

hP T Q S M i where p P

and p hwaiti

and t Q

and t hip spt readysii

p

hP T Q SMi where p hip spti

and t h t runningsii

p

and Q Q ftg

Figure Describ es how an idle pro cessor gets work from the ready queue

CHAPTER MULTITHREADED ABSTRACT MACHINE

When an idle thread b ecomes enabled ie when some thread sends a message to

one of its inlets which executes enable it b ecomes a ready thread by b eing placed on the

ready queue see Figure When a pro cessor is idle it steals work out of the ready queue

by retrieving a thread from the ready queue see Figure The pro cessors ip and sp are

then loaded with the threads ip and sp The pro cessor continues to execute instructions

from its running thread until the thread susp ends b ecomes idle yields b ecomes ready

placing itself on the ready queue or exits When the thread exits it frees its stackand

ceases to exist

Discussion

The abstract machine describ ed here is the basis for most multithreaded systems

that exist to day It can b e implemented on a physical machine in manyways In one

such implementation each thread is describ ed by a data structure that contains the three

sp ecialpurp ose registers ip sp and parent and a p ointer to a stack A thread is mapp ed

onto a pro cessor by loading the ip and sp into the pro cessors ip and sp and if necessary

the registers are loaded from the register save area Before a thread gives up the pro cessor

the registers are saved in the register save area and likewise the ip and sp are saved

The fork op eration creates a thread data structure allo cates a new stack initializes the

registers and places the thread on the ready queue The ready queue can b e implemented

as a queue bag or stackthe c hoice of the ready queue implementation is b eyond the

scop e of this thesis The other thread op erations are easily implemented

This naive implementation of MAM satises the requirementsofamultithreaded

system but it is more costly than necessary since it treats every p otentially parallel call as

a fork of an indep endent thread There are four areas of p otential ineciency

There maybemany unnecessary enqueue and dequeue op erations For example when

a parent forks a thread and immediately executes a join op eration the child is put

on the ready queue and the parent susp ends whereup on the child is pulled o the

ready queue and nally executed Logicallyhowever the child could have b een run

immediately

The pro cessor registers are not used for transferring data arguments on call and

results on return b etween parent and child This last ineciency arises b ecause MAM

The description do es not mandate a particular register save p olicy

CHAPTER MULTITHREADED ABSTRACT MACHINE

separates the transfer of data the arguments in a fork or the results in a sendfrom

the transfer of control the actual execution of the child or parent

The registers may b e unnecessarily saved and restored For example the system loads

registers when the thread starts even though none are active

The thread may not need an entire stack

Sequential Call and Return

Before considering a more ecient abstract machine let us review sequential call

and return Observe that the eciency of a sequential call derives from two co op erating

factors First the parent is susp ended up on call and all of its descendants have completed

up on return Second data and control are transferred together on call and return The rst

condition implies that storage allo cation for the activation frames involves only adjusting

the stack p ointer The second condition means that arguments and return values can b e

passed in registers and that no explicit synchronization is required

When a sequential call is executed it passes its child a return address whichisa

continuation to the remaining work in the parent In fact since the parent can not continue

until its child returns the return address is a continuation to the rest of the program

When a sequential return is executed it returns control to its parent through the

return address We can easily change the destination address b ecause it is stored in memory

This indirect jump provides the only signicant exibility in the sequential callandreturn

sequence and we exploit it to realize a fork and later a thread return which b ehaves like

a sequential call and return

MAMDFSupp orting Direct Fork

In this section we presentamultithreaded abstract machine with direct fork

MAMDF which eliminates two of the ineciencie s present in MAM when a child runs

to completion First we eliminate extra scheduling enqueueing and dequeueing op erations

by directly scheduling the child on invo cation and directly scheduling the parent when the

child terminates Second we transfer data and control simultaneously to the child so that

we can use registers when a parentinvokes a child The key dierence b etween MAM and

CHAPTER MULTITHREADED ABSTRACT MACHINE

MAMDF is that MAMDF intro duces a new way to create a thread the dfork instruction

A dfork do es not create an indep endent thread but rather a lazy thread In this section

we dene lazy threads and intro duce the three p ossible new representations for threads

continuations thread seeds and closures

The MAMDF dfork b ehaves more likeasequential call transferring control di

rectly to its child on call and receiving control when the child exits When a parent forks a

child instead of creating a new indep endent thread for the child and placing it on the ready

queue it places itself on a parent queue and immediately b egins execution of the child in a

lazy thread on its pro cessor When a child completes without susp ending it returns control

of the pro cessor to its parent whichcontinues execution at the instruction after the fork

just as if the fork had b een a sequential call

MAMDF keeps MAMs fork and exit instructions and adds dfork and dexit

the direct scheduled versions of fork and exit fork and exit are used to create and

terminate upward and daemon threads dfork and dexit are used for downward threads

dfork implements a p otentially parallel call and dexit terminates it

Threads created by dfork are called lazy threads A lazy thread is not indep endent

on creation but can b ecome indep endent in one of three ways First it may susp end

Second it may yield Third an idle pro cessor maycho ose to steal the parent running

the parent concurrently with the child In all three cases the relationship b etween the parent

and child changes so that b oth may run concurrently as indep endent threads When a lazy

thread b ecomes indep endent it or its parentorbothmust b e mo died to reect the fact

that the child will no longer b e returning to a waiting parent

When a child thread is a lazy thread wesay that it is connected to its parent If

through suspend yield or a steal op eration it b ecomes an indep endent thread wesay

that it is disconnected A thread started by fork is also an independent thread In the rest

of the dissertation we use the terms indep endent thread and lazy thread when the context

is not sucienttodisambiguate our use of the word thread

Even though dfork transfers control directly to its child wemust allow for the fact

that control may return to the parent b efore the child has completed eg the child may

susp end on a synchronization event Thus MAMDF must b e exible enough to start a

lazy thread which is exp ected to run to completion but instead susp ends b efore doing so To

CHAPTER MULTITHREADED ABSTRACT MACHINE

Machine hP T Q Z S M i

Where Z is the parent Queue

Thread hip spt sifli

p

Where represents the current thread status

fready running idle ischild haschildg

f is a ag either true or false that indicates whether the thread is connected

to its parent true or is an indep endent thread false

Figure Redenition of the machine and thread for MAMDF All elements not explic

itly dened remain the same as in MAM See Figure

handle this MAMDF has a parent queue in addition to the ready queue Furthermore

threads have an additional eld the connected ag and can takeontwo additional states

ischild and haschild See Figure for a description of the changes to the Machine

and Thread tuples

In the Section we describ e the new op erations in MAMDF and signicant

changes to the original MAM op erations In Section we outline the new schedul

ing metho dology In Sections and we address thread representations and new

metho ds of work stealing

The MAMDF Scheduler

In MAMDF threads maybescheduled by other threads or by the general sched

uler There are two queues of threads that are ready to run a ready queue and a parent

queue The ready queue as in MAM has threads in the ready state These are scheduled

the same way they are scheduled in MAM using a rule similar to the MAM rule in Fig

ure which assigns a thread in the ready queue to an idle pro cessor Threads in the

parent queue are all in the haschild state and are connected to the child they last invoked

with a dfork Threads in the parent queue are scheduled either directly by an op eration

suspend yieldordexit or through work stealing

The parent queue despite its name need not b e a queue and nothing in the semantics of MAMDF

requires it to haveany particular structure

The change in the MAM rule is to add the disconnect ag to the Thread tuples

CHAPTER MULTITHREADED ABSTRACT MACHINE

Has-Child F Creates child child exited Has-Child (dfork) T Returned to parent

child exited Running Creates child F (dfork) scheduled by general scheduler Returned to parent (dexit) suspend Is-Child yield T

suspend Idle Ready Enabled by inlet created by parent (dfork) FF yield

created by parent (fork)

Figure The legal state transitions for a thread in the MAMDF without work stealing

Each no de in the graph representsavalid state and the value of the connected ag The

solid lines indicate transitions due to the execution of an instruction The dashed line

indicates a transition caused by an outside inuence in this case the general scheduler

Figure shows the state transitions for a thread Only threads in the ischild

and running states are actually assigned to pro cessors and executing Threads in the

haschild state are on the parent queue while threads on the ready state are on the

ready queue Essentially MAMDF separates the running state of MAM into two states

running and ischild and the ready state of MAM into three states ready haschild

with connected ag true and haschild with connected ag false

If a child is created by dfork then unless it susp ends or yields it will remain on

the left of the diagram continuously moving b etween ischild and haschild Once a

thread b ecomes indep endent it moves to one of the states on the right of the diagram Only

a thread in running or ischild can return to its parent This is b ecause only threads in

these states are actually executing on a pro cessor

Thread Op erations in MAMDF

In MAMDF fork creates a new indep endent thread just as it did in MAM

However as shown in Figure the connected ag is set to false to indicate that the

thread is indep endent and will not return control directly to its parent We sometimes refer

to fork as an eager fork since it always creates an indep endent thread

CHAPTER MULTITHREADED ABSTRACT MACHINE

hP T Q Z S M i where p P

and p hip sptRi

and instip fork i adrarg arg arg

 n

hP T Q ZS Mi where p hip sptRi

and t hadr s n t readys ifalse i

new new new

and s arg arg arg

new

 n

and T T ft g

new

and Q Q ft g

new

and S S fs g

new

Figure The fork op eration under MAMDF

hP T Q Z S M i where p P

and p hip sptRi

adrarg arg arg and instip dfork i

 n

and t h t sifi

p

and frunning ischildg

hP T QZ S Mi where p hadrs t Ri

new new

and t hip spt haschildsifi

p

and t h t ischilds i true i

new new

and r arg x x n

x

x

and T T ft g

new

and S S fs g

new

and Z Z ftg

Figure The dfork op eration under MAMDF transfers control directly to the forked

child

The new dfork op eration shown in Figure creates and then transfers control

directly to a new c hild a lazy thread It do es so by loading the arguments of the fork into

the pro cessor registers saving a p ointer to the parent thread on the parent queue and then

assigning the pro cessor to the newly created child thread The state of the new thread is

ischild and its connected ag is set to true indicating that it is connected to and can

return control directly to its parent The state of the parentbecomes haschild indicating

that it is no longer executing and is on the parent queue

This abstract machine assumes a caller save proto col for saving registers It is

assumed that b efore the dfork is executed there were instructions that saved the active

registers in the parents stack Up on return the registers will b e restored It should b e

noted that in order for the denition of this machine to make sense there must b e signicant

CHAPTER MULTITHREADED ABSTRACT MACHINE

hP T Q Z S M i where p P

and p hip sptRi

and instip dexit

and t h t ischildsitrue i

p

and t Z

p

p p p p p p

and t hip sp t haschilds i f i

p

p

p p

hP T QZ S Mi where p hip sp t Ri

p

p p p p

and t h t s i f i

p

p



p

ischild if f true

and

running otherwise

and T T ftg

and S S fsg

and Z Z ft g

p

hP T Q Z S M i where p P

and p hip spti

and instip dexit

h t runningsifalse i and t

p

hP T QZS Mi where p hwaiti

and T T ftg

and S S fsg

Figure The dexit op eration under MAMDF transfers control directly to the parent

The rst rule applies when the child completes without susp ending ie while still in the

ischild state The second rule applies when the thread has b een disconnected from its

parent

compiler interaction For example a thread started by fork puts its arguments on the stack

while a thread started by dfork puts them in registers

The dexit op eration exits the thread and returns control from a child thread to

its parent see Figure If the child thread is in the ischild state the parenthasto

b e in the haschild state it is terminated and the parent thread is restarted on the same

pro cessor without anyintervening op erations by the general scheduler The parent thread

returns to the state it was in b efore the dfork of the child This state is either running or

ischild dep ending up on the state of the parents connected ag If the child thread is



in the running state dexit behaves like exit it deallo cates the threads resources but

invok es the general scheduler on the pro cessor

Recall that in MAMDF control and data are passed simultaneously only on call On return data is

returned with a send and control with dexit

CHAPTER MULTITHREADED ABSTRACT MACHINE

hP T Q Z S M i where p P

and p hip spti

and inst ip suspend

and t h t ischildsitrue i

p

and t Z

p

p p p p p

and t hip sp t haschilds i fi

p

p

p p

hP T Q Z SMi where p hip sp t i

p

and t hip spt idlesifalse i

p

p p p

and t h t s i fi

p

p



ischild if f true

and

running otherwise

and Z Z ft g

p

Figure The susp end op eration under MAMDF for a lazy thread

suspend and yield also b ehave dierently dep ending up on whether the thread

executing them is in the ischild or running state If the thread is in the running state

the MAM versions of these op erations apply See Figures and If however the

thread is in the ischild state the child and parent threads must b e disconnected After

a lazy thread susp ends or yields it b ecomes indep endent of its parent and the parentma y

execute another dfork Since a parentcannothavetwochildren that exp ect to schedule the

parent directly when they return the susp ending child is changed so that when it returns

it will not schedule the parent directly This change is indicated by setting the connected

ag to false see Figure

Continuation Stealing

Continuation stealing is one of two metho ds by which a thread in the parent queue

can resume execution b efore its child completes The other seed activation is describ ed

in the next section Continuation stealing resumes execution of a parent at its current

continuation and breaks the connection b etween a parent and its child In other words

it turns the child into an indep endent thread removes the parent from the parent queue

and then resumes the parent at its current ip Figure shows the state transitions for

MAMDF when continuation stealing is used to eect work stealing

The continuation that is stolen is the same one that the child would have used

had it returned to the parent with dexit This continuation exists b ecause the child was

started with dfork which puts a p ointer to the parent on the parent queue The ip and

CHAPTER MULTITHREADED ABSTRACT MACHINE

Parent Stolen Has-Child F Creates child child exited Has-Child (dfork) T Returned to parent

child exited Parent Stolen Running Creates child F (dfork) scheduled by general scheduler Returned to parent (dexit) suspend Is-Child yield T

suspend Idle Ready Enabled by inlet created by parent (dfork) FF yield

created by parent (fork)

Figure The new state transitions for a thread in MAMDF using continuation stealing

Here we highlight the new transitions intro duced when a threads continuation is stolen

hP T Q Z S M i where p P

and p hwaiti

c c c c c

and t hip sp thaschilds i true i

and t Z

and t hip spt haschildsifi

p

hP T Q Z SMi where p hip spti

c c c c c

and t hip sp trunnings i false i

and t h t sifi

p



ischild if f true

and

running otherwise

and Z Z ftg

Figure The rule for stealing work using continuation stealing when the child of the

stolen thread is not currently running ie the child is in the haschild state

sp elds in the parent thread are the continuation If the child returns after the parentis

stolen it exits without scheduling the parent b ecause it is now an indep endent thread

Figure shows the semantics of a continuation stealing op eration when the child

of the stolen thread is not currently running on the pro cessorThe resulting conguration

is very similar to the conguration of a parent after a child has susp ended The eect of

continuation stealing is similar to that of a child susp ending ie the paren t migrates The

main dierence is that the child remains on its pro cessor while the parent and its ancestors

will nowbescheduled on what was the idle pro cessor

CHAPTER MULTITHREADED ABSTRACT MACHINE

dfork foo

dfork foo

dfork bar

dfork foo

sequential code join

case a sequential co de case c a join

case b another dfork

Figure Three p ossible resumption p oints after executing dfork fo o

Any thread in the parent queue can b e scheduled using continuation stealing

However not all of them have useful work to p erform There are three dierent situations

shown in Figure that the parent can b e in when it is stolen In the rst two cases the

parent will execute useful work either by the sequential co de as in case a or by creating

a new thread in case b In case c however it will susp end immediately after b eing

scheduled b ecause the join fails

Thread Seeds and Seed Activation

A thread seed represents the work in a parent thread that can b e forked o into

a new thread When a parentforksachild it leaves b ehind a thread seed whichisa

t Seed activation is used by a pro cessor lo oking continuation up to the next fork in the paren

for work ie an idle pro cessor The idle pro cessor nds a parent in the parent queuewhic h

means that the parent has a descendant that is currently executing on a pro cessor The

pro cessor running the parents descendantisinterrupted and b egins to activate the thread

seed ie it executes the continuation previously left b ehind in the parent which forks the

next child in the parent This new child is picked up by the idle pro cessor the one that

initiated the seed activation In other words seed activation is a metho d of making progress

in the program by causing a thread in the parent queue to create a new thread The new

thread can then b e stolen by another pro cessor As describ ed b elow seed activation uses

thread seeds to instantiate nascent threads A thread seed is the rst part of a continuation

to the rest of the program It is the part that spawns a new thread

Nascent Threads

A nascent thread is a thread that is ready to start execution but has not yet done

so Nascent threads exist b ecause dfork causes control to b e transferred from the parent

even when the parent has more work that it can p erform ie the instructions after the

CHAPTER MULTITHREADED ABSTRACT MACHINE

Stack s for thread t. dfork foo() a_steal: sp a: // steal code fragment following 1st dfork dfork bar() b: fork bar() dfork baz() sreturn

join b_steal: context main code // steal code fragment following 2nd dfork pointer pointer find(a) = a_steal fork baz() The thread seed after find(b) = b_steal sreturn thread t executes dfork foo(). There is no seed for the last dfork. ⋅

thread t:

Figure Example thread seed and seed co de fragments asso ciated with dforks The

thread seed shown is simply the rst two elds of the thread tuple after thread t executes

dfork foo Note that the co de if any represented by the ellipsis is copied to the steal

fragments

return continuation Informally threads that would have b een created in MAM but have

not yet b een created in MAMDF are nascent threads Formally a nascent thread is a



thread in the logical task graph whichisayounger sibling of a thread started by dfork

For example dfork bar and dfork baz are nascent threads after the execution of dfork

foo in Figure

Thread Seeds

A thread seed consists of a continuation to the rest of the program and one or

more partial continuations A partial continuation is a continuation not to the rest of the

program but only to the p ortion of the program up to the next fork It is given by a p ointer

to a context ie an activation frame and a set of related co de p ointers where each of the

co de p ointers can b e derived at link time from a single co de p ointer Thecodepointer

used to derive the other co de p ointers in the seed is the main code pointer The main co de

p ointer is a continuation to the rest of the program The derived co de p ointers are partial

continuations that p oint to co de that is a copy of the co de at the main co de p ointer up to

and includin g the main co de p ointer In other words a thread seed can b e viewed as a set

of continuations where eachcontinuation in the set has the same context and eachofthe

co de p ointers in the set can b e derived from the main co de p ointer Or more concretelyit

is a set of entry p oints in a function each of which p erforms a dierent but related task

A thread s is a sibling of thread t if s and t have the same parent s is a younger sibling of t if t was

created b efore s

CHAPTER MULTITHREADED ABSTRACT MACHINE

In MAMDF the thread seed is represented by a p ointer to a thread The threads

sp is the context and the threads ip is the main co de p ointer from which all the other co de

p ointers in the thread seed can b e derived Figure shows the thread tuple and the

p ortion of it that is the thread seed that represents the nascent thread that will b e created

by dfork bar

There is a thread seed for every dfork that has at least one dfork between it and

its asso ciated join See Figure Each thread seed in MAMDF has two co de p ointers

The main co de p ointer p oints to the instruction following its dfork This is the ip that

the parent is set to after executing the dfork asso ciated with the thread seed Thus the

main co de p ointer for a thread seed is the same instruction p ointer that would b e saved by

a sequential call instruction In fact the creation of a thread seed at runtime is nothing

more than saving the return address of the call to the child thread The second co de p ointer

p oints to a co de fragment for seed activation called the steal co de fragment The steal co de

fragment is used to create the new work up on a work stealing request

Figure has a section of a co deblo ck which includes the steal fragments for

the thread seeds asso ciated with the rst and second forks The function find whichis

computed at compile or link time returns the derived co de p ointers from the main co de

p ointer This function can b e as simple as address arithmetic

Seed Activation

The key characteristic of seed activation is that a nascent thread is elevated into an

indep endent thread in the con text of its parent This contrasts with continuation stealing

where the parent itself is resumed With continuation stealing the parent is migrated as

opp osed to seed activation where the new thread is migrated

Seed activation see Figure takes several steps We describ e the pro cedure

with reference to Figures and First a thread is chosen t in Figure and

p

temp orarily removed from the parent queue The thread then b egins executing not at

its current ip but at the seed routine p ointed to by the second co de p ointer computed

by findip in Figure The seed routine starts at Line of Figure This is

state in Figure The pro cessor that executes the seed routine is the one currently

executing the lazy thread descended from the thread that owns the seed In the example

pro cessor p is executing t the rst child of t In other words the idle pro cessor causes

X p

CHAPTER MULTITHREADED ABSTRACT MACHINE

hP T Q Z S M i where p P

and p hwaiti

run

and p P

run run run run

and p hip sp t Ri

run run

and t h ischild truel i

run

and l

c

and t h thaschild true i

and t Z

and t hip spt haschildsifi

p

run

and t ancestor t

c run

and t t

hP T Q Z SMi where p steal

run run

and p hseed ip sp tRi

run run

and sp ip

run run

and sp t

and t h spt haschildsifi

p

and Z Z ftg

run

and l

Figure Work stealing though seed activation

int function() 1: function: 10: inlet1: { 2: set synch=2 11: recv a int a,b,c; 3: dfork inlet1 = X() 12: add synch = synch - 1 3a: // code, if any, between dforks 13: cmp synch, 0 a = fork X(); 4: dfork inlet2 = Y() 14: be enab b = fork Y(); 5a: cmp synch,0 15: ireturn join; 5b: be continue 16: c = a+b; 5c: suspend 17: inlet2: return c; 6: continue:// enable will restart here. 18: recv b } 7: add c = a + b 19: add synch = synch - 1 8: send tq, iq, c 20: cmp synch, 0 9: dexit 21: be enab 22: ireturn 27: X_steal: 23: 24: enab: find(4)=27 27a: // duplicate of code in 3a 28: fork inlet2 = Y() 25: enable 29: sreturn continue 26: ireturn

Inlet Code

Codeblock Code

Figure Example translation of a function with two forks into psuedoco de for the

MAMDF using thread seeds t is the parent thread i is the parent inlet

q q

CHAPTER MULTITHREADED ABSTRACT MACHINE

MAMDF hP fpgTQZ ft gSMi

p

p hip spt Ri

X

t h sp t haschilds i f i

p q p q p

p

t h t s true i

X p X X

MAMDF hP fpgTQZSMi

p h sp tRi

t h sp t haschilds i f i

p q p q p

p

t h t s true i

X p X X

MAMDF hP fpgT ft gQZS fs gMi

new new

p h sp tRi

t h sp t haschilds i f i

p q p q p

p

t h t s true i

X p X X

MAMDF hP fpgT ft gQZS fs gMi

new new

i p hip spt R

X

t ha sp t haschilds i f i

p q p q p

p

t h t s true i

X p X X

t h t runnings false i

new p new

Figure Example of seed activation assuming t is executing function X on pro cessor p

X

The numb ers in the rst eld of the thread tuple represent the instruction p ointer asso ciated

with the line numb ers in Figure State o ccurs right b efore work stealing happ ens

State is right after work stealing o ccurs In this case find State is after the

fork o ccurs in line State is after the sreturn has executed

hP T Q Z S M i where p P

and p hip sp tRi

and instip sreturn ip

new

and t h spt ischildsifli

p

and sp ip

and sp t

and t h ischild truel i

and l

hP T Q Z SMi where p hip sp t Ri

and t hip spt haschildsifli

p

new

and l

and Z Z ftg

Figure The seed return instruction is used to complete a seed routine

CHAPTER MULTITHREADED ABSTRACT MACHINE

a running pro cessor to b e interrupted The seed routine executes and forks o a new child

which is mapp ed to the idle pro cessor though the MAM idle rule See Figure Finally

the sreturn at the end of the seed routine executes which returns the parent thread to

the parent queue and sets the parent threads ip to the instruction following the dfork

asso ciated with the seed routine that was executed

The seed routine nishes by executing an sreturn See Figure sreturn

returns control of the pro cessor to the descendantthatwas interrupted and also returns

the thread to the parent queue up dating the ip of the thread to reect the fact that it

has made progress When the parent is next resumed either byitschild through dexit

suspend yieldorby another steal request it will continue execution at the p ointafter

the dfork of the nascent thread it just activated

Seed activation requires supp ort from the compiler in twoways First the compiler

must construct the seed routines and the find function for every dfork Second it must

create twoentry p oints for every thread one for dfork and one for fork Because fork

creates the child stores the arguments and then continues in the parent the fork entry

p oint loads the arguments from the stackinto registers and then continues at the dfork

entry p oint which assumes that the arguments are in registers Twoentry p oints are

necessary b ecause a routine which will b e dforked in the main computation path maybe

forked from a seed routine

Closures

The main drawback to thread seeds as compared to continuation stealing arises

when an dfork is followed by sequential co de case a in Figure The sequential

co de cannot b e represented by a thread seed so if a dfork starts the rst thread the total

amount of parallelism available is reduced In order to solve this problem weintro duce

another representation for a thread the closure

Weintro duce a new fork op eration cfork that creates a closure which b ecomes an

indep endent thread when executed The closure contains the necessary data the instruction

p ointer and arguments sp ecied in the fork instruction to start the thread later Closures

are enqueued on a separate closure queue

If a single fork is followed by sequential co de which can run in parallel with the

forked child weusecfork to create a closure instead of creating a lazy thread or an

CHAPTER MULTITHREADED ABSTRACT MACHINE

indep endent thread When the join is reached one of two things happ ens to the closure

created for the child The closure created bythecfork mayhave b een stolen by another

pro cessor in which case the the join fails or succeeds dep ending on the childs completion

Otherwise the closure is still awaiting execution in which case the join fails and the closure

is instantiated

Discussion

MAMDF eliminates some of the overhead of multithreading byallowing a parent

thread to schedule its child thread directly This reduces scheduling costs byavoiding a

generalpurp ose enqueue and dequeue and reduces memory costs byallowing arguments to

b e passed in registers It also exp oses the dierent p olicies allowed for work representing

threads that can b e stolen continuations thread seeds and closures

Seed activation in MAMDF creates an indep endent thread leaving a parents

preexisting child still connected to the parent We could havechosen to disconnect the

preexisting child and create the new child with a dfork This more closely parallels what

must happ en with continuation stealing where the preexisting child is disconnected from

the parent and the parent continues which causes new children to b e created with lazy

threads Wehavechosen to use fork in seed activation for reasons discussed in Chapter

trol and MAMDF still has scheduling ineciencies Achild cannot return con

data simultaneously to its parent Instead as in MAM a child must return the results

through an inlet and then return control with dexitFurthermore if b oth children execute

sequentially ie they are never disconnected from their parent we p erform unnecessary

synchronization Wenow dene a machine that allows us to eliminate this overhead

MAMDSSupp orting Direct Return

In this section we mo dify our multithreaded abstract machine to allow a thread

to directly schedule its child on creation and its parent on return This optimization allows

control transfer b etween threads to more accurately mimic that of sequential call and thus

when the semantics allow to obtain similar eciency In particular registers can b e used

to transfer b oth arguments and results

We present new MAMDS mechanisms which grant a lazy thread that runs to

completion the ability to return results when it returns control to its parent The ma jor

CHAPTER MULTITHREADED ABSTRACT MACHINE

dierence b etween MAMDS and MAMDF is that in MAMDS the thread exit op eration

lreturn can sp ecify return results which will b e transferred in registers directly to the

parent as the thread exits This change makes lazy thread exit similar to a sequential return

instruction The creation of lazy threads also changes In MAMDS lfork creates a lazy

thread and it diers from dfork in that an inlet is not sp ecied Instead the instruction

following the lfork is implied as the inlet just as in a sequential call

To understand the ramications of these changes we need to analyze the actions

carried out on return from a child thread A return op eration consists of three indep endent

op erations First it must return the results to its caller Second it must up date the

synchronization status in the parent whichmayhavemultiple outstanding children Third

it must ensure that the parent restarts at the prop er continuation address In MAM and

MAMDF the rst two steps are carried out byinvoking an inlet on the parent thread The

third step is achieved b ecause the parentmaintains an ip which is the address at whichit

will restart when control is transferred to it either from the general scheduler from a work

stealing request or directly from the child

If we compare these three indep endent op erations to the sequential return instruc

tion we see that a return instruction combines all three actions into a single op eration It

returns control to the parent at its return address passing results in registers The instruc

tions at the return address act as the inlet and save the results in the parentcontext Then

it continues to execute instructions in the parent There is no synchronization step since

ys run to completion The return address acts as b oth the inlet address sequential calls alwa

and the continuation address

Figure shows a translation of a function with two forks into psuedoco de for

MAMDS using thread seeds Lines a are executed when b oth children run to comple

tion The child created in line returns to line a a No explicit synchronization is

p erformed and arguments and results are transferred directly in registers

If the rst child X susp ends control is transferred to line This co de fragment

lines sets up the synchronization counter line changes Xs inlet address line

forks Y and then handles Ys return lines The synchronization counter is set

here b ecause the order of return for the twochildren is no longer known For the same

reason Xs inlet is changed to inlet where it stores its result and p erforms synchronization

on return The reader should convince herself that whatever the order of Xs and Ys return

CHAPTER MULTITHREADED ABSTRACT MACHINE

int function() 1: function: 10:inlet1: { 3: lfork slot0 = X() 11: recv a int a,b,c; 3a: store a <- result 12: add synch = synch - 1 4: lfork slot1 = Y() 13: cmp synch, 0 a = fork X(); 4b: store b <- result 14: be enab b = fork Y(); 6: continue: 15: ireturn join; 7: add c = a + b 16: c = a+b; 7a: lreturn c 17:inlet2: return c; 18: recv b } 30:X_suspend: 19: add synch = synch - 1 31: set synch=2 20: cmp synch, 0 32: set slot0 <- inlet1 21: be enab 33: lfork inlet2 = Y() 22: ireturn 34: store b <- result 23: Suspend 24:enab: stream 35: add synch = synch - 1 36:join: 25: enable 37: cmp synch,0 26: ireturn 38: be continue 39: suspend 40: goto continue find(3a, suspend) = 30 41: X_steal: find(3a, steal) = 38 Steal 42: set synch=2 stream 43: set slot0 <- inlet1 44: fork inlet2 = Y() 45: sreturn join

Inlet Code

Codeblock Code

Figure Example translation of a function with two forks into psuedoco de for the

MAMDS using thread seeds slot and slot are slots in the threads activation frame

used to hold inlet addresses result is the register used to pass results from callees to

callers

Thread hip spt sifli

p

Where i is either an inlet address or an index into the parents inlet return vector

Figure Redenition of a thread for MAMDS All elements not dened remain the

same as in MAMDF

the prop er synchronization is p erformed and eventually lines a will b e executed nishing

the function

If a steal request arrives while X is executing the steal co de fragment lines

is executed This co de fragment activates the seed for Y as an indep endent thread also

setting up synchronization and changing Xs inlet address

From the previous example we see that in MAMDS each p otentially parallel call

has two asso ciated return paths one for the case when the child runs to completion and

the other for the case when the child has b een disconnected from the parent If the child

runs to completion it can use a mechanism similar to a sequential return ie it returns to

the instruction following the fork If disconnected the inlet approach of MAM is used

CHAPTER MULTITHREADED ABSTRACT MACHINE

To supp ort this dualreturn strategy in MAMDS weintro duce indirect inlets

An indirect inlet is simply an inlet invoked through an indirect jump The key b ehind this

change is that the indirect inlet address can b e changed by the parent to reect the parents

current state We redene the return register in a thread to b e a p ointer to an indirect inlet

address See Section We also redene the find mapping findadr type maps

the ip at adr toacodepointer determined by type one of suspend or steal

Thread seeds in MAMDS have three co de p ointers the main co de p ointer the

susp end co de p ointer and the steal co de p ointer When thread seeds are used to represent

the nascent threads find maps the threads ip the main co de p ointer of the thread seed

to one of twocodepointers findip suspend returns the susp end co de p ointer and

findip steal returns the steal co de p ointer

When continuations are used to represent the remaining work in a thread find

maps the ip in a thread into the continuation that will b e stolen In this case find ignores



the second argument

MAMDS Op erations

lfork like dfork MAMDS replaces dfork with lfork and dexit with lreturn

creates a lazy thread and transfers control directly to the new thread lreturn simultane

ously returns control and data from the child to its parent when p ossible

Indirect Inlets

The main challenge in implementing MAMDS is to supp ort two dierent return

paths one for return when the child is connected to the parent and one for return when

the parent and child have b een disconnected with a single instruction Instead of passing

to the child the address of the inlet the child should invoke up on return lfork passes

the child the address of a lo cation in the parents frame whichcontains the inlet address

We call this lo cation the indirect inlet addressThus inlet invo cation is p erformed via an

By using a mapping function to derive the continuation to b e stolen from the ipwe blur the distinction

between continuation stealing and seed activation In previous work the actual continuation to b e stolen

is p osted on a queue so that two dierentcontinuations need to b e saved one used by the child for

return and one used by the system for susp ension and work stealing The mapping function eliminates the

need for two continuations and allows us to fo cus on the more imp ortant dierence b etween continuation

stealing and seed activation viz how they treat the connection b etween the parentandchild

CHAPTER MULTITHREADED ABSTRACT MACHINE

indirect inlet address ← rtc_inlet

lfork i = routine(args) rtc_inlet: side effect of lfork store results convertReturn(rtc_inlet) = inlet_after_suspend continue: find(rtc_inlet,suspend) = suspend_label more code ...

indirect inlet address ← inlet_after_suspend

suspend_label: side effect of suspend inlet_after_suspend: setup synchronization store results goto continue perform synchronization

ireturn

Figure A pseudoco de sequence for an lfork with its two asso ciated return paths

If the child runs to completion it returns to rtc inlet Otherwise it runs the suspend

routine and later returns to inlet after suspend This example uses continuation stealing



indirect jump through the indirect inlet address The key b ehind this change is that the

indirect inlet address can b e changed by the parent to reect the parents current state

When the child is invoked its indirect inlet address is set to the address of an

inlet in Figure inlet that b ehaves like the co de after a sequential call See inlet rtc

inlet If the child runs to completion no synchronization is p erformed and the inlet rtc

runs when the child returns control to the parent storing the results and continuing If the

parent is resumed b efore the child returns eg the child susp ends then the indirect inlet

address is changed by the routine at suspendlabel to contain a new inlet The new inlet

inletaftersuspend in Figure stores the results and then p erforms the necessary

synchronization The inlet used after a child susp ends must end with an ireturn This

ensures that control is transferred to the prop er place when the inlet nishes

Changes to Supp ort Indirect Inlets

To supp ort this new return mechanism many of the op erations in MAMDF are

changed to manipulate the indirect inlet addresses lfork passes its child an indirect inlet

address send can take an indirect inlet address lreturn uses the indirect inlet address for

return suspend changes the indirect inlet stored in the indirect inlet address and a work

stealing request also changes the indirect inlet lfork see Figure sets the indirect

There is a unique indirect inlet address for each outstanding child The numb er of indirect inlet addresses

in a threads frame is the maximum over all joins in the thread of the number of children that the join is

synchronizing

CHAPTER MULTITHREADED ABSTRACT MACHINE

hP T Q Z S M i where p P

and p hip sptRi

and instip lfork i adrarg arg arg

 n

and t h t sifi

p

and frunning ischildg

hP T QZ S Mi where p hadrs t Ri

new new

and T T ft g

new

and S S fs g

new

and Z Z ftg

and t hip spt haschildsifi

p

and t h t ischilds i true i

new new

and r arg x x n

x

x

and v i pc

Figure The lfork op eration in MAMDS

hP T Q Z S M i where p P

and p hip sptRi

and inst ip lreturn arg arg arg

 n

and t h t ischildsifi

p

and t Z

p

p p p p p p p

haschilds i f v i and t hip sp t

p

p

p p

and v i ip

p

p

hP T QZ S Mi where p hip sp t Ri

p

p p p p p p

and t h t s i f v i

p

p



p

ischild if f true

p

and

running otherwise

and T T ftg

and S S fsg

and Z Z ft g

p

and r arg x x n

x

x

Figure The lreturn op eration in MAMDS when the parent and child are connected

and the parent has not b een resumed since it lfork ed the child

inlet address to p oint at a co de fragment immediately following the lfork This is where

a thread that runs to completion will return send can either transfer data to an inlet

sp ecied by an address or to an inlet sp ecied by an indirect inlet address

The actions undertaken by lreturn vary according to the state of the child and

the parent If the child runs to completion and its parent has never had work stolen from

it lreturn mimics a sequential return see Figure It reclaims the resources of the

thread executing the lreturn and returns results in registers and control to its parentby

CHAPTER MULTITHREADED ABSTRACT MACHINE

hP T Q Z S M i where p P

and p hip spti

and inst ip suspend

and t h t ischildsitrue i

p

and t Z

p

p p p p p p p

and t hip sp t haschilds i f v i

p

p

p p

hP T Q Z SMi where p hseed ip susp end sp t i

p

and t hip spt idlesifalse i

p

p p p p p

and t h t s i f v i

p

p



p

ischild if f true

and

running otherwise

and Z Z ft g

p

p p

and v i convertReturn ip

Figure The susp end op eration in MAMDS

invoking the inlet at its indirect inlet address Thus when lreturn returns from a child

that executed to completion the indirect inlet address acts like the return address in a

sequential call We discuss the other two cases for lreturn b elow

suspend see Figure resumes the parent at the continuation indicated by

findip suspend The reason we cannot continue at the parents ipaswe did in

MAMDF is that the co de immediately following the lfork is the return inlet which

cannot b e executed until the child thread actually exits In addition to resuming the

parent suspend also changes the indirect inlet address for the parents outstanding lazy

child to an inlet that receives results for children that have susp ended See Figure

The work stealing op erations like suspend also change the indirect inlet address for the

parents outstanding lazy child

Resuming the Parent

When continuation stealing is used to resume the parent the parent and child

are disconnected and the parentcontinues at the p oint in the co deblo ck that follows the

lforks asso ciated inlet See Figure When the child nally returns to the parent it

cannot resume the parent since the parentwas already resumed when its continuation was

stolen Instead lreturn runs the inlet in the indirect inlet address and frees the threads

resources including the pro cessor Since the inlet ends with ireturn we can ensure that

CHAPTER MULTITHREADED ABSTRACT MACHINE

hP T Q Z S M i where p P

and p hip sptRi

and instip lreturn arg arg arg

 n

and t h t runningsifi

p

and t Z

p

p p p p p p p p

and t hip sp t s i f v i

p

p

p p

hP T QZS Mi where p hv i sp t Ri

p

p p p p p p p p

and t hip sp t s i f v i

p

p

and T T ftg

and S S fsg

and r arg x x n

x

x

and sp wait

and sp

Figure The lreturn op eration when the parent and child have b een disconnected bya

contin uation stealing op eration

the pro cessor is idled by making ireturn go to the general scheduler represented by the



address wait in Figure

If seed activationisusedtogetwork from the parent then the routine at the

susp end or steal co de p ointer will activate the nascent thread that the seed represents

After the nascent thread has b een activated the parents ip will p oint to the next thread

seed not to the inlet immediately following the lfork that activated the thread The reason

that the parents ip do es not p oint to the activated threads inlet is that the parentcan

next b e resumed either by one of its outstanding children or byawork stealing request

We can see that once a parent has activated a seed all of its children must b e started with

inlets that store their results p erform synchronization and then continue in the parentat

its current ipThus we lose some of the b enet of lreturn for threads that were activated

from seeds Though we pass the results from the child to the parent in registers they may

immediately b e stored in the parent frame

The b ehavior of lreturn for a thread whose parent has activated a seed ensures

that when the inlet executes ireturn the parentwillcontinue at the prop er p oint whichis

the parents current ipItachieves this by placing the parent thread and ip on the parent

k b efore executing the inlet When the inlet executes ireturn the parent will threads stac

continue at the prop er p oint see Figure

The rules in Figure Figure and Figure cause a pro cessor with a ip of wait to get work

from the ready queue from the parent queue by continuation stealing and from the parent queue by seed

activation resp ectively

CHAPTER MULTITHREADED ABSTRACT MACHINE

hP T Q Z S M i where p P

and p hip sptRi

and instip lreturn arg arg arg

 n

and t h t ischildsifi

p

and t Z

p

p p p p p p p

and t hip sp t haschilds i f v i

p

p

p p

and v i ip

p p

hP T QZ S Mi where p hv i sp t Ri

p

p p p p p p p p

and t hip sp t s i f v i

p

p



p

ischild if f true

p

and

running otherwise

and T T ftg

and S S fsg

and Z Z ft g

p

and r arg x x n

x

x

p

and sp ip

p

and sp t

Figure The lreturn op eration in MAMDS when a thread seed in the parent has b een

activated

Discussion

MAMDS allows a p otentially parallel call that runs to completion to takeadvan

tage of the sequential call and return mechanisms without limiting parallelism It transfers

control and data simultaneously on b oth invo cation and termination

This machine also claries the distinction b etween continuation stealing and seed

activation When continuation stealing is used to continue the parent b efore its child re

turns the connection b etween the parent and the child is broken This causes the child

to use the twostep return pro cess of an indep endent thread lreturn behaves likeasend

followed byanexit

When seed activation is used to continue the parent it b ecomes a slavetoits

children In other words each of its returning children causes it to activate a thread seed

until no more remain When no more thread seeds remain subsequent returning children

fail at the join until the last child returns When the last child returns it continues the

parent after the join

CHAPTER MULTITHREADED ABSTRACT MACHINE

The Lazy Multithreaded Abstract Machine LMAM

With the lazy multithreaded abstract machine LMAM we complete our goal of

amultithreaded abstract machine that can execute a p otentially parallel call with nearly

the eciency of a sequential call In LMAM an lfork starts a new lazy thread on the

same stack as its parent thread Combined with the direct scheduling of threads on call and

return intro duced in MAMDS this allows us to transfer arguments to the child in registers

allo cate the childs frame on the stack of the parent and if the thread runs to completion

return the results in registers In this section weintro duce the two disconnection metho ds

and the four storage mo dels

If a thread needs to run concurrently with its parent b ecause the child executes

a susp end or the parent is stolen then an indep endent thread is created for the child

This involves disconnecting the control b etween the parent and child and disconnecting the

stack ie creating a new stacksothechild or parent can continue The disconnection

op eration dep ends on the underlying storage mo del used to implement the cactus stack

There are four storage mo dels that we consider linked frames multiple stacks

stacklets and spaghetti stacks We divide the storage mo dels into two broad classes linked

storage mo dels and stack storage mo dels In the linked storage mo dels linked frames and

spaghetti stacks links b etween frames are explicit Every thread frame has an explicit link

to its parent In these mo dels disconnection is easy b ecause in some sense the threads

already have their own stacks In the stack storage mo dels multiple stacks and stac klets

the links b etween the frames can b e implicit or explicit The implicit links are b etween

lazy threads and their parents They are implicit since the parent of a lazy thread can b e

found by doing arithmetic on a stack p ointer The explicit links are b etween the stacks of

the indep endent threads

By xing the storage mo del for threads LMAM also dictates how the indirect

inlet address is implemented The indirect inlet address is stored in the same slot of the

parents activation frame used for a sequential calls return address The link b etween a

child thread and its parent is the p ointer from the child to the indirect inlet address in the

parent that the child will use to return to the parent In other words the return register

is the link from a child to a parent If this link is implicit then the return register is never

stored explicitly

CHAPTER MULTITHREADED ABSTRACT MACHINE

hP T Q Z S M i where p P

and p hip sptRi

and instip lfork i adrarg arg arg

 n

and t h t sifi

p

and frunning ischildg

hP T QZ SMi where p hadr sp F t Ri

new

and T T ft g

new

and Z Z ftg

and t hip spt haschildsi i

p

and t h t ischildsi true i

new

and r arg x x n

x

x

and v i pc

and F size of child frame

Figure The lfork op eration in LMAM with multiple stacks

vetwo op erational semantics for LMAM The rst describ es LMAM im Wethus ha

plemented with linked frames or spaghetti stackssince the cactus stac kismaintained with

links this semantics mirrors that of MAMDS The second describ es LMAM implemented

with multiple stacks or stacklets

Disconnection and the Storage Mo del

When a lazy thread is elevated to an indep endent thread it must b e disconnected

from its parent There are two disconnection metho ds eager and lazy Eagerdisconnect

allows the parenttoinvokechildren on its abstract stack in exactly the same manner as

it did b efore it had a disconnected child Lazydisconnect leaves the current abstract stack

untouched and forces the parent to allo cate new children dierently than it did b efore it

had a disconnected child

In order to understand the dierence b etween lazy and eagerdisconnect consider

two of the actions taken byanlfork in LMAM see Figure First it saves the indirect

inlet address in the childs return register This acts as the link b etween the child thread

and its parent Second it creates an activation frame for the child Because the lfork

op eration always stores the indirect inlet address in the same slot of the parentwe will have

to copy the disconnected childs indirect inlet address to a new slot This allows the parent

to invoke its children in the same manner indep endent of whether it has a disconnected

CHAPTER MULTITHREADED ABSTRACT MACHINE

Child Child

Eager-disconnect

indirect inlet address empty empty indirect inlet address

Parent

Parent

Figure Eagerdisconnect used to disconnect a child from its parent when a linked storage mo del is used In this case only the indirect inlet address needs to b e copied

Free Space Free Space Free Space Child Disconnect

Parent Parent Child

Grandparent Grandparent

New Stack

Figure An example of eagerdisconnect when the child is copied to a new stack

child Thus there must b e a separate slot in the activation frame for dfork in a forkset

The childs return register also has to b e up dated We call this metho d eagerdisconnect

See Figure Of course in the stack storage mo dels eagerdisconnect also has to copy

the childs activation frame onto another stack

If weleave the childs indirect inlet address in its original place then any future

child invoked by the parent has to link to another slot in the parents frame for its indirect

inlet address Lazydisconnect do es not copy the indirect inlet address It also do es not

copy the childs activation frame even in the stack storage mo dels

Eagerdisconnect

For all storage mo dels eagerdisconnect copies the childs indirect inlet address

to another slot in the parents frame In addition in the stack mo dels it must split and

CHAPTER MULTITHREADED ABSTRACT MACHINE

Free Space

Free Space Free Space

Parent Child Disconnect Child

Parent Grandparent

Unusable Grandparent

New Stack

Figure An example of eagerdisconnect when the parent p ortion of the stack is copied

to a new stack

hP T Q Z S M i where p P

and p hip spti

and instip suspend

and t h t ischildsitrue i

p

and t Z

p

p p p p p p p

haschilds i f v i and t hip sp t

p

p

p p

hP T Q Z S Mi where p hseed ip susp end sp t i

p

and t hip s sp s t idles ifalse i

new p new

p p p p p

and t h t s i f v i

p

p



p

ischild if f true

and

running otherwise

and Z Z ft g

p

p p

and v i convertReturn ip

and S S fs g

new

and s x s x x x sp s

new

and M aa where s a sp

and a a s s

new

Figure The susp end op eration under LMAMMS and LMAMS using child copy for

eagerdisconnect

copy the stack Eagerdisconnect bychild or parentcopying causes the stack to b e split

and a p ortion of it to b e copied to a newly allo cated stack at the time of disconnection

Childcopying copies the p ortion of the stack used by the child to a newly allo cated stack

See Figure Parentcopying copies all but the childs frame to a newly allo cated stack



See Figure In either case since we copy the lo cal data to a new lo cation we

must either forbid p ointers to data in a frame or scan memory and up date anypointers

to the region b eing copied Eagerdisconnect is currently employed by most lazy thread

systems

We can also copy only some of the parents ancestors and leave b ehind frames which forward data to the new lo cation

CHAPTER MULTITHREADED ABSTRACT MACHINE

The susp end op eration in Figure is an example of childcopying that splits

the stack In addition to changing the state of the child and parent up dating the indirect

inlet address and scheduling the parent it allo cates a new stack and copies the data for

the child thread to the new stack

The advantage of eagerdisconnect is that the price is paid only once at the time

of the disconnect op eration The disadvantage is that in the stack storage mo dels this one

time cost can b e quite high and even more imp ortantlypointers to data in a frame cannot

b e passed to other threads

Lazydisconnect

Lazydisconnect a new metho d intro duced in this thesis logically disconnects the

parent and child without requiring anycopying at the time of disconnection It allows us

to use p ointers to data in a frame without requiring that we scan memory to up date them

at disconnect time Lazydisconnect is the analog of lazy invo cation Just as a lazy thread

call do es not eagerly create a new thread lazydisconnect do es not eagerly separate a child

and parent instead causing the parent to create a new stack for its future children in the

same forkset

The lazy metho d do es not p erform anystack op erations in any of the storage

mo dels when disconnection o ccurs but instead incurs the cost of allo cating a new stack

for every fork or call made by the parent after it has b een disconnected from its child In

other words lazydisconnect causes the memory mo del to devolveinto the linkedframe

mo del after a parent has b een disconnected from its child When the logical tree of stacks is

cactuslike having the parent create new stacks for each future child is not onerous as the

children are able to invoke their children using the more ecientstack or stackletbased

calls

In short lazydisconnect p erforms disconnection without copying either the child

or the parent Instead the c hild steals the stack causing all future allo cation requests in

the parent to b e fullled by allo cating a new stack See Figure

To supp ort this metho d of disconnection wechange the denition of the stackto

b e a pair consisting of top and the former stack htop i top points to the highest lo cation

used in the stack itself

CHAPTER MULTITHREADED ABSTRACT MACHINE

Free Space Free Space Free Space

Child Disconnect Child

Parent Parent A New Child

Grandparent Grandparent

New Stack

Figure An example of how a parent and its child are disconnected by linking new

frames to the parent

hP T Q Z S M i where p P

and p hip spti

and instip suspend

and t h t ischildsitrue i

p

and t Z

p

p p p p p p

haschildsi f v i and t hip sp t

p

p

and s hsp datai

p p

hP T Q Z SMi where p hseed ip susp end sp t i

p

and t hip spt idlesifalse i

p

p p p p

and t h t si f v i

p

p



p

ischild if f true

and

running otherwise

and Z Z ft g

p

p p

and v i convertReturn ip

Figure The susp end op eration for LMAMMS and LMAMS using lazydisconnect

When top equals sp then lfork works as it do es in LMAMDS After disconnec

sp see Figure In this case lfork must allo cate a new stack tion top do es not equal

for the child thread see Figure

Summary

This chapter culminates with the description of a lazy multithreading abstract

machine which eliminates all of the unnecessary overhead found in MAM when p otentially

parallel calls run to completion

There are no enqueue dequeue or scheduler op erations when lfork is used to create a lazy thread

CHAPTER MULTITHREADED ABSTRACT MACHINE

hP T Q Z S M i where p P

and p hip sptRi

and instip lfork i adrarg arg arg

 n

and t h t sifi

p

and frunning ischildg

and s data

and sp

hP T QZ S Mi where p hadr data F t Ri

new new

and T T ft g

new

and Z Z ftg

and t hip spt haschildsi i

p

and t h t ischilds i true i

new new

and r arg x x n

x

x

and v i pc

and s data F data

new new new

and data

new

and S S fs g

new

and F size of child frame

Figure The lfork op eration under LMAMMS using lazydisconnect after a thread has

b een disconnected from a child

Pro cessor registers are used to transfer arguments to the child and results back to the

parent

Unnecessary register saves and restores are eliminated since the thread is never sched

uled by the general scheduler except when it has previously susp ended

Lazy threads do not require there own stack

No synchronization op erations are p erformed b etween the parentanditschildren

when the children are scheduled sequentially

Chapter

Storage Mo dels

This chapter describ es the dierent concrete implementations for the storage p or

tion of the abstract machine presented in the last chapter Weinvestigate four storage

mo dels for a global cactus stack linked frames multiple stacks spaghetti stacks and

stacklets Although the storage mo del and the control mo del interact we delay a detailed

discussion of this interaction until the next chapter We explain howeachmodelmaintains

the logical global cactus stack how it allo cates deallo cates and disconnects frames and

what tradeos it intro duces There are four kinds of calls sequential lazy fork fork and

remote fork each of whichmapsonto a dierent kind of allo cation request in each storage

mo del

For each storage mo del we discuss four tradeos allo cation eciencyinternal

fragmentation external fragmentation and ease of implementation Of these the most

imp ortant is the cost of allo cating and deallo cating an activation frame for the four dier

enttyp es of calls Internal fragmentation caused byleaving unused space within the basic

allo cation unit is also imp ortant Less imp ortant is external fragmentation which results

from allo cating activation frames in dierent regions of the memory space Finallyimple

mentation details make certain storagemo del variations impractical on some machines

The mo dels fall into two broad categories linked storage mo dels and stack storage

mo dels The linked storage mo dels include the linked frames and spaghetti stack mo dels

which represent the parentc hild relationship b etween lazy threads with explicit links The

stack storage mo dels include the multiple stacks and stacklets mo dels which represent the

parentchild relationship b etween lazy threads implicitly Another view is that these four

mo dels are span a sp ectrum of p ossible implementations with linked frames on one end

CHAPTER STORAGE MODELS

and multiple stack on the other In b etween spaghetti stacks and stacklets oer dierent

compromises with spaghetti stacks b eing closer to linkedframes and stacklets closer to

multiple stacks

Storing a Threads Internal State

In this section we discuss storage requirements for a threads state variables its

ip indirect inlet addresses for its disconnected children and its p ortion of the ready queue

In the previous chapter we made an articial distinction b etween the threads

stack and its state ie its ip sp parentpointer and return register In fact when a

thread is not running it stores the thread state in its stack A lazy thread also has state

although it may not haveitsown stack It do es however always have a base frame on some

stack It is in this base frame that we store the lazy threads state

Since all threads lazy or indep endent have a unique base frame we use the address

of the base frame to identify a thread We sometimes nd it convenient simply to refer to

a thread by its frame

In all of the memory mo dels presented eachactivation frame includes a eld to

store the threads ip Recall that a threads ip is the address of the instruction at whichit

will b egin op eration when it is next scheduled either bythescheduler or by an outstanding

child The return register is made up of the parentpointer and the indirect inlet address

in the parent For lazy threads and sequential calls this is the return address of the child

which is the same as the inlet address

As discussed in Section there must b e a sucientnumb er of slots in the

activation frame to hold the indirect inlet address of children that may b e disconnected

In eagerdisconnect the slots are used to store the indirect inlet addresses for disconnected

children In lazydisconnect the slots are used to store the indirect inlet addresses of children

invoked after disconnection

In the concrete implementation of LMAM we distribute the ready queue among the

indep endent threads Since the ready queue can include each indep endent thread at most

once it can b e implemented by allo cating a xed numb er of frame slots p er indep endent

thread We implement the ready queue as p linked lists where p is the numb er of pro cessors

in the system Thus we allo cate one slot p er thread activation frame and one ready queue

Section discusses the rationale for splitting the return register b etween the parent and the child

CHAPTER STORAGE MODELS

D

C FG

B E

A

Figure A cactus stack using linked frames Each frame has an explicit link to its parent

p ointer p er pro cessor Since lazy threads can b ecome indep endentwe also reserve a slot in

each lazy thread activation frame

To summarize every mo del requires that an ip an indirect inlet address and a

slot for the ready queue b e stored in each activation frame For reasons given in Chapter

the return register is not directly realized The parent p ointer is included in the activation

frame only in the linked frames and spaghetti stacks mo dels The sp is not stored in the

threads state in concrete implementations of any of the mo dels

Remote Fork

Remote fork used only on distributed memory machines is handled by sending

messages to the remote pro cessor which p erforms the allo cation request and dep osits the

arguments into the newly allo cated frame Since lo cal fork has exactly the same storage

requirements as remote fork we discuss only lo cal fork in this chapter

Linked Frames

We rst discuss mapping the cactus stackonto linked frames This approachis

the simplest and as we shall see the least attractive It is also the most commonly used

eg in Cilk Id and MulT

The cactus stack is represented explicitly by the links b etween the frames as shown

in Figure Eachactivation frame whether it is the start of a new thread or a sequential

call is allo cated in a separate heapallo cated frame The link b etween a child and its parent

is explicitly part of the activation frame

CHAPTER STORAGE MODELS

The linkedframes mapping imp oses an exp ensive allo cation and deallo cation op

eration on every call and return It pro duces no internal fragmentation but can pro duce

external fragmentation whichcanhave negative eects on the pro cessor cache and TLB

since frames related bycontrol may b e in completely dierent areas of the memory system

Op erations on Linked Frames

When using linked frames all of the allo cation requests fork lfork and call

are treated uniformly Weallocateachild frame and link it to its parent frame exit

lreturn and return free the frame and return it to a p o ol of unused frames whichmaybe

reused for subsequent allo cation requests Since eachactivation frame is physically separate

from every other disconnecting a lazy thread from its parentnever requires copying

The linkedframe mapping is simple to implement However it do es nothing to

optimize for the most frequent op erations the sequential call and the p otentially parallel

call that completes without susp ension

Linked Frame Timings

Here we discuss the cost of the basic op erations We present instruction counts

whenever p ossible and cycle counts for library op erations We divide the instructions into

three categories memory reference instructions branch instructions and registeronly in

structions Wecontend that on mo dern pro cessors the most imp ortant metric is memory

reference instructions Thus we try to minimize the numb er of memory reference instruc

tions In the case where we dep end on a library call to do some of the work we measure

the cycle coun ttoallow application to other pro cessors In all cases we are concerned

only with the costs of storage managementcon trol costs are covered in the next chapter

In particular we do not include the costs of manipulating the indirect inlet address on

disconnection since it dep ends on the queuing metho d chosen See Section

The timings are presented in Table The rst two lines of the table contain the



cost of p erforming a sequential call and return in C This is used as a comparison against

whichwe can judge the eectiveness of the linkedframe mo del The next two lines show the

cost of p erforming a sequential call in the linkedframe mo del We then show the storage

costs of allo cating scheduling and freeing a thread The allo cation cost do es not include

In all cases we give the cost of a call with no arguments and a return with no results

CHAPTER STORAGE MODELS

Instruction Count

Op eration Memory Branch Register

Ccall

C return

Sequential call

Sequential return

Create thread from p o ol

Schedule thread

Free thread

Create lazy thread

Lazy return

Eagerdisconnect

Create lazy thread

after lazydisconnect

Table Times for the primitive op erations using linked frames These times do not

include any control costs They assume frames are allo cated from a p o ol of frames and do

not include the cost of a mallo c

the cost of p erforming a mallo c to allo cate the thread Instead our allo cation p olicy is to

put freed frames in a p o ol of frames which can b e reused We assume that requests satised

in the p o ol dominate those that are not in the p o ol and thus do not include the cost of



doing a mallo c Finallyweshow the storage costs for lazy threads

As we see from the table the linkedframe mo del imp oses a signicant p enaltyon

sequential calls over that of a traditional stackmodelFurthermore sequential calls haveno

advantage over forks While lfork has no sp ecial advantage over fork in terms of storage

it also has no extra cost in the case when the child is disconnected from the parent

Like call return is also signicantly more exp ensive than the traditional stack

mo del lreturn is more exp ensive than exit b ecause it includes the cost of loading the

frame p ointer for the parent frame in addition to freeing the frame

The data b ears out the reasoning that the linkedframe mo del is not very ecient

for sequential and p otentially parallel calls In addition it creates external fragmentation

In spite of these problems it is used often b ecause it is easy to implement

This op eration could b e made more ecientby using a heap p ointer as suggested in However

App els metho d requires garbage collection or a reclamation metho d that makes the resulting cactus stack

lo ok very similar to our spaghetti stacks

CHAPTER STORAGE MODELS

D G F C E B

A

Figure A cactus stack using multiple stacks There are only explicit links from the

stubs at the base of a stack to the parent of the thread in the base frame of the stack

Multiple Stacks

The next approachwe discuss is that of mapping the cactus stackonto multiple

stacks Each thread is assigned its own contiguous stack The cactus stack is maintained

with explicit links b etween stacks and implicit links within each stack See Figure The

link b etween a lazy thread and its parent is implicit b ecause a lazy thread is allo cated on

the same stack as its parent The parent p ointer is not stored but is calculated by doing

arithmetic on the stack p ointer

The main advantage of this approach is that an allo cation request for call and

lfork has the same cost as a sequential call in a single threaded system The disadvantages

are that it limits the numb er of threads in the system due to the large size of each stack

creates internal fragmentation and increases the cost of creating an indep endent thread

Multiple stacks havenever to our knowledge b een used in a lazy multithreaded sys

tem but variations of the multiple stacks approach has b een used in many thread packages

including pthreads NewThreads and Solaris Threads and in some languages

eg Cid All of these systems x the size of the stack and none provides protection

against stackoverows

StackLayout and Stubs

ks approach is divided into two regions the stub and Each stack in the multiple stac

the frame area See Figure The stub contains threadsp ecic data and maintains the

global cactus stackby linking the individual stacks to each other The frame area contains

the activation frames of children invoked by either sequential calls or lforks Thus the

CHAPTER STORAGE MODELS

Free Space

Frame Area

stub handler address Stub Area slot for saving sp bottom frame’s parent pointer

bottom frame’s return address

Figure Layout of a stack with its stub

cactus stack is maintained b oth explicitly with links from the stack stubs to the parent

thread of the b ottom frame in the frame area and implicitlybetween lazy threads in the

frame area

The stackstubserves to maintain the explicit links needed b etween threads It

also stores an ip that can b e accessed by the thread ab ove it on the stack When the thread

returns instead of testing its own state it do es an lreturn that assumes that the thread

is lazy and is still connected to its parent lreturn invokes the stub routine p ointed to by

the stubs ip This routine using the data in the stubs frame then p erforms the necessary

op erations to return to the threads parent

The stack stub maintains information common to all the activation frames on its

stack When none of the threads on the stack are running the stub contains the sp for the

stack The sp stored in the stub is loaded into the pro cessors sp when any of the threads

in the stack are scheduled by the general scheduler Every activation frame can calculate



the stub for the stackonwhich it lives The stub also contains the parentpointer for

the b ottom frame in the stack Since all other threads on the stack are descendants of the

b ottom frame this is the only explicit link needed

Stubs allowevery thread lazy or indep endent to use the sequential return mech

anism of lreturn even though a cactus stack is b eing used The stack stub stores all the

t When a new stack is allo cated data needed for the b ottom frame to return to its paren

the parents return address and stackpointer are saved in the stub and a return address to

the stub handler is given to the child When the b ottom frame in a stack executes a return

it do es not return to its caller Instead it returns to the stub handler The stub handler

This can easily b e accomplished by xing the size of a stacktobeapower of two and using an exclusiveor and logicaland instruction

CHAPTER STORAGE MODELS

p erforms stack deallo cation and using the data in the stack stub carries out the necessary

actions to return control to the parent eg setting the sp to the parents stack

There are two dierent stub handlers in the multiple stacks mo del the lo calfork

stub handler which p erforms the actions describ ed in the previous paragraph and the

remotefork stub handler The remotefork stub handler uses indirect active messages

to return data and control to the parents message handler which in turn is resp onsible for

integrating the data into the parent frame and indicating to the parent that its child has

returned

Stack Implementation

Since wedonothave innite memory in which to allo cate stacks wemust decide

howtosimulate a virtually innite stack Many thread packages do this by allo cating a

relatively small on the order of kk xed stack for each thread This works well for

some programs on some data sets but not at all for those that use a great deal of automatic

data or have deeply nested routines There are however many programs which b ecause

the stack size dep ends up on the input data set require large stacks

Problems o ccur b ecause these thread packages have no stackoverow detection

Stackoverow is esp ecially problematic in multithreaded systems If stacks are adjacent a

thread that overows will write to the base of another threads stack which causes errors

to o ccur when the other thread is scheduled and nally returns to the frame at the base of

its stac k Even if the stacks are not adjacent the error caused by a stackoverowmay not

surface until later in the programs execution

Stackoverow protection can b e added in one of three ways First sequential calls

can check for stackoverow whichmakes this mo del exactly like stacklets See Section

Second a guard page can b e allo cated for each stack Third a guard page can b e created

dynamically whenever a thread is scheduled The latter two metho ds require that stacks

b e allo cated on page b oundaries and b e multiples of the system page size If a static guard

page is allo cated then a page is wasted for eachstack With the dynamic metho d the cost



of switching to a new thread is raised substantially

Guard pages require that the op erating system allow the user to change the write protection of a page

in memorywhich is not supp orted on some machines It is probably for this reason that thread packages

do not supp ort overow detection

CHAPTER STORAGE MODELS

Free Space

Child Free Space lfork child or Parent call child Parent

stub stub

Figure Allo cating a sequential call or an lfork on a stack

If there is already a copying garbage collector in the system we can use it to avoid

allo cating large xed regions Instead we can allo cate a small stack and set a guard When

the stackoverows we reallo cate it to a larger region copying the data and adjusting any

p ointers to the data in the stack as appropriate Of course p ointer adjustmentwould have

to happ en across the entire address space

Without garbage collection wehave three dierentways to implementmultiple

stacks The xed scheme allo cates a small xed stack p er thread The xed scheme with

static guard pages asso ciates a static guard page to each stack The xed scheme with

dynamic guard pages sets the guard page on eachstack when the thread asso ciated with

it is running The rst scheme can b e implemented using standard memory allo cation

primitives like mallo c The latter schemes require the ability to manipulate the readwrite

p ermissions of virtual pages

Op erations on Multiple Stacks

In this section we discuss the implementation of the op erations using multiple

stacks The multiple stacks mo del exploits the fact that each thread has its own stack

Sequential call and lazy fork b oth allo cate the child frame on the stack of its parent As

a result when a lazy thread needs to b ecome indep endent wemust p erform some kind of

op eration to disconnect the parent and child threads

Allo cation

The allo cation op eration for call or lfork allo cates the child frame on the same

hild frame by adjusting the sp See stack as the caller The op eration is carried out by the c Figure

CHAPTER STORAGE MODELS

Free Space

Child

Free Space Free Space stub fork child Parent Parent

stub stub

Figure Allo cating a new stack when a fork is p erformed

The fork op eration requires that a new stack b e allo cated and linked to the callers

stack See Figure The actual allo cation requests will dier dep ending up on which stack

scheme is used but they are similar in all cases If there is a free stack which has b een

allo cated already it is reused Otherwise a new stack area is allo cated from the system

using the appropriate library call ie mallo c or mmap and if a static guard page is b eing

used it is set

Deallo cati on

When a lazy thread or sequential call exits or returns it deallo cates its frame

area by adjusting the sp When a thread exits it returns its stack to the free p o ol of stacks

Scheduling

Unlike the linkedframe representation the multiple stacks mo del requires work

to b e p erformed whenever a thread is scheduled In particular the dynamic guard page

scheme requires that the page ab ove the stack b e set so that if an overow o ccurs it will

cause a fault When a thread susp ends exits or yields the guard page is unprotected In

addition in all the schemes the sp has to b e set from the stub data

Disconnection

Since a lazy thread is allo cated on the same stack as its parent if the lazy thread

susp ends it must b e converted into an indep endent thread This pro cess requires that

the parent and child eachhaveitsown logical stack Wenow discuss the two options for

disconnectionlazy and eager iden tied in Section

CHAPTER STORAGE MODELS



The lazy metho d avoids copying and allows p ointers to automatic variables

by forcing all subsequent allo cations by the parentonto new stacks This means that

subsequent sequential calls and lazy forks will always allo cate a new stack for their frames

Since a susp ending child is left unchanged the ip that a parentpresents to its future

children must reside in a separate slot in the frame

Eagerdisconnect is signicantly harder to implement than lazydisconnect When

we copyanactivation frame wemust b e careful not to invalidate anypointers to the copied

frames Even if we disallow p ointers to automatic variables within a frame wemust correctly

maintain the p ointers that comprise the cactus stack and the ready queue Although wecan

up date parent p ointers that are on the lo cal pro cessor relatively inexp ensively up dating

remote parentpointers is costly and can lead to race conditions which are dicult and

exp ensive to alleviate To ensure that frames with remote children are not copied we force

any frame with a remote child to live at the b ottom of a stack creating a new stackif

necessary and enforce the invariant that the b ottom frame in a stack is never copied

This invariant implies that when a lazy thread susp ends or yields it must b e

disconnected from its parentbycopying itself and its descendants to a new stack This

childcopying may require up dating lo cal p ointers but never remote p ointers Since child

copying is used whenever a lazy thread susp ends or yields wealsohave the invariant that

whenever a child susp ends it is at the top of the stack

Childcopying creates a new stackforthechildren copied and a stub to link the

new child stack to the parentstack The resulting conguration is exactly the same as if

the b ottom child of those copied had b een eagerly forked

When a child susp ends and its parent is not on the parent queue the susp end

transfers control to an ancestor of the child forcing many frames to b e copied at once The

p ointers to each of these frames will have to b e up dated Figure shows the eects of

two susp end op erations In the rst the parent is on the parent queue and one frame is

copied In the second the parent is not on the parent queue forcing multiple copies

The implementation of eagerdisconnect diers dep ending up on the control mo del

When seed activation is used to steal work a thread having its seed stolen is copied to a

new stack to ensure that it is the b ottom frame Figure shows the eect of activating a

seed from a lazy thread

An automatic variable is one lo cal in scop e to a function not to b e confused with memory lo cal to a pro cessor

CHAPTER STORAGE MODELS

Newly allocated stack.

Newly allocated stack.

Free Space Free Space

Free Space D D

D stub stub Free Space Free Space C C C

D suspends and is C suspends when B B B copied to a new B has no wortk. C and stack. B are copied to a new stack. C’s parent Free Space stub pointer is updated. A A A

stub stub stub

Figure Disconnection using childcopy for a susp ending child First D susp ends and

then C susp ends When C susp ends B is not on the parent queue so control is passed to A This forces b oth B and C to b e copied

Newly allocated stacks.

Free Space Free Space

C D

Free Space B stub C stub B has a seed stolen. B Free Space

A A

stub stub

Figure Disconnection caused by seed activation also uses childcopy Thread B has a

seed stolen causing it to b e copied to new stacktomaintain the invariant that a thread

with a remote child is at the b ottom of a stack

CHAPTER STORAGE MODELS

Newly allocated stacks.

Free Space Free Space

C B

Free Space Forwarding stub frame for B C stub B is stolen by B another . Free Space

A A

stub stub

Figure Disconnection caused bycontinuation stealing B migrates to a new pro cessor

leaving b ehind a forwarding frame

Since continuation stealing results in migration we need to ensure that a forward

ing frame is left b ehind after the frame is copied See Figure When a child of the

stolen frame returns the forwarding frame sends the results to the parents new lo cation

When the stolen thread completes it sends its results back to the forwarding frame which

returns to the stolen frames parent and then exits

Multiple Stacks Timings

As the timings in Table show the multiple stack mapping intro duces no over

head for sequential calls lforks or forks However when indep endent threads are needed

a signicantoverhead is incurred For all schemes the cost of creating susp ending and free

ing a thread includes the cost of the stub used to link the thread to its parent Notice that

even though stubs are used the cost of using threads in the multiple stack mo del is ab out

the same as that in the linkedframe mo del See Table

Overow protection intro duces the largest overhead in the multiple stacks mo del

The dynamic scheme is clearly unacceptable b ecause the call to protect the page on all

scheduling op erations costs four orders of magnitude more than any other op eration The

static guard page avoids this problem at the cost of wasting memory and increasing the cost

of allo cating a stack from the system Timing mallo c and mprotect is imprecise at b est

CHAPTER STORAGE MODELS

Instruction Count

Op eration Memory Branch Register System

Ccall

C return

Sequential call

Sequential return

Create thread from p o ol

Susp end thread xed

Susp end thread dynamic

Schedule thread xed

Schedule thread dynamic

Free thread xed

Free thread dynamic

Create lazy thread

Lazy return

Eagerdisconnectnword frame n

Create lazy thread

after lazydisconnect

Table Times for the primitive op erations using multiple stacks These times do not

include any control costs They assume stacks are allo cated from a p o ol of stacks and

exclude the cost of a mallo c and mmap Fixed denotes the xed and static guard schemes

Dynamic denotes the dynamic guard page scheme System calls are included as cycle

counts The eagerdisconnect cost is the minimum cost

When a static guard page is used however it increases the cost of allo cating a new stack

by at least

The other signicant increase in cost imp osed by the multiple stacks mo del is due

to eagerdisconnect Eagerdisconnect causes at least n memory references where n is

the size of the frames copied In addition all p ointers to the frames have to b e up dated

including the p ointers in the ready queue and the links from parents to their children

Although sequential and p otentially parallel calls are eciently implemented in

the multiple stacks mo del the ineciencies inherent to supp orting overow protection and

eagerdisconnect make it a less than ideal candidate for a storage mo del It intro duces b oth

internal and external fragmentation and it is nonp ortable

CHAPTER STORAGE MODELS

G

D

D F

E C G

B F C

A E B

A

Figure Two examples of how the cactus stack could b e represented using spaghetti

stacks In the rst the cactus stack is lo cal to a single pro cessor In the second it is split

across pro cessors

Spaghetti Stacks

Spaghetti stacks are a compromise b etween the forkcentric linkedframe approach

and the callcentric stack approach By removing the requirement that all free space in a

stackbeabove the current frame the cactus stack can b e implemented in a single stack

This means that no storage related disconnection op erations are needed when a lazy thread

b ecomes indep endent

The use of a single interwoven stack for multiple environments was rst intro duced

by Bobrow and Wegbreit It has since b een used in a more simplied form in Olden

and in implementing functional languages In all cases it has b een accompanied bya

garbage collector to p erform compaction

In a spaghetti stack every allo cation request is fullled by allo cating space at the

top of the stack A frame in the middle of the stack can exit however and create free space

in the middle of the stack The spaghetti stackmodelthus requires a metho d to reclaim

this fragmented freed space It also requires that the global cactus stack b e maintained

explicitly with links b etween every child and its parent See Figure

Unlike the linkedframe and multiplestack approaches in whicheach frame or

stack is asso ciated with a thread spaghetti stacks are assigned to pro cessors in the system

tained so that each pro cessor has its own spaghetti stack The global cactus stack is main

with global p ointers b etween child frames on one pro cessor and their parent frames on remote pro cessors

CHAPTER STORAGE MODELS

Free Space top Free Space Child top other frames other frames lfork child, call child, Parent or,

fork child Parent

Figure Allo cation of a new frame on a spaghetti stack

The main advantage of spaghetti stacks over multiple stacks is that no storage

related disconnection op eration is needed In addition allo cation requests are relatively

inexp ensive compared to the linkedframe approach The disadvantages are that it can

b e exp ensive to reclaim freed space and that internal fragmentation can consume a lot of

memory

Spaghetti StackLayout

In addition to the stackpointer and frame p ointer used in a singlethreaded system

a spaghetti stack requires a top p ointer top whichpoints to the rst free word ab ove the

last word used in the stack ie the lo cation to which sp would p oint in a traditional stack

The cactus stack is maintained by links in each frame on the stack A link from

parenttochild even for sequentially called children is required b ecause the child frame

may not b e adjacent to its parent See Figure

Spaghetti Stack Op erations

Allo cation

The allo cation op eration is identical for call lfork and fork The child allo cates

space for its frame by adding to top See Figure It saves the current sp in the base

of its frame and then sets the sp to the top of its frame and the fp to the base of its frame

Remote fork allo cates the remote stub in the same manner and then p erforms a lo cal fork to allo cate the thread

CHAPTER STORAGE MODELS

Free Space top Child Free Space top other frames other frames dealloction of a child that ran to completion. Parent

Parent

Figure Deallo cating a child that has run to completion

Free Space top Child

Free Space Free Space top other frames other frames dealloction of a child that previously suspended. Parent

Parent

Figure Deallo cating a child that has susp ended but is currently at the top of the

spaghetti stack

Deallo cati on

When a child returns or exits it attempts to deallo cate its frame There are three

p ossibilities the child has run to completion the child has susp ended and is at the top of

the stack or the child has susp ended and is not at the top of the stack

If the child has run to completion then it must b e at the top of the spaghetti



stack In this case it deallo cates its frame by setting top to the base of its frame See

Figure

If the child previously susp ended but is currently at the top of the stack it checks

the area b elow it to see if it is active or free If it is active then it sets top to the base of

its frame If the area b elow it has b een marked as free then it sets top to the base of the

free area See Figure

If the child is not at the top of the stack then it can mark the frame as free but

it cannot actually reclaim the space since there are active frames ab ove it in the stack In

This is only true if work distribution is p erformed bywork stealing If remote fork is used to create

threads on other pro cessors on a distributed memory machine then wemust consider additional cases

CHAPTER STORAGE MODELS

Free Space Free Space top top other frames other frames

Child

Free Space Free Space

dealloction of a child that previously sus- Parent pended. Deallocation

coalesces free space. Parent

Figure Deallo cating a child that has susp ended and is in the middle of the stack The

child coalesces any free space b elowitinto its frame

this case the child marks its frame as free and checks to see if the frame b elow it is free If

the frame b elow is free then b oth frames are merged into one free area Later when the

frame ab ove it exits it will deallo cate itself and the current frame including any of the

frames with whichitwas merged as describ ed in the previous paragraph See Figure

Whenever a frame is not at the top of the stack and is attempting to deallo cate

itself it must have a susp ended frame ab ove it This invariant lets us use PCRCs as

describ ed in Section to eliminate any reclamation overhead for frames that run to

completion In other words if a frame runs to completion it need not p erform anychecks

but can blindly set top to the base of its frame

Spaghetti Stack Timings

As Table shows spaghetti stacks are a go o d compromise b etween linked frames

and multiple stacks In all cases they are less exp ensive than linked frames and all thread

op erations are cheap er on spaghetti stacks than on multiple stacks

Sequential call and lfork on a spaghetti stack incurs the cost of saving a link to

its parent but it avoids allo cation from a p o ol which is exp ensive It is only slightly more

exp ensive than a sequential call on a stack

The thread op erations shown here include the cost of using a stub for indep endent

threads and lazy threads invoked after either eager or lazydisconnect Stubs allowusto

reclaim the memory in the spaghetti stack without having to p erform a test on every return

CHAPTER STORAGE MODELS

Instruction Count

Op eration Memory Branch Register

C call

C return

Sequential call

Sequential return

Create Thread

Schedule Thread

Free Thread

Create Lazy Thread

lazy return

Eagerdisconnect

Create lazy thread

after disconnect

Table Times for the primitive op erations using spaghetti stacks These times do not

include anycontrol costs

CHAPTER STORAGE MODELS

D

C

G F

E B

A

Figure A cactus stack using stacklets

Stacklets

Stacklets intro duced in eect a dierent compromise b etween the linked

frame and stack approaches Whereas spaghetti stacks incur extra cost when a function

returns to handle reclamation of free space stacklets maintain a stackinvariant but incur

extra cost on function call to checkforoverow

A stacklet is a region of contiguous memory on a single pro cessor that can store

several activation frames Unlike spaghetti stacks each stacklet is managed like a sequential

stack As in the stack mo del the global cactus stack is maintained implicitly b etween

frames in a single stacklet and explicitly with links b etween stacklets See Figure

Stacklet Layout

Each stacklet is divided into two regions the stub and the frame area The stub

contains data that maintain the global cactus stackby linking the individual stacklets to

each other The frame area contains the activation frames See Figure

Stacklet Op erations

Stacklets exploit the fact that each indep endent thread b egins at the base of a

new stacklet while lazy threads are allo cated on the same stacklet as their parent Thus

lazy thread forks and sequential calls b oth p erform a sequential allo cation A sequential

allo cation is one that requests space on the same stack as the caller The child p erforms

the allo cation and determines whether the new frame ts in the currentstacklet If it do es

is up dated appropriately See Figure If not a new stacklet is allo cated and the sp

child frame is allo cated on the new stacklet See Figure

CHAPTER STORAGE MODELS

free space sp Frame running Area Address to enter stub routine frame Storage for sp a frame Storage for parent pointer Stub

Storage for parent’s return address

Figure The basic form of a stacklet

free space free sp space child sp parent parent sequential call a frame a frame

Stub Stub

Figure The result of a sequential call which do es not overow the stacklet

free free unused space space fork sp sp parent or parent sequential call child a frame with overflow a frame overflow Stub Stub top Stub

retadr

Figure The result of a fork or a sequential call whichoverows the stacklet

An eager fork also causes the child to b e allo cated on a new stacklet We can either

run the child in the new stacklet immediately or schedule the child for later execution In

the former case sp p oints to the child stacklet See Figure In the latter case the sp

remains unchanged after the allo cation

CHAPTER STORAGE MODELS

free free space space free sp sp space parent parent remote call sp a frame a frame child remote Stub Stub Stub

& of retcont

Figure A remote fork leaves the current stacklet unchanged and allo cates a new

stacklet on another pro cessor

For a remote fork there are no stacklet op erations on the lo cal pro cessor Instead

a message is sent to a remote pro cessor with the child routines address and arguments See

Figure

The overhead in checking for stacklet overow in a sequential call is three register

based instructions and a branch which will usually b e successfully predicted If the stacklet

overows a new stacklet is allo cated This cost is amortized over the manyinvo cations that

will run in the stacklet

Stacklet Stubs

As with multiple stacks stacklet stubs are used to link the individual stacklets

that form the global cactus stack The stub handler p erforms stacklet deallo cation and

using the data in the stacklet stub carries out the necessary actions to return control to

the parent

Compilation

To reduce the cost of frame allo cation even further we construct a call graph

which enables us to determine for all nonrecursive function calls whether an overowcheck

is needed Each function has twoentry p oints one that checks stacklet overowand

another that do es not If the compiler can determine that no check is needed it uses

the latter entry p oint This analysis may insert co de to p erform a stacklet allo cation to

wchecks guarantee that future children will not need to p erform anyovero

CHAPTER STORAGE MODELS

Instruction Count

Op eration Memory Branch Register

Ccall

C return

Sequential call

Sequential call overow

Sequential return

Create Thread from p o ol

Schedule Thread

Free Thread

Create lazy thread

Create lazy thread no check

lazy return

Eagerdisconnectnword frame n

Create lazy thread

after lazydisconnect

Table Times for the primitive op erations using stacklets These times do not include

anycontrol costs They assume stacklets are allo cated from a p o ol of stacks and exclude

the cost of a mallo c

Timings

Stacklets like spaghetti stacks oer a compromise b etween linked frames and

multiple stacks As seen in the timings in Table sequential call and lfork behave

similarly and require no memory references However when a stacklet overows or a new

thread is created wemust create a new stacklet from the p o ol which costs slightly more

than creating a new linked frame Likemultiple stacks eagerdisconnect is exp ensive

We reject the use of guard pages to detect static overow b ecause the cost of taking

an overow fault is cycles The use of guard pages reduces the cost of sequential call

and lfork by one branch and four register instructions Thus for the guard page scheme

to payoanoverowwould have to b e rarer than once every calls

Discussion

The main disadvantage of stacklets is that if a frame is at the top of a stacklet

each of its children must b e allo cated on a new stacklet This b oundary case is similar to the

overow case when using register windows We can reduce the probability that this o ccurs

CHAPTER STORAGE MODELS

Instruction Count

Op eration Memory Branch Register

C call

C return

Sequential call

Sequential return

Create Thread stacklet

Schedule Thread

Free Thread stacklet

Create lazy thread stacklet

Create lazy thread stacklet no check

lazy return stacklet

Create lazy thread

after lazydisconnect

Create thread or lazy thread heap

Free thread heap

Lazy return heap

Table Times for the primitive op erations using a sequential stack for purely sequential

calls stacklets for p otentially parallel and susp endable sequential calls with mo derate sized

frames and the heap for the rest These times do not include anycontrol costs They

assume stacklets and heapallo cated frames are allo cated from a p o ol and exclude the cost

of a mallo c

by increasing the size of each stacklet Additionally the compiler optimization mentioned

in Section eliminates many of the places where this b oundary case can o ccur

Mixed Storage Mo del

In all of the memory mo dels except the multiple stacks mo del sequential call

and lfork cost more than a sequential call on a stack For this reason we prop ose a fth

alternative the mixed storage mo del whichcombines a stack with stacklets and linked

frames in a single system In such a system activation frames can b e stored on the stack

a stacklet or a heap dep ending up on the typ e of call and the requirements of the child

The compiler cho oses the least exp ensive storage mechanism for each activation frame The

least versatile and least exp ensive metho d is to use a stack whichisavailable only for

purely sequential calls ie for calls that can never susp end The most exp ensive metho d

is to store each frame in a heapallo cated memory segment A heapallo cated frame store

requires a link from child to parent which is implicit when using a stack

CHAPTER STORAGE MODELS

Between these two extremes we use stacklets for frames that are allo cated by forks

Astacklet combines the eciency of a stack with the exibility of heapallo cated frames for

those frames that are or may b ecome indep endent threads of control This metho d allows

us to assign to each thread its own logically unb ounded stack without having to manage

multiple stacks or require garbage collection Our compilation strategy allows us to use

a sequential stack for sequential activation frames stacklets for small to mo derately sized

thread frames and the heap for large thread frames

The Memory System

Here we sp ecify the p ossible memory hierarchies of the concrete implementation

of the abstract machine While we can implement LMAM on either a shared memory

machine SMM or a distributed memory machine DMM either an MPP or a NOW

the underlying memory system will signicantly aect the implementation and cost of the

dierentmechanisms describ ed here There are three main areas of impact lo cks thread

migration and splitphase op erations

Many of the op erations describ ed have to b e p erformed atomicallyFor instance

lreturn must remove the parent from the parent queue and return control to the parent

atomically If it do es not a steal op eration may steal the parent thread from the parent

queue at the same time On a DMM using Active Messages the steal op eration is only

initiated when the network is p olled for a message Since wecontrol when the network is

p olled weavoid the explicit manipulation of lo cks On an SMM one of three approaches

must b e used the paren t queue must b e lo cked on every lfork and lreturn multiple



queues maintained or active messages for work stealing requests simulated With

all but the last technique SMM implementations will intro duce signicantoverhead

On an SMM it is easy to migrate a thread while on a DMM the migration of

threads requires sending all of the data in the threads frame across the network Since lo cal

p ointers are not valid on another pro cessor we require either that all p ointers b e global or

that no p ointers b e placed in the frame When seed activation is used to represent nascent

threads there is an alternative for it is easy to migrate a nascent thread A nascent

The only synchronizatio n p oint is the buer used to store the steal requests However when a steal

request is inserted into the buer the pro cessor inserting the message is by denition idle

CHAPTER STORAGE MODELS

thread has no data asso ciated with it except its arguments Thus when a nascent thread

is activated it can b e activated either lo cally or remotely

Finally on a DMM splitphase memory op erations are often used to hide the

network latency A splitphase read op eration consists of two messages a request and a

resp onse The request must sp ecify the address of the remote lo cation and the address of

the lo cal destination The resp onse then stores the value into the lo cal destination If we

allowchild or parentcopying disconnection then the lo cal destination maychange b etween

the request and resp onse if the destination is in a frame that is copied due to disconnection

Thus on a DMM neither child nor parentcopying can b e used unless we forbid thread

op erations during splitphase op erations

In the remaining chapters we will fo cus on distributed memory multipro cessors

For this reason we will lo ok only at systems that do not allow thread migration and p erform

lazydisconnect We will also generally skirt the issue of atomicitysincewe can easily

guarantee it by p olling the network controlling when handlers are p ermitted to execute

Summary

In this chapter wehaveinvestigated ve storage mo dels for implementing thread

state The multiple stacks mo del is the most exp ensive for thread op erations and the least

for lazy thread and sequential op erations However since we do not use garbage collection

it do es not provide an innite logical stack p er thread For this reason we will not study it

further

Of the remaining four mo dels linked frames are the most exp ensive for sequential

op erations and mo derately eective for thread op erations The mixed mo del stacklets and

spaghetti stacks strike a compromise b et ween sequential and thread op erations

In all cases the timings wehave presented dep end heavily on integrating thread

op erations directly into the compiler By integrating knowledge of the storage mo del into

the compiler we are able to construct a function prolog and epilog that is tailored to the

sp ecic mo del b eing used This is carried even further for the mixed mo del where the size

of the functions activation frame determines whether it is heapallo cated stackallo cated

or stackletallo cated Furthermore we discussed how the underlying memory mo del of the

machine aects how atomicityisachieved

If an argument is a p ointer it has to b e a global p ointer

CHAPTER STORAGE MODELS

In all cases wehave ignored the eects that the dierent thread representations

and queuing metho ds imp ose on the storage mo del We consider these eects in the next

chapter

Chapter

Implementing Control

In this chapter we present the control p ortion of a concrete implementation of

the abstract machinery intro duced in Chapter for the supp ort of a lazy thread fork We

b egin with the thread seed mo del and present the dierent means of invoking threads and

representing nascent threads We then explore the dierentways in which nascent threads

can b e queued and dequeued This leads us into a discussion of the implementation of

disconnection Before pro ceeding to discuss the continuationstealing mo del we showhow

to reduce the cost of synchronization and how to use lazy parent queues

Representing Threads

As wesaw in Chapter the inclusion of a lazy thread fork requires that threads

or p otential threads b e represented in many forms What would have b een a thread in

MAM can b egin life as a nascent thread b ecome a lazy thread and even b e transformed

by disconnection into an indep endent thread In this section we describ e the concrete

representations of nascent threads as thread seeds and closures and the implementation

of an lfork as a parallel ready sequential call

Sequential Call

Although a sequential call do es not start a new thread it is the basis up on which

we build all the representations for lazy threads In this chapter we are concerned mainly

with the transfer of control b etween parent and child For a sequential call the child b egins

CHAPTER IMPLEMENTING CONTROL

executing bysaving its return address Instead of storing the return address in the child

frame the child stores the return address at a xed oset from the top of the parent stack

frame The key feature we exploit is that the return address is in the same relative lo cation

in every parent frame

The child returns to the parentby doing an indirect jump through the lo cation that

contains the return address We exploit the exibility of the indirect jump to implement

lfork as a sequential call

The Parallel Ready Sequential Call

The parallel ready sequential call implements the abstract lfork op eration It

starts its child thread a lazy thread with a sequential call The child stores its return

address just as a sequentially called child do es in the parents frame The child returns to

the parent using the sequential return instruction

To supp ort disconnection every return address has asso ciated with it the addresses

of two seed routines These routines findadr susp and findadr steal See Sec

tion are used to handle child susp ension and the receipt of a work stealing request

resp ectively There are many alternative implementation strategies for asso ciating the seed

routines with the return address inserting a jump table or addresses b etween the call and

normal return dening xed osets and a central hash table to name a few In all cases

given a return address the seed routines should b e easily computed and if they are not

needed they should not interfere with sequential execution

For exibility and ease of implementation we insert jump instructions into the

seed routines b etween the call instruction and co de that is executed when the child returns

Figure shows a sample instruction sequence for a parallel ready sequential call Every

thread co deblo ck b egins with an adjustmen t to the return address just as in the Sparc

This adjusted return address p oints to the co de immediately following the seed routines

asso ciated with the lfork that invoked it See the prolog co de in Figure

The decision b etween implementations of the find functions should b e based on

the cost of adjusting the return address the cost of nding the seed routines whether branch

prediction is used on return and the b ehavior of the cache An ideal implementation allows a

seed routine to b e found and executed quickly without interfering with sequential execution

Leaf functions are not required to store their return address but for the purp oses of this discussion we assume they do

CHAPTER IMPLEMENTING CONTROL

... retadr = retadr + 2 // function prolog call child // lfork child save retadr in a frame slot jmp steal-fragment // codeblock for child jmp suspend-fragment ...... // return here ... load retadr from frame slot // function epilog jmp retadr

A sample lfork instruction sample function prolog/epilog

Figure Pseudoco de for a parallel ready sequential call and its asso ciated return entry

p oints

As the sequential case o ccurs more frequently than nding a seed an implementation should

favor noninterference over nding the seed

Inserting a jump table b etween the call and return requires the return address

to b e adjusted on every call and allows the seed routines to b e found by doing address

arithmetic This is a particularly go o d metho d on architectures like the Sparc which allow

the return address to b e adjusted as part of the return instruction making the adjustment

free However if the pro cessor predicts return branches based on the address of the call

instruction then adjusting the return p ointer causes a misprediction on return increasing

the cost of a sequential return The inserted table also causes the sequential co de to b e

larger p ossibly leading to an increased instruction cache miss rate

An alternate metho d to implement find is for the compilerlinker to construct a

hash table that maps return addresses to appropriate seed routines This metho d pro duces

no interference with sequential callsthe return address do es not need to b e manipulated and

the co de size is not increased It do es however increase the cost of nding seed routines

A more complicated approach is to place the seed routines at a large xed oset

from the return address This metho d do es not interfere with the sequential call and the

seed routines can b e found simply by using address arithmetic However the compilerlinker

must interleave unrelated co de and seed routines in order not to waste co de space

Seeds

A thread seed represents a nascent thread It is a pair of p ointers a frame p ointer

and a co de p ointer The addresses of the susp end and steal routines can b e derived from

the co de p ointer

A seed is activated by loading the frame p ointer and executing one of the routines

asso ciated with the co de p ointer The susp end routine creates a new lazy thread The steal

CHAPTER IMPLEMENTING CONTROL

fork X(a,b); 1: synch = 1 a = 5; 2: c = allocate closure(cseed) join; 3: store a into c 4: enqueue c 5: store a <- 5 6: cmp synch, 0 7: be continue 8: suspend 9: continue:

10:cseed: 11: jmp on_suspend 12: jmp on_steal 13:on_suspend: code-pointer 14: setup fp from closure 15: load a from closure into temp frame-pointer 16: free closure 17: lfork X(a,b) 18: goto continue slot for ‘a’

Closure created in

line 2

Figure Example of a closure and the co de to handle it

routine starts a new indep endent thread and then exits Both routines up date the parent

thread state to reect the seeds activation

The arguments used to start the nascent thread if any are stored in the parent

thread frame ie the frame p ointed to by the frame p ointer in the seed These arguments

cannot b e mo died b etween enqueuing and activating the seed If the program can mo dify

the arguments a closure is used instead of a seed to represent the thread

Closures

We use a closure when the arguments to the fork of a nascent thread maychange

in the parentbetween the time the fork is represented as a nascent thread and the time

that the nascent thread is instantiated See Figure It consists of two p ointers a co de

p ointer and an environmentpointer The co de p ointer like the co de p ointer in a seed is

used to derive the susp end and steal routines which activate the closure

The environment is a heapallo cated region of memory that contains the argu

ments and a p ointer to the frame that created the closure The mutable arguments are

placed in the environment Any arguments guaranteed not to change are left in the frame

When a closure is activated the activation routine uses the frame p ointer in the

environmenttochange the state of the parent thread The environment and p ossibly the

CHAPTER IMPLEMENTING CONTROL

frame are used as the source of the arguments to the new thread The closure is freed when

the new thread is created

Summary

In the seed mo del wehavetwoways to represent nascent threads seeds and

closures and one way to execute a lazy thread the parallel ready sequential call It is no

coincidence that the return address of a parallel ready sequential call is a co de p ointer that

could b e used for a seed In fact wehave also arranged for the co de p ointertobestored

in the parents frame The result is that the child has a p ointer to a lo cation which stores

the co de p ointer In other words the child has a thread seed to the remaining work in the

parent

In the next section we discuss the ways in whichwork threads thread seeds and

closures is queued found and then executed in the system Since closures are queued and

dequeued in the same manner as thread seeds we only discuss thread seeds We showhow

thread seeds and parallel ready sequential calls interact and aect the parent queue and its

representations

The Ready and Parent Queues

The abstract machine has two queues of threads that are ready to run the ready

queue and the parent queue The former contains threads in the ready state whichare

scheduled by the general scheduler The latter contains threads that have passed control to

achild ie threads in the haschild state

The Ready Queue

In order to b ound the amount of memory that the ready queue consumes we

distribute it among the activation frames Any data structure that allows us to do this

would b e sucient and wecho ose to represent the ready queue as a link ed list exhibiting

LIFO b ehavior Thus the ready queue is maintained for each pro cessor by a global register

which p oints to the rst ready frame Each frame in turn p oints to the next frame in the ready queue

CHAPTER IMPLEMENTING CONTROL

Our implementation do es not dep end on the data structure chosen to represent

the ready queue This is imp ortant b ecause the data structure can have a signicant eect

on execution time and storage requirements The eect of the data structure for the

ready queue is orthogonal to the goals of this thesis

The Parent Queue

Unlike the ready queue whichisinvolved with indep endent threads only ie

threads that have already susp ended and need to b e scheduled the parent queue holds

nascent threads More precisely lfork requires that nascent threads in the parentbe

represented by enqueuing the parent thread on the parent queue This means that wehave

to pay more attention to the cost of enqueuing and dequeuing or we lose the b enet of lazy

threads

A thread seed can b e enqueued on either an implicit queue See Section or

an explicit queue See Section When a thread seed is implicitly enqueued we call it

an implicit seed and when it is explicitly enqueued we call it an explicit seed

Whether implicit or explicit the parent queue is implemented here as a stack

Thus when an item is enqueued on the parent queue it is placed at the top of the parent

queue Furthermore dequeue removes the most recently enqueued item

The Implicit Queue and Implicit Seeds

Parallel ready sequential call uses the call mechanism itself to implicitl y enqueue

the parents seed as it transfers control to the child The seed is enqueued in the parent

frame by the act of calling the child When an implicit queue is used we establish the

invariant that a child always returns control to its parent even if it susp ends This allows

us to use the return address pro duced by the parallel ready sequential call as the parents

thread seed emb edding the parent queue in the cactus stack See Figure

The implicit queue is thus woven throughout the activation frames and managed

implicitly through the call and return op erations eliminating the cost of managing the

queue whenever the child runs to completion We call a return address enqueued bya

parallel ready sequential call an implicit seed When a fork that is translated into a parallel

ready sequential call is followed by another fork the implicit seed created by the rst of

CHAPTER IMPLEMENTING CONTROL

The running thread on a processor.

A thread in the ready, or idle state.

A thread on the parent queue with a thread seed..

A thread in the has-child state with no work.

Processor A Processor B

Figure An example of a cactus stack split across two pro cessors with the implicit parent

queue emb edded in it The threads on the parent queue are guaranteed to b e ancestors of

the currently running thread

Implicit thread seed created

fork X();

fork Y();

join; Implicit seed created which transfers control to the parent.

(A “No Work” seed.)

Figure Example of when a fork generates a thread seed and when it do es not



these two forks is a thread seed See Figure If the co de following the parallel ready

sequential call do es not contain a fork then the implicit seed is used only to transfer control

from the frame in which it resides to the frames parent

The return address of the child thread stored in the parents frame is an implicit

seed that can b e activated by either a susp ending child or a work stealing request On

susp ension a child returns control to its parent not at the return address but to the seed

routine for susp ension ie findip suspend This removes the parent from the implicit

parent queue and starts the susp end routine whichactivates the seed and starts the next

thread If a workstealing request o ccurs it nds the implicit seed bysearching the stack

for a thread seed Only the frames from the running thread to the ro ot of the cactus stack

need to b e searched All the other threads must either b e idle or ready which means they

The parallel call do es not have to immediately follow the parallel ready sequential call It only must precede a join

CHAPTER IMPLEMENTING CONTROL

cannot haveany nascent threads When it nds an implicit seed that is a thread seed it

invokes the steal routine found in the seeds frame Since an implicit seed is stored in the

frame that will invoke the nascentchild thread the seed itself is represented by a single

word

There are two drawbacks to an implicit queue b oth of whichderive from the fact

that it contains only seeds created by parallel ready sequential calls First an implicit

seed may b e created even when there is no nascent thread in the parent For example

the last lfork b efore a join creates an implicit seed that indicates that the parenthasno

threads available In this case the susp end routine causes the parent to susp end recursively

invoking its parents susp end routine Likewise the steal routine for such an implicit seed

invokes its parents steal routine in search of a nascent thread to activate The entire path

from the currently executing child to the ro ot mayhave to b e scanned to verify that there

are no nascent threads to invoke The other disadvantage is that implicit seeds can only

b e created for nascent threads that follow a parallel ready sequential call This means

that parallel lo ops must b e turned into lo ops of function calls and that the sequential

co de following a lone fork must b e encapsulated in a function call Despite these two

disadvantages implicit seeds are useful b ecause they cost nothing to enqueue a side eect

of invoking the lazy thread that precedes them

The Explicit Queue and Explicit Seeds

An explicit queue is a data structure separate from the cactus stack that contains

thread seeds A thread seed on such a queue is called an explicit seed An explicit queue

solves the two problems of implicit queues by enqueuing the parent thread on an explicit

parent queue When an explicit parent queue exists a susp ending thread or a work stealing

request may immediately nd work by removing a seed from the parent queue

Of course explicit seeds may b e enqueued on the parent queue at any time not

just when a parallel ready sequential call is made This makes them useful not only for

avoiding the stackwalking needed by implicit seeds but also for handling parallel lo ops

and the case of a lone fork followed by sequential co de In order to handle parallel lo ops

we enqueue a seed b efore the lo op starts whichpoints to a routine which will fork o an

iteration of the lo op After the seed routine creates the new thread it replaces the seed in

the queue When the lo op is nished it removes the seed The lone fork is represented as a

CHAPTER IMPLEMENTING CONTROL

... goto steal-fragment goto suspend-fragment explSeed: // normal return continues here ... steal-fragment: sp = sp - offset offset // steal routine continues here ... A thread frame The code associated with an explicit seed.

Parent Queue

Figure Example of how an explicit seed is stored on the parent queue The parent

queue contains a p ointer to a lo cation in the threads frame The lo cation contains a co de

p ointer from which the seed routines can b e derived When invoked the seed routines

calculate the thread frames stackpointer based on the oset known at compile time from

the lo cation to the frames top

nascent thread and enqueued by the parent b efore it starts executing the sequential co de If

no other pro cessor has activated the seed when the parent reaches the join it activates the

seed and then continue In this waywenever limit the parallelism in the original program

even though we are using seed activation as the underlying control mo del

We implement the explicit queue as a xed array that is managed bytwo global



registers front and back p ointers We enqueue new seeds onto the front of the queue If

the seed is enqueued due to a parallel ready sequential call then when the call returns it

will p op the seed o the queue If instead the child susp ends we also p op the seed o the

front of the queue On the other hand a steal request takes a seed from the back of the

queue This allows work to b e stolen from lower down in the cactus stack and thus the

work is more likely to b e largergrained

An explicit seed is a single p ointer which p oints to a slot in the frame of the thread

that created it The slot in the frame contains the co de p ointer from which the seed routines

can b e derived For example if seeds are represented by a jump table See Figure then

the the co de p ointer for an explicit seed p oints to the instruction after the second jump

ed with the address of the seed itself from See Figure The seed routines are invok

which it is p ossible to calculate the base of the frame

Though this app ears to x the size of the queue it can b e dynamically expanded if needed In any case

empirical results show a small queue on the order of words is sucient for most programs

CHAPTER IMPLEMENTING CONTROL

Implicit Queue Explicit Queue

Suspend: goto (*retadr)-1 goto (*popSeed())-1

goto steal-routine Implicit Seed: goto suspend-routine // continue

goto steal-routine Explicit Seed: goto suspend-routine popSeed()

goto (*retadr)-2 No Work Seed: goto (*retadr)-1

// continue

Figure A comparison of the dierent implementations of susp end dep ending up on the

typ e of parent queue Emptyentries indicate that no such seed exists for the mo del The

units of address arithmetic are words

Interaction Between the Queue and Susp ension

The implementation of the susp end op eration dep ends up on which representation

of the parent queue wecho ose If an implicit queue is used then the child always returns to

its parent which then decides whichnascent thread to activate If only the explicit queue

is used then the child immediately transfers control to its youngest ancestor with work

ie to the frame p ointed to by the thread seed at the front of the queue Figure shows

how susp end is implemented and what kinds of seeds are used dep ending up on the typ e of

queue in the implementation

When wehave an implicit queue in the system the child always returns to the

parent which means that sometimes the child returns to a thread that has no nascent

threads In this case the seed left b ehind by the parallel ready sequential call that invoked

the child is liketheNoWork Seed in Figure As we can see from the gure this seed

recursively susp ends the parent to its parent

If wehave only an explicit queue then a child can never return to a parent without

nascent threads This is b ecause such a parentwould not push a seed on the queue in the

rst place

We cannot use b oth queues in the system at the same time The reason is that

ays have achild would never know if its parent is enqueued implicitlyandthus would alw

to return to its parent Yet if the parent enqueued a seed explicitlyitwould not knowif

it was reached via an explicit dequeue so that its seed is no longer on the explicit queue

CHAPTER IMPLEMENTING CONTROL

or from a child susp ending with its seed still on the queue While this could b e xed by

setting and testing a ag a more basic problem is that there is no explicit handle to the

implicit queue If for some reason a thread is scheduled whichisnotthechild of the top

parent on the implicit queue then we lose the link to the implicit queue p ossibly resulting

in deadlo ck In Section we showhowalazyparent queue can b e built whichcombines

the b est features of b oth the explicit and implicit queues

Disconnection and ParentControlled Return Continu

ations

Disconnection is a key comp onentofalazymultithreaded system It allows a

thread to start a child as a lazy thread and later if necessary to disconnect from it turning

the child into an indep endent thread In this section weintro duce the key mechanism that

allows us to p erform disconnection parentcontrolled return continuations and discuss the

dierences b etween lazydisconnect and eagerdisconnect

The address that a parallel ready sequential call gives its child acts like a sequen

tial return address in that it acts as b oth the inlet address and the continuation address

for the parent When the child susp ends we need to break this b ond b etween inlet and

continuation addresses In terms of the abstract machine the parentmust haveavalid ip

its continuation and the indirect inlet address for its child must b e up dated

Three facts help us in this task First disconnection o ccurs only when the parent

is resumed by susp ension or bya work stealing request Second the return instruction

is an indirect jump whichwe p erform through a lo cation in the parent frame Third as

discussed in Section every parallel ready sequential call has asso ciated with it tw o inlets

one for normal return and one for return after disconnection

When the parent disconnects from its child it must change the indirect inlet

address for the child It do es this bychanging the return address for the child to its p ost

disconnection inlet Since the parentcontrols the return continuation for the child we

call this mechanism parentcontrol ledreturn continuations pcrcs Later when the child

completes it will run the inlet that reects the fact that it has b een disconnected from its

parent We are not done yet since when the inlet returns wemust then continue execution

in the parent at its current at the time of return ip

CHAPTER IMPLEMENTING CONTROL

load arguments for foo x = lfork foo(arguments); ret call foo y = lfork bar(arguments); jump foo_steal join; jump foo_susp ret: save return value to x Source load arguments for bar // lfork bar ●●● cont:// join is a nop ? ...

retAdr foo_susp:

retAdr = post_dis_ret ●●● resumeAdr resumeAdr = checkcont post_dis_ret join_counter = 2 Before disconnection // fork bar cont suspend ●●●

x post_dis_ret: save return value to x ●●● join_counter-- ? jump *resumeAdr ●●● checkcont: Frame layout if (join_counter) suspend ●●● jump cont After lazy-disconnection

Psuedo-code for implementation

Figure Example implementation of two lforks followed by a join It shows the state of

the thread frame b efore and after disconnection It also shows the frames up on execution

of the co de up to the dashed arrows

We call the ip at whichwecontinue the resume address The resume address is

stored in a predetermined eld in the parents frame It is used only when a parent has b een

disconnected from its child Thus when disconnection o ccurs two things happ en First

the parentchanges the pcrc to reect the new inlet to b e used by the child Second the

parent stores a continuation that reects the new state of the parent in its resume address

eld The resume address nowpoints to the co de following the seed it is activating

In other words some action caused the parent to disconnect itself from its child

This action causes the parent to instantiate the nascent thread in the parent When the child

nally do es return to the parent wedonotwant it executing the lfork of the instantiated

nascent thread Instead it b egins execution at the resume address

Figure shows an implementation of the lfork op eration with its asso ciated

pcrc manipulation instructions under lazydisconnect If the child created by the lfork

foo runs to completion it returns to ret stores the return value to x and then forks

bar Before disconnection no synchronization op erations are p erformed Also the return

address is b oth the inlet run storing the return value x and the continuation address ip

in the abstract machine

After disconnection in this case the co de shown is for the child susp ending we see

that the inlet used by the child has b een changed to post dis ret explicit synchronization

is p erformed and the resumeAdr eld in the frame is set to cont In other words the

CHAPTER IMPLEMENTING CONTROL

X_susp: fork X(); set synch=3 fork Y(); lfork_Y: fork Z(); set resumeAdr <- lfork_Z join; call Y jmp Y_susp1 X_steal: jmp Y_steal1 Source code set synch=3 // Y returns here fork_Y: set synch = synch - 1 set resumeAdr <- lfork_Z jmp *resumeAdr fork Y // inlet=Y_inlet return Y_susp: call X set synch=2 jmp X_susp Y_inlet: lfork_Z: jmp X_steal set synch = synch - 1 set resumeAdr <- join cmp synch,0 // X returns here call Z call Y bne skip jmp noWork_susp jmp Y_susp enable jmp noWork_steal jmp Y_steal skip: // Z returns here return // Y returns here set synch = synch - 1 call Z jmp *resumeAdr jmp noWork_susp Y_steal: jmp noWork_steal set synch=2 join: // Z returns here fork_Z: cmp synch,0 continue: set resumeAdr <- join be continue // after join fork Z // inlet=Z_inlet suspend

Sequential Code Stream SuspendCode Stream Steal Code Stream

Figure Each forkset is compiled into three co de streams The co de fragments shown

here do not include indirect inlet address manipulation

return address is now simply an indirect inlet address and the ip in the abstract semantics

is now represented by the resumeAdr eld

Abypro duct of lazydisconnect is that after disconnection the parent thread

continues to execute all the remaining lforks b efore the next join dierently than if no

disconnection had o ccurred This is b ecause after disconnection any progress in the parent

has to b e reected in its resume address Thus the co de for the second fork to the matching

join in a series of forks has to b e duplicated

Compilation Strategy

The compilation strategy is to create three dierent co de streams for each forkset

See Figure sequential susp end and steal The sequential co destream handles the

case when all the parallel ready sequential calls run to completion In this stream each

lfork is implemented as a call followed by the appropriate thread seed The join op eration

is of course not implemented since no synchronization is needed if none of the calls is ever

disconnected from the parent

The susp end co destream is entered when one of the original lazy threads susp ends

In this stream each lfork is also implemented as a call followed by a thread seed Unlike the

sequential stream synchronization is p erformed explicitly and after the lazy thread returns

we nd the next lo cation to execute by jumping through the resumeAdr eld When this

CHAPTER IMPLEMENTING CONTROL

stream is entered the susp ending call will setup the synchronization counter and also mo dify

its indirect inlet address to p oint to an explicit inlet as in Figure The join is explicitly

implemented in this stream

The steal co destream is entered when a steal request arrives during execution of

one of the parents children The steal routine is executed which forks o the next child

This causes all future lforks to b e executed in the susp end stream Since this stream is

only entered via a steal request there is no provision for forked children to return and

execute anything other than their inlet The future children are started when the currently

outstanding child nishes and jumps through the resumeAdr eld

A slight optimization involves checking the numb er of outstanding children b efore

activating a seed If all the children have completed then it continues execution backin

the original sequence of co de If there are outstanding children then it activates the seed

with the p ostdisconnection co de

Lazydisconnect

Lazydisconnect is mainly a waytoavoid thread frame copying when threads are

disconnected from their parents However it also has an eect on the control mo del since

the frame slot in a childs parent that holds its return address is never moved

If the parent has b een disconnected its return address eld contains the inlet for

the rst child from whichitwas disconnected Any further children that are invoked must

have their own return address elds Thus eachseedactivation is assigned a unique eld in

the frame to store the return address Even though this eld is never c hanged we use it so



that the child may return with a return instruction This eliminates the need for a child

to determine howitwas invoked making concrete implementation of the abstract machine

simpler than the rewrite rules would seem to indicate The inlets for suchchildren adjust

the stack p ointer in the same manner as an explicit seed routine See Figure

Eagerdisconnect

When eagerdisconnect is used we copy the the disconnecting childs return ad

dress to a new slot so that additional children can use the regular return address This

The children started in the duplicated co de stream are invoked lazily but their inlets always end with

a jump through the resume address Thus wenever havetochange their return addresses

CHAPTER IMPLEMENTING CONTROL

means that a parent has to maintain a p ointer to its connected child The parent uses

this p ointer to change the childs p ointer to the return address in the parent slot Fig

ure shows the conguration b efore and after disconnection using eagerdisconnect with

the linkedframes storage mo del

The result of eagerdisconnect is that the return address for a thread even one

invoked in the susp end co destream can change lo cations Eagerdisconnect also increases

the cost of the parallel ready sequential call in the linked storage mo dels since wehave

to store a p ointer from the parent to the child The p ointer is used to up date the parent

p ointer in the child to reect the new lo cation of the childs return address

Synchronizers

In this section we describ e a compiler optimization synchronizer transformation

which eliminates synchronization costs for forks and joins by always combining synchro

nization with return even for disconnected threads While there is no synchronization cost

when all the forked children of a parent run to completion we do incur synchronization

costs once any one of them causes disconnection We can minimize this synchronization

cost by extending the use of pcrcs with judicious co de duplication and by exploiting the

exibility of the indirect jump in the return instruction

A synchronizer is an inlet which implicitly p erforms synchronization The idea is

to change the return address of a child to reect the synchronization state of the parent In

this way neither extra synchronization variables nor tests are needed The amountofcode

duplicated is small since we need to copy only the inlet co de fragmen ts Here we describ e

the synchronizer transformation for two forks followed by a join

In the previous section weshowed how to enco de the inlet address and continuation

address in the return address for the parallel ready sequential call The return address for

achild was changed at most once by the parent if the child susp ended Here we extend

the enco ding to include the synchronization state of the parent If none of the children

susp ends then each will return and the parent will call the next child until the last child

returns and the parent executes the co de following the join

If anychild susp ends then not only must the inlet for that child change to reect a

new continuation address but each subsequentforkmust b e supplied a return continuation

that indicates that there is an outstanding child Furthermore when a child that previously

CHAPTER IMPLEMENTING CONTROL

inlet_X2.susp: pcrc(X)←inlet_X1 suspend(); Call X inlet_X2.ret: inlet_X1.susp: goto done; pcrc(X)←inlet_X3 ← inlet_X3.susp: pcrc(Y) inlet_Y2 suspend(); lfork X Fork Y inlet_X3.ret: lfork Y inlet_X1.ret: pcrc(Y)←inlet_Y1 Join suspend(); continue after join pcrc(Y)←inlet_Y1 Call Y inlet_Y2.susp: inlet_Y1.susp: suspend(); suspend(); inlet_Y2.ret: inlet_Y1.ret: pcrc(X)←inlet_X2 done: suspend(); continue after join

Inlets which handle return Auxiliary inlets which Main program text

and suspension. handle synchronization.

Figure An example of applying the synchronizer transformation to two forks followed

by a join Each inlet i is represented bytwo lab els a susp ension lab el isusp and a return

lab el iret In the pseudo co de we exp ose the op erations that set a childs return address

p crcX denotes the return address for the child X

susp ended returns it must up date the return continuations for each of the outstanding



children to reect the new synchronization state In the worst case this can require On

n

stores for n forks The synchronizer transformation also requires O synchronizers for n

forks For this reason we apply the synchronizer transformation only when n

In the case where two forks are followed by a join the transformation is straight

forward and costeective As shown in Figure the synchronization transformation

generates ve inlets for two forks the rst two in the middle of the gure are used when

no susp ension o ccurs and the last three at the right of the gure are used when susp ension

do es o ccur

There are three cases to consider for the fork of X First if it returns without

susp ending it returns to an inlet inlet X that starts the fork of Y Second and third if

it has susp ended then when it returns either the second call will have b een completed or

it will still b e executing In the former case its inlet inlet X should continue with the

co de after the join In the latter it should susp end the parent inlet X

There are two cases to consider for the second fork the fork of Y Either X has

completed when it returns or Xs thread is outstanding If X has completed then the inlet

inlet Y continues with the co de after the join If Xs thread is still outstanding then the

inlet inlet Y changes the pcrc for X to inlet X and susp ends the parent

CHAPTER IMPLEMENTING CONTROL

When no susp ensions o ccur inlets X and Y are executed and the runtime cost

of synchronization is zero If the rst fork susp ends it starts the second fork with inlet

Y and sets its own inlet to X If X Y returns next it sets the inlet for Y X to inlet

Y X which allows the last thread that returns to execute the co de immediately after

the join without p erforming any tests In this case the runtime synchronization cost is two

stores which is less than the cost of explicit synchronization This optimization allows us

to use the combined return address of a sequential call even after disconnection

Lazy ParentQueue

Before leaving seed activation and examining howwe implement LMAM using

continuation stealing we revisit the representations for the parent queue Wehaveshown

that there are advantages to the explicit seed queue but it imp oses a higher cost on threads

that run to completion Here weshowhow pcrcs allowustocombine the advantages of

an explicit queue with the less exp ensive implicit queue in the same system

The basic idea is that everyseedstartsoasanimplicitseedThenifawork

stealing request arrives or a child susp ends to a parent with no nascent threads the sys

tem converts the current implicit seed queue into an explicit one The resulting explicit

queue maintains the same ordering of threads as that in the implicit queue ie the oldest

ancestors seed is at the b ottom of the queue After this conversion the implicit queue is

empty so the p ossibility of losing a handle to it if we transfer to another thread do es not

arise Further when a thread that was in the current implicit queue susp ends it obtains

work from the explicit queue rather than traversing the tree to nd work In addition a

work stealing request can easily obtain work from the b ottom of the newly created explicit

queue

Figure sho ws how a seed is converted from an implicit seed into an explicit

one Ifawork stealing request arrives it checks the explicit queue If it is empty the

system invokes the steal entry p oint for the seed at the top of the implicit queue which

causes the convertroutine to start The convert routine traverses the implicit queue

converting each threads seed to its explicit version When the ro ot of the cactus stackor

an already converted thread is encountered the recursion unwinds and the newly created

explicit seeds are stored in the order they would have b een stored if explicit seeds had b een

used from the start When all have returned if the explicit queue is not empty the steal

CHAPTER IMPLEMENTING CONTROL

convert_routine: convert_flag = true call parent’s convert_routine Before Conversion After Conversion push converted seed change pcrc to new seed Suspend: goto (*retadr)-1 goto (*retadr)-1 return

Seed for a goto convert_routine goto steal_routine nascent thread: goto suspend_routine goto suspend_routine steal_routine: rest:// continue popSeed() if (convert_flag == true) return goto rest // activate seed

Seed when no goto convert_routine return nascent thread goto conv_and_suspend // not used conv_and_suspend: exists in thread: rest:// continue popSeed() call convert_routine

goto rest goto (*popSeed())-1

Figure The table on the left shows how a seed is represented b efore and after conver

sion The routines that aid in conversion are shown on the right

entry p ointisinvoked for the seed at the b ottom of the queue Otherwise it rep orts that

there is no work Likewise if a child susp ends and its parent has no nascent threads it

invokes the convandsuspend routine whichconverts the implicit queue into an explicit

one and then jumps to the top seed on the queue

This entire scheme hinges on the use of pcrcs Before conversion the return

address p oints to the preconverted implicit seed After conversion the return address of

eachconverted child p oints to the converted explicit seed

Once the conversion routine invokes a parent that has already b een converted we

terminate the stackwalk Since wenever pull seeds o the middle of the implicit queue we

know that once wehave reached a seed that has already b een converted wehave converted

the entire implicit queue As shown in Figure the entire implicit queue is in one branch

of the cactus stack

The lazy parent queue also allows us to plant an explicit seed The only caveat

e plant an explicit seed wemust convert the implicit queue b elowitinto an is that if w

explicit one

Costs in the Seed Mo del

In this section we present the control related costs of p erforming the basic op er

ations of thread creation susp ension etc in light of the various options presented in this

chapter Our main metric is the extra cost of an lfork ab ove that of a traditional sequential

CHAPTER IMPLEMENTING CONTROL

call eg a C function call Since a lazy thread may end up susp ending we also dene a

secondary cost the disconnection cost as the incrementalcostofconverting a lazy thread

call into an indep endent one rather than creating it as an indep endent thread ab initio

There are three parts to the cost of an lfork creation linkage and instantia

tion Creation cost is the cost of creating the nascent form of the thread Linkage is the

cost of enqueuing the representation so that it can later b e found and instantiated The

instantiation cost is the cost of turning the representation into an executing thread We

also compare lfork to a sequential call and to an eager fork

As in our discussion ab out the storage mo del we include the costs of lfork

lreturn suspend disconnection and lfork after disconnection We break down our

analysis by the parent queue representation implicit explicit or lazyWe further break

down the dierences by the kind of disconnectionlazy or eagerused to transform lazy

threads into indep endent ones

We describ e the costs of the op erations in terms of m the cost of a memory

instruction b the cost of a branch r the cost of a registeronly instruction and d the

numb er of frames traversed b efore a thread seed is found

The Implicit Parent Queue

When the parent queue is implicit the creation and linkage costs of an lfork are

always zero since the seeds are created and enqueued implicitl yby calling the child The

overhead for an lreturn is also zero

The cost of susp ension has two comp onents First a seed m ust b e lo cated Second

the parentmust b e disconnected from its child Finding a seed requires d indirect jumps



each of which jumps to the susp end routine and loads the prop er frame p ointer which

results in a total of dm b r op erations When a workstealing request arrives it to o

must nd a seed and p erform a disconnection op eration all of whichtakes exactly the same

amount of time as susp ension

Regardless of the disconnection metho d used a synchronization counter must b e

initialized and a return address must b e stored for the child taking m r steps During

lazydisconnect no other steps are p erformed In eagerdisconnect we also havetochange



the childs p ointer to the return address incurring an additional cost of m r

Loading the frame p ointer is either a register or memory op eration dep ending on the storage mo del

Recall that we are only counting the control p ortion of the eagerdisconnect op eration

CHAPTER IMPLEMENTING CONTROL

Disconnection Mo del

Op eration lazy eager

lfork

lreturn

suspend dm b r dm b r

disconnection m r m r

lfork after disconnect m r b m r b

Table The cost of the basic op erations using an implicit parent queue The cost of

lfork also includes the costs of running its inlet on return m is the cost of a memory

reference b is the cost of a branch r is the cost of a register op eration d is the number of

frames checked b efore work is found

lforks after disconnection require setting a resume address and up dating a syn

chronization counter b efore invoking the child with a parallel ready sequential call incurring

a cost of m r On return the inlet must also up date synchronization and then nish

by jumping through the resume address incurring a cost of m r b These results are

summarized in Table

The Explicit Parent Queue

When the explicit parent queue is used an lfork costs more than in the implicit

queue b ecause of the linkage cost Every seed has to b e enqueued which costs m r

assuming the head of the seed queue is kept in a register However susp end and work

stealing can nd a thread seed in O time Dequeuing a seed is a stack p op and throw

away op eration which takes r instructions When a lazy thread runs to completion it has

an overhead of m r for the push and p op op erations

The cost of disconnection remains the same as that in the implicit queue mo del

An lfork after disconnection however incurs an additional cost of r for enqueuing the seed

ks a memory reference b ecause the parentmust have its seed on The enqueue op eration lac

the queue if it is executing an lfork after disconnection Since the seed mo del requires that

all future executions of the parent b egin with a jump through the resume address the seed

which is already on the queue is sucient to represent the remaining nascent threads in the

parent The enqueuing of a seed after disconnection is eectively an unp op op eration

The costs of the basic op erations using the explicit parent queue are summarized

in Table

CHAPTER IMPLEMENTING CONTROL

Disconnection Mo del

Op eration lazy eager

lfork m r m r

lreturn

suspend m r b m r b

disconnection m r m r

lfork after disconnect m r b m r b

Table The cost of the basic op erations using an explicit parent queue

Disconnection Mo del

Op eration lazy eager

lfork

lreturn

suspend b efore convert dm b r dm b r

suspend after convert m b r m b r

disconnection m r m r

lfork after disconnect m b r m b r

Table The cost of the basic op erations using an lazy parent queue

The Lazy Parent Queue

The lazy parent queue has the same linkage costs as the implicit queue mo del but

on average the same susp ension costs as the explicit queue mo del See Table When a

thread susp ends it rst invokes the conversion routine whichwalks down the tree causing

all thread seeds which representnascent threads to b e enqueued on the explicit queue until

it reaches the ro ot of the tree or a seed that has already b een converted Then the original

thread executes the susp end by doing an explicit dequeue The convert op eration requires

dm b r op erations which includes checking the convert ag saving a new seed into

the converted parent executing the calls and returns to descend and ascend the tree and

nally saving the new seed on the queue After disconnection the mo del b ehaves exactly like

the explicit queue mo del This is b ecause a disconnected thread has already b een converted to the explicit queue

CHAPTER IMPLEMENTING CONTROL

call child // lfork child post_dis_ret: jump steal_routine suspend_routine: store results jump suspend_routine retAdr = post_dis_ret join_counter-- store results // return here join_counter = 1 join: ... disconnect frames if (join_counter) suspend after_join: goto continue_parent jump after_join

A parallel ready sequential call with its The associated suspend_routine. It The associated post-disconnection inlet. return inlet. continues the parent in the duplicated Note that it goes directly to the join instead

immediately after the call to child. of jumping through a resume address.

Figure Example implementation of a parallel ready sequential call with its asso ciated

inlets in the continuationstealing control mo del

Continuation Stealing

Wenow address how to implement the control of the abstract machine LMAM

when using continuation stealing instead of seed activation We use the same basic mech

anisms eg a parallel ready sequential call multiple return entry p oints and pcrcs to

implement the continuationstealing mo del The main dierence is that the continuation

stealing mo del lacks representations for nascent threads Instead when disconnection o c

curs execution continues with the remainder of the parent

Every series of forks followed by a join is as in the seed mo del duplicated One

series handles the case when no disconnection o ccurs and the other handles the forks that

o ccur after disconnection The only exception to this rule is when a single fork is followed by

a large p ortion of sequential co de In this case we do not duplicate the co de but explicitly

synchronize The reason for this exception is that we see no advantage to duplicating

lengthy sections of co de to save only four memory op erations and a single test

Control Disconnection

A parallel ready sequential call as in the seed activation mo del is asso ciated with

three return entry p oints one for stealing one for susp ension of the child and one for

normal return If a child susp ends it invokes its parent through the susp ension entry p oint

This causes the pcrc to b e changed so that when the child returns ie when it completes

it returns to its p ostdisconnection inlet It then disconnects the parent from its child and

continues execution in the parent

CHAPTER IMPLEMENTING CONTROL

post_dis_ret: if (migrated) { send results to new location call child // lfork child suspend_routine: suspend } jump steal_routine retAdr = post_dis_ret store results jump suspend_routine join_counter = 1 join_counter-- store results // return here migrated = false join: ... disentangle frames if (join_counter) suspend after_join: goto continue_parent jump after_join

A parallel ready sequential call with its The associated suspend_routine. It con- The associated post-disentanglement inlet. return inlet. tinues the parent in the duplicated imme- It goes directly to the join instead of jumping diately after the call to child. through a resume address.

steal_routine: recv_result: migrated = true store results disentangle frames retAdr = forward2parent forward2parent: send frame to newlocation enable load results return return return

The associated steal-routine. It sets the When the frame completes, it returns its When the frame is scheduled it runs this rou- migrated flag and then ships the frame to its results to this inlet.; which places the tine, which returns the results to the thread’s

new processor. frame on the ready queue. parent.

Figure A parallel ready sequential call with its asso ciated steal and susp end routines

help er inlets and routines for handling thread migration

Unlike the seed mo del however continuation stealing do es not require us to store

a resume address in the parents frame With continuation stealing the parent continues

executing after disconnection so the continuation address for the parent is actually found

in the pro cessors Instead of resuming the parent at a resume address

the p ostdisconnection inlet jumps directly to the join If there are children outstanding

the join fails and the parent susp ends continuing execution in its parent See Figure

Note that even if the parent of a susp ended child has no nascent thread it b egins execution

as a running thread

Migration

In the seed mo del once a thread has b een started on a pro cessor it remains on that

pro cessor Only nascent threads are migrated b etween pro cessors Thus a parentalways

knows if its children are remote and can customize its return inlet to handle a remote

communication Since there are no nascent threads in the continuationstealing mo del

indep endent threads must migrate among pro cessors to satisfy a workstealing request

Aside from the storage related costs the main eect of this is that child threads may

migrate without the intervention or knowledge of the parent

CHAPTER IMPLEMENTING CONTROL

When a thread migrates to another pro cessor the frame that used to hold this

thread remains b ehind It is used to forward results from its outstanding children to the

migrated thread and to forward the result received from the migrated frame to its parent

To implement this forwarding function weuseacombination of pcrcs and a ag

When a thread is migrated to another pro cessor it mayhave outstanding children

on the current pro cessor These children must have their results forwarded to the migrated

frame on the new pro cessor The co de fragments suspendroutine and postdisret in

Figure showhow this is handled suspendroutine sets the migrated ag to false

indicating that no migration has taken place The stealroutine sets the migrated ag to

true and then migrates the frame When the child returns it uses the postdisret inlet

This inlet rst checks the migrated ag If it is true it sends the data to the new frame

Otherwise it b ehaves normally When the thread itself completes it sends a message with

the data back to its original frame which causes recvresults to execute recvresults

stores the results and enqueues the frame on the ready queue When the frame is run it

returns to its parentasifitwere completing normally Thus a parentdoesnothaveto

notify its children when it migrates

Integrating the Control and Storage Mo dels

Wehavenow describ ed all the control mechanisms used to implement LMAM In

this section we relate these with the storage mo dels presented in the previous chapter

Linked Frames

Since the linkedframes mapping uses only explicit links there are no sp ecial needs

imp osed by this mapping on the control mo del

Stacks and Stacklets

Both stacks and stacklets have implicit and explicit links b etween activation frames

which require us to use stubs to construct the global cactus stack There are several stub

routines required bybothcontrol mo dels The two most imp ortant stubs assist in return

ing control and data from the child at the b ottom of the frame area to its parent When

the child is lo cal it uses the localthread stub which simply nds the parent frame and

CHAPTER IMPLEMENTING CONTROL

jump steal-routine jump suspend-routine local-thread: sp = parentSp top = parentTop // only with stacklets jump *retAdr Frame Area steal-routine: if (convert-flag == true) return sp = parentSp top = parentTop jump (*retAdr)-2 local-thread parentTop stub frame susend-routine: parentSp sp = parentSp

top = parentTop retAdr

Figure A localthread stub that can b e used for either stacks or stacklets

invokes the prop er inlet in the parent When the child is on a dierent pro cessor from its

parent it uses the remotethread stub which sends the data to the parents pro cessor

The localthread stubs return address is similar to the implicit seed planted by

a parent with no nascent threads in it In other words it returns control to the prop er

return entry p oint in the parent dep ending on what return entry p oint the child uses to

enter the stub See Figure

A remotethread stubs return address is similar to a converted implicit seed

planted by a parent with no nascent threads Since the stub has no lo cal parent it transfers

control to the general scheduler if its child susp ends Unlike the localthread stub the

remotethread stub is a custom routine created by the compiler for each co deblo ck This

is b ecause it must b oth marshal the arguments from the parent and send the results back

using active messages customized for the routine in question

Both kinds of stubs indicate that the thread at the b ottom of the frame area is an

indep endent thread Thus if we are using a lazy parent queue the conversion routine will

not pro ceed past the stub

Spaghetti Stacks

A spaghetti stack presents sp ecial implementation challenges b ecause of the need

to limit the cost of reclaiming space on the stack Since wedonothave a garbage collector

wemust reclaim the space as part of executing the program Here we showhow pcrcscan

CHAPTER IMPLEMENTING CONTROL

call child // lfork child Free Space return_stub: jump steal_routine jump steal_routine jump suspend_routine jump suspend_routine sp = psp // return here Child Frame psp = parentSp ... top = sp-2 suspend_routine: jump (*psp-1) top = sp suspend_stub ... *top = sp ... stub *(top+1) = return_stub parentSp The stub used by frames before top += 2 A parallel ready sequential call with suspension. sp = top call child the start of its return inlet and the start Other Frames // local fork of its suspend_routine. jump steal_routine jump suspend_routine sp = psp Parent Frame suspend_stub: A local fork operation. It reserves // function prologue jump steal_routine space for and fills in the stub. Upon *top = sp jump suspend_routine return, top is not reset as it is with an ... coalesce frames lfork. // function epilogue psp = parentSp psp = base of frame jump (*psp-1) jump (*psp - 1) State of the spaghetti ... stack after the child has executed a suspend The stub used by frames that have The function prologue and epilog.

operation. suspended.

Figure Co de fragments used to implement a minimal overhead spaghetti stack On

the left is the invo cation routine for lfork and the function prolog and epilog The stack

in the middle shows the result of a parent forking a child The routines to supp ort susp end

and subsequent forks are the right

lower the reclamation cost to that of a stackwhenchildren run to completion and weshow

howtokeep the cost low when they susp end

The scheduling p olicy of LMAM on a distributed memory machine implies that

when a child runs to completion it will b e at the top of the spaghetti stack Furthermore

a parallel ready sequential call is always made from a thread that is at the top of the stack

Thus there is no need to store a link b etween a child and its parent if the child is invoked

by a parallel ready sequential call Figure shows co de for a parallel ready sequential

call that takes advantage of this invariant For children that run to completion the only

additional overhead is the use of a temp orary register psp which is needed to calculate the

size of the frame by the coalesce routines The size of a frame is sp psp at the time of

return

When a thread susp ends top must b e set since sp no longer p oints to the top of

the spaghetti stack This setting of top o ccurs in the parents suspend routine since we

know b ecause of pcrcs that the susp end routine is executed only on the rst susp ension

of a child Once a child has susp ended it can no longer reclaim the space for its frame on

exit However since the function epilogue do es not touch top it will not reclaim the space

CHAPTER IMPLEMENTING CONTROL

in any case Instead reclamation is handled either by the p ostdisconnection inlet for the

child or up on the return of the frame ab ove it which will have a stub since it must have

b een started with a fork

Achild invoked byafork needs a link to its parent This is provided in a stub

which is created by the parentaspartofthe fork op eration See routine at the rightin

Figure The stub for such a routine is initially set to returnstub which reclaims the

space for the child up on exit If however the child susp ends it changes the stub routine

to suspendstub which coalesces the frames and frees them if they are at the top of the

stack

In short by relying on the invariants created by our representation of the abstract

parent queue and the implementation of the lfork forkandsuspend op erations weare

able to reduce the cost of using a spaghetti stack to within one register move more than

that of a normal stack for parallel ready sequential call

Discussion

This chapter has intro duced the mechanisms and compilation strategy needed to

implement the control necessary to realize the lazy multithreaded abstract machine pre

sented in Chapter The four main mechanisms are the parallel ready sequential call the

pcrc the realization of the find function and the realization of the parent queue Addi

e presented a metho d of eliminating synchronization costs with synchronizers tionally w

The imp ortance of these mechanisms combined with the compilation strategy of

creating three co de streams for each forkset is the convergence of the implementations

indep endent of the p olicy decisions ie thread seeds vs continuations eagerdisconnect

vs lazydisconnect implicit vs explicit vs lazy queue The three co de streams combined

with the ab ovemechanisms allow for the elimination of all synchronization costs whether or

not thread seeds or continuations are used to represent the remaining work in a parentofa

lazy thread Furthermore all b o okkeeping op erations can b e avoided when lazy threads run

to completion if an implicit or lazy queue is used to queue thread seeds or continuations

In fact we nd that the p olicy decisions have less impact then would app ear to b e the case

from previous literature b ecause wehave used the appropriate implementation techniques

Chapter

Programming Languages

In this chapter we describ e twovery dierent parallel programming languages for

whichwehave implemented compilers that use the lazy threads mo del describ ed in this

dissertation SplitCthreads and Id SplitCthreads develop ed by the author is an

explicit parallel language based on SplitC itself based on the imp erative language C

Id is an implicitly parallel functional based on the dataow

mo del of computation

SplitCthreads

SplitCthreads is a multithreaded extension to SplitC Before intro ducing the

thread extensions in SplitCthreads we briey describ e the features of SplitC that aect

our extension and the compilation of SplitCthreads

SplitC is a parallel extension of C intended for high p erformance programming

on distributed memory multipro cessors SplitC like C has a clear cost mo del that allows

the program to b e optimized in the presence of the communication overhead latencyand

bandwidth inherent on distributed memory multipro cessors It follows a SPMD mo del

starting a program on each of the pro cessors at the same p oint in a common co de image

SplitC provides a global address space and allows any pro cessor to access an ob ject

anywhere in that space but each pro cessor owns a p ortion of the global address space The

p ortion of the global address space that a pro cessor owns is its lo cal region Each pro cessors

lo cal region contains the pro cessors entire stack and a p ortion of the global heap

CHAPTER PROGRAMMING LANGUAGES

Twopointers typ es are provided reecting the cost dierence b etween lo cal and

global accesses Global pointers reference the entire address space while standard p ointers

reference only the p ortion owned by the accessing pro cessor

SplitC includes a splitphase assignment op erator Splitphase assignment al

lows computation and communication to overlap The op erator initiates the assignment

butdoesnotwait for it to complete The sync statementwaits for all outstanding split

phase assignments to complete Thus the request to get a value from a lo cation or put a

value into a lo cation is separated from the completion of the op eration

The last attribute of SplitC that we are concerned with is the signaling store

denoted by Store up dates a global lo cation but do es not provide anyacknowledgement

of its completion to the issuing pro cessor Completion of a collection of such stores is

detected globally using allstoresync executed by all pro cessors or lo cally by storesync

SplitCthreads extends SplitC by allowing a programmer to movebetween the

SPMD mo del and a multithreaded programming mo del When a SplitCthreads program

starts it b egins in the SPMD mo de of SplitC However the programmer mayatany time

initiate multithreading with the allstartthreads construct When all of the threads

have terminated the program do es an implicit barrier among all the pro cessors and again

reenters the SPMD mo del of SplitC

We b egin our sp ecication of SplitCthreads with a simple example and then

pro ceed to describ e the syntax and semantics of the new op erators and how they interact

with the SplitC global address space memory mo del

Example SplitCthreads Program

In this section wepresent an example SplitCthreads program for naively com

th

puting the n Fib onacci numb er See Figure This example shows how to start a

multithreaded segment in a SplitC program how to declare a function that can b ecome an

indep endent thread and howtoinvoke these threads in parallel

The program starts execution with the invo cation of splitcthreadsmain on

Line which is the main entry p oint for all SplitCthreads programs All pro cessors as

sign x the value of the rst argument Then on Line they all reach the allstartthreads

call which indicates the b eginning of a multithreading p ortion of the program There is

CHAPTER PROGRAMMING LANGUAGES

include threadsh

forkable int fibint n

int i

int j

if n

return

forkset

pcall i fibn

pcall j fibn

return ij

int

splitcthreadsmainint argc char argv

int x

int r

x atoiargv

allstartthreads r fibx

onproc

printffib of d d x r

return

Figure An implementation of the Fib onacci function in SplitCthreads

an implicit barrier b efore and after the allstartthreads statement The argument to

the allstartthreads statement causes only pro cessor to start executing the fib call

Instead of executing the call in the allstartthreads statement the rest of the

pro cessors enter the main threadscheduling lo op whichschedules ready threads If there

are no ready threads then it attempts to steal a thread from the parent queue of another

pro cessor The default b ehavior is to attempt to get work from a random pro cessor

Pro cessor b egins by executing a call to the function fib on Line The function

is given the typ e mo dier forkable to indicate that it can b ecome an indep endent thread

If n is greater than then the function in turn creates two new logical threads with the

pcall statements The forkset statementintro duces a synchronization blo ck whichcan

contain pcall statements A pcall statement indicates that the call on the right hand side

can b e executed in parallel Execution will not continueoutoftheforkset blo ckuntil all

CHAPTER PROGRAMMING LANGUAGES

start threads susp end all

xed Thread

forkable ThreadId

forkset yield

pcall

Figure The new keywords dened in SplitCthreads

stmt forkset stmt

j pcall p callexpression

j p call xed p callexpression

j p callexpression

j fork mayb epro cexpr p callexpression

start threads mayb epro cexpr p callexpression j all

j susp end

j yield

p callexpression lvalue assignop functionname arglist

j functionname arglist

mayb epro cexpr expr

j

expr ThreadId

typ equalier forkable

typ e Thread

Figure New syntax for the threaded extensions in SplitCthreads

the pcalls in the forkset complete In this case there are two pcalls in the forkset After

the forkset completes ie the two threads are joined with the parent the thread returns

avalue and terminates

CHAPTER PROGRAMMING LANGUAGES

forkset

pcall x one

forkset forkset if flag pcall two

pcall ifibn for i in i baz

pcall jfibn pcall sumsum args pcall z three

A B C

Figure Example uses of forkset

When the original call to fib made in the allstartthreads statement completes

then all the pro cessors are signaled that the multithreading section has completed and a

barrier will b e p erformed after which they will all continue SPMD style with the next

statement Note that the value of the call is only returned on pro cessor

Thread Extensions to SplitC

The main dierence b etween SplitC and SplitCthreads is that SplitCthreads

has an explicit parallel call or fork op eration pcallScheduling capabilities are provided

by susp end and yield statements and some library calls In this section we describ e the basic

syntax and semantics of these statements The new keywords dened in SplitCthreads

are in Figure The new syntax intro duced in SplitCthreads is in Figure

Forkset statement

The forkset statementintro duces a statement blo ck that can contain any state

ment including pcallsandforks all of whichmust complete b efore control passes out

of the blo ck The forkset statement can only app ear in a function that has b een declared

forkable See Section The statements are executed in the order in which they app ear

in the program text

Figure shows some example uses of forkset In Figure A two threads

created by pcall are joined with their parent b efore control continues past the forkset

statement In Figure B n threads are created one at a time byeach iteration of the

for statement The enclosing forkset joins all of the n threads with the parent thread In

Figure C rst a thread is started to execute one then the if statement is executed

CHAPTER PROGRAMMING LANGUAGES

followed by a sequential call to baz When baz completes the last pcall will execute The

forkset joins all of the threads created b efore continuing in the parent

Pcall Statement

The p call statement initiates a p otentially parallel call as a lazy thread Every

pcall must b e in the scop e of a forkset statement The function called must b e declared

forkable See Section

There are three variants of the p call statement The pcall in Rule of Figure

creates a lazy thread whichmay b e instantiated lo cally or stolen by another pro cessor This

variantofpcall is a concrete representation of an lfork

In Rule the keyword fixed is used to indicate that the lazy thread started

cannot b e stolen by another pro cessor The rationale b ehind a pcall which creates a lazy

thread that cannot b e stolen is so programmers can initiate threads whichhave arguments

which are p ointers to lo cal data in threads Although the thread cannot b e stolen it can

still susp end or help to hide latency

Rule describ es a sequential call to a forkable function This call although run

sequentially can susp end If a thread invoked with the last rule susp ends it also causes its

parent to susp end

Fork statement

The fork statement creates an indep endent thread The programmer may sp ecify

the pro cessor on which the thread is created or leaveittotheruntime system to place the

thread In all cases the new thread is created eagerly

fork can b e used to create downward threads by placing it in a forkset upward

threads or daemon threads In the latter two cases the programmer has to construct her

own synchronization mechanism from the SplitCthreads primitives The programmer

must b e careful to detect the termination of the child Otherwise allstartthreads may

complete b efore the child thread itself terminates See Section

Start Statement

When the allstartthreads statement See Rule is invoked an implicit

barrier is p erformed and then the program b egins executing in multithreading mo de Before

CHAPTER PROGRAMMING LANGUAGES

this and after control returns to the statement following this statement the program is in

SPMD mo de If a pro cessor is sp ecied by this statement then only the pro cessor sp ecied

receives an active threadthe other pro cessors en ter the scheduling lo op and attempt to

steal work If no pro cessor is sp ecied then all the pro cessors invoke the co de in the

p callexpression

The use of allstartthreads allows a programmer to construct mixedmo de pro

grams The program can switchback and forth b etween SPMD and multithreaded mo des of

execution Since allstartthreads allows every pro cessor to start a function some anity

between data and threads can b e created by starting functions on lo cal data

The multithreaded p ortion of the program started by allstartthreads comes

to an end when the function it calls completes Control passes to the next statementin

the program after all the pro cessors have completed their work and executed a barrier If

only a single pro cessor was started ie an argumentwas sp ecied to allstartthreads

then it noties the other pro cessors that each of them should exit its scheduling lo op and

p erform a barrier If all of the pro cessors were started then eachenters a barrier when the

function it invoked nishes

Since the only synchronization built in is the join mechanism in the forkset

statement the programmer must include a check that threads started with fork com

plete b efore returning from the function invoked by allstartthreads In other words

allstartthreads do es not checktomake sure that threads created by fork terminate

b efore it completes

Susp end Statement

The susp end statement See Rule causes the current thread to release the

pro cessor and b ecome idle The next thread run is from the parent queue or if no threads

are in the parent queue it is a thread from the ready queue Which ready thread is run on

the pro cessor dep ends on the scheduling p olicy and the control mo del of the compiler The

default scheduling p olicy is for the last thread enqueued on the ready queue to b e the next

one scheduled If b oth the parent and ready queues are empty the pro cessor will initiate a

work stealing request

CHAPTER PROGRAMMING LANGUAGES

Yield Statement

The yield statement See Rule causes the current thread to release the pro

cessor and b e placed on the ready queue If there is a nascent thread on the parent queue

it is run next If the parent queue is empty then a thread from the ready queue is run If

no other threads are available then the thread executing the yield continues to execute

Function Typ es

In order to distinguish b etween functions that can form threads and those that

are guaranteed to run to completion weintro duce a new typ e mo dier forkable It is only

used to qualify function typ es A function that has typ e mo dier forkable maybeusedin

a p call expression See Rule

If a function is not forkable ie it is sequential then it may not use anyofthe

extensions presented here except the start statement See Rule It is a runtime error

for a sequential function that has a forkable ancestor on the stack to execute the start

statement A forkable function may not use barrier or any expressions involving signaling

stores

A forkable function that takes lo cal p ointers as arguments may only b e invoked

by one of the p call statements in Rule or Rule This is to ensure that a thread with

lo cal p ointers do es not start on a pro cessor other than the one on which those p ointers are

valid

Manipulating Threads

The programmer can also construct her own synchronization primitives ThreadId

always evaluates to a p ointer to the current thread It has typ e Thread The function

enqThread x will enqueue thread x on the ready queue The current implementation of

SplitCthreads do es not allow the programmer to declare anyvariables of typ e Thread when

eagerdisconnect is in use for the stacklet memory mo del The reason for this limitation is

that when threads are disconnected the activation frame of a thread maybemoved which

in turn changes the threads identier In other words ThreadId may not remain the same

over the life of the thread for eagerdisconnect under stacklets

CHAPTER PROGRAMMING LANGUAGES

SplitPhase Memory Op erations and Threads

One of the main advantages of SplitC is that it allows the programmer to hide the

latency of remote references with splitphase assignments Such an assignment is initiated

with one op eration eg or and concludes with another eg sync or allstoresync

When there is at most one thread p er pro cessor this works quite well However when there

is more than one thread p er pro cessor wemust examine splitphase assignment more closely

allstoresync must b e abandoned in this case The basis for the op eration of

these signaling stores is that all the threads in the system are participating in the op eration

At the very least it requires the ability for all the threads in the system to execute a barrier

This allows the synchronization counters to b e reset Then all the threads in the system

p erform some op erations including the store op eration and nally they all wait until all

the stores have completed

With the splitphase assignment op erators there are two p ossible semantics When

the sync op eration o ccurs and the outstanding assignments havenotyet completed the pro

cessor can either spin or susp end If it spins then it will wait for the op erations to complete

b efore pro ceeding in the current thread This closely matches the SplitC semantics and

threads will not interfere with each other If on the other hand the pro cessor susp ends

which will allow the latency of the remote op eration to b e hidden by running another thread

then each thread must have its own synchronization counter If they did not have unique

counters then an outstanding op eration from one thread could cause another to wait even

though its op eration had completed In practice the latency is low enough that this only

o ccurs if the second thread initiates a splitphase assignment that is lo cal to the pro cessor

on which it executes

Wecho ose to have a pro cessor susp end if a sync is executed when there are out

standing stores Additionally wecho ose to have a unique counter p er thread for splitphase

op erations

store synch can still b e used

CHAPTER PROGRAMMING LANGUAGES

Other Examples

Here we presenttwo additional examples of SplitCthreads programs The rst

shows how lazy threads can aect the style of programming The second p oints out a

deciency in SplitCthreads

Lazy Thread Style

Lazy Threads reduce the overhead of the thread call but do not in and of them

selves eliminate parallel overhead even if each thread runs to completion We dene paral lel

overhead as the overhead intro duced into an which enables it to run in parallel

As we shall now see there are ways to reduce parallel overhead

In this example we showhow to implement the nqueens program without parallel

overhead The nqueens program nds all the placements of n queens on an nxn chess b oard

It recursively places a queen in a row and then forks o a thread to evaluate the placements

in the remaining p ortion of the chess b oard The key function in a sequential version is

b elow

int

tryrow int path int n int i

int j sum

if in return one successful placement

sum

for j jn j

if safe path i j c an we put a queen in square ij

pathi j

sum tryrow path n i

return sum

This lo op tries to place a queen in every row j for the column i that this thread

is resp onsible for pathi is the rownumb er of the queen in column i safepath i j

checks to see if a queen may b e placed in square ij given the previous placements in path

This co de cannot b e turned into a correct parallel program simply by making the

call in Line into a pcall If that were the only change then dierent threads would

overwrite inconsistentvalues into pathEach thread needs its own logical copyofpath

CHAPTER PROGRAMMING LANGUAGES

Furthermore wewould also need to make path a global p ointer so that the tryrow thread

could b e stolen by another pro cessor

The correct but costly approach is to makea copyofpath b efore every pcall

We also make path a global p ointer and if path points to a remote pro cessor then we

use bulkget to copy the data from the remote pro cessor onto the pro cessor running the

function However this is intolerably inecient particularly since we know that most of the

pcalls will run to completion on the pro cessor that invoked them

Instead at the start of the function we test to see if MYPROC the pro cessor in which

the thread is running is the same as the parents pro cessor If they are the same no copy

is made If they are not we need to get the data from the parentanywaysowe make a new

copy

forkable int

tryrow int global path int n int i

int j sum

int lpath

if in return

if toproc path MYPROC stolen threadcopy path

int newpath int malloc n sizeof path

bulkget newpath path i sizeof path

lpath newpath

else

lpath intpath

sum

forkset

for j jn j

if safe lpath i j

lpathi j

pcall sum tryrow int globallpath n i

return sum

Wehave reduced the parallel overhead bycopying the shared data only when an

indep endent thread is created by a remote fork op eration However this example dep ends on

the fact that neither tryrow nor safe ever susp end If either susp end then an indep endent

thread will b e started on the same pro cessor as its parent and the copywillnever b e

p erformed In fact a thread can susp end when a child is stolen and do es not return by the

CHAPTER PROGRAMMING LANGUAGES

time the join is reachedthe thread with the outstanding c hild will susp end to its parent

Since threads can susp end the co de ab ove will not work unless the compiler inserts a copy

op eration in the susp end routine

The general strategy is for any disconnection to cause a copy of the lo cal data to

b e made We call this strategy copyonfork In the future we plan to explore language

extensions and compiler analysis that will make this transformation automatic even in the

case of susp ension

IStructures and Strands

Istructures rst intro duced in Id are writeonce data structures useful in a

variety of parallel programs They provide synchronization b etween the pro ducer and

the consumer on an elementby element basis The two op erations dened on Istructures

are ifetch and istore which resp ectively get and store a value to an istructure element If

an ifetch attempts to get the value of an empty element the ifetch defers until an istore

stores a value into the element

We can implement Istructures using the susp ension mechanism in SplitCthreads

as shown in Figure When an ifetch attempts to read the value of an empty element the

ifetch thread is enqueued on the deferred list of that element Later when an istore stores

avalue into that element it enqueues all the threads in the deferred list onto the ready

queue ifetchmust b e a forkable function since it can susp end istore is not a forkable func

tion since it cannot susp end and do es not p erform any pcalls One nice thing ab out this

implementation is that it uses the ifetch activation frame to construct the list of deferred

elements

Although Istructures are easily implemented in SplitCthreads their full p ower

cannot b e exploited For example in the co de fragmentbelo wwewould like to b e able to

execute the second set of ifetches even if one of the rst two susp ends

x ifetchA i

y ifetchB i

zxy

y ifetchB i want to continue here

zxy

zyy

We can arrange for suchbehavior byinvoking each ifetch with a pcall and putting

the entire sequence in a forkset

CHAPTER PROGRAMMING LANGUAGES

typedef struct deferlist DeferList

struct deferlist list of readers waiting for a store

Thread t

DeferList next

typedef struct the structure of an istructure element

int value

int flag TRUE when ful l FALSE when empty

DeferList deferred

Element Istructure

forkable int return the value of iindex

ifetchIstructure i int index

DeferList d

if iindexflag return iindexvalue

the element is empty so defer

dt ThreadId

dnext iindexdeferred

iindexdeferred d

suspend

return iindexvalue

void istoreIstructure i int index int value

iindex value

DeferListd

DeferList dnext

if iindexflag abort

iindexvalue value

iindexflag TRUE

for d iindexdeferred d NULL ddnext

dnext dnext

enqdt

Figure Implementation of Istructures in SplitCthreads using suspend When allo

cated the elements of an Istructure are initialized have their ags set to FALSE and the deferred eld set to NULL

CHAPTER PROGRAMMING LANGUAGES

forkset

x pcall ifetchA i

y pcall ifetchBi

y pcall ifetchBi

zxy

zxy

zyy

Although this works for the simple example ab ove it requires that all three fetches

complete b efore we can b egin computing even though we could start computing when any

twoofxyorywere returned If the function to compute z z or z were more complex

this could b e a signicant p enalty

A solution to this problem b orrowed from TAM is to allow for logically

parallel execution within a function We extend SplitCthreads to allow a function to

havemultiple strands A strand is simply a owofcontrol within a thread that cannot



susp end and may dep end on other strands Each strand is identied by a unique integer

Furthermore a strand mayhave a condition sp ecied which determines which other strands

must have completed b efore it is allowed to run The syntax of the strand statement follows

stmt strand strandid strandcondition stmt

strandid integerconstant

j

strandcondition when logicalexpr

j

If strands were in SplitCthreads we could solve our ifetch problem as follows

strand xifetchAi

strand y ifetchBi

strand y ifetchBi

strand when zxy

strand when zxy

strand when z y y

For those familiar with TAM a strand is slightly more p owerful than a TAM thread b ecause it can

branch but is otherwise the same as a TAM thread

CHAPTER PROGRAMMING LANGUAGES

SplitCThreads Compiler

SplitCthreads is compiled using a mo died GCC compiler The compiler is

based on one whichwas already mo died in twoways it could compile SplitC and it

pro duced abstract syntax trees for the entire function instead of for a single statement

making the compilation pro cess much simpler

Our compiler rst analyzes the structure of the pcalls to determine what co de

needs to b e duplicated and what optimizations can b e applied to each forkset The opti

mizations we currently p erform are centered around eliminating synchronization op erations

After the AST pass is complete RTL is generated We added several new RTL expressions

the three most imp ortant b eing fork seed and reload expressions The fork expression is

used to represent the pcalls The seed expression is used to construct the thread seeds that

come after a fork The reload expression indicates which registers need to b e loaded when

a thread is resumed We also added new instruction templates in the machine denition

les Finally the function prolog and epilog co de is tailored to each storage mo del

The synchronization optimization is aimed at removing the necessity for increment

ing a synchronization counter b efore a pcall is made in the susp end or steal co destreams

During construction of the AST it is recognized that the pcalls are consecutive and thus

we do not need to increment a synchronization counter for each pcall in the susp end stream

Instead the co de executed on entering the susp end co destream sets the synchronization

counter to the numb er of threads that remain to b e completed ie the number of pcalls

to b e executed plus one for the susp ending pcall If the forkset contains conditional or

lo op statements then the synchronization counter is incremented b efore each pcall in the

susp end stream

The most in teresting challenge in implementing the compiler was to p erform cor

rectly the ow analysis and register allo cation in the presence of forks which can branch

to one of three addresses return susp end and work stealing Our primary goal was to

intro duce no new op erations in the sequential p ortion of the co de In order to achieve this

we ensure that during data ow analysis the nonsequential co destreams do not inuence

the sequential co destream To guarantee that liveness analysis is correct and minimal for

the nonsequential co de streams we construct phantom ow links b etween basic blo cks to

represent the p otential owofcontrol

CHAPTER PROGRAMMING LANGUAGES

forkable int f(int x) { int a,b,c; 12 5 forkset { suspend routine for replacement seed pcall a = one(x-1); call one for call one pcall b = two(x-2); pcall c = three(x-3); } 6 return a+b+c; lfork two(x-2) 13 } inlet for call one

Source code 7 store result into b synchronize jmp *resumeAdr

8 increment parent queue

9 14 prolog suspend routine for replacement seed call two for call two

1 call one(x-1) 10 lfork three(x-3) 15 inlet for call two 2 call two(x-2) 11 store result into c synchronize 3 jmp *resumeAdr call three(x-3) 16 suspend indirect jump 4 return a+b+c 17 join

epilog 10

Figure The ow graph for the sequential and susp end co destreams for a forkable

function compiled with an explicit parent queue the stacklet storage mo del lazydisconnect

and thread seeds The dashed lines indicate phantom control ow links

CHAPTER PROGRAMMING LANGUAGES

Aavor of the compilation pro cess can b e seen by examining Figure which

shows the control ow graph for a simple function with a forkset consisting of three consec

utive forks Blo cks show the sequential co destream and Blo cks show the susp end

co destream The steal co destream is omitted so that the example ts on one page but it

follows an analogous structure to the susp end co destream

The rst thing to note is that the three calls are not as exp ected in the same

basicblo ck If the calls had truly b een sequential then the control ow links to the susp end

co destream would not exist However since each call can p otentially transfer control to

the susp end co destream a pcall always ends a basic blo ck This is an artifact of our desire

reuse the infrastructure of the GCC compiler as much as p ossible We represent the p ossible

control transfer out of the sequential co destream by a thread seed RTL expression which

contains a jump table to three addresses one for sequential return whichpoints to the

following instruction one for susp ension whichpoints to the susp end routine for the call

in question and one for steal requests whichpoints to the steal routine

The most imp ortant ramication of the control ow links from the sequential

stream to the other streams is that liveness analysis causes all the variables used in the

other streams to b e marked as live at the b eginning of the function This is b ecause GCC

do es not see how the variables are set due to the strange control ow out of a call expression

We eliminate this eect byblocking data ow information from the other co de streams back

to the sequential stream This works for two reasons First we know that once another

co de stream is entered then the sequential stream is abandoned until the join The dark

rectangles in Figure indicate the blo c king of the data ow information Second we

construct phantom ow links within the other streams to ensure that the minimum set

of livevariables required at the join p oint are correctly maintained

There are two kinds of phantom ow links The rst indicated by the dashed lines

in Figure are used to indicate the set of p ossible destinations of an indirect jump The

second indicated by dotted lines are used to indicate the set of p otential destinations that

can b e reached during execution

In GCC if an indirect jump is encountered in the RTL then every lab eled in

struction is considered to b e a p ossible destination This is unnecessarily conservative since

we know the p ossible destinations of each of the indirect jumps which are always jumps

through the resume address Thus we augmented the RTL expression for jump expressions

CHAPTER PROGRAMMING LANGUAGES

to include a set of p ossible destinations This improves co de qualityby limiting the number

of variables considered live

The second kind of phantom link is included to increase the number of variables

considered to b e live Consider how thread seeds op erate Once a function has entered the

sequential or steal co destreams the destination of a jump through a resume address in an

inlet for example in blo ck can b e any of the remaining pcalls or the join dep ending on

which of the outstanding pcallshave b een started for example blo ck or In order to

make sure that the register allo cator correctly saves the livevariables and restores them on

return wemust ensure that the compiler b elieves that at the time of the lfork for example

blo ck the next p ossible blo ck to b e executed will b e anyofthefollowing calls or the join

for example blo cks or For this reason we add phantom links from the lfork blo cks

to each of the other lforks and the inlets of the call that created them

The compiler takes several options to allow us to compile co de for the dierent

memory mo dels the dierent parent queue implementations the dierentwork stealing

mo dels and the dierent disconnection metho ds In addition it can pro duce co de for either

sequential or parallel machines

Id

Id is an implicitl y parallel language with synchronizing data structures For the

purp oses of this dissertation its three salient features are that it is nonstrict it is executed

leniently and it has synchronizing data structures Nonstrictness means that a function

can b e started b efore it has received all of its arguments Lenient execution means that a

function is evaluated in parallel with its arguments evaluated as much as data dep endencies

p ermit Synchronizing data structures lik e Istructures require signicantoverhead

These three features combine to require a mechanism like strands

Id was originally implementedoncustomdataowmachines and later on

conventional pro cessors by compiling to abstract machines likeTAM and PRISC

Our Id compiler translates Id into TL which is the instruction set for a

Threaded Abstract Machine TAM TL is then translated to C with extensions to supp ort lazy threads with strands

Chapter

Empirical Results

One of the goals of this thesis is to compare fairly the ideas b ehind dierent

multithreaded systems Toachieve this weevaluate the p oints in the design space using

executables pro duced by the same compiler and compilation techniques

We rst examine the b enets of integrating thread primitives into the compiler

as opp osed to using a thread library Compiler integration yields at least a twentyvefold

improvementover thread libraries We next examine the dierence b etween eager and lazy

threads showing that lazy multithreading is always more ecient than eager threads The

rest of the chapter compares the dierent p oints in the design space of lazy multithreading

We b egin our analysis of the design space byshowing how the dierent mo dels

p erform when running co de on a single pro cessor Even when running on a multipro cessor

this is an imp ortant case b ecause when the logical parallelism in the program is unneces

sary the p otentially parallel calls will b e executed sequentially A successful lazy threading

system should execute such co de without loss of eciency relative to a sequentially com

piled program We examine every p oint in the design space dierent mo dels comp osed

from three storage mo dels linked frames spaghetti stacks and stacklets two thread rep

resentations continuations and thread seeds two disconnection metho ds lazydisconnect

and eagerdisconnect and three queue mechanisms implicit explicit and lazy In our

comparisons we fo cus on how the dierent p olicy decisions on eachaxisinteract to form a

complete system

Each axis is assessed for its p erformance on multithreaded co de that runs serially

and then on multithreaded co de that requires parallelism W e then examine the b ehavior

of lazy threads on a distributed memory machine the Berkeley NOW Our last exp eriment

CHAPTER EMPIRICAL RESULTS

lo oks at how lazy threads improve the p erformance of Id co des A summary of the

eectiveness of these mo dels is shown in Figure at the end of the chapter

With the exception of the comparison b etween lazy threads and eager threads all

p erformance numb ers are comparisons b etween lazy threads as compiled by SplitCthreads

and sequential programs compiled by GCC Thus unipro cessor results are rep orted as

slowdowns and multipro cessor results as sp eedups In the comparison b etween eager and

lazy threads we rep ort sp eedups for lazy threads over eager threads

Exp erimental Setup

Toevaluate the dierent approaches to lazy threading we ran several micro

b enchmarks compiled by the SplitCthreads compiler for b oth uni and multipro cessors

The microb enchmarks allow us to isolate the b ehavior of the dierent comp onents of a

lazy threading system The unipro cessor machines are used to analyze the b ehavior of the

individual primitives The multipro cessor exp eriments are used to validate the qualityof

the complete system in a real parallel environment

For the unipro cessor exp eriments we use the Sun UltraSparc Mo del worksta

tions Mhz MB memory KB L cache running Solaris For our parallel

exp eriments we use the Berkeley NOW system running the GAM active message

layer The Berkeley NOW is a cluster of Sun UltraSparc workstations running

Solaris with Myricom Lanai network interface cards connected by a large Myricom

network The roundtrip message time using GAM is s Since GAM do es not

supp ort interrupts the NIC must b e p olled to determine if a message has arrived The

cost of a p oll op eration when there is no message is s or the equivalent of Sparc

registerbased instructions

Compiler Integration

This work diers from previous work on multithreading in that thread op erations

are integrated into the compiler instead of b eing library calls This improves p erformance

as shown in Table at the cost of a slight reduction in p ortability The forkjoin time

is the time it takes for the parent to fork and join on a single thread that do es nothing but

CHAPTER EMPIRICAL RESULTS

CThreads s NewThreads s SplitCthreadss

forkjoin

context switch

Table The p erformance of basic thread op erations made with library calls using C

threads and NewThreads packages versus having them integrated into the SplitCthreads

compiler Times are in microseconds on a Mhz UltraSparc

return The context switch time is the time it takes for one thread to susp end and another

to b egin execution

Performance improves mainly b ecause every call to a library function in a package

must handle the general case whereas SplitCthreads can customize each call to the con

text in whichitappearsFor instance a context switch in a thread with no other statements

do es not haveanylive registers at the time of susp ension so a compiler implementation

do es not havetosaveany registers On the other hand a library call has no knowledge of

the context of a call so it must store every register at the time of susp ension This eect

is amplied on the Sparc architecture since the entire register window set not just the

registers must b e saved

When multithreading functionalityisintegrated into the compiler every call p er

forms exactly the work required by the op eration This results in a p erformance increase

of at least times over library implementations of multithreading

Eager Threading

Wenow compare a compilerintegrated eager thread implementation to a lazy

thread implementation within SplitCthreads We use the microb enchmark grainshown

in Figure and rst used by Mohr Kranz and Halstead The grain b enchmark

depth

creates threads The leaf threads then p erform grainsize instructions We

grainsize to detect when the overhead due to multithreading b ecomes insignicant can vary

The smaller the grain size is the b etter the system p erforms on negrained programs

Figure shows the sp eedup of lazy threading relative to eager threading for

grain with dierent grain sizes as the p ercentage of leaf threads susp ending increases We

see that lazy threading is always more ecient than eager threading even for large grain

sizes and even when all the threads susp end For small grain sizes instructions we

CHAPTER EMPIRICAL RESULTS

include threadsh

forkable int grainint depth int grainsize

int ij

if n

int i

int sum

for i igrainsize i

sum sum

return sum

forkset

pcall i grainn grainsize

pcall j grainn grainsize

return ij

Figure The grain microb enchmark

Performance of Lazy Threading Versus Eager Threading

3.5 Grain size in Sparc Instructions

3 14 608 1208 2408 2.5 4808 p 9608 peedu

S 2

1.5

1 0% 20% 40% 60% 80% 100%

Percent of Leaves Suspending

Figure Improvementshown by lazy multithreading heap storage mo del thread ac

tivation lazydisconnect and explicit queue over eager multithreading for the micro

b enchmark grain for dierent grain sizes in Sparc instructions with dierent p ercentages

of leaf threads susp ending and resuming after all leaves have b een created

CHAPTER EMPIRICAL RESULTS

Memory Usage Allocation Requests Eager Eager 1.0E+08 2 1000000 2 Lazy Lazy 1.8 1.8 1.0E+07 Ratio 100000 Ratio 1.6 1.6 s ) 1.0E+06 1.4 1.4 y

y 10000 1.0E+05 1.2 1.2 1000 1 1.0E+04 1 0.8

0.8 atio Eager/Laz 1.0E+03 100 atio Eager/Laz R llocation Request 0.6 R A emory Used (Bytes 0.6 M 1.0E+02 0.4 0.4 10 0.2 1.0E+01 0.2 1 0 1.0E+00 0 0% 0.000% 0.001% 0.010% 0.100% 1.000% 0.1% 1.0% 10.000% 0.01% 10.0% 100.000% 0.001% 100.0%

Percentage of leaves suspending Percentage of leaves suspending

Figure Memory required and allo cation requests not satised by the p o ol manager to

store thread state for eager and lazy multithreading as a function of the numb er of leaves

susp ending

see improvements of over For large grain sizes instructions we see mo dest

improvements of Even when the lazy threads are disconnected from their parents

and have to run indep endently as is the case when they susp end we see improvements

of b etween and dep ending up on the grain size In no case is eager multithreading

faster than lazy threading The dips in sp eedup at and are an artifact of how the

b enchmark cho oses threads to susp end At the p oint every parent of a leaf thread

susp ends b ecause one of its children always susp ends Atevery grandparent of a leaf

thread susp ends

In this and in all the b enchmarks the ready queue is actually a LIFO stack so the

rst thread put on the ready queue is the last scheduled For this b enchmark a leaf susp ends

after placing itself on the ready queue Thus the susp ended leaves are not scheduled until

the last leaf thread is created

When parallelism is not needed and thus none of the leaves susp end the lazy

threading system uses half the memory that the eager system uses since the children are

not created until they are run See Figure However as so on as the children b egin to

susp end we nd as exp ected that the memory requirements are essentially the same in

the lazy and the eager systems when heapallo cated frames are used

When this b enchmark is run on a single pro cessor the lazy system never uses

more memory than the eager system However there are cases when the eager system uses

CHAPTER EMPIRICAL RESULTS

Grain Size Register Windows Flat slowdown

Table Comparison of grain compiled by GCC for register windows and for a at register

mo del Grain size is in Sparc instructions Times are in microseconds

less memory than the lazy system even though b oth systems schedule threads depthrst

Since the eager threading system always schedules children from the ready queue susp ended

frames that are enabled by a remote pro cessor are scheduled b efore all the lazy children

are created In this b enchmark this would result in the enabled children nishing b efore

all the leaves were created thus reclaiming the memory they used In all cases the lazy

threading system creates all the leaves b efore any of the enabled leaves on the ready queue

are run Thus under certain conditions some programs can consume less memory under

an eager threading system

Comparison to Sequential Co de

Perhaps the most imp ortant measure of success for any lazy threading system is

the sp eed of a program when all of the p otentially parallel calls run serially If a lazy

threading system p erforms well on this test then programmers or language implementors

can parallelize wherever p ossible without concern ab out threading overhead Further when

all the p otentially parallel calls run serially they could have b een written as sequential calls

and we can compare the program to a similar program compiled by a sequential compiler In

this section we compare grain compiled with dierent options by our lazy thread compiler

to grain compiled with GCC

Register Windows vs Flat Register Mo del

Before comparing programs generated by the lazy threads compiler to those gen

erated by GCC we need to isolate the eects of register windows on the Sparc The

lazy threads compiler do es not use register windows so we make our comparisons to co de

pro duced by GCC for the at register mo del the mat option of GCC As shown in

CHAPTER EMPIRICAL RESULTS

Grain Heap Spaghetti Stacklet

Mo del Mo del Mo del

SeedCopyLazy SeedLinkLazy ContCopyLazy

SeedCopyLazy SeedLinkLazy SeedLinkImpl

ContCopyImpl SeedLinkLazy ContCopyLazy

SeedLinkImpl SeedLinkLazy ContCopyLazy

SeedCopyImpl SeedLinkLazy ContCopyLazy

Table The minimum slowdowns relative to GCCcompiled executables for the dierent

memory mo dels and where in the design space they o ccur for a grain size of instructions

Table the co de pro duced for the at register mo del is less ecient than the register

window mo del There are two reasons for this ineciency the at mo del uses callee save

and the prologepilog co de requires more memory op erations Since callee saveisineect

more register saverestore op erations o ccur in the at register mo del The at register

mo del also requires an additional two stores and two loads p er function invo cation than the

equivalent co de compiled for register windows In retrosp ect we should have optimized the

co de pro duced for GCCs at mo del b efore intro ducing the lazy thread mo dications In

the rest of this chapter we compare the lazy thread co de to the at register mo del

Overall Performance Comparisons

Figure shows the slowdown of each of the mo dels for dierent grain sizes As

exp ected we see little dierence b etween the mo dels at large grain sizes instruc

tions and large dierences at small grain sizes instructions When all the threads

run to completion as in this case the two biggest factors are the memory mo del and the

queueing metho d The linkedframe memory mo del is signicantly slower than the other

memory mo dels running at half their sp eed for small grain sizes The other ma jor factor is

the use of the explicit queueing metho d which increases overhead by indep endently

of the other choices in the design space

Comparing Memory Mo dels

The most imp ortant improvementintro duced by lazy threads is the use of stack

like structures to store thread state As Figure shows the overhead of the linkedframes

elve p oints in the design space is a factor of twoto mo del when averaged over the other tw

ves times more than that of the other memory mo dels This o ccurs for two reasons frame

CHAPTER EMPIRICAL RESULTS

Uni-processor Multithreaded Slowdown for Grain of 8 Uni-processor Multithreaded Slowdown for Grain of 68 2.3 1.8 Linked Frame 2.2 Linked Frame 1.7 2.1

2.0 1.6 1.9 Spaghetti Stacks Stacklets Spaghetti Stacks Stacklets 1.8 1.5 n n 1.7 1.4 1.6 lowdow lowdow S S 1.5 1.3

1.4 1.2 1.3

1.2 1.1 1.1

1.0 1.0 HCEE HCEI HCEL HCLE HCLI HCLL 6HSEE HSEI HSEL HSLE HSLI HSLL HCEE HCEI HCEL HCLE HCLI HCLL 68HSEE HSEI HSEL HSLE HSLI HSLL PCEE PCEI PCEL PCLE PCLI PCLLgrain PSEEsizes PSEI PSEL PSLE PSLI PSLL PCEE PCEI PCEL PCLE PCLI PCLLgrain PSEEsizes PSEI PSEL PSLE PSLI PSLL SCEE SCEI SCEL SCLE SCLI SCLL SSEE SSEI SSEL SSLE SSLI SSLL SCEE SCEI SCEL SCLE SCLI SCLL SSEE SSEI SSEL SSLE SSLI SSLL

Uni-processor Multithreaded Slowdown for Grain of 608 Uni-processor Multithreaded Slowdown for Grain of 6008 1.3 1.1 Linked Frame

1.2 Spaghetti Stacks Stacklets n n lowdow lowdow Linked Frame Spaghetti Stacks

S Stacklets S

1.1

1.0 1.0 HCEE HCEI HCEL HCLE HCLI HCLL 608HSEE HSEI HSEL HSLE HSLI HSLL HCEE HCEI HCEL HCLE HCLI HCLL 6008HSEE HSEI HSEL HSLE HSLI HSLL PCEE PCEI PCEL PCLE PCLI PCLLgrain PSEEsizes PSEI PSEL PSLE PSLI PSLL PCEE PCEI PCEL PCLE PCLI PCLLgrain PSEEsizes PSEI PSEL PSLE PSLI PSLL

SCEE SCEI SCEL SCLE SCLI SCLL SSEE SSEI SSEL SSLE SSLI SSLL SCEE SCEI SCEL SCLE SCLI SCLL SSEE SSEI SSEL SSLE SSLI SSLL

Figure Slowdown of unipro cessor multithreaded co de running serially over GCC co de

The four letter abbreviations sp ecify the memory mo del HLinked frame PSpaghetti S

Stacklets the thread representation CContinuations SThread Seeds the disconnection

mo del EEager LLazy and the queue mo del EExplicit IImplicit and LLazy Note

the dierent scales of the Y axis on each of the four graphs

CHAPTER EMPIRICAL RESULTS

Memory Model Average Overhead

2.20

Linked frames 2.00 Spaghetti Stacklets

1.80 n

1.60 lowdow S 1.40

1.20

1.00 8 14 68 608 6008

Instructions

Figure Average slowdown of multithreaded co de running sequentially on one pro cessor

over GCC co de The slowdown is the average of all p oints in the design space for each

memory mo del

allo cation is more exp ensive and the child must p erform a memory op eration in order to

nd its parent

The dierence b etween spaghetti stacks and stacklets is less pronounced Stacklets

are slightly more ecient than spaghetti stacks b ecause the overowcheck is less exp ensive

than saving and restoring a link b etween the child and the parent However stacklets may

b e more exp ensive than spaghetti stacks if manyoverows o ccur By adjusting the depth

argumenttothe grain we can cause an overow to o ccur for every leaf Aworstcase

scenario is when all the allo cations o ccur at the b oundary A function that recursively calls

itself and nally calls another function named null in a lo op will take longer if

overows o ccur when null is invoked and do es nothing but return In other words the

allo cation for an overow doubles the cost of a function call which is exp ected since it is

similar to a linkedframe allo cation

Toevaluate the eect of the memory mo del in isolation we also lo ok at a sequential

grain compiled by the lazy thread compiler Figure shows the slowdown of the version of

multithreaded version of grain relative toGCCofthesequential version of grain relative

to GCC and of the multithreaded relative to the sequential version of grain when b oth are

compiled by the lazy threads compiler The last data line shows the overhead of p otentially

parallel calls versus sequential calls within the lazy threading framework

CHAPTER EMPIRICAL RESULTS

Linked-Frame Memory Model Spaghetti Stack Memory Model LT/GCC 2.3 LT-Seq/GCC 1.6 2.2 LT/LT-Seq 2.1 1.4 2 1.2 1.9 n

1.8 n 1 1.7 1.6 0.8 lowdow lowdow

S 1.5

S 0.6 1.4 LT/GCC 1.3 0.4 LT-Seq/GCC 1.2 LT/LT-Seq 1.1 0.2 1 0 Lazy Seed-Eager- Cont-Lazy-Impl Cont-Lazy-Expl Cont-Lazy-Lazy Seed-Lazy-Impl Seed-Lazy-Expl Seed-Lazy-Lazy Cont-Eager-Impl Cont-Eager-Expl Cont-Eager-Lazy Seed-Eager-Impl Seed-Eager-Expl Cont-Lazy-Impl Cont-Lazy-Expl Cont-Lazy-Lazy Seed-Lazy-Impl Seed-Lazy-Expl Seed-Lazy-Lazy Cont-Eager-Impl Cont-Eager-Expl Cont-Eager-Lazy Seed-Eager-Impl Seed-Eager-Expl Seed-Eager-Lazy Models Models

Stacklets Memory Model

1.4

1.2

1 n 0.8

lowdow 0.6

S LT/GCC 0.4 LT-Seq/GCC LT/LT-Seq 0.2

0 Cont-Lazy-Impl Cont-Lazy-Expl Cont-Lazy-Lazy Seed-Lazy-Impl Seed-Lazy-Expl Seed-Lazy-Lazy Cont-Eager-Impl Cont-Eager-Expl Cont-Eager-Lazy Seed-Eager-Impl Seed-Eager-Expl Seed-Eager-Lazy

Models

Figure Comparison of the three memory mo dels for a grain size of instructions

LTGCC shows the slowdown of the multithreaded grain relativetoasequential version

compiled using GCC LTSEQGCC compares a sequential version of grain compiled with

SplitCthreads relative toGCCLTLTSeq shows the slowdown intro duced by turning

sequential calls into pcalls when b oth are compiled by SplitCthreads

CHAPTER EMPIRICAL RESULTS

As exp ected the linkedframe memory mo del intro duces most of its overhead not

in handling p otentially parallel calls but just in managing the activation frames On the

other hand b oth the spaghetti stack and stacklet mo dels show almost no overhead in the

sequential versionall the o verhead is intro duced by the p otentially parallel calls

The interaction b etween the dierent axes of the design space is seen in the extra

overhead of the explicit queue mo dels for b oth the spaghetti stacks and stacklets memory

mo dels In b oth of these memory mo dels the stackmust b e managed sp ecially when a

thread susp ends When explicit queueing is used the parent of a susp ending thread may

not b e notied that its child has susp ended Thus to prepare for the p ossible susp ension

of a child the compiler must initialize some frame slots on every call resulting in extra

overhead

Having shown that the linkedframe memory mo del is always signicantly slower

than the other memory mo dels we eliminate it from the rest of the comparisons in this

section

Thread Representations

In this section weinvestigate the eect of the thread representation on p erfor

mance We b egin by lo oking at grain running sequentially We then turn our attention

to how the choice of thread representation interacts with the compilation pro cess This

interaction varies with the number of pcalls in a forkset and the data ow in the forkset

Finallywe show the eect of running threads in parallel on grain byvarying the number

of leaf threads that susp end

For anychoice of memory mo del disconnection metho d or queueing mechanism

there is no signicant dierence b etween thread seeds and continuations when all the p oten

tially parallel calls run to completion See Figure This is exp ected since we use the

same techniques to implement b oth When the p otentially parallel call runs to completion

the continuation and seed are b oth enqueued in the same way and the pcall is implemented

as a sequential call

When a pcall in a forkset do es not run to completion the dierent thread rep

resentations can p erform dierently If threads are represented with continuations then

when a child susp ends the continuation to the rest of the thread is stolen and when the

susp ending child returns it only tests the join to see if it is the last returning child If it is

CHAPTER EMPIRICAL RESULTS

Thread Representations (grain size = 8) All Threads Run To Completion LT/GCC-cont 1.6 LT/GCC-seed 1.4

1.2

n 1

0.8 lowdow

S 0.6

0.4

0.2

0 Stac-Lazy-Impl Stac-Lazy-Expl Stac-Lazy-Lazy Spag-Lazy-Impl Spag-Lazy-Expl Spag-Lazy-Lazy Stac-Eager-Impl Stac-Eager-Expl Stac-Eager-Lazy Spag-Eager-Impl Spag-Eager-Expl Spag-Eager-Lazy

Models

Figure Comparison of continuation stealing and seed activation when all threads run

to completion across the design space for grain with a grain size of instructions The

slowdown is relative to a GCCcompiled sequential grain

not the join fails and the general scheduler is entered Thus the control ow of the susp end

and steal streams is identical to that of the sequential stream When thread seeds are used

to represent nascent threads children that previously susp ended will when they complete

and return to their parents activate a thread seed if it exists Thus the control ow in the

susp end and steal streams is from every pcall to every other pcall in the forkset The

additional control paths present when thread seeds are used require that every inlet end

with an indirect jump Furthermore every register that is livebeforeallthepcallsmust

b e restored b efore the indirect jump is executed

Toevaluate the dierence in control owwe compare three programs with forksets

which dier in whether or not results are returned bythechildren threads Each program

consists of a lo op that rep eatedly calls a routine caller which consists entirely of one

forkset with pcalls to a another routine test The test routine either susp ends and later

returns or returns immediately dep ending on its argument In the rst program voidvoid

both caller and test are void functions and no values need to b e restored or saved in any

co de stream In the second program void test is a void function and caller returns a

alue so one value needs to b e restored and saved in b oth mo dels In the last program v

return b oth test and caller return valuesth us every pcall in the susp end stream to save

and restore all the variables in the thread seed mo del while in the continuationstealing

th

mo del the n call needs to save and restore n registers As we see in Figure the

CHAPTER EMPIRICAL RESULTS

Comparing Thread Representation voidvoid Comparing Thread Representation voidvoid The first pcall out of two suspends. void The first pcall out of ten suspends. void return 1.2 return 1.2 s

1 o 1

0.8 0.8

0.6 0.6

0.4 0.4 ontinuation/Thread Seed Rati

0.2 C 0.2 atio of Continuations/Thread Seed R

0 0 SLI SLI PLI PLI SEI SEI PEI PEI SLL SLL PLL PLL SEL SLE SEL SLE PEL PLE PEL PLE SEE SEE PEE PEE

Models Models

Figure The execution time ratio b etween the continuationstealing mo del and the

threadseed mo del when the p oints on the other axes are xed and the rst pcall of the

forkset susp ends Lower ratios indicate that continuations p erform b etter than thread

seeds

continuationstealing mo del is more ecient for large forksets and for small forksets there

is no real dierence b etween the twomodels This data is consistent with the fact that

the thread representation do es not signicantly aect the other axes of the design space

Interestinglyeven though more saverestore op erations are p erformed in the last program

return in the thread seed mo del the dierence in p erformance is lower b ecause there is

slightly more work p er thread

The Figures and show the worst case for thread seeds in that the rst

pcall in the forkset susp ends and all the remaining calls must run in the susp end stream

As Figure shows if a later pcall susp ends the dierence b etween the two mo dels

disapp ears

When welookatthebehavior of grain in Figure we see that the thread

representation has little eect on the p erformance of the program even as the number of

threads susp ending changes The small dierence is explained bytwo factors First there

are only two pcalls in the forkset so there is no dierence in the control ow in the susp end

stream Second the susp ending threads are evenly distributed b etween the rst and last

pcalls in the forkset

As more leaves susp end continuations p erform b etter than thread seeds by only

a small margin This is particularly true when the memory mo del has lowoverhead eg

CHAPTER EMPIRICAL RESULTS

Comparing Suspension Location 1 of 10 5 of 10 1.2 10 of 10

o 1

0.8

0.6

0.4

ontinuation/Thread Seed Rati 0.2 C

0 SLI PLI SEI PEI SLL PLL SEL SLE PEL PLE SEE PEE

Model

Figure Comparing the eect of whichofthetenpcalls in a forkset susp ends

Thread Representations for Grain of 8 0.01% Percentage 0.10% 1.2 of Leaves 1.00% Suspending 10.00%

s 1

0.8

0.6

0.4

atio Continuations/Thread Seed 0.2 R

0 SLI PLI SEI PEI SLL PLL SEL SLE PEL PLE SEE PEE

Models

Figure Comparing thread representations for a grain size of as the p ercentage of

leaves that susp end increases Missing bars indicate that the program was unable to run

b ecause it ran out of memory or to ok more than three orders of magnitude longer than the

other programs

stacklets with lazy disconnection and the implicit queue is used When implicit queueing

is used the susp ending thread always returns to its parent When continuation stealing is

in eect the parent immediately susp ends to its parent if it has no workwith thread seeds

in contrast registers might b e restored needlessly and a jump through the resume address

would b e p erformed b efore the parent susp ends

CHAPTER EMPIRICAL RESULTS

Disconnection Method (grain size = 8) All Threads Run To Completion LT/GCC-eager 1.6 LT/GCC-lazy 1.4

1.2

n 1

0.8 lowdow

S 0.6

0.4

0.2

0 Stac-Cont-Impl Stac-Cont-Expl Stac-Cont-Lazy Stac-Seed-Impl Stac-Seed-Expl Spag-Cont-Impl Spag-Cont-Expl Stac-Seed-Lazy Spag-Cont-Lazy Spag-Seed-Impl Spag-Seed-Expl Spag-Seed-Lazy

Models

Figure Comparison of eager and lazy disconnection for grain with a grain size of

and all the threads running to completion

Continuations and thread seeds p erform equally well when all the threads in a fork

set run to completion or when a thread do es susp end if there are few remaining pcalls left

in the forkset Since the total number of pcalls in a forkset is often very small in most

cases two b oth metho ds of representing threads should work equally well when executing

multithreaded co de on a unipro cessor

Disconnection

In this section we compare eagerdisconnect and lazydisconnect We rst lo ok

at the cases when all threads run to completion in grain and then when threads susp end

in grain Next welookathow the number of pcalls in a forkset aects the relative

p erformance of the two disconnection metho ds

As we see in Figure the dierent disconnection mo dels have little eect on

the sequential p erformance of p otentially parallel calls When eagerdisconnect is used the

parentmust b e able to mo dify the childs link back to its parent In b oth the linkedframe

and stacklets mo dels this requires an extra store in the function prolog of every child

There is no dierence in the spaghetti mo del b ecause when explicit queueing is used extra

overhead is always required

The real dierence b etween the metho ds of disconnection shows up when the

potentially parallel calls run in parallel In this case we see an overwhelming disadvantage

CHAPTER EMPIRICAL RESULTS

Disconnection Method for Grain of 8

>4 >4 >4 >4 2 >4 >4 Percentage of Leaves Suspending 1.8 0.01% 0.10% 1.00% 10.00% 1.6

y 1.4

1.2

1

0.8 atio of Eager/Laz

R 0.6

0.4

0.2

0 SSI PSI SCI PCI SSL PSL SCL PCL SSE PSE SCE PCE

Models

Figure Comparison of eager and lazy disconnection for grain of as the number of

leaves susp ending changes Higher bars indicate lazydisconnect is b etter

Disconnection Performance (forkset=2) Thread Representation (forkset = 10) voidvoid void return 2 1.2

1.8 void voidvoid 1 1.6 return

1.4 y

o 0.8 1.2

1 0.6

0.8 ont/Seed Rati ation Eager/Laz C 0.4 R 0.6

0.4 0.2 0.2

0 0 SLI PLI SSI PSI SCI PCI SEI PEI SLL PLL SSL PSL SCL PCL SEL SLE PEL PLE SSE PSE SCE PCE SEE PEE

Models Models

Figure Comparison of eager and lazy disconnection when the rst pcall in a forkset

with two or ten pcalls susp ends

CHAPTER EMPIRICAL RESULTS

to the eagerdisconnect metho d for stacklets since entire activation frames must b e copied

Figure compares the two metho ds as the numb er of leaves susp ending changes Eager

disconnect is signicantly worse for the stacklet mo dels even when as few as one out of

threads susp ends When more threads susp end the programs run for so long hours as

opp osed to seconds that we omit the data The only other signicant dierence b etween

disconnection metho ds is when spaghetti stacks are used with the explicit queue In this case

the eagerdisconnect metho d requires extra setup on every call whether or not it susp ends

As Figure shows when the forkset has only two pcalls the overhead of

copying the frames overwhelms the advantage of b eing able to run the next pcall in the

same stacklet as the parent For the other memory mo dels there is no real dierence b etween

the two disconnection metho ds when the forkset has few pcalls

We conclude that eagerdisconnect should not b e used with stacklets or with

spaghetti stacks if explicit queueing is used Further there is no advantage to eager

disconnect unless the ma jority of forksets havemany pcalls in them

Queueing

Wenow examine the b ehavior of the dierent queueing mechanisms when all the

threads run to completion and when they susp end As exp ected Figure shows that

the explicit queue is more exp ensive than either the lazy or the implicit queue for threads

that run to completion The dierences b etween the implicit and lazy queues is small when

the threads run to completion

When threads susp end the explicit mo del b ecomes more attractive and the im

plicit and lazy mo dels diverge in p erformance Figure shows the improvement of the

ws the destructiveinteraction explicit queue mo del as more threads susp end It also sho

between spaghetti stacks and the explicit queue As discussed ab ove this arises b ecause

there is no guarantee that threads are notied when they must reclaim storage and every

activation frame must set and checksomevariables in the prolog and epilog

CHAPTER EMPIRICAL RESULTS

Queueing Methods on Grain (grainsize = 8) All Threads Run To Completion Explicit 1.6 Implicit Lazy 1.4

1.2 C

1

0.8

0.6 lowdown versus GC

S 0.4

0.2

0 Spag- Spag- Spag- Spag- Stac- Stac- Stac- Stac- Cont- Cont- Seed- Seed- Cont- Cont- Seed- Seed- Eager Lazy Eager Lazy Eager Lazy Eager Lazy

Models

Figure Comparison of the three queueing metho dsexplici t implicit and lazyfor

grain when all threads run to completion

Lazy Queue Method for Grain of 8 0.01% Implicit Queue Method for grain of 8 0.01% 0.10% 1.8 1.8 0.10% 1.00% 1.00% 1.6 1.6 10.00 10.00% % 1.4 1.4 it y 1.2 1.2

1 1

0.8 0.8 atio Explicit/Laz

0.6 atio Explicit/Implic 0.6 R R 0.4 0.4

0.2 0.2

0 0 PCE PCL PSE PSL SCE SCL SSE SSL PCE PCL PSE PSL SCE SCL SSE SSL

Models Models

Figure Comparison of the queueing mechanisms as the numb er of threads susp ending

increases for grain with a grain size of instructions Lower bars indicate that explicit queueing is preferred

CHAPTER EMPIRICAL RESULTS

Speedup of Grain 8 on NOW

8

Spagetti+Explicit 7 Queue

Linked Frame Models 6

5 Heap-Seed-Lazy-Expl p Heap-Seed-Lazy-Impl 4

peedu Heap-Seed-Lazy-Lazy S 3 Spag-Seed-Lazy-Expl Spag-Seed-Lazy-Lazy 2 Stac-Seed-Lazy-Expl 1 Implicit Queues Stac-Seed-Lazy-Impl Stac-Seed-Lazy-Lazy 0 0 10 20 30

Processors

Figure Sp eedup of grain grain size instructions on the NOW relativetoaversion

compiled with GCC on a single pro cessor for the dierent mo dels

Running on the NOW

Having analyzed the b ehavior of the dierent p oints in the design space on a

unipro cessor wenowinvestigate the p erformance of lazy thread programs on a distributed

memory multipro cessor We limit our attention to those mo dels that can b e reasonably

supp orted on a distributed memory machines Thus we do not lo ok at continuation stealing

or eagerdisconnect Continuation stealing requires migration of instantiated threads which

is costly and very dicult to implement Eagerdisconnect is clearly to o costlyeven on a

single pro cessor to haveany p ositivebenetonmultipro cessors We lo ok at the implicit

queueing metho d only briey b ecause it promotes the distribution of the nestgrained work

and thus do es not scale well on multipro cessors

Figure shows the sp eedup of grain with a grain size of instructions when the

network is p olled on every function call Clearly the work distribution metho d implemented

by the implicit queue is not ecient as the programs compiled with implicit queue run very

poorly As exp ected the linkedframe storage mo del p erforms p o orly Indep endently of the

numb er of pro cessors it is at least slower than the other storage mo dels The other

Continuation stealing can b e implemented without frame migration but in this case when a continuation

is stolen the parentmust fork o its next child onto the remote pro cessor just as in seed activation However

with continuation stealing the parentcannotbecontinued until its remote child returns and then it will

also have to fork o its next child onto a remote machine

CHAPTER EMPIRICAL RESULTS

Where The Time Goes

7 GCC GCC-Flat 6 LT-Sequential LT-Uniprocessor 5 C LT-Multi-No-Poll

4 LT-Multiprocessor

3 lowdown over GC

S 2

1

0 6 6 86 86 806 806 8006 8006 PSLL SSLL PSLL SSLL PSLL SSLL SSLL PSLL

Model + Grain

Figure A breakdown of the overhead when running on one pro cessor of NOW LT

Sequential is the serial version of grain LTUnipro cessor is a multithreaded version of

grain compiled for one pro cessor LTMultiNoPoll is compiled for an MPP but do es not

include anynetwork p olling LTMultipro cessor is an MPP version with p olling

mo dels all p erform well getting linear sp eedup relative to themselves However the lazy

threads programs run four times slower on a single multipro cessor than on a unipro cessor

Figure shows where the p erformance is lost Fo cusing on the smallest grain

size wepay an immediate p enalty of for using the mat option of GCC Ignoring the

linkedframes mo del there is no additional p enalty for using SplitCthreads for compiling

a sequential version of grain However turning the function calls into pcalls incurs an

additional loss of p erformance of ab out Compiling the multithreaded version for a

multipro cessor adds no cost but adding the instructions to p oll the network interface more

than triples the running time

If we compare the p erformance of lazy threads on the NOW with that of the

Thinking Machines CM weseehow the less exp ensive p oll op eration on the CM

b o osts p erformance Figure shows the sp eedup of grain compiled by an earlier compiler

which used stacklets thread seeds lazydisconnect and explicit queueing Even the nest

grain sizes have eciencies of over In the case of NOW the more exp ensivepoll

op eration reduces the overall p erformance of even the large grain sizes

umb er of p oll op erations eciency increases dramatically When we reduce the n

If wechange the p oll op eration into a conditional p oll whichchecks the network only every

n p olls we increase the eciency of the program without implicitly increasing the grain

CHAPTER EMPIRICAL RESULTS

30 14 25 38 188 608 20 1208 3608 15 Speedup 10

5

0 5 10 15 20 25 30

Number of Processors

Figure Sp eedup of lazy threads stacklets thread seeds lazy disconnection and ex

plicit queue on the CM compared to the sequential C implementation as a function of

granularity

Speedup of Grain as Polling Period Changes (grainsize=8)

25

20

15 Speedup 10

5

0 3 1 2 8 4096 16384 64 6 4 256 1024 4 2 16 1 1 Processors

Polling Period

Figure The eect of reducing the p olling frequency on grain with a grain size of

instructions using stacklets thread seeds lazydisconnect and lazy queueing

CHAPTER EMPIRICAL RESULTS

Speedup of Models at Grain 8 Speedup of Models at Grain 68

32 32

28 28

24 24 Spagetti-Lazy Spaghetti-Explicit Stacklets 20 20 Stacklets-Explicit p p

16 16 peedu peedu Spag-Expl 68 S S 12 12 Spag-Lazy 68 Stac-Expl 68

8 Spag-Expl 8 8 Stac-Lazy 68

Spag-Lazy 8 4 4 Stac-Expl 8

0 Stac-Lazy 8 0 0 4 8 12 16 20 24 28 32 0 4 8 12 16 20 24 28 32 Processors Processors

Speedup of Models at Grain 608 Speedup of Models at Grain 6008

32 32

28 28 Spag-Expl 6008

24 24 Spag-Lazy 6008

Stac-Expl 6008 20 20 Stac-Lazy 6008 p p

16 16 peedu Spag-Expl 608 peedu S S 12 Spag-Lazy 608 12 Stac-Expl 608

8 Stac-Lazy 608 8

4 4

0 0 0 4 8 12 16 20 24 28 32 0 4 8 12 16 20 24 28 32

Processors Processors

Figure Sp eedup of grain at dierent grain sizes for the explicit and lazy queue mo dels

In all cases the p olling frequency is

size of work that can b e stolen Figure shows how the sp eedup of grain changes as

we increase the p erio d b etween p olls Performance increases as we decrease the number of

p olls until a p ointisreached when the long p erio ds of ignoring the network interfere with

load balancing For the rest of our exp eriments we set the p olling p erio d to

In Figure weshow the sp eedup of grain on the NOW with a p olling p erio d of

In all cases the combination of spaghetti stacks and the explicit queue p erforms p o orly

The other combinations are roughly equivalent when the grain size exceeds instructions When run on pro cessors the eciency ranges from for a grain size of instructions

CHAPTER EMPIRICAL RESULTS

Unbalanced Grain - Remote Larger

32 Spag-Expl 8 Spag-Lazy 8 28 Stac-Expl 8 Stac-Lazy 8 24 Spag-Expl 68 Spag-Lazy 68 20

p Stac-Expl 68

16 Stac-Lazy 68

peedu Spag-Expl 608 S 12 Spag-Lazy 608 Stac-Expl 608 8 Stac-Lazy 608 Spag-Expl 6008 4 Spag-Lazy 6008 Stac-Expl 6008 0 Stac-Lazy 6008 0 4 8 12 16 20 24 28 32

Processors

Figure Sp eedup of an unbalanced grain where there is more work in the second pcall of the forkset

Unbalanced Grain - Local Larger

32

28 Spag-Expl 8 Spag-Lazy 8 24 Stac-Expl 8 Stac-Lazy 8 20 Spag-Expl 68

p Spag-Lazy 68 16 Stac-Expl 68 Stac-Lazy 68 peedu

S Spag-Expl 608 12 Spag-Lazy 608 Stac-Expl 608 8 Stac-Lazy 608 Spag-Expl 6008 4 Spag-Lazy 6008 Stac-Expl 6008 Stac-Lazy 6008 0 0 4 8 12 16 20 24 28 32

Processors

Figure Sp eedup of an unbalanced grain where there is more work in the rst pcall of the forkset

CHAPTER EMPIRICAL RESULTS

Speedup of Nqueens Efficiency of Nqueens 32 0.76 28

Copy-On-Suspend 0.74 24 Always Copy

20 0.72 p y 16 Copy-On-Suspend 0.7

peedu Always Copy S fficienc

12 E 0.68 8

0.66 4

0 0.64 0 4 8 12 16 20 24 28 32 0 4 8 12 16 20 24 28 32

Processors Processors

Figure Sp eedup and eciency achieved with and without the copyonsusp end opti

mization for nqueens using stacklets thread seeds lazydisconnect and the lazy queue

to over for a grain size of or more showing that lazy threading is eective when

implemented correctlyeven on distributed memory machines

Lazy threads also work well when the amountofwork p erformed by the dierent

threads is not uniform Figures and show the sp eedup of a mo died grain in

which the amountofwork p erformed by a parents children diers by a factor of two In

Figure the rst pcall p erforms twice as muchwork as the second Thus work that

is stolen is always smaller than work p erformed lo cally In Figure the second pcall

p erforms twice as muchwork as the rst leading to largergrained work b eing stolen When

grain size is small we see a slight p erformance degradation when the work that can b e stolen

is smaller than the work available lo cally While there is only a small dierence b etween

these two programs the dierence shows the asymmetry in the overall approachtaken by

lazy threads

Using CopyonSusp end

In our nal exp eriment using SplitCthreads we examine the eect of the copy

onsusp end optimization as applied to the nqueens program As Figure shows the

copyonsusp end optimization yields up to a improvement in p erformance If the copy

onsusp end optimization is not applied to nqueens there is a loss in p erformance for using

lazy threads even on a single pro cessor b ecause of the need to copy the shared data See

CHAPTER EMPIRICAL RESULTS

Section The copyonsusp end optimization eliminates the need to copy the shared

data except when parallelism is actually needed

When running on a distributed multipro cessor there is the additional p enaltyof

sending the data across the network This imp oses two costs the explicit cost of sending

the data and the implicit cost of having to p oll the network often enough to service the

data requests We compiled the program using a p olling p erio d of When compiled

for a multipro cessor and run on one pro cessor the program gets eciency indicating

that the cost of p olling is relatively low Since nqueens achieves a sp eedup of on

pro cessors the p oll rate is sucient to handle b oth load balancing and requests for data

Susp end and Steal Stream Entry Points

As discussed in Section for ease of exp erimentation wechose to implement the

susp end and steal entry p oints asso ciated with a seed or continuation as jump instructions

placed after the call to which they are asso ciated This has a larger than exp ected impact on

the p erformance of a pcall due to the dierence in the oset used by a return instruction

If we mo dify the assembly co de bychanging the jumps to nops and have the function

return like an ordinary sequential call we see no eect on the runtime even though six

extra instructions are executed on every return If we then remove the nops so the same

instructions are executed we see an improvement in p erformance of ab out

Comparing Co de Size

Our compilation techniques incur additional overhead in the form of an increase

in co de size Every call is duplicated for singlepro cessor concurrency or triplicated for a

multipro cessor If there is co de b etween pcalls it to o must b e copied Furthermore each

call has asso ciated with it the susp end and steal stream entry p oints setup co de to handle

disconnection etc Figure compares the size of an ob ject le for the grain function

compiled by SplitCthreads for a unipro cessor and multipro cessor to the size of the same

function compiled by GCC The multithreaded versions are approximately and times

larger than the sequential co de as compiled by GCC for the unipro cessor and multipro cessor

version resp ectively The largest eect on co de size comes from the queueing metho d chosen The lazy queue has extra co de to construct the explicit queue and the implicit queue has

CHAPTER EMPIRICAL RESULTS

Code Size Expansion for forkset with 2 pcalls

5 LT for Uniprocessor LT for Multiprocessor 4.5

4 C 3.5

3

2.5

2

1.5 xpansion Factor Over GC E

1

0.5

0 SSLI PSLI SCLI PCLI HSLI HCLI SSEI PSEI SCEI PCEI HSEI HCEI SSLL PSLL SCLL PCLL HSLL HCLL SSEL SSLE PSEL PSLE SCEL SCLE PCEL PCLE HSEL HSLE HCEL HCLE SSEE PSEE SCEE PCEE HSEE HCEE

Models

Figure The amount of co de dilation for eachpoint in the design space

extra co de to transfer control to the parent In eect wehave traded o time for space as

compared to the library implementation of eager multithreading

Id Comparison

Our earlier work on lazy threading was motivated by previous work on eciently

implementing Id using a threaded abstract machine TAM The compilation pro cess

involved compiling Id to TL an assembly language for TAM and then compiling TL

to C In TL every function call is implemented as a parallel fork with the asso ciated

overhead of scheduling synchronization frame allo cation and use of memory to transfer

arguments and results To improve the p erformance of Id executables wedevelop ed a new

assembly language TL that would supp ort lazy threading Wedevelop ed a lazy thread

based compiler for TL that used stacklets thread seeds lazydisconnect and explicit

queueing In addition TL supp orts strands This compiler compiled TL co de to C using

techniques that b ecame the basis for the SplitCthreads compiler In Table weshow

the results of this early work We can see that even with the overhead of compiling to C

the need to supp ort strands and no control over register usage weachieved a signicant

sp eedup over TL

CHAPTER EMPIRICAL RESULTS

Program Short Description Input Size TAM Stacklets Seeds

Lazydisconnect

Explicit Queue

Gamteb Monte Carlo neutron transp ort

Parans Enumerate isomers of parans

Simple Hydro dynamics and heat conduction

MMT Matrix multiply test

Table Dynamic runtime in seconds on a SparcStation for the Id b enchmark

programs under the TAM mo del and lazy threads with multiple strands The programs are

describ ed in

Slower than the other memory models.

Storage Model Linked Frames Spaghetti Stacks Stacklets Eager disconnection causes excessive copying with stacklets. Queue Mechanism Lazy Impl Expl Lazy Impl Expl Lazy Impl Expl Explicit queueing adds extra overhead to Eager Spaghetti stacks. Seeds Lazy Requires frame migration or message forwarding and limits use of local point- ers on distributed memory machines. Eager Requires message forwarding on distrib- Conts uted memory machines. Lazy Distributes work at too fine a grain for Disconnection Method distributed memory machines.

Thread Representation No drawbacks.

Figure Which p oints in the design space are eectiveandwhy The white squares are

the eective mo dels

Summary

In each of the preceding sections we examined the axes of the design space and

analyzed the p erformance of the p oints on that axis with resp ect to the other axes Fig

ure summarizes our conclusions Each of the shaded regions indicates a p ointinthe

space that wehave rejected either b ecause the intersection of mo dels lead to p o or p erfor

mance or b ecause a particular p ointwas p o or compared to the other p ossible p oints on that

axis For example eagerdisconnect and stacklets clearly work against each other and any

implementation that uses b oth of these will p erform p o orly

The gure also includes three additional reasons for rejecting particular mo dels for

use on distributed memory machines Eagerdisconnect requires either frame migration or

CHAPTER EMPIRICAL RESULTS

message forwarding on distributed memory machines It also excludes the use of p ointers

to automatic variables Continuation stealing is excluded since it requires frame migration

Finally implicit queueing distributes work at to o ne a grain

All of the three remaining mo dels p erform well even on very negrained programs

Stacklets thread seeds lazydisconnect and lazy queueing work particularly well together

requiring almost no preparation overhead b efore a p otentially parallel call is initiated and

little work when parallelism is actually needed

Chapter

Related Work

Attempts to accommo date logical parallelism include thread packages

compiler techniques and clever runtime representations and

direct hardware supp ort for negrained parallel execution These approaches have

b een used to implement many parallel languages eg MulT Id CC

Charm Opus Cilk Olden and Cid The common goal is to

reduce the overhead asso ciated with managing the logical parallelism While much of this

work overlaps ours none has combined all of the techniques describ ed in this thesis More

imp ortantly none has started from the premise that all calls parallel or sequential can b e

initiated in exactly the same manner

Our work grew out of previous eorts to implement the nonstrict functional lan

guage Id for commo dity parallel machines Our earlier work develop ed a Threaded Ab

stract Machine TAM whichserves as an intermediate compilation target The twokey

dierences b etween this work and TAM are that under TAM calls are always parallel and

due to TAMs scheduling hierarchy calling another function do es not immediately transfer

control

Our lazy thread fork allows all calls to b egin in the same way and creates only the

required amount of concurrency In the framework of previous work it allo ws excess par

allelism to degrade eciently into a sequential call Many other researchers have prop osed

schemes which deal lazily with excess parallelism One of the simplest schemes is load based

inlining which uses load characteristics of the parallel machine to decide at the time a p o

tentially parallel call is encountered whether to execute it sequentially inline it or execute

it in parallel This has the advantage of dynamically increasing the granularity of the

CHAPTER RELATED WORK

program However these decisions are irrevo cable which can lead to serious load imbalances

or deadlo ck Our approach builds on lazy task creation LTC whichintro duced the idea of

lazy threads LTC is based on linked frames continuation stealing eagerdisconnect and an

explicit queue LTC also uses work stealing to p erform dynamic load balancing These

ideas were studied for MulT running on sharedmemory machines Since workstealing

causes the parent to migrate to a new thread LTC dep ends on the ability of the system

to migrate activation frames This either requires sharedmemory hardware capabilities

or restricts the kind of p ointers allowed in the language FinallyLTC also dep ends on a

garbage collector which hides manyofthecostsofstack management

Another prop osed technique for improving LTC is leapfrogging which uses multi

ple stacks a limited form of continuation stealing eagerdisconnect and explicit queues

Unlike the techniques we use it restricts the use of continuation stealing in an attempt to re

duce the cost of futures Leapfrogging has b een implemented using Cthreads a lightweight

thread package

StackThreads uses b oth a stack and the heap for storing activation frames

in an attempt to reduce overhead for negrained programs running on a single pro cessor

Activation frames are initially placed on the stack and if a thread blo cks its activation

frame is moved onto the heap in eect implementing eagerdisconnect on a hybrid mem

ory mo del Since the sequential call invariants are not enforced Stac kThreads do es not

takeadvantage of passing control and data at the same time reducing register usage and

increasing synchronization overhead They takea point of view diametrically opp osed to

ours in that all calls sequential or parallel use the same representation This triples the

direct function callreturn overhead and prevents the use of registers

Amuch simpler thread mo del is advo cated in Shared Filaments and Dis

tributed Filaments A lament is an extremely lightweight thread which do es not have

astack asso ciated with it This works well when a thread do es not fork other threads

More general threads are supp orted with a single stack b ecause language restrictions make

it imp ossible for a parenttobescheduled when anyofitschildren are waiting for a syn

chronization event Distributed laments are combined with a distributed shared memory

system In the case of a remote page fault communication can b e overlapp ed with compu

tation b ecause each pro cessor runs multiple scheduling threads

Olden is based on spaghetti stacks continuation stealing eagerdisconnect

and an explicit queue In Olden as well as in previous uses of spaghetti stacks a

CHAPTER RELATED WORK

garbage collector is used to reclaim freed frames Oldens thread mo del is more p owerful

than ours since in Olden threads can migrate The idea is that a thread computation that

follows references to unstructured heapallo cated data might increase lo cality if migration

o ccurs On the other hand this mo del requires migrating the current call frame as well

as disallowing lo cal p ointers to other frames on the stack

We use stacklets for ecient stackbased frame allo cation in parallel programs

Previous work by Hieb et al describ es similar ideas for handling continuations eciently

but it uses a garbage collector

Our implementation technique for enco ding the susp end and steal entry p oints

for a thread seed or continuation builds on the use of multiple osets from a single return

address to handle sp ecial cases This technique was used in SOAR It was also applied

to Self which uses parent controlled return continuations to handle debugging

Finallymanylightweight thread packages havebeendevelop ed Cthreads is a

runtime library whichprovides multiple threads of control and synchronization primitives

for parallel programming at the level of the C language Scheduler activations reduce

the overhead bymoving negrained threads completely to the user level and relying on the

kernel only for infrequent cases Synthesis is an op erating systems kernel for a parallel

and distributed computational environment that integrates dynamic load balancing and

dynamic compilation techniques Chan tisalightweight threads package which

is used in the implementation of an HPF extension called Opus Chant provides an

interface for lightweight userlevel threads that have the capability of communication and

synchronization across separate address spaces While userlevel thread packages eliminate

much of the overhead encountered in traditional op eratingsystems thread packages they are

still not as lightweight as many of the systems mentioned ab ove that use sp ecial runtime

representations supp orted by the compiler Since the primitives of thread packages are

exp osed at the library level compiler optimizations presented in this pap er are not p ossible

for such systems

Chapter

Conclusions

Many mo dern parallel languages supp ort dynamic creation of threads or require

multithreading in their implementations In these languages it is desirable for the logical

parallelism expressed by the threads to exceed the physical parallelism of the machine In

practice most logical threads need not b e indep endent threads Instead they can run as

sequential calls which are inherently cheap er than indep endent threads The challenge to

the language implementor is that one cannot generally predict which logical threads can b e

implemented as sequential calls In lazy multithreading systems each logical thread b egins

execution sequentially with the attendant ecient stack management and direct transfer

of control and data Only if a thread must execute in parallel do es it get its own thread of

control

This dissertation presents a design space for implementing multithreaded systems

without preemption Weintro duce a sucient set of primitives so that any p oint in the

design space may b e implemented using our new primitives The primitives and implemen

tation techniques that we prop ose reduce the cost of a p otentially parallel call to nearly

that of a sequential call without restricting the parallelism inherent in the program Using

our primitives we implement a system in some cases a previously prop osed system at each

p oint in the design space

Wehaveshown that using appropriate compiler techniques we can provide a fast

parallel call with nearly the full eciency of a sequential call when the child thread executes

lo cally and runs to completion without susp ension Suchchild threads o ccur frequently with

aggressively parallel languages such as Id and with more conservative languages suchas C with parallel calls eg SplitCthreads

CHAPTER CONCLUSIONS

The central idea b ehind a fast parallel call is to pay only for what is used Thus

a lo cal fork is p erformed essentially as a sequential call with the attendant ecient stack

management and direct transfer of control and data If the child actually susp ends b e

fore completion control is returned to the parent so that it can take appropriate action

Similarly remote work is generated lazily

Our compilation techniques and the new p oints in the lazy thread design space

weintro duce exploit the one bit of exibility in the sequential call the indirect jump on

return First we exploit this exibilityby asso ciating with each return address multiple

return entry p oints The extra entry p oints are linked to compilergenerated co de streams

that handle any necessary parallelism without advance preparation Second we exploit

the indirect jump by using parentcontrolled return continuations to eliminate the need for

synchronization until the child and parent need to run in parallel The new design p oints

intro duced stacklets on the memory axis thread seeds on the thread representation axis

lazydisconnect on the disconnection axis and implicit and lazy queueing also arise from

exploiting the exibility of the indirect jump on return

We nd that p erformance is b est in implementations that strike a balance b etween

preparation b efore the p otentially parallel call and extra work when parallelism is actually

needed Our implementation of spaghetti stacks illustrates this p oint p erfectly If a lazy

or implicit queue is used then when a thread runs to completion no unnecessary check

is made to see if reclamation is necessaryHowever if a thread susp ends weusepcrcs

es to ensure that the check is p erformed when the thread eventually completes This strik

a prop er balance byavoiding runtime preparation and limiting the work necessary when

parallelism is needed However when spaghetti stacks are used in conjunction with an

explicit queue we prepare for the worst case by p erforming the checkonevery return

making this an unattractive combination Stacklets also achieve balance for storing thread

state while linked frames require to o much preparation Continuations and thread seeds

when implemented using multiple return entry p oints and pcrcs b oth achieve this balance

Eagerdisconnect when combined with stacklets or spaghetti stacks and explicit queueing

requires to o muchwork when parallelism is needed Finally of the three queueing metho ds

lazy queueing works well while explicit queueing requires preparation on every parallel call

and implicit queueing interacts badly with work stealing when parallelism is actually needed

Our empirical studies using SplitCthreads on the Berkeley NOW show that

the combination of stacklets or spaghetti stacks thread seeds lazydisconnect and lazy

CHAPTER CONCLUSIONS

queueing oers very go o d parallel p erformance and supp orts negrained parallelism even

on a distributed memory machine In contrast the linkedframe storage mo del whichis

the most commonly used storage mo del for compilerintegrated multithreading p erforms

poorly When concurrency is needed on a single pro cessor we nd that continuations are

also a reasonable choice for representing threads The tradeo b etween thread seeds and

continuations on shared memory machines remains unexploredon suc hmachines migration

across address spaces is not necessaryremoving the obstacle to using continuations

This dissertation op ens several areas for exploration The most obvious is com

paring the design space on shared memory machines Three other interesting areas are

generalizing the use of thread seeds the use of multiple co de streams to supp ort copyon

susp end in parallel programs and the ecient placement of network p olling op erations

One way to view thread seeds is as an ecientway to handle exceptions without

advance preparation Broadly sp eaking an exception is an infrequently encountered

condition that must nonetheless b e handled as part of normal program execution In

our case the exception is the need to run a child in parallel Perhaps thread seeds

can b e applied to a more broad class of exceptions in languages that promote the use

of explicit exception handling

Our results dep end heavily on generating multiple co de streams for each forkset We

also see howmultiple co de streams enable the copyonsusp end optimization which

eliminates unnecessary copying of shared data structures Perhaps copyonsusp end

can b e implemented automatically and if so multiple co de streams may allow other

optimizations

Because our compilation strategy eliminates most of the overhead due to multithread

ing the dominating overhead on a distributed memory multipro cessor is the network

poll We currently takea very conservative approach and insert many more p olling

op erations than necessary An imp ortant optimization to investigate is how to reduce

the numb er of p olling op erations while still guaranteeing correct network b ehavior

In conclusion wehave shown that negrained parallelism can b e eciently sup

p orted thus allowing programmers and backend compilers the freedom to sp ecify all the

parallelism in a program without concern ab out excessiveoverhead

App endix A

Glossary

cactus stack Page

The global tree of thread frames

ischild Page

The state of a frame that has b een invoked with an lfork but has not b een

disconnected

child thread Page

The thread created by a one of the fork op erations fork dforkorlfork

The thread executing the fork op eration is the parent thread

closure Page

A data structure that represents a nascent thread containing the arguments

and address of the threads co deblo ck The closure is selfcontained and can

b e turned into an indep endent thread by the general scheduler

co deblo ck Page

The set of instructions that comprise the b o dy of a function or a thread

continuation stealing Page

An op eration that allows a parent thread to disconnect from its lazy child

thread It disconnects the parentby restarting the parent at its current

continuation

disconnection Page

An op eration that elevates a lazy thread into an indep endent thread This

mayinvolve disconnecting the control link b etween the lazy thread and its

parent the storage used by the lazy thread and its parent or b oth

daemon thread Page

A thread that remains aliveuntil the end of the program

APPENDIX A GLOSSARY

dexit Page

The instruction in MAMDF that terminates a thread that was invoked by

dfork

dfork Page

An instruction in MAMDF that creates and directly schedules a child

thread

downward thread Page

A thread that terminates b efore its parent terminates

eagerdisconnect Page

The act of disconnecting a child from its parentbychanging the parent frame

so that future lazy threads maybeinvoked in the same manner as b efore

disconnection Compare to lazydisconnect

eager fork Page

A fork that always creates an indep endent thread

enable Page

The instruction that ensures that a thread is on the ready queue and in the

ready state

exit Page

The instruction that terminates a thread and idles the pro cessor on which

it was running

fork Page

The instruction that creates a new indep endent thread

forkset Page

A set of forks that are related by a common join

frame area Page

The area of a stack or stacklet that holds the activation frames for one

indep endent thread the base frame its lazy thread descendants and their

sequential call frames

haschild Page

The state a parent thread enters when it lforks a lazy child thread

idle thread Page

A thread waiting on an even t It is always an indep endent thread

indep endent thread Page

A thread that is not lazy ie one with its own logical stack

indirect inlet address Page

An address of a lo cation in a threads frame whichcontains the address of

an inlet The address is used by lreturn to nd the inlet to run when the

child terminates

APPENDIX A GLOSSARY

inlet Page

A co de fragment that pro cesses data received from another thread All

interthread communication is handled by transferring data with send to an

inlet

ireturn Page

The instruction used to return control from an inlet back to the sending

thread

ischild Page

The state a lazy thread enters when it is created

join counter Page

A counter in the parent thread used in the synthesis of the join op eration

for threads

lazy thread Page

A thread that has b een started byanlfork and is using its parents stack

lfork Page

An instruction in MAMDS and LMAM that creates and directly schedules

achild thread

lazydisconnect Page

The act of disconnecting a child from its parent without changing the par

ents frame It causes future lazy threads to b e invoked dierently than

b efore disconnection Compare to eagerdisconnect

linked frames Page

A thread storage mo del where every activation frame is xedsized and heap

allo cated

LMAM Page

A Lazy Multithreaded Abstract Machine A lazy thread is directly scheduled

and can share the same stack as its parent thread

lreturn Page

An instruction in MAMDS and LMAM that simultaneously returns control

and data from a child thread to its parent

nascent thread Page

A thread that has not yet started but will start the next time its paren tis

resumed

MAM Page

The basic multithreaded abstract machine which supp orts only eager thread

creation

MAMDF Page

Amultithreaded abstract machine with direct fork

APPENDIX A GLOSSARY

MAMDS Page

Amultithreaded abstract machine with direct scheduling for b oth thread

creation and termination lfork directly calls its lazy child and lreturn

directly schedules its parent

multiple stacks Page

A thread storage mo del in whicheach thread is assigned its own stack

lforks and calls are allo cated on the same stacks as their parents

parallel overhead Page

The overhead intro duced by a parallel algorithm over its sequential algo

rithm

parent queue Page

The set of parent threads that havegiven control of their pro cessors to their

children These threads are all in the haschild state

parentthread Page

The thread that executes a fork op eration fork lforkordfork The

thread created by the fork is the child thread

ready thread Page

A thread in the ready state It is always an indep endent thread

recv Page

The rst instruction of an inlet It describ es the data exp ected by the inlet

ready queue Page

The set of indep endent threads that are in the ready state ie that are

ready to execute

return register Page

Contains the return continuation used byachild to return results to its

parent In MAM the return continuation is an inlet p ointer and a thread

pointer It is redened in MAMDS on Page to b e the indirect inlet

address

running thread Page

A thread in the running state It is always an indep endent thread

send Page

The instruction used to transfer data from one thread to the inlet of another

thread

spaghetti stack Page

A memory region that is unb ounded in one direction but unlike a stack has

no free op eration A spaghetti stack requires the use of a top pointer which

points to the next free lo cation in the stack

stacklet Page

A xedsized heapallo cated chunk of memory that b ehaves like a stack

APPENDIX A GLOSSARY

strands Page

Aowofcontrol within a function that may not susp end Strands allow

parallelism within a function

stub Page

A frame at the b ottom of a stack or stacklet that is used to maintain the

links b etween indep endent threads and the cactus stack

suspend Page

The instruction that causes a thread to give up its pro cessor and enter the

idle state If the thread was lazy this mayalsoinvolve disconnecting the

thread from its parent

thread seed Page

A representation for a nascent thread It is a p ointertoacontext and a set

of related co de p ointers where each of the co de p ointers can b e derived at

link time from a single co de p ointer

thread Page

A lo cus of control which can p erform calls to arbitrary nesting depth sus

p end at any p oint and fork additional threads Threads are scheduled in

dep endently and are nonpreemptive Threads can b e either indep endentor

lazy

upward thread Page

A thread that terminates after its parent terminates

yield Page

The instruction that causes the thread executing it to b e placed on the ready

queue

work stealing Page

The act of assigning a thread that comes from the parent queue to an idle

pro cessor Work stealing is accomplished through either continuation steal

ing or seed activation

Biblio graphy

T E Anderson B N Bershad E D Lazowska and H M Levy Scheduler acti

vations eectivekernel supp ort for the userlevel management of parallelism ACM

Transactions on Computer Systems February

Andrew W App el Garbage collection can b e faster than stack allo cation Information

Processing Letters Jan

Arvind and D E Culler Dataowarchitectures In Annual Reviews in Computer Sci

encevolume pages Annual Reviews Inc Palo Alto CA Reprinted in

Dataow and Reduction ArchitecturesSSThakkar editor IEEE Computer So ciety

Press

Arvind D E Culler R A Iannucci V Kathail K Pingali and R E Thomas The

Tagged Token DataowArchitecture Technical Rep ort FLA memo MIT Lab for

Comp Sci Tech Square Cambridge MA August

Arvind S K Heller and R S Nikhil Programming Generality and Parallel Comput

ers In Proc of the Fourth Int Symp on Biological and Articial Intel ligence Systems

pages ESCOM Leider Trento Italy September

Arvind R S Nikhil and K K Pingali IStructures Data Structures for Parallel

Computing ACM Transactions on Programming Languages and Systems

Octob er

R D Blumofe C F Jo erg B C Kuszmaul C E Leiserson P Lisiecki K H

Randall A Shaw and Y Zhou Cilk reference manual MIT Lab for Comp Sci

Technology Square Cambridge MA September

BIBLIOGRAPHY

R D Blumofe C F Jo erg B C Kuszmaul C E Leiserson K H Randall and

Y Zhou Cilk an ecientmultithreaded runtime system Journal of Paral lel and

Distributed Computing August

R D Blumofe and C E Leiserson Scheduling multithreaded computations bywork

stealing In Proceedings of the ThirtyFifth Annual Symposium on Foundations of

Computer ScienceFOCS pages Sante Fe NM Novemb er

Daniel G Bobrow and Ben Wegbreit A mo del and stack implementation of multiple

environments Communications of the ACM

N Bo den D Cohen R Felderman A Kulawik C Sietz J Seizovic and W Su

Myrinet A gigabitp ersecond lo calarea network In IEEE Micro pages

February

MC Carlisle A Rogers JH Reppy and LJ Hendren Early exp eriences with Olden

parallel programming In Languages and Compilers for Paral lel Computing th In

ternational Workshop Proceedings pages SpringerVerlag

KM Chandy and C Kesselman Comp ositional C comp ositional parallel pro

gramming In Languages and Compilers for Paral lel Computing th International

Workshop Proceedings pages SpringerVerlag

E C Co op er and R PDraves CThreads Tec hnical Rep ort CMUCS

CarnegieMellon UniversityFebruary

D Culler A Dusseau S Goldstein S Lumetta T von eicken and K Yelick Par

allel Programming in SplitC In Proceedings of Supercomputing pages

Novemb er

D E Culler S C Goldstein K E Schauser and T von Eicken TAM a compiler

controlled threaded abstract machine Journal of Paral lel and Distributed Computing

July

D E Culler K E Schauser and T von Eicken TwoFundamental Limits on Dataow

Multipro cessing In Proceedings of the IFIP WG Working ConferenceonArchitec

tures and Compilation Techniques for Fine and Medium Grain Paral lelism Orlando

FL NorthHolland January

BIBLIOGRAPHY

David E Culler Andrea ArpaciDusseau Remzi ArpaciDusseau BrentChun Steven

Lumetta Alan Mainwaring Richard Martin Chad Yoshikawa and FrederickWong

Parallel computing on the Berkeley NOW In Toappear in Ninth Joint Symposium on

Paral lel Processing Kob e Japan

D R Engler D K Lowenthal and Andrews G R Shared Filaments ecient ne

grain parallelism on sharedmemory multipro cessors Technical Rep ort TR a

University of Arizona April

JE Faust and HM Levy The p erformance of an ob jectoriented threads package In

SIGPLAN Notices pages Oct

Message Passing Interface Forum Mpi a messagepassing interface standard In

ternational Journal of Supercomputer Applications and High Performance Computing

volno FallWinter

I Foster C Kesselman and S Tuecke The Nexus approachtointegrating multithread

ing and communication Journal of Paral lel and Distributed Computing

August

VW Freeh DK Lowenthal and GR Andrews Distributed Filaments ecient ne

grain parallelism on a cluster of workstations In Pr oceedings of the First USENIX

Symposium on Operating Systems Design and Implementation OSDI pages

USENIX Asso c

S C Goldstein K E Schauser and D E Culler Lazy Threads Implementing a fast

parallel call Journal of Paral lel and Distributed Computing August

Seth Cop en Goldstein The implementation of a threaded abstract machine Masters

thesis University of California at Berkeley Computer Science Division Universityof

California Berkeley Ca May Technical Rep ort UCB

James Gosling Bill Joy and Steele Guy The Java Language Specication Sun Mi

crosystems version edition August

D Grunwald B Calder S Vajracharya and H Srinivasan Heaps

oStacks combined heapbased activation allo cation for parallel programs URL

httpwwwcscoloradoedu grunwald

BIBLIOGRAPHY

M Haines D Cronk and P Mehrotra On the design of Chant a talking threads

package In Proceedings Supercomputing Cat NoCH pages IEEE

Comput So c Press

E A Hauck and B A Dent Burroughs BB stackmechanism In Proceedings

of the AFIPS Spring Joint Computer Conference pages

R Hieb R Kent Dybvig and C Bruggeman Representing control in the presence of

rstclass continuations In SIGPLAN Notices pages June

Guido Hogen and Rita Lo ogen A New StackTechnique for the Management of Run

time Structures in Distributed Implementations Aachener InformatikBerichte

RWTH Aachen Ahornstr Aachen Germany

U Holzle C Chamb ers and D Ungar Debugging optimized co de with dynamic

deoptimization In SIGPLAN Notices pages July

IEEE IEEE Standard for Threads Interface to POSIX ieee draft std pcd

edition

H F Jordan Performance measurement on HEP a pip elined MIMD computer In

Proc of the th Annual Int Symp on Comp Arch Sto ckholm Sweden June

LV Kale and S Krishnan CHARM a p ortable concurrent ob ject oriented system

based on C In SIGPLAN Notices pages Oct

V Karamcheti and A Chien Concert ecient runtime supp ort for concurrent ob ject

oriented programming languages on sto ckhardware In Proceedings SUPERCOMPUT

ING pages IEEE Comput So c Press Nov

David Kepp el To ols and techniques for building fast p ortable threads packages Tech

nical Rep ort UWCSE UniversityofWashington Department of Computer

Science and Engineering May

C ko elb el D Loveman R Schreib er G Steel Jr and M Zosel The High Performance

Fortran Handbook The MIT Press

DA Kranz Jr Halstead RH and E Mohr MulT a highp erformance parallel

Lisp In SIGPLAN Notices pages July

BIBLIOGRAPHY

R P Martin A M Vahdat D E Culler and T E Anderson Eects of communi

cation latencyoverhead and bandwidth in a cluster architecture In Proc of the th

Intl Symposium on Computer Architecture June

H Massalin and C Pu Threads and inputoutput in the Synthesis kernel In Twelfth

ACM Symposium on Operating Systems Principles Decemb er

Dylan McNamee Newthreads

httpwwwcswashingtoneduresearchcompilerpap ersdnthtml

P Mehrotra and M Haines An overview of the Opus language and runtime sys

tem In Languages and Compilers for Paral lel Computing th International Workshop

Proceedings pages SpringerVerlag AN

E Mohr DA Kranz and Jr Halstead RH Lazy task creation a technique for

increasing the granularity of parallel programs IEEE Transactions on Paral lel and

DistributedSystemsvolno July

R S Nikhil Id version reference manual Technical Rep ort CSG Memo

MIT Lab for Comp Sci March

R S Nikhil Id Version Reference Manual Technical Rep ort CSG Memo to

app ear MIT Lab for Comp Sci Tech Square Cambridge MA

R S Nikhil A Multithreaded Implementation of Id using PRISC Graphs In Proc

Sixth Ann Workshop on Languages and Compilers for Paral lel ComputingP ortland

Oregon August

R S Nikhil Cid A parallel shared memory C for distributedmemory machines

In Languages and Compilers for Paral lel Computing th International Workshop Pro

ceedings SpringerVerlag

RS Nikhil A multithreaded implementation of Id using PRISC graphs In Languages

and Compilers for Paral lel Computing th International Workshop Proceedingspages

SpringerVerlag

MD Noakes DA Wallach and WJ Dally The JMachine multicomputer an ar

chitectural evaluation In Computer Architecture News pages May

BIBLIOGRAPHY

G M Papadop oulos and D E Culler Monso on an Explicit TokenStore Architecture

In Proc of the th Annual Int Symp on Comp Arch Seattle Washington May

A Rogers MC Carlisle JH Reppy and LJ Hendren Supp orting dynamic data

structures on distributedmemory machines ACM Transactions on Programming Lan

guages and Systemsvolno March

A Rogers J Reppy and L Hendren Supp orting SPMD execution for dynamic data

structures In Languages and Compilers for Paral lel Computing th International

Workshop Proceedings pages SpringerVerlag

M Rozier V Abrossimov F Armand I Boule M Gien M Guillemont F Herrman

C Kaiser S Langlois P Leonard and W Neuhauser Overview of the CHORUS

distributed op erating system In Proceedings of the USENIX Workshop on Micro

Kernels and Other Kernel Architectures pages USENIX Asso c

Anurag Sah Parallel language supp ort on shared memory multipro cessors Masters

thesis University of California BerkeleyMay

Sun Microsystems Inc Solaris Threads

K Taura S Matsuoka and A Yonezawa StackThreads an abstract machine for

sc heduling negrain threads on sto ck CPUs In Theory and PracticeofParal lel Pro

gramming International Workshop TPPP Proceedings pages Springer

Verlag AN

Thinking Machines Corp oration Cambridge Massachusetts The Connection Machine

CM technical summaryJanuary

K R Traub Implementation of Nonstrict Functional Programming LanguagesMIT

Press

D M Ungar The design and evaluation of a high performance Smal ltalk systemACM

distinguished dissertations MIT Press

MT Vandevo orde and ES Rob erts WorkCrews an abstraction for controlling par

allelism International Journal of Paral lel Programmingvolno Aug

BIBLIOGRAPHY

T von Eicken D E Culler S C Goldstein and K E Schauser Active Messages a

mechanism for integrated communication and computation In Proc of the th Intl

Symposium on Computer Architecture Gold Coast Australia May

DB Wagner and BG Calder Leapfrogging a p ortable technique for implementing

ecient futures In SIGPLAN Notices pages July

Tim A Wagner Vance Maverick Susan L Graham and Michael A Harrison Accurate

static estimators for program optimization In ACM SIGPLAN Conferenceon

Programming Language Design and Implementation pages Orlando Florida

A Yonezawa ABCL an objectorientedconcurrent system MIT Press series in computer systems MIT Press