Scheduling Multithreaded Computations

by

Rob ert D Blumofe

The University of Texas at Austin

Charles E Leiserson

MIT Laboratory for Computer Science

Abstract

This pap er studies the problem of eciently fully strict ie well

structured multithreaded computations on parallel computers A p opular and prac

tical metho d of scheduling this kind of dynamic MIMDstyle computation is work

stealing in which pro cessors needing work steal computational threads from other

pro cessors In this pap er we give the rst provably go o d workstealing scheduler for

multithreaded computations with dep endencies

Sp ecically our analysis shows that the exp ected time to execute a fully strict

computation on P pro cessors using our workstealing scheduler is T P O T where

T is the minimum serial execution time of the multithreaded computation and T is

the minimum execution time with an innite numb er of pro cessors Moreover the

space required by the execution is at most S P where S is the minimum serial space

requirement We also show that the exp ected total communication of the algorithm is

at most O PT n S where S is the size of the largest activation record of

max max

d

any and n is the maximum numb er of times that any thread synchronizes with

d

its parent This communication b ound justies the folk wisdom that workstealing

schedulers are more communication ecient than their worksharing counterparts All

three of these b ounds are existentially optimal to within a constant factor

Intro duction

For ecient execution of a dynamically growing multithreaded computation on a MIMD

style parallel computer a scheduling algorithm must ensure that enough threads are active

concurrently to keep the pro cessors busy Simultaneously it should ensure that the numb er

of concurrently active threads remains within reasonable limits so that memory requirements

are not unduly large Moreover the scheduler should also try to maintain related threads

This research was supp orted in part by the Advanced Research Pro jects Agency under Contract N

This research was done while Rob ert D Blumofe was at the MIT Lab oratory for Computer Science

and was supp orted in part by an ARPA HighPerformance Computing Graduate Fellowship

on the same pro cessor if p ossible so that communication b etween them can b e minimized

Needless to say achieving all these goals simultaneously can b e dicult

Two scheduling paradigms have arisen to address the problem of scheduling multithreaded

computations work sharing and work stealing In work sharing whenever a pro cessor

generates new threads the scheduler attempts to migrate some of them to other pro cessors

in hop es of distributing the work to underutilized pro cessors In work stealing however

underutilized pro cessors take the initiative they attempt to steal threads from other

pro cessors Intuitively the migration of threads o ccurs less frequently with work stealing

than with work sharing since when all pro cessors have work to do no threads are migrated

by a workstealing scheduler but threads are always migrated by a worksharing scheduler

The workstealing idea dates back at least as far as Burton and Sleeps research on par

allel execution of functional programs and Halsteads implementation of Multilisp

These authors p oint out the heuristic b enets of work stealing with regards to space and

communication Since then many researchers have implemented variants on this strategy

Rudolph SlivkinAllalouf and Upfal analyzed a random

ized workstealing strategy for load balancing indep endent jobs on a parallel computer and

Karp and Zhang analyzed a randomized workstealing strategy for parallel backtrack

search Recently Zhang and Ortynski have obtained go o d b ounds on the communication

requirements of this algorithm

In this pap er we present and analyze a workstealing algorithm for scheduling fully

strict wellstructured multithreaded computations This class of computations encom

passes b oth backtrack search computations and divideandconquer computations

as well as dataow computations in which threads may stall due to a data dep endency

We analyze our algorithms in a stringent atomicaccess mo del similar to the atomic message

passing mo del of in which concurrent accesses to the same data structure are serially

queued by an adversary

Our main contribution is a randomized workstealing scheduling algorithm for fully strict

multithreaded computations which is provably ecient in terms of time space and com

munication We prove that the exp ected time to execute a fully strict computation on P

pro cessors using our workstealing scheduler is T P O T where T is the minimum

serial execution time of the multithreaded computation and T is the minimum execution

time with an innite numb er of pro cessors In addition the space required by the execution

is at most S P where S is the minimum serial space requirement These b ounds are b et

ter than previous b ounds for worksharing schedulers and the workstealing scheduler

is much simpler and eminently practical Part of this improvement is due to our fo cus

ing on fully strict computations as compared to the general strict computations studied

in We also prove that the exp ected total communication of the execution is at most

O PT n S where S is the size of the largest activation record of any thread

d max max

and n is the maximum numb er of times that any thread synchronizes with its parent This

d

b ound is existentially tight to within a constant factor meeting the lower b ound of Wu

and Kung for communication in parallel divideandconquer In contrast worksharing

schedulers have nearly worstcase b ehavior for communication Thus our results b olster the

folk wisdom that work stealing is sup erior to work sharing

Others have studied and continue to study the problem of eciently managing the space

requirements of parallel computations Culler and Arvind and Ruggiero and Sargeant

give heuristics for limiting the space required by dataow programs Burton shows

how to limit space in certain parallel computations without causing deadlo ck More recently

Burton has develop ed and analyzed a scheduling algorithm with provably go o d time and

space b ounds Blello ch Gibb ons Matias and Narlikar have also recently develop ed

and analyzed scheduling algorithms with provably go o d time and space b ounds It is not

yet clear whether any of these algorithms are as practical as work stealing

The remainder of this pap er is organized as follows In Section we review the graph

theoretic mo del of multithreaded computations intro duced in which provides a theo

retical basis for analyzing schedulers Section gives a simple scheduling algorithm which

uses a central queue This busyleaves algorithm forms the basis for our randomized work

stealing algorithm which we present in Section In Section we intro duce the atomicaccess

mo del that we use to analyze execution time and communication costs for the workstealing

algorithm and we present and analyze a combinatorial balls and bins game that we use

to derive a b ound on the contention that arises in random work stealing We then use this

b ound along with a delaysequence argument in Section to analyze the execution time

and communication cost of the workstealing algorithm To conclude in Section we briey

discuss how the theoretical ideas in this pap er have b een applied to the programming

language and runtime system as well as make some concluding remarks

A mo del of multithreaded computation

This section reprises the graphtheoretic mo del of multithreaded computation intro duced

in We also dene what it means for computations to b e fully strict We conclude

with a statement of the greedyscheduling theorem which is an adaptation of theorems by

Brent and Graham on dag scheduling

A multithreaded computation is comp osed of a set of threads each of which is a se

quential ordering of unittime instructions The instructions are connected by dependency

edges which provide a partial ordering on which instructions must execute b efore which

other instructions In Figure for example each shaded blo ck is a thread with circles

representing instructions and the horizontal edges called continue edges representing the

sequential ordering Thread of this example contains instructions v v and v

The instructions of a thread must execute in this sequential order from the rst leftmost

instruction to the last rightmost instruction In order to execute a thread we allo cate for

it a chunk of memory called an activation frame that the instructions of the thread can

use to store the values on which they compute

A P pro cessor execution schedule for a multithreaded computation determines which

pro cessors of a P pro cessor parallel computer execute which instructions at each step An

execution schedule dep ends on the particular multithreaded computation and the numb er P

of pro cessors In any given step of an execution schedule each pro cessor executes at most

one instruction

During the course of its execution a thread may create or spawn other threads Spawn

ing a thread is like a subroutine call except that the spawning thread can op erate concur

rently with the spawned thread We consider spawned threads to b e children of the thread

that did the spawning and a thread may spawn as many children as it desires In this way Γ1

v1 v2 v16 v17 v21 v22 v23

Γ2 Γ6

v3 v6 v9 v13 v14 v15 v18 v19 v20

Γ3 Γ4 Γ5

v4 v5 v7 v8 v10 v11 v12

Figure A multithreaded computation This computation contains instructions v v v

and threads

threads are organized into a spawn tree as indicated in Figure by the downwardp ointing

shaded dep endency edges called spawn edges that connect threads to their spawned chil

dren The spawn tree is the parallel analog of a call tree In our example computation the

spawn trees root thread has two children and and thread has three children

and Threads and which have no children are leaf threads

Each spawn edge go es from a sp ecic instructionthe instruction that actually do es

the spawn op erationin the parent thread to the rst instruction of the child thread An

execution schedule must ob ey this edge in that no pro cessor may execute an instruction in

a spawned child thread until after the spawning instruction in the parent thread has b een

executed In our example computation Figure due to the spawn edge v v instruction

v cannot b e executed until after the spawning instruction v Consistent with our unittime

mo del of instructions a single instruction may spawn at most one child When the spawning

instruction executes it allo cates an activation frame for the new child thread Once a thread

has b een spawned and its frame has b een allo cated we say the thread is alive or living

When the last instruction of a thread executes it deallo cates its frame and the thread dies

An execution schedule generally resp ects other dep endencies b esides those represented

by continue and spawn edges Consider an instruction that pro duces a data value to b e

consumed by another instruction Such a pro ducerconsumer relationship precludes the

consuming instruction from executing until after the pro ducing instruction To enforce

such orderings other dep endency edges called join edges may b e required as shown in

Figure by the curved edges If the execution of a thread arrives at a consuming instruction

b efore the pro ducing instruction has executed execution of the consuming thread cannot

continuethe thread stal ls Once the pro ducing instruction executes the join dep endency is

resolved which enables the consuming thread to resume its executionthe thread b ecomes

ready A multithreaded computation do es not mo del the means by which join dep endencies

get resolved or by which unresolved join dep endencies get detected In implementation

resolution and detection can b e accomplished using mechanisms such as join counters

futures or Istructures

We make two technical assumptions regarding join edges We rst assume that each

instruction has at most a constant numb er of join edges incident on it This assumption

is consistent with our unittime mo del of instructions The second assumption is that no

join edges enter the instruction immediately following a spawn This assumption means

that when a parent thread spawns a child thread the parent cannot immediately stall It

continues to b e ready to execute for at least one more instruction

An execution schedule must ob ey the constraints given by the spawn continue and join

edges of the computation These dep endency edges form a directed graph of instructions

and no pro cessor may execute an instruction until after all of the instructions predecessors

in this graph have b een executed So that execution schedules exist this graph must b e

acyclic That is it must b e a directed acyclic graph or dag At any given step of an

execution schedule an instruction is ready if all of its predecessors in the dag have b een

executed

We make the simplifying assumption that a parent thread remains alive until all its

children die and thus a thread do es not deallo cate its activation frame until all its childrens

frames have b een deallo cated Although this assumption is not absolutely necessary it gives

the execution a natural structure and it will simplify our analyses of space utilization In

accounting for space utilization we also assume that the frames hold all the values used by

the computation there is no global storage available to the computation outside the frames

or if such storage is available then we do not account for it Therefore the space used

at a given time in executing a computation is the total size of all frames used by all living

threads at that time and the total space used in executing a computation is the maximum

such value over the course of the execution

To summarize a multithreaded computation can b e viewed as a dag of instructions con

nected by dep endency edges The instructions are connected by continue edges into threads

and the threads form a spawn tree with the spawn edges When a thread is spawned an

activation frame is allo cated and this frame remains allo cated as long as the thread remains

alive A living thread may b e either ready or stalled due to an unresolved dep endency

A given multithreaded program when run on a given input can sometimes generate more

than one multithreaded computation In that case we say the program is nondetermin

istic If the same multithreaded computation is generated by the program on the input

no matter how the computation is scheduled then the program is deterministic In this

pap er we shall analyze multithreaded computations not multithreaded programs Sp eci

cally we shall not worry ab out how the multithreaded computation is generated Instead

we shall study its prop erties in an a posteriori fashion

Because multithreaded computations with arbitrary dep endencies can b e imp ossible to

schedule eciently we study sub classes of general multithreaded computations in which

the kinds of syncrhonizations that can o ccur are restricted A strict multithreaded com

putation is one in which all join edges from a thread go to an ancestor of the thread in

the activation tree In a strict computation the only edge into a subtree emanating from

outside the subtree is the spawn edge that spawns the subtrees ro ot thread For example

the computation of Figure is strict and the only edge into the subtree ro oted at is the

spawn edge v v Thus strictness means that a thread cannot b e invoked b efore all of

its arguments are available although the arguments can b e garnered in parallel A ful ly

strict computation is one in which all join edges from a thread go to the threads parent A

fully strict computation is in a sense a wellstructured computation in that all join edges

from a subtree of the spawn tree emanate from the subtrees ro ot The example compu

tation of Figure is fully strict Any multithreaded computation that can b e executed in a

depthrst manner on a single pro cessor can b e made either strict or fully strict by altering

the dep endency structure p ossibly aecting the achievable parallelism but not aecting the

semantics of the computation

We quantify and b ound the execution time of a computation on a P pro cessor parallel

computer in terms of the computations work and criticalpath length We dene the

work of the computation to b e the total numb er of instructions and the criticalpath

length to b e the length of a longest directed path in the dag Our example computation

Figure has work and criticalpath length For a given computation let T X denote

the time to execute the computation using P pro cessor execution schedule X and let

T min T X

P

X

denote the minimum execution time with P pro cessorsthe minimum b eing taken over all P

pro cessor execution schedules for the computation Then T is the work of the computation

since a pro cessor computer can only execute one instruction at each step and T is the

criticalpath length since even with arbitrarily many pro cessors each instruction on a path

must execute serially Notice that we must have T T P b ecause P pro cessors can

P

execute only P instructions p er time step and of course we must have T T

P

Early work on dag scheduling by Brent and Graham shows that there exist P

pro cessor execution schedules X with T X T P T As the sum of two lower b ounds

this upp er b ound is universally optimal to within a factor of The following theorem

proved in extends these results minimally to show that this upp er b ound on T can

P

b e obtained by greedy schedules those in which at each step of the execution if at least

P instructions are ready then P instructions execute and if fewer than P instructions are

ready then all execute

Theorem The greedyscheduling theorem For any multithreaded computation with work

T and criticalpath length T and for any number P of processors any greedy P processor

execution schedule X achieves T X T P T

Generally we are interested in schedules that achieve linear speedup that is T X

O T P For a greedy schedule linear sp eedup o ccurs when the paral lelism which we

dene to b e T T satises T T P

To quantify the space used by a given execution schedule of a computation we dene the

stack depth of a thread to b e the sum of the sizes of the activation frames of all its ancestors

including itself The stack depth of a multithreaded computation is the maximum stack

depth of any of its threads We shall denote by S the minimum amount of space p ossible for

any pro cessor execution of a multithreaded computation which is equal to the stack depth

of the computation Let S X denote the space used by a P pro cessor execution schedule

X of a multithreaded computation We shall b e interested in those execution schedules that

exhibit at most linear expansion of space that is S X O S P which is existentially

optimal to within a constant factor

The busyleaves prop erty

Once a thread has b een spawned in a strict computation a single pro cessor can complete

the execution of the entire sub computation ro oted at even if no other progress is made

on other parts of the computation In other words from the time the thread is spawned

until the time dies there is always at least one thread from the sub computation ro oted

at that is ready In particular no leaf thread in a strict multithreaded computation can

stall As we shall see this prop erty allows an execution schedule to keep the leaves busy

By combining this busyleaves prop erty with the greedy prop erty we derive execution

schedules that simultaneously exhibit linear sp eedup and linear expansion of space

In this section we show that for any numb er P of pro cessors and any strict multithreaded

computation with work T criticalpath length T and stack depth S there exists a P

pro cessor execution schedule X that achieves time T X T P T and space S X S P

simultaneously We give a simple online P pro cessor parallel algorithmthe BusyLeaves

Algorithmto compute such a schedule This simple algorithm will form the basis for the

randomized workstealing algorithm presented in Section

The BusyLeaves Algorithm op erates online in the following sense Before the tth step

the algorithm has computed and executed the rst t steps of the execution schedule

At the tth step the algorithm uses only information from the p ortion of the computation

revealed so far in the execution to compute and execute the tth step of the schedule In

particular it do es not use any information from instructions not yet executed or threads not

yet spawned

The BusyLeaves Algorithm maintains all living threads in a single thread p o ol which

is uniformly available to all P pro cessors When spawns o ccur new threads are added to

this global p o ol and when a pro cessor needs work it removes a ready thread from the p o ol

Though we describ e the algorithm as a P pro cessor parallel algorithm we shall not analyze it

as such Sp ecically in computing the tth step of the schedule we allow each pro cessor to add

threads to the thread p o ol and delete threads from it Thus we ignore the eects of pro cessors

contending for access to the p o ol In fact we shall only analyze prop erties of the schedule

itself and ignore the cost incurred by the algorithm in computing the schedule Scheduling

overheads will b e analyzed for the randomized workstealing algorithm however

The BusyLeaves Algorithm op erates as follows The algorithm b egins with the ro ot

thread in the global thread p o ol and all pro cessors idle At the b eginning of each step each

pro cessor either is idle or has a thread to work on Those pro cessors that are idle b egin the

step by attempting to remove any ready thread from the p o ol If there are suciently many

ready threads in the p o ol to satisfy all of the idle pro cessors then every idle pro cessor gets

a ready thread to work on Otherwise some pro cessors remain idle Then each pro cessor

that has a thread to work on executes the next instruction from that thread In general

once a pro cessor has a thread call it to work on it executes an instruction from at

a a

each step until the thread either spawns stalls or dies in which case it p erforms according

to the following rules

➊ Spawns If the thread spawns a child then the pro cessor nishes the current

a b

step by returning to the thread p o ol The pro cessor b egins the next step working

a

on

b

pro cessor activity

step thread p o ol p p

v

v

v v

v v

v v

v v

v v

v v

v

v v

v v

v v

v

v

Figure A pro cessor execution schedule computed by the BusyLeaves Algorithm for the com

putation of Figure This schedule lists the living threads in the global thread p o ol at each step

just after each idle pro cessor has removed a ready thread It also lists the ready thread b eing

worked on and the instruction executed by each of the pro cessors p and p at each step Living

threads that are ready are listed in b old The other living threads are stalled

➋ Stalls If the thread stalls then the pro cessor nishes the current step by returning

a

to the thread p o ol The pro cessor b egins the next step idle

a

➌ Dies If the thread dies then the pro cessor nishes the current step by checking to

a

see if s parent thread currently has any living children If has no live children

a b b

and no other pro cessor is working on then the pro cessor takes from the p o ol

b b

and b egins the next step working on Otherwise the pro cessor b egins the next step

b

idle

Figure illustrates these three rules in a pro cessor execution schedule computed by

the BusyLeaves Algorithm on the computation of Figure Rule ➊ At step pro cessor

p working on thread executes v which spawns the child so p places back in the

p o ol to b e picked up at the b eginning of the next step by the idle p and b egins the next

➋ At step pro cessor p working on thread executes v and step working on Rule

stalls so p returns to the p o ol and b egins the next step idle and remains idle since

➌ At step pro cessor p working on the thread p o ol contains no ready threads Rule

executes v and dies so p retrieves the parent from the p o ol and b egins the next

step working on

Besides b eing greedy for any strict computation the schedule computed by the Busy

Leaves Algorithm maintains the busyleaves property at every time step during the

execution every leaf in the spawn subtree has a pro cessor working on it We dene the

spawn subtree at any time step t to b e the p ortion of the spawn tree consisting of just

those threads that are alive at step t To restate the busyleaves prop erty at every time step

every living thread that has no living descendants has a pro cessor working on it We shall

now prove this fact and show that it implies linear expansion of space It is worth noting

that not every multithreaded computation has a schedule that maintains the busyleaves

prop erty but every strict multithreaded computation do es We b egin by showing that any

schedule that maintains the busyleaves prop erty exhibits linear expansion of space

Lemma For any multithreaded computation with stack depth S any P processor execution

schedule X that maintains the busyleaves property uses space bounded by S X S P

Proof The busyleaves prop erty implies that at all time steps t the spawn subtree has at

most P leaves For each such leaf the space used by it and all of its ancestors is at most S

and therefore the space in use at any time step t is at most S P

For schedules that maintain the busyleaves prop erty the upp er b ound S P is conser

vative By charging S space for each busy leaf we may b e overcharging For some com

putations by knowing that the schedule preserves the busyleaves prop erty we can app eal

directly to the fact that the spawn subtree never has more than P leaves to obtain tight

b ounds on space usage

We nish this section by showing that for strict computations the BusyLeaves Algorithm

computes a schedule that is b oth greedy and maintains the busyleaves prop erty

Theorem For any number P of processors and any strict multithreaded computation with

work T criticalpath length T and stack depth S the BusyLeaves Algorithm computes

a P processor execution schedule X whose execution time satises T X T P T and

whose space satises S X S P

Proof The time b ound follows directly from the greedyscheduling theorem Theorem

since the BusyLeaves Algorithm computes a greedy schedule The space b ound follows from

Lemma if we can show that the BusyLeaves Algorithm maintains the busyleaves prop erty

We prove this fact by induction on the numb er of steps At the rst step of the algorithm the

spawn subtree contains just the ro ot thread which is a leaf and some pro cessor is working

on it We must show that all of the algorithm rules preserve the busyleaves prop erty When

a pro cessor has a thread to work on it executes instructions from that thread until it

a

either spawns stalls or dies Rule ➊ If spawns a child then is not a leaf even if it

a b a

was b efore and is a leaf In this case the pro cessor works on so the new leaf is busy

b b

Rule ➋ If stalls then cannot b e a leaf since in a strict computation the unresolved

a a

dep endency must come from a descendant Rule ➌ If dies then its parent thread

a b

may turn into a leaf In this case the pro cessor works on unless some other pro cessor

b

already is so the new leaf is guaranteed to b e busy

We now know that every strict multithreaded computation has an ecient execution

schedule and we know how to nd it But these facts take us only so far Execution schedules

must b e computed eciently online and though the BusyLeaves Algorithm do es compute

ecient execution schedules and do es op erate online it surely do es not do so eciently

except p ossibly in the case of smallscale symmetric multipro cessors This lack of scalability

is a consequence of employing a single centralized thread p o ol at which all pro cessors must

contend for access In the next section we present a distributed online scheduling algorithm

and in the following sections we prove that it is b oth ecient and scalable

A randomized workstealing algorithm

In this section we present an online randomized workstealing algorithm for scheduling mul

tithreaded computations on a parallel computer Also we present an imp ortant structural

lemma which is used at the end of this section to show that for fully strict computations this

algorithm causes at most a linear expansion of space This lemma reapp ears in Section to

show that for fully strict computations this algorithm achieves linear sp eedup and generates

existentially optimal amounts of communication

In the WorkStealing Algorithm the centralized thread p o ol of the BusyLeaves

Algorithm is distributed across the pro cessors Sp ecically each pro cessor maintains a ready

deque data structure of threads The ready deque has two ends a top and a bottom

Threads can b e inserted on the b ottom and removed from either end A pro cessor treats

its ready deque like a pushing and p opping from the b ottom Threads that are

migrated to other pro cessors are removed from the top

In general a pro cessor obtains work by removing the thread at the b ottom of its ready

deque It starts working on the thread call it and continues executing s instructions

a a

until spawns stalls dies or enables a stalled thread in which case it p erforms according

a

to the following rules

➊ Spawns If the thread spawns a child then is placed on the b ottom of the

a b a

ready deque and the pro cessor commences work on

b

➋ Stalls If the thread stalls its pro cessor checks the ready deque If the deque

a

contains any threads then the pro cessor removes and b egins work on the b ottommost

thread If the ready deque is empty however the pro cessor b egins work stealing it

steals the topmost thread from the ready deque of a randomly chosen pro cessor and

b egins work on it This workstealing strategy is elab orated b elow

➌ Dies If the thread dies then the pro cessor follows rule ➋ as in the case of

a a

stalling

➍ Enables If the thread enables a stalled thread the nowready thread is

a b b

placed on the b ottom of the ready deque of s pro cessor

a

A thread can simultaneously enable a stalled thread and stall or die in which case we rst

➍ for enabling and then rule ➋ for stalling or rule ➌ for dying Except for p erform rule

rule ➍ for the case when a thread enables a stalled thread these rules are analogous to the

rules of the BusyLeaves Algorithm and as we shall see rule ➍ is needed to ensure that the

algorithm maintains imp ortant structural prop erties including the busyleaves prop erty

The WorkStealing Algorithm b egins with all ready deques empty The ro ot thread of

the multithreaded computation is placed in the ready deque of one pro cessor while the other

pro cessors start work stealing

When a pro cessor b egins work stealing it op erates as follows The pro cessor b ecomes

a thief and attempts to steal work from a victim pro cessor chosen uniformly at random

The thief queries the ready deque of the victim and if it is nonempty the thief removes and

b egins work on the top thread If the victims ready deque is empty however the thief tries

again picking another victim at random Γ k

ready deque Γ 2

Γ 1

currently Γ 0 executing

thread

Figure The structure of a pro cessors ready deque The black instruction in each thread indicates

the threads currently ready instruction Only thread may have b een worked on since it last

k

spawned a child The dashed edges are the deque edges intro duced in Section

We now state and prove an imp ortant lemma on the structure of threads in the ready

deque of any pro cessor during the execution of a fully strict computation This lemma is

used later in this section to analyze execution space and in Section to analyze execution

time and communication Figure illustrates the lemma

Lemma In the execution of any ful ly strict multithreaded computation by the WorkStealing

Algorithm consider any processor p and any given time step at which p is working on a

thread Let be the thread that p is working on let k be the number of threads in ps ready

deque and let denote the threads in ps ready deque ordered from bottom to

k

top so that is the bottommost and is the topmost If we have k then the threads

k

in ps ready deque satisfy the fol lowing properties

➀ For i k thread is the parent of

i i

➁ If we have k then for i k thread has not been worked on since

i

it spawned

i

Proof The pro of is a straightforward induction on execution time Execution b egins with

the ro ot thread in some pro cessors ready deque and all other ready deques empty so the

lemma vacuously holds at the outset Now consider any step of the algorithm at which

pro cessor p executes an instruction from thread Let denote the k threads

k

in ps ready deque b efore the step and supp ose that either k or b oth prop erties hold

Let denote the thread if any b eing worked on by p after the step and let

0

k

denote the k threads in ps ready deque after the step We now lo ok at the rules of the

algorithm and show that they all preserve the lemma That is either k or b oth

prop erties hold after the step

➊ If spawns a child then p pushes onto the b ottom of the ready deque Rule

and commences work on the child Thus is the child we have k k and

for j k we have See Figure Now we can check b oth prop erties

j

j

Prop erty ➀ If k then for j k thread is the parent of since b efore

j j

the spawn we have k which means that for i k thread is the parent of

i i

Moreover is obviously the parent of Prop erty ➁ If k then for j k

thread has not b een worked on since it spawned b ecause b efore the spawn we have

j j

k which means that for i k thread has not b een worked on since it

i

spawned Finally thread has not b een worked on since it spawned b ecause the

i

spawn only just o ccurred

Γ Γ′ k k′ ➠

Γ Γ′ 2 3

Γ Γ′ 1 2

Γ Γ′ 0 1

Γ′

0

a Before spawn b After spawn

Figure The ready deque of a pro cessor b efore and after the thread that it is working on

are not actually in the deque they are the spawns a child Note that the threads and

threads b eing worked on b efore and after the spawn

➋ and ➌ If stalls or dies then we have two cases to consider If k the Rules

ready deque is empty so the pro cessor commences work stealing and when the pro cessor

steals and b egins work on a thread we have k If k the ready deque is not

empty so the pro cessor p ops the b ottommost thread o the deque and commences work on

it Thus we have the p opp ed thread and k k and for j k we

have See Figure Now if k we can check b oth prop erties Prop erty ➀

j

j

For j k thread is the parent of since for i k thread is the

i

j j

parent of Prop erty ➁ If k then for j k thread has not b een

i

j

b ecause b efore the stall or death we have k which worked on since it spawned

j

means that for i k thread has not b een worked on since it spawned

i i

Γ Γ′ k k′ ➠

Γ Γ′ 2 1

Γ Γ′ 1 0

Γ

0

a Before death b After death

Figure The ready deque of a pro cessor b efore and after the thread that it is working on dies

are not actually in the deque they are the threads b eing worked Note that the threads and

on b efore and after the death

➍ If enables a stalled thread then due to the fully strict condition that pre Rule

viously stalled thread must b e s parent First we observe that we must have k If

we have k then the pro cessors ready deque is not empty and this parent thread must

b e b ottommost in the ready deque Thus this parent thread is ready and Rule ➍ do es not

apply With k the ready deque is empty and the pro cessor places the parent thread on

the b ottom of the ready deque We have and k k with denoting the

newly enabled parent We only have to check the rst prop erty Prop erty ➀ Thread is

obviously the parent of

If some other pro cessor steals a thread from pro cessor p then we must have k and

after the steal we have k k If k holds then b oth prop erties are clearly preserved

All other actions by pro cessor psuch as work stealing or executing an instruction that do es

not invoke any of the ab ove rulesclearly preserve the lemma

Before moving on it is worth p ointing out how it may happ en that thread has b een

k

➁ excludes This situation arises when worked on since it spawned since Prop erty

k k

is stolen from pro cessor p and then stalls on its new pro cessor Later is reenabled by

k k

and brought back to pro cessor ps ready deque The key observation is that when is

k k

reenabled pro cessor ps ready deque is empty and p is working on The other threads

k

shown in Figure were spawned after was reenabled

k k k

We conclude this section by b ounding the space used by the WorkStealing Algorithm

executing a fully strict computation

Theorem For any ful ly strict multithreaded computation with stack depth S the Work

Stealing Algorithm run on a computer with P processors uses at most S P space

Proof Like the BusyLeaves Algorithm the WorkStealing Algorithm maintains the busy

leaves prop erty at every time step of the execution every leaf in the current spawn subtree

has a pro cessor working on it If we can establish this fact then Lemma completes the

pro of

That the WorkStealing Algorithm maintains the busyleaves prop erty is a simple con

sequence of Lemma At every time step every leaf in the current spawn subtree must b e

ready and therefore must either have a pro cessor working on it or b e in the ready deque

of some pro cessor But Lemma guarantees that no leaf thread sits in a pro cessors ready

deque while the pro cessor works on some other thread

With the space b ound in hand we now turn attention to analyzing the time and commu

nication b ounds for the WorkStealing Algorithm Before we can pro ceed with this analysis

however we must take care to dene a mo del for coping with the contention that may arise

when multiple thief pro cessors simultaneously attempt to steal from the same victim

Atomic accesses and the recycling game

This section presents the atomicaccess mo del that we use to analyze contention during the

execution of a multithreaded computation by the WorkStealing Algorithm We intro duce

a combinatorial balls and bins game which we use to b ound the total amount of delay

incurred by random asynchronous accesses in this mo del We shall use the results of this

section in Section where we analyze the WorkStealing Algorithm

The atomicaccess model is the machine mo del we use to analyze the WorkStealing

Algorithm We assume that the machine is an asynchronous parallel computer with P

pro cessors and its memory can b e either distributed or shared Our analysis assumes that

concurrent accesses to the same data structure are serially queued by an adversary as in

the atomic messagepassing mo del of This assumption is more stringent than that in

the mo del of Karp and Zhang They assume that if concurrent steal requests are made

to a deque in one time step one request is satised and all the others are denied In the

atomicaccess mo del we also assume that one request is satised but the others are queued

by an adversary rather than b eing denied Moreover from the collection of waiting requests

for a given deque the adversary gets to cho ose which is serviced and which continue to wait

The only constraint on the adversary is that if there is at least one request for a deque then

the adversary cannot cho ose that none b e serviced

The main result of this section is to show that if requests are made randomly by P

pro cessors to P deques with each pro cessor allowed at most one outstanding request then

the total amount of time that the pro cessors sp end waiting for their requests to b e satised

is likely to b e prop ortional to the total numb er M of requests no matter which pro cessors

make the requests and no matter how the requests are distributed over time In order to

prove this result we intro duce a balls and bins game that mo dels the eects of queueing

by the adversary

The P M recycling game is a combinatorial game played by the adversary in which

balls are tossed at random into bins The parameter P is the numb er of balls in the game

which is equal to the numb er of bins The parameter M is the total numb er of ball tosses

executed by the adversary Initially all P balls are in a reservoir separate from the P bins

At each step of the game the adversary executes the following two op erations in sequence

The adversary cho oses some of the balls in the reservoir p ossibly all and p ossibly

none and then for each of these balls the adversary removes it from the reservoir

selects one of the P bins uniformly and indep endently at random and tosses the ball

into it

The adversary insp ects each of the P bins in turn and for each bin that contains at

least one ball the adversary removes any one of the balls in the bin and returns it to

the reservoir

The adversary is p ermitted to make a total of M ball tosses The game ends when M ball

tosses have b een made and all balls have b een removed from the bins and placed back in the

reservoir

The recycling game mo dels the servicing of steal requests by the WorkStealing Algo

rithm We can view each ball and each bin as b eing owned by a distinct pro cessor If a ball

is in the reservoir it means that the balls owner is not making a steal request If a ball is

in a bin it means that the balls owner has made a steal request to the deque of the bins

owner but that the request has not yet b een satised When a ball is removed from a bin

and returned to the reservoir it means that the request has b een serviced

After each step t of the game there are some numb er n of balls left in the bins which

t

corresp ond to steal requests that have not b een satised We shall b e interested in the total

P

T

delay D n where T is the total numb er of steps in the game The goal of the

t

t

adversary is to make the total delay as large as p ossible The next lemma shows that despite

the choices that the adversary makes ab out which balls to toss into bins and which to return

to the reservoir the total delay is unlikely to b e large

Lemma For any with probability at least the total delay in the P M recycling

game is O M P lg P P lg The expected total delay is at most M In other words

the total delay incurred by M random requests made by P processors in the atomicaccess

model is O M P lg P P lg with probability at least and the expected total delay

is at most M

Proof We rst make the observation that the strategy by which the adversary cho oses a

ball from each bin is immaterial and thus we can assume that balls are queued in their bins

in a rstinrstout FIFO order The adversary removes balls from the front of the queue

and when the adversary tosses a ball it is placed on the back of the queue If several balls

are tossed into the same bin at the same step they can b e placed on the back of the queue

in any order The reason that assuming a FIFO discipline for queuing balls in a bin do es not

aect the total delay is that the numb er of balls in a given bin at a given step is the same

no matter which ball is removed and where balls are tossed has nothing to do with which

ball is tossed

For any given ball and any given step the step either nishes with the the ball in a bin

or in the reservoir Dene the delay of ball r to b e the random variable denoting the

r

total numb er of steps that nish with ball r in a bin Then we have

P

X

D

r

r

Dene the ith cycle of a ball to b e those steps in which the ball remains in a bin from the

ith time it is tossed until it is returned to the reservoir Dene also the ith delay of a ball

to b e the numb er of steps in its ith cycle

We shall analyze the total delay by fo cusing without loss of generality on the delay

of ball If we let m denote the numb er of times that ball is tossed by the adversary and

for i m let d b e the random variable denoting the ith delay of ball then we

i

P

m

have d

i

i

We say that the ith cycle of ball is delayed by another ball r if the ith toss of ball

places it in some bin k and ball r is removed from bin k during the ith cycle of ball Since

the adversary follows the FIFO rule it follows that the ith cycle of ball can b e delayed

by another ball r either once or not at all Consequently we can decomp ose each random

variable d into a sum d x x x of indicator random variables where

i i i i im

if the ith cycle of ball is delayed by ball r

x

ir

otherwise

Thus we have

m P

X X

x

ir

r

i

1

Greg Plaxton of the University of Texas Austin has improved this b ound to O M for the case when

is at most p olynomial in M and P

We now prove an imp ortant prop erty of these indicator random variables Consider any

set S of pairs i r each of which corresp onds to the event that the ith cycle of ball is

delayed by ball r For any such set S we claim that

jS j

P x Pr

ir

ir S

The crux of proving the claim is to show that

0 0

P x x Pr

ir i r

0 0 0

i r S

where S S fi r g whence the claim follows from Bayess Theorem

We can derive Inequality from a careful analysis of dep endencies Because the adver

sary follows the FIFO rule we have that x only if when the adversary executes the ith

ir

toss of ball it falls into whatever bin contains ball r if any A priori this event happ ens

with probability either P or and hence with probability at most P Conditioning on

any collection of events relating which balls delay this or other cycles of ball cannot in

crease this probability as we now argue in two cases In the rst case the indicator random

0 0

where i i tell whether other cycles of ball are delayed This information variables x

i r

tells nothing ab out where the ith toss of ball go es Therefore these random variables are

indep endent of x and thus the probability P upp er b ound is not aected In the second

ir

0

tell whether the ith toss of ball go es to the bin case the indicator random variables x

ir

containing ball r but this information tells us nothing ab out whether it go es to the bin

containing ball r b ecause the indicator random variables tell us nothing to relate where ball

r and ball r are lo cated Moreover no collusion among the indicator random variables

provides any more information and thus Inequality holds

Equation shows that the delay encountered by ball throughout all of its cycles

can b e expresses as a sum of mP indicator random variables In order for to equal

or exceed a given value there must b e some set containing of these indicator random

variables each of which must b e For any sp ecic such set Inequality says that the

probability is at most P that all random variables in the set are Since there are

mP

emP such sets where e is the base of the natural logarithm we have

emP

Pr f g P

em

P

whenever max fem lg P lgg

Although our analysis was p erformed for ball it applies to any other ball as well

Consequently for any given ball r which is tossed m times the probability that its delay

r r

exceeds max fem lg P lgg is at most P By Bo oles inequality and Equation

r

it follows that with probability at least the total delay D is at most

P

X

max fem lg P lg g D

r

r

M P lg P P lg

P

P

since M m

r

r

The upp er b ound E D M can b e obtained as follows Recall that each is the

r

sum of P m indicator random variables each of which has exp ectation at most P

r

Therefore by linearity of exp ectation E m Using Equation and again using

r r

linearity of exp ectation we obtain E D M

With this b ound on the total delay incurred by M random requests now in hand we

turn back to the WorkStealing Algorithm

Analysis of the workstealing algorithm

In this section we analyze the time and communication cost of executing a fully strict mul

tithreaded computation with the WorkStealing Algorithm For any fully strict computation

with work T and criticalpath length T we show that the exp ected running time with

P pro cessors including scheduling overhead is T P O T Moreover for any

the execution time on P pro cessors is T P O T lg P lg with probability at

least We also show that the exp ected total communication during the execution of a

fully strict computation is O PT n S where n is the maximum numb er of join

d max d

edges from a thread to its parent and S is the largest size of any activation frame

max

Unlike in the BusyLeaves Algorithm the ready p o ol in the WorkStealing Algorithm

is distributed and so there is no contention at a centralized data structure Nevertheless it

is still p ossible for contention to arise when several thieves happ en to descend on the same

victim simultaneously In this case as we have indicated in the previous section we make

the conservative assumption that an adversary serially queues the workstealing requests

We further assume that it takes unit time for a pro cessor to resp ond to a workstealing

request This assumption can b e relaxed without materially aecting the results so that a

workstealing resp onse takes any constant amount of time

To analyze the running time of the WorkStealing Algorithm executing a fully strict

multithreaded computation with work T and criticalpath length T on a computer with

P pro cessors we use an accounting argument At each step of the algorithm we collect P

dollars one from each pro cessor At each step each pro cessor places its dollar in one of

three buckets according to its actions at that step If the pro cessor executes an instruction

at the step then it places its dollar into the Work bucket If the pro cessor initiates a steal

attempt at the step then it places its dollar into the Steal bucket And if the pro cessor

merely waits for a queued steal request at the step then it places its dollar into the Wait

bucket We shall derive the runningtime b ound by b ounding the numb er of dollars in each

bucket at the end of the execution summing these three b ounds and then dividing by P

We rst b ound the total numb er of dollars in the Work bucket

Lemma The execution of a ful ly strict multithreaded computation with work T by the

WorkStealing Algorithm on a computer with P processors terminates with exactly T dol lars

in the Work bucket

Proof A pro cessor places a dollar in the Work bucket only when it executes an instruction

Thus since there are T instructions in the computation the execution ends with exactly T

dollars in the Work bucket

Bounding the total dollars in the Steal bucket requires a signicantly more involved

delaysequence argument We rst intro duce the notion of a round of worksteal at

tempts and we must also dene an augmented dag that we then use to dene critical

instructions The idea is as follows If during the course of the execution a large numb er of

steals are attempted then we can identify a sequence of instructionsthe delay sequencein

the augmented dag such that each of these steal attempts was initiated while some instruc

tion from the sequence was critical We then show that a critical instruction is unlikely to

remain critical across a mo dest numb er of steal attempts We can then conclude that such

a delay sequence is unlikely to o ccur and therefore an execution is unlikely to suer a large

numb er of steal attempts

A round of steal attempts is a set of at least P but fewer than P consecutive steal

attempts such that if a steal attempt that is initiated at time step t o ccurs in a particular

round then all other steal attempts initiated at time step t are also in the same round We

can partition all of the steal attempts that o ccur during an execution into rounds as follows

The rst round contains all steal attempts initiated at time steps t where t is the

earliest time such that at least P steal attempts were initiated at or b efore t We say that

the rst round starts at time step and ends at time step t In general if the ith round

ends at time step t then the i st round b egins at time step t and ends at the

i i

earliest time step t t such that at least P steal attempts were initiated at time

i i

steps b etween t and t inclusive These steal attempts b elong to round i By

i i

denition each round contains at least P consecutive steal attempts Moreover since at

most P steal attempts can b e initiated in a single time step each round contains fewer

than P steal attempts and each round takes at least steps

The sequence of instructions that make up the delay sequence is dened with resp ect to

an augmented dag obtained by mo difying the original dag slightly Let G denote the original

dag that is the dag consisting of the computations instructions as vertices and its continue

spawn and join edges as edges The augmented dag G is the original dag G together with

some new edges as follows For every set of instructions u v and w such that u v is a

spawn edge and u w is a continue edge the deque edge w v is placed in G These

deque edges are shown dashed in Figure In Section we made the technical assumption

that instruction w has no incoming join edges and so G is a dag If T is the length of a

longest path in G then the longest path in G has length at most T It is worth p ointing

out that G is only an analytical to ol The deque edges have no eect on the scheduling and

execution of the computation by the WorkStealing Algorithm

The deque edges are the key to dening critical instructions At any time step during

the execution we say that an unexecuted instruction v is critical if every instruction that

precedes v either directly or indirectly in G has b een executed that is if for every in

struction w such that there is a directed path from w to v in G instruction w has b een

executed A critical instruction must b e ready since G contains every edge of G but a

ready instruction may or may not b e critical Intuitively the structural prop erties of a ready

deque enumerated in Lemma guarantee that if a thread is deep in a ready deque then

its current instruction cannot b e critical b ecause the predecessor of the threads current

instruction across the deque edge has not yet b een executed

We now formalize our denition of a delay sequence

Denition A delay sequence is a tuple U R satisfying the fol lowing conditions

U u u u is a maximal directed path in G Specical ly for i L

L

the edge u u belongs to G instruction u has no incoming edges in G in

i i

struction u must be the rst instruction of the root thread and instruction u has no

L

outgoing edges in G instruction u must be the last instruction of the root thread

L

R is a positive integer number of stealattempt rounds

P

L

is a partition of R that is R such

L i

L i i

that f g for each i L

i

The partition induces a partition of a sequence of R rounds as follows The rst piece

of the partition corresp onds to the rst rounds The second piece corresp onds to the next

consecutive rounds after the rst rounds The third piece corresp onds to the next

consecutive rounds after the rst rounds and so on We are interested primarily

and so we dene the ith group of rounds in the pieces corresp onding to the not the

i

i

P

i

to b e the consecutive rounds starting after the r th round where r

i i i j

j j

Because is a partition of R and f g for i L we have

i

L

X

R L

i

i

We say that a given round of steal attempts occurs while instruction v is critical if all

of the steal attempts that comprise the round are initiated at time steps when v is critical

In other words v must b e critical throughout the entire round A delay sequence U R

is said to occur during an execution if for each i L all rounds in the ith group

i

o ccur while instruction u is critical In other words u must b e critical throughout all

i i i

rounds

The following lemma states that if at least R rounds take place during an execution

then some delay sequence U R must o ccur In particular if we lo ok at any execution in

which at least R rounds o ccur then we can identify a path U u u u in the dag

L

G and a partition of the rst R rounds such that for each

L

L

i L all of the rounds in the ith group o ccur while u is critical Each indicates

i i

i

whether u is critical at the b eginning of a round but gets executed b efore the round ends

i

Such a round cannot b e part of any group b ecause no instruction is critical throughout

Lemma Consider the execution of a ful ly strict multithreaded computation with critical

path length T by the WorkStealing Algorithm on a computer with P processors If at least

PR steal attempts occur during the execution then some delay sequence U R must

occur

Proof For a given execution in which at least PR steal attempts take place we construct

a delay sequence U R and show that it o ccurs With at least PR steal attempts there

must b e at least R rounds We construct the delay sequence by rst identifying a set of

instructions on a directed path in G such that for every time step during the execution

one of these instructions is critical Then we partition the rst R rounds according to when

each round o ccurs relative to when each instruction on the path is critical

To construct the path U we work backwards from the last instruction of the ro ot thread

denote a not necessarily immediate predecessor instruction which we denote by v Let v

l

1

of v in G with the latest execution time Let v v v denote a directed path from

l

1

v to v in G We extend this path back to the rst instruction of the ro ot thread by

l

1

iterating this construction as follows At the ith iteration we have an instruction v and a

l

i

directed path in G from v to v We let v denote a predecessor of v in G with the

l l l

i i+1 i

latest execution time and let v v v denote a directed path from v to v

l l l l l

i+1 i i i+1 i

in G We nish iterating the construction when we get to an iteration k in which v is the

l

k

rst instruction of the ro ot thread Our desired sequence is then U u u u where

L

L l and u v for i L One can verify that at every time step of the

k i Li

execution one of the v is critical

l

i

Now to construct the partition we partition the sequence

L

L

of the rst R rounds according to when each round o ccurs We would like our partition to

b e such that for each round among the rst R rounds we have the prop erty that if the

round o ccurs while some instruction u is critical then the round b elongs to the ith group

i

Start with and let equal the numb er of rounds that o ccur while u is critical All of

these rounds are consecutive at the b eginning of the sequence so these rounds comprise the

st groupthat is they are the consecutive rounds starting after the r rst rounds

Next if the round that immediately follows those rst rounds b egins after u has b een

executed then we set and we go on to Otherwise that round b egins while u is

critical and ends after u is executed for otherwise it would b e part of the rst group so

we set and we go on to For we let equal the numb er of rounds that o ccur

while u is critical Note that all of these rounds are consecutive b eginning after the rst

rounds so these rounds comprise the nd group We continue in this fashion r

letting each b e the numb er of rounds that o ccur while u is critical and letting each b e

i i

i

the numb er of rounds that b egin while u is critical but do not end until after u is executed

i i

As an example we may have a round that b egins while u is critical and then ends while

i

u is critical and in this case we set and In this example the i st

i

i i

group is empty so we also set

i

We conclude the pro of by verifying that the U R as just constructed is a delay

sequence and that it o ccurs By construction U is a maximal path in G Now considering

we observe that each round among the rst R rounds is counted exactly once in either

a or a so is indeed a partition of R Moreover for i L at most one

i

i

round can b egin while the instruction u is critical and end after u is executed so we have

i i

f g Thus U R is a delay sequence Finally we observe that by construction

i

for i L the rounds that comprise the ith group all o ccur while the instruction

i

u is critical Therefore the delay sequence U R o ccurs

i

We now establish that a critical instruction is unlikely to remain critical across a mo dest

numb er of rounds Sp ecically we rst show that a critical instruction must b e the ready

instruction of a thread that is near the top of its pro cessors ready deque We then use this

fact to show that after O rounds a critical instruction is very likely to b e executed

Lemma At every time step during the execution of a ful ly strict multithreaded computa

tion by the WorkStealing Algorithm each critical instruction is the ready instruction of a

thread that has at most thread above it in its processors ready deque

Proof Consider any time step and let u b e the critical instruction of a thread Since

u is critical is ready Hence for some pro cessor p either is in ps ready deque or

is b eing worked on by p If has more than thread ab ove it in ps ready deque then

Lemma guarantees that each of the at least threads ab ove in ps ready deque is an

ancestor of Let denote s ancestor threads where is the parent of

k

and is the ro ot thread Further for i k let u denote the instruction of

k i

thread that spawned thread and let w denote u s successor instruction in thread

i i i i i

Because of the deque edges each instruction w is a predecessor of u in G and consequently

i

since u is critical each instruction w has b een executed Moreover b ecause each w is the

i i

successor of the spawn instruction u in thread each thread for i k has b een

i i i

worked on since the time step at which it spawned thread But Lemma guarantees

i

that only the topmost thread in ps ready deque can have this prop erty Thus is the only

thread that can p ossibly b e ab ove in ps ready deque

Lemma Consider the execution of any ful ly strict multithreaded computation by the

WorkStealing Algorithm on a paral lel computer with P processors For any instruction

v and any number r of stealattempt rounds the probability that any particular set of r

rounds occur while the instruction v is critical is at most the probability that only or of

the steal attempts initiated in the rst r of these rounds choose v s processor which is at

r

most e

Proof Let t denote the rst time step at which instruction v is critical and let p denote

a

the pro cessor in whose ready deque v s thread resides at time step t Consider any particular

a

set of r rounds and supp ose that they all o ccur while instruction v is critical Now consider

the steal attempts that comprise the rst r of these rounds of which there must b e at

least P r Let t denote the time step just after the time step at which the last of

b

these steal attempts is initiated Note that b ecause the last round like every round must

take at least two in fact four steps the time step t must o ccur b efore the time step at

b

which instruction v is executed

We shall rst show that of these P r steal attempts initiated while instruction v is

critical and at least time steps b efore v is executed at most of them can cho ose pro cessor

p as its target for otherwise v would b e executed at or b efore t Recall from Lemma

b

that instruction v is the ready instruction of a thread which has at most thread ab ove

it in ps ready deque as long as v is critical

If has no threads ab ove it then another thread cannot b e placed ab ove it until after

instruction v is executed since only by pro cessor p executing instructions from can another

thread b e placed ab ove it in its ready deque Consequently if a steal attempt targeting

pro cessor p is initiated at some time step t t we are guaranteed that instruction v is

a

executed at a time step no later than t either by thread b eing stolen and executed or by

p executing the thread itself

Now supp ose has one thread ab ove it in ps ready deque In this case if a steal

attempt targeting pro cessor p is initiated at time step t t then thread gets stolen

a

from ps ready deque no later than time step t Supp ose further that another steal attempt

targeting pro cessor p is initiated at time step t where t t t t Then we know that

a b

a second steal will b e serviced by p at or b efore time step t If this second steal gets

thread then instruction v must get executed at or b efore time step t t which is

b

imp ossible since v is executed after time step t Consequently this second steal must get

b

thread the same thread that the rst steal got But this scenario can only o ccur if in

the intervening time p erio d thread stalls and is subsequently reenabled by the execution

of some instruction from thread in which case instruction v must b e executed b efore time

step t t which is once again imp ossible

b

Thus we must have P r steal attempts each initiated at a time step t such that

t t t and at most of which targets pro cessor p The probability that either or

a b

of P r steal attempts cho oses pro cessor p is

P r P r

P r

P P P

P r

P

r

P P

P r

r

P

r

r e

r

e

for r

We now complete the delaysequence argument and b ound the total dollars in the Steal

bucket

Lemma Consider the execution of any ful ly strict multithreaded computation with critical

path length T by the WorkStealing Algorithm on a paral lel computer with P processors

For any with probability at least at most O P T lg worksteal attempts

occur The expected number of steal attempts is O PT In other words with probability at

least the execution terminates with at most O P T lg dol lars in the Steal

bucket and the expected number of dol lars in this bucket is O PT

Proof From Lemma we know that if at least PR steal attempts o ccur then some

delay sequence U R must o ccur Consider a particular delay sequence U R having

U u u u and with L T We shall compute

L L

L

the probability that U R o ccurs

Such a sequence o ccurs if for i L each instruction u is critical throughout

i

all rounds in the ith group From Lemma we know that the probability of the

i i

rounds in the ith group all o ccurring while a given instruction u is critical is at most the

i

probability that only or of the steal attempts initiated in the rst of these rounds

i

i

cho ose v s pro cessor which is at most e provided For those values of i with

i

we shall use as an upp er b ound on this probability Moreover since the targets

i

of the worksteal attempts in the rounds of the ith group are chosen indep endently from

i

the targets chosen in other rounds we can b ound the probability of the particular delay

sequence U R o ccurring as follows

Pr fU R o ccurs g

Y

Pr fthe rounds of the ith group o ccur while u is criticalg

i i

iL

Y

i

e

1iL

2

i

X

B C

B C

exp L

i

A

1iL

2

i

X X

B C

B C

exp L

i i

A

1iL

iL

2

i

RLLL

e

RL

e

where the secondtolast line follows from Inequality

To b ound the probability of some delay sequence U R o ccurring we need to count

the numb er of such delay sequences and multiply by the probability that a particular such

sequence o ccurs The directed path U in the mo died dag G starts at the rst instruction

of the ro ot thread and ends at the last instruction of the ro ot thread If the original dag has

degree d then G has degree at most d Consistent with our unittime assumption for

instructions we assume that the degree d is a constant Since the length of a longest path in

T

1

G is at most T there are at most d ways of cho osing the path U u u u

L

LR T R

1

There are at most ways to cho ose since partitions R into L pieces

R R

RL RT

1

As we have just shown a given delay sequence has at most an e e chance

of o ccurring Multiplying these three factors together b ounds the probability that any delay

sequence U R o ccurs by

T R

RT T

1 1

e d

R

which is at most for R cT lg d lg where c is a suciently large constant

Thus the probability that at least PR P T lg d lg P T lg

steal attempts o ccur is at most The exp ectation b ound follows b ecause the tail of the

distribution decreases exp onentially

With b ounds on the numb er of dollars in the Work and Steal buckets we now state the

theorem that b ounds the total execution time for a fully strict multithreaded computation

by the WorkStealing Algorithm and we complete the pro of by b ounding the numb er of

dollars in the Wait bucket

Theorem Consider the execution of any ful ly strict multithreaded computation with work

T and criticalpath length T by the WorkStealing Algorithm on a paral lel computer with

P processors The expected running time including scheduling overhead is T P O T

Moreover for any with probability at least the execution time on P processors

is T P O T lg P lg

Proof Lemmas and b ound the dollars in the Work and Steal buckets so we now

must b ound the dollars in the Wait bucket This b ound is given by Lemma which b ounds

the total delaythat is the total dollars in the Wait bucketas a function of the numb er

M of steal attemptsthat is the total dollars in the Steal bucket This lemma says that

for any with probability at least the numb er of dollars in the Wait bucket is at

most a constant times the numb er of dollars in the Steal bucket plus O P lg P P lg

and the exp ected numb er of dollars in the Wait bucket is at most the numb er in the Steal

bucket

We now add up the dollars in the three buckets and divide by P to complete this pro of

The next theorem b ounds the total amount of communication that a multithreaded com

putation executed by the WorkStealing Algorithm p erforms in a distributed mo del The

analysis makes the assumption that at most a constant numb er of bytes need b e communi

cated along a join edge to resolve the dep endency

Theorem Consider the execution of any ful ly strict multithreaded computation with critical

path length T by the WorkStealing Algorithm on a paral lel computer with P processors

Then the total number of bytes communicated has expectation O PT n S where n

d max d

is the maximum number of join edges from a thread to its parent and S is the size in bytes

max

of the largest activation frame in the computation Moreover for any the probability

is at least that the total communication incurred is O P T lg n S

d max

Proof We prove the b ound for the exp ectation The highprobability b ound is analogous

By our bucketing argument the exp ected numb er of steal attempts is at most O PT

When a thread is stolen the communication incurred is at most S Communication also

max

o ccurs whenever a join edge enters a parent thread from one of its children and the parent

has b een stolen but since each join edge accounts for at most a constant numb er of bytes

the communication incurred is at most O n p er steal Finally we can have communication

d

when a child thread enables its parent and puts the parent into the childs pro cessors ready

deque This event can happ en at most n times for each time the parent is stolen so the

d

communication incurred is at most n S p er steal Thus the exp ected total communication

d max

cost is O PT n S

d max

The communication b ounds in this theorem are existentially tight in that there exist

fully strict computations that require PT n S total communication for any

d max

execution schedule that achieves linear sp eedup This result follows directly from a theorem

of Wu and Kung who showed that divideandconquer computationsa sp ecial case of

fully strict computations with n require this much communication

d

2

With Plaxtons b ound for Lemma this b ound b ecomes T P O T whenever is at most

1 1

p olynomial in M and P

In the case when we have n O and the algorithm achieves linear exp ected sp eedup

d

that is when P O T T the total communication is at most O T S Moreover

max

if P T T the total communication is much less than T S which conrms the folk

max

wisdom that workstealing algorithms require much less communication than the p ossibly

T S communication of worksharing algorithms

max

Conclusion

How practical are the metho ds analyzed in this pap er We have b een actively engaged in

building a Cbased language called Cilk pronounced silk for programming multithreaded

computations Cilk is derived from the PCM Parallel Ma

chine system which was itself partly inspired by the research rep orted here The Cilk

runtime system employs the WorkStealing Algorithm describ ed in this pap er Because Cilk

employs a provably ecient scheduling algorithm Cilk delivers guaranteed p erformance to

user applications Sp ecically we have found empirically that the p erformance of an appli

cation written in the Cilk language can b e predicted accurately using the mo del T P T

The Cilk system currently runs on contemp orary sharedmemory multipro cessors such

as the Sun Enterprise the Silicon Graphics Origin the Intel Quad Pentium and the DEC

Alphaserver Earlier versions of Cilk ran on the Thinking Machines CM MPP the Intel

Paragon MPP and the IBM SP To date applications written in Cilk include protein

folding graphic rendering backtrack search and the Socrates chess program

which won second prize in the ICCA World Computer Chess Championship running

on a no de Paragon at Sandia National Lab oratories Our more recent chess program

Cilkchess won the Dutch Op en Computer Chess Tournament A team programming

in Cilk won First Prize undefeated in the ICFP Programming Contest sp onsored by the

International Conference on

As part of our research we have implemented a prototyp e runtime system for Cilk on net

works of workstations This runtime system called CilkNOW supp orts adaptive

parallelism where pro cessors in a workstation environment can join a users computation

if they would b e otherwise idle and yet b e available immediately to leave the computation

when needed again by their owners CilkNOW also supp orts transparent fault tolerance

meaning that the users computation can pro ceed even in the face of pro cessors crashing

and yet the programmer writes the co de in a completely faultoblivious fashion A more

recent distributed implementation for clusters of SMPs is describ ed in

We have also investigated other topics related to Cilk including distributed shared mem

ory and debugging to ols Uptodate information pap ers and

software releases can b e found on the World Wide Web at httpsupertechlcsmit edu cilk

For the case of sharedmemory multipro cessors we have recently generalized the time

b ound but not the space or communication b ounds along two dimensions First we

have shown that for arbitrary not necessarily fully strict or even strict multithreaded com

putations the exp ected execution time is O T P T This b ound is based on a new

structural lemma and an amortized analysis using a p otential function Second we have

develop ed a nonblo cking implementation of the workstealing algorithm and we have an

alyzed its execution time for a multiprogrammed environment in which the computation

executes on a set of pro cessors that grows and shrinks over time This growing and shrink

ing is controlled by an adversary In case the adversary cho oses not to grow or shrink the

set of pro cessors the b ound sp ecializes to match our previous b ound The nonblo cking

work stealer has b een implemented in the Hood userlevel threads library Upto

date information pap ers and software releases can b e found on the World Wide Web at

httpwwwcsutexasedu user sh ood

Acknowledgments

Thanks to Bruce Maggs of Carnegie Mellon who outlined the strategy in Section for using

a delaysequence argument to prove the time b ounds on the WorkStealing Algorithm which

improved our previous b ounds Thanks to Greg Plaxton of University of Texas Austin for

technical comments on our probabilistic analyses Thanks to the anonymous referees Yanjun

Zhang of Southern Metho dist University and Warren Burton of Simon Fraser University for

comments that improved the clarity of our pap er Thanks also to Arvind Michael Halbherr

Chris Jo erg Bradley Kuszmaul Keith Randall and Yuli Zhou of MIT for helpful discussions

References

Nimar S Arora Rob ert D Blumofe and C Greg Plaxton Thread scheduling for multipro

grammed multipro cessors In Proceedings of the Tenth Annual ACM Symposium on Paral lel

Algorithms and Architectures SPAA pages Puerto Vallarta Mexico June

Arvind Rishiyur S Nikhil and Keshav K Pingali Istructures Data structures for paral

lel computing ACM Transactions on Programming Languages and Systems

Octob er

Guy E Blello ch Phillip B Gibb ons and Yossi Matias Provably ecient scheduling for

languages with negrained parallelism In Proceedings of the Seventh Annual ACM Symposium

on Paral lel Algorithms and Architectures SPAA pages Santa Barbara California July

Guy E Blello ch Phillip B Gibb ons Yossi Matias and Girija J Narlikar Spaceecient

scheduling of parallelism with synchronization variables In Proceedings of the th Annual

ACM Symposium on Paral lel Algorithms and Architectures SPAA pages Newp ort

Rho de Island June

Rob ert D Blumofe Executing Multithreaded Programs Eciently PhD thesis Depart

ment of Electrical Engineering and Computer Science Massachusetts Institute of Technology

Septemb er Also available as MIT Lab oratory for Computer Science Technical Rep ort

MITLCSTR

Rob ert D Blumofe Matteo Frigo Christopher F Jo erg Charles E Leiserson and Keith H

Randall An analysis of dagconsistent distributed sharedmemory algorithms In Proceedings

of the Eighth Annual ACM Symposium on Paral lel Algorithms and Architectures SPAA

pages Padua Italy June

Rob ert D Blumofe Matteo Frigo Christopher F Jo erg Charles E Leiserson and Keith H

Randall Dagconsistent distributed shared memory In Proceedings of the Tenth International

Paral lel Processing Symposium IPPS pages Honolulu Hawaii April

Rob ert D Blumofe Christopher F Jo erg Bradley C Kuszmaul Charles E Leiserson Keith H

Randall and Yuli Zhou Cilk An ecient multithreaded runtime system Journal of Paral lel

and August

Rob ert D Blumofe and Charles E Leiserson Scheduling multithreaded computations by work

stealing In Proceedings of the th Annual Symposium on Foundations of Computer Science

FOCS pages Santa Fe New Mexico Novemb er

Rob ert D Blumofe and Charles E Leiserson Spaceecient scheduling of multithreaded

computations SIAM Journal on Computing February

Rob ert D Blumofe and Philip A Lisiecki Adaptive and reliable on net

works of workstations In Proceedings of the USENIX Annual Technical Conference on

UNIX and Advanced Computing Systems pages Anaheim California January

Rob ert D Blumofe and Dionisios Papadop oulos The p erformance of work stealing in multi

programmed environments Technical Rep ort TR The University of Texas at Austin

Department of Computer Sciences May

Richard P Brent The parallel evaluation of general arithmetic expressions Journal of the

ACM April

F Warren Burton Storage management in virtual tree machines IEEE Transactions on

Computers March

F Warren Burton Guaranteeing go o d memory b ounds for parallel programs IEEE Transac

tions on Software Engineering Octob er

F Warren Burton and M Ronan Sleep Executing functional programs on a virtual tree of

pro cessors In Proceedings of the Conference on Functional Programming Languages and

Computer Architecture pages Portsmouth New Hampshire Octob er

GuangIen Cheng Algorithms for datarace detection in multithreaded programs Masters

thesis Department of Electrical Engineering and Computer Science Massachusetts Institute

of Technology May

GuangIen Cheng Mingdong Feng Charles E Leiserson Keith H Randall and Andrew F

Stark Detecting data races in Cilk programs that use lo cks In Tenth ACM Symposium on

Paral lel Algorithms and Architectures SPAA Puerto Vallarta Mexico June

David E Culler and Arvind Resource requirements of dataow programs In Proceedings

of the th Annual International Symposium on Computer Architecture ISCA pages

Honolulu Hawaii May Also available as MIT Lab oratory for Computer Science

Computation Structures Group Memo

Derek L Eager John Zahorjan and Edward D Lazowska Sp eedup versus eciency in parallel

systems IEEE Transactions on Computers March

R Feldmann P Mysliwietz and B Monien Game tree search on a massively parallel system

Advances in Computer Chess pages

Mingdong Feng and Charles E Leiserson Ecient detection of determinacy races in Cilk pro

grams In Ninth Annual ACM Symposium on Paral lel Algorithms and Architectures SPAA

pages Newp ort Rho de Island June

Raphael Finkel and Udi Manb er DIBa distributed implementation of backtracking ACM

Transactions on Programming Languages and Systems April

Matteo Frigo The weakest reasonable memory mo del Masters thesis Department of Electri

cal Engineering and Computer Science Massachusetts Institute of Technology January

Matteo Frigo Charles E Leiserson and Keith H Randall The implementation of the Cilk

multithreaded language In Proceedings of the ACM SIGPLAN Conference on Program

ming Language Design and Implementation PLDI Montreal Canada June

Matteo Frigo and Victor Luchangco Computationcentric memory mo dels In Proceedings of

the Tenth Annual ACM Symposium on Paral lel Algorithms and Architectures SPAA pages

Puerto Vallarta Mexico June

R L Graham Bounds for certain multipro cessing anomalies The Bel l System Technical

Journal Novemb er

R L Graham Bounds on multipro cessing timing anomalies SIAM Journal on Applied

Mathematics March

Michael Halbherr Yuli Zhou and Chris F Jo erg MIMDstyle parallel programming with

continuationpassing threads In Proceedings of the nd International Workshop on Massive

Paral lelism Hardware Software and Applications Capri Italy Septemb er

Rob ert H Halstead Jr Implementation of Multilisp Lisp on a multipro cessor In Conference

Record of the ACM Symposium on Lisp and Functional Programming pages Austin

Texas August

Chris Jo erg and Bradley C Kuszmaul Massively parallel chess In Proceedings of the Third

DIMACS Paral lel Implementation Chal lenge Rutgers University New Jersey Octob er

Christopher F Jo erg The Cilk System for Paral lel Multithreaded Computing PhD thesis

Department of Electrical Engineering and Computer Science Massachusetts Institute of Tech

nology January

Richard M Karp and Yanjun Zhang Randomized parallel algorithms for backtrack search

and branchandb ound computation Journal of the ACM July

Bradley C Kuszmaul Synchronized MIMD Computing PhD thesis Department of Electrical

Engineering and Computer Science Massachusetts Institute of Technology May Also

available as MIT Lab oratory for Computer Science Technical Rep ort MITLCSTR

Philip Lisiecki Macroscheduling in the Cilk network of workstations environment Masters

thesis Department of Electrical Engineering and Computer Science Massachusetts Institute

of Technology May

Pangfeng Liu William Aiello and Sandeep Bhatt An atomic mo del for messagepassing In

Proceedings of the Fifth Annual ACM Symposium on Paral lel Algorithms and Architectures

SPAA pages Velen Germany June

Eric Mohr David A Kranz and Rob ert H Halstead Jr Lazy task creation A technique for

increasing the granularity of parallel programs IEEE Transactions on Paral lel and Distributed

Systems July

Vijay S Pande Christopher F Jo erg Alexander Yu Grosb erg and Toyoichi Tanaka Enu

merations of the hamiltonian walks on a cubic sublattice Journal of Physics A

Dionysios P Papadop oulos Ho o d A userlevel thread library for multiprogramming multi

pro cessors Masters thesis Department of Computer Sciences The University of Texas at

Austin August

C Gregory Plaxton August Private communication

Abhiram Ranade How to emulate shared memory In Proceedings of the th Annual Sympo

sium on Foundations of Computer Science FOCS pages Los Angeles California

Octob er

Keith H Randall Cilk Ecient Multithreaded Computing PhD thesis Department of Elec

trical Engineering and Computer Science Massachusetts Institute of Technology May

Larry Rudolph Miriam SlivkinAllalouf and Eli Upfal A simple load balancing scheme for

task allo cation in parallel machines In Proceedings of the Third Annual ACM Symposium on

Paral lel Algorithms and Architectures SPAA pages Hilton Head South Carolina

July

Carlos A Ruggiero and John Sargeant Control of parallelism in the Manchester dataow

machine In Functional Programming Languages and Computer Architecture numb er in

Lecture Notes in Computer Science pages SpringerVerlag

Andrew F Stark Debugging multithreaded programs that incorp orate userlevel lo cking

Masters thesis Department of Electrical Engineering and Computer Science Massachusetts

Institute of Technology May

Mark T Vandevo orde and Eric S Rob erts WorkCrews An abstraction for controlling paral

lelism International Journal of Paral lel Programming August

IChen Wu and H T Kung Communication complexity for parallel divideandconquer In

Proceedings of the nd Annual Symposium on Foundations of Computer Science FOCS

pages San Juan Puerto Rico Octob er

Y Zhang and A Ortynski The eciency of randomized parallel backtrack search In Pro

ceedings of the th IEEE Symposium on Paral lel and Distributed Processing Dallas Texas

Octob er