Scheduling Multithreaded Computations by Work Stealing

Scheduling Multithreaded Computations by Work Stealing Rob ert D Blumofe The University of Texas at Austin Charles E Leiserson MIT Laboratory for Computer Science Abstract This pap er studies the problem of eciently scheduling fully strict ie well structured multithreaded computations on parallel computers A p opular and prac tical metho d of scheduling this kind of dynamic MIMDstyle computation is work stealing in which pro cessors needing work steal computational threads from other pro cessors In this pap er we give the rst provably go o d workstealing scheduler for multithreaded computations with dep endencies Sp ecically our analysis shows that the exp ected time to execute a fully strict computation on P pro cessors using our workstealing scheduler is T P O T where T is the minimum serial execution time of the multithreaded computation and T is the minimum execution time with an innite numb er of pro cessors Moreover the space required by the execution is at most S P where S is the minimum serial space requirement We also show that the exp ected total communication of the algorithm is at most O PT n S where S is the size of the largest activation record of max max d any thread and n is the maximum numb er of times that any thread synchronizes with d its parent This communication b ound justies the folk wisdom that workstealing schedulers are more communication ecient than their worksharing counterparts All three of these b ounds are existentially optimal to within a constant factor Intro duction For ecient execution of a dynamically growing multithreaded computation on a MIMD style parallel computer a scheduling algorithm must ensure that enough threads are active concurrently to keep the pro cessors busy Simultaneously it should ensure that the numb er of concurrently active threads remains within reasonable limits so that memory requirements are not unduly large Moreover the scheduler should also try to maintain related threads This research was supp orted in part by the Advanced Research Pro jects Agency under Contract N This research was done while Rob ert D Blumofe was at the MIT Lab oratory for Computer Science and was supp orted in part by an ARPA HighPerformance Computing Graduate Fellowship on the same pro cessor if p ossible so that communication b etween them can b e minimized Needless to say achieving all these goals simultaneously can b e dicult Two scheduling paradigms have arisen to address the problem of scheduling multithreaded computations work sharing and work stealing In work sharing whenever a pro cessor generates new threads the scheduler attempts to migrate some of them to other pro cessors in hop es of distributing the work to underutilized pro cessors In work stealing however underutilized pro cessors take the initiative they attempt to steal threads from other pro cessors Intuitively the migration of threads o ccurs less frequently with work stealing than with work sharing since when all pro cessors have work to do no threads are migrated by a workstealing scheduler but threads are always migrated by a worksharing scheduler The workstealing idea dates back at least as far as Burton and Sleeps research on par allel execution of functional programs and Halsteads implementation of Multilisp These authors p oint out the heuristic b enets of work stealing with regards to space and communication Since then many researchers have implemented variants on this strategy Rudolph SlivkinAllalouf and Upfal analyzed a random ized workstealing strategy for load balancing indep endent jobs on a parallel computer and Karp and Zhang analyzed a randomized workstealing strategy for parallel backtrack search Recently Zhang and Ortynski have obtained go o d b ounds on the communication requirements of this algorithm In this pap er we present and analyze a workstealing algorithm for scheduling fully strict wellstructured multithreaded computations This class of computations encom passes b oth backtrack search computations and divideandconquer computations as well as dataow computations in which threads may stall due to a data dep endency We analyze our algorithms in a stringent atomicaccess mo del similar to the atomic message passing mo del of in which concurrent accesses to the same data structure are serially queued by an adversary Our main contribution is a randomized workstealing scheduling algorithm for fully strict multithreaded computations which is provably ecient in terms of time space and com munication We prove that the exp ected time to execute a fully strict computation on P pro cessors using our workstealing scheduler is T P O T where T is the minimum serial execution time of the multithreaded computation and T is the minimum execution time with an innite numb er of pro cessors In addition the space required by the execution is at most S P where S is the minimum serial space requirement These b ounds are b et ter than previous b ounds for worksharing schedulers and the workstealing scheduler is much simpler and eminently practical Part of this improvement is due to our fo cus ing on fully strict computations as compared to the general strict computations studied in We also prove that the exp ected total communication of the execution is at most O PT n S where S is the size of the largest activation record of any thread d max max and n is the maximum numb er of times that any thread synchronizes with its parent This d b ound is existentially tight to within a constant factor meeting the lower b ound of Wu and Kung for communication in parallel divideandconquer In contrast worksharing schedulers have nearly worstcase b ehavior for communication Thus our results b olster the folk wisdom that work stealing is sup erior to work sharing Others have studied and continue to study the problem of eciently managing the space requirements of parallel computations Culler and Arvind and Ruggiero and Sargeant give heuristics for limiting the space required by dataow programs Burton shows how to limit space in certain parallel computations without causing deadlo ck More recently Burton has develop ed and analyzed a scheduling algorithm with provably go o d time and space b ounds Blello ch Gibb ons Matias and Narlikar have also recently develop ed and analyzed scheduling algorithms with provably go o d time and space b ounds It is not yet clear whether any of these algorithms are as practical as work stealing The remainder of this pap er is organized as follows In Section we review the graph theoretic mo del of multithreaded computations intro duced in which provides a theo retical basis for analyzing schedulers Section gives a simple scheduling algorithm which uses a central queue This busyleaves algorithm forms the basis for our randomized work stealing algorithm which we present in Section In Section we intro duce the atomicaccess mo del that we use to analyze execution time and communication costs for the workstealing algorithm and we present and analyze a combinatorial balls and bins game that we use to derive a b ound on the contention that arises in random work stealing We then use this b ound along with a delaysequence argument in Section to analyze the execution time and communication cost of the workstealing algorithm To conclude in Section we briey discuss how the theoretical ideas in this pap er have b een applied to the Cilk programming language and runtime system as well as make some concluding remarks A mo del of multithreaded computation This section reprises the graphtheoretic mo del of multithreaded computation intro duced in We also dene what it means for computations to b e fully strict We conclude with a statement of the greedyscheduling theorem which is an adaptation of theorems by Brent and Graham on dag scheduling A multithreaded computation is comp osed of a set of threads each of which is a se quential ordering of unittime instructions The instructions are connected by dependency edges which provide a partial ordering on which instructions must execute b efore which other instructions In Figure for example each shaded blo ck is a thread with circles representing instructions and the horizontal edges called continue edges representing the sequential ordering Thread of this example contains instructions v v and v The instructions of a thread must execute in this sequential order from the rst leftmost instruction to the last rightmost instruction In order to execute a thread we allo cate for it a chunk of memory called an activation frame that the instructions of the thread can use to store the values on which they compute A P pro cessor execution schedule for a multithreaded computation determines which pro cessors of a P pro cessor parallel computer execute which instructions at each step An execution schedule dep ends on the particular multithreaded computation and the numb er P of pro cessors In any given step of an execution schedule each pro cessor executes at most one instruction During the course of its execution a thread may create or spawn other threads Spawn ing a thread is like a subroutine call except that the spawning thread can op erate concur rently with the spawned thread We consider spawned threads to b e children of the thread that did the spawning and a thread may spawn as many children as it desires In this way Γ1 v1 v2 v16 v17 v21 v22 v23 Γ2 Γ6 v3 v6 v9 v13 v14 v15 v18 v19 v20 Γ3 Γ4 Γ5 v4 v5 v7 v8 v10 v11 v12 Figure A multithreaded computation This computation contains instructions v

Load more