Scheduling Parallel Programs by Work Stealing with Private Deques
Total Page:16
File Type:pdf, Size:1020Kb
Scheduling Parallel Programs by Work Stealing with Private Deques Umut A. Acar Arthur Chargueraud´ Mike Rainey Department of Computer Science Inria Saclay – Ile-de-Franceˆ Max Planck Institute Carnegie Mellon University & LRI, Universite´ Paris Sud, CNRS for Software Systems [email protected] [email protected] [email protected] Abstract 1. Introduction Work stealing has proven to be an effective method for scheduling As multicore computers (computers with chip-multiprocessors) be- fine-grained parallel programs on multicore computers. To achieve come mainstream, techniques for writing and executing parallel high performance, work stealing distributes tasks between concur- programs have become increasingly important. By allowing paral- rent queues, called deques, assigned to each processor. Each pro- lel programs to be written in a style similar to sequential programs, cessor operates on its deque locally except when performing load and by generating a plethora of parallel tasks, fine-grained paral- balancing via steals. Unfortunately, concurrent deques suffer from lelism, as elegantly exemplified by languages such as Cilk [21], two limitations: 1) local deque operations require expensive mem- NESL [5], parallel Haskell [30], and parallel ML [20] has emerged ory fences in modern weak-memory architectures, 2) they can be as a promising technique for parallel programming [31]. very difficult to extend to support various optimizations and flexi- Executing fine-grained parallel programs with high-performance ble forms of task distribution strategies needed many applications, requires overcoming a key challenge: efficient scheduling. Schedul- e.g., those that do not fit nicely into the divide-and-conquer, nested ing costs include the cost creating a potentially very large number data parallel paradigm. of parallel tasks each of which can contain a tiny amount of actual For these reasons, there has been a lot recent interest in imple- work (e.g., thousand cycles), and distributing such parallel tasks mentations of work stealing with non-concurrent deques, where de- among the available processors to minimize the total run time. In ques remain entirely private to each processor and load balancing is the course of the last decade, the randomized work-stealing algo- performed via message passing. Private deques eliminate the need rithm as popularized by Cilk [7, 21] has emerged as an effective for memory fences from local operations and enable the design and scheduler for fine-grained parallel programs. (The idea of work implementation of efficient techniques for reducing task-creation stealing goes back to 80’s [9, 23].) overheads and improving task distribution. These advantages, how- In work stealing each processor maintains a deque (doubly- ever, come at the cost of communication. It is not known whether ended queue) of ready tasks to execute. By operating at the “bot- work stealing with private deques enjoys the theoretical guarantees tom” end of its own deque, each processor treats its deque as a of concurrent deques and whether they can be effective in practice. stack, mimicking sequential execution as it works locally. When In this paper, we propose two work-stealing algorithms with pri- a processor finds its deque empty, it acts globally by stealing the vate deques and prove that the algorithms guarantee similar theoret- task at the top end of the deques of a victim, a randomly chosen ical bounds as work stealing with concurrent deques. For the analy- processor. In theory, work stealing delivers close to optimal per- sis, we use a probabilistic model and consider a new parameter, the formance for a reasonably broad range of computations [8]. Fur- branching depth of the computation. We present an implementation thermore, these theoretical results can be matched in practice by of the algorithm as a C++ library and show that it compares well to carefully designing scheduling algorithms and data structures. Key Cilk on a range of benchmarks. Since our approach relies on private to achieving practical efficiency is a non-blocking deque data struc- deques, it enables implementing flexible task creation and distribu- tures that prevents contention during concurrent operations. Arora tion strategies. As a specific example, we show how to implement et al [2] proposed the first such data structure for fixed-sized de- task coalescing and steal-half strategies, which can be important in ques. Hendler et al [24] generalized that data structure to support fine-grain, non-divide-and-conquer algorithms such as graph algo- unbounded deques; however, the algorithm could result in memory rithms, and apply them to the depth-first-search problem. leaks. Chase and Lev [10] used circular buffers to obtain deques whose size can grow without memory leaks. Most current paral- Categories and Subject Descriptors D.3.4 [Programming Lan- lel programming systems that support this form of work stealing guages]: Processors – Run-time environments critically utilize these data structures. While the aforementioned style of randomized work stealing Keywords work stealing, nested parallelism, dynamic load bal- has been shown to be effective in many applications, previous re- ancing search has also identified both algorithmic and practical limita- tions with it. A number of studies show that in scheduling fine- grained applications, it is important to be flexible in choosing the Permission to make digital or hard copies of all or part of this work for personal or task(s) to transfer during a steal. These studies show both experi- classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation mentally [12, 15, 25, 36, 37] and theoretically [3, 34, 35], that, for on the first page. To copy otherwise, to republish, to post on servers or to redistribute certain important irregular graph computations, it is advantageous to lists, requires prior specific permission and/or a fee. to transfer not just a single task at every steal, but multiple tasks. PPoPP’13, February 23–27, 2013, Shenzhen, China. For instance, previous research found the steal-half strategy [25] Copyright c 2013 ACM 978-1-4503-1922/13/02. $10.00 where a steal transfers half of tasks from the victim’s deque, can be ques, our algorithms eliminate memory fences from the common more effective [12, 15, 37] compared to the “steal-one” approach. scheduling path. Another important practical limitation concerns the interaction be- To balance load, our algorithms rely primarily on explicit com- tween concurrent deque data structures used in the implementation munication between processors. To provide efficient communica- of work stealing and modern memory models, which provide in- tion, our algorithms exploit a key property of fine-grained paral- creasingly weaker consistency guarantees. On weak memory mod- lel programs: that calls to the scheduler are frequent because the els, concurrent deques require expensive memory-fences, which tasks are naturally small. Thus, most of polling needed for commu- can degrade performance significantly [17, 21, 32]; for example, nication can be performed by the scheduler without (hardware or Frigo et al found that Cilk’s work-stealing protocol spends half of software) interrupts, which, depending on the platform, may not be its time executing the memory fence [21]. The final limitation con- able to deliver interrupts frequently and cheaply enough. Although cerns flexibility and generality. Due to their inherent complexity, the treatment of parallel programs with coarse granularity is out of the non-blocking deques are difficult to extend to support sophis- the scope of this paper, we report on some preliminary investiga- ticated algorithms for creating and scheduling parallel tasks. For tions on the use of interrupts to make our algorithms robust in the example, Hiraishi et al [27] and Umatani et al [42] used non- face of potentially-large tasks. concurrent, private deques to implement their techniques for reduc- We give a proof of efficiency for both the sender- and receiver- ing task-creation overheads; Hendler et al’s non-blocking, concur- initiated work-stealing algorithms using private deques. For our rent deques for steal-half work stealing require asymptotically non- analysis, we consider a probabilistic model, which takes into ac- constant atomic operations, and works only for bounded, fixed-size count the delays due to the interval between polling operations. We deques [25]; Cong et al found that a batching technique can reduce present a bound in terms of the work and depth (traditional parame- task-creation overheads but were not able to combine it with the ters used in the analysis of work stealing), and a new parameter, the flexibility of steal-half strategy [12] using private deques. branching depth, which measures the maximal number of branch- Due to these limitations, there has been a lot of interest in work- ing nodes in the computation DAG. The branching depth is similar stealing algorithms where non-concurrent, private deques replace to traditional notion of depth but is often significantly smaller be- the concurrent, shared deques, and processors explicitly commu- cause it counts only the number of fork nodes along the path, ig- nicate to balance load. In such algorithms, each processor keeps noring sequential work performed in between. The branching depth its deque private, operating on its bottom end as usual. When a pro- parameter enables us to bound tightly the effect of the polling de- cessor finds its deque to be empty, instead of manipulating a remote lays on performance, showing that the algorithm performs close to deque concurrently, it sends a message to a randomly chosen victim a greedy scheduler, even when the communication delay is quite processor and awaits for a response, which either includes one or large. Due to space restrictions we could not provide all the details more tasks or indicates that victim’s deque is empty. In order to re- of the proof, which can be found in the accompanying technical spond to messages, each processor periodically polls, often driven report accessible online [1].