An Efficient Work-Stealing Scheduler for Task Dependency Graph

2020 IEEE 26th International Conference on Parallel and Distributed Systems (ICPADS) An Efficient Work-Stealing Scheduler for Task Dependency Graph Chun-Xun Lin Tsung-Wei Huang Martin D. F. Wong Department of Electrical Department of Electrical Department of Electrical and Computer Engineering, and Computer Engineering, and Computer Engineering, University of Illinois Urbana-Champaign, The University of Utah, University of Illinois Urbana-Champaign, Urbana, IL, USA Salt Lake City, Utah Urbana, IL, USA [email protected] [email protected] [email protected] Abstract—Work-stealing is a key component of many parallel stealing scheduler is not an easy job, especially when dealing task graph libraries such as Intel Threading Building Blocks with task dependency graph where the parallelism could be (TBB) FlowGraph, Microsoft Task Parallel Library (TPL) very irregular and unstructured. Due to the decentralized ar- Batch.Net, Cpp-Taskflow, and Nabbit. However, designing a correct and effective work-stealing scheduler is a notoriously chitecture, developers have to sort out many implementation difficult job, due to subtle implementation details of concur- details to efficiently manage workers such as deciding the rency controls and decentralized coordination between threads. number of steals attempted by a thief and how to mitigate This problem becomes even more challenging when striving the resource contention between workers. The problem is for optimal thread usage in handling parallel workloads even more challenging when considering throughput and with complex task graphs. As a result, we introduce in this paper an effective work-stealing scheduler for execution of energy efficiency, which have emerged as critical issues in task dependency graphs. Our scheduler adopts a simple and modern scheduler designs [4] [19]. The scheduler’s worker efficient strategy to adapt the number of working threads management can have a huge impact on these issues if it to available task parallelism at any time during the graph is not designed properly. For example, a straightforward execution. Our strategy is provably good in preventing resource method is to keep workers busy in waiting for tasks. underutilization and simultaneously minimizing resource waste when tasks are scarce. We have evaluated our scheduler on both Apparently, this method consumes too much resource and micro-benchmarks and a real-world circuit timing analysis can result in a number of problems, such as decreasing workload, and demonstrated promising results over existing the throughput of co-running multithreaded applications and methods in terms of runtime, energy efficiency, and throughput. low energy efficiency [4] [19]. Several methods have been Keywords-task dependency graph; work stealing; parallel proposed to remedy this deficiency, e.g., making thieves computing; scheduling; multithreading; relinquish their cores before stealing [5] or backing off a certain time [6], [13], or modifying OS kernel to directly I. INTRODUCTION manipulate CPU cores [4]. Nevertheless, these approaches still have drawbacks especially from the standpoints of Work stealing has been proved to be an efficient approach solution generality and performance scalability. for parallel task scheduling on multi-core systems and has In this paper, we propose an adaptive work-stealing sched- received wide research interest over the past few decades uler with provably good worker management for executing [1]–[12]. Several task-based parallel programming libraries task dependency graphs. Our scheduler employs a simple yet and language have adopted work-stealing scheduler as the effective strategy, in no need of sophisticated data structures, runtime for thread management and task dispatch such to adapt the number of working threads to dynamically as Intel Threading Building Blocks (TBB) [13], Cilk [2] generated task parallelism. Not only can our strategy prevent [14], X10 [15] [8], Nabbit [16], Microsoft Task Parallel resource from being underutilized which could degrade the Library (TPL) [17], and Golang [18]. The efficiency of performance, but it also avoids overly subscribing threads the work-stealing scheduler can be attributed to the way it or thieves when tasks are scarce. As a result, we can sig- manages the threads: The scheduler spawns multiple threads nificantly improve the overall system performance including (denoted as workers) on initialization. Each worker first runtime, energy usage, and throughput. We summarize our carries out tasks in its private queue. Then, a worker without contributions as follows: available tasks becomes a thief and randomly steals tasks • from others. By having thieves actively steal tasks, the An adaptive scheduling strategy: We developed an adap- scheduler is able to balance the workload and maximize tive scheduling strategy for executing task dependency the performance. However, implementing an efficient work- graph. The strategy is simple and easy to implement in a standalone fashion. It does not require sophisticated data structures or hardware-specific code. A wide range of parallel processing libraries and runtimes that leverage 978-1-7281-9074-7/20/$31.00 ©2020 IEEE 64 DOI 10.1109/ICPADS51040.2020.00018 work-stealing can benefit from our scheduler designs. execution only after all its predecessors finish. The execution • Provably good worker management: We proved of a task dependency graph begins from source nodes (nodes our scheduler can simultaneously prevent the under- without predecessors, e.g. node A) and ends when all sink subscription problem and the over-subscription problem nodes (nodes without successors, e.g. node G) finish. during the graph execution. When tasks are scarce, the Our scheduler consists of a set of workers, a master queue, number of wasteful thieves will drop within a bounded a lock and a notifier. On initialization the scheduler spawns interval to save resources for other threads both inside workers waiting for tasks. Each worker is equipped with a and outside the application. private queue to store tasks ready for execution. To begin • Improved system performance: We showed through bal- executing a task graph, users insert source nodes to the ancing working threads on top of dynamically generated master queue and notify workers via the notifier. Algorithm 1 task parallelisms can improve both application perfor- is the pseudo code of task insertion from users. A lock mance and overall system performance including runtime, (line 1) is used to protect the master queue for concurrent energy efficiency, and throughput. Improved system per- submissions of task graphs. When a worker completes a task, formance translates to better scalability. it automatically decrements the dependency of successive We evaluated the proposed scheduler on both micro- tasks and pushes new tasks to its queue whenever depen- benchmarks and real-world applications, and we show that dency is met. Each worker keeps one local cache of task for on a very large scale integration (VLSI) timing analyzer continuation. benchmark our scheduler can achieve up to 15% less runtime We implement the queue based on the Chase-Lev algo- and 36% less energy consumption over a baseline algorithm. rithm [20] [21]. Each worker can add and pop task tasks from one end of its queue, while other workers can steal II. ADAPTIVE WORK-STEALING SCHEDULER tasks from the other end. The notifier is a two-phase In this section we first introduce the task dependency synchronization protocol that efficiently supports: (1) putting graph model and present the details of the proposed work- a worker into sleep and (2) waking up one or all sleeping stealing scheduler. Then, we provide an analysis of our workers. We leverage the EventCount construct from the worker management to show its efficiency. Eigen library [22] as the notifier in our scheduler. The usage of EventCount is similar to a condition variable. A. Scheduler Overview The notifying thread sets a condition to true and then signals waiting workers via EventCount. On the other side, a Task Dependency Worker Queue worker first checks the condition and returns to work if A Push the condition is true. Otherwise, the worker updates the EventCount to indicate it is waiting and checks the C B Push Pop Pop condition again. If the condition is still false, the worker E is suspended via the EventCount. D G F Steal Steal B. Worker Management Each worker iterates the following two phases: Figure 1. A task dependency graph and the architecture of our scheduler. 1) Task exploitation: The worker repeatedly executes tasks from its queue and exploits new successor tasks whenever dependency is met. Algorithm 1: Task insertion from users 2) Task exploration: The worker drains out its queue and Input: tasks: source tasks turns into a thief to explore tasks by randomly stealing tasks from peer workers. If the worker successfully 1 lock(); steals a task, it returns to the task exploitation phase, 2 for t in tasks do or becomes a sleep candidate to wait for future tasks. 3 master queue.push(t); 4 end Algorithm 2 is the pseudo code of a worker’s top- 5 unlock(); level control flow. The exploit_task (line 5) and 6 notifier.notify one(); wait_for_task (line 6) functions correspond to the two phases, respectively. The worker exits the control flow when the scheduler terminates (wait_for_task returns Figure 1 shows a task dependency graph (left) and the false). Inside these two phases, our scheduler uses two architecture of the proposed scheduler (right). A task de- variables: num_actives

An Efficient Work-Stealing Scheduler for Task Dependency Graph

8A. Going Parallel

Correct and Efficient Work-Stealing for Weak Memory Models

Intel Threading Building Blocks

Lecture 15-16: Parallel Programming with Cilk

Intel Threading Building Blocks (Mike Voss)

Work Stealing for Interactive Services to Meet Target Latency ∗

Threading Building Blocks and Openmp Implementation Issues

Scheduling Parallel Programs by Work Stealing with Private Deques

Asymmetry-Aware Work-Stealing Runtimes

Randomized Work Stealing for Large Scale Soft Real-Time Systems

Parallelization in Rust with Fork-Join and Friends Creating the Forkjoin Framework

Work-Stealing Without the Baggage ∗