A Work Stealing Scheduler for Parallel Loops on Shared Cache Multicores

Total Page:16

File Type:pdf, Size:1020Kb

A Work Stealing Scheduler for Parallel Loops on Shared Cache Multicores A Work Stealing Scheduler for Parallel Loops on Shared Cache Multicores Marc Tchiboukdjian Vincent Danjean Thierry Gautier Fabien Le Mentec Bruno Raffin Marc Tchiboukdjian A Work Stealing Scheduler for Parallel Loops on Shared Cache Multicores 1/24 2. Multiple cores with private caches 3. Multiple cores with private caches and one shared cache The sequential has an unfair advantage. Can we still get linear speedup? Shared Cache of Multicore Processors 1. One core with 2 cache levels L2 L1 Core Marc Tchiboukdjian A Work Stealing Scheduler for Parallel Loops on Shared Cache Multicores 2/24 3. Multiple cores with private caches and one shared cache The sequential has an unfair advantage. Can we still get linear speedup? Shared Cache of Multicore Processors 1. One core with 2 cache levels 2. Multiple cores with private caches L2 L2 L2 L2 L1 L1 L1 L1 Core Core Core Core Marc Tchiboukdjian A Work Stealing Scheduler for Parallel Loops on Shared Cache Multicores 2/24 Shared Cache of Multicore Processors 1. One core with 2 cache levels 2. Multiple cores with private caches 3. Multiple cores with private caches and one shared cache L3 L2 L2 L2 L2 L1 L1 L1 L1 Core Core Core Core The sequential has an unfair advantage. Can we still get linear speedup? Marc Tchiboukdjian A Work Stealing Scheduler for Parallel Loops on Shared Cache Multicores 2/24 Scheduling for Efficient Shared Cache Usage Schedule the computation so that shared cache misses do not increase. I with a work stealing scheduler allowing efficient dynamic load balancing I for parallel loops: no dependencies between tasks 8MB 256KB256KB256KB256KB 32KB 32KB 32KB 32KB Core 0 Core 1 Core 2 Core 3 Xeon Nehalem E5530 (2.4Ghz) Marc Tchiboukdjian A Work Stealing Scheduler for Parallel Loops on Shared Cache Multicores 3/24 Overview Cache Efficient Work Stealing Scheduling for Parallel Loops 1. Standard Schedulers for Parallel Loops 2. New Scheduler Optimized for Shared Cache 3. Efficient Implementation of the Scheduler 4. Experiments Marc Tchiboukdjian A Work Stealing Scheduler for Parallel Loops on Shared Cache Multicores 4/24 Parallel Loops Examples of parallel loops I OpenMP #pragma omp parallel for I TBB parallel for I Cilk parallel for I Parallel STL for each or transform Problem characteristics I Schedule n iterations on p cores I Iterations can be processed independently I Time to process one iteration can vary Marc Tchiboukdjian A Work Stealing Scheduler for Parallel Loops on Shared Cache Multicores 5/24 Static Scheduling of Parallel Loops Static Scheduling n I Allocate iterations to each core p I ex: OpenMP static scheduling core 1 core 2 core 3 core 4 Characteristics I low overhead mechanism I bad load balancing if workload is irregular Marc Tchiboukdjian A Work Stealing Scheduler for Parallel Loops on Shared Cache Multicores 6/24 Dynamic Scheduling of Parallel Loops OpenMP dynamic scheduling I Allocate iterations in chunks of size g I All chunks are stored in a centralized list I Each thread remove a chunk from the list and process it g Characteristics I Good load balancing I Contention on the list I Chunk creation overhead core 1 core 2 core 3 core 4 Marc Tchiboukdjian A Work Stealing Scheduler for Parallel Loops on Shared Cache Multicores 7/24 Scheduling Parallel Loops with Work Stealing Work Stealing I Each thread has its own list of tasks (= chunks) I If list is empty, steal tasks in a randomly selected list I Binary tree of tasks to minimize number of steals: one steal , half of the iterations steal Characteristics I Good load balancing I Contention is reduced I Task creation overhead core 1 core 2 core 3 core 4 Marc Tchiboukdjian A Work Stealing Scheduler for Parallel Loops on Shared Cache Multicores 8/24 Scheduling Parallel Loops with Work Stealing [0; 127] [0; 63] [64; 127] [0; 31] [32; 63] [0; 15] [16; 31] I terminated tasks I ready tasks I running tasks Marc Tchiboukdjian A Work Stealing Scheduler for Parallel Loops on Shared Cache Multicores 9/24 Scheduling Parallel Loops with Work Stealing [0; 127] [64; 127] steal [0; 63] [64; 95] [96; 127] [0; 31] [32; 63] [64; 79] [80; 95] [0; 15] [16; 31] I terminated tasks I ready tasks I running tasks Marc Tchiboukdjian A Work Stealing Scheduler for Parallel Loops on Shared Cache Multicores 9/24 Parallel Loops with Xkaapi Xkaapi typedef struct f InputIterator ibeg; InputIterator iend; I Work stealing library g Work_t ; // Task I Tasks are created on a steal: void parallel_for (...) f reduce task creation overhead while (iend != ibeg) I Cooperative stealing: do_work ( ibeg++ ) ; g // no more work -> become a thief The victim stops working to answer work requests void splitter ( num_req ) f i = 0 ; I The victim can answer to size = victim.iend - victim.ibeg ; multiple requests at a time bloc = size / ( num_req + 1 ) ; local_end = victim.iend ; while ( num_req > 0 ) f thief->iend = local_end ; Characteristics thief->ibeg = local_end - bloc ; local_end -= bloc ; I Good load balancing --num_req ; g I Low overhead mechanism g // victim + thieves -> parallel_for Marc Tchiboukdjian A Work Stealing Scheduler for Parallel Loops on Shared Cache Multicores 10/24 Overview Cache Efficient Work Stealing Scheduling for Parallel Loops 1. Standard Schedulers for Parallel Loops steal core 1 core 2 core 3 core 4 2. New Scheduler Optimized for Shared Cache 3. Efficient Implementation of the Scheduler 4. Experiments Marc Tchiboukdjian A Work Stealing Scheduler for Parallel Loops on Shared Cache Multicores 11/24 Reuse Distances [Beyls and D'Hollander 01] I Number of distinct elements accessed between two accesses to the same element. hd I If first access, reuse distance is infinity. C I On a fully associative LRU cache of size C: reuse distance ≤ C ) hit misses reuse distance > C ) miss d I hd : number of accesses with a reuse distance d 1 X I number of cache misses M(C) = hd d=C+1 element access ABABBCBAC reuse distance 1 1 2 2 1 1 2 3 3 cache content ; A A,B A,B A,B A,B B,C B,C A,B Marc Tchiboukdjian A Work Stealing Scheduler for Parallel Loops on Shared Cache Multicores 12/24 Shared Cache Misses with the Classic Schedule I Cores work on elements far away shared cache I Good temporal locality of the sequential algorithm core 1 core 2 core 3 core 4 ) Cores work on distinct data hd C I To 1 access by a core corresponds C p − 1 accesses to distinct data by p seq misses the other cores I Reuse distance is multiplied by p: seq par hd = hp·d d I Number of cache misses additional misses 1 1 1 X par X seq X seq Mpar (C) = hd = hd=p = hd = Mseq(C=p) d=C+1 d=C+1 d=C=p+1 Marc Tchiboukdjian A Work Stealing Scheduler for Parallel Loops on Shared Cache Multicores 13/24 A Shared Cache Aware Schedule I Cores work at distance at most m shared cache I r(m) accesses to distinct elements in a window of size m I The reuse distance is increased by seq win m at most r(m): hd = hd+r(m) I Number of cache misses 1 C X seq X seq Mwin(C) ≤ hd = Mseq(C) + hd d=C+1−r(m) d=C+1−r(m) hd r(m) Small Overhead over Sequential seq misses I good sequential locality ) r(m) small ) hd small for large d C d Marc Tchiboukdjian A Work Stealing Scheduler for Parallel Loops onadditional Shared Cache Multicores misses 14/24 Overview Cache Efficient Work Stealing Scheduling for Parallel Loops 1. Standard Schedulers for Parallel Loops steal core 1 core 2 core 3 core 4 2. New Scheduler Optimized for Shared Cache shared cache m 3. Efficient Implementation of the Scheduler 4. Experiments Marc Tchiboukdjian A Work Stealing Scheduler for Parallel Loops on Shared Cache Multicores 15/24 Implementing the Shared Cache Aware Schedule Using a standard parallel for loop: StaticWindow I Divide the iterations in n=m chunks of size m I Each chunk is processed in parallel with a standard parallel for I Two versions: pthread and Xkaapi Pthread version I each thread processes 1=p of a chunk shared cache I wait on a barrier after each chunk Xkaapi version m I each chunk is processed in parallel with a parallel for Marc Tchiboukdjian A Work Stealing Scheduler for Parallel Loops on Shared Cache Multicores 16/24 Implementing the Shared Cache Aware Schedule Optimized implementation using Xkaapi: SlidingWindow I Processing iteration i enables iteration i + m I Master thread is at the beginning of the sequence I On a steal, the master can give work I In the interval [ibeg; iend[ like the other workers I In the interval [ilast; ibeg + m[ enabled since the last steal ibeg iend ilast Processed Elements Master Work Stolen Work Remaining Elements m-size window typedef struct f typedef struct f InputIterator ibeg; InputIterator ibeg; InputIterator iend; InputIterator iend; InputIterator ilast; g Work_t ; // Task g Master_Work_t ; // Master Task Marc Tchiboukdjian A Work Stealing Scheduler for Parallel Loops on Shared Cache Multicores 17/24 Overview Cache Efficient Work Stealing Scheduling for Parallel Loops 1. Standard Schedulers for Parallel Loops steal core 1 core 2 core 3 core 4 2. New Scheduler Optimized for Shared Cache shared cache m 3. Efficient Implementation of the Scheduler ibeg iend ilast Processed Elements Master Work Stolen Work Remaining Elements m-size window 4. Experiments Marc Tchiboukdjian A Work Stealing Scheduler for Parallel Loops on Shared Cache Multicores 18/24 Application: Isosurface Extraction ·108 1:2 C Mseq Isosurface extraction 1 p I Common scientific 0:8 visualization filter 0:6 Mseq(C) I Memory bounded 0:4 2MB 8MB Algorithm I Iterate through all cells in the mesh I Interpolate surface inside each cell I Cells close in the mesh share points Marc Tchiboukdjian A Work Stealing Scheduler for Parallel
Recommended publications
  • 8A. Going Parallel
    2019/2020 UniPD - T. Vardanega 02/06/2020 Terminology /1 • Program = static piece of text 8a. Going parallel – Instructions + link-time data • Processor = resource that can execute a program – In a “multi-processor,” processors may – Share uniformly one common address space for main memory – Have non-uniform access to shared memory – Have unshared parts of memory – Share no memory as in “Shared Nothing” (distributed memory) Where we take a bird’s eye look at what architectures changes (again!) when jobs become • Process = instance of program + run-time data parallel in response to the quest for best – Run-time data = registers + stack(s) + heap(s) use of (massively) parallel hardware Parallel Lang Support 505 Terminology /2 Terms in context: LWT within LWP • No uniform naming of threads of control within process –Thread, Kernel Thread, OS Thread, Task, Job, Light-Weight Process, Virtual CPU, Virtual Processor, Execution Agent, Executor, Server Thread, Worker Thread –Taskgenerally describes a logical piece of work – Element of a software architecture –Threadgenerally describes a virtual CPU, a thread of control within a process –Jobin the context of a real-time system generally describes a single execution consequent to a task release Lightweight because OS • No uniform naming of user-level lightweight threads does not schedule threads –Task, Microthread, Picothread, Strand, Tasklet, Fiber, Lightweight Thread, Work Item – Called “user-level” in that scheduling is done by code outside of Exactly the same logic applies to user-level lightweight the kernel/operating-system threads (aka tasklets), which do not need scheduling Parallel Lang Support 506 Parallel Lang Support 507 Real-Time Systems 1 2019/2020 UniPD - T.
    [Show full text]
  • Correct and Efficient Work-Stealing for Weak Memory Models
    Correct and Efficient Work-Stealing for Weak Memory Models Nhat Minh Lê Antoniu Pop Albert Cohen Francesco Zappa Nardelli INRIA and ENS Paris Abstract Lev’s deque for the ARM architectures, and prove its correctness Chase and Lev’s concurrent deque is a key data structure in shared- against the memory semantics defined in [12] and [7]. Our second memory parallel programming and plays an essential role in work- contribution is a systematic study of the performance of several stealing schedulers. We provide the first correctness proof of an implementations of Chase–Lev on relaxed hardware. In detail, we optimized implementation of Chase and Lev’s deque on top of the compare our optimized ARM implementation against a standard POWER and ARM architectures: these provide very relaxed mem- implementation for the x86 architecture and two portable variants ory models, which we exploit to improve performance but consider- expressed in C11: a reference sequentially consistent translation ably complicate the reasoning. We also study an optimized x86 and of the algorithm, and an aggressively optimized version making a portable C11 implementation, conducting systematic experiments full use of the release–acquire and relaxed semantics offered by to evaluate the impact of memory barrier optimizations. Our results C11 low-level atomics. These implementations of the Chase–Lev demonstrate the benefits of hand tuning the deque code when run- deque are evaluated in the context of a work-stealing scheduler. We ning on top of relaxed memory models. consider diverse worker/thief configurations, including a synthetic benchmark with two different workloads and standard task-parallel Categories and Subject Descriptors D.1.3 [Programming Tech- kernels.
    [Show full text]
  • An Efficient Work-Stealing Scheduler for Task Dependency Graph
    2020 IEEE 26th International Conference on Parallel and Distributed Systems (ICPADS) An Efficient Work-Stealing Scheduler for Task Dependency Graph Chun-Xun Lin Tsung-Wei Huang Martin D. F. Wong Department of Electrical Department of Electrical Department of Electrical and Computer Engineering, and Computer Engineering, and Computer Engineering, University of Illinois Urbana-Champaign, The University of Utah, University of Illinois Urbana-Champaign, Urbana, IL, USA Salt Lake City, Utah Urbana, IL, USA [email protected] [email protected] [email protected] Abstract—Work-stealing is a key component of many parallel stealing scheduler is not an easy job, especially when dealing task graph libraries such as Intel Threading Building Blocks with task dependency graph where the parallelism could be (TBB) FlowGraph, Microsoft Task Parallel Library (TPL) very irregular and unstructured. Due to the decentralized ar- Batch.Net, Cpp-Taskflow, and Nabbit. However, designing a correct and effective work-stealing scheduler is a notoriously chitecture, developers have to sort out many implementation difficult job, due to subtle implementation details of concur- details to efficiently manage workers such as deciding the rency controls and decentralized coordination between threads. number of steals attempted by a thief and how to mitigate This problem becomes even more challenging when striving the resource contention between workers. The problem is for optimal thread usage in handling parallel workloads even more challenging when considering throughput and with complex task graphs. As a result, we introduce in this paper an effective work-stealing scheduler for execution of energy efficiency, which have emerged as critical issues in task dependency graphs.
    [Show full text]
  • Intel Threading Building Blocks
    Praise for Intel Threading Building Blocks “The Age of Serial Computing is over. With the advent of multi-core processors, parallel- computing technology that was once relegated to universities and research labs is now emerging as mainstream. Intel Threading Building Blocks updates and greatly expands the ‘work-stealing’ technology pioneered by the MIT Cilk system of 15 years ago, providing a modern industrial-strength C++ library for concurrent programming. “Not only does this book offer an excellent introduction to the library, it furnishes novices and experts alike with a clear and accessible discussion of the complexities of concurrency.” — Charles E. Leiserson, MIT Computer Science and Artificial Intelligence Laboratory “We used to say make it right, then make it fast. We can’t do that anymore. TBB lets us design for correctness and speed up front for Maya. This book shows you how to extract the most benefit from using TBB in your code.” — Martin Watt, Senior Software Engineer, Autodesk “TBB promises to change how parallel programming is done in C++. This book will be extremely useful to any C++ programmer. With this book, James achieves two important goals: • Presents an excellent introduction to parallel programming, illustrating the most com- mon parallel programming patterns and the forces governing their use. • Documents the Threading Building Blocks C++ library—a library that provides generic algorithms for these patterns. “TBB incorporates many of the best ideas that researchers in object-oriented parallel computing developed in the last two decades.” — Marc Snir, Head of the Computer Science Department, University of Illinois at Urbana-Champaign “This book was my first introduction to Intel Threading Building Blocks.
    [Show full text]
  • Lecture 15-16: Parallel Programming with Cilk
    Lecture 15-16: Parallel Programming with Cilk Concurrent and Mul;core Programming Department of Computer Science and Engineering Yonghong Yan [email protected] www.secs.oakland.edu/~yan 1 Topics (Part 1) • IntroducAon • Principles of parallel algorithm design (Chapter 3) • Programming on shared memory system (Chapter 7) – OpenMP – Cilk/Cilkplus – PThread, mutual exclusion, locks, synchroniza;ons • Analysis of parallel program execuAons (Chapter 5) – Performance Metrics for Parallel Systems • Execuon Time, Overhead, Speedup, Efficiency, Cost – Scalability of Parallel Systems – Use of performance tools 2 Shared Memory Systems: Mul;core and Mul;socket Systems • a 3 Threading on Shared Memory Systems • Employ parallelism to compute on shared data – boost performance on a fixed memory footprint (strong scaling) • Useful for hiding latency – e.g. latency due to I/O, memory latency, communicaon latency • Useful for scheduling and load balancing – especially for dynamic concurrency • Relavely easy to program – easier than message-passing? you be the judge! 4 Programming Models on Shared Memory System • Library-based models – All data are shared – Pthreads, Intel Threading Building Blocks, Java Concurrency, Boost, Microso\ .Net Task Parallel Library • DirecAve-based models, e.g., OpenMP – shared and private data – pragma syntax simplifies thread creaon and synchronizaon • Programming languages – CilkPlus (Intel, GCC), and MIT Cilk – CUDA (NVIDIA) – OpenCL 5 Toward Standard Threading for C/C++ At last month's meeting of the C standard committee, WG14 decided to form a study group to produce a proposal for language extensions for C to simplify parallel programming. This proposal is expected to combine the best ideas from Cilk and OpenMP, two of the most widely-used and well-established parallel language extensions for the C language family.
    [Show full text]
  • Intel Threading Building Blocks (Mike Voss)
    Mike Voss, Principal Engineer Core and Visual Computing Group, Intel We’ve already talked about threading with pthreads and the OpenMP* API • POSIX threads (pthreads) lets us express threading but makes us do a lot of the hard work • OpenMP is higher-level model and is widely used in C/C++ and Fortran • It takes care of many of the low-level error prone details for us • OpenMP has weaknesses, especially for C++ developers… • It uses #pragmas and so doesn’t look like C++ • It is not a composable parallelism model Optimization Notice Copyright © 2018, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Agenda • What is composability and why is it important? • An introduction to the Threading Building Blocks (TBB) library • What it is and what it contains • TBB’s high-level execution interfaces • The generic parallel algorithms, the flow graph and Parallel STL • Synchronization primitives and concurrent containers • The TBB scalable memory allocator Optimization Notice Copyright © 2018, Intel Corporation. All rights reserved. 3 *Other names and brands may be claimed as the property of others. There are different ways parallel software components can be combined with other parallel software components Optimization Notice Copyright © 2018, Intel Corporation. All rights reserved. 4 *Other names and brands may be claimed as the property of others. Nested composition int main() { #pragma omp parallel f(); } void f() { #pragma omp parallel g(); } void g() { #pragma omp parallel h(); } Optimization Notice Copyright © 2018, Intel Corporation. All rights reserved. 5 *Other names and brands may be claimed as the property of others.
    [Show full text]
  • Work Stealing for Interactive Services to Meet Target Latency ∗
    Work Stealing for Interactive Services to Meet Target Latency ∗ Jing Li∗ Kunal Agrawal∗ Sameh Elniketyy Yuxiong Hey I-Ting Angelina Lee∗ Chenyang Lu∗ Kathryn S. McKinleyy ∗Washington University in St. Louis yMicrosoft Research fli.jing, kunal, angelee, [email protected], fsamehe, yuxhe, [email protected] Abstract quent occurrence of high latency responses at one server is signif- Interactive web services increasingly drive critical business work- icantly amplified because services aggregate responses from large loads such as search, advertising, games, shopping, and finance. number of servers. Therefore, these servers are designed to mini- mize tail latency, i.e., the latency of requests in the high percentiles, Whereas optimizing parallel programs and distributed server sys- th tems have historically focused on average latency and throughput, such as 99 percentile latency (Dean and Barroso 2013; DeCandia the primary metric for interactive applications is instead consis- et al. 2007; Haque et al. 2015; He et al. 2012; Yi et al. 2008). This tent responsiveness, i.e., minimizing the number of requests that paper studies a closely related problem, optimizing for a target la- miss a target latency. This paper is the first to show how to gen- tency — given a target latency, the system minimizes the number eralize work-stealing, which is traditionally used to minimize the of requests whose latency exceeds this target. makespan of a single parallel job, to optimize for a target latency in We explore interactive services on a multicore server with mul- interactive services with multiple parallel requests. tiple parallelizable requests. Interactive service workloads tend We design a new adaptive work stealing policy, called tail- to be computationally intensive with highly variable work de- control, that reduces the number of requests that miss a target mand (Dean and Barroso 2013; He et al.
    [Show full text]
  • Threading Building Blocks and Openmp Implementation Issues
    Threading Building Blocks and OpenMP Implementation Issues John Mellor-Crummey Department of Computer Science Rice University [email protected] COMP 522 21 February 2019 Context • Language-based parallel programming models —Cilk, Cilk++: extensions to C and C++ • Library and directive-based models —Threading Building Blocks (TBB): library-based model —OpenMP: directive-based programming model !2 Outline • Threading Building Blocks —overview —tasks and scheduling —scalable allocator • OpenMP —overview —lightweight techniques for managing parallel regions on BG/Q !3 Thread Building Blocks • C++ template library used for multi-threading • Features —like Cilk, a task-based programming model —abstracts away implementation details of task parallelism – number of threads – mapping of task to threads – mapping of threads to processors – memory and locality —advantages – reduces source code complexity over PThreads – runtime determines parameters automatically – automatically scales parallelism to exploit all available cores !4 Threading Building Blocks Components • Parallel algorithmic templates • Concurrent containers parallel_for concurrent hash map parallel_reduce concurrent queue pipeline concurrent vector parallel_sort • Task scheduler parallel_while parallel_scan • Memory allocators • Synchronization primitives cache_aligned_allocator atomic ops on integer types scalable_allocator mutex • Timing spin_mutex tick_count queuing_mutex spin_rw_mutex queuing_rw_mutex !5 Comparing Implementations of Fibonacci TBB class FibTask : public task { //
    [Show full text]
  • Scheduling Parallel Programs by Work Stealing with Private Deques
    Scheduling Parallel Programs by Work Stealing with Private Deques Umut A. Acar Arthur Chargueraud´ Mike Rainey Department of Computer Science Inria Saclay – Ile-de-Franceˆ Max Planck Institute Carnegie Mellon University & LRI, Universite´ Paris Sud, CNRS for Software Systems [email protected] [email protected] [email protected] Abstract 1. Introduction Work stealing has proven to be an effective method for scheduling As multicore computers (computers with chip-multiprocessors) be- fine-grained parallel programs on multicore computers. To achieve come mainstream, techniques for writing and executing parallel high performance, work stealing distributes tasks between concur- programs have become increasingly important. By allowing paral- rent queues, called deques, assigned to each processor. Each pro- lel programs to be written in a style similar to sequential programs, cessor operates on its deque locally except when performing load and by generating a plethora of parallel tasks, fine-grained paral- balancing via steals. Unfortunately, concurrent deques suffer from lelism, as elegantly exemplified by languages such as Cilk [21], two limitations: 1) local deque operations require expensive mem- NESL [5], parallel Haskell [30], and parallel ML [20] has emerged ory fences in modern weak-memory architectures, 2) they can be as a promising technique for parallel programming [31]. very difficult to extend to support various optimizations and flexi- Executing fine-grained parallel programs with high-performance ble forms of task distribution strategies needed many applications, requires overcoming a key challenge: efficient scheduling. Schedul- e.g., those that do not fit nicely into the divide-and-conquer, nested ing costs include the cost creating a potentially very large number data parallel paradigm.
    [Show full text]
  • Asymmetry-Aware Work-Stealing Runtimes
    Appears in the Proceedings of the 43rd ACM/IEEE Int’l Symp. on Computer Architecture (ISCA-43), June 2016 Asymmetry-Aware Work-Stealing Runtimes Christopher Torng, Moyang Wang, and Christopher Batten School of Electrical and Computer Engineering, Cornell University, Ithaca, NY {clt67,mw828,cbatten}@cornell.edu Abstract—Amdahl’s law provides architects a compelling rea- Cortex-A15/A57 out-of-order cores with “little” Cortex- son to introduce system asymmetry to optimize for both se- A7/A53 in-order cores [25, 41], are commercially available rial and parallel regions of execution. Asymmetry in a multi- from Samsung [29], Qualcomm [28], Mediatek [17], and Re- core processor can arise statically (e.g., from core microarchitec- ture) or dynamically (e.g., applying dynamic voltage/frequency nesas [27]. There has been significant interest in new tech- scaling). Work stealing is an increasingly popular approach niques for effectively scheduling software across these big to task distribution that elegantly balances task-based paral- and little cores, although most of this prior work has fo- lelism across multiple worker threads. In this paper, we pro- cused on either multiprogrammed workloads [1, 42, 43, 60] pose asymmetry-aware work-stealing (AAWS) runtimes, which or on applications that use thread-based parallel program- are carefully designed to exploit both the static and dynamic asymmetry in modern systems. AAWS runtimes use three key ming constructs [35, 36, 44, 59]. However, there has been hardware/software techniques: work-pacing, work-sprinting, less research exploring the interaction between state-of-the- and work-mugging. Work-pacing and work-sprinting are novel art work-stealing runtimes and static asymmetry.
    [Show full text]
  • Randomized Work Stealing for Large Scale Soft Real-Time Systems
    Randomized Work Stealing for Large Scale Soft Real-time Systems Jing Li, Son Dinh, Kevin Kieselbach, Kunal Agrawal, Christopher Gill, and Chenyang Lu Washington University in St. Louis fli.jing, sonndinh, kevin.kieselbach,kunal, cdgill, [email protected] Abstract—Recent years have witnessed the convergence of parallel execution of a task to meet its deadline, theoretic two important trends in real-time systems: growing computa- analysis often assumes that it is executed by a greedy (work tional demand of applications and the adoption of processors conserving) scheduler, which requires a centralized data with more cores. As real-time applications now need to exploit parallelism to meet their real-time requirements, they face a structure for scheduling. On the other hand, for general- new challenge of scaling up computations on a large number purpose parallel job scheduling it has been known that cen- of cores. Randomized work stealing has been adopted as tralized scheduling approaches suffer considerable schedul- a highly scalable scheduling approach for general-purpose ing overhead and performance bottleneck as the number computing. In work stealing, each core steals work from a of cores increases. In contrast, a randomized work stealing randomly chosen core in a decentralized manner. Compared to centralized greedy schedulers, work stealing may seem approach is widely used in many parallel runtime systems, unsuitable for real-time computing due to the non-predictable such as Cilk, Cilk Plus, TBB, X10, and TPL [2]–[6]. In work nature of random stealing. Surprisingly, our experiments with stealing, each core steals work from a randomly chosen core benchmark programs found that random work stealing (in Cilk in a decentralized manner, thereby avoiding the overhead Plus) delivers tighter distributions in task execution times than and bottleneck of centralized scheduling.
    [Show full text]
  • Parallelization in Rust with Fork-Join and Friends Creating the Forkjoin Framework
    Parallelization in Rust with fork-join and friends Creating the ForkJoin framework Master’s thesis in Computer Science and Engineering LINUS FÄRNSTRAND Department of Computer Science and Engineering CHALMERS UNIVERSITY OF TECHNOLOGY Gothenburg, Sweden 2015 Master’s thesis 2015 Parallelization in Rust with fork-join and friends Creating the ForkJoin framework LINUS FÄRNSTRAND Department of Computer Science and Engineering Division of Software Technology Chalmers University of Technology Gothenburg, Sweden 2015 Parallelization in Rust with fork-join and friends Creating the ForkJoin framework LINUS FÄRNSTRAND © LINUS FÄRNSTRAND, 2015. Academic supervisor: Josef Svenningsson, Department of Computer Science and Engineering, Chalmers Industry advisor: Lars Bergstrom, Mozilla Corporation Examiner: Patrik Jansson, Department of Computer Science and Engineering, Chalmers Master’s Thesis 2015 Department of Computer Science and Engineering Division of Software Technology Chalmers University of Technology SE-412 96 Gothenburg Telephone +46 31 772 1000 Cover: A visualization of a fork-join computation that split itself up into multiple small computations that can be executed in parallel. Credit for the Rust logo goes to Mozilla. Typeset in LATEX Printed by TeknologTryck Gothenburg, Sweden 2015 iv Parallelization in Rust with fork-join and friends Creating the ForkJoin framework LINUS FÄRNSTRAND Department of Computer Science and Engineering Chalmers University of Technology Abstract This thesis describes the design, implementation and benchmarking of a work stealing fork-join library, called ForkJoin, for the new language Rust. Rust is a programming language with a novel approach to memory safety and concurrency, and guarantees memory safety through zero-cost abstractions and thorough checks at compile time rather than run time.
    [Show full text]