Nowa: a Wait-Free Continuation-Stealing Concurrency Platform

Nowa: A Wait-Free Continuation-Stealing Concurrency Platform Florian Schmaus, Nicolas Pfeiffer Timo Honig¨ Jorg¨ Nolte Wolfgang Schroder-Preikschat¨ Ruhr University Bochum (RUB) Brandenburg University of Technology Friedrich-Alexander-University Bochum, Germany Cottbus-Senftenberg (BTU) Erlangen-Nurnberg¨ (FAU) [email protected] Cottbus, Germany Erlangen, Germany [email protected] fschmaus,pfeiffer,[email protected] Abstract—It is an ongoing challenge to efficiently use paral- Nowa Fibril Cilk Plus TBB lelism with today’s multi- and many-core processors. Scalability ¯ becomes more crucial than ever with the rapidly growing number nqueens (TS = 5:004 ± 0:001 s) of processing elements in many-core systems that operate in data 100 centres and embedded domains. Guaranteeing scalability is often ensured by using fully-strict fork/join concurrency, which is the prevalent approach used by concurrency platforms like Cilk. The 50 runtime systems employed by those platforms typically resort to lock-based synchronisation due to the complex interactions Speedup of data structures within the runtime. However, locking limits 0 scalability severely. With the availability of commercial off-the- 64 128 192 256 shelf systems with hundreds of logical cores, this is becoming a Threads problem for an increasing number of systems. This paper presents Nowa, a novel wait-free approach to arbi- Figure 1: Comparison of runtime systems of different concur- trate the plentiful concurrent strands managed by a concurrency rency platforms. Our wait-free Nowa system exhibits better platform’s runtime system. The wait-free approach is enabled speedup than the lock-based ones. by exploiting inherent properties of fully-strict fork/join concurrency, and hence is potentially applicable for every continuation- stealing runtime system of a concurrency platform. We have implemented Nowa and compared it with existing runtime systems, The choice of concurrency platform has a direct impact on including Cilk Plus, and Threading Building Blocks (TBB), which the execution time of a parallel application. Figure 1 shows a employ a lock-based approach. Our evaluation results show that comparison of concurrency platforms, including established the wait-free implementation increases the performance up to 1:64 × compared to lock-based ones, on a system with 256 hard- ones like Cilk Plus or Threading Building Blocks (TBB). ware threads. The performance increased by 1:17 × on average, However, the performance of existing concurrency platforms while no but one benchmark exhibited performance regression. is often limited by the lock-based synchronisation used within Compared against OpenMP tasks using Clang’s libomp, Nowa the platform’s runtime system. In contrast, as can be seen outperforms OpenMP by 8:68 × on average. in Figure 1, our Nowa runtime system unlocks additional Index Terms—scheduling, concurrency platform, wait-free performance on systems with hundreds of hardware threads, by combining our wait-free [2] approach with a work-stealing I. INTRODUCTION queue algorithm, optimised for scalability. As to the growing Driven by the rapidly rising number of computation cores, number of cores at the hardware level, it is also safe to assume the way programmers can express concurrency in their code that the relevance of concurrency platforms will continue to and how this concurrency is efficiently transformed into paral- increase in the future. lelism becomes decisive. Since it is cumbersome to introduce All concurrency platforms’ joint property is that they consist support for parallelism into serial programming languages [1], of a programming-language layer and a runtime system. The new programming languages are designed with primitives programming-language layer’s goal is to reduce the devel- supporting parallelism from the start. As a result, concurrency oper’s burden of concurrent programming. This is done by control is part of the language specification. An example of this providing various language constructs to define and coordinate is the Go programming language. Besides the programming concurrent tasks. Those constructs may deliberately limit the language layer, Go also includes a runtime system, which is amount of concurrency that can unfold at a given point. For the essential driver of the parallel execution of the concurrency example, fully-strict concurrent computations [3] require all expressed in the Go programs. The combination of a parallel child tasks to finish before a parent task itself can finish.1 programming language and an accompanying runtime system is called a concurrency platform. 1Contemporary sometimes referred to as “Structured Concurrency”. © 2021 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. Schmaus et al. “Nowa: A Wait-Free Continuation-Stealing Concurrency Platform”. In: 2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS). May 2021. URL: https://ieeexplore.ieee.org/document/9460452 doi: 10.1109/IPDPS49936.2021.00044 1 unsigned fib(unsigned n) { top 2 if (n<2) return n; e1 3 unsigned a= spawn fib(n-1); 4 unsigned b= fib(n-2); e2 popTop() 5 sync ; e3 6 return a+ b; e4 7 } pushBottom() Listing 1: Fibonacci function with concurrency keywords QW1 QW2 bottom While this may be seen as an undesirable restriction at first, popBottom() it does not practically limit the achievable parallelism of an “steal” application, if the application contains a sufficient level of W1 W2 concurrency. It does, however, simplify reasoning about the program, thus enabling the derivation of space-, time- and communication-bounds [4]. Figure 2: Workers and associated work-stealing queues Listing 1 shows keywords that a concurrency platform may introduce. Here, the implementation of the calculation of the n-th Fibonacci number is extended by two additional concurrency platforms, which are essential to understand our keywords: spawn and sync. Call expressions annotated with approach. In Section IV we present our wait-free approach spawn indicate that the caller may continue to execute in for continuation-stealing platforms. We present an evaluation parallel with the callee. A function whose body contains a of our approach in Section V and conclude in Section VI. spawn statement is called a spawning function. All potentially II. BACKGROUND spawned tasks of a spawning function join at the sync expres- sion (or at the return from the function). However, the spawn The properties of randomised work-stealing make it par- keyword merely expresses the potential for parallelism. The ticularly appealing for concurrency platforms. In contrast to decision to lift this annotated concurrency into real parallelism scheduling with a global “ready-list”, work-stealing provides can be made dynamically at runtime. every worker with a private queue, which other workers can If fib() were naively implemented using POSIX threads, steal from once their queue runs out of work. This way, then the large amount of created threads would probably conflicts due scheduling are distributed over workers, reducing harm the system’s performance. In contrast, when a con- contention and preventing a single hotspot from forming and currency platform is used to parallelise a computation, the becoming a bottleneck. Furthermore, there are proven bounds programming-language layer and the runtime system cooper- on computation and space requirements, guaranteeing linear ate to make ideal use of the available computational resources, speedup for an ideal work-stealing scheduler [4] without causing overutilisation of the same. This collaboration The concrete incarnation of a worker depends on the kind of enables dynamic task parallelism and is a decisive character- concurrency platform. If the platform’s runtime system solely istic of some concurrency platforms. This relieves developers resides in user-space, which is currently the case for most from pondering about the negative impact of parallel over- platforms, then workers are implemented based on kernel-level decomposition. Examples of concurrency platforms include threads. If the platform is based on a spatially-multiplexing Cilk-5 [5], MIT’s OpenCilk [6], Cilk Plus [7], OpenMP, TBB, operating system [10], then workers are CPU cores that are X10 [8], and Go. They usually structure parallelism using the exclusively assigned to the application. fork/join model and employ randomised work-stealing [4]. This paper presents Nowa, our non-blocking wait-free A. Work-Stealing Queues approach for tasks coordination in continuation-stealing con- While every fully-synchronised queue could be used for currency platforms. To the best of our knowledge, this is work-stealing, a queue algorithm exploiting the unique prop- the first wait-free approach. Our implementation of the Nowa erties of work-stealing is able to increase scheduling perfor- runtime system employs randomised work-stealing, allows for mance by reducing contention. We will refer to such queues dynamic task parallelism and employs a practical solution to as work-stealing queues. the cactus stack problem [9]. In our empirical study, the Nowa One can summarise the common properties of work-stealing runtime system yielded promising results of up to 1:64 × the queues

Nowa: a Wait-Free Continuation-Stealing Concurrency Platform

8A. Going Parallel

Correct and Efficient Work-Stealing for Weak Memory Models

Package 'Slurmr'

TEE Internal Core API Specification V1.1.2.50

Powerview Command Reference

Xshell 6 User Guide Secure Terminal Emualtor

Lock-Free Programming

An Efficient Work-Stealing Scheduler for Task Dependency Graph

Intel Threading Building Blocks

Waste Transfer Stations: a Manual for Decision-Making Acknowledgments

Lecture 15-16: Parallel Programming with Cilk

System Calls & Signals