Efficient Workstealing for Multicore Event-Driven Systems

Efficient Workstealing for Multicore Event-Driven Systems Fabien Gaud, Sylvain Genevès, Renaud Lachaize, Baptiste Lepers, Fabien Mottet, Gilles Muller, Vivien Quema To cite this version: Fabien Gaud, Sylvain Genevès, Renaud Lachaize, Baptiste Lepers, Fabien Mottet, et al.. Efficient Workstealing for Multicore Event-Driven Systems. ICDCS 2010 - IEEE 30th International Conference on Distributed Computing Systems, Jun 2010, Genova, Italy. pp.516-525, 10.1109/ICDCS.2010.55. hal-00945722 HAL Id: hal-00945722 https://hal.inria.fr/hal-00945722 Submitted on 13 Feb 2014 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Efficient Workstealing for Multicore Event-Driven Systems Fabien Gaud, Sylvain Genevès, Renaud Lachaize Baptiste Lepers, Fabien Mottet, Gilles Muller Vivien Quéma University of Grenoble INRIA CNRS France France France fi[email protected] fi[email protected] [email protected] Abstract—Many high-performance communicating systems are support for safe parallel execution through annotations (colors) designed using the event-driven paradigm. As multicore plat- specifying events that can be handled in parallel. The main forms are now pervasive, it becomes crucial for such systems benefits of the event coloring approach are that it preserves to take advantage of the available hardware parallelism. Event- coloring is a promising approach in this regard. First, it allows the expressiveness of pure event-driven programming, offers programmers to simply and progressively inject support for the a relatively simple model with respect to concurrency, and is safe, parallel execution of multiple event handlers through the easily applicable to existing event-driven applications. use of annotations. Second, it relies on a workstealing algorithm A side-effect of event coloring is that it sometimes causes to dynamically balance the execution of event handlers on the unbalances in the processing load handled by the different available cores. This paper studies the impact of the workstealing algorithm cores of a machine. To improve performance, Libasync-smp on the overall system performance. We first show that the designers have thus proposed a workstealing (WS) mechanism only existing workstealing algorithm designed for event-coloring in charge of balancing event executions on the multiple cores. runtimes is not always efficient: for instance, it causes a 33% We actually show in this paper that enabling workstealing performance degradation on a Web server. We then introduce can hurt the throughput of real systems services by as much several enhancements to improve the workstealing behavior. An evaluation using both microbenchmarks and real applications, as 33%. Using microbenchmarks, we have identified two a Web server and the Secure File Server (SFS), shows that reasons for this performance decrease. First, the workstealing our system consistently outperforms a state-of-the-art runtime mechanism makes naïve decisions. Second, data structures (Libasync-smp), with or without workstealing. In particular, used in the runtime are not optimized for workstealing. our new workstealing improves performance by up to +25% The contributions of this paper are twofold. First, we compared to Libasync-smp without workstealing and by up to +73% compared to the Libasync-smp workstealing algorithm, in introduce enhanced heuristics to guide workstealing decisions. the Web server case. These heuristics try to preserve cache locality and avoid unfavorable stealing attempts, with little involvement required Keywords-workstealing; multicore; system services; performance; event-driven; from the application programmers. We then present Mely (Multi-core Event LibrarY), a novel event-driven runtime for multicore platforms. Mely is backward-compatible with I. INTRODUCTION Libasync-smp and its internal architecture has been designed Event-driven programming is a popular approach for the with workstealing in mind. Consequently, Mely exhibits a very development of robust applications such as networked systems low workstealing overhead, which makes it more efficient for [1], [2], [3], [4], [5], [6], [7], [8]. This programming and short-running events. execution model is based on continuation-passing between We evaluate Mely with a set of micro-benchmarks and short-lived and cooperatively-scheduled tasks. Its strength two applications: a Web server and the Secure File Server mainly lies in its expressiveness for fine-grain management (SFS) [12]. Our evaluations show that Mely consistently of overlapping tasks, including asynchronous network and outperforms (or, at worse, equals) Libasync-smp. For in- disk I/O. Moreover, some applications developed using the stance, we show that the Web server running on top of Mely event-driven model exhibit lower memory consumption and achieves a +25% higher throughput than when running on top better performance than their equivalents based on threaded of Libasync-smp without workstealing, and a +73% higher models [9], [10]. throughput than when running on top of Libasync-smp with However, a traditional event-driven runtime cannot take workstealing enabled. advantage of the current multicore platforms since it relies on a The paper is structured as follows. We start with an analysis single thread executing the main processing loop. To overcome of Libasync-smp in Section II. We then propose new heuristics this restriction, a promising approach, event coloring, has to improve event workstealing in Section III. The implementa- been proposed and implemented within the Libasync-smp tion of the Mely runtime is presented in Section IV. Section V library [11]. Event coloring tries to preserve the serial event ex- is dedicated to the performance evaluation of Mely. Finally, ecution model and allows programmers to incrementally inject we discuss related work in Section VI, before concluding the paper in Section VII. II. THE LIBASYNC-SMP RUNTIME This section describes the Libasync-smp runtime [11]. We start with a description of its design. Then, we detail the workstealing algorithm used to dynamically balance events Core 1 Core 2 Core 3 on cores. Finally, we evaluate and analyze Libasync-smp performance on two real-sized system services. Color 0 A. Design Thread Event queue Color 1 Libasync-smp is a multiprocessor-compliant event-driven Color 2 runtime. Its implementation relies, for each core, on an event queue and a thread. Events are data structures containing a Figure 1. Libasync-smp architecture. pointer to a handler function, and a continuation (i.e. a set of parameters carrying state information). Event handlers are executed by the core thread associated with the event queue. B. Workstealing algorithm Handlers are assumed to be non-blocking, which explains why only one thread per core is required. The architecture of the As mentioned in the previous section, colored events are Libasync-smp runtime is illustrated in Figure 1. dispatched on the cores using a hashing function. This simple Since several threads (one per core) are simultaneously load balancing strategy ignores the fact that some colors might manipulating events, it is necessary to properly handle the require more time to be processed than others (e.g., when there concurrent execution of different handlers. An event execution are many events with the same color or when different events updating a data item must execute in mutual exclusion with have different processing costs). The Libasync-smp library other events accessing the same data item. To ensure this thus provides a dynamic load balancing algorithm based on property, Libasync-smp does not rely on the use of locking the workstealing principle. When a core has no more events primitives in the code of the handlers. Rather, mutual exclusion to process, it attempts to fetch events from other core queues. issues are solved at the runtime level using programmer The workstealing algorithm is presented as pseudo-code specifications. More precisely, programmers can restrain the in Figure 2. First, the stealing core builds a core_set potential parallel execution of events using annotations (named containing an ordered set of cores. This is achieved calling colors and represented as a short integer). Two events with the construct_core_set function (functions used in the different colors can be handled concurrently, whereas events pseudo-code are detailed in the next paragraph). For each core of the same color must be handled serially. This is achieved in the set, the stealing core checks whether events can be by dispatching those events on the same core. Note that, stolen using the can_be_stolen function. If events can be events without annotations are all mapped to a default unique stolen from this core, the stealing core chooses one color to be color in order to guarantee safe execution. The Libasync-smp stolen using the choose_color_to_steal function. The implementation assigns new events to cores using a simple stealing core then builds a set containing all the events with the hashing function on colors. Load balancing is adjusted with a chosen

Efficient Workstealing for Multicore Event-Driven Systems

8A. Going Parallel

Correct and Efficient Work-Stealing for Weak Memory Models

An Efficient Work-Stealing Scheduler for Task Dependency Graph

Intel Threading Building Blocks

Lecture 15-16: Parallel Programming with Cilk

Intel Threading Building Blocks (Mike Voss)

Work Stealing for Interactive Services to Meet Target Latency ∗

Threading Building Blocks and Openmp Implementation Issues

Scheduling Parallel Programs by Work Stealing with Private Deques

Asymmetry-Aware Work-Stealing Runtimes

Randomized Work Stealing for Large Scale Soft Real-Time Systems

Parallelization in Rust with Fork-Join and Friends Creating the Forkjoin Framework