Work Stealing Scheduler for Automatic Parallelization in Faust Stéphane Letz, Yann Orlarey, Dominique Fober

Work Stealing Scheduler for Automatic Parallelization in Faust Stéphane Letz, Yann Orlarey, Dominique Fober To cite this version: Stéphane Letz, Yann Orlarey, Dominique Fober. Work Stealing Scheduler for Automatic Paralleliza- tion in Faust. Linux Audio Conference, 2010, Utrecht, Netherlands. hal-02158924 HAL Id: hal-02158924 https://hal.archives-ouvertes.fr/hal-02158924 Submitted on 18 Jun 2019 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Work Stealing Scheduler for Automatic Parallelization in Faust Stephane Letz and Yann Orlarey and Dominique Fober GRAME 9 rue du Garet, BP 1185 69202 Lyon Cedex 01, France, fletz, orlarey, [email protected] Abstract sophisticated methods to reorganize the code to Faust 0.9.10 1 introduces an alternative to OpenMP fit specific architectures can then be tried and based parallel code generation using a Work Steal- analyzed. ing Scheduler and explicit management of worker threads. This paper explains the new option and 1.1 The Faust approach presents some benchmarks. Faust [2] is a programming language for real- time signal processing and synthesis designed Keywords from scratch to be a compiled language. Being FAUST, Work Stealing Scheduler, Parallelism efficiently compiled allows Faust to provide a viable high-level alternative to C/C++ to de- 1 Introduction velop high-performance signal processing appli- Multi/many cores machines are becoming com- cations, libraries or audio plug-ins. mon. There is a challenge for the software com- The Faust compiler is built as a stack of code munity to develop adequate tools to take profit generators. The scalar code generator produces of the new hardware platforms [3] [8]. Various a single computation loop. The vector genera- libraries like Intel Thread Building Blocks 2 or tor rearrange the C++ code in a way that fa- "extended C like" Cilk [5] can possibly help but cilitates the autovectorization job of the C++ still require the programmer to precisely define compiler. Instead of generating a single sam- sequential and parallel sub part of the compu- ple computation loop, it splits the computation tation and use the appropriate library call or into several simpler loops that communicates by building blocks to implement the solution. vectors. The result is a Direct Acyclic Graph In the audio domain, work has been done to (DAG) in which each node is a computation parallelize well known algorithms [4], but very loop. few systems aim to help in automatic paral- Starting from the DAG of computation loops lelization. The problem is usually tricky with (called "tasks"), Faust was already able to gen- stateful systems and imperative approaches erate parallel code using OpenMP directives [1]. where data has to be shared between several This model has been completed with a new threads, and concurrent access have to be accu- dynamic scheduler based on a Work Stealing rately controlled. model. On the contrary, the functional approach used in high level specification languages generally 2 Scheduling strategies for parallel helps in this area. It basically allows the pro- code grammer to define the problem in an abstract The role of scheduling is to decide how to or- way completely unconnected of implementa- ganize the computation, in particular how to tions details, and let the compiler and various assign tasks to processors. backends do the hard job. By exactly controlling when and how state 2.1 Static versus dynamic scheduling is managed during the computation, the com- The scheduling problem exists in two forms: piler can decide what multi-threading or par- static and dynamic. In static scheduling usu- allel generation techniques can be used. More ally done at compile time, the characteristics 1this work was partially supported by the ANR of a parallel program (such as task processing project ASTREE (ANR-08-CORD-003) times, communication, data dependencies, and 2http://www.threadingbuildingblocks.org/ synchronization requirements) are known before program execution. Temporal and spatial as- are organized as a sequence of groups of par- signment of tasks to processors are done by the allel tasks. Then appropriate OpenMP direc- scheduler to optimize the execution time. In tives are added to describe a fork/join model. dynamic scheduling on the contrary, only a few Each group of task is executed in parallel and assumptions about the parallel program can be synchronization points are placed between the made before execution, and thus, scheduling de- groups (Figure 1). This model gives good re- cisions have to be realized on-the-fly, and as- sults with the Intel icc compiler. Unfortunately signment of tasks to processors are made at run until recently, OpenMP implementation in g++ time. was quite weak and inefficient, and even unus- able in an real-time context on OSX. 2.2 DSP specific context Moreover the overall approach of organizing We can list several specific elements of the prob- tasks as a sequence of groups of parallel tasks is lem: not optimum in the sense that synchronization - the graph describes a DSP processor which points are required for all threads when part of is going to read n input buffers containing p the graph could continue to run. In some sense frames to produce m output buffers containing the synchronization strategy is too strict and a the same p number of frames data-flow model is more appropriate. - the graph is completely known in advance, 2.3.1 Activating Work Stealing but precise cost of each task and communication Scheduler mode times (memory access, cache effects...) is not known The scheduler based parallel code generation is - the graph can be computed in a single step: activated by passing the --scheduler (or --sch) each task is going to process p frames in a single option to the Faust compiler. It implies the step, or buffers can be cut in slices and the graph --vec option as the parallel code generation is can be executed several times on sub slices to built on top of the vector code generation. fill output buffers. (see "pipelining" section) With the --scheduler option, the Faust compiler uses a very different approach. A data-flow model for graph execution is used, to be executed by a dynamic scheduler. Parallel C++ code embedding a Work Stealing Scheduler and using a pool of worker threads is generated. Threads are created when the application starts and all participate in the computation of the graph of tasks. Since the graph topology is known at compilation time, the generated C++ code can precisely describe the execution flow (which tasks have to be executed when a given task is finished...). Ready tasks are activated at the beginning of the compute method and are executed by available threads in the pool. The control flow then circulate in the graph from inputs task to output tasks in the form of acti- Figure 1: Tasks graph with forward activations vations (ready task index called tasknum) until and explicit synchronization points all output tasks have been executed. 2.4 Work Stealing Scheduler 2.3 OpenMP mode In a Work Stealing Scheduler [7], idle threads The OpenMP based parallel code generation is take the initiative: they attempt to steal tasks activated by passing the ---openMP (or --omp) op- from other threads. This is possible by having tion to the Faust compiler. It implies the --vec each thread owns a Work Stealing Queue, a spe- option as the parallel code generation is built cial double-ended queue with a Push operation, on top of the vector code generation. a private LIFO Pop operation 3 and a public In OpenMP mode, a topological sort of the graph is done. Starting from the inputs, tasks 3which does not need to be multi-thread aware FIFO Pop operation 4. The basic idea of work 2.5.2 Several outputs stealing is for each processor to place work when If the task has several outputs, the code has to: it is discovered in its local WSQ, greedily per- - init tasknum with the WORK_STEALING value form that work from its local WSQ, and steal - if there is a direct link between the given work from the WSQ of other threads when the task and one of the output task, then this out- local WSQ is empty. put task will be the next to execute. All other Starting from a ready task, each thread exe- tasks with a direct link are pushed on current cutes it and follows the data dependencies, pos- thread WSQ 5 sibly pushing ready output tasks into its own - otherwise for output tasks with more than local WSQ. When no more tasks can be exe- one input, the activation counter is atomically cuted on a given computation path, the thread decremented (possibly returning the tasknum of pops a task from its local WSQ using its pri- the next task to execute) vate LIFO Pop operation. If the WSQ is empty, - after execution of the activation code, the thread is allowed to steal tasks from other tasknum will either contains the actual value threads WSQ using their public FIFO Pop op- of the next task to run or WORK_STEALING, so that eration. the next ready task if found by running the work The local LIFO Pop operation allows better stealing task. cache locality and the FIFO steal Pop larger 2.5.3 Work Stealing task chuck of work to be done.

Work Stealing Scheduler for Automatic Parallelization in Faust Stéphane Letz, Yann Orlarey, Dominique Fober

8A. Going Parallel

Correct and Efficient Work-Stealing for Weak Memory Models

An Efficient Work-Stealing Scheduler for Task Dependency Graph

Intel Threading Building Blocks

Lecture 15-16: Parallel Programming with Cilk

Intel Threading Building Blocks (Mike Voss)

Work Stealing for Interactive Services to Meet Target Latency ∗

Threading Building Blocks and Openmp Implementation Issues

Scheduling Parallel Programs by Work Stealing with Private Deques

Asymmetry-Aware Work-Stealing Runtimes

Randomized Work Stealing for Large Scale Soft Real-Time Systems

Parallelization in Rust with Fork-Join and Friends Creating the Forkjoin Framework