KAAPI: a Thread Scheduling Runtime System for Data Flow Computations
Total Page:16
File Type:pdf, Size:1020Kb
KAAPI: A thread scheduling runtime system for data flow computations on cluster of multi-processors Thierry Gautier, Xavier Besseron, Laurent Pigeon INRIA, projet MOAIS LIG, Batiment^ ENSIMAG 51 avenue Jean-Kuntzmann F-38330 Montbonnot St Martin, France [email protected], [email protected], [email protected] ABSTRACT among the processors. In the work-stealing scheduling [7], The high availability of multiprocessor clusters for com- each processor gets its own stack implemented by a deque puter science seems to be very attractive to the engineer of frames. When a thread is created, a new frame is ini- because, at a first level, such computers aggregate high per- tialized and pushed at the bottom of the deque. When a formances. Nevertheless, obtaining peak performances on processor become idle, it tries to steal work from a victim irregular applications such as computer algebra problems re- processor with a non-empty deque; then it takes the top- mains a challenging problem. The delay to access memory is most frame (the least recently pushed frame) and places it non uniform and the irregularity of computations requires to in its own deque which is linked to the victim deque. use scheduling algorithms in order to automatically balance Such scheduling has been proven to be efficient for fully- the workload among the processors. strict multithreaded computations [7, 12] while requiring a This paper focuses on the runtime support implementa- bounded memory space with respect to a depth first sequen- tion to exploit with great efficiency the computation re- tial execution [8, 29]. sources of a multiprocessor cluster. The originality of our In Cilk [7], the implementation is highly optimized and approach relies on the implementation of an efficient work- follow the work first principle that aims at moving most of stealing algorithm for a macro data flow computation based the scheduling cost of from the work of the execution to the on minor extension of POSIX thread interface. rare case during work-stealing operations. Lock free algo- rithm manages the access to the deque allowing to reach good efficiency even for fine grain application. Cilk is one Categories and Subject Descriptors of the best known system that allows developers to focus on D.1.3 [Programming Techniques]: Concurrent Program- the algorithm rather than the hardware. Nevertheless, Cilk ming—distributed programming, parallel programming targets are shared memory machines. Although a past ver- sion of Cilk, called Cilk-NOW [6], was designed for network General Terms of workstations, the current highly optimized Cilk cannot runs on multiprocessor cluster. The underlaying DAG con- Performance, Algorithms, Experimentation sistency memory model remains challenging to implement for large scale clusters. Keywords Data flow based languages do not suffer from penalty due Work-stealing, dataflow, multi-core, multi-processor, cluster to memory coherence protocol, mainly because data move- ment between instructions are explicit in the model. Since 1. INTRODUCTION the 70s, data flow models have moved to multi-threaded architecture where key points of original model have been Multithreaded languages have been proposed as a gen- conserved in order to hide latency of (remote) memory ac- eral approach to model dynamic, unstructured parallelism. cess. They include data parallel ones – e.g. NESL [4] –, data flow – The EARTH [19, 28] Threaded-C language exposes two ID [9] –, macro dataflow – Athapascan [15] –, languages with level of threads (standard threads and fiber threads) that fork-join based constructs – Cilk [7] – or with additional syn- allows a fine control over the effective parallelism. Fibers chronization primitives – Jade [34], EARTH [19] –. are executed using a data flow approach: a synchronization Efficient execution of a multithreaded computation on unit detects and schedules fibers that have all their inputs a parallel computer relies on the schedule of the threads produced. In [37], the authors present a thread partitioning and scheduling algorithm to statically gather instructions into threads. Permission to make digital or hard copies of all or part of this work for The paper focuses on the implementation of KAAPI [25], personal or classroom use is granted without fee provided that copies are a runtime support for scheduling irregular macro data flow not made or distributed for profit or commercial advantage and that copies programs on a cluster of multi-processors using a work- bear this notice and the full citation on the first page. To copy otherwise, to stealing algorithm [10, 17]. Section 2 presents the high level republish, to post on servers or to redistribute to lists, requires prior specific and low level programming interfaces. We show how the in- permission and/or a fee. PASCO’07, July 27–28, 2007, London, Ontario, Canada. ternal data flow representation is built at runtime with low Copyright 2007 ACM 978-1-59593-741-4/07/0007 ...$5.00. 15 overhead. Moreover, we explain our lazy evaluation of readi- include <athapascan-1> ness properties of the data flow graph following the work first void sum(shared r int res1, shared r int res2, shared w int res) { } principle of Cilk. This is similar to hybrid data flow archi- res = res1+res2; tecture which allows to group together tasks for a sequential void fibonacci( int n, shared w int res ) execution in order to reduce scheduling overhead. Neverthe- { less, KAAPI permits to reconsider at runtime group of tasks if (n < 2) res = n; in order to extract more parallelism. Section 3 deals with { the runtime system and the scheduling of KAAPI threads else shared int res1; on the processors. The work-stealing algorithm relies on an shared extension of POSIX thread interface that allows work steal- int res2; fork fibonacci(n-1, res1 ); ing. Next, we present the execution of data flow graph as fork an application of the general work-stealing algorithm when fibonacci(n-2, res2 ); fork sum( res1, res2, res ); threads are structured as lists of tasks. This section ends } with the presentation of the thread partitioning to distribute } work among processors prior to the execution. The interac- tion with the communication sub-system to send messages over the network concludes this section. Section 4 reports Figure 1: Fibonacci algorithm with Athapascan in- experiments on multi-core / multi-processor architectures, terface. shared w (resp. shared r) means that the clusters and grids up to 1400 processors. parameter is written (resp. read). 2. THE KAAPI PROGRAMMING MODEL 2.2 Low level interface and internal data flow This section describes the high level instructions used to graph representation express parallel execution as a dynamic data flow graph. Our instructions extend the ones proposed in Cilk [14] in The KAAPI low level interface allows to build the macro order to take into account data dependencies. Following data flow graph between tasks during the execution of Atha- Jade [34], parallel computation is modelled from three con- pascan program. This interface defines several objects. A cepts: tasks, shared objects and access specification. How- closure object represents a function call with a state. An ever, while Jade is restricted to iterative computations, here access object represents the binding of an formal parameter nested recursive parallelism is considered to take benefit in the global memory. The state of a closure can be created, from the work-stealing scheduling performances [8, 15, 17]. running, stolen or terminated and represents the evolution of the closure during the execution. The state of the access 2.1 High level interface object is one of the following: created, ready, destroyed.An access is created when a task is forked. The KAAPI interface is based on the Athapascan inter- Once it is created, a closure should be bound to each ef- face described in [15, 16]. The programming model is based fective parameter of the call. Binding of parameter by value on a global address space called global memory and allows makes copy. Binding of shared object in the global mem- to describe data dependencies between tasks that access ob- ory links together the last access to the object and the new jects in the global memory. The language extends C++ 1 access. The link between accesses is called shared link and with two keywords . The shared keyword is a type qualifier represents the sequence of accesses on a same shared vari- to declare objects in the global memory. The fork keyword able. Figure 2 illustrates the internal data flow graph after creates a new task that may be executed in concurrence one execution of the task’s body fibonacci of the figure 1. with other tasks. Figure 1 sketches the folk algorithm for Let us note that creation of objects closure and access are computing the n-th Fibonacci number in a recursive way. pushed in a stack in order to improve allocation performance A task is a function call: a function that should return no (O(1) allocation time). Moreover, a stack is associated to value except through the global memory and its effective pa- each control flow, so that there is no contention on stack. rameters. The function signature should specify the mode When a closure is being executed, an activation frame is first of access (read, write or access) of the formal parameters in pushed in the stack in order to restore the activation of the the global memory. For instance the function sum of the fig- caller’ stack environment after the execution. There is no ure 1 is specified to have read accesses (res1 and res2)and a priori limit on the stack size: it is managed as a linked to have write access (res). The effective parameters in the list of memory blocks of fixed size, except for some request. global memory are passed by reference and other parameters The system guarantees that an allocation request in a stack are passed by value.