The Implementation of the Cilk-5 Multithreaded Language

The Implementation of the Cilk Multithreaded Language Matteo Frigo Charles E Leiserson Keith H Randall MIT Lab oratory for Computer Science Technology Square Cambridge Massachusetts fathenacelrandallglcsmitedu our latest Cilk release still uses a theoretically ecient Abstract scheduler but the language has b een simplied considerably The fth release of the multithreaded language Cilk uses a It employs callreturn semantics for parallelism and features provably go o d workstealing scheduling algorithm similar a linguistically simple inlet mechanism for nondeterminis to the rst system but the language has b een completely re designed and the runtime system completely reengineered tic control Cilk is designed to run eciently on contem The eciency of the new implementation was aided by a p orary symmetric multipro cessors SMPs which feature clear strategy that arose from a theoretical analysis of the hardware supp ort for shared memory We have co ded many scheduling algorithm concentrate on minimizing overheads applications in Cilk including the So crates and Cilkchess that contribute to the work even at the exp ense of overheads chessplaying programs which have won prizes in interna that contribute to the critical path Although it may seem counterintuitive to move overheads onto the critical path tional comp etitions this workrst principle has led to a p ortable Cilk im The philosophy b ehind Cilk development has b een to plementation in which the typical cost of spawning a parallel make the Cilk language a true parallel extension of C b oth thread is only b etween and times the cost of a C function semantically and with resp ect to p erformance On a paral call on a variety of contemp orary machines Many Cilk pro lel computer Cilk control constructs allow the program to grams run on one pro cessor with virtually no degradation compared to equivalent C programs This pap er describ es execute in parallel If the Cilk keywords for parallel control how the workrst principle was exploited in the design of are elided from a Cilk program however a syntactically and Cilk s compiler and its runtime system In particular we semantically correct C program results which we call the C present Cilk s novel twoclone compilation strategy and elision or more generally the serial elision of the Cilk its Dijkstralike mutualexclusion proto col for implementing program Cilk is a faithful extension of C b ecause the C the ready deque in the workstealing scheduler elision of a Cilk program is a correct implementation of the semantics of the program Moreover on one pro cessor a Keywords parallel Cilk program scales down to run nearly as fast as Critical path multithreading parallel computing program its C elision ming language runtime system work Unlike in Cilk where the Cilk scheduler was an identi able piece of co de in Cilk b oth the compiler and runtime Intro duction system b ear the resp onsibility for scheduling To obtain ef Cilk is a multithreaded language for parallel programming ciency we have of course attempted to reduce scheduling that generalizes the semantics of C by intro ducing linguistic overheads Some overheads have a larger impact on execu constructs for parallel control The original Cilk release tion time than others however A theoretical understanding featured a provably ecient randomized work of Cilks scheduling algorithm has allowed us to iden stealing scheduler but the language was clumsy tify and optimize the common cases According to this ab b ecause parallelism was exp osed by hand using explicit stract theory the p erformance of a Cilk computation can b e continuation passing The Cilk language implemented by characterized by two quantities its work which is the to tal time needed to execute the computation serially and its This research was supp orted in part by the Defense Advanced criticalpath length which is its execution time on an in Research Pro jects Agency DARPA under Grant N Computing facilities were provided by the MIT Xolas Pro ject thanks nite numb er of pro cessors Cilk provides instrumentation to a generous equipment donation from Sun Microsystems that allows a user to measure these two quantities Within Cilks scheduler we can identify a given cost as contribut To app ear in Proceedings of the ACM SIGPLAN ing to either work overhead or criticalpath overhead Much Conference on Programming Language Design and Im plementation PLDI Montreal Canada June of the eciency of Cilk derives from the following principle which we shall justify in Section include stdlibh The workrst principle Minimize the schedul include stdioh ing overhead borne by the work of a computation include cilkh Specical ly move overheads out of the work and onto the critical path cilk int fib int n The workrst principle played an imp ortant role during the if n return n design of earlier Cilk systems but Cilk exploits the prin else int x y ciple more extensively x spawn fib n The workrst principle inspired a twoclone strategy y spawn fib n for compiling Cilk programs Our cilkc compiler is a sync return xy typ echecking sourcetosource translator that transforms a Cilk source into a C p ostsource which makes calls to Cilks runtime library The C p ostsource is then run through the cilk int main int argc char argv gcc compiler to pro duce ob ject co de The cilkc compiler pro duces two clones of every Cilk pro cedurea fast clone int n result and a slow clone The fast clone which is identical in n atoiargv result spawn fibn most resp ects to the C elision of the Cilk program executes sync in the common case where serial semantics suce The slow printf Result dn result clone is executed in the infrequent case that parallel seman return tics and its concomitant b o okkeeping are required All com munication due to scheduling o ccurs in the slow clone and contributes to criticalpath overhead but not to work over Figure A simple Cilk program to compute the nth Fib onacci numb er in parallel using a very bad algorithm head The workrst principle also inspired a Dijkstralike sharedmemory mutualexclusion proto col as part of the The basic Cilk language can b e understo o d from an exam runtime loadbalancing scheduler Cilks scheduler uses a ple Figure shows a Cilk program that computes the nth workstealing algorithm in which idle pro cessors called Fib onacci numb er Observe that the program would b e an thieves steal threads from busy pro cessors called vic ordinary C program if the three keywords cilk spawn and tims Cilks scheduler guarantees that the cost of steal sync are elided ing contributes only to criticalpath overhead and not to The keyword cilk identies fib as a Cilk procedure work overhead Nevertheless it is hard to avoid the mutual which is the parallel analog to a C function Parallelism exclusion costs incurred by a p otential victim which con is created when the keyword spawn precedes the invo cation tribute to work To minimize work overhead instead of using of a pro cedure The semantics of a spawn diers from a lo cking Cilks runtime system uses a Dijkstralike proto col C function call only in that the parent can continue to ex which we call the THE proto col to manage the runtime ecute in parallel with the child instead of waiting for the deque of ready threads in the workstealing algorithm An child to complete as is done in C Cilks scheduler takes the added advantage of the THE proto col is that it allows an resp onsibility of scheduling the spawned pro cedures on the exception to b e signaled to a working pro cessor with no ad pro cessors of the parallel computer ditional work overhead a feature used in Cilks ab ort mech A Cilk pro cedure cannot safely use the values returned by anism its children until it executes a sync statement The sync The remainder of this pap er is organized as follows Sec statement is a lo cal barrier not a global one as for ex tion overviews the basic features of the Cilk language Sec ample is used in messagepassing programming In the Fi tion justies the workrst principle Section describ es b onacci example a sync statement is required b efore the how the twoclone strategy is implemented and Section statement return xy to avoid the anomaly that would presents the THE proto col Section gives empirical evi o ccur if x and y are summed b efore they are computed In dence that the Cilk scheduler is ecient Finally Section addition to explicit synchronization provided by the sync presents related work and oers some conclusions statement every Cilk pro cedure syncs implicitly b efore it returns thus ensuring that all of its children terminate b e The Cilk language fore it do es Ordinarily when a spawned pro cedure returns the re This section presents a brief overview of the Cilk extensions turned value is simply stored into a variable in its parents to C as supp orted by Cilk For a complete description frame consult the Cilk manual The key features of the lan 1 This program uses an inecient algorithm which runs in exp onen guage are the sp ecication of parallelism and synchroniza tial time Although logarithmictime metho ds are known p tion through the spawn and sync keywords and the sp eci this program nevertheless provides a go o d didactic example cation of nondeterminism using inlet and abort cilk int fib int n ab out concurrency and nondeterminism without requiring lo cking declaration of critical regions and the like int x inlet void summer int result Cilk provides syntactic sugar to pro duce certain commonly used inlets implicitly For example the statement

The Implementation of the Cilk-5 Multithreaded Language

Other Apis What’S Wrong with Openmp?

8A. Going Parallel

Correct and Efficient Work-Stealing for Weak Memory Models

Operating Systems and Applications for Embedded Systems >>> Toolchains

Concurrent Cilk: Lazy Promotion from Tasks to Threads in C/C++

Map, Reduce and Mapreduce, the Skeleton Way$

Parallelism in Cilk Plus

Author Guidelines for 8

A CPU/GPU Task-Parallel Runtime with Explicit Epoch Synchronization

From Sequential to Parallel Programming with Patterns

Opencl: a Hands-On Introduction

An Efficient Work-Stealing Scheduler for Task Dependency Graph