<<

The Implementation of the -5 Multithreaded Language

Matte0 Frigo Charles . Leiserson Keith H. Randall

MIT Laboratory for 545 Technology Square Cambridge, Massachusetts 02139 {athena,cel,randall}@lcs.mit.edu

Abstract our latest Cilk-5 release [8] still uses a theoretically efficient scheduler, but the language has been simplified considerably. The fifth release of the multithreaded language Cilk uses a provably good “work-stealing” similar It employs call/return semantics for parallelism and features to the first system, but the language has been completely re- a linguistically simple “inlet” mechanism for nondeterminis- designed and the runtime system completely reengineered. tic control. Cilk-5 is designed to run efficiently on contem- The efficiency of the new implementation was aided by a porary symmetric multiprocessors (SMP’s), which feature clear strategy that arose from a theoretical analysis of the hardware support for . We have coded many scheduling algorithm: concentrate on minimizing overheads that contribute to the work, even at the expense of overheads applications in Cilk, including the *Socrates and Cilkchess that contribute to the critical path. Although it may seem chess-playing programs which have won prizes in interna- counterintuitive to move overheads onto the critical path, tional competitions. this “work-first” principle has led to a portable Cilk-5 im- The philosophy behind Cilk development has been to plementation in which the typical cost of spawning a parallel make the Cilk language a true parallel extension of C, both is only between 2 and 6 times the cost of a C function call on a variety of contemporary machines. Many Cilk pro- semantically and with respect to performance. On a paral- grams run on one with virtually no degradation lel computer, Cilk control constructs allow the program to compared to equivalent C programs. This paper describes execute in parallel. If the Cilk keywords for parallel control how the work-first principle was exploited in the design of are elided from a Cilk program, however, a syntactically and Cilk-5’s and its runtime system. In particular, we semantically correct C program results, which we call the C present Cilk-5’s novel “two-clone” compilation strategy and its Dijkstra-like mutual-exclusion protocol for implementing elision (or more generally, the serial elision) of the Cilk the ready deque in the work-stealing scheduler. program. Cilk is a faithful extension of C, because the C elision of a Cilk program is a correct implementation of the Keywords semantics of the program. Moreover, on one processor, a parallel Cilk program “scales down” to run nearly as fast as Critical path, multithreading, parallel , program- ming language, runtime system, work. its C elision. Unlike in Cilk-1, where the Cilk scheduler was an identifi- 1 Introduction able piece of code, in Cilk-5 both the compiler and runtime system bear the responsibility for scheduling. To obtain ef- Cilk is a multithreaded language for parallel programming ficiency, we have, of course, attempted to reduce scheduling that generalizes the semantics of C by introducing linguistic overheads. Some overheads have a larger impact on execu- constructs for parallel control. The original Cilk-1 release tion time than others, however. A theoretical understanding [3, 4, 181 featured a provably efficient, randomized, “work- of Cilk’s scheduling algorithm [3, 51 has allowed us to iden- stealing” scheduler [3, 51, but the language was clumsy, tify and optimize the common cases. According to this ab- because parallelism was exposed “by hand” using explicit stract theory, the performance of a Cilk computation can be continuation passing. The Cilk language implemented by characterized by two quantities: its work, which is the to- This research was supported in part by the Defense Advanced tal time needed to execute the computation serially, and its Research Projects Agency (DARPA) under Grant N00014-94-1-0985. critical-path length, which is its execution time on an in- Computing facilities were provided by the MIT Xolas Project, thanks finite number of processors. (Cilk provides instrumentation to a generous equipment donation from Sun Microsystems. that allows a user to measure these two quantities.) Within

Permission tc make digital or hard copies of all or pan of thii wcrk for Cilk’s scheduler, we can identify a given cost as contribut- pereonsl or classroom US* is grantad without 1~ provided that ing to either work overhead or critical-path overhead. Much copies are not made M distributed lot profit 01 commercial advan- taps and that copies bear thie notiw and tha full citation on ti fit paw. of the efficiency of Cilk derives from the following principle, To copyotherwise, to republish. tc poston l rwnn OT tc redistribute to lists, requires prior spadfic pwniwicn end/w a fu. which we shall justify in Section 3. SIGPLAN ‘SE Montreal, Canada 8 1998 ACM 0-89791~987-4/98/0006...$5,00

212 #include tstdlib.h> The work-first principle: Minimize the schedul- #include ing overhead borne by the world of a computation. #include Specifically, move overheads out of the work and onto the critical path. cilk int fib (int n) c The work-first principle played an important role during the if (n(2) return n; design of earlier Cilk systems, but Cilk-5 exploits the prin- else c ciple more extensively. int x, y; x = spawn fib (n-1); The work-first principle inspired a “two-clone” strategy y = spawn fib (n-2); for compiling Cilk programs. Our cilk2c compiler [23] is a sync; type-checking, source-to-source translator that transforms a return (x+y) ; lt Cilk source into a C postsource which makes calls to Cilk’s 1 runtime . The C postsource is then run through the gee compiler to produce object code. The cilk2c compiler cilk int main (int argc, char *argv[l) { produces two clones of every Cilk procedure-a “fast” clone int n. result; and a “slow” clone. The fast clone, which is identical in n = atoi(argvC11); most respects to the C elision of the Cilk program, executes result. = spawn fib(n); sync ; in the common case where serial semantics suffice. The slow printf (“Result: %d\n”, result) ; clone is executed in the infrequent case that parallel seman- return 0: tics and its concomitant bookkeeping are required. All com- munication due to scheduling occurs in the slow clone and contributes to critical-path overhead, but not to work over- Figure 1: A simple Cilk program to compute the nth Fibonacci head. number in parallel (using a very bad algorithm). The work-first principle also inspired a Dijkstra-like [ll], shared-memory, mutual-exclusion protocol as part of the The basic Cilk language can be understood from an exam- runtime load-balancing scheduler. Cilk’s scheduler uses a ple. Figure 1 shows a Cilk program that computes the nth “work-stealing” algorithm in which idle processors, called Fibonacci number.’ Observe that the program would be an thieves, “steal” threads from busy processors, called vic- ordinary C program if the three keywords cilk, spawn, and tims. Cilk’s scheduler guarantees that the cost of steal- sync are elided. ing contributes only to critical-path overhead, and not to The keyword cilk identifies fib as a Cilk procedure, work overhead. Nevertheless, it is hard to avoid the mutual- which is the parallel analog to a C function. Parallelism exclusion costs incurred by a potential victim, which con- is created when the keyword spawn precedes the invocation tribute to work. To minimize work overhead, instead of using of a procedure. The semantics of a spawn differs from a locking, Cilk’s runtime system uses a Dijkstra-like protocol, C function call only in that the parent can continue to ex- which we call the THE protocol, to manage the runtime ecute in parallel with the child, instead of waiting for the deque of ready threads in the work-stealing algorithm. An child to complete as is done in C. Cilk’s scheduler takes the added advantage of the THE protocol is that it allows an responsibility of scheduling the spawned procedures on the exception to be signaled to a working processor with no ad- processors of the parallel computer. ditional work overhead, a feature used in Cilk’s abort mech- A Cilk procedure cannot safely use the values returned by anism. its children until it executes a sync statement. The sync The remainder of this paper is organized as follows. Sec- statement is a local “barrier,” not a global one as, for ex- tion 2 overviews the basic features of the Cilk language. Sec- ample, is used in message-passing programming. In the Fi- tion 3 justifies the work-first principle. Section 4 describes bonacci example, a sync statement is required before the how the two-clone strategy is implemented, and Section 5 statement return (x+y> to avoid the anomaly that would presents the THE protocol. Section 6 gives empirical evi- occur if x and y are summed before they are computed. In dence that the Cilk-5 scheduler is efficient. Finally, Section 7 addition to explicit synchronization provided by the sync presents related work and offers some conclusions. statement, every Cilk procedure syncs implicitly before it returns, thus ensuring that all of its children terminate be- 2 The Cilk language fore it does. Ordinarily, when a spawned procedure returns, the re- This section presents a brief overview of the Cilk extensions turned value is simply stored into a variable in its parent’s to C as supported by Cilk-5. (For a complete description, frame: consult the Cilk-5 manual [8].) The key features of the lan- guage are the specification of parallelism and synchroniza- ‘This program USESan inefficient algoritlnn which runs in exponen- tial time. Although logarithmic-time methods are known [9, p. 8501, tion, through the spawn and sync keywords, and the speci- this program nevertheless provides a good didactic example. fication of nondeterminism, using inlet and abort.

213 cilk int fib (int n) about concurrency and nondeterminism without requiring c int x = 0; locking, declaration of critical regions, and the like. inlet void summsr (int result) Cilk provides syntactic sugar to produce certain commonly used inlets implicitly. For example, the statement x += x += result; return; spawn fib(n-1) conceptually generates an inlet similar to 1 the one in Figure 2. Sometimes, a procedure spawns off parallel work which it if (nt2) return n; else c later discovers is unnecessary. This “speculative” work can summer(spawn fib (n-l)); be aborted in Cilk using the abort primitive inside an in- summsr(spawn fib (n-2)); let. A common use of abort occurs during a parallel search, sync; return (x); where many possibilities are searched in parallel. As soon as lt a solution is found by one of the searches, one wishes to abort > any currently executing searches as soon as possible so as not to waste processor resources. The abort statement, when Figure 2: Using an inlet to compute the nth Fibonnaci number. executed inside an inlet, causes all of the already-spawned children of the procedure to terminate. x = spawn foo (y) ; We considered using “futures” [19] with implicit synchro- nization, as well as synchronizing on specific variables, in- Occasionally, one would like to incorporate the returned stead of using the simple spawn and sync statements. We value into the parent’s frame in a more complex way. Cilk realized from the work-first principle, however, that differ- provides an inlet feature for this purpose, which was in- ent synchronization mechanisms could have an impact only spired in part by the inlet feature of TAM [lo]. on the critical-path of a computation, and so this issue was An inlet is essentially a C function internal to a Cilk pro- of secondary concern. Consequently, we opted for imple- cedure. In the normal syntax of Cilk, the spawning of a mentation simplicity. Also, in systems that support re- procedure must occur as a separate statement and not in an laxed memory-consistency models, the explicit sync state- expression. An exception is made to this rule if the spawn ment can be used to ensure that all side-effects from previ- is performed as an argument to an inlet call. In this case, ously spawned subprocedures have occurred. the procedure is spawned, and when it returns, the inlet is In addition to the control synchronization provided by invoked. In the meantime, control of the parent procedure sync, Cilk programmers can use explicit locking to syn- proceeds to the statement following the inlet call. In princi- chronize accesses to data, providing and ple, inlets can take multiple spawned arguments, but Cilk-5 atomicity. Data synchronization is an overhead borne on has the restriction that exactly one argument to an inlet the work, however, and although we have striven to min- may be spawned and that this argument must be the first imize these overheads, fine-grain locking on contemporary argument. If necessary, this restriction is easy to program processors is expensive. We are currently investigating how around. to incorporate atomicity into the Cilk language so that pro- Figure 2 illustrates how the f ib0 function might be coded tocol issues involved in locking can be avoided at the user using inlets. The inlet summer 0 is defined to take a returned level. To aid in the debugging of Cilk programs that use value result and add it to the variable x in the frame of the locks, we have been developing a tool called the “Nonde- procedure that does the spawning. All the variables of f ib() terminator” [7, 131, which detects common synchronization are available within summer 0, since it is an internal function bugs called data races. of fib().2 No is required around the accesses to x by summer, 3 The work-first principle because Cilk provides atomicity implicitly. The concern is that the two updates might occur in parallel, and if atomic- This section justifies the work-first principle stated in Sec- ity is not imposed, an update might be lost. Cilk provides tion 1 by showing that it follows from three assumptions. implicit atomicity among the “threads” of a procedure in- First, we assume that Cilk’s scheduler operates in practice stance, where a thread is a maximal sequence of instruc- according to the theoretical analysis presented in [3, 51. Sec- tions ending with a spawn, sync, or return (either explicit ond, we assume that in the common case, ample “parallel or implicit) statement. An inlet is precluded from containing slackness” [28] exists, that is, the average parallelism of a spawn and sync statements, and thus it operates atomically Cilk program exceeds the number of processors on which we as a single thread. Implicit atomicity simplifies reasoning run it by a sufficient margin. Third, we assume (as is indeed the case) that every Cilk program has a C elision against aThe C elision of a Cilk progranl with inlets is not ANSI C, because ANSI C does not support internal C functions. Cilk is based on Gnu which its one-processor performance can be measured. C technology, however, which does provide tbis support. The theoretical analysis presented in [3, 51 cites two funda-

214 mental lower bounds as to how fast a Cilk program can run. considerable parallel slackness. The parallelisim of these ap- Let us denote by Tp the execution time of a given computa- plications increases with problem size, thereby ensuring they tion on P processors. Then, the work of the computation is will run well on large machines. Tl and its critical-path length is Tm. For a computation with The third assumption behind the work-first principle is Tl work, the lower bound Tp > Tl/P must hold, because at that every Cilk program has a C elision against which its most P units of work can be executed in a single step. In one-processor performance can be measured. Let us denote addition, the lower bound Tp > Too must hold, since a finite by TS the running time of the C elision. Then, we defme the number of processors cannot execute faster than an infinite work overhead by cl = Tl/Ts. Incorporating critical-path number.’ and work overheads into Inequality (2) yields Cilk’s randomized work-stealing scheduler [3, 51 executes a Cilk computation on P processors in expected time TP 5 cJ’s/P + c,Too (3) = CITSIP , TP = TI/P + O(T,) , (1) since we assume parallel slackness. assuming an ideal parallel computer. This equation resem- We can now restate the work-first principle precisely. h&n- bles “Brent’s theorem” [6, 151 and is optimal to within a imize cl, even at the expense of a larger c,, because cl has a constant factor, since Tl/P and T, are both lower bounds. more direct impact on performance. Adopting the work-first We call the first term on the right-hand side of Equation (1) principle may adversely affect the ability of an application the world term and the second term the critical-path term. to scale up, however, if the critical-path overhead coo is too Importantly, all communication costs due to Cilk’s scheduler large. But, as we shall see in Section 6, critical-path over- are borne by the critical-path term, as are most of the other head is reasonably small in Cilkd, and many applications scheduling costs. To make these overheads explicit, we de- can be coded with large amounts of parallelism. fine the critical-path overhead to be the smallest constant The work-first principle pervades the Cilk-5 implementa- co3 such that tion. The work-stealing scheduler guarantees that with high TP

215 1 int fib (int n) ample of the postsource for the fast clone of fib is given 2 { in Figure 3. The generated C code has the same general 3 fib-frame *f; frame pointer 4 f = alloc(sizeof(*f)); allocate frame structure as the C elision, with a few additional statements. 5 f->sig = fib-sig; initialixe frame In lines 4-5, an activation fMcme is allocated for fib and 6 if (n<2) I initialized. The Cilk runtime system uses activation frames 7 free(f, sizeof(* free frame 8 return n; to represent procedure instances. Using techniques similar 9 > to [16, 171, our inlined allocator typically takes only a few 10 else { cycles. The frame is initialized in line 5 by storing a pointer 11 int x. y; 12 f->entry = 1; save PC to a static structure, called a signature, describing fib. 13 f->n = n; save live vars The first spawn in fib is translated into lines 12-18. In 14 *T = f; store fmme pointer lines 12-13, the state of the fib procedure is saved into 15 push0 ; push frame 16 x = fib h-1); do C call the activation frame. The saved state includes the program 17 if (pop(x) == FAILURE) POP frame counter, encoded as an entry number, and all live, dirty vari- 18 return 0; frame stolen ables. Then, the frame is pushed on the runtime deque in 19 . . . second spawn 20 sync is free! lines 14-15.4 Next, we call the fib routine as we would 21 k(f, sizeof(* free frame in C. Because the spawn statement itself compiles directly 22 return (x+y); to its C elision, the postsource can exploit the optimization 23 ) 24 1 capabilities of the C compiler, including its ability to pass arguments and receive return values in registers rather than in memory. Figure 3: The fast clone generated by cilk2c for the fib proce- dure from Figure 1. The code for the second spawn is omitted. After fib returns, lines 17-18 check to see whether the The functions allot and free are inlined calls to the runtime parent procedure has been stolen. If it has, we return im- system’s fast memory allocator. The signature f ib-sig contains mediately with a dummy value. Since all of the ancestors a description of the fib procedure, including a pointer to the slow clone. The push and pop calls are operations on the scheduling have been stolen as well, the C stack quickly unwinds and deque and are described in detail in Section 5. control is returned to the runtime system.6 The protocol to check whether the parent procedure has been stolen is quite subtle-we postpone discussion of its implementation as C treats its , pushing and popping spawned acti- to Section 5. If the parent procedure has not been stolen, vation frames. When a worker runs out of work, it becomes it continues to execute at line 19, performing the second a thief and attempts to steal a procedure another worker, spawn, which is not shown. called its victim. The thief steals the procedure from the In the fast clone, all sync statements compile to no-ops. head of the victim’s deque, the opposite end from which the Because a fast clone never has any children when it is exe- victim is working. cuting, we know at compile time that all previously spawned When a procedure is spawned, the fast clone runs. When- procedures have completed. Thus, no operations are re- ever a thief steals a procedure, however, the procedure is quired for a sync statement, as it always succeeds. For exarn- converted to a slow clone. The Cilk scheduler guarantees ple, line 20 in Figure 3, the translation of the sync statement that the number of steals is small when sufficient slackness is just the empty statement. Finally, in lines 21-22, fib deal- exists, and so we expect the fast clones to be executed most locates the activation frame and returns the computed result of the time. Thus, the work-first principle reduces to mini- to its parent procedure. mizing costs in the fast clone, which contribute more heavily The slow clone is similar to the fast clone except that to work overhead. Minimizing costs in the slow clone, al- it provides support for parallel execution. When a proce- though a desirable goal, is less important, since these costs dure is stolen, control has been suspended between two of contribute less heavily to work overhead and more to critical- the procedure’s threads, that is, at a spawn or sync point. path overhead. When the slow clone is resumed, it uses a goto statement We minimize the costs of the fast clone by exploiting the to restore the , and then it restores local structure of the Cilk scheduler. Because we convert a pro- cedure to its slow clone when it is stolen, we maintain the variable state from the activation frame. A spawn statement is translated in the slow clone just as in the fast clone. For a invariant that a fast clone has never been stolen. Further- sync statement, cilk2c inserts a call to the runtime system, more, none of the descendants of a fast clone have been which checks to see whether the procedure has any spawned stolen either, since the strategy of stealing from the heads children that have not returned. Although the parallel book- of ready deques guarantees that parents are stolen before their children. As we shall see, this simple fact allows many 41f the shared memory is not sequentially consistent, a memory fence must be inserted between lines 14 and 15 to ensure that the optimizations to be performed in the fast clone. surrounding writes are executed in the proper order. We now describe how our cilk2c compiler generates post- ‘The setjmp/longjmp facility of C could have been used as well, but source C code for the fib procedure from Figure 1. An ex- our unwinding strategy is simpler.

216 keeping in a slow clone is substantial, it contributes little to tions led to complicated protocols. Even though these over- work overhead, since slow clones are rarely executed. heads could be charged to the critical-path term, in practice, The separation between fast clones and slow clones also they became so large that the critical-path term contributed allows US to compile inlets and abort statements efficiently significantly to the running time, thereby violating the as- in the fast clone. An inlet call compiles as efficiently as an sumption of parallel slackness. A one-processor execution of ordinary spawn. For example, the code for the inlet call from a program was indeed fast, but insufficient slackness some- Figure 2 compiles similarly to the following Cilk code: times resulted in poor parallel performance. In Cilk-5, we simplified the allocation of activation frames tmp = spawn fibh-1); summer(tmp) ; by simply using a heap. In the common case, a frame is allocated by removing it from a free list. Deallocation is Implicit inlet calls, such as x += spawn f ib(n-l), compile performed by inserting the frame into the free list. No user- directly to their C elisions. An abort statement compiles to level management of virtual memory is required, except for a no-op just as a sync statement does, because while it is the initial setup of shared memory. Heap allocation con- executing, a fast clone has no children to abort. tributes only slightly more than stack allocation to the work The runtime system provides the glue between the fast and overhead, but it saves substantially on the critical path term. slow clones that makes the whole system work. It includes On the downside, heap allocation can potentially waste more protocols for stealing procedures, returning values between memory than stack allocation due to fragmentation. For a processors, executing inlets, aborting computation subtrees, careful analysis of the relative merits of stack and heap based and the like. All of the costs of these protocols can be amor- allocation that supports heap allocation, see the paper by tized against the critical path, so their overhead does not Appel and Shao [l]. For an equally careful analysis that significantly affect the running time when sufficient parallel supports stack allocation, see [22]. slackness exists. The portion of the stealing protocol exe- Thus, although the work-first principle gives a general un- cuted by the worker contributes to work overhead, however, derstanding of where overheads should be borne, our expe- thereby ‘warranting a careful implementation. We discuss rience with Cilk-4 showed that large enough critical-path this protocol in detail in Section 5. overheads can tip the scales to the point where the assump- The work overhead of a spawn in Cilk-5 is only a few reads tions underlying the principle no longer hold. We believe and writes in the fast clone-3 reads and 5 writes for the fib that Cilk-5 work overhead is nearly as low as possible, given example. We will experimentally quantify the work overhead our goal of generating portable C output from our compiler.’ in Section 6. Some work overheads still remain in our im- Other researchers have been able to reduce overheads even plementation, however, including the allocation and freeing more, however, at the expense of portability. For example, of activation frames, saving state before a spawn, pushing lazy threads (141 obtains efficiency at the expense of im- and popping of the frame on the deque, and checking if a plementing its own calling conventions, stack layouts, etc. procedure has been stolen. A portion of this work overhead Although we could in principle incorporate such machine- is due to the fact that Cilk-5 is duplicating the work the C dependent techniques into our compiler, we feel that Cilk-5 compiler performs, but as Section 6 shows, this overhead is strikes a good balance between performance and portability. small. Although a production Cilk compiler might be able We also feel that the current overheads are sufficiently low eliminate this unnecessary work, it would likely compromise that other problems, notably minimizing overheads for data portability. synchronization, deserve more attention. In Cilk-4, the precursor to Cilk-5, we took the work-first principle to the extreme. Cilk-4 performed stack-based al- 5 implemention of work-stealing location of activation frames, since the work overhead of stack allocation is smaller than the overhead of heap ahoca- In this section, we describe Cilk-5’s work-stealing mecha- tion. Because of the “cactus stack” 1251semantics of the Cilk nism, which is based on a Dijkstra-like [ll], shared-memory, stack,6 however, Cilk-4 had to manage the virtual-memory mutual-exclusion protocol called the “THE” protocol. In on each processor explicitly, as was done in [27]. The accordance with the work-first principle, this protocol has work overhead in Cilk-4 for frame allocation was little more been designed to minimize work overhead. For example, on than that of incrementing the stack pointer, but whenever a 167-megahertz UltraSPARC I, the fib program with the the stack pointer overflowed a page, an expensive user-level THE protocol runs about 25% faster than with hardware interrupt ensued, during which Cilk-4 would modify the locking primitives. We first present a simplified version of memory map. Unfortunately, the operating-system mech- the protocol. Then, we discuss the actual implementation, anisms supporting these operations were too slow and un- which allows exceptions to be signaled with no additional predictable, and the possibility of a page fault in critical sec- overhead. 6Suppose a procedure A spawns two children B and C. The two ‘Although the runtime system requires some effort to port between children can reference objects in A’s activation frame, but B and C architectures, the compiler requires oo changes whatsoever for differ- do not see each other’s frame. ent platform.

217 Several straightforward mechanisms might be considered Worker/Victim Thief 1 steal0 C to implement a work-stealing protocol. For example, a thief 1 push0 I 2 T++; 2 lock(L) ; might interrupt a worker and demand attention from this 3 1 3 li++; victim. This strategy presents problems for two reasons. 4 if (H > T) { H-m. First, the mechanisms for signaling interrupts are slow, and 4 pop0 I 5 5 T--; 6 u&k(L) ; although an interrupt would be borne on the critical path, 6 if (ii > T) I 7 return FAILURE; its large cost could threaten the assumption of parallel slack- 7 T++ ; 8 > ness. Second, the worker would necessarily incur some over- 8 lock(L) ; 9 unlock(L) ; 9 T--; 10 return SUCCESS; head on the work term to ensure that it could be safely 10 if (ii > T) i 11 1 interrupted in a . As an alternative to send- 11 T++; ing interrupts, thieves could post steal requests, and workers 12 unlock(L) ; 13 return FAILURE; could periodically poll for them. Once again, however, a cost 14 > accrues to the work overhead, this time for polling. Tech- 15 unlock(L) ; niques are known that can limit the overhead of polling [12], 16 It 17 return SUCCESS; but they require the support of a sophisticated compiler. 18 1 The work-first principle suggests that it is reasonable to put substantial effort into minimizing work overhead in the Figure 4: Pseudocode of a simplified version of the THE protocol. The left part of the figure shows the actions performed by the work-stealing protocol. Since Cilk-5 is designed for shared- victim, and the right part shows the actions of the thief. None memory machines, we chose to implement work-stealing of the actions besides reads and writes are assumed to be atomic. through shared-memory, rather than with message-passing, For example, T-- ; can be implemented as tmp = T; tmp = trap - I; T = trap;. as might otherwise be appropriate for a distributed-memory implementation. In our implementation, both victim and thief operate directly through shared memory on the victim’s in detail. We first present a simplified protocol that uses ready deque. The crucial issue is how to resolve the race con- only two shared variables T and H designating the tail and dition that arises when a thief tries to steal the same frame the head of the deque, respectively. Later, we extend the that its victim is attempting to pop. One simple solution protocol with a third variable E that allows exceptions to be is to add a lock to the deque using relatively heavyweight signaled to a worker. The exception mechanism is used to hardware primitives like Compare-And-Swap or Test-And- implement Cilk’s abort statement. Interestingly, this exten- Set. Whenever a thief or worker wishes to remove a frame sion does not introduce any additional work overhead. from the deque, it first grabs the lock. This solution has The pseudocode of the simplified THE protocol is shown the same fundamental problem as the interrupt and polling in Figure 4. Assume that shared memory is sequentially mechanisms just described, however. Whenever a worker consistent [2O].s The code assumes that the ready deque is pops a frame, it pays the heavy price to grab a lock, which implemented as an array of frames. The head and tail of contributes to work overhead. the deque are determined by two indices T and II, which axe Consequently, we adopted a solution that employs Di- stored in shared memory and are visible to all processors. jkstra’s protocol for mutual exclusion [ll], which assumes The index T points to the first unused element in the array, only that reads and writes are atomic. Because our proto- and H points to the first frame on the deque. Indices grow col uses three atomic shared variables T, H, and E, we call from the head towards the tail so that under normal con- it the THE protocol. The key idea is that actions by the ditions, we have T 1 H. Moreover, each deque has a lock L worker on the tail of the queue contribute to work overhead, implemented with atomic hardware primitives or with OS while actions by thieves on the head of the queue contribute calls. only to critical-path overhead. Therefore, in accordance with The worker uses the deque as a stack. (See Section 4.) the work-first principle, we attempt to move costs from the Before a spawn, it pushes a frame onto the tail of the deque. worker to the thief. To arbitrate among different thieves After a spawn, it pops the frame, unless the frame has been attempting to steal from the same victim, we use a hard- stolen. A thief attempts to steal the frame at the head of ware lock, since this overhead can be amortized against the the deque. Only one thief at the time may steal from the critical path. To resolve conflicts between a worker and the deque, since a thief grabs L as its first action. As can be sole thief holding the lock, however, we use a lightweight seen from the code, the worker alters T but not H, whereas Dijkstra-like protocol which contributes minimally to work the thief only increments H and does not alter T. overhead. A worker resorts to a heavyweight hardware lock The only possible interaction between a thief and its vic- only when it encounters an actual conflict with a thief, in sIf the shared memory is not sequentially consistent, a memory which case we can charge the overhead that the victim incurs fence must be inserted between lines 6 and 6 of the worker/victim to the critical path. code and between lines 3 and 4 of the thief code to ensure that these In the rest of this section, we describe the THE protocol instructions are executed in the proper order.

218 Thief the deque, both H and T can be cached exclusively by the worker. The expensive operation of a worker grabbing the lock L occurs only when a thief is simultaneously trying to steal the frame being popped. Since the number of steal attempts depends on T,, not on TI, the relatively heavy cost of a victim grabbing L can be considered as part of the critical-path overhead coo and does not influence the work overhead cl. We ran some experiments to determine the relative per- H=T formance of the THE protocol versus the straightforward protocol in which pop just locks the deque before accessing it. On a 167-megahertz UltraSPARC I, the THE protocol is about 25yo faster than the simple locking protocol. This machine’s memory model requires that a memory fence in- struction (membar) be inserted between lines 5 and 6 of the Victim pop pseudocode. We tried to quantify the performance im- pact of the membar instruction, but in all our experiments the execution times of the code with and without membar (a) (b) Cc) are about the same. On a 200-megahertz Pentium Pro run- Figure 5: The three cases of the ready deque in the simplified THE ning Linux and gee 2.7.1, the THE protocol is only about protocol. A shaded entry indicates the presence of a frame at a 5% faster than the locking protocol. On this processor, the certain position in the deque. The head and the tail are marked by T and H. THE protocol spends about half of its time in the memory fence. Because it replaces locks with memory synchronization, tim occurs when the thief is incrementing H while the vic- the THE protocol is more “nonblocking” than a straightfor- tim is decrementing T. Consequently, it is always safe for ward locking protocol. Consequently, the THE protocol is a worker to append a new frame at the end of the deque less prone to problems that arise when spin locks are used (push) without worrying about the actions of the thief. For extensively. For example, even if a worker is suspended a pop operations, there are three cases, which are shown in by the during the execution of pop, the Figure 5. In case (a), the thief and the victim can both get infrequency of locking in the THE protocol means that a a frame from the deque. In case (b), the deque contains only thief can usually complete a steal operation on the worker’s one frame. If the victim decrements T without interference deque. Recent work by Arora et al. [2] has shown that a from thieves, it gets the frame. Similarly, a thief can steal completely nonblocking work-stealing scheduler can be im- the frame as long as its victim is not trying to obtain it. If plemented. Using these ideas, Lisiecki and Medina [21] have both the thief and the victim try to grab the frame, however, modified the Cilk-5 scheduler to make it completely non- the protocol guarantees that at least one of them discovers blocking. Their experience is that the THE protocol greatly that H > T. If the thief discovers that H > T, it restores simplifies a nonblocking implementation. H to its original value and retreats. If the victim discovers The simplified THE protocol can be extended to support that H > T, it restores T to its original value and restarts the the signaling of exceptions to a worker. In Figure 4, the protocol after having acquired L. With L acquired, no thief index H plays two roles: it marks the head of the deque, and can steal from this deque so the victim can pop the frame it marks the point that the worker cannot cross when it pops. without interference (if the frame is still there). Finally, in These places in the deque need not be the same. In the full case (c) the deque is empty. If a thief tries to steal, it will THE protocol, we separate the two functions of H into two always fail. If the victim tries to pop, the attempt fails and variables: H, which now only marks the head of the deque, control returns to the Cilk runtime system. The protocol and E, which marks the point that the victim cannot cross. cannot , because each holds only one lock Whenever E > T, some exceptional condition has occurred, at a time. which includes the frame being stolen, but it can also be used We now argue that the THE protocol contributes little to for other exceptions. For example, setting E = 00 causes the the work overhead. Pushing a frame involves no overhead worker to discover the exception at its next pop. In the beyond updating T. In the common case where a worker new protocol, E replaces H in line 6 of the worker/victim. can succesfully pop a frame, the pop protocol performs only Moreover, lines 7-15 of the worker/victim are replaced by 6 operations-2 memory loads, 1 memory store, 1 decre- a call to an exception handler to determine the type of ment, 1 comparison, and 1 (predictable) conditional branch. exception (stolen frame or otherwise) and the proper action Moreover, in the common case where no thief operates on to perform. The thief code is also modified. Before trying to

219 Program Sile Tl TC.3 P c1 Ts 3-g Ts/Ts fib 35 12.77 0.0005 25540 3.63 1.60 2.2 blockedmul 1024 29.9 0.0044 6730 1.05 4.3 7:o 6.6 notempmul 1024 29.7 0.015 1970 1.05 3.9 7.6 7.2 atraasen 1024 20.2 0.58 35 1.01 3.54 5.7 5.6 *cilkaort 4,100,000 5.4 0.0049 1108 1.21 0.90 6.0 5.0 tqueens 22 150. 0.0015 96898 0.99 18.8 8.0 8.0 tknapaack 30 75.8 0.0014 54143 1.03 9.5 8.0 7.7 1U 2048 155.8 0.42 370 1.02 20.3 7.7 7.5 *&&sky BCSSTK32 1427. 3.4 420 1.25 208. 6.9 5.5 heat 4096 x 512 62.3 0.16 384 1.08 9.4 6.6 6.1 fft 220 4.3 0.0020 2145 0.93 0.77 5.6 6.0 Barnes-But 2’O 124. 0.15 853 1.02 16.5 7.5 7.4

Figure 6: The performance of example Cilk programs. Times are in seconds and are accurate to within about 10%. The serial programs are C elisions of the Cilk programs, except for those programs that are starred (*), where the parallel program implements a different algorithm than the serial program. Programs labeled by a dagger (t) are nondeterministic, and thus, the running time on one processor is not the same as the work performed by the computation. For these programs, the value for Ti indicates the actual work of the computation on 8 processors, and not the running time on one processor. steal, the thief increments E. If there is nothing to steal, the ble also lists the running time T8 on 8 processors, and the thief restores E to the original value. Otherwise, the thief speedup Tl/Ts relative to the one-processor execution time, steals frame H and increments H. From the point of view of and speedup Ts/TB relative to the serial execution time. a worker, the common case is the same as in the simplified For the 12 programs, the average parallelism F is in most protocol: it compares two pointers (E and T rather than H cases quite large relative to the number of processors on a and T). typical SMP. These measurements validate our assumption The exception mechanism is used to implement abort. of parallel slackness, which implies that the work term dom- When a Cilk procedure executes an abort instruction, the inates in Inequality (4). For instance, on 1024 x 1024 matri- runtime system serially walks the tree of outstanding descen- ces, notempmul runs with an average parallelism of 1970- dants of that procedure. It marks the descendants as aborted yielding adequate parallel slackness for up to several hun- and signals an abort exception on any processor working on dred processors. For even larger machines, one normally a descendant. At its next pop, an aborted procedure will would not run such a small problem. For notempmul, as well discover the exception, notice that it has been aborted, and as the other 11 applications, the average parallelism grows return immediately. It is conceivable that a procedure could with problem size, and thus sufficient parallel slackness is run for a long time without executing a pop and discovering likely to exist even for much larger machines, as long as the that it has been aborted. We made the design decision to problem sizes are scaled appropriately. accept the possibility of this unlikely scenario, figuring that The work overhead cl is only a few percent larger than more cycles were likely to be lost in work overhead if we 1 for most programs, which shows that our design of Cilk-5 abandoned the THE protocol for a mechanism that solves faithfully implements the work-first principle. The two cases this minor problem. where the work overhead is larger (cilksort and cholesky) are due to the fact that we had to change the serial algo- 6 Benchmarks rithm to obtain a , and thus the compar- ison is not against the C elision. For example, the serial C In this section, we evaluate the performance of Cilk-5. We algorithm for sorting is an in-place quicksort, but the par- show that on 12 applications, the work overhead cl is close allel algorithm cilksort requires an additional temporary to 1, which indicates that the Cilk-5 implementation exploits array which adds overhead beyond the overhead of Cilk it- the work-first principle effectively. We then present a break- self. Similarly, our parallel Cholesky factorization uses a down of Cilk’s work overhead cl on four machines. Finally, quadtree representation of the sparse matrix, which induces we present experiments showing that the critical-path over- more work than the linked-list representation used in the head cm is reasonably small as well. serial C algorithm. Finally, the work overhead for fib is Figure 6 shows a table of performance measurements taken large, because fib does essentially no work besides spawn- for 12 Cilk programs on a Sun Enterprise 5000 SMP with 8 ing procedures. Thus, the overhead cl = 3.63 for fib gives a 167-megahertz UltraSPARC processors, each with 512 kilo- good estimate of the cost of a Cilk spawn versus a traditional of L2 cache, 16 kilobytes each of Ll data and instruc- C function call. With such a small overhead for spawning, tion caches, running Solaris 2.5. We compiled our programs one can understand why for most of the other applications, with gee 2.7.2 at optimization level -03. For a full descrip- which perform significant work for each spawn, the overhead tion of these programs, see the Cilk 5.1 manual [8]. The of Cilk-5’s scheduling is barely noticeable compared to the table shows the work of each Cilk program TI, the critical 10% “noise” in our measurements. path T,, and the two derived quantities 7 and cr. The ta-

220 466huh Alpha 21164

2OOMI-h PalUum Ro

167 MHz Ultra SPARC I

195 MHz MIPS RlOOOO

0r .1 2 3 4 i 6 7 ovaheads ’ * . - -...I 0.1 1 Normalized Machine Size Figure 7: Breakdown of overheads for fib running on one pro- cessor on various architectures. The overheads are normalized to the running time of the serial C elision. The three overheads are Figure 8: Normalized speedup curve for Cilk-5. The horizontal for saving the state of a procedure before a spawn, the allocation axis is the number P of processors and the vertical axis is the Tl/Tp, of activation frames for procedures, and the THE protocol. Ab- speedup but, each data point has been normalized by di- solute times are given for the per-spawn running time of the C viding by Tl /Tm. The graph also shows the speedup predicted elision. by the formula Tp = Tl/P + Too.

We now present a breakdown of Cilk’s serial overhead cl speedup TllTp for each run against the machine size P for into its components. Because scheduling overheads are small that run. In order to plot different computations on the same for most programs, we perform our analysis with the fib graph, we normalized the machine size and the speedup by program from Figure 1. This program is unusually sensi- dividing these values by the average parallelism F = Tl/T,, tive to scheduling overheads, because it contains little actual as was done in [4]. For each run, the horizontal position of computation. We give a breakdown of the serial overhead the plotted datum is the inverse of the slackness P/F, and into three components: the overhead of saving state before thus, the normalized machine size is 1.0 when the number of spawning, the overhead of allocating activation frames, and processors is equal to the average parallelism. The vertical the overhead of the THE protocol. position of the plotted datum is (Tl/Tp)/F = T,/Tp, which Figure 7 shows the breakdown of Cilk’s serial overhead measures the fraction of maximum obtainable speedup. As for fib on four machines. Our methodology for obtaining can be seen in the chart, for almost all runs of this bench- these numbers is as follows. First, we take the serial C fib mark, we observed TP 5 Tl/P + l.OT,. (One exceptional program and time its execution. Then, we individually add data point satisfies TP z Tl/P + l.O5T,.) Thus, although in the code that generates each of the overheads and time the work-first principle caused us to move overheads to the the execution of the resulting program. We attribute the critical path, the ability of Cilk applications to scale up was additional time required by the modified program to the not significantly compromised. scheduling code we added. In order to verify our numbers, we timed the fib code with all of the Cilk overheads added 7 Conclusion (the code shown in Figure 3), and compared the resulting time to the sum of the individual overheads. In all cases, We conclude this paper by examining some related work. the two times differed by less than 10%. Mohr et al. [24] introduced lazy task creation in their im- Overheads vary across architectures, but the overhead of plementation of the Mul-T language. Lazy task creation Cilk is typically only a few times the C running time on this is similar in many ways to our lazy scheduling techniques. spawn-intensive program. Overheads on the Alpha machine Mohr et al. report a work overhead of around 2 when com- are particularly large, because its native C function calls are paring with serial T, the Scheme dialect on which Mul-T fast compared to the other architectures. The state-saving is based. Our research confirms the intuition behind their costs are small for fib, because all four architectures have methods and shows that work overheads of close to 1 are write buffers that can hide the latency of the writes required. achievable. We also attempted to measure the critical-path over- The Cid language [26] is like Cilk in that it uses C as head coo. We used the synthetic knary benchmark [4] to a base language and has a simple preprocessing compiler to synthesize computations artificially with a wide range of convert parallel Cid constructs to C. Cid is designed to work work and critical-path lengths. Figure 8 shows the outcome in a environment, and so it employs from many such experiments. The figure plots the measured latency-hiding mechanisms which Cilk-5 could avoid. (We

221 are working on a distributed version of Cilk, however.) Both Architectures (SPAA), Puerto Vallarta, Mexico, June 1998. Cilk and Cid recognize the attractiveness of basing a parallel To appear. Cilk-5.1 (Beta 1) Reference Manual. Available on the Inter- language on C so as to leverage C compiler technology for net fromhttp://theory.lcs.mit.edu/‘cilk. high-performance codes. Cilk is a faithful extension of C, PI Thomas H. Cormen, Charles E. Leiserson, and Ronald L. however, supporting the simplifying notion of a C elision Rive& Introduction to . The MIT Press, Cam- and allowing Cilk to exploit the C compiler technology more bridge, Massachusetts, 1990. readily. PO1David E. Culler, Anurag Sah, Klaus Erik Schauser, Thorsten von Eicken, and John Wawrzynek. Fine-grain parallelism TAM [IO] and Lazy Threads [14] also analyze many of with minimal hardware support: A compiler-controlled the same overhead issues in a more general, “nonstrict” lan- threaded abstract machine. In Proceedings of the Fourth guage setting, where the individual performances of a whole International Conference on Architectural Support for Pro- gmmming Languages and Opemting Systems (ASPLOS), host of mechanisms are required for applications to obtain pages 164-175, Santa Clara, California, April 1991. good overall performance. In contrast, Cilk’s multithreaded Pll E. W. Dijkstra. Solution of a problem in concurrent pro- language provides an execution model based on work and gramming control. Communications of the ACM, 8(9):569, critical-path length that allows us to focus our implemen- September 1965. 1121Marc Feeley. Polling efficiently on stock hardware. In Pro- tation efforts by using the work-first principle. Using this ceedings of the 1993 ACM SIGPLAN Conference on Func- principle as a guide, we have concentrated our optimizing tional Programming and , pages 179- effort on the common-case protocol code to develop an effi- 187, Copenhagen, Denmark, June 1993. cient and portable implementation of the Cilk language. P31 Mingdong Feng and Charles E. Leiserson. Efficient detection of determinacy races in Cilk programs. In Proceedings of the Ninth Annual ACM Symposium on Parallel Algorithms and Acknowledgments Architectures (SPAA), pages l-11, Newport, Rhode Island, June 1997. We gratefully thank all those who have contributed to Cilk P41 S. C. Goldstein, K. E. Schauser, and D. E. Culler. Lazy development, including Bobby Blumofe, Ien Cheng, Don threads: Implementing a fast parallel call. Journal of Paml- lel and , 37(1):5-20, August 1996. Dailey, Mingdong Feng, Chris Joerg, Bradley Kuszmaul, D51 R. L. Graham. Bounds on timing anoma- Phil Lisiecki, Albert0 Medina, Rob Miller, Aske Plaat, Bin lies. SIAM Journal on Apvlied Mathematics, 17(2):416-429, __ .c Song, Andy Stark, Volker Strumpen, and Yuli Zhou. Many March 1969. thanks to all our users who have provided us with feedback 1161Dirk Grunwald. Heaps o’ stacks: Time and space efficient threads without operating system support. Technical Report and suggestions for improvements. Martin Rinard suggested CU-CS-750-94, University of Colorado, November 1994. the term “work-first.” 1171 Dirk Grunwald and Richard Neves. Whole-program opti- mization for time and space efficient threads. In Proceedings References of the Seventh International Conference on Architectural Support for Programming Languages and Opemting Systems PI Andrew W. Appel and Zhong Shao. Empirical and analytic (ASPLOS), pages 50-59, Cambridge, Massachusetts, Octo- study of stack versus heap cost for languages with closures. ber 1996. Journal of Functional Programming, 6(1):47-74, 1996. PI Christopher F. Joerg. The Cilk System for Parallel Multi- PI Nimar S. Arora, Robert, D. Blumofe, and C. Greg Plaxton. threaded Computing. PhD thesis, Department of Electrical Thread scheduling for multiprogrammed multiprocessors. In Engineering and Computer Science, Massachusetts Institute Proceedings of the Tenth Annual ACM Symposium on Par- of Technology, January 1996. allel Aloon’thms and Architectures (SPAA). Puerto Vallarta, P91 Robert H. Halstead Jr. Multilisp: A language for concurrent Mexico; June 1998. To appear. ’ ” symbolic computation. ACM 7Wzsactions on Progmmming [31 Robert D. Blumofe. Executing Multithreaded Programs Ej- Languages and Systems, 7(4):501-538, October 1985. ficiently. PhD thesis, Department of Electrical Engineering WI . How to make a multiprocessor computer and Computer Science, Massachusetts Institute of Techno- that correctly executes multiprocess programs. IEEE ‘Darts- logy, September 1995. actions on Computers, C-28(9):690-691, September 1979. I41 Robert D. Blumofe, Christopher F. Joerg, Bradley C. Kuss- PI Phillip Lisiecki and Albert0 Medina. Personal communica- maul, Charles E. Leiserson, Keith H. Randall, and Yuli Zhou. tion. Cilk: An efficient multithreaded runtime system. Journal P21 James S. Miller and Guillermo J. Rozas. Garbage collection is of Parallel and Distributed Computing, 37(1):55-69, August fast, but a stack is faster. Technical Report Memo 1462, MIT 1996. Laboratory, Cambridge, MA, 1994. Robert D. Blumofe and Charles E. Leiserson. Scheduling [51 [231 Robert C. Miller. A type-checking preprocessor for Cilk 2, multithreaded computations by work stealing. In Pmceed- a multithreaded C language. Master’s thesis, Department of ings of the 35th Annual Symposium on Foundations of Com- Electrical Engineering and Computer Science, Massachusetts puter Science (FOCS), pages 356-368, Santa Fe, New Mex- Institute of Technology, May 1995. ico, November 1994. 1241 Eric Mohr, David A. Kranz, and Robert H. Halstead, Jr. PI Richard P. Brent. The parallel evaluation of general arith- Lazy task creation: A technique for increasing the granular- metic expressions. Journal of the ACM, 21(2):201-206, April ity of parallel programs. IEEE ‘Dunsactions on Pamllel and 1974. Distributed Systems, 2(3):264-280, July 1991. Guang-Ien Cheng, Mingdong Feng, Charles E. Leiserson, 171 [251 Joel Moses. The function of FUNCTION in LISP or whv the Keith H. Randall, and Andrew F. Stark. Detecting data FUNARG problem should be called the environment prob- races in Cilk programs that use locks. In Proceedings of the lem. Technical Report memo AI-199, MIT Artificial Intelli- Tenth Annual ACM Symposium on Parallel Algorithms and gence Laboratory, June 1970.

222 [ZS] Rishiyur Sivaswami Nikhil. Parallel Symbolic Computing in Cid. In Prvc. Wkehp. on Parallel Symbolic Computing, Beaune, I%ance, Springer-Verlag LNCS 1068, pages 217- 242, October 1995. [27] Per Stenstriim. VLSI support for a cactus stack oriented memory organization. Proceedings of the Twenty-First An- nual Hawaii International Conference on System Sciences, volume 1, pages 211-220, January 1988. [28] Leslie G. Valiant. A bridging model for parallel computation. Communications of the ACM, 33(8):103-111, August 1990.

223