<<

The Implementation of the Multithreaded Language

Matteo Frigo Charles E Leiserson Keith H Randall

MIT Lab oratory for Computer Science

Technology Square

Cambridge Massachusetts

fathenacelrandallglcsmitedu

our latest Cilk release still uses a theoretically ecient Abstract

scheduler but the language has b een simplied considerably

The fth release of the multithreaded language Cilk uses a

It employs callreturn semantics for parallelism and features

provably go o workstealing algorithm similar

a linguistically simple inlet mechanism for nondeterminis

to the rst system but the language has b een completely re

designed and the completely reengineered

tic control Cilk is designed to run eciently on contem

The eciency of the new implementation was aided by a

p orary symmetric multipro cessors SMPs which feature

clear strategy that arose from a theoretical analysis of the

hardware supp ort for We have co ded many

scheduling algorithm concentrate on minimizing overheads

applications in Cilk including the So crates and Cilkchess

that contribute to the work even at the exp ense of overheads

chessplaying programs which have won prizes in interna

that contribute to the critical path Although it may seem

counterintuitive to move overheads onto the critical path

tional comp etitions

this workrst principle has led to a p ortable Cilk im

The philosophy b ehind Cilk development has b een to

plementation in which the typical cost of spawning a parallel

make the Cilk language a true parallel extension of b oth

is only b etween and times the cost of a C function

semantically and with resp ect to p erformance On a paral

call on a variety of contemp orary machines Many Cilk pro

lel computer Cilk control constructs allow the program to

grams run on one pro cessor with virtually no degradation

compared to equivalent C programs This pap er describ es

execute in parallel If the Cilk keywords for parallel control

how the workrst principle was exploited in the design of

are elided from a Cilk program however a syntactically and

Cilk s and its runtime system In particular we

semantically correct C program results which we call the C

present Cilk s novel twoclone compilation strategy and

elision or more generally the serial elision of the Cilk

its Dijkstralike mutualexclusion proto col for implementing

program Cilk is a faithful extension of C b ecause the C

the ready deque in the workstealing scheduler

elision of a Cilk program is a correct implementation of the

semantics of the program Moreover on one pro cessor a

Keywords

parallel Cilk program scales down to run nearly as fast as

Critical path multithreading program

its C elision

ming language runtime system work

Unlike in Cilk where the Cilk scheduler was an identi

able piece of co de in Cilk b oth the compiler and runtime

Intro duction

system b ear the resp onsibility for scheduling To obtain ef

Cilk is a multithreaded language for parallel programming

ciency we have of course attempted to reduce scheduling

that generalizes the semantics of C by intro ducing linguistic

overheads Some overheads have a larger impact on execu

constructs for parallel control The original Cilk release

tion time than others however A theoretical understanding

featured a provably ecient randomized work

of Cilks scheduling algorithm has allowed us to iden

stealing scheduler but the language was clumsy

tify and optimize the common cases According to this ab

b ecause parallelism was exp osed by hand using explicit

stract theory the p erformance of a Cilk computation can b e

passing The Cilk language implemented by

characterized by two quantities its work which is the to

tal time needed to execute the computation serially and its

This research was supp orted in part by the Defense Advanced

criticalpath length which is its execution time on an in

Research Pro jects Agency DARPA under Grant N

Computing facilities were provided by the MIT Xolas Pro ject thanks

nite numb er of pro cessors Cilk provides instrumentation

to a generous equipment donation from

that allows a user to measure these two quantities Within

Cilks scheduler we can identify a given cost as contribut

To app ear in Proceedings of the ACM SIGPLAN

ing to either work overhead or criticalpath overhead Much

Conference on Design and Im

plementation PLDI Montreal Canada June of the eciency of Cilk derives from the following principle

which we shall justify in Section

include stdlibh

The workrst principle Minimize the schedul

include stdioh

ing overhead borne by the work of a computation

include cilkh

Specical ly move overheads out of the work and

onto the critical path

cilk int fib int n

The workrst principle played an imp ortant role during the

if n return n

design of earlier Cilk systems but Cilk exploits the prin

else

int x y

ciple more extensively

x spawn fib n

The workrst principle inspired a twoclone strategy

y spawn fib n

for compiling Cilk programs Our cilkc compiler is a

sync

return xy

typ echecking sourcetosource translator that transforms a

Cilk source into a C p ostsource which makes calls to Cilks

runtime library The C p ostsource is then run through the

cilk int main int argc char argv

gcc compiler to pro duce ob ject co de The cilkc compiler

pro duces two clones of every Cilk pro cedurea fast clone

int n result

and a slow clone The fast clone which is identical in

n atoiargv

result spawn fibn

most resp ects to the C elision of the Cilk program executes

sync

in the common case where serial semantics suce The slow

printf Result dn result

clone is executed in the infrequent case that parallel seman

return

tics and its concomitant b o okkeeping are required All com

munication due to scheduling o ccurs in the slow clone and

contributes to criticalpath overhead but not to work over

Figure A simple Cilk program to compute the nth Fib onacci

numb er in parallel using a very bad algorithm

head

The workrst principle also inspired a Dijkstralike

sharedmemory mutualexclusion proto col as part of the

The basic Cilk language can b e understo o d from an exam

runtime loadbalancing scheduler Cilks scheduler uses a

ple Figure shows a Cilk program that computes the nth

workstealing algorithm in which idle pro cessors called

Fib onacci numb er Observe that the program would b e an

thieves steal threads from busy pro cessors called vic

ordinary C program if the three keywords cilk spawn and

tims Cilks scheduler guarantees that the cost of steal

sync are elided

ing contributes only to criticalpath overhead and not to

The keyword cilk identies fib as a Cilk procedure

work overhead Nevertheless it is hard to avoid the mutual

which is the parallel analog to a C function Parallelism

exclusion costs incurred by a p otential victim which con

is created when the keyword spawn precedes the invo cation

tribute to work To minimize work overhead instead of using

of a pro cedure The semantics of a spawn diers from a

lo cking Cilks runtime system uses a Dijkstralike proto col

C function call only in that the parent can continue to ex

which we call the THE proto col to manage the runtime

ecute in parallel with the child instead of waiting for the

deque of ready threads in the workstealing algorithm An

child to complete as is done in C Cilks scheduler takes the

added advantage of the THE proto col is that it allows an

resp onsibility of scheduling the spawned pro cedures on the

exception to b e signaled to a working pro cessor with no ad

pro cessors of the parallel computer

ditional work overhead a feature used in Cilks ab ort mech

A Cilk pro cedure cannot safely use the values returned by

anism

its children until it executes a sync statement The sync

The remainder of this pap er is organized as follows Sec

statement is a lo cal barrier not a global one as for ex

tion overviews the basic features of the Cilk language Sec

ample is used in messagepassing programming In the Fi

tion justies the workrst principle Section describ es

b onacci example a sync statement is required b efore the

how the twoclone strategy is implemented and Section

statement return xy to avoid the anomaly that would

presents the THE proto col Section gives empirical evi

o ccur if x and y are summed b efore they are computed In

dence that the Cilk scheduler is ecient Finally Section

addition to explicit synchronization provided by the sync

presents related work and oers some conclusions

statement every Cilk pro cedure syncs implicitly b efore it

returns thus ensuring that all of its children terminate b e

The Cilk language

fore it do es

Ordinarily when a spawned pro cedure returns the re

This section presents a brief overview of the Cilk extensions

turned value is simply stored into a variable in its parents

to C as supp orted by Cilk For a complete description

frame

consult the Cilk manual The key features of the lan

1

This program uses an inecient algorithm which runs in exp onen

guage are the sp ecication of parallelism and synchroniza

tial time Although logarithmictime metho ds are known p

tion through the spawn and sync keywords and the sp eci

this program nevertheless provides a go o d didactic example

cation of nondeterminism using inlet and abort

cilk int fib int n

ab out concurrency and nondeterminism without requiring

lo cking declaration of critical regions and the like

int x

inlet void summer int result

Cilk provides syntactic sugar to pro duce certain commonly

used inlets implicitly For example the statement x

x result

spawn fibn conceptually generates an inlet similar to

return

the one in Figure

Sometimes a pro cedure spawns o parallel work which it

if n return n

later discovers is unnecessary This sp eculative work can

else

summerspawn fib n

b e ab orted in Cilk using the abort primitive inside an in

summerspawn fib n

let A common use of abort o ccurs during a parallel search

sync

where many p ossibilities are searched in parallel As so on as

return x

a solution is found by one of the searches one wishes to ab ort

any currently executing searches as so on as p ossible so as not

to waste pro cessor resources The abort statement when

Figure Using an inlet to compute the nth Fib onnaci numb er

executed inside an inlet causes all of the alreadyspawned

children of the pro cedure to terminate

We considered using futures with implicit synchro

x spawn fooy

nization as well as synchronizing on sp ecic variables in

Occasionally one would like to incorp orate the returned

stead of using the simple spawn and sync statements We

value into the parents frame in a more complex way Cilk

realized from the workrst principle however that dier

provides an inlet feature for this purp ose which was in

ent synchronization mechanisms could have an impact only

spired in part by the inlet feature of TAM

on the criticalpath of a computation and so this issue was

An inlet is essentially a C function internal to a Cilk pro

of secondary concern Consequently we opted for imple

cedure In the normal syntax of Cilk the spawning of a

mentation simplicity Also in systems that supp ort re

pro cedure must o ccur as a separate statement and not in an

laxed memoryconsistency mo dels the explicit sync state

expression An exception is made to this rule if the spawn

ment can b e used to ensure that all sideeects from previ

is p erformed as an argument to an inlet call In this case

ously spawned subpro cedures have o ccurred

the pro cedure is spawned and when it returns the inlet is

In addition to the control synchronization provided by

invoked In the meantime control of the parent pro cedure

sync Cilk can use explicit lo cking to syn

pro ceeds to the statement following the inlet call In princi

chronize accesses to data providing mutual exclusion and

ple inlets can take multiple spawned arguments but Cilk

atomicity Data synchronization is an overhead b orne on

has the restriction that exactly one argument to an inlet

the work however and although we have striven to min

may b e spawned and that this argument must b e the rst

imize these overheads negrain lo cking on contemp orary

argument If necessary this restriction is easy to program

pro cessors is exp ensive We are currently investigating how

around

to incorp orate atomicity into the Cilk language so that pro

Figure illustrates how the fib function might b e co ded

to col issues involved in lo cking can b e avoided at the user

using inlets The inlet summer is dened to take a returned

level To aid in the debugging of Cilk programs that use

value result and add it to the variable x in the frame of the

lo cks we have b een developing a to ol called the Nonde

pro cedure that do es the spawning All the variables of fib

terminator which detects common synchronization

are available within summer since it is an internal function

bugs called data races

of fib

No lo ck is required around the accesses to x by summer

The workrst principle

b ecause Cilk provides atomicity implicitly The concern is

This section justies the workrst principle stated in Sec

that the two up dates might o ccur in parallel and if atomic

tion by showing that it follows from three assumptions

ity is not imp osed an up date might b e lost Cilk provides

First we assume that Cilks scheduler op erates in practice

implicit atomicity among the threads of a pro cedure in

according to the theoretical analysis presented in Sec

stance where a thread is a maximal sequence of instruc

ond we assume that in the common case ample parallel

tions ending with a spawn sync or return either explicit

slackness exists that is the average parallelism of a

or implicit statement An inlet is precluded from containing

Cilk program exceeds the numb er of pro cessors on which we

spawn and sync statements and thus it op erates atomically

run it by a sucient margin Third we assume as is indeed

as a single thread Implicit atomicity simplies reasoning

the case that every Cilk program has a C elision against

2

The C elision of a Cilk program with inlets is not ANSI C b ecause

which its onepro cessor p erformance can b e measured

ANSI C do es not supp ort internal C functions Cilk is based on Gnu

C technology however which do es provide this supp ort

The theoretical analysis presented in cites two funda

considerable parallel slackness The parallelisim of these ap mental lower b ounds as to how fast a Cilk program can run

plications increases with problem size thereby ensuring they Let us denote by T the execution time of a given computa

P

will run well on large machines tion on P pro cessors Then the work of the computation is

The third assumption b ehind the workrst principle is T and its criticalpath length is T For a computation with

1

that every Cilk program has a C elision against which its T work the lower b ound T  T P must hold b ecause at

P

onepro cessor p erformance can b e measured Let us denote most P units of work can b e executed in a single step In

by T the running time of the C elision Then we dene the addition the lower b ound T  T must hold since a nite

S P 1

work overhead by c T T Incorp orating criticalpath numb er of pro cessors cannot execute faster than an innite

S

and work overheads into Inequality yields numb er

Cilks randomized workstealing scheduler executes

T  c T P c T

P S 1 1

a Cilk computation on P pro cessors in exp ected time

 c T P

S

T T P O T

P 1

since we assume parallel slackness

assuming an ideal parallel computer This equation resem

We can now restate the workrst principle precisely Min

bles Brents theorem and is optimal to within a

imize c even at the expense of a larger c b ecause c has a

1

constant factor since T P and T are b oth lower b ounds

1

more direct impact on p erformance Adopting the workrst

We call the rst term on the righthand side of Equation

principle may adversely aect the ability of an application

the work term and the second term the criticalpath term

to scale up however if the criticalpath overhead c is to o

1

Imp ortantly all communication costs due to Cilks scheduler

large But as we shall see in Section criticalpath over

are b orne by the criticalpath term as are most of the other

head is reasonably small in Cilk and many applications

scheduling costs To make these overheads explicit we de

can b e co ded with large amounts of parallelism

ne the criticalpath overhead to b e the smallest constant

The workrst principle p ervades the Cilk implementa

c such that

1

tion The workstealing scheduler guarantees that with high

T  T P c T

P 1 1

probability only O PT steal migration attempts o ccur

1

that is O T on average p er pro cessor all costs for which

1

The second assumption needed to justify the workrst

are b orne on the critical path Consequently the scheduler

principle fo cuses on the commoncase regime in which a

for Cilk p ostp ones as much of the scheduling cost as p os

parallel program op erates Dene the average paral lelism

sible to when work is b eing stolen thereby removing it as

P T T which corresp onds to the maximum p os as

1

a contributor to work overhead This strategy of amortiz

sible sp eedup that the application can obtain Dene also

ing costs against steal attempts p ermeates virtually every

the paral lel slackness to b e the ratio P P The as

decision made in the design of the scheduler

P P  c which sumption of paral lel slackness is that

1

means that the numb er P of pro cessors is much smaller than

Cilks compilation strategy

the average parallelism P Under this assumption it follows

that T P  c T and hence from Inequality that

1 1

This section describ es how our cilkc compiler generates C

T  T P and we obtain linear sp eedup The critical

P

p ostsource from a Cilk program As dictated by the work

path overhead c has little eect on p erformance when su

1

rst principle our compiler and scheduler are designed to

cient slackness exists although it do es determines how much

reduce the work overhead as much as p ossible Our strategy

slackness must exist to ensure linear sp eedup

is to generate two clones of each pro cedurea fast clone and

Whether substantial slackness exists in common applica

a slow clone The fast clone op erates much as do es the C

tions is a matter of opinion and empiricism but we suggest

elision and has little supp ort for parallelism The slow clone

that slackness is the common case The expressiveness of

has full supp ort for parallelism along with its concomitant

Cilk makes it easy to co de applications with large amounts

overhead We rst describ e the Cilk scheduling algorithm

of parallelism For mo destsized problems many applica

Then we describ e how the compiler translates the Cilk lan

tions exhibit an average parallelism of over yielding sub

guage constructs into co de for the fast and slow clones of

stantial slackness on contemp orary SMPs Even on Sandia

each pro cedure Lastly we describ e how the runtime sys

National Lab oratorys Paragon which contains

tem links together the actions of the fast and slow clones to

no des the So crates chess program co ded in Cilk ran

pro duce a complete Cilk implementation

in its linearsp eedup regime during the ICCA World

As in lazy task creation in Cilk each pro ces

Computer Chess Championship where it placed second in

sor called a worker maintains a ready deque doubly

a eld of Section describ es a dozen other diverse

ended queue of ready pro cedures technically pro cedure

applications which were run on an pro cessor SMP with

instances Each deque has two ends a head and a tail

3

This abstract mo del of execution time ignores reallife details

from which pro cedures can b e added or removed A worker

such as memoryhierarchy eects but is nonetheless quite accurate

op erates lo cally on the tail of its own deque treating it much

int fib int n

ample of the p ostsource for the fast clone of fib is given

in Figure The generated C co de has the same general

fibframe f frame pointer

structure as the C elision with a few additional statements

f allocsizeoff al locate frame

fsig fibsig initialize frame

In lines an activation frame is allo cated for fib and

if n

initialized The Cilk runtime system uses activation frames

freef sizeoff free frame

to represent pro cedure instances Using techniques similar

return n

to our inlined allo cator typically takes only a few

else

cycles The frame is initialized in line by storing a p ointer

int x y

to a static structure called a signature describing fib

fentry save PC

fn n save live vars

The rst spawn in fib is translated into lines In

T f store frame pointer

lines the state of the fib pro cedure is saved into

push push frame

the activation frame The saved state includes the program

x fib n do C cal l

if popx FAILURE pop frame

counter enco ded as an entry numb er and all live dirty vari

return frame stolen

ables Then the frame is pushed on the runtime deque in

second spawn

lines Next we call the fib routine as we would

sync is free

freef sizeoff free frame

in C Because the spawn statement itself compiles directly

return xy

to its C elision the p ostsource can exploit the optimization

capabilities of the C compiler including its ability to pass

arguments and receive return values in registers rather than

in memory

Figure The fast clone generated by cilkc for the fib pro ce

After fib returns lines check to see whether the

dure from Figure The co de for the second spawn is omitted

parent pro cedure has b een stolen If it has we return im

The functions alloc and free are inlined calls to the runtime

sig contains systems fast memory allo cator The signature fib

mediately with a dummy value Since all of the ancestors

a description of the fib pro cedure including a p ointer to the slow

have b een stolen as well the C stack quickly unwinds and

clone The push and pop calls are op erations on the scheduling

control is returned to the runtime system The proto col

deque and are describ ed in detail in Section

to check whether the parent pro cedure has b een stolen is

quite subtlewe p ostp one discussion of its implementation

as C treats its pushing and p opping spawned acti

to Section If the parent pro cedure has not b een stolen

vation frames When a worker runs out of work it b ecomes

it continues to execute at line p erforming the second

a thief and attempts to steal a pro cedure another worker

spawn which is not shown

called its victim The thief steals the pro cedure from the

In the fast clone all sync statements compile to noops

head of the victims deque the opp osite end from which the

Because a fast clone never has any children when it is exe

victim is working

cuting we know at compile time that all previously spawned

When a pro cedure is spawned the fast clone runs When

pro cedures have completed Thus no op erations are re

ever a thief steals a pro cedure however the pro cedure is

quired for a sync statement as it always succeeds For exam

converted to a slow clone The Cilk scheduler guarantees

ple line in Figure the translation of the sync statement

that the numb er of steals is small when sucient slackness

is just the empty statement Finally in lines fib deal

exists and so we exp ect the fast clones to b e executed most

lo cates the activation frame and returns the computed result

of the time Thus the workrst principle reduces to mini

to its parent pro cedure

mizing costs in the fast clone which contribute more heavily

The slow clone is similar to the fast clone except that

to work overhead Minimizing costs in the slow clone al

it provides supp ort for parallel execution When a pro ce

though a desirable goal is less imp ortant since these costs

dure is stolen control has b een susp ended b etween two of

contribute less heavily to work overhead and more to critical

the pro cedures threads that is at a spawn or sync p oint

path overhead

When the slow clone is resumed it uses a goto statement

We minimize the costs of the fast clone by exploiting the

to restore the program counter and then it restores lo cal

structure of the Cilk scheduler Because we convert a pro

variable state from the activation frame A spawn statement

cedure to its slow clone when it is stolen we maintain the

is translated in the slow clone just as in the fast clone For a

invariant that a fast clone has never b een stolen Further

sync statement cilkc inserts a call to the runtime system

more none of the descendants of a fast clone have b een

which checks to see whether the pro cedure has any spawned

stolen either since the strategy of stealing from the heads

children that have not returned Although the parallel b o ok

of ready deques guarantees that parents are stolen b efore

4

If the shared memory is not sequentially consistent a memory

their children As we shall see this simple fact allows many

fence must b e inserted b etween lines and to ensure that the

optimizations to b e p erformed in the fast clone

surrounding writes are executed in the prop er order

5

The setjmplongjmp facility of C could have b een used as well but

We now describ e how our cilkc compiler generates p ost

our unwinding strategy is simpler

source C co de for the fib pro cedure from Figure An ex

tions led to complicated proto cols Even though these over keeping in a slow clone is substantial it contributes little to

heads could b e charged to the criticalpath term in practice work overhead since slow clones are rarely executed

they b ecame so large that the criticalpath term contributed The separation b etween fast clones and slow clones also

signicantly to the running time thereby violating the as allows us to compile inlets and ab ort statements eciently

sumption of parallel slackness A onepro cessor execution of in the fast clone An inlet call compiles as eciently as an

a program was indeed fast but insucient slackness some ordinary spawn For example the co de for the inlet call from

times resulted in p o or parallel p erformance Figure compiles similarly to the following Cilk co de

In Cilk we simplied the allo cation of activation frames

tmp spawn fibn

by simply using a heap In the common case a frame is

summertmp

allo cated by removing it from a free list Deallo cation is

Implicit inlet calls such as x spawn fibn compile

p erformed by inserting the frame into the free list No user

directly to their C elisions An abort statement compiles to

level management of virtual memory is required except for

a noop just as a sync statement do es b ecause while it is

the initial setup of shared memory Heap allo cation con

executing a fast clone has no children to ab ort

tributes only slightly more than stack allo cation to the work

The runtime system provides the glue b etween the fast and

overhead but it saves substantially on the critical path term

slow clones that makes the whole system work It includes

On the downside heap allo cation can p otentially waste more

proto cols for stealing pro cedures returning values b etween

memory than stack allo cation due to fragmentation For a

pro cessors executing inlets ab orting computation subtrees

careful analysis of the relative merits of stack and heap based

and the like All of the costs of these proto cols can b e amor

allo cation that supp orts heap allo cation see the pap er by

tized against the critical path so their overhead do es not

App el and Shao For an equally careful analysis that

signicantly aect the running time when sucient parallel

supp orts stack allo cation see

slackness exists The p ortion of the stealing proto col exe

Thus although the workrst principle gives a general un

cuted by the worker contributes to work overhead however

derstanding of where overheads should b e b orne our exp e

thereby warranting a careful implementation We discuss

rience with Cilk showed that large enough criticalpath

this proto col in detail in Section

overheads can tip the scales to the p oint where the assump

The work overhead of a spawn in Cilk is only a few reads

tions underlying the principle no longer hold We b elieve

and writes in the fast clone reads and writes for the fib

that Cilk work overhead is nearly as low as p ossible given

example We will exp erimentally quantify the work overhead

our goal of generating p ortable C output from our compiler

in Section Some work overheads still remain in our im

Other researchers have b een able to reduce overheads even

plementation however including the allo cation and freeing

more however at the exp ense of p ortability For example

of activation frames saving state b efore a spawn pushing

lazy threads obtains eciency at the exp ense of im

and p opping of the frame on the deque and checking if a

plementing its own calling conventions stack layouts etc

pro cedure has b een stolen A p ortion of this work overhead

Although we could in principle incorp orate such machine

is due to the fact that Cilk is duplicating the work the C

dep endent techniques into our compiler we feel that Cilk

compiler p erforms but as Section shows this overhead is

strikes a go o d balance b etween p erformance and p ortability

small Although a pro duction Cilk compiler might b e able

We also feel that the current overheads are suciently low

eliminate this unnecessary work it would likely compromise

that other problems notably minimizing overheads for data

p ortability

synchronization deserve more attention

In Cilk the precursor to Cilk we to ok the workrst

principle to the extreme Cilk p erformed stackbased al

Implemention of workstealing

lo cation of activation frames since the work overhead of

stack allo cation is smaller than the overhead of heap allo ca

In this section we describ e Cilk s workstealing mecha

tion Because of the cactus stack semantics of the Cilk

nism which is based on a Dijkstralike sharedmemory

stack however Cilk had to manage the virtualmemory

mutualexclusion proto col called the THE proto col In

on each pro cessor explicitly as was done in The

accordance with the workrst principle this proto col has

work overhead in Cilk for frame allo cation was little more

b een designed to minimize work overhead For example on

than that of incrementing the stack p ointer but whenever

a megahertz UltraSPARC I the fib program with the

the stack p ointer overowed a page an exp ensive userlevel

THE proto col runs ab out faster than with hardware

interrupt ensued during which Cilk would mo dify the

lo cking primitives We rst present a simplied version of

memory map Unfortunately the op eratingsystem mech

the proto col Then we discuss the actual implementation

anisms supp orting these op erations were to o slow and un

which allows exceptions to b e signaled with no additional

predictable and the p ossibility of a page fault in critical sec

overhead

6

7

Supp ose a pro cedure A spawns two children B and C The two

Although the runtime system requires some eort to p ort b etween

children can reference ob jects in As activation frame but B and C

architectures the compiler requires no changes whatso ever for dier

do not see each others frame

ent platforms

WorkerVictim Thief

Several straightforward mechanisms might b e considered

steal

push

to implement a workstealing proto col For example a thief

L

T

might interrupt a worker and demand attention from this

H

if H T

victim This strategy presents problems for two reasons

H

pop

First the mechanisms for signaling interrupts are slow and

unlockL

T

although an interrupt would b e b orne on the critical path

return FAILURE

if H T

T

its large cost could threaten the assumption of parallel slack

unlockL

lockL

ness Second the worker would necessarily incur some over

return SUCCESS

T

head on the work term to ensure that it could b e safely

if H T

T

interrupted in a critical section As an alternative to send

unlockL

ing interrupts thieves could p ost steal requests and workers

return FAILURE

could p erio dically p oll for them Once again however a cost

unlockL

accrues to the work overhead this time for p olling Tech

niques are known that can limit the overhead of p olling

return SUCCESS

but they require the supp ort of a sophisticated compiler

The workrst principle suggests that it is reasonable to

Figure Pseudo co de of a simplied version of the THE proto col

put substantial eort into minimizing work overhead in the

The left part of the gure shows the actions p erformed by the

workstealing proto col Since Cilk is designed for shared

victim and the right part shows the actions of the thief None

of the actions b esides reads and writes are assumed to b e atomic

memory machines we chose to implement workstealing

For example T can b e implemented as tmp T tmp tmp

through sharedmemory rather than with messagepassing

T tmp

as might otherwise b e appropriate for a distributedmemory

implementation In our implementation b oth victim and

in detail We rst present a simplied proto col that uses

thief op erate directly through shared memory on the victims

only two shared variables T and H designating the tail and

ready deque The crucial issue is how to resolve the race con

the head of the deque resp ectively Later we extend the

dition that arises when a thief tries to steal the same frame

proto col with a third variable E that allows exceptions to b e

that its victim is attempting to p op One simple solution

signaled to a worker The exception mechanism is used to

is to add a lo ck to the deque using relatively heavyweight

implement Cilks abort statement Interestingly this exten

hardware primitives like CompareAndSwap or TestAnd

sion do es not intro duce any additional work overhead

Set Whenever a thief or worker wishes to remove a frame

The pseudo co de of the simplied THE proto col is shown

from the deque it rst grabs the lo ck This solution has

in Figure Assume that shared memory is sequentially

the same fundamental problem as the interrupt and p olling

consistent The co de assumes that the ready deque is

mechanisms just describ ed however Whenever a worker

implemented as an array of frames The head and tail of

p ops a frame it pays the heavy price to grab a lo ck which

the deque are determined by two indices T and H which are

contributes to work overhead

stored in shared memory and are visible to all pro cessors

Consequently we adopted a solution that employs Di

The index T p oints to the rst unused element in the array

jkstras proto col for mutual exclusion which assumes

and H p oints to the rst frame on the deque Indices grow

only that reads and writes are atomic Because our proto

from the head towards the tail so that under normal con

col uses three atomic shared variables T H and E we call

ditions we have T  H Moreover each deque has a lo ck L

it the THE proto col The key idea is that actions by the

implemented with atomic hardware primitives or with OS

worker on the tail of the queue contribute to work overhead

calls

while actions by thieves on the head of the queue contribute

The worker uses the deque as a stack See Section

only to criticalpath overhead Therefore in accordance with

Before a spawn it pushes a frame onto the tail of the deque

the workrst principle we attempt to move costs from the

After a spawn it p ops the frame unless the frame has b een

worker to the thief To arbitrate among dierent thieves

stolen A thief attempts to steal the frame at the head of

attempting to steal from the same victim we use a hard

the deque Only one thief at the time may steal from the

ware lo ck since this overhead can b e amortized against the

deque since a thief grabs L as its rst action As can b e

critical path To resolve conicts b etween a worker and the

seen from the co de the worker alters T but not H whereas

sole thief holding the lo ck however we use a lightweight

the thief only increments H and do es not alter T

Dijkstralike proto col which contributes minimally to work

The only p ossible interaction b etween a thief and its vic

overhead A worker resorts to a heavyweight hardware lo ck

8

only when it encounters an actual conict with a thief in

If the shared memory is not sequentially consistent a memory

fence must b e inserted b etween lines and of the workervictim which case we can charge the overhead that the victim incurs

co de and b etween lines and of the thief co de to ensure that these

to the critical path

instructions are executed in the prop er order

In the rest of this section we describ e the THE proto col

deque b oth H and T can b e cached exclusively by the

Thief the

orker The exp ensive op eration of a worker grabbing the

1 w

k L o ccurs only when a thief is simultaneously trying to  lo c 



the frame b eing p opp ed Since the numb er of steal  steal 



dep ends on T not on T the relatively heavy

2 H attempts  1



of a victim grabbing L can b e considered as part of the  cost 



criticalpath overhead c and do es not inuence the work  1

3

verhead c  o 

 

e ran some exp eriments to determine the relative p er   W

 

of the THE proto col versus the straightforward   formance 4 H H = T

 

in which pop just lo cks the deque b efore accessing   proto col  



On a megahertz UltraSPARC I the THE proto col  it



ab out faster than the simple lo cking proto col This 5 T is 



hines memory mo del requires that a memory fence in

 mac

struction membar b e inserted b etween lines and of the

6 T Victim

pop pseudo co de We tried to quantify the p erformance im

pact of the membar instruction but in all our exp eriments

execution times of the co de with and without membar

(a) (b) (c) the

are ab out the same On a megahertz Pentium Pro run

Figure The three cases of the ready deque in the simplied THE

ning and gcc the THE proto col is only ab out

proto col A shaded entry indicates the presence of a frame at a

faster than the lo cking proto col On this pro cessor the

certain p osition in the deque The head and the tail are marked

THE proto col sp ends ab out half of its time in the memory

by T and H

fence

Because it replaces lo cks with memory synchronization

tim o ccurs when the thief is incrementing H while the vic

the THE proto col is more nonblo cking than a straightfor

tim is decrementing T Consequently it is always safe for

ward lo cking proto col Consequently the THE proto col is

a worker to app end a new frame at the end of the deque

less prone to problems that arise when spin lo cks are used

push without worrying ab out the actions of the thief For

extensively For example even if a worker is susp ended

a pop op erations there are three cases which are shown in

by the op erating system during the execution of pop the

Figure In case a the thief and the victim can b oth get

infrequency of lo cking in the THE proto col means that a

a frame from the deque In case b the deque contains only

thief can usually complete a steal op eration on the workers

one frame If the victim decrements T without interference

deque Recent work by Arora et al has shown that a

from thieves it gets the frame Similarly a thief can steal

completely nonblo cking workstealing scheduler can b e im

the frame as long as its victim is not trying to obtain it If

plemented Using these ideas Lisiecki and Medina have

b oth the thief and the victim try to grab the frame however

mo died the Cilk scheduler to make it completely non

the proto col guarantees that at least one of them discovers

blo cking Their exp erience is that the THE proto col greatly

that H T If the thief discovers that H T it restores

simplies a nonblo cking implementation

H to its original value and retreats If the victim discovers

The simplied THE proto col can b e extended to supp ort

that H T it restores T to its original value and restarts the

the signaling of exceptions to a worker In Figure the

proto col after having acquired L With L acquired no thief

index H plays two roles it marks the head of the deque and

can steal from this deque so the victim can p op the frame

it marks the p oint that the worker cannot cross when it p ops

without interference if the frame is still there Finally in

These places in the deque need not b e the same In the full

case c the deque is empty If a thief tries to steal it will

THE proto col we separate the two functions of H into two

always fail If the victim tries to p op the attempt fails and

variables H which now only marks the head of the deque

control returns to the Cilk runtime system The proto col

and E which marks the p oint that the victim cannot cross

cannot deadlo ck b ecause each pro cess holds only one lo ck

Whenever E T some exceptional condition has o ccurred

at a time

which includes the frame b eing stolen but it can also b e used

We now argue that the THE proto col contributes little to

for other exceptions For example setting E 1 causes the

the work overhead Pushing a frame involves no overhead

worker to discover the exception at its next p op In the

b eyond up dating T In the common case where a worker

new proto col E replaces H in line of the workervictim

can succesfully p op a frame the p op proto col p erforms only

Moreover lines of the workervictim are replaced by

op erations memory loads memory store decre

a call to an exception hand ler to determine the typ e of

ment comparison and predictable conditional branch

exception stolen frame or otherwise and the prop er action

Moreover in the common case where no thief op erates on

to p erform The thief co de is also mo died Before trying to

Program Size T T P c T T T T T

1 1 1 8 1 8 S 8

fib

blockedmul

notempmul

strassen

cilksort

y queens

y knapsack

lu

cholesky BCSSTK

heat

20

fft

16

BarnesHut

Figure The p erformance of example Cilk programs Times are in seconds and are accurate to within ab out The serial programs

are C elisions of the Cilk programs except for those programs that are starred where the parallel program implements a dierent

algorithm than the serial program Programs lab eled by a dagger y are nondeterministic and thus the running time on one pro cessor

is not the same as the work p erformed by the computation For these programs the value for T indicates the actual work of the

computation on pro cessors and not the running time on one pro cessor

ble also lists the running time T on pro cessors and the steal the thief increments E If there is nothing to steal the

sp eedup T T relative to the onepro cessor execution time thief restores E to the original value Otherwise the thief

and sp eedup T T relative to the serial execution time steals frame H and increments H From the p oint of view of

S

For the programs the average parallelism P is in most a worker the common case is the same as in the simplied

cases quite large relative to the numb er of pro cessors on a proto col it compares two p ointers E and T rather than H

typical SMP These measurements validate our assumption and T

of parallel slackness which implies that the work term dom The exception mechanism is used to implement abort

inates in Inequality For instance on  matri When a Cilk pro cedure executes an abort instruction the

ces notempmul runs with an average parallelism of runtime system serially walks the tree of outstanding descen

yielding adequate parallel slackness for up to several hun dants of that pro cedure It marks the descendants as ab orted

dred pro cessors For even larger machines one normally and signals an ab ort exception on any pro cessor working on

would not run such a small problem For notempmul as well a descendant At its next pop an ab orted pro cedure will

as the other applications the average parallelism grows discover the exception notice that it has b een ab orted and

with problem size and thus sucient parallel slackness is return immediately It is conceivable that a pro cedure could

likely to exist even for much larger machines as long as the run for a long time without executing a pop and discovering

problem sizes are scaled appropriately that it has b een ab orted We made the design decision to

The work overhead c is only a few p ercent larger than accept the p ossibility of this unlikely scenario guring that

for most programs which shows that our design of Cilk more cycles were likely to b e lost in work overhead if we

faithfully implements the workrst principle The two cases abandoned the THE proto col for a mechanism that solves

where the work overhead is larger cilksort and cholesky this minor problem

are due to the fact that we had to change the serial algo

rithm to obtain a and thus the compar

Benchmarks

ison is not against the C elision For example the serial C

In this section we evaluate the p erformance of Cilk We

algorithm for sorting is an inplace quicksort but the par

show that on applications the work overhead c is close

allel algorithm cilksort requires an additional temp orary

to which indicates that the Cilk implementation exploits

array which adds overhead b eyond the overhead of Cilk it

the workrst principle eectively We then present a break

self Similarly our parallel Cholesky factorization uses a

down of Cilks work overhead c on four machines Finally

quadtree representation of the sparse matrix which induces

we present exp eriments showing that the criticalpath over

more work than the linkedlist representation used in the

head c is reasonably small as well

1

serial C algorithm Finally the work overhead for fib is

Figure shows a table of p erformance measurements taken

large b ecause fib do es essentially no work b esides spawn

for Cilk programs on a Sun Enterprise SMP with

ing pro cedures Thus the overhead c for fib gives a

megahertz UltraSPARC pro cessors each with kilo

go o d estimate of the cost of a Cilk spawn versus a traditional

bytes of L cache kilobytes each of L data and instruc

C function call With such a small overhead for spawning

tion caches running Solaris We compiled our programs

one can understand why for most of the other applications

with gcc at optimization level O For a full descrip

which p erform signicant work for each spawn the overhead

tion of these programs see the Cilk manual The

of Cilk s scheduling is barely noticeable compared to the

table shows the work of each Cilk program T the critical

noise in our measurements

P and c The ta path T and the two derived quantities

1 466 MHz 1 Alpha 21164 27ns

200 MHz Pentium Pro 78ns

167 MHz THE protocol 113ns 0.1 Ultra SPARC I frame allocation state saving Experimental data C 195 MHz Normalized Model T_1/P + T_inf MIPS R10000 115ns Work bound Critical path bound

0 1 2 3 4 5 6 7 overheads 0.01 0.01 0.1 1 10

Normalized Machine Size

Figure Breakdown of overheads for fib running on one pro

cessor on various architectures The overheads are normalized to

Figure Normalized sp eedup curve for Cilk The horizontal

the running time of the serial C elision The three overheads are

axis is the numb er P of pro cessors and the vertical axis is the

for saving the state of a pro cedure b efore a spawn the allo cation

sp eedup T T but each data p oint has b een normalized by di

P

of activation frames for pro cedures and the THE proto col Ab

viding by T T The graph also shows the sp eedup predicted

1

solute times are given for the p erspawn running time of the C

by the formula T T P T

1

P

elision

sp eedup T T for each run against the machine size P for

P

We now present a breakdown of Cilks serial overhead c

that run In order to plot dierent computations on the same

into its comp onents Because scheduling overheads are small

graph we normalized the machine size and the sp eedup by

for most programs we p erform our analysis with the fib

dividing these values by the average parallelism P T T

1

program from Figure This program is unusually sensi

as was done in For each run the horizontal p osition of

tive to scheduling overheads b ecause it contains little actual

the plotted datum is the inverse of the slackness P P and

computation We give a breakdown of the serial overhead

thus the normalized machine size is when the numb er of

into three comp onents the overhead of saving state b efore

pro cessors is equal to the average parallelism The vertical

spawning the overhead of allo cating activation frames and

P T T which p osition of the plotted datum is T T

1 P P

the overhead of the THE proto col

measures the fraction of maximum obtainable sp eedup As

Figure shows the breakdown of Cilks serial overhead

can b e seen in the chart for almost all runs of this b ench

for fib on four machines Our metho dology for obtaining

mark we observed T  T P T One exceptional

P 1

these numb ers is as follows First we take the serial C fib

data p oint satises T  T P T Thus although

P 1

program and time its execution Then we individually add

the workrst principle caused us to move overheads to the

in the co de that generates each of the overheads and time

critical path the ability of Cilk applications to scale up was

the execution of the resulting program We attribute the

not signicantly compromised

additional time required by the mo died program to the

scheduling co de we added In order to verify our numb ers

we timed the fib co de with all of the Cilk overheads added

Conclusion

the co de shown in Figure and compared the resulting

We conclude this pap er by examining some related work

time to the sum of the individual overheads In all cases

Mohr et al intro duced lazy task creation in their im

the two times diered by less than

plementation of the MulT language Lazy task creation

Overheads vary across architectures but the overhead of

is similar in many ways to our lazy scheduling techniques

Cilk is typically only a few times the C running time on this

Mohr et al rep ort a work overhead of around when com

spawnintensive program Overheads on the Alpha machine

paring with serial T the Scheme dialect on which MulT

are particularly large b ecause its native C function calls are

is based Our research conrms the intuition b ehind their

fast compared to the other architectures The statesaving

metho ds and shows that work overheads of close to are

costs are small for fib b ecause all four architectures have

achievable

write buers that can hide the latency of the writes required

The Cid language is like Cilk in that it uses C as

We also attempted to measure the criticalpath over

a base language and has a simple prepro cessing compiler to

head c We used the synthetic knary b enchmark to

1

convert parallel Cid constructs to C Cid is designed to work

synthesize computations articially with a wide range of

in a environment and so it employs

work and criticalpath lengths Figure shows the outcome

latencyhiding mechanisms which Cilk could avoid We

from many such exp eriments The gure plots the measured

Architectures SPAA Puerto Vallarta Mexico June are working on a distributed version of Cilk however Both

To app ear

Cilk and Cid recognize the attractiveness of basing a parallel

Cilk Beta Reference Manual Available on the Inter

language on C so as to leverage C compiler technology for

net from httptheorylcsmitedu ci lk

highp erformance co des Cilk is a faithful extension of C

Thomas H Cormen Charles E Leiserson and Ronald L

however supp orting the simplifying notion of a C elision

Rivest Introduction to Algorithms The MIT Press Cam

bridge Massachusetts

and allowing Cilk to exploit the C compiler technology more

David E Culler Anurag Sah Klaus Erik Schauser Thorsten

readily

von Eicken and John Wawrzynek Finegrain parallelism

TAM and Lazy Threads also analyze many of

with minimal hardware supp ort A compilercontrolled

threaded abstract machine In Proceedings of the Fourth the same overhead issues in a more general nonstrict lan

International Conference on Architectural Support for Pro

guage setting where the individual p erformances of a whole

gramming Languages and Operating Systems ASPLOS

host of mechanisms are required for applications to obtain

pages Santa Clara California April

go o d overall p erformance In contrast Cilks multithreaded

E W Dijkstra Solution of a problem in concurrent pro

gramming control Communications of the ACM

language provides an execution mo del based on work and

Septemb er

criticalpath length that allows us to fo cus our implemen

Marc Feeley Polling eciently on sto ck hardware In Pro

tation eorts by using the workrst principle Using this

ceedings of the ACM SIGPLAN Conference on Func

principle as a guide we have concentrated our optimizing

tional Programming and Computer Architecture pages

Cop enhagen Denmark June

eort on the commoncase proto col co de to develop an e

Mingdong Feng and Charles E Leiserson Ecient detection

cient and p ortable implementation of the Cilk language

of determinacy races in Cilk programs In Proceedings of the

Ninth Annual ACM Symposium on Paral lel Algorithms and

Architectures SPAA pages Newp ort Rho de Island

Acknowledgments

June

S C Goldstein K E Schauser and D E Culler Lazy

We gratefully thank all those who have contributed to Cilk

threads Implementing a fast parallel call Journal of Paral

development including Bobby Blumofe Ien Cheng Don

lel and August

Dailey Mingdong Feng Chris Jo erg Bradley Kuszmaul

R L Graham Bounds on multipro cessing timing anoma

Phil Lisiecki Alb erto Medina Rob Miller Aske Plaat Bin

lies SIAM Journal on Applied Mathematics

March

Song Andy Stark Volker Strump en and Yuli Zhou Many

Dirk Grunwald Heaps o stacks Time and space ecient

thanks to all our users who have provided us with feedback

threads without op erating system supp ort Technical Rep ort

and suggestions for improvements Martin Rinard suggested

CUCS University of Colorado Novemb er

the term workrst

Dirk Grunwald and Richard Neves Wholeprogram opti

mization for time and space ecient threads In Proceedings

of the Seventh International Conference on Architectural

References

Support for Programming Languages and Operating Systems

ASPLOS pages Cambridge Massachusetts Octo

Andrew W App el and Zhong Shao Empirical and analytic

b er

study of stack versus heap cost for languages with closures

Journal of

Christopher F Jo erg The Cilk System for Paral lel Multi

threaded Computing PhD thesis Department of Electrical

Nimar S Arora Rob ert D Blumofe and C Greg Plaxton

Engineering and Computer Science Massachusetts Institute

Thread scheduling for multiprogrammed multipro cessors In

of Technology January

Proceedings of the Tenth Annual ACM Symposium on Par

al lel Algorithms and Architectures SPAA Puerto Vallarta

Rob ert H Halstead Jr Multilisp A language for concurrent

Mexico June To app ear

symb olic computation ACM Transactions on Programming

Languages and Systems Octob er

Rob ert D Blumofe Executing Multithreaded Programs Ef

ciently PhD thesis Department of Electrical Engineering

Leslie Lamp ort How to make a multipro cessor computer

and Computer Science Massachusetts Institute of Techno

that correctly executes multipro cess programs IEEE Trans

logy Septemb er

actions on Computers C Septemb er

Rob ert D Blumofe Christopher F Jo erg Bradley C Kusz

Phillip Lisiecki and Alb erto Medina Personal communica

maul Charles E Leiserson Keith H Randall and Yuli Zhou

tion

Cilk An ecient multithreaded runtime system Journal

James S Miller and Guillermo J Rozas Garbage collection is

of Paral lel and Distributed Computing August

fast but a stack is faster Technical Rep ort Memo MIT

Articial Intelligence Lab oratory Cambridge MA

Rob ert D Blumofe and Charles E Leiserson Scheduling

Rob ert C Miller A typ echecking prepro cessor for Cilk

multithreaded computations by In Proceed

a multithreaded C language Masters thesis Department of

ings of the th Annual Symposium on Foundations of Com

Electrical Engineering and Computer Science Massachusetts

puter Science FOCS pages Santa Fe New Mex

Institute of Technology May

ico Novemb er

Eric Mohr David A Kranz and Rob ert H Halstead Jr

Richard P Brent The parallel evaluation of general arith

Lazy task creation A technique for increasing the granular

metic expressions Journal of the ACM April

ity of parallel programs IEEE Transactions on Paral lel and

Distributed Systems July

GuangIen Cheng Mingdong Feng Charles E Leiserson

Jo el Moses The function of FUNCTION in LISP or why the

Keith H Randall and Andrew F Stark Detecting data

FUNARG problem should b e called the environment prob

races in Cilk programs that use lo cks In Proceedings of the

lem Technical Rep ort memo AI MIT Articial Intelli

Tenth Annual ACM Symposium on Paral lel Algorithms and

gence Lab oratory June

Rishiyur Sivaswami Nikhil Parallel Symb olic Computing

in Cid In Proc Wkshp on Paral lel Symbolic Computing

Beaune France SpringerVerlag LNCS pages

Octob er

Per Stenstrom VLSI supp ort for a cactus stack oriented

memory organization Proceedings of the TwentyFirst An

nual Hawaii International Conference on System Sciences

volume pages January

Leslie G Valiant A bridging mo del for parallel computation

Communications of the ACM August