Dag-Consistent Distributed

Robert D. Blumofe Matteo Frigo Christopher F. Joerg Charles E. Leiserson Keith H. Randall

MIT Laboratory for 545 Technology Square Cambridge, MA 02139

Abstract relaxed models of shared-memory consistency have been developed [10, 12, 13] that compromise on semantics for We introduce dag consistency, a relaxed consistency model a faster implementation. By and large, all of these consis- for distributed shared memory which is suitable for multi- tency models have had one thingin common: they are ªpro- threaded programming. We have implemented dag consis- cessor centricº in the sense that they de®ne consistency in tency in software for the multithreaded runtime system terms of actions by physical processors. In this paper, we running on a Connection Machine CM5. Our implementa- introduce ªdagº consistency, a relaxed consistency model tion includes a dag-consistent distributed cactus stack for based on user-level threads which we have implemented for storage allocation. We provide empirical evidence of the Cilk[4], a C-based multithreadedlanguage and runtimesys- ¯exibility and ef®ciency of dag consistency for applications tem. that include blocked matrix multiplication, Strassen's ma- trix multiplication algorithm, and a Barnes-Hut code. Al- Dag consistency is de®ned on the dag of threads that make though Cilk schedules the executions of these programs dy- up a parallel computation. Intuitively, a read can ªseeº a namically, their performances are competitive with stati- write in the dag-consistency model only if there is some se- cally scheduled implementations in the literature. We also rial execution order consistentwiththedag in which theread

sees the write. Unlike sequential consistency, but similar to F

prove that the number P of page faults incurred by a user certain processor-centric models [12, 14], dag consistency program runningon P processors can be related to the num-

allows different reads to return values that are based on dif-

F F  P ber 1 of page faults running serially by the formula

ferent serial orders, but the values returned must respect the

F C s C s

1 , where is the cache size and is the number of migrations executed by Cilk's scheduler. dependencies in the dag. The current Cilk mechanisms to support dag-consistent 1 Introduction distributed shared memory on the Connection Machine CM5 are implemented in software. Nevertheless, codes Architects of shared memory for parallel computers have at- such as matrix multiplication run ef®ciently, as can be seen tempted to support Lamport's model of sequential consis- in Figure 1. The dag-consistent shared memory performs at tency [22]: The result of any execution is the same as if the 5 mega¯ops per processor as long as the work per proces- operations of all the processors were executed in some se- sor is suf®ciently large. This performance compares fairly quentialorder, andtheoperationsof each individualproces- well withothermatrix multiplicationcodeson theCM5(that sor appear in this sequence in the order speci®ed by its pro- do not use the CM5's vector units). For example, an imple- gram. Unfortunately, they have generally found that Lam- mentation coded in Split-C[9] attains just over 6 mega¯ops port's model is dif®cult to implement ef®ciently, and hence per processor on 64 processors using a static data layout, a This paper appears in the Proceedingsof the 10th International Paral- static thread schedule, and an optimized assembly-language lel Processing Symposium, Honolulu, Hawaii, April 1996. inner loop. In contrast, Cilk's dag-consistent shared mem- This researchwassupportedin partbytheAdvancedResearch Projects ory is mapped across the processors dynamically, and the Agency under Grants N00014-94-1-0985 and N00014-92-J-1310. Robert Cilk threads performing the computation are scheduled dy- Blumofe was supported in part by an ARPA High-Performance Computing Graduate Fellowship, and he is now Assistant Professor at The University namically at runtime. We believe that the overhead in our of Texas at Austin. Matteo Frigo was a visiting scholar from University of current implementation can be reduced, but that in any case, Padova, Italy and is now a graduate student at MIT. Charles Leiserson is this overhead is a reasonable price to pay for ease of pro- currently Shaw Visiting Professor at the National University of Singapore. Keith Randall was supported in part by a Department of Defense NDSEG gramming and dynamic load balancing. Fellowship. We have implemented irregular applications that employ 6 ABx = R

5 C D GH CG CH DI DJ 4 x = + E F IJ EG EH FI FJ

3

Figure 2: Recursive decomposition of matrix multiplication. The

 n 2 multiplication of n matrices requires eight multiplications of

Mflops / processor

 n n  n 4096x4096 optimized n matrices, followed by one addition of matrices. 1024x1024 optimized 1 1024x1024

and-conquer algorithm shown in Figure 2. To multiply one

 n 0 n matrix by another, we divide each matrix into four

4 8 16 32 64

 n n submatrices, recursively compute some prod- processors ucts of these submatrices, and then add the results together. Figure 1: Mega¯ops per processor versus the number of proces- This algorithm lends itself to a parallel implementation, be- sors for several matrix multiplication runs on the Connection Ma- cause each of the eight recursive multiplicationsis indepen- chine CM5. The shared-memory cache on each processoris set to dent and can be executed in parallel.

2MB.The lowercurveis for the matrixmul codein Figure 3and Figure 3 shows Cilk code1 for a ªblockedº implementa- the uppertwo curvesare for an optimized version thatusesno tem- tion of recursive matrix multiplicationin which the (square)

porary storage. input matrices A and B and the output matrix R are stored

 as a collection of submatrices, called blocks. All Cilk's dag-consistent shared memory, including a port of three matrices are abstractly stored in Cilk's shared mem- a Barnes-Hut N -body simulation [1] and an implementa- ory, but the CM5 implementation distributes their elements tion of Strassen's algorithm [33] for matrix multiplication. among the individual processor memories. The Cilk proce- These irregular applications provide a good test of Cilk's dure matrixmul takes as arguments pointers to the ®rst ability to schedule computations dynamically. We achieve block in each matrix as well as a variable nb denoting the a of 9 on an 8192-particle N -body simulation us- number of blocks in any row or column of the matrices. ing 32 processors, which is competitive with other soft- From the pointerto the ®rst block of a matrix and the value ware implementations of distributed shared memory [18]. of nb, the location of any other block in the matrix can be Strassen's algorithm runs as fast as regular matrix multipli-

computed quickly. As matrixmul executes, values are

 cation for matrices, and we coded it in Cilk in stored into R, as well as into a temporary matrix tmp. a few hours. matrixmul The remainder of this paper is organized as follows. Sec- The procedure operates as follows. Lines 3±4 check to see if the matrices to be multiplied consist of tion 2 gives an example of how matrix multiplication can a single block, in which case a call is made to a serial rou- be coded in Cilk using dag-consistent memory. Section 3 multiply block gives a formal de®nition of dag consistency and describes tine (not shown) to perform the mul- tiplication. Otherwise, line 8 allocates some page-aligned the abstract BACKER coherence algorithm for maintaining dag consistency. Section 4 describes an implementation of temporary storage in shared memory for the results, lines 9± 10 compute pointers to the 8 submatrices of A and B, and the BACKER algorithmfor Cilk on the Connection Machine lines 11±12 compute pointers to the 8 submatrices of R and CM5. Section 5 describes the distributed ªcactus stackº tmp memory allocator which the system uses to allocate shared- the temporary matrix . At this point, the divide step of memory objects. Section 6 analyzes the number of faults the divide-and-conquerparadigm is complete, and we begin taken bymultithreadedprograms, boththeoreticallyand em- on the conquer step. Lines 13-20 recursively compute the 8 required submatrix multiplications, storing the results in pirically. Section 7 investigates the running time of dag- the 8 disjoint submatrices of R and tmp. The recursion is consistent shared memory programs and presents a model made to execute in parallel by using the spawn directive, for their performance. Section 8 compares dag-consistency with some related consistency models and offers some ideas which is similar to a C function call except that the caller for the future. can continue to execute even if the callee has not yet re- turned. The sync statement in line 21 causes the procedure to suspend until all the procedures it spawned have ®nished. 2 Example: matrix multiplication To illustrate the concepts behind dag consistency, consider 1 Shown is Cilk-3 code, which provides explicit linguistic support for shared-memory operations and call/return semantics for coordinating the problem of parallel matrix multiplication. One way to threads. The original Cilk-1 system [4] used explicit continuation passing program matrix multiplicationis to use the recursive divide- to coordinate threads. For a history of the evolution of Cilk, see [17].

2 1 cilk void matrixmul(long nb, shared block *A, X Y Z shared block *B, .. . shared block *R) . .. 2 { 3 if (nb == 1) 4 multiply_block(A, B, R); M1M2M3M4M5M6M7M8 S 5 else { 6 shared block *C,*D,*E,*F,*G,*H,*I,*J; 7 shared block *CG,*CH,*EG,*EH, Figure4: Dag of blockedmatrix multiplication. Some edgeshave *DI,*DJ,*FI,*FJ; been omitted for clarity. 8 shared page_aligned block tmp[nb*nb];

/* get pointers to input submatrices */ the edges de®ne the partial execution order. The syncs in 9 partition(nb, A, &C, &D, &E, &F); lines 21 and 23 break the procedure matrixmul into three

10 partition(nb, B, &G, &H, &I, &J);

Y Z

threads X , , and , which correspond respectively to the

M M M

2 8 partitioningand spawningof subproblems 1 /* get pointers to result submatrices */

11 partition(nb, R, &CG, &CH, &EG, &EH); in lines 2±20, the spawning of the addition S in line 22, and 12 partition(nb, tmp, &DI, &DJ, &FI, &FJ); the return in line 25. The Cilk runtime system automatically schedules the ex- /* solve subproblems recursively */ 13 spawn matrixmul(nb/2, C, G, CG); ecution of the computation on the processors of a paral- 14 spawn matrixmul(nb/2, C, H, CH); lel computer using the provably ef®cient technique of work 15 spawn matrixmul(nb/2, E, H, EH); stealing [5], in which idle processors steal spawned proce- 16 spawn matrixmul(nb/2, E, G, EG); dures from victim processors chosen at random. When a 17 spawn matrixmul(nb/2, D, I, DI); procedure is stolen, we refer to it and all of its descendant 18 spawn matrixmul(nb/2, D, J, DJ); procedures in the spawn tree as a subcomputation. 19 spawn matrixmul(nb/2, F, J, FJ); Dag-consistent shared memory is a natural consisten- 20 spawn matrixmul(nb/2, F, I, FI); cy model to support a shared-memory program such as 21 sync; matrixmul. Certainly, sequential consistency can guar- /* add results together into R */ antee the correctness of the program, but a closer look at 22 spawn matrixadd(nb, tmp, R); the precedence relationgiven by the dag reveals that a much 23 sync;

24 } weaker consistency model suf®ces. Speci®cally, the 8 recur-

M M M

2 8 25 return; sively spawned children 1 need not have the 26 } same view of shared memory, because the portion of shared memory that each writes is neither read nor written by the others. On the other hand, the parallel addition of tmp into

Figure 3: Cilk code for recursive blocked matrix multiplication. S

R by the computation S requires to have a view in which

M M M

2 8 allof thewritestoshared memory by 1 have (The sync statement is not a global barrier.) Then, line 22 completed. spawns a parallel additionin which the matrix tmp is added The intuition behind dag consistency is that each thread R matrixadd into . (The procedure is itself implemented sees values that are consistent with some serial execution in a recursive, parallel, divide-and-conquer fashion, and the order of the dag, but two different threads may see differ- sync code is not shown.) The in line 23 ensures that the ent serial orders. Thus, the writes performed by a thread matrixmul addition completes before returns. are seen by its successors, but threads that are incompa- Like any Cilk multithreaded computation [4], the paral- rable in the dag may or may not see each other's writes. matrixmul lel instruction stream of can be viewed as a In matrixmul, the computation S sees the writes of

ªspawn treeº of procedures broken into a directed acyclic

M M M S

2 8 1 , because all the threads of are succes-

graph, or dag, of ªthreads.º The spawn tree is exactly anal-

M M M M

2 8 i sors of 1 , but since the are incompara- ogous to a traditional call tree. When a procedure, such as ble, they cannot depend on seeing each others writes. We matrixmul performs a spawn, the spawned procedure be- shall de®ne dag consistency precisely in Section 3. comes a child of the procedure that performed the spawn. Each procedure is broken by sync statements into non- 3 The BACKER coherence algorithm blocking sequences of instructions, called threads, and the threads of the computation are organized into a dag repre- This section describes our coherence algorithm, which we senting the partial execution order de®ned by the program. call BACKER, for maintaining dag consistency. We ®rst Figure4 illustratesthestructureof thedag for matrixmul. give a formal de®nition of dag-consistent shared memory Each vertex corresponds to a thread of the computation, and and explain how it relates to the intuition of dag consis-

3 tency that we have gained thus far. We then describe the must also lieona singlepath. Consequently,by De®nition1, cache and ªbacking storeº used by BACKER to store shared- every read of an object sees exactly one write to that object, memory objects, and we give three fundamental opera- and the execution is deterministic. tions for moving shared-memory objects between cache and We now describe the BACKER coherence algorithm for backing store. Finally, we give the BACKER algorithm and maintaining dag-consistent shared memory.2 In this algo-

describe how it ensures dag consistency. rithm, versions of shared-memory objects can reside simul-

V E

We ®rst introduce some terminology. Let G be taneouslyinany ofthe processors' local caches orthe global

2 V

the dag of a multithreaded computation. For i j , ifa backing store. Each processor's cache contains objects re-

j G

path of nonzero length from thread i to thread exists in , cently used by the threads that have executed on that pro-

j i  j

we say that i (strictly) precedes , which we write . We cessor, and the backing store provides default global storage

2 V i 6 j

say that two threads i j with are incomparable foreach object. ForourCilksystem ontheCM5, portionsof

6 j j 6 i if we have i and . each processor's main memory are reserved for the proces- Shared memory consists of a set of objects that threads sor'scache and for a portionof the distributedbacking store, can read and write. To track which thread is responsible for although on some systems, it might be reasonable to imple- an object's value, we imagine that each shared-memory ob- ment the backing store on disk. In order for a thread exe- ject has a tag which the write operation sets to the name of cuting on the processor to read or writean object, the object the thread performing the write. We assume without loss must be in the processor's cache. Each object in the cache of generality that each thread performs at most one read or has a dirtybit to record whether the object has been modi®ed write. since it was brought into the cache. We de®ne dag consistency as follows. (See also [17].) Three basic operations are used by the BACKER to ma- nipulate shared-memory objects: fetch, reconcile, and ¯ush.

De®nition 1 The shared memory M of a multithreaded

A fetch copies an object from the backing store to a proces-

V E computation G is dag-consistent if the following sor cache and marks the cached object as clean. A reconcile two conditions hold: copies a dirty object from a processor cache to the backing

store and marks the cached object as clean. Finally, a ¯ush

2 V m 2 M 1. Whenever any thread i reads any object ,

removes a clean object from a processor cache. Unlike im-

j 2 V it receives a value v tagged with some thread

plementations of other models of consistency, all three op-

v m i 6 j such that j writes to and .

erations are bilateral between a processor's cache and the

2 V i  j  k

2. Forany three threads i j k , satisfying , backing store, and other processors' caches are never in-

m 2 M k m

if j writes some object and reads , then the volved. i value received by k is not tagged with . The BACKER coherence algorithm operates as follows. When the user code performs a read or write operationonan For deterministic programs, this de®nition implies the intu- object, the operation is performed directlyon a cached copy itivenotionthataread can ªseeº a writeonlyifthereis some of the object. If the object is not in the cache, it is fetched serial execution order of the dag in which the read sees the from the backing store before the operation is performed. If write. Asitturnsout,however, thisintuitionisillde®ned for the operation is a write, the dirty bit of the object is set. To certain nondeterministic programs. For example, there ex- make space in the cache for a new object, a clean object can ist nondeterministic programs whose parallel execution can be removed by ¯ushing it fromthe cache. To remove a dirty contain reads that do not occur in any serial execution. De®- object, it is reconciled and then ¯ushed. nition1 impliestheintuitivesemantics fordeterministic pro- Besides performing these basic operations in response grams and is well de®ned for all programs. to user reads and writes, the BACKER performs additional Programs can easily be written that are guaranteed to be reconciles and ¯ushes to enforce dag consistency. For

deterministic. Nondeterminism arises when a write to an ob-

! j i each edge i in the computation dag, if threads and

ject occurs that is incomparable with another read or write

p q j are scheduled on different processors, say and , then

(of a different value) to the same object. For example, if a p BACKER reconciles all of p's cached objects after executes

read and a write to the same object are incomparable, then

p j q i but before enables , and it reconciles and ¯ushes 's en-

the read might or might not receive the value of the write. j tire cache before q executes . Similarly, if two writes are incomparable and a read exists The key reason BACKER works is that it is always safe, thatsucceeds thembothwithno other interveningwrites, the

at any point during the execution, for a processor p to rec- read might receive the value of either write. To avoid non- oncile an object or to ¯ush a clean object. Suppose we ar- determinism, it suf®ces that no write to an object occurs that bitrarily insert a reconcile of an object into the computation is incomparable with another read or write to the same ob- ject, inwhich case allwritestotheobjectmust lieonasingle 2 See [17] for details of a ªlazierº coherence algorithm than BACKER path in the dag. Moreover, all writes and any one given read based on climbing the spawn tree.

4

performed by p. Assuming that there is no other communi- fetches and reconciles do not adversely affect the BACKER p cation involving p, if later fetches the object that it previ- coherence algorithm, we implemented the familiar strategy ously reconciled, it will receive either the value that it wrote of grouping objects into pages [16, Section 8.2], each of

earlier or a value written by a thread i that is incomparable which is fetched or reconciled as a unit. Assuming that spa- 0 withthethread i performingtheread. Inthe®rst case, part 2 tial locality exists when objects are accessed, grouping ob- of the De®nition 1 is satis®ed by the semantics of ordinary jects helps amortize the runtime system overhead.

serial execution. In the second case, the thread i that per- Unfortunately, the CM5 operating system does not sup- 0 formed the write is incomparable with i , and thuspart 2 of port handling of page faults by user-level code, and so we the de®nition holds as well. (Part 1 of the de®nition holds were forced to implement shared memory in a relatively ex- trivially.) pensive fashion. Speci®cally, in our CM5 implementation, The BACKER algorithm uses this safety property to guar- shared memory is kept separate from other user memory,

antee dag consistency even when there is communication. and special operations are required to operate on it. Most p

Suppose that a thread i resides on processor with an edge painfully, testing for page faults occurs explicitly in soft- q

to a thread j on processor . In this case, BACKER causes ware, rather than implicitlyin hardware. Our cilk2c type- i

p to reconcile all its cached objects after executing but be- checking preprocessor [27] alleviates some of the discom- q

fore enabling j , and it causes to reconcile and ¯ush its en- fort,butatransparent solutionthatuses hardware supportfor q

tire cache before executing j . At this point, the state of 's paging would be preferable. A minor advantage to the soft-

j i cache (empty) is the same as p's if had executed with on ware approach we use, however, is that we can support full processor p, but a reconcile and ¯ush had occurred between 64-bit addressing of shared memory on the 32-bit SPARC them. Consequently, BACKER ensures dag consistency. processors of the CM5 system. Withall thereconciles and ¯ushes being performed by the Cilk's language support makes it easy to express oper- BACKER algorithm, why should we expect it to be an ef- ations on shared memory. The user can declare shared ®cient coherence algorithm? The main reason is that once pointersand can operate on these pointerswith normal C op- a processor has fetched an object into its cache, the object erations, such as pointer arithmetic and dereferencing. The never needs to be updated with external values or invali- type-checking preprocessor automatically generates code dated, unless communication involving that processor oc- to perform these operations. The user can also declare curs to enforce a dependency in the dag. Consequently, the shared arrays which are allocated and deallocated auto- processor can run with the speed of a serial algorithm with matically by the system. As an optimization, we also pro- no overheads. Moreover, in Cilk,communication to enforce vide register shared pointers, which are a version of dependencies does not occur often [4]. shared pointers that are optimizedfor multipleaccesses to It is worth mentioning that BACKER actually supports the same page. In our CM5 system, a register shared stronger semantics than De®nition 1 requires. In fact, Def- pointer dereference takes 4 instructions when it performs inition 1 allows certain semantic anomalies, but BACKER multiple accesses to within a single page, as compared to 1 handles these situations in the intuitively correct way. We instruction for an ordinary C-pointer dereference. Finally, are currently attempting to characterize the semantics of Cilk provides a loophole mechanism to convert shared BACKER fully. pointers to C pointers, allowing direct, fast operations on pages. This loophole mechanism puts the onus on the user, 4 Implementation however, for ensuring that the pointer stays within a single This section describes our implementation of dag-consistent page. In the near future, we hope to port Cilk to an archi- shared memory for the Cilk runtime system running on tecture and operating system that allow user-level handling the Connection Machine Model CM5 parallel supercom- of page faults. On such a platform, no difference will exist puter [24]. We also describe the Cilk language extensions between shared objects and their C equivalents, and de- for supporting shared-memory objects and the ªdiffº mech- tecting page faults will incur no software overhead. anism [20] for managing dirty bits. Finally, we discuss mi- An important issue we faced with the implementation of nor anomalies in atomicity that can occur when the size of dag-consistent shared memory on the CM5 was how to keep the concrete objects supported by the shared-memory sys- track of which objects on a page have been written. The tem is different from the abstract objects that the program- CM5 provides no direct hardware support to maintain dirty mer manipulates. bits explicitlyat the granularityof words. Rather than using The Cilk system on the CM5 supports concrete shared- dirty bits explicitly,Cilk uses a diff mechanism as is used in memory objects of 32-bitwords. All consistency operations the Treadmarks system [20]. The diff mechanism computes are logically performed on a per-word basis. If the runtime the dirty bit for an object by comparing that object's value system had to operate on every word independently, how- with its value in a copy made at fetch time. Our implemen- ever, the system would be terribly inef®cient. Since extra tation makes this copy only for pages loaded in read/write

5 mode, thereby avoiding the overhead of copying for read- only pages. The diff mechanism imposes extra overhead on A AAA each reconcile, but it imposes no extra overhead on each ac- P1 cess [35]. Dag consistency can suffer from atomicity anomalies C CC when abstract objects that the programmer is reading and P2 writingare larger than the concrete objects supported by the B B shared-memory system. For example, suppose the program- DE DE mer is treating two4-byte concrete objects as one 8-byteab- stract object. If two incomparable threads each write the en- S S S S S S tire 8-byte object, the programmer might expect an 8-byte 1 2 3 1 2 3

read of the structure by a common successor to receive one P

of the two 8-byte values written. The 8-byte read may non- Figure 5: A cactus stack. Procedure 1 is stolen from subcom-

S S P

2 2

deterministically receive 4 bytes of one value and 4 bytes putation 1 to start subcomputation , and then procedure is

S S 3 of the other value, however, since the 8-byte read is really stolen from 2 to start subcomputation . Each subcomputation sees its own stack allocations and the stack allocated by its ances- two 4-byte reads, and the consistency of the two halves is tors. The stack grows downwards. The left side of the picture maintained separately. Fortunately, this problem can only shows how the stack grows like a tree, resembling a cactus. The occur if the abstract program is nondeterministic, that is, if right side shows the stack as seen by the three subcomputations.

the program is nondeterministic even when the abstract and In this example, the stack segment A is shared by all subcomputa-

C S S 3

concrete objects are the same size. When writing determin- tions, stack segment is shared by subcomputations 2 and ,

D E istic programs, the programmer need not worry about this and the other segments, B , , and , are private. atomicity problem. As with other consistency models, including sequential branch of the stack is started so that subsequent allocations

consistency, atomicity anomalies can also occur when the S

performed by 2 do not interfere with the stack being used

S S

programmer packs several abstract objects into a single sys- S

1 2 by 1 . The stacks as seen by and are independent

tem object. Fortunately, this problem can easily be avoided below the steal point, but they are identical above the steal S

in the standard way by not packing together abstract objects P 2 point. Similarly, when procedure 2 is stolen from to

that might be updated in parallel. S start subcomputation 3 , the cactus stack branches again. Cactus-stack allocationmirrors the advantages of an ordi- 5 Memory allocation nary procedure stack. Any object on the stack that is view- Some means of allocating memory must be provided in able by a procedure has a simple address: its offset from any useful implementation of shared memory. We consid- the base of the stack. Procedure local variables and arrays ered implementing general heap storage in the style of C's can be allocated and deallocated automatically by the run- malloc and free, but most of our immediate applica- time system in a natural fashion, as can be seen in the ma- tions only require stack-like allocation for temporary vari- trix multiplication example from Figure 3. Allocation can ables and the like. Since Cilk procedures operate in a par- be performed completely locally withoutcommunication by allel tree-like fashion, however, we needed some kind of simply incrementing a local pointer, although communica- parallel stack. We settled on implementing a cactus-stack tion may be required when an out-of-cache stack page is ac- [15, 28, 32] allocator. tually referenced. Separate branches of the cactus stack are Fromthepointofviewofa singleCilkprocedure, a cactus insulated from each other, allowing two subcomputations to stack behaves much like an ordinary stack. The procedure allocate and free objects independently,even though objects can allocate and free memory by incrementing and decre- may be allocated with the same address. Procedures can menting a stack pointer. The procedure views the stack as a reference common data through the shared portion of their linearly addressed space extending back from its own stack stack address space. frame to the frame of its parent and continuing to more dis- Cactus stacks have many of the same limitations as ordi- tant ancestors. nary procedure stacks [28]. For instance, a child thread can- The stack becomes a cactus stack when multiple proce- not return to its parent a pointer to an object that it has al- dures execute inparallel, each withits ownview ofthe stack located. Similarly, sibling procedures cannot share storage

that corresponds to its call history, as shown in Figure 5. In that they create on the stack. Just as with a procedure stack,

S A

the ®gure, subcomputation 1 allocates some memory be- pointers to objects allocated on the cactus-stack can only be

P S 1

fore procedure 1 is spawned. Subcomputation then con- safely passed to procedures below the allocation point in the

B P

tinues to allocate more memory . When procedure 1 is call tree. Heap storage offers a way of alleviating some of S stolen and becomes the root of subcomputation 2 , a new these limitations (and we intend to provide a heap alloca-

6

tor in a futureversion of Cilk),but the cactus stack provides To relate the number of page faults during execution E to 0 simpleand ef®cient supportfor allocationof procedure local the number during execution E , we examine cache behav- variables and arrays. ior under LRU replacement. Consider two processors that The size of the backing store determines how large a execute simultaneously and in lock step a block of code us- shared-memory application one can run. On the CM5, the ing two different starting cache states, where each proces-

backing store is implemented in a distributedfashion by al- sor's cache has C pages. The main property of LRU that we locating a large fraction of each processor's memory to this exploit is that the number of page faults in the two execu-

function. To determine which processor holds the backing tions can differ by at most C page faults. This property fol- store for a page, a hash function is applied to the page iden- lows from the observation that no matter what the starting ti®er (a pair of the cactus-stack address and the allocating cache states might be, the states of the two caches must be

subcomputation). A fetch or reconcile request for a page identical after one of the two executions takes C page faults.

is made to the backing store of the processor to which the Indeed, at the point when one execution has just taken its C page hashes. Thispolicyensures that backing storeisspread C th page fault, each cache contains exactly the last dis- evenly across the processors' memory. In other systems, it tinct pages referenced [19].

might be reasonable to place the backing store on disk Áala We can now count the number of page faults during the E

traditional virtual memory. execution E . Thefault behaviorof isthesame as thefault

0 T

behavior of E except for the subcomputation and thesub- T 6 An analysis of page faults computation, call it U , from which it stole. Since is exe-

cuted in depth-®rst fashion, the only difference between the F

In this section, we examine the number P of page faults

two executions is that the starting cache state of T and the

that a Cilk computationincurswhen run on P processors us- T starting cache state of U after are different. Therefore, ex-

ing Cilk's randomized work-stealing scheduler [4] and the

C ecution E makes at most morepagefaults thanexecution ACKER

implementation of the B coherence algorithm de- 0

E E F C s C

, and thusexecution has at most 1 F

scribed in Section 4. We prove that P can be related to

F C s

1 page faults.

F

the number 1 of page faults taken by a -processor execu-

F  F C s C 1 tion by the formula P , where is the size

Theorem 1 says that the total number of faults on P pro- of each processor's cache in pages and s is the total number

cessors is at most the total number of faults on 1 proces- C s of steals executed by Cilk's scheduler. The term repre- sor plus an overhead term. The overhead arises whenever sents faults due to ªwarming upº the processors' caches, and a steal occurs, because in the worst case, the caches of both we present empirical evidence that this overhead is actually the thieving processor and the victim processor contain no much smaller in practice than the theoretical bound. pages in common compared to the situation when the steal We begin with a theorem that bounds the number of page did not occur. Thus, they must be ªwarmed upº until the faults of a Cilk application. The proof takes advantage of caches ªsynchronizeº with the cache of a serial execution. properties of the least-recently used (LRU) page replace- ment scheme used by Cilk, as well as the fact that Cilk's To measure the warm-up overhead, we counted the num- scheduler,like C, executes serial code in a depth-®rst fash- ber of page faults taken by several applicationsÐincluding ion. matrixmul, an optimized matrix multiplication routine, and a parallel version of Strassen's algorithm[33]Ðfor var-

ious choices of cache, processor, and problem size. For F

Theorem 1 Let P be the number of page faults of a Cilk

F

each run we measured the cache warm-up fraction P C

computation when run on P processors with a cache of

F C s

1 , which represents the fraction of the cache that

F  F C s 1 pages on each processor. Then, we have P , needs to be warmed up on each steal. We know from The- where s is the totalnumber of steals that occur during Cilk's execution of the computation. orem 1 that the cache warm-up fraction is at most . Our

experiments indicate that the cache warm-up fraction is, in

Proof: The proofisby inductiononthe number s of steals. fact, typically less than , as can be seen from the his- For the base case, observe that if no steals occur, then the togram in Figure 6 showing the cache warm-up fraction for

application runs entirely on one processor, and thus it faults 153 experimental runs of the above applications, with pro- F cessor counts ranging from 2 to 64 and cache sizes from

1 times by de®nition. For the inductive case, consider an

C s s execution E of the computation that has steals. Choose 256KB to 2MB. Thus, we see less than of the extra

any subcomputation T from which no processor steals dur- faults.

0 E

ing the execution E . Construct a new execution of the To understand why cache warm-up costs are so low, we T

computation which is identical to E , except that is never performed an experiment that recorded the size of each sub-

0

s

stolen. Since E has only steals, we know it has at most problem stolen. We observed that most of the subproblems

F C s

1 page faults by the inductive hypothesis. stolen during an execution were small. In fact, only 5±10

7 70

60 1

50 0.1 40

30 0.01 20 normalized speedup Linear speedup number of experiments Parallelism bound 10 0.001 Curve fit

0 <0.5 0.5-1 1-1.5 1.5-2 2-2.5 2.5-3 >3 0.001 0.01 0.1 1 10

cache warm-up fraction (%) normalized machine size

F

Figure 6: Histogram of the cache warm-up fraction P Figure 7: Normalized speedup curve for matrix multiplication.

F C s

1 for a variety of applications, cache sizes, processor The horizontal axis is normalized machine size and the vertical



counts, and problem sizes. The vertical axis shows the number of axis is normalized speedup. Experiments consisted of ,

  experiments with a cache warm-up fraction in the shown range. , and problemsizes on to processors, for matrix multiplication algorithms with various critical paths.

of the stolen subproblems were ªlarge,º where a large sub- C

problem is de®ned to be one that takes or more pages to recording the number P of processors and the actual run- execute. The other 90±95 of the subproblems are small T

time P . and are stolen when littlework is left to do and many of the Figure 7 shows a normalized speedup curve [4] for the processors are idle. Therefore, most of the stolen subprob- synthetic benchmarks. This curve is obtained by plotting

lems never perform C page faults before terminating. The

T T P P

speedup 1 versus machine size , but normalizing

F  F C s 1 bound P derived in Theorem 1 thus appears each of these values by dividing them by the average par-

to be rather loose, and our experiments indicate that much

T T 1 allelism 1 . We use a normalized speedup curve, be- better performance can be expected. cause it allowsus toplotruns ofdifferent benchmarks on the

same graph. Also plottedin the ®gure are the perfect linear-



T P

7 Performance T 1

speedup curve P (the line) and the limit on

T  T 1 performance given by the parallelism bound P (the In this section, we model the performance of Cilk on syn- horizontal line).

thetic benchmark applications similar to matrixmul. We T

quantify performance in terms of ªworkº and ªcritical-path The quantity 1 is not necessarily a tight lower bound

T

P T on , because it ignores page faults. Indeed, the structure

length.º The work 1 of a computation is the running time,

 n lg n including page faults, of the computation on one processor, of matrixmul on n matrices causes faults

when the backing store is running on other processors. The to be taken along any path through the dag. A better mea-

T C

sure, which we shall denote 1 , is the maximum, over T

critical-path length 1 is the (theoretical) running time on

an in®nite number of processors assuming that page faults all pathsinthedag, ofthetime(includingpagefaults)toexe-

T T cute all threads alongthe pathon one processor witha cache 1

take zero time. Their ratio 1 is the average paral-

C T  T C 1 lelism of the computation. We found that the running time size of pages. Althoughthebound P is tighter

(and makes our numbers look better), it appears dif®cult to

T P

P of the benchmarks on processors can be estimated as

compute. We can estimate using analytical techniques, how-

T  T P T

1 1

P . Speedup was always at least

T C

a third of perfect linear speedup for benchmarks with large ever, that for our matrix multiplication algorithms, 1

T T 1 average parallelism and running time was always within a is about twice as large as 1 . Hadweusedthisvaluefor factor of 10 of optimal for those without much parallelism. in the normalized speedup curve in Figure7, each data point To analyze Cilk's implementation of the BACKER co- wouldshift upand right bythisfactor of2, givingsomewhat herence algorithm, we measured the work and critical-path tighter results. length for synthetic benchmarks obtained by adding sync The normalized speedup curve in Figure 7 shows that

statements to the matrix multiplication program shown in dag-consistent shared-memory applications can obtain good

T

Figure 3. By judiciously placing sync statements in the speedups. The data was ®t to a curve of the form P

c T P c T c

1 1 1 1

code, we were able to obtain synthetic benchmarks that ex- 1 . We obtained a ®t with and

2

c R

hibited a wide range of average parallelism. We ran the 1 , with an correlation coef®cient of and

benchmarks on various numbers of processors, each time a mean relative error of . Thus, the shared memory

8 imposes about a performance penalty on the work of ing operating system hooks to make the use of shared mem-

an algorithm, and a factor of performance penalty on the ory be transparent to the user. We expect that the well-

critical path. The factor of onthe critical path term is quite structured nature of Cilk computations will allow the run- good considering all of the scheduling, protocol, and com- time system to maintain dag consistency ef®ciently, even in munication that could potentially contribute to this term. the presence of processor faults. There are two possible explanations for the additional

on the work term. The extra work could represent con- Acknowledgments gestionat the backing store, which causes page faultsto cost more than in the one-processor run. Alternatively, it could We gratefully acknowledge the work of Bradley Kuszmaul

T of Yale, Yuli Zhou of AT&T Bell Laboratories, and Rob be because our 1 measure is too conservative. To compute

T Miller of Carnegie Mellon, all formerly of MIT, for their ef-

1 , we runthebackingstoreonprocessorsotherthantheone

running the benchmark, while when we run on P proces- forts as part of the Cilk team. Rob implemented the type-

sors, we use the same P processors to implement the back- checking preprocessor for Cilk and led and implemented the ing store. We have not yet run experiments to see which of designof thelinguisticsfor explicitshared memory. Yuliun- these two explanations is correct. dertook the ®rst port of the N -body code to Cilk. Thanks to BurtonSmith of Tera Computer Corporationfor acquainting 8 Conclusion us with related work. Thanks to Mingdong Feng of the Na- tional University of Singapore for porting Cilk to the IBM Many other researchers have investigated distributed shared SP-2 and providingfeedback on our research. Thanks to the memory. To conclude, we brie¯y outline work in this area National University of Singapore for resources used to pre- and offer some ideas for the future. pare the ®nal versionof thispaper. Thanks toArvindand his The notion that independent tasks may have incoherent data¯ow group at MIT for helpful discussions and inspira- views of each others' memory is not new to Cilk. The tion. BLAZE [26] language incorporated a memory semantics similar to that of dag consistency into a PASCAL-like lan- References

guage. The Myrias [2] computer was designed to sup-

N log N N port a relaxed memory semantics similar to dag consistency, [1] Joshua E. Barnes. A hierarchical O -body code. Available on the Internet from ftp://hubble. with many of the mechanisms implemented in hardware. ifa.hawaii.edu/pub/barnes/treecode/. Loosely-Coherent Memory [23] allows for a range of con- [2] Monica Beltrametti, Kenneth Bobey, and John R. Zorbas. sistency protocols and uses compiler support to direct their The controlmechanismfor the Myrias parallel computersys- use. Compared with these systems, Cilk provides a mul- tem. Computer Architecture News, 16(4):21±30, September tithreaded programming model based on directed acyclic 1988. [3] Robert D. Blumofe. Executing Multithreaded Programs Ef- graphs, which leads to a more ¯exible linguistic expression ®ciently. PhD thesis, Department of Electrical Engineering of operations on shared memory. and Computer Science, Massachusetts Institute of Techno- Cilk's implementation of dag consistency borrows heav- logy, September 1995. ily on the experiences from previous implementations of [4] Robert D. Blumofe, Christopher F. Joerg, Bradley C. Kusz- maul, Charles E.Leiserson,Keith H.Randall, and YuliZhou. distributed shared memory. Like Ivy [25] and others [6, Cilk: An ef®cient multithreaded runtime system. In Pro- 11, 20], Cilk's implementation uses ®xed-sized pages to cut ceedings of the Fifth ACM SIGPLAN Symposium on Princi- down on the overhead of managing shared objects. In con- ples and Practice of Parallel Programming (PPoPP), pages 207±216, Santa Barbara, California, July 1995. trast, systems that use cache lines [7, 21, 29] require some [5] Robert D. Blumofe and Charles E. Leiserson. Schedul- degree of hardware support [31] to manage shared mem- ing multithreaded computations by work stealing. In Pro- ory ef®ciently. As another alternative, systems that use ceedings of the 35th Annual Symposium on Foundations of arbitrary-sized objects or regions [8, 18, 30, 34] require ei- Computer Science, pages 356±368, Santa Fe, New Mexico, November 1994. ther an object-oriented programming model or explicit user [6] John B. Carter, John K. Bennett, and Willy Zwaenepoel. Im- management of objects. plementation and performance of Munin. In Proceedings of The idea of dag-consistent shared memory can be ex- the Thirteenth ACM Symposiumon Operating Systems Prin- tended to the domain of ®le I/O to allow multiplethreads to ciples, pages 152±164, Paci®c Grove, California, October 1991. read and write the same ®le in parallel. We anticipate that [7] David Chaiken and Anant Agarwal. Software-extended co- it shouldbe possible to memory-map ®les and use our exist- herent shared memory: Performance and cost. In Proceed- ing dag-consistency mechanisms to provide a parallel, asyn- ings of the 21st Annual International Symposium on Com- chronous, I/O capability for Cilk. puter Architecture, pages 314±324, Chicago, Illinois, April 1994. We are also currently working on porting dag-consistent [8] Jeffrey S. Chase, Franz G. Amador, Edward D. Lazowska, shared memory to our Cilk-NOW [3] adaptively parallel, Henry M. Levy, and Richard J. Little®eld. The Amber sys- fault-tolerant, network-of-workstations system. We are us- tem: Parallel programming on a network of multiprocessors.

9 In Proceedings of the Twelfth ACM Symposium on Operat- and Operating Systems, pages 208±218, San Jose, Califor- ing Systems Principles, pages 147±158,Litch®eld Park, Ari- nia, October 1994. zona, December 1989. [24] Charles E. Leiserson, Zahi S. Abuhamdeh, David C. Doug- [9] Daved E. Culler, Andrea Dusseau, Seth Copen Goldst- las, Carl R. Feynman, Mahesh N. Ganmukhi, Jeffrey V. Hill, ein, Arvind Krishnamurthy, Steven Lumetta, Thorsten von W. Daniel Hillis, Bradley C. Kuszmaul, Margaret A. St. Eicken, and Katherine Yelick. Parallel programming in Pierre, David S. Wells, Monica C. Wong, Shaw-Wen Yang, Split-C. In Supercomputing '93, pages 262±273, Portland, and Robert Zak. The network architecture of the Connection Oregon, November 1993. Machine CM-5. In Proceedingsof the Fourth Annual ACM [10] Michel Dubois, Christoph Scheurich, and Faye Briggs. Symposium on Parallel Algorithms and Architectures,pages Memory access buffering in multiprocessors. In Proceed- 272±285, San Diego, California, June 1992. ings of the 13th Annual International Symposium on Com- [25] Kai Li and PaulHudak. Memory coherencein sharedvirtual puter Architecture, pages 434±442, June 1986. memory systems. ACM Transactions on Computer Systems, [11] Vincent W. Freeh, David K. Lowenthal, and Gregory R. An- 7(4):321±359, November 1989. drews. Distributed Filaments: Ef®cient ®ne-grain paral- [26] Piyush Mehrotra and Jon Van Rosendale. The BLAZE lan- lelism on a cluster of workstations. In Proceedings of the guage: A parallel language for scienti®c programming. Par- First Symposium on Operating Systems Design and Imple- allel Computing, 5:339±361, 1987. mentation, pages201±213, Monterey, California, November [27] Robert C. Miller. A type-checking preprocessor for Cilk 2, 1994. a multithreaded C language. Master's thesis, Department [12] Guang R. Gao and Vivek Sarkar. Location consistency: of Electrical Engineering and Computer Science, Massachu- Stepping beyondthe barriers of memory coherenceand seri- setts Institute of Technology, May 1995. alizability. Technical Report 78, McGill University, School [28] JoelMoses. Thefunction of FUNCTIONin LISP or why the of Computer Science, Advanced Compilers, Architectures, FUNARG problem should be called the envronment prob- and Parallel Systems (ACAPS) Laboratory, December1993. lem. Technical Report memo AI-199, MIT Arti®cial Intel- [13] KouroshGharachorloo, Daniel Lenoski, James Laudon,Phil ligence Laboratory, June 1970. lip Gibbons, Anoop Gupta, and John Hennessy. Memory [29] Steven K. Reinhardt, James R. Larus, and David A. Wood. consistency and event ordering in scalable shared-memory Tempest and Typhoon: User-level shared memory. In Pro- multiprocessors. In Proceedingsof the 17th Annual Interna- ceedings of the 21st Annual International Symposium on tional Symposium on Computer Architecture, pages 15±26, Computer Architecture, pages 325±336, Chicago, Illinois, Seattle, Washington, June 1990. April 1994. [14] James R. Goodman. Cache consistency and sequential con- [30] Daniel J. Scales and Monica S. Lam. The design and evalu- sistency. Technical Report 61, IEEE Scalable Coherent In- ation of a shared object system for ma- terface (SCI) Working Group, March 1989. chines. In Proceedingsof the First Symposiumon Operating [15] E.A.HauckandB.A. Dent. Burroughs'B6500/B7500stack Systems Design and Implementation, pages 101±114, Mon- mechanism. Proceedings of the AFIPS Spring Joint Com- terey, California, November 1994. puter Conference, pages 245±251, 1968. [31] Ioannis Schoinas, Babak Falsa®, Alvin R. Lebeck, Stev [16] John L. Hennessyand David A. Patterson. Computer Archi- en K. Reinhardt, James R. Larus, and David A. Wood. Fine- tecture: a Quantitative Approach. Morgan Kaufmann, San grain access control for distributed shared memory. In Pro- Mateo, CA, 1990. ceedings of the Sixth International Conference on Architec- [17] Christopher F. Joerg. The Cilk System for Parallel Multi- tural Support for Programming Languages and Operating threaded Computing. PhD thesis, Department of Electrical Systems,pages297±306,SanJose,California, October1994. Engineering and Computer Science, Massachusetts Institute [32] Per StenstrÈom. VLSI support for a cactus stack oriented of Technology, January 1996. memory organization. Proceedings of the Twenty-First An- [18] Kirk L. Johnson, M. Frans Kaashoek, and Deborah A. nual Hawaii International Conference on System Sciences, Wallach. CRL: High-performance all-software distributed volume 1, pages 211±220, January 1988. shared memory. In Proceedingsof the Fifteenth ACM Sym- [33] Volker Strassen. Gaussian elimination is not optimal. Nu- posium on Operating Systems Principles, pages 213±228, merische Mathematik, 14(3):354±356, 1969. Copper Mountain Resort, Colorado, December 1995. [34] Andrew S. Tanenbaum, Henri E. Bal, and M. Frans Kaash- [19] Edward G. Coffman Jr. and Peter J. Denning. Operating oek. Programming a distributed system using shared ob- Systems Theory. Prentice-Hall, Inc., Englewood Cliffs, NJ, jects. In Proceedingsof the SecondInternationalSymposium 1973. on High Performance , pages 5±12, [20] Pete Keleher, Alan L. Cox, Sandhya Dwarkadas, and Willy Spokane, Washington, July 1993. Zwaenepoel. TreadMarks: Distributed shared memory on [35] Matthew J.Zekauskas,WayneA.Sawdon,andBrian N.Ber- standard workstations and operating systems. In USENIX shad. Software write detection for a distributed shared mem- Winter 1994 Conference Proceedings, pages 115±132, San ory. In Proceedings of the First Symposium on Operating Francisco, California, January 1994. Systems Design and Implementation, pages 87±100, Mon- [21] Jeffrey Kuskin, David Ofelt, Mark Heinrich, John Hein- terey, California, November 1994. lein, Richard Simoni, Kourosh Gharachorloo, John Chapin, David Nakahira, Joel Baxter, Mark Horowitz, Anoop Gupta, Mendel Rosenblum,and John Hennessy. The Stanford Flash multiprocessor. In Proceedings of the 21st Annual Inter- national Symposium on Computer Architecture, pages 302± 313, Chicago, Illinois, April 1994. [22] Leslie Lamport. How to make a multiprocessor computer that correctly executes multiprocess programs. IEEE Trans- actions on Computers, C-28(9):690±691, September 1979. [23] James R. Larus, Brad Richards, and Guhan Viswanathan. LCM: Memory system support for parallel language imple- mentation. In Proceedingsof the Sixth International Confer- ence on Architectural Support for Programming Languages

10