Dag-Consistent Distributed Shared Memory
Robert D. Blumofe Matteo Frigo Christopher F. Joerg Charles E. Leiserson Keith H. Randall
MIT Laboratory for Computer Science 545 Technology Square Cambridge, MA 02139
Abstract relaxed models of shared-memory consistency have been developed [10, 12, 13] that compromise on semantics for We introduce dag consistency, a relaxed consistency model a faster implementation. By and large, all of these consis- for distributed shared memory which is suitable for multi- tency models have had one thingin common: they are ªpro- threaded programming. We have implemented dag consis- cessor centricº in the sense that they de®ne consistency in tency in software for the Cilk multithreaded runtime system terms of actions by physical processors. In this paper, we running on a Connection Machine CM5. Our implementa- introduce ªdagº consistency, a relaxed consistency model tion includes a dag-consistent distributed cactus stack for based on user-level threads which we have implemented for storage allocation. We provide empirical evidence of the Cilk[4], a C-based multithreadedlanguage and runtimesys- ¯exibility and ef®ciency of dag consistency for applications tem. that include blocked matrix multiplication, Strassen's ma- trix multiplication algorithm, and a Barnes-Hut code. Al- Dag consistency is de®ned on the dag of threads that make though Cilk schedules the executions of these programs dy- up a parallel computation. Intuitively, a read can ªseeº a namically, their performances are competitive with stati- write in the dag-consistency model only if there is some se- cally scheduled implementations in the literature. We also rial execution order consistentwiththedag in which theread
sees the write. Unlike sequential consistency, but similar to F
prove that the number P of page faults incurred by a user certain processor-centric models [12, 14], dag consistency program runningon P processors can be related to the num-
allows different reads to return values that are based on dif-
F F P ber 1 of page faults running serially by the formula
ferent serial orders, but the values returned must respect the
F C s C s
1 , where is the cache size and is the number of thread migrations executed by Cilk's scheduler. dependencies in the dag. The current Cilk mechanisms to support dag-consistent 1 Introduction distributed shared memory on the Connection Machine CM5 are implemented in software. Nevertheless, codes Architects of shared memory for parallel computers have at- such as matrix multiplication run ef®ciently, as can be seen tempted to support Lamport's model of sequential consis- in Figure 1. The dag-consistent shared memory performs at tency [22]: The result of any execution is the same as if the 5 mega¯ops per processor as long as the work per proces- operations of all the processors were executed in some se- sor is suf®ciently large. This performance compares fairly quentialorder, andtheoperationsof each individualproces- well withothermatrix multiplicationcodeson theCM5(that sor appear in this sequence in the order speci®ed by its pro- do not use the CM5's vector units). For example, an imple- gram. Unfortunately, they have generally found that Lam- mentation coded in Split-C[9] attains just over 6 mega¯ops port's model is dif®cult to implement ef®ciently, and hence per processor on 64 processors using a static data layout, a This paper appears in the Proceedingsof the 10th International Paral- static thread schedule, and an optimized assembly-language lel Processing Symposium, Honolulu, Hawaii, April 1996. inner loop. In contrast, Cilk's dag-consistent shared mem- This researchwassupportedin partbytheAdvancedResearch Projects ory is mapped across the processors dynamically, and the Agency under Grants N00014-94-1-0985 and N00014-92-J-1310. Robert Cilk threads performing the computation are scheduled dy- Blumofe was supported in part by an ARPA High-Performance Computing Graduate Fellowship, and he is now Assistant Professor at The University namically at runtime. We believe that the overhead in our of Texas at Austin. Matteo Frigo was a visiting scholar from University of current implementation can be reduced, but that in any case, Padova, Italy and is now a graduate student at MIT. Charles Leiserson is this overhead is a reasonable price to pay for ease of pro- currently Shaw Visiting Professor at the National University of Singapore. Keith Randall was supported in part by a Department of Defense NDSEG gramming and dynamic load balancing. Fellowship. We have implemented irregular applications that employ 6 ABx = R
5 C D GH CG CH DI DJ 4 x = + E F IJ EG EH FI FJ
3
Figure 2: Recursive decomposition of matrix multiplication. The
n 2 multiplication of n matrices requires eight multiplications of
Mflops / processor