Design of the Munin Distributed Shared Memory System

Design of the Munin Distributed Shared Memory System John B Carter Department of Computer Science University of Utah Salt LakeCity UT This researchwas supp orted in part by the National Science Foundation under Grants CDA CCR CCRby the IBM Corp oration under Research Agreement No by the Texas Advanced Technology Program under Grants and andby a NASA Graduate Fellowship Design of the Munin Distributed Shared Memory System Prop osed running head Design of the Munin Distributed Shared Memory System Contact author John B Carter Department of Computer Science University of Utah Salt LakeCityUT Abstract Software distributed shared memory DSM is a software abstraction of shared memory on a distributed memory machine The key problem in building an ecient DSM system is to reduce the amount of communication needed to keep the distributed memories consistent The Munin DSM system incorp orates a number of novel techniques for doing so including the use of multiple consistency proto cols and supp ort for multiple concurrent writer proto cols Due to these and other features Munin is able to achieve high p erformance on a varietyof numerical applications This pap er contains a detailed description of the design and implementation of the Munin prototyp e with sp ecial emphasis given to its novel write shared proto col Furthermore it describ es a numb er of lessons that we learned from our exp erience with the prototyp e implementation that are relevant to the implementation of future DSMs List of Symbols Boldface italics and typewriterfontare used often Sp ecial symb ols used A numb er of mathematical symb ols are used including slash leftarrow left brace f right brace g and ellipsis Intro duction A software distributed shared memory DSM system provides the abstraction of a shared address space spanning the pro cessors of a distributed memory multipro cessor This abstrac tion simplies the programming of distributed memory multipro cessors and allows parallel programs written for shared memory machines to b e p orted easily The basic idea b ehind software DSM whichwe will refer to as simply DSM throughout the rest of this pap er is to treat the lo cal memory of a pro cessor as if it were a coherentcache in a shared memory multipro cessor DSM systems combine the b est features of shared memory and distributed memory multipro cessors They supp ort the relatively simple and p ortable programming mo del of shared memory on physically distributed memory hardware which is more scalable and less exp ensive to build than shared memory hardware DSM would thus seem to b e an ideal vehicle for making parallel pro cessing widely available However although many DSM systems have b een prop osed and implemented DSM is not widely used The reason for this is that it has proven dicult to achieve accept able p erformance using DSM without requiring programmers to carefully restructure their sharedmemory parallel programs to reect the way that the DSM op erates For example it is often necessary to decomp ose the shared data into small pagealigned pieces or to in tro duce new variables to reduce the amount of sharing This restructuring can b e as tedious and dicult as using messagepassing directly or more so The challenge in building a DSM system is to achieve go o d p erformance over a wide range of programs without requiring programmers to restructure their shared memory par allel programs The overhead of maintaining consistency in software and the high latency of sending messages make this dicult The primary source of DSM overhead is the large amountofcommunication that is required to maintain consistency Since DSMs use general purp ose networks eg Ethernet and op erating systems to communicate the latency of eac h message b etween no des is high Given this high cost of interpro cessor communication the p erformance challenge for DSM is to reduce the amount of communication p erformed during the execution of a DSM program ideally to the same level as the amount of commu nication p erformed during the execution of an equivalent message passing program DSM has not gained wide acceptance b ecause previous DSM systems did not achieve this level of communication The Munin DSM system incorp orates several techniques make DSM a more viable so lution for distributed pro cessing by substantially reducing the amount of communication required to maintain consistency compared to previous DSM systems The innovative communicationreducing features supp orted by Munin include the use of multiple consis tency protocolswhichkeep each shared variable consistent with a proto col well suited to its exp ected or observed access pattern and supp ort for multiple concurrent writers which ad dresses the problem of false sharing by reducing the amount of unnecessary communication p erformed keeping falsely shared data consistent To determine the value of Munins novel features weevaluated the p erformance of a suite of seven scientic applications implemented using message passing Munin DSM and a conventional Ivylike DSM system Munins p erformance is within of message passing for four out of the seven applications studied and within to for the other three Furthermore for the ve applications in which there was a mo derate to high degree of sharing the programs run under Munin achieved from to over higher sp eedups than their conventional DSM counterparts Munins p erformance on a variety of application programs is evaluated elsewhere this pap er concentrates on the design of Munin esp ecially its ma jor data structures and the proto cols used to maintain memory consistency Also included are a numb er of lessons that we garnered from our exp erience with the prototyp e implementation that are relev ant to the implementation of future DSMs The remainder of this pap er is organized as follows Section presents an overview of Munins runtime system Section describ es Munins multiple consistency proto cols and the algorithms used to implement them We summarize Munins p erformance in Section and draw conclusions in Section Overview of the Munin Runtime System The core of the Munin system is the runtime library that contains the fault handling thread supp ort synchronization and other runtime mechanisms It consists of approximately lines of C source co de that create an kilobyte library le that is linked into each Munin program Each no de of an executing Munin program consists of a collection of Munin run time threads that handle consistency and synchronization op erations and one or more user threads p erforming the parallel computation Munin programmers write parallel programs using threads as they would on a unipro cessor or shared memory multipro cessor Figure illustrates the organization of a Munin program during runtime The Munin prototyp e was implemented on a collection of sixteen SUN workstations running the V op erating system On each participating no de the Munin runtime is linked into the same address space as the user program and thus can access user data directly The two ma jor data structures used by the Munin runtime are an object directory that maintains the state of the shared data b eing used by lo cal user threads and the delayedupdate queue DUQ which manages Munins software implementation of release consistency Munin installs itself as the default page fault handler for the Munin program so that the underlying Vkernel will forward all memory exceptions to it for handling Each Munin no de interacts with the V kernel to communicate with the other Munin no des over the Ethernet and to manipulate the virtual memory system as part of maintaining the consistency of shared memory Although the Munin prototyp e was implemented on the V system it could b e made to run on any op erating system that allowed user programs to manipulate its own virtual memory mappings handle page faults at user levelcreateand destroy pro cesses on remote no des exchange messages with remote pro cesses and access unipro cessor synchronization supp ort P and V All of these requirements are present in current op erating systems suchasMach Chorus and Unix Object Directory User Code Munin Runtime and SUN 3/60s Data DUQ V Kernel ... Network (10 Mbps Ethernet) Figure Munin Runtime Organization Munins ob ject directory is structured as a hash table that maps a virtual address to an entry that describ es the data lo cated at that address In Munin all variables on the same page are treated as a single ob ject with the same consistency proto col Programmers can guide placementofvariables on to pages but by default variables are placed on pages by themselves The memory overhead of this solution is negligible for the applications studied as there were few distinct variables When a Munin runtime thread cannot nd an ob ject directory entry in the lo cal hash table it requests a copy from the no de where the computation started the socalled the ro ot no de whichcontains all of the shared data at the start of the program The elds of an entry in the directory include a lock that provides exclusive access to the entry for a handler thread a start address and size that act as keys for lo oking up a shared data items directory entry a protocol that sp ecies how the fault handler should service access faults on the data item various state bits that characterize the dynamic state of the data such as whether it is present lo cally and whether it is writable a copyset that represents a b est guess of the set of pro cessors with copies of the data and a probable owner that represents

Design of the Munin Distributed Shared Memory System

Scalable and Distributed Deep Learning (DL): Co-Design MPI Runtimes and DL Frameworks

Parallel Programming

Parallel Computer Architecture

Parallel Programming Using Openmp Feb 2014

Enabling Efficient Use of UPC and Openshmem PGAS Models on GPU Clusters

An Overview of the PARADIGM Compiler for Distributed-Memory

Concepts from High-Performance Computing Lecture a - Overview of HPC Paradigms

In Reference to RPC: It's Time to Add Distributed Memory

Parallel Breadth-First Search on Distributed Memory Systems

Introduction to Parallel Computing

Multi-Core Architectures

Scalable Problems and Memory-Bounded Speedup