SHARED MEMORY for GRIDS John P
Total Page:16
File Type:pdf, Size:1020Kb
SMG: SHARED MEMORY FOR GRIDS John P. Ryan, Brian A. Coghlan Computer Architecture Group Dept. of Computer Science Trinity College Dublin Dublin 2, Ireland email: fjohn.p.ryan,[email protected] ABSTRACT The disadvantage is limited scalability. But nonetheless, This paper discusses some of the salient issues involved vast quantities of parallel codes have been written in this in implementing the illusion of a shared-memory program- manner. OpenMP, promoted by multiple vendors in the ming model across a group of distributed memory proces- high performance computing sector, has emerged as a stan- sors from a cluster through to an entire Grid. This illusion dard for developing these shared memory applications [2]. can be provided by a distributed shared memory (DSM) Through the use of compiler directives, serial code can runtime system. be easily parallelized by explicitly identifying the areas of Mechanisms that have the potential to increase the perfor- code that can be executed concurrently. This paralleliza- mance by minimizing high-latency intra site messages & tion of an existing serial application can be done in an in- data transfers are highlighted. Relaxed consistency mod- cremental fashion. This has been an important feature in els are investigated, as well as the use of a grid informa- promoting the adoption of this standard among parallel ap- tion system to ascertain topology information. The latter plication developers. allows for hierarchy-aware management of shared data and These are the two predominant models for parallel comput- synchronization variables. The process of incremental hy- ing and there are many implementations of both for differ- bridization is also explored, where more efficient message- ent architectures and platforms. Shared memory program- passing mechanisms can incrementally replace DSM ac- ming has the easier programming semantics, while mes- tions when circumstances dictate that performance im- sage passing has more scalability and efficiency (commu- provements can be obtained. nication is explicitly defined and overheads such as con- In this paper we describe the overall design/architecture trol messages can be reduced dramatically or even elimi- of a prototype system, SMG, which integrates DSM and nated). Previous work [3] examined approaches to combine message passing paradigms and may be the target of an the message passing and shared-memory paradigms in or- OpenMP compiler. Our initial findings based on some triv- der to leverage the benefits of both approaches, especially ial examples indicate some of the potential benefits that can when the applications are executed in an environment such be obtained for grid Applications. as a cluster of SMPs, but not on grids. Distributed Shared Memory (DSM) implementations aim KEY WORDS to provide an abstraction of shared memory to parallel ap- DSM, MPI, OpenMP, Grid, Incremental Hybridization plications executing on ensembles of physically distributed machines. The application developer therefore obtains the benefits of developing in a style similar to shared memory 1 Introduction while harnessing the price/performance benefits associated with distributed memory platforms. Throughout the 1990's The message passing programming paradigm provides research into Software-only Distributed Shared Memory simple mechanisms to transfer data structures between dis- (S-DSM) became popular, resulting in numerous imple- tributed memory machines to enable the construction of mentations, e.g. Munin [4], Midway [5], Treadmarks [6], high performance and highly scalable parallel applications. and Brazos [7]. However it has met with little success due However, there is considerable burden placed on the pro- to poor scalability and the lack of a common Application grammer whereby send/receive message pairs must be ex- Programming Interface (API) [8]. plicitly declared and used, and this is often the source of As Grids are composed of geographically distributed mem- errors. Implementations of message passing paradigms ex- ory machines, traditional shared memory may not execute ist for grids [1]. across multiple grid sites. DSM offers a potential solution, Shared memory is a simpler paradigm for constructing par- but its use in a Grid environment would be no small accom- allel applications, as it offers uniform access methods to plishment as numerous barriers exist to an efficient imple- memory for all user threads of execution. Therefore it mentation that could run in such a heterogeneous environ- offers an easier way to construct applications when com- ment. However, if the paradigm could be made available, pared to a corresponding message passing implementation. then according to [9], grid programming would be reduced For a DSM to gain acceptance, compliance with open to optimizing the assignment and use of threads and the standards is necessary, so it is also preferable that the DSM communication system. form the basis for a target of a parallelizing compiler, such Our aim has been to explore a composition of the environ- as OpenMP. There are other projects have adopted this ments, with an OpenMP-compatible DSM system layered approach [11]. By examining some OpenMP directives on top of a version of the now ubiquitous Message Pass- one can identify some of the design requirements of the ing Interface (MPI) that can execute in a Grid environment. DSM. The following is an OpenMP code snippet that This choice reflects a desire for a tight integration with the implements the multiplication of two matrices. message passing library and the extensive optimization of MPI communications by manufacturers. Such a system would need to be multi-threaded to allow the overlap of /*** Begin parallel section ***/ computation and communication to mask the high-latency. #pragma omp for It has also been shown that the integration of an informa- for (i = 0; i < ROWSA; i++){ tion and monitoring system can yield substantial gains in for(j = 0; j < COLSB; j++){ the implementation of MPI collective routines [10]. We c[i][j] = 0; believe that following a similar approach and integrating for (k = 0; k < COLSA; k++) an information system into the DSM will result in perfor- { mance improvements by allowing optimizations to, for ex- c[i][j] = c[i][j] + ample, per-site caching & write collection of shared data, a[i][k] * b[k][j]; and enable communication-efficient synchronization rou- } tines such as barriers. Hence, a further aim has been to use } the information system to create topology/hierarchy aware- } /*** End parallel section ***/ ness and runtime support of user applications, and to insu- Barriers are used implicitly in OpenMP, where any thread late the developer from this, except where they wish topol- will wait at the end of the structured code block until all ogy information to be exposed to the user application code threads have arrived, except, for example, where a nowait to allow for optimizations at the application level. clause has been declared. In order for concurrency to be A primary concern has been to minimize the use of the allowed inside the parallel section the shared memory high-latency communication channels between participat- regions must be writable by multiple writers. ing sites where possible, and to favour a lower message Mutual Exclusion is required in OpenMP. The main mutual count with higher message payload where not, and also to exclusion directives are as follows. not require hardware support for remote memory access. /*** Only the master thread (rank 0) 2 Shared Memory for Grids will execute the code ***/ #pragma omp master The premise of grid computing is that distributed sites {...} make available resources for use by remote users. A grid job may be run on one or more sites and each site may con- /*** Only one thread will execute the sist of a heterogeneous group of machines. In order to re- code ***/ duce the potential number of variations necessary, the DSM #pragma omp single implementation should only use standard libraries such as {...} the POSIX thread, MPI, and standard C libraries. No spe- cial compiler or kernel modifications should be required. /*** All threads will execute the The system must present the programmer with an easy-to- code, but only one at a time ***/ use and intuitive API in order that the associated burden #pragma omp critical in the construction of an application is minimal. This re- {...} quires that the semantics must be as close to that of shared memory programming as possible, and that it should bor- These imply that a distributed mutual exclusion device (or row from the successes and learn from the mistakes of pre- lock) is required to implement the latter of these directives. vious DSM implementations. All function calls therefore The first two are functionally equivalent and can be imple- belong to one of the three following groups. mented using simple if-then-else structure. • Initialization/finalization 2.1 Memory Consistency • Memory allocation The primary goal of minimizing communication between nodes is only achievable if a relaxed consistency model is • Synchronization employed, where data and synchronization operations are clearly distinguished and data is only made consistent at passing library, and leveraging of useful MPI resources synchronization points. The most common are the Lazy- such as profiling tools and debuggers. Its use also insu- Release (LRC), Scope (SC), and Entry Consistency (EC) lates the system from platform dependencies and will ease models. The choice of which to use involves a trade-off be- porting to other architectures and platforms in the future. tween the complexity of the programming semantics and Unfortunately the current Grid enabled version of MPICH, the volume of overall control and consistency messages MPICH-G2 [18], is based on the MPICH distribu- generated by the DSM. tion (currently version 1.2.5.2), which has no support for multi-threaded applications. This makes hybridiz- • Release consistency requires that the shared memory ing hard, as the DSM system thread requires the MPI areas are only consistent after a synchronization oper- communication channel, and so can only be used in ation occurs [12] MPI THREAD FUNNELLED mode.