Distributed Shared Memory – a Survey and Implementation Using Openshmem

Ryan Saptarshi Ray Int. Journal of Engineering Research and Applications www.ijera.com ISSN: 2248-9622, Vol. 6, Issue 2, (Part - 1) February 2016, pp.49-52 RESEARCH ARTICLE OPEN ACCESS Distributed Shared Memory – A Survey and Implementation Using Openshmem Ryan Saptarshi Ray, Utpal Kumar Ray, Ashish Anand, Dr. Parama Bhaumik Junior Research Fellow Department of Information Technology, Jadavpur University Kolkata, India Assistant Professor Department of Information Technology, Jadavpur University Kolkata, India M. E. Software Engineering Student Department of Information Technology, Jadavpur University Kolkata, India Assistant Professor Department of Information Technology, Jadavpur University Kolkata, India Abstract Parallel programs nowadays are written either in multiprocessor or multicomputer environment. Both these concepts suffer from some problems. Distributed Shared Memory (DSM) systems is a new and attractive area of research recently, which combines the advantages of both shared-memory parallel processors (multiprocessors) and distributed systems (multi-computers). An overview of DSM is given in the first part of the paper. Later we have shown how parallel programs can be implemented in DSM environment using Open SHMEM. I. Introduction networks of workstation using cluster technologies Parallel Processing such as the MPI programming interface [3]. Under The past few years have marked the start of a MPI, machines may explicitly pass messages, but do historic transition from sequential to parallel not share variables or memory regions directly. computation. The necessity to write parallel programs Parallel computing systems usually fall into two is increasing as systems are getting more complex large classifications, according to their memory while processor speed increases are slowing down. system organization: shared and distributed-memory Generally one has the idea that a program will run systems. faster if one buys a next-generation processor. But currently that is not the case. While the next- Multiprocessor Environment generation chip will have more CPUs, each A shared-memory system [4] (often called a individual CPU will be no faster than the previous tightly coupled multiprocessor) makes a global year’s model. If one wants programs to run faster, physical memory equally accessible to all processors. one must learn to write parallel programs as currently These systems enable simple data sharing through a multi-core processors are becoming more and more uniform mechanism of reading and writing shared popular. Parallel Programming means using multiple structures in the common memory. This system has computing resources like processors for advantages of ease of programming and portability. programming so that the time required to perform However, shared-memory multiprocessors typically computations is reduced. Parallel Processing suffer from increased contention and longer latencies Systems are designed to speed up the execution of in accessing the shared memory, which degrades programs by dividing the program into multiple peak performance and limits scalability compared to fragments and processing these fragments distributed systems. Memory system design also simultaneously. Parallel systems deal with the tends to be complex. simultaneous use of multiple computer resources. Parallel systems can be - a single computer with Multicomputer Environment multiple processors, or a number In contrast, a distributed-memory system (often of computers connected by a network to form a called a multicomputer) consists of multiple parallel processing cluster or a combination of both. independent processing nodes with local memory Cluster computing has become very common for modules, connected by a general interconnection applications that exhibit large amount of control network. The scalable nature of distributed-memory parallelism. Concurrent execution of batch jobs and systems makes systems with very high computing parallel servicing of web and other requests [1] as in power possible. However, communication between Condor [2], which achieve very high throughput rates processes residing on different nodes involves a have become very popular. Some workloads can message-passing model that requires explicit use of benefit from concurrently running processes on send/receive primitives. Also, process migration separate machines and can achieve speedup on imposes problems because of different address www.ijera.com 49|P a g e Ryan Saptarshi Ray Int. Journal of Engineering Research and Applications www.ijera.com ISSN: 2248-9622, Vol. 6, Issue 2, (Part - 1) February 2016, pp.49-52 spaces. Therefore, compared to shared-memory In multiprocessor systems there is an upper limit systems, hardware problems are easier and software to the number of processors which can be added to a problems more complex in distributed-memory single system. But in DSM according to requirement systems. [5] any number of systems can be added. DSM systems Distributed shared memory (DSM) is an are also cheaper and more scalable than both alternative to the above mentioned approaches that multiprocessors and multi-computer systems. In operates over networks of workstations. DSM DSM message passing overhead is much less than combines the advantages of shared memory parallel multi-computer systems. computer and distributed systems. [5],[6] Cons II. DSM – An Overview Consistency can be an important issue in DSM In early days of distributed computing, it was as different processors access, cache and update a implicitly assumed that programs on machines with shared single memory space. Partial failures or/and no physically shared memory obviously ran in lack of global state view can also lead to different address spaces. In 1986, Kai Li proposed a inconsistency. different scheme in his PhD dissertation entitled, “Shared Virtual Memory on loosely Coupled IV. Implementation of DSM using Microprocessors”, it opened up a new area of OpenSHMEM research that is known as Distributed Shared Memory An Overview – OpenSHMEM (DSM) systems. [7] OpenSHMEM is a standard for SHMEM library A DSM system logically implements the shared- implementations which can be used to write parallel memory model on a physically distributed-memory programs in DSM environment. SHMEM is a system. DSM is a model of inter-process communications library that is used for Partitioned communications in distributed system. In DSM, Global Address Space (PGAS) [9] style processes running on separate hosts can access a programming. The key features of SHMEM include shared address space. The underlying DSM system one-sided point-to-point and collective provides its clients with a shared, coherent memory communication, a shared memory view, and atomic address space. Each client can access any memory operations that operate on globally visible or location in the shared address space at any time and “symmetric” variables in the program. [10] see the value last written by any client. The primary advantage of DSM is the simpler abstraction it Code Example provides to the application programmer. The The code below shows implementation of communication mechanism is entirely hidden from parallel programs in DSM environment using the application writer so that the programmer does OpenSHMEM. not have to be conscious of data movements between #include <stdio.h> processes and complex data structures can be passed #include <shmem.h> //SHMEM library is included by reference. [8] #define LIMIT 7 DSM can be implemented in hardware long pSync[SHMEM_BARRIER_SYNC_SIZE]; (Hardware DSM) as well as software (Software DSM). Hardware implementation requires addition of int special network interfaces and cache coherence pWrk[SHMEM_REDUCE_MIN_WRKDATA_SIZE circuits to the system to make remote memory access ]; look like local memory access. So, Hardware DSM is int global_data[LIMIT] = {1,2,3,4,5,6,7}; very expensive. Software implementation is int result[LIMIT]; advantageous as in this case only software has to be int main(int argc, char **argv) installed. In Software DSM a software layer is added { between the OS and application layers and kernel of int rank, size, number, i, j; OS may or may not be modified. Software DSM is int local_data[LIMIT]; more widely used as it is cheaper and easier to start_pes(0); implement than Hardware DSM. size = num_pes(); rank = my_pe(); III. DSM – Pros and Cons Pros shmem_barrier(0,0,3,pSync); Because of the combined advantages of the if (rank == 0) shared-memory and distributed systems, DSM { approach is a viable solution for large-scale, high- for(i=0; i<LIMIT; i++) performance systems with a reduced cost of parallel local_data[i] = 0; software development. [5] //Local array is initialized } www.ijera.com 50|P a g e Ryan Saptarshi Ray Int. Journal of Engineering Research and Applications www.ijera.com ISSN: 2248-9622, Vol. 6, Issue 2, (Part - 1) February 2016, pp.49-52 else logPE_stride - The log (base 2) of the stride between { consecutive virtual PE numbers in the active set. if (rank%2 == 1) logPE_stride must be of type integer. { PE_size – It is the number of PEs in the active set. for(i=0; i<LIMIT; i++) PE_size must be of type integer. pSync - It is a { symmetric work array. local_data[i] = global_data[i] + 1; shmem_quiet() – It is one of the most useful routines } as it ensures ordering of delivery of several remote }shmem_quiet(); operations. if(rank%2 == 0) shmem_int_sum_to_all(target, source, nreduce, { PE_start, logPE_stride, PE_size, pWrk, pSync) – It for(i=0; i<LIMIT; i++) is a reduction

Distributed Shared Memory – a Survey and Implementation Using Openshmem

Mellanox Scalableshmem Support the Openshmem Parallel Programming Language Over Infiniband

Introduction to Parallel Programming

An Open Standard for SHMEM Implementations Introduction

Enabling Efficient Use of UPC and Openshmem PGAS Models on GPU Clusters

Introduction to Parallel Computing

Bridging Parallel and Reconfigurable Computing with Multilevel PGAS and SHMEM+

NUG Monthly Meeting

A Space Exploration Framework for GPU Application and Hardware Codesign

Parallel Computing, Models and Their Performances

Benchmarking Parallel Performance on Many-Core Processors

SIMD Programmingprogramming

Experiences at Scale with PGAS Versions of a Hydrodynamics Application