An Investigation of Distributed Shared Memory Programming

An investigation of distributed shared memory programming Prashobh Balasundaram August 22, 2007 MSc in High Performance Computing The University of Edinburgh Year of Presentation: 2007 Abstract This report presents the results and observations of a study on using the Global Arrays toolkit and Unified Parallel C (Berkeley UPC and IBM XL UPC) for distributed shared memory programming. A goal of this project was to improve the memory scalability and performance of a molecular dynamics application, on massively parallel processing machines. The main target system for this effort was an IBM Bluegene/L service operated by Edinburgh Parallel Computing Centre. Since the original code implemented a data replication strategy, the problem size that could be solved on the Bluegene/L was limited. The scaling of the original application was restricted to 128 processors on the Bluegene/L. Through the use of distributed shared memory programming using the Global Arrays toolkit, the maximum system sizes that could be solved on the Bluegene/L improved by 77.7%. The original application also suffered from poor scaling due to the presence of serial components. By parallelizing the serial routines, the scaling issue was resolved. The distributed shared memory based code is capable of scaling up to 512 processors, while the original application scaling was restricted to 128 processors. In order to compare and contrast the usability and performance of the Global Arrays toolkit against Unified Parallel C, a Global Arrays version of well known benchmarks and Global Arrays and Unified Parallel C versions of an image processing application were developed. The experience gained from this effort was used in a detailed comparison of features, performance and productivity of Global Arrays and Unified Parallel C. Contents 1 Introduction 1 2 Background 3 2.1 Distributed Shared Memory programming model.....................4 2.2 Hardware environment...................................5 2.3 Global Arrays toolkit....................................6 2.3.1 Global Arrays Architecture.............................7 2.3.2 Structure of a GA program.............................8 2.3.3 Productivity of GA programming compared to MPI................ 10 2.3.4 Global Arrays on Bluegene/L........................... 13 2.4 Unified Parallel C...................................... 13 2.4.1 UPC Architecture.................................. 14 2.4.2 Structure of an UPC program........................... 15 2.4.3 Productivity of UPC programming compared to MPI............... 15 2.5 Molecular dynamics..................................... 16 3 Results of preliminary investigation on Global Arrays 19 3.1 Installation of the Global Arrays toolkit.......................... 19 3.1.1 Installation of the Global Arrays toolkit on Lomond................ 19 3.1.2 Installation of the Global Arrays toolkit on Bluegene/L.............. 20 3.1.3 Installation of the Global Arrays toolkit on HPCx................. 21 3.2 Global Arrays benchmarks................................. 21 3.3 Image processing benchmarks using Global Arrays on Bluegene/L............ 24 4 Optimizing memory usage of molecular dynamics code using Global Arrays 29 4.1 Molecular Dynamics.................................... 29 4.2 The physical system and its effect on memory....................... 29 4.3 Structure of the molecular dynamics application...................... 31 4.4 Performance characteristics of original code........................ 32 4.5 Design of the GA based molecular dynamics application................. 33 4.6 Performance of application using non blocking communication.............. 37 4.7 Result of memory optimization using Global Arrays.................... 38 4.7.1 Simulation of an Aqueous system......................... 40 4.8 Test results......................................... 40 i 5 Performance optimization of Molecular dynamics code 42 5.1 Analysis of load imbalance and serial components..................... 42 5.2 Parallelizing RDF routines................................. 43 5.2.1 Result of performance optimization........................ 44 6 UPC 46 6.1 Installation of UPC..................................... 46 6.1.1 Installation of UPC on Lomond.......................... 46 6.1.2 Installation of IBM XL UPC on HPCx....................... 47 6.2 UPC syntax and common usage patterns.......................... 47 6.3 UPC benchmarks...................................... 50 6.3.1 Image reconstruction using shared memory and upc_forall............ 52 6.3.2 Image processing using local memory and halo swaps.............. 55 6.3.3 Summary...................................... 56 7 Comparison of Global Arrays and UPC 58 7.1 Portability and availability................................. 58 7.2 Comparison of GA and UPC syntax............................ 59 7.3 Effect of compiler optimisation............................... 61 7.4 Interoperability with MPI.................................. 61 7.5 Comparison of communication latency of GA,UPC and MPI on HPCx.......... 61 8 Conclusions 63 8.1 Future Work......................................... 64 ii List of Figures 2.1 Shared memory model...................................3 2.2 Message passing model...................................4 2.3 Distributed shared memory model.............................4 2.4 Architecture of the GA toolkit...............................8 2.5 Structure of a generic program using Global Arrays....................9 2.6 Comparison of MPI two sided communication to one sided communication....... 11 2.7 Complexity and communication pattern.......................... 12 2.8 Architecture of UPC compiler............................... 14 2.9 Structure of a generic program using UPC syntax..................... 15 3.1 Ping-pong benchmark using GA & MPI on Lomond................... 22 3.2 Ping-pong benchmark using GA & MPI on Bluegene/L.................. 23 3.3 Ping-pong benchmark using GA & MPI on HPCx..................... 24 3.4 Program structure of the image reconstruction application................. 25 3.5 Application performance - Image processing using GA with different input sizes on Blue- gene/L............................................ 26 3.6 Application performance - Image processing using GA vs MPI.............. 27 3.7 Application performance - Effect of Virtual node mode on Global Arrays based image processing benchmark................................... 27 4.1 The physical system under simulation........................... 30 4.2 Memory usage for storing position, velocity related data.................. 31 4.3 Scaling of original MD code................................ 33 4.4 Main Loop of the molecular dynamics code........................ 34 4.5 A detailed view of the conjgradwall routine........................ 34 4.6 Usage of the Global Array in the molecular dynamics application............. 35 4.7 Effect of using large buffers................................. 36 4.8 Comparison of GA and MPI application performance................... 37 4.9 Blocking Vs Non Blocking GA program......................... 39 4.10 Result of memory optimisation using GA......................... 39 5.1 Analysis of load imbalance................................. 43 5.2 The scaling of the MPI application before and after rdf routine was parallelised..... 44 5.3 Performance of GA and MPI version of MD code on HPCx................ 45 iii 6.1 Ping-pong benchmark of UPC vs MPI communication on Lomond............ 51 6.2 Ping-pong benchmark of IBM XL UPC on HPCx compared to MPI performance.... 51 6.3 Ping-pong benchmark of Berkeley UPC on Lomond, Intel core duo compared to IBM XLUPC on HPCx...................................... 52 6.4 Image processing using UPC compared to MPI...................... 53 6.5 Elapsed time for UPC program using shared memory vs serial program......... 54 6.6 XL UPC vs XLC compiler optimisation........................... 55 6.7 Halo swap implemented using upc_memput and upc_memget............... 56 6.8 UPC - shared memory vs UPC - Message passing..................... 57 6.9 Speed up of image processing benchmark using IBM XL UPC on HPCx......... 57 7.1 Comparison of communication latency of UPC, GA and MPI on HPCx.......... 62 iv Acknowledgements I thank Dr. Alan Gray for his support and guidance on all phases of this project. I also thank, Dr. Mark Bull for the guidance and reviews provided. My sincere thanks goes to Mr. Aristides Papadopoulos. His MSc dissertation was used as the documentation of the original molecular dynamics code on which this project is based. I thank the authors of the Global Arrays toolkit for the excellent documentation and tools support provided throughout the project. Chapter 1 Introduction Recent developments in massively parallel computing systems have contributed to the growing popularity of the distributed shared memory programming (DSM) model [1]. Modern massively parallel processing (MPP) supercomputers like IBM Bluegene/L are comprised of hundreds of thousands of light weight nodes networked through specialized interconnects. The interconnects of these machines are equipped with remote direct memory access hardware. Programming models like DSM can be used to exploit the aggregate compute power of these machines effectively. The DSM programming model offers many advantages over the message passing programming model. The ease of use of the distributed shared memory programming model leads to high developer productivity while providing good performance and application scalability. The DSM toolkits

An Investigation of Distributed Shared Memory Programming

Benchmarking the Intel FPGA SDK for Opencl Memory Interface

BCL: a Cross-Platform Distributed Data Structures Library

Advances, Applications and Performance of The

Enabling Efficient Use of UPC and Openshmem PGAS Models on GPU Clusters

Automatic Handling of Global Variables for Multi-Threaded MPI Programs

Overview of the Global Arrays Parallel Software Development Toolkit: Introduction to Global Address Space Programming Models

Exascale Computing Project -- Software

The Global Arrays User Manual

Recent Activities in Programming Models and Runtime Systems at ANL

The Opengl ES Shading Language

The Opengl ES Shading Language

Enabling Efficient Use of UPC and Openshmem PGAS Models on GPU Clusters