Enabling Efficient Use of UPC and Openshmem PGAS Models on GPU Clusters

Enabling Efficient Use of UPC and OpenSHMEM PGAS Models on GPU Clusters Presented at GTC ’15 Presented by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: [email protected] hCp://www.cse.ohio-state.edu/~panda Accelerator Era GTC ’15 • Accelerators are becominG common in hiGh-end system architectures Top 100 – Nov 2014 (28% use Accelerators) 57% use NVIDIA GPUs 57% 28% • IncreasinG number of workloads are beinG ported to take advantage of NVIDIA GPUs • As they scale to larGe GPU clusters with hiGh compute density – hiGher the synchronizaon and communicaon overheads – hiGher the penalty • CriPcal to minimize these overheads to achieve maximum performance 3 Parallel ProGramminG Models Overview GTC ’15 P1 P2 P3 P1 P2 P3 P1 P2 P3 LoGical shared memory Shared Memory Memory Memory Memory Memory Memory Memory Shared Memory Model Distributed Memory Model ParPPoned Global Address Space (PGAS) DSM MPI (Message PassinG Interface) Global Arrays, UPC, Chapel, X10, CAF, … • ProGramminG models provide abstract machine models • Models can be mapped on different types of systems - e.G. Distributed Shared Memory (DSM), MPI within a node, etc. • Each model has strenGths and drawbacks - suite different problems or applicaons 4 Outline GTC ’15 • Overview of PGAS models (UPC and OpenSHMEM) • Limitaons in PGAS models for GPU compuPnG • Proposed DesiGns and Alternaves • Performance Evaluaon • ExploiPnG GPUDirect RDMA 5 ParPPoned Global Address Space (PGAS) Models GTC ’15 • PGAS models, an aracPve alternave to tradiPonal message passinG - Simple shared memory abstracPons - LiGhtweiGht one-sided communicaon - Flexible synchronizaon • Different approaches to PGAS - Libraries - LanGuages • OpenSHMEM • Unified Parallel C (UPC) • Global Arrays • Co-Array Fortran (CAF) • Chapel • X10 6 OpenSHMEM GTC ’15 • SHMEM implementaons – Cray SHMEM, SGI SHMEM, Quadrics SHMEM, HP SHMEM, GSHMEM • Subtle differences in API, across versions – example: SGI SHMEM Quadrics SHMEM Cray SHMEM Inializaon start_pes(0) shmem_init start_pes Process ID _my_pe my_pe shmem_my_pe • Made applicaons codes non-portable • OpenSHMEM is an effort to address this: “A new, open specifica3on to consolidate the various extant SHMEM versions into a widely accepted standard.” – OpenSHMEM Specificaon v1.0 by University of Houston and Oak RidGe Naonal Lab SGI SHMEM is the baseline 7 OpenSHMEM Memory Model GTC ’15 • Defines symmetric data objects that are Globally addressable int main (int c, char *v[]) { - Allocated usinG a collecPve shmalloc rouPne int *b; start_pes(); - Same type, size and offset address at all processes/ b = (int *) shmalloc (sizeof(int)); processinG elements (PEs) shmem_int_get (b, b, 1 , 1); - Address of a remote object can be calculated based on info } (dst, src, count, pe) of local object int main (int c, char *v[]) { b int *b; Symmetric start_pes(); b b = (int *) shmalloc (sizeof(int)); Object } PE 0 PE 1 8 Compiler-based: Unified Parallel C GTC ’15 • UPC: a parallel extension to the C standard • UPC Specificaons and Standards: - IntroducPon to UPC and LanGuage Specificaon, 1999 - UPC LanGuage Specificaons, v1.0, Feb 2001 - UPC LanGuage Specificaons, v1.1.1, Sep 2004 - UPC LanGuage Specificaons, v1.2, 2005 - UPC LanGuage Specificaons, v1.3, In ProGress - Dra Available • UPC Consorum - Academic InsPtuPons: GWU, MTU, UCB, U. Florida, U. Houston, U. Maryland… - Government InsPtuPons: ARSC, IDA, LBNL, SNL, US DOE… - Commercial InsPtuPons: HP, Cray, Intrepid TechnoloGy, IBM, … • Supported by several UPC compilers - Vendor-based commercial UPC compilers: HP UPC, Cray UPC, SGI UPC - Open-source UPC compilers: Berkeley UPC, GCC UPC, MichiGan Tech MuPC • Aims for: hiGh performance, codinG efficiency, irreGular applicaons, … 9 UPC Memory Model GTC ’15 Thread 0 Thread 1 Thread 2 Thread 3 A1[0] A1[1] A1[2] A1[3] Global Shared Space y y y y Private Space • Global Shared Space: can be accessed by all the threads • Private Space: holds all the normal variables; can only be accessed by the local thread • Example: shared int A1[THREADS]; //shared variable int main() { int y; //private variable A1[0] = 0; //local access A1[1] = 1; //remote access 10 } MPI+PGAS for Exascale Architectures and Applicaons GTC ’15 • GaininG aenPon in efforts towards Exascale compuPnG • Hierarchical architectures with mulPple address spaces • (MPI + PGAS) Model - MPI across address spaces - PGAS within an address space • MPI is Good at movinG data between address spaces • Within an address space, MPI can interoperate with other shared memory proGramminG models • Re-wriPnG complete applicaons can be a huGe effort • Port criPcal kernels to the desired model instead 11 Hybrid (MPI+PGAS) ProGramminG GTC ’15 • Applicaon sub-kernels can be re-wriCen in MPI/PGAS based on communicaon characterisPcs HPC Applica8on • Benefits: Kernel 1 - Best of Distributed CompuPnG Model MPI - Best of Shared Memory CompuPnG Model Kernel 2 PGASMPI • Exascale Roadmap*: Kernel 3 MPI - “Hybrid ProGramminG is a pracPcal way to proGram exascale systems” Kernel N Kernel N PGASMPI 12 MVAPICH2-X for Hybrid MPI + PGAS Applicaons GTC ’15 MPI, OpenSHMEM, UPC, CAF and Hybrid (MPI + PGAS) Applications OpenSHMEM Calls UPC Calls CAF Calls MPI Calls Unified MVAPICH2-X Runtime InfiniBand, RoCE, iWARP • Unified communicaon runPme for MPI, UPC, OpenSHMEM,CAF available from MVAPICH2-X 1.9 : (09/07/2012) hCp://mvapich.cse.ohio-state.edu • Feature HiGhliGhts - Supports MPI(+OpenMP), OpenSHMEM, UPC, CAF, MPI(+OpenMP) + OpenSHMEM, MPI(+OpenMP) + UPC, MPI(+OpenMP) + CAF - MPI-3 compliant, OpenSHMEM v1.0 standard compliant, UPC v1.2 standard compliant, CAF 2015 standard compliant - Scalable Inter-node and intra-node communicaon – point-to-point and collecPves 13 • Effort underway for support on NVIDIA GPU clusters Outline GTC ’15 • Overview of PGAS models (UPC and OpenSHMEM) • Limitaons in PGAS models for GPU compuPnG • Proposed DesiGns and Alternaves • Performance Evaluaon • ExploiPnG GPUDirect RDMA 14 Limitaons of PGAS models for GPU CompuPnG GTC ’15 • PGAS memory models does not support disjoint memory address spaces - case with GPU clusters Exisng OpenSHMEM Model with CUDA • OpenSHMEM case PE 0 • Copies severely limit the performance host_buf = shmalloc (…) cudaMemcpy (host_buf, dev_buf, . ) PE 0 shmem_putmem (host_buf, host_buf, size, pe) shmem_barrier (…) GPU-to-GPU Data Movement PE 1 host_buf = shmalloc (…) PE 1 shmem_barrier ( . ) cudaMemcpy (dev_buf, host_buf, size, . ) • Synchronizaon neGates the benefits of one-sided communicaon • Similar limitaons in UPC 15 Outline GTC ’15 • Overview of PGAS models (UPC and OpenSHMEM) • Limitaons in PGAS models for GPU compuPnG • Proposed DesiGns and Alternaves • Performance Evaluaon • ExploiPnG GPUDirect RDMA 16 Global Address Space with Host and Device Memory GTC ’15 heap_on_device(); Host Memory Host Memory /*allocated on device*/ Private Private dev_buf = shmalloc (sizeof(int)); Shared shared space Shared heap_on_host(); N on host memory N /*allocated on host*/ host_buf = shmalloc (sizeof(int)); Device Memory Device Memory Private Private Shared shared space Shared N on device memory N Extended APIs: heap_on_device/heap_on_host a way to indicate locaon on heap Can be similar for dynamically allocated memory in UPC 17 CUDA-aware OpenSHMEM and UPC runPmes GTC ’15 • Amer device memory becomes part of the Global shared space: - Accessible throuGh standard UPC/OpenSHMEM communicaon APIs - Data movement transparently handled by the runPme - Preserves one-sided semanPcs at the applicaon level • Efficient desiGns to handle communicaon - Inter-node transfers use host-staged transfers with pipelininG - Intra-node transfers use CUDA IPC - Possibility to take advantage of GPUDirect RDMA (GDR) • Goal: EnablinG HiGh performance one-sided communicaons semanPcs with GPU devices 18 Outline GTC ’15 • Overview of PGAS models (UPC and OpenSHMEM) • Limitaons in PGAS models for GPU compuPnG • Proposed DesiGns and Alternaves • Performance Evaluaon • ExploiPnG GPUDirect RDMA 19 Shmem_putmem Inter-node Communicaon GTC ’15 Small Messages LarGe Messages 35 3000 2500 28% 30 22% 25 2000 20 1500 15 1000 10 Latency (usec) 500 Latency (usec) Latency 5 0 0 1 4 16 64 256 1K 4K 16K 64K 256K 1M 4M Message Size (Bytes) Message Size (Bytes) • Small messages benefit from selecPve CUDA reGistraon – 22% for 4Byte messages • LarGe messages benefit from pipelined overlap – 28% for 4MByte messages S. Potluri, D. Bureddy, H. WanG, H. Subramoni and D. K. Panda, ExtendinG OpenSHMEM for GPU CompuPnG, Int'l Parallel and Distributed ProcessinG Symposium (IPDPS '13) 20 Shmem_putmem Intra-node Communicaon GTC ’15 Small Messages LarGe Messages 30 3000 25 2500 20 2000 15 1500 3.4X 5X 10 1000 5 500 Latency (usec) Latency Latency (usec) 0 0 1 4 16 64 256 1K 4K 16K 64K 256K 1M 4M Message Size (Bytes) Message Size (Bytes) • UsinG IPC for intra-node communicaon Based on MVAPICH2X-2.0b + Extensions Intel WestmereEP node with 8 cores • Small messages – 3.4X improvement for 4Byte messages 2 NVIDIA Tesla K20c GPUs, Mellanox QDR HCA CUDA 6.0RC1 • LarGe messages – 5X for 4MByte messages 21 Applicaon Kernel Evaluaon: Stencil2D GTC ’15 30000 14 25000 12 20000 10 15000 8 65% 6 10000 4 5000 2 0 0 512x512 1Kx1K 2Kx2K 4Kx4K 8Kx8K 48 96 192 Total Execution Time (msec) Time Execution Total Problem Size/GPU (msec) Time Execution Total Number of GPUs (192 GPUs) (4Kx4K problem/GPU) • Modified SHOC Stencil2D kernelto use OpenSHMEM for cluster level parallelism • The enhanced version shows 65% improvement on 192 GPUs with 4Kx4K problem size/GPU • Using OpenSHMEM for GPU-GPU communicaon allows runPme

Enabling Efficient Use of UPC and Openshmem PGAS Models on GPU Clusters

Mellanox Scalableshmem Support the Openshmem Parallel Programming Language Over Infiniband

Scalable and Distributed Deep Learning (DL): Co-Design MPI Runtimes and DL Frameworks

Parallel Programming

Introduction to Parallel Programming

Benchmarking the Intel FPGA SDK for Opencl Memory Interface

An Open Standard for SHMEM Implementations Introduction

Parallel Computer Architecture

BCL: a Cross-Platform Distributed Data Structures Library

Advances, Applications and Performance of The

Parallel Programming Using Openmp Feb 2014

Automatic Handling of Global Variables for Multi-Threaded MPI Programs

An Overview of the PARADIGM Compiler for Distributed-Memory