Enabling Efficient Use of UPC and OpenSHMEM PGAS Models on GPU Clusters

Presented at GTC ’15 Presented by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: [email protected] hp://www.cse.ohio-state.edu/~panda Accelerator Era GTC ’15 • Accelerators are becoming common in high-end system architectures Top 100 – Nov 2014 (28% use Accelerators) 57% use NVIDIA GPUs

57% 28%

• Increasing number of workloads are being ported to take advantage of NVIDIA GPUs • As they scale to large GPU clusters with high compute density – higher the synchronizaon and communicaon overheads – higher the penalty • Crical to minimize these overheads to achieve maximum performance

3 Parallel Programming Models Overview GTC ’15

P1 P2 P3 P1 P2 P3 P1 P2 P3

Logical Shared Memory Memory Memory Memory Memory Memory Memory

Shared Memory Model Model Paroned Global Address Space (PGAS) DSM MPI (Message Passing Interface) , UPC, Chapel, X10, CAF, …

• Programming models provide abstract machine models • Models can be mapped on different types of systems - e.g. Distributed Shared Memory (DSM), MPI within a node, etc. • Each model has strengths and drawbacks - suite different problems or applicaons

4 Outline GTC ’15 • Overview of PGAS models (UPC and OpenSHMEM) • Limitaons in PGAS models for GPU compung • Proposed Designs and Alternaves • Performance Evaluaon • Exploing GPUDirect RDMA

5 Paroned Global Address Space (PGAS) Models GTC ’15

• PGAS models, an aracve alternave to tradional message passing

- Simple shared memory abstracons

- Lightweight one-sided communicaon

- Flexible synchronizaon

• Different approaches to PGAS - Libraries - Languages • OpenSHMEM • Unified Parallel C (UPC) • Global Arrays • Co-Array Fortran (CAF) • Chapel • X10

6 OpenSHMEM GTC ’15

• SHMEM implementaons – Cray SHMEM, SGI SHMEM, SHMEM, HP SHMEM, GSHMEM

• Subtle differences in API, across versions – example: SGI SHMEM Quadrics SHMEM Cray SHMEM Inializaon start_pes(0) shmem_init start_pes ID _my_pe my_pe shmem_my_pe • Made applicaons codes non-portable

• OpenSHMEM is an effort to address this: “A new, open specificaon to consolidate the various extant SHMEM versions into a widely accepted standard.” – OpenSHMEM Specificaon v1.0 by University of Houston and Oak Ridge Naonal Lab

SGI SHMEM is the baseline 7 OpenSHMEM Memory Model GTC ’15

• Defines symmetric data objects that are globally addressable int main (int c, char *v[]) { - Allocated using a collecve shmalloc roune int *b;

start_pes(); - Same type, size and offset address at all processes/ b = (int *) shmalloc (sizeof(int)); processing elements (PEs) shmem_int_get (b, b, 1 , 1); - Address of a remote object can be calculated based on info } (dst, src, count, pe) of local object

int main (int c, char *v[]) { b int *b;

Symmetric start_pes(); b b = (int *) shmalloc (sizeof(int));

Object } PE 0 PE 1 8 Compiler-based: Unified Parallel C GTC ’15

• UPC: a parallel extension to the C standard • UPC Specificaons and Standards: - Introducon to UPC and Language Specificaon, 1999 - UPC Language Specificaons, v1.0, Feb 2001 - UPC Language Specificaons, v1.1.1, Sep 2004 - UPC Language Specificaons, v1.2, 2005 - UPC Language Specificaons, v1.3, In Progress - Dra Available • UPC Consorum - Academic Instuons: GWU, MTU, UCB, U. Florida, U. Houston, U. Maryland… - Government Instuons: ARSC, IDA, LBNL, SNL, US DOE… - Commercial Instuons: HP, Cray, Intrepid Technology, IBM, … • Supported by several UPC compilers - Vendor-based commercial UPC compilers: HP UPC, Cray UPC, SGI UPC - Open-source UPC compilers: Berkeley UPC, GCC UPC, Michigan Tech MuPC • Aims for: high performance, coding efficiency, irregular applicaons, …

9 UPC Memory Model GTC ’15

Thread 0 1 Thread 2 Thread 3

A1[0] A1[1] A1[2] A1[3] Global Shared Space

y y y y Private Space

• Global Shared Space: can be accessed by all the threads • Private Space: holds all the normal variables; can only be accessed by the local thread • Example: shared int A1[THREADS]; //shared variable int main() { int y; //private variable A1[0] = 0; //local access A1[1] = 1; //remote access 10 } MPI+PGAS for Exascale Architectures and Applicaons GTC ’15

• Gaining aenon in efforts towards Exascale compung

• Hierarchical architectures with mulple address spaces • (MPI + PGAS) Model - MPI across address spaces - PGAS within an address space • MPI is good at moving data between address spaces • Within an address space, MPI can interoperate with other shared memory programming models

• Re-wring complete applicaons can be a huge effort • Port crical kernels to the desired model instead 11 Hybrid (MPI+PGAS) Programming GTC ’15 • Applicaon sub-kernels can be re-wrien in MPI/PGAS based on communicaon characteriscs HPC Applicaon

• Benefits: Kernel 1 - Best of Distributed Compung Model MPI

- Best of Shared Memory Compung Model Kernel 2 PGASMPI

• Exascale Roadmap*: Kernel 3 MPI - “Hybrid Programming is a praccal way to program exascale systems” Kernel N Kernel N PGASMPI

12 MVAPICH2-X for Hybrid MPI + PGAS Applicaons GTC ’15

MPI, OpenSHMEM, UPC, CAF and Hybrid (MPI + PGAS) Applications

OpenSHMEM Calls UPC Calls CAF Calls MPI Calls

Unified MVAPICH2-X Runtime

InfiniBand, RoCE, iWARP • Unified communicaon runme for MPI, UPC, OpenSHMEM,CAF available from MVAPICH2-X 1.9 : (09/07/2012) hp://.cse.ohio-state.edu • Feature Highlights - Supports MPI(+OpenMP), OpenSHMEM, UPC, CAF, MPI(+OpenMP) + OpenSHMEM, MPI(+OpenMP) + UPC, MPI(+OpenMP) + CAF - MPI-3 compliant, OpenSHMEM v1.0 standard compliant, UPC v1.2 standard compliant, CAF 2015 standard compliant - Scalable Inter-node and intra-node communicaon – point-to-point and collecves 13 • Effort underway for support on NVIDIA GPU clusters Outline GTC ’15 • Overview of PGAS models (UPC and OpenSHMEM) • Limitaons in PGAS models for GPU compung • Proposed Designs and Alternaves • Performance Evaluaon • Exploing GPUDirect RDMA

14 Limitaons of PGAS models for GPU Compung GTC ’15

• PGAS memory models does not support disjoint memory address spaces - case with GPU clusters Exisng OpenSHMEM Model with CUDA • OpenSHMEM case PE 0 • Copies severely limit the performance host_buf = shmalloc (…) cudaMemcpy (host_buf, dev_buf, . . . )

PE 0 shmem_putmem (host_buf, host_buf, size, pe) shmem_barrier (…) GPU-to-GPU Data Movement PE 1 host_buf = shmalloc (…)

PE 1 shmem_barrier ( . . . )

cudaMemcpy (dev_buf, host_buf, size, . . . ) • Synchronizaon negates the benefits of one-sided communicaon • Similar limitaons in UPC 15 Outline GTC ’15 • Overview of PGAS models (UPC and OpenSHMEM) • Limitaons in PGAS models for GPU compung • Proposed Designs and Alternaves • Performance Evaluaon • Exploing GPUDirect RDMA

16 Global Address Space with Host and Device Memory GTC ’15

heap_on_device(); Host Memory Host Memory /*allocated on device*/ Private Private dev_buf = shmalloc (sizeof(int));

Shared shared space Shared heap_on_host(); N on host memory N /*allocated on host*/ host_buf = shmalloc (sizeof(int)); Device Memory Device Memory Private Private

Shared shared space Shared N on device memory N Extended : heap_on_device/heap_on_host a way to indicate locaon on heap Can be similar for dynamically allocated memory in UPC

17 CUDA-aware OpenSHMEM and UPC runmes GTC ’15

• Aer device memory becomes part of the global shared space: - Accessible through standard UPC/OpenSHMEM communicaon APIs - Data movement transparently handled by the runme - Preserves one-sided semancs at the applicaon level • Efficient designs to handle communicaon - Inter-node transfers use host-staged transfers with pipelining - Intra-node transfers use CUDA IPC - Possibility to take advantage of GPUDirect RDMA (GDR)

• Goal: Enabling High performance one-sided communicaons

semancs with GPU devices 18 Outline GTC ’15 • Overview of PGAS models (UPC and OpenSHMEM) • Limitaons in PGAS models for GPU compung • Proposed Designs and Alternaves • Performance Evaluaon • Exploing GPUDirect RDMA

19 Shmem_putmem Inter-node Communicaon GTC ’15

Small Messages Large Messages

35 3000 2500 28% 30 22% 25 2000 20 1500 15 1000 10

Latency (usec) Latency 500 Latency (usec) Latency 5 0 0 1 4 16 64 256 1K 4K 16K 64K 256K 1M 4M Message Size (Bytes) Message Size (Bytes)

• Small messages benefit from selecve CUDA registraon – 22% for 4Byte messages

• Large messages benefit from pipelined overlap – 28% for 4MByte messages

S. Potluri, D. Bureddy, H. Wang, H. Subramoni and D. K. Panda, Extending OpenSHMEM for GPU Compung, Int'l Parallel and Distributed Processing Symposium (IPDPS '13) 20 Shmem_putmem Intra-node Communicaon GTC ’15

Small Messages Large Messages

30 3000 25 2500 20 2000 15 1500 3.4X 5X 10 1000 5 500 Latency (usec) Latency Latency (usec) Latency 0 0 1 4 16 64 256 1K 4K 16K 64K 256K 1M 4M Message Size (Bytes) Message Size (Bytes)

• Using IPC for intra-node communicaon Based on MVAPICH2X-2.0b + Extensions Intel WestmereEP node with 8 cores • Small messages – 3.4X improvement for 4Byte messages 2 NVIDIA Tesla K20c GPUs, Mellanox QDR HCA CUDA 6.0RC1 • Large messages – 5X for 4MByte messages 21 Applicaon Kernel Evaluaon: Stencil2D GTC ’15

30000 14 25000 12 20000 10 15000 8 65% 6 10000 4 5000 2 0 0 512x512 1Kx1K 2Kx2K 4Kx4K 8Kx8K 48 96 192 Total Execution Time (msec) (msec) Time Execution Total Problem Size/GPU (msec) Time Execution Total Number of GPUs (192 GPUs) (4Kx4K problem/GPU)

• Modified SHOC Stencil2D kernelto use OpenSHMEM for cluster level parallelism

• The enhanced version shows 65% improvement on 192 GPUs with 4Kx4K problem size/GPU

• Using OpenSHMEM for GPU-GPU communicaon allows runme to opmize non-conguous transfers

22 Applicaon Kernel Evaluaon: BFS GTC ’15

1600 1400 12% 1200 1000 800 600 400 200 0 Total Execution Time (msec) (msec) Time Execution Total 24 48 96 Number of GPUs (1 million vertices/GPU with degree 32) • Extended SHOC BFS kernel to run on a GPU cluster using a level-synchronized algorithm and OpenSHMEM

• The enhanced version shows upto 12% improvement on 96 GPUs, a consistent improvement in performance as we scale from 24 to 96 GPUs. 23 Outline GTC ’15 • Overview of PGAS models (UPC and OpenSHMEM) • Limitaons in PGAS models for GPU compung • Proposed Designs and Alternaves • Performance Evaluaon • Exploing GPUDirect RDMA

24 OpenSHMEM: Inter-node Evaluaon GTC ’15

Small Message shmem_put D-D Large Message shmem_put D-D 25 800 20 600 Host-Pipeline 15 Host-Pipeline 6.6X GDR 10 GDR 400 5 200 Latency(us) 0 0 Latency(us) 1 4 16 64 256 1K 4K 8K 32K 128K 512K 2M Message Size (bytes) Message Size (bytes) Small Message shmem_get D-D • GDR for small/medium message sizes 40 • Host-staging for large message to avoid PCIe 30 bolenecks 20 Host-Pipeline 9X • Hybrid design brings best of both 10 GDR

Latency(us) • 3.13 us Put latency for 4B (6.6X improvement ) 0 and 4.7 us latency for 4KB 1 4 16 64 256 1K 4K Message Size (bytes) • 9X improvement for Get latency of 4B 25 OpenSHMEM: Intra-node Evaluaon GTC ’15

Small Message shmem_put D-H Small Message shmem_get D-H 10 14 8 IPC GDR 12 IPC GDR 10 6 8 2.6X 3.6X 4 6 4

Latency(us) 2 2 Latency(us) 0 0 1 4 16 64 256 1K 4K 1 4 16 64 256 1K 4K Message Size (bytes) Message Size (bytes) • GDR for small and medium message sizes • IPC for large message to avoid PCIe bolenecks • Hybrid design brings best of both • 2.42 us Put D-H latency for 4 Bytes (2.6X improvement) and 3.92 us latency for 4 KBytes • 3.6X improvement for Get operaon • Similar results with other configuraons (D-D, H-D and D-H) 26 OpenSHMEM: Stencil3D Applicaon Kernel Evaluaon GTC ’15

Host-Pipeline GDR 0.1 19% 0.08

0.06

0.04

0.02

0 Execution time(sec) Execution 8 16 32 64 Number of GPU Nodes - Input size 2K x 2K x 2K - Plaorm: Wilkes (Intel Ivy Bridge + NVIDIA Tesla K20c + Mellanox Connect-IB) - New designs achieve 20% and 19% improvements on 32 and 64 GPU nodes K. Hamidouche, A. Venkatesh, A. Awan, H. Subramoni, C. Ching and D. K. Panda, Exploing GPUDirect RDMA in Designing High Performance OpenSHMEM for GPU Clusters. (under review) 27 Conclusions GTC ’15

• PGAS models offer lightweight synchronizaon and one-sided communicaon semancs • Low-overhead synchronizaon is suited for GPU architectures • Extensions to the PGAS memory model to efficiently support CUDA- Aware PGAS models. • High efficient GDR-based designs for OpenSHMEM • Plan on exploing the GDR-based designs for UPC • Enhanced designs are planned to be incorporated into MVAPICH2-X

28 Two More Talks GTC ’15

- Learn more on how to combine and take advantage of Mulcast and GPUDirect RDMA simultaneously • S5507 – High Performance Broadcast with GPUDirect RDMA and InfiniBand Hardware Mulcast for Streaming Applicaon • Thursday, 03/19 (Today) • Time: 14:00 -- 14:25 • Room 210 D - Learn about recent advances and upcoming features in CUDA- aware MVAPICH2-GPU library • S5461 - Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU Clusters with InfiniBand • Thursday, 03/19 (Today) • Time: 17:00 -- 17:50 • Room 212 B 29

Thanks! Quesons? GTC ’15 Contact [email protected]

http://mvapich.cse.ohio-state.edu http://nowlab.cse.ohio-state.edu

30