Enabling Efficient Use of UPC and OpenSHMEM PGAS Models on GPU Clusters
Presented at GTC ’15 Presented by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: [email protected] h p://www.cse.ohio-state.edu/~panda Accelerator Era GTC ’15 • Accelerators are becoming common in high-end system architectures Top 100 – Nov 2014 (28% use Accelerators) 57% use NVIDIA GPUs
57% 28%
• Increasing number of workloads are being ported to take advantage of NVIDIA GPUs • As they scale to large GPU clusters with high compute density – higher the synchroniza on and communica on overheads – higher the penalty • Cri cal to minimize these overheads to achieve maximum performance
3 Parallel Programming Models Overview GTC ’15
P1 P2 P3 P1 P2 P3 P1 P2 P3
Logical shared memory Shared Memory Memory Memory Memory Memory Memory Memory
Shared Memory Model Distributed Memory Model Par oned Global Address Space (PGAS) DSM MPI (Message Passing Interface) Global Arrays, UPC, Chapel, X10, CAF, …
• Programming models provide abstract machine models • Models can be mapped on different types of systems - e.g. Distributed Shared Memory (DSM), MPI within a node, etc. • Each model has strengths and drawbacks - suite different problems or applica ons
4 Outline GTC ’15 • Overview of PGAS models (UPC and OpenSHMEM) • Limita ons in PGAS models for GPU compu ng • Proposed Designs and Alterna ves • Performance Evalua on • Exploi ng GPUDirect RDMA
5 Par oned Global Address Space (PGAS) Models GTC ’15
• PGAS models, an a rac ve alterna ve to tradi onal message passing
- Simple shared memory abstrac ons
- Lightweight one-sided communica on
- Flexible synchroniza on
• Different approaches to PGAS - Libraries - Languages • OpenSHMEM • Unified Parallel C (UPC) • Global Arrays • Co-Array Fortran (CAF) • Chapel • X10
6 OpenSHMEM GTC ’15
• SHMEM implementa ons – Cray SHMEM, SGI SHMEM, Quadrics SHMEM, HP SHMEM, GSHMEM
• Subtle differences in API, across versions – example: SGI SHMEM Quadrics SHMEM Cray SHMEM Ini aliza on start_pes(0) shmem_init start_pes Process ID _my_pe my_pe shmem_my_pe • Made applica ons codes non-portable
• OpenSHMEM is an effort to address this: “A new, open specifica on to consolidate the various extant SHMEM versions into a widely accepted standard.” – OpenSHMEM Specifica on v1.0 by University of Houston and Oak Ridge Na onal Lab
SGI SHMEM is the baseline 7 OpenSHMEM Memory Model GTC ’15
• Defines symmetric data objects that are globally addressable int main (int c, char *v[]) { - Allocated using a collec ve shmalloc rou ne int *b;
start_pes(); - Same type, size and offset address at all processes/ b = (int *) shmalloc (sizeof(int)); processing elements (PEs) shmem_int_get (b, b, 1 , 1); - Address of a remote object can be calculated based on info } (dst, src, count, pe) of local object
int main (int c, char *v[]) { b int *b;
Symmetric start_pes(); b b = (int *) shmalloc (sizeof(int));
Object } PE 0 PE 1 8 Compiler-based: Unified Parallel C GTC ’15
• UPC: a parallel extension to the C standard • UPC Specifica ons and Standards: - Introduc on to UPC and Language Specifica on, 1999 - UPC Language Specifica ons, v1.0, Feb 2001 - UPC Language Specifica ons, v1.1.1, Sep 2004 - UPC Language Specifica ons, v1.2, 2005 - UPC Language Specifica ons, v1.3, In Progress - Dra Available • UPC Consor um - Academic Ins tu ons: GWU, MTU, UCB, U. Florida, U. Houston, U. Maryland… - Government Ins tu ons: ARSC, IDA, LBNL, SNL, US DOE… - Commercial Ins tu ons: HP, Cray, Intrepid Technology, IBM, … • Supported by several UPC compilers - Vendor-based commercial UPC compilers: HP UPC, Cray UPC, SGI UPC - Open-source UPC compilers: Berkeley UPC, GCC UPC, Michigan Tech MuPC • Aims for: high performance, coding efficiency, irregular applica ons, …
9 UPC Memory Model GTC ’15
Thread 0 Thread 1 Thread 2 Thread 3
A1[0] A1[1] A1[2] A1[3] Global Shared Space
y y y y Private Space
• Global Shared Space: can be accessed by all the threads • Private Space: holds all the normal variables; can only be accessed by the local thread • Example: shared int A1[THREADS]; //shared variable int main() { int y; //private variable A1[0] = 0; //local access A1[1] = 1; //remote access 10 } MPI+PGAS for Exascale Architectures and Applica ons GTC ’15
• Gaining a en on in efforts towards Exascale compu ng
• Hierarchical architectures with mul ple address spaces • (MPI + PGAS) Model - MPI across address spaces - PGAS within an address space • MPI is good at moving data between address spaces • Within an address space, MPI can interoperate with other shared memory programming models
• Re-wri ng complete applica ons can be a huge effort • Port cri cal kernels to the desired model instead 11 Hybrid (MPI+PGAS) Programming GTC ’15 • Applica on sub-kernels can be re-wri en in MPI/PGAS based on communica on characteris cs HPC Applica on
• Benefits: Kernel 1 - Best of Distributed Compu ng Model MPI
- Best of Shared Memory Compu ng Model Kernel 2 PGASMPI
• Exascale Roadmap*: Kernel 3 MPI - “Hybrid Programming is a prac cal way to program exascale systems” Kernel N Kernel N PGASMPI
12 MVAPICH2-X for Hybrid MPI + PGAS Applica ons GTC ’15
MPI, OpenSHMEM, UPC, CAF and Hybrid (MPI + PGAS) Applications
OpenSHMEM Calls UPC Calls CAF Calls MPI Calls
Unified MVAPICH2-X Runtime
InfiniBand, RoCE, iWARP • Unified communica on run me for MPI, UPC, OpenSHMEM,CAF available from MVAPICH2-X 1.9 : (09/07/2012) h p://mvapich.cse.ohio-state.edu • Feature Highlights - Supports MPI(+OpenMP), OpenSHMEM, UPC, CAF, MPI(+OpenMP) + OpenSHMEM, MPI(+OpenMP) + UPC, MPI(+OpenMP) + CAF - MPI-3 compliant, OpenSHMEM v1.0 standard compliant, UPC v1.2 standard compliant, CAF 2015 standard compliant - Scalable Inter-node and intra-node communica on – point-to-point and collec ves 13 • Effort underway for support on NVIDIA GPU clusters Outline GTC ’15 • Overview of PGAS models (UPC and OpenSHMEM) • Limita ons in PGAS models for GPU compu ng • Proposed Designs and Alterna ves • Performance Evalua on • Exploi ng GPUDirect RDMA
14 Limita ons of PGAS models for GPU Compu ng GTC ’15
• PGAS memory models does not support disjoint memory address spaces - case with GPU clusters Exis ng OpenSHMEM Model with CUDA • OpenSHMEM case PE 0 • Copies severely limit the performance host_buf = shmalloc (…) cudaMemcpy (host_buf, dev_buf, . . . )
PE 0 shmem_putmem (host_buf, host_buf, size, pe) shmem_barrier (…) GPU-to-GPU Data Movement PE 1 host_buf = shmalloc (…)
PE 1 shmem_barrier ( . . . )
cudaMemcpy (dev_buf, host_buf, size, . . . ) • Synchroniza on negates the benefits of one-sided communica on • Similar limita ons in UPC 15 Outline GTC ’15 • Overview of PGAS models (UPC and OpenSHMEM) • Limita ons in PGAS models for GPU compu ng • Proposed Designs and Alterna ves • Performance Evalua on • Exploi ng GPUDirect RDMA
16 Global Address Space with Host and Device Memory GTC ’15
heap_on_device(); Host Memory Host Memory /*allocated on device*/ Private Private dev_buf = shmalloc (sizeof(int));
Shared shared space Shared heap_on_host(); N on host memory N /*allocated on host*/ host_buf = shmalloc (sizeof(int)); Device Memory Device Memory Private Private
Shared shared space Shared N on device memory N Extended APIs: heap_on_device/heap_on_host a way to indicate loca on on heap Can be similar for dynamically allocated memory in UPC
17 CUDA-aware OpenSHMEM and UPC run mes GTC ’15
• A er device memory becomes part of the global shared space: - Accessible through standard UPC/OpenSHMEM communica on APIs - Data movement transparently handled by the run me - Preserves one-sided seman cs at the applica on level • Efficient designs to handle communica on - Inter-node transfers use host-staged transfers with pipelining - Intra-node transfers use CUDA IPC - Possibility to take advantage of GPUDirect RDMA (GDR)
• Goal: Enabling High performance one-sided communica ons
seman cs with GPU devices 18 Outline GTC ’15 • Overview of PGAS models (UPC and OpenSHMEM) • Limita ons in PGAS models for GPU compu ng • Proposed Designs and Alterna ves • Performance Evalua on • Exploi ng GPUDirect RDMA
19 Shmem_putmem Inter-node Communica on GTC ’15
Small Messages Large Messages
35 3000 2500 28% 30 22% 25 2000 20 1500 15 1000 10
Latency (usec) Latency 500 Latency (usec) Latency 5 0 0 1 4 16 64 256 1K 4K 16K 64K 256K 1M 4M Message Size (Bytes) Message Size (Bytes)
• Small messages benefit from selec ve CUDA registra on – 22% for 4Byte messages
• Large messages benefit from pipelined overlap – 28% for 4MByte messages
S. Potluri, D. Bureddy, H. Wang, H. Subramoni and D. K. Panda, Extending OpenSHMEM for GPU Compu ng, Int'l Parallel and Distributed Processing Symposium (IPDPS '13) 20 Shmem_putmem Intra-node Communica on GTC ’15
Small Messages Large Messages
30 3000 25 2500 20 2000 15 1500 3.4X 5X 10 1000 5 500 Latency (usec) Latency Latency (usec) Latency 0 0 1 4 16 64 256 1K 4K 16K 64K 256K 1M 4M Message Size (Bytes) Message Size (Bytes)
• Using IPC for intra-node communica on Based on MVAPICH2X-2.0b + Extensions Intel WestmereEP node with 8 cores • Small messages – 3.4X improvement for 4Byte messages 2 NVIDIA Tesla K20c GPUs, Mellanox QDR HCA CUDA 6.0RC1 • Large messages – 5X for 4MByte messages 21 Applica on Kernel Evalua on: Stencil2D GTC ’15
30000 14 25000 12 20000 10 15000 8 65% 6 10000 4 5000 2 0 0 512x512 1Kx1K 2Kx2K 4Kx4K 8Kx8K 48 96 192 Total Execution Time (msec) (msec) Time Execution Total Problem Size/GPU (msec) Time Execution Total Number of GPUs (192 GPUs) (4Kx4K problem/GPU)
• Modified SHOC Stencil2D kernelto use OpenSHMEM for cluster level parallelism
• The enhanced version shows 65% improvement on 192 GPUs with 4Kx4K problem size/GPU
• Using OpenSHMEM for GPU-GPU communica on allows run me to op mize non-con guous transfers
22 Applica on Kernel Evalua on: BFS GTC ’15
1600 1400 12% 1200 1000 800 600 400 200 0 Total Execution Time (msec) (msec) Time Execution Total 24 48 96 Number of GPUs (1 million vertices/GPU with degree 32) • Extended SHOC BFS kernel to run on a GPU cluster using a level-synchronized algorithm and OpenSHMEM
• The enhanced version shows upto 12% improvement on 96 GPUs, a consistent improvement in performance as we scale from 24 to 96 GPUs. 23 Outline GTC ’15 • Overview of PGAS models (UPC and OpenSHMEM) • Limita ons in PGAS models for GPU compu ng • Proposed Designs and Alterna ves • Performance Evalua on • Exploi ng GPUDirect RDMA
24 OpenSHMEM: Inter-node Evalua on GTC ’15
Small Message shmem_put D-D Large Message shmem_put D-D 25 800 20 600 Host-Pipeline 15 Host-Pipeline 6.6X GDR 10 GDR 400 5 200 Latency(us) 0 0 Latency(us) 1 4 16 64 256 1K 4K 8K 32K 128K 512K 2M Message Size (bytes) Message Size (bytes) Small Message shmem_get D-D • GDR for small/medium message sizes 40 • Host-staging for large message to avoid PCIe 30 bo lenecks 20 Host-Pipeline 9X • Hybrid design brings best of both 10 GDR
Latency(us) • 3.13 us Put latency for 4B (6.6X improvement ) 0 and 4.7 us latency for 4KB 1 4 16 64 256 1K 4K Message Size (bytes) • 9X improvement for Get latency of 4B 25 OpenSHMEM: Intra-node Evalua on GTC ’15
Small Message shmem_put D-H Small Message shmem_get D-H 10 14 8 IPC GDR 12 IPC GDR 10 6 8 2.6X 3.6X 4 6 4
Latency(us) 2 2 Latency(us) 0 0 1 4 16 64 256 1K 4K 1 4 16 64 256 1K 4K Message Size (bytes) Message Size (bytes) • GDR for small and medium message sizes • IPC for large message to avoid PCIe bo lenecks • Hybrid design brings best of both • 2.42 us Put D-H latency for 4 Bytes (2.6X improvement) and 3.92 us latency for 4 KBytes • 3.6X improvement for Get opera on • Similar results with other configura ons (D-D, H-D and D-H) 26 OpenSHMEM: Stencil3D Applica on Kernel Evalua on GTC ’15
Host-Pipeline GDR 0.1 19% 0.08
0.06
0.04
0.02
0 Execution time(sec) Execution 8 16 32 64 Number of GPU Nodes - Input size 2K x 2K x 2K - Pla orm: Wilkes (Intel Ivy Bridge + NVIDIA Tesla K20c + Mellanox Connect-IB) - New designs achieve 20% and 19% improvements on 32 and 64 GPU nodes K. Hamidouche, A. Venkatesh, A. Awan, H. Subramoni, C. Ching and D. K. Panda, Exploi ng GPUDirect RDMA in Designing High Performance OpenSHMEM for GPU Clusters. (under review) 27 Conclusions GTC ’15
• PGAS models offer lightweight synchroniza on and one-sided communica on seman cs • Low-overhead synchroniza on is suited for GPU architectures • Extensions to the PGAS memory model to efficiently support CUDA- Aware PGAS models. • High efficient GDR-based designs for OpenSHMEM • Plan on exploi ng the GDR-based designs for UPC • Enhanced designs are planned to be incorporated into MVAPICH2-X
28 Two More Talks GTC ’15
- Learn more on how to combine and take advantage of Mul cast and GPUDirect RDMA simultaneously • S5507 – High Performance Broadcast with GPUDirect RDMA and InfiniBand Hardware Mul cast for Streaming Applica on • Thursday, 03/19 (Today) • Time: 14:00 -- 14:25 • Room 210 D - Learn about recent advances and upcoming features in CUDA- aware MVAPICH2-GPU library • S5461 - Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU Clusters with InfiniBand • Thursday, 03/19 (Today) • Time: 17:00 -- 17:50 • Room 212 B 29
Thanks! Ques ons? GTC ’15 Contact [email protected]
http://mvapich.cse.ohio-state.edu http://nowlab.cse.ohio-state.edu
30