MVAPICH2-GPU: Op mized GPU to GPU Communica on for InfiniBand Clusters
H. Wang, S. Potluri, M. Luo, A. K. Singh, S. Sur D. K. Panda
Network-Based Compu ng Laboratory The Ohio State University
ISC 2011 Hamburg 1 Outline • Introduc on • Problem Statement • Our Solu on: MVAPICH2-GPU • Design Considera ons • Performance Evalua on • Conclusion & Future Work
ISC 2011 Hamburg 2 InfiniBand Clusters in Top500
• Percentage share of InfiniBand is steadily increasing • 41% of systems in TOP 500 using InfiniBand (June ’11) • 61% of systems in TOP 100 using InfiniBand (June ‘11) ISC 2011 Hamburg 3 Growth in GPGPUs • GPGPUs are gaining significance on clusters for data-centric applica ons: – Word Occurrence, Sparse Integer Occurrence – K-means clustering, Linear regression • GPGPUs + InfiniBand are gaining momentum for large clusters – #2 (Tianhe-1A), #4 (Nebulae) and #5 (Tsubame) Petascale systems • GPGPUs programming – CUDA or OpenCL + MPI – Dr. Sumit Gupta briefed industry users at NVIDIA mee ng yesterday on programmability advances on GPUs • Big issues: performance of data movement – Latency – Bandwidth – Overlap
ISC 2011 Hamburg 4 Data movement in GPU clusters
Main IB IB Main GPU GPU Memory Adapter Adapter Memory PCI-E PCI-E PCI-E PCI-E
PCI-E Hub IB Network PCI-E Hub
• Data movement in InfiniBand clusters with GPUs – CUDA: Device memory à Main memory [at source process] – MPI: Source rank à Des na on process – CUDA: Main memory à Device memory [at des na on process] • GPU and InfiniBand require separate memory registra on
ISC 2011 Hamburg 5 GPU Direct
without GPU Direct with GPU Direct
• Collabora on between Mellanox and NVIDIA to converge on one memory registra on technique • Both devices can register same host memory: – GPU and network adapters can access the buffer
ISC 2011 Hamburg 6 Outline • Introduc on • Problem Statement • Our Solu on: MVAPICH2-GPU • Design Considera ons • Performance Evalua on • Conclusion & Future Work
ISC 2011 Hamburg 7 Problem Statement
• Data movement from/to GPGPUs – Performance bo leneck – Reduced programmer produc vity • Hard to op mize at the applica on level – CUDA and MPI exper se required for efficient implementa on
– Hardware dependent latency characteris cs – Hard to support and op mize collec ves – Hard to support advanced features like one-sided communica on
ISC 2011 Hamburg 8 Outline • Introduc on • Problem Statement • Our Solu on: MVAPICH2-GPU • Design Considera ons • Performance Evalua on • Conclusion & Future Work
ISC 2011 Hamburg 9 MVAPICH2-GPU: Design Goals
• Support GPU to GPU communica on through standard MPI interfaces – e.g. enable MPI_Send, MPI_Recv from/to GPU memory • Provide high performance without exposing low level details to the programmer – Pipelined data transfer which automa cally provides op miza ons inside MPI library without user tuning • Available to work with – GPU Direct – Without GPU Direct
ISC 2011 Hamburg 10 Sample Code - without MPI integra on
• Naïve implementa on with MPI and CUDA
At Sender: cudaMemcpy(s_buf, s_device, size, cudaMemcpyDeviceToHost); MPI_Send(s_buf, size, MPI_CHAR, 1, 1, MPI_COMM_WORLD);
At Receiver: MPI_Recv(r_buf, size, MPI_CHAR, 0, 1, MPI_COMM_WORLD, &req); cudaMemcpy(r_device, r_buf, size, cudaMemcpyHostToDevice);
• High produc vity but poor performance
ISC 2011 Hamburg 11 Sample Code – User op mized code
• Pipelining at user level with non-blocking MPI and CUDA interfaces • Code repeated at receiver side • Good performance but poor produc vity At Sender: for (j = 0; j < pipeline_len; j++) cudaMemcpyAsync(s_buf + j * block_sz, s_device + j * block_sz, …); for (j = 0; j < pipeline_len; j++) { while (result != cudaSucess) { result = cudaStreamQuery(…); if(j > 0) MPI_Test(…); } MPI_Isend(s_buf + j * block_sz, block_sz, MPI_CHAR, 1, 1, ….); } MPI_Waitall(); ISC 2011 Hamburg 12 Sample Code – MVAPICH2-GPU
• MVAPICH2-GPU: provides standard MPI interfaces for GPU
At Sender: MPI_Send(s_device, size, …); // s_device is data buffer in GPU
At Receiver: MPI_Recv(r_device, size, …); // r_device is data buffer in GPU
• High productivity and high performance!
ISC 2011 Hamburg 13 Outline • Introduc on • Problem Statement • Our Solu on: MVAPICH2-GPU • Design Considera ons • Performance Evalua on • Conclusion & Future Work
ISC 2011 Hamburg 14 Design considera ons
• Memory detec on – CUDA 4.0 introduces Unified Virtual Addressing (UVA) – MPI library can differen ate between device memory and host memory without any hints from the user • Overlap CUDA copy and RDMA transfer – Pipeline DMA of data from GPU and InfiniBand RDMA – Allow for progressing DMAs individual data chunks
ISC 2011 Hamburg 15 Pipelined Design
MPI_Send MPI_Recv GPU Device Host Main Host Main GPU Device Memory Memory Memory Memory RTS/CTS cudaMemcpy Async() … RDMA Write & Finish MSG cudaStream Query() … cudaMemcpy … Async() …
cudaStream … Query()
with GPU-Direct – Data is divided into chunks – Pipeline CUDA copies with RDMA transfers – If system does not have GPU-Direct, an extra copy is required ISC 2011 Hamburg 16 Pipeline Design (Cont.)
• Chunk size depends on CUDA copy cost and RDMA latency over the network • Automa c tuning of chunk size – Detects CUDA copy and RDMA latencies during installa on – Chunk size can be stored in configura on file (mvapich.conf) • User transparent to deliver the best performance
ISC 2011 Hamburg 17 Outline • Introduc on • Problem Statement • Our Solu on: MVAPICH2-GPU • Design Considera ons • Performance Evalua on • Conclusion & Future Work
ISC 2011 Hamburg 18 Performance Evalua on
• Experimental environment – NVIDIA Tesla C2050 – Mellanox QDR InfiniBand HCA MT26428 – Intel Westmere processor with 12 GB main memory – MVAPICH2 1.6, CUDA Toolkit 4.0, GPUDirect • OSU Micro Benchmarks – The source and des na on addresses are in GPU device memory
ISC 2011 Hamburg 19 Ping Pong Latency 3000 3000 Memcpy+Send Memcpy-Send 2500 MemcpyAsync+Isend 2500 MemcpyAsync+Isend
2000 MVAPICH2-GPU 2000 MVAPICH2-GPU
1500 1500 with GPU Direct without GPU Direct Time (us) 1000 Time (us) 1000
500 500
0 0 32K 64K 128K 256K 512K 1M 2M 4M 32K 64K 128K 256K 512K 1M 2M 4M Message Size (bytes) Message Size (bytes) • With GPU-Direct – 45% improvement compared to Memcpy+Send (4MB messages) – 24% improvement compared to MemcpyAsync+Isend (4MB messages) • Without GPU-Direct – 38% improvement compared to Memcpy+send (4MB messages) – 33% improvement compared to MemcpyAsync+Isend (4MB messages) ISC 2011 Hamburg 20 One-sided Communica on 3500 3000 Mempy+Put w/ GPU Direct Mempy+Get w/ GPU Direct 3000 MVAPICH2-GPU w/ GPU Direct 2500 MVAPICH2-GPU w/ GPU Direct 2500 Mempy+Put w/o GPU Direct Mempy+Get w/o GPU Direct 2000 MVAPICH2-GPU w/o GPU Direct MVAPICH2-GPU w/o GPU Direct 2000 1500 1500 Time (us) Time (us) 1000 1000 500 500 0 0 32K 64K 128K 256K 512K 1M 2M 4M 32K 64K 128K 256K 512K 1M 2M 4M Message Size (bytes) Message Size (bytes) • 45% improvement compared to Memcpy+Put (with GPU Direct) • 39% improvement compared with Memcpy+Put (without GPU Direct) • Similar improvement for Get opera on • Major improvement in programming – One sided communica on not feasible without MPI+CUDA support ISC 2011 Hamburg 21 Collec ve Communica on 25 25 Mempy+Gather w/ GPU Direct Mempy+Sca er w/ GPU Direct MVAPICH2-GPU w/ GPU Direct MVAPICH2-GPU w/ GPU Direct 20 20 Mempy+Gather w/o GPU Direct Mempy+Sca er w/o GPU Direct 15 MVAPICH2-GPU w/o GPU Direct 15 MVAPICH2-GPU w/o GPU Direct
10 10 Time (ms) Time (ms)
5 5
0 0 32K 64K 128K 256K 512K 1M 2M 4M 32K 64K 128K 256K 512K 1M 2M 4M Message Size (bytes) Message Size (bytes)
• With GPU Direct – 37% improvement for MPI_Gather (1MB) and 32% improvement for MPI_Sca er (4MB) • Without GPU Direct – 33% improvement for MPI_Gather (1MB) and 23% improvement MPI_Sca er (4MB)
ISC 2011 Hamburg 22 Collective Communication: Alltoall
50 Mempy+Alltoall w/ GPU Direct MVAPICH2-GPU w/ GPU Direct 40 Mempy+Alltoall w/o GPU Direct 30 MVAPICH2-GPU w/o GPU Direct
20 Time (ms)
10
0 32K 64K 128K 256K 512K 1M 2M 4M Message Size (bytes)
• With GPU Direct • Without GPU Direct – 30% improvement for 4MB messages – 26% improvement for 4MB messages
ISC 2011 Hamburg 23 Comparison with Other MPI Stack: Latency Small Messages Large Messages 60 2500 MVAPICH2-GPU 50 2000 OpenMPI-CUDA 40 1500 30 1000 20 Latnecy (us)
MVAPICH2-GPU Latency (us) 10 OpenMPI-CUDA 500 0 0 1 4 16 64 256 1K 4K 8K 32K 128K 512K 2M Message Size (Bytes) Message Size (Bytes) • Open MPI implementa on by Rolf vandeVaart(NVIDIA) – h ps://bitbucket.org/rolfv/ompi-trunk-cuda-3 • Small message latency around 10% be er (1B – 512B) • Large message latency around 40% be er (4M) ISC 2011 Hamburg 24 Comparison with Other MPI Stack: Bandwidth 3500
3000 MVAPICH2-GPU 2500 OpenMPI-CUDA 2000
1500
Bandwidth (MBytes/s) 1000
500
0 1 4 16 64 256 1K 4K 16K 64K 256K 1M 4M Message Size (Bytes)
• Bandwidth for large messages 512K improved by almost factor of three
ISC 2011 Hamburg 25 Conclusion & Future Work • GPUs are key for data-centric applica ons • However, data movement is the key bo leneck in current genera on clusters with GPUs • Asynchronous CUDA Copy and RDMA provide opportunity for latency hiding • Op miza ons at applica on level reduce produc vity and achieve sub- op mal performance • MPI Library can handle this for the users efficiently • MVAPICH2-GPU shows substan al improvements for MPI opera ons – Will be available in future MVAPICH2 release • Future work includes collec ve specific op miza ons and applica ons- level studies with MVAPICH2-GPU • Upcoming paper at Cluster 2011: “Op mized Non-con guous MPI Datatype Communica on for GPU Clusters: Design, Implementa on and Evalua on with MVAPICH2”, H. Wang, S. Potluri, M. Luo, A. K. Singh, X. Ouyang, S. Sur, D. K. Panda
ISC 2011 Hamburg 26 Thank You! {wangh, potluri, luom, singhas, surs, panda}@cse.ohio-state.edu
Network-Based Compu ng Laboratory h p://nowlab.cse.ohio-state.edu/ MVAPICH Web Page h p://mvapich.cse.ohio-state.edu/
27