MVAPICH2-GPU: Opmized GPU to GPU Communicaon for InfiniBand Clusters

H. Wang, S. Potluri, M. Luo, A. K. Singh, S. Sur D. K. Panda

Network-Based Compung Laboratory The Ohio State University

ISC 2011 Hamburg 1 Outline • Introducon • Problem Statement • Our Soluon: MVAPICH2-GPU • Design Consideraons • Performance Evaluaon • Conclusion & Future Work

ISC 2011 Hamburg 2 InfiniBand Clusters in Top500

• Percentage share of InfiniBand is steadily increasing • 41% of systems in TOP 500 using InfiniBand (June ’11) • 61% of systems in TOP 100 using InfiniBand (June ‘11) ISC 2011 Hamburg 3 Growth in GPGPUs • GPGPUs are gaining significance on clusters for data-centric applicaons: – Word Occurrence, Sparse Integer Occurrence – K-means clustering, Linear regression • GPGPUs + InfiniBand are gaining momentum for large clusters – #2 (Tianhe-1A), #4 (Nebulae) and #5 (Tsubame) Petascale systems • GPGPUs programming – CUDA or OpenCL + MPI – Dr. Sumit Gupta briefed industry users at NVIDIA meeng yesterday on programmability advances on GPUs • Big issues: performance of data movement – Latency – Bandwidth – Overlap

ISC 2011 Hamburg 4 Data movement in GPU clusters

Main IB IB Main GPU GPU Memory Adapter Adapter Memory PCI-E PCI-E PCI-E PCI-E

PCI-E Hub IB Network PCI-E Hub

• Data movement in InfiniBand clusters with GPUs – CUDA: Device memory à Main memory [at source ] – MPI: Source rank à Desnaon process – CUDA: Main memory à Device memory [at desnaon process] • GPU and InfiniBand require separate memory registraon

ISC 2011 Hamburg 5 GPU Direct

without GPU Direct with GPU Direct

• Collaboraon between Mellanox and NVIDIA to converge on one memory registraon technique • Both devices can register same host memory: – GPU and network adapters can access the buffer

ISC 2011 Hamburg 6 Outline • Introducon • Problem Statement • Our Soluon: MVAPICH2-GPU • Design Consideraons • Performance Evaluaon • Conclusion & Future Work

ISC 2011 Hamburg 7 Problem Statement

• Data movement from/to GPGPUs – Performance boleneck – Reduced programmer producvity • Hard to opmize at the applicaon level – CUDA and MPI experse required for efficient implementaon

– Hardware dependent latency characteriscs – Hard to support and opmize collecves – Hard to support advanced features like one-sided communicaon

ISC 2011 Hamburg 8 Outline • Introducon • Problem Statement • Our Soluon: MVAPICH2-GPU • Design Consideraons • Performance Evaluaon • Conclusion & Future Work

ISC 2011 Hamburg 9 MVAPICH2-GPU: Design Goals

• Support GPU to GPU communicaon through standard MPI interfaces – e.g. enable MPI_Send, MPI_Recv from/to GPU memory • Provide high performance without exposing low level details to the programmer – Pipelined data transfer which automacally provides opmizaons inside MPI without user tuning • Available to work with – GPU Direct – Without GPU Direct

ISC 2011 Hamburg 10 Sample Code - without MPI integraon

• Naïve implementaon with MPI and CUDA

At Sender: cudaMemcpy(s_buf, s_device, size, cudaMemcpyDeviceToHost); MPI_Send(s_buf, size, MPI_CHAR, 1, 1, MPI_COMM_WORLD);

At Receiver: MPI_Recv(r_buf, size, MPI_CHAR, 0, 1, MPI_COMM_WORLD, &req); cudaMemcpy(r_device, r_buf, size, cudaMemcpyHostToDevice);

• High producvity but poor performance

ISC 2011 Hamburg 11 Sample Code – User opmized code

• Pipelining at user level with non-blocking MPI and CUDA interfaces • Code repeated at receiver side • Good performance but poor producvity At Sender: for (j = 0; j < pipeline_len; j++) cudaMemcpyAsync(s_buf + j * block_sz, s_device + j * block_sz, …); for (j = 0; j < pipeline_len; j++) { while (result != cudaSucess) { result = cudaStreamQuery(…); if(j > 0) MPI_Test(…); } MPI_Isend(s_buf + j * block_sz, block_sz, MPI_CHAR, 1, 1, ….); } MPI_Waitall(); ISC 2011 Hamburg 12 Sample Code – MVAPICH2-GPU

• MVAPICH2-GPU: provides standard MPI interfaces for GPU

At Sender: MPI_Send(s_device, size, …); // s_device is data buffer in GPU

At Receiver: MPI_Recv(r_device, size, …); // r_device is data buffer in GPU

• High productivity and high performance!

ISC 2011 Hamburg 13 Outline • Introducon • Problem Statement • Our Soluon: MVAPICH2-GPU • Design Consideraons • Performance Evaluaon • Conclusion & Future Work

ISC 2011 Hamburg 14 Design consideraons

• Memory detecon – CUDA 4.0 introduces Unified Virtual Addressing (UVA) – MPI library can differenate between device memory and host memory without any hints from the user • Overlap CUDA copy and RDMA transfer – DMA of data from GPU and InfiniBand RDMA – Allow for progressing DMAs individual data chunks

ISC 2011 Hamburg 15 Pipelined Design

MPI_Send MPI_Recv GPU Device Host Main Host Main GPU Device Memory Memory Memory Memory RTS/CTS cudaMemcpy Async() … RDMA Write & Finish MSG cudaStream Query() … cudaMemcpy … Async() …

cudaStream … Query()

with GPU-Direct – Data is divided into chunks – Pipeline CUDA copies with RDMA transfers – If system does not have GPU-Direct, an extra copy is required ISC 2011 Hamburg 16 Pipeline Design (Cont.)

• Chunk size depends on CUDA copy cost and RDMA latency over the network • Automac tuning of chunk size – Detects CUDA copy and RDMA latencies during installaon – Chunk size can be stored in configuraon file (mvapich.conf) • User transparent to deliver the best performance

ISC 2011 Hamburg 17 Outline • Introducon • Problem Statement • Our Soluon: MVAPICH2-GPU • Design Consideraons • Performance Evaluaon • Conclusion & Future Work

ISC 2011 Hamburg 18 Performance Evaluaon

• Experimental environment – NVIDIA Tesla C2050 – Mellanox QDR InfiniBand HCA MT26428 – Intel Westmere processor with 12 GB main memory – MVAPICH2 1.6, CUDA Toolkit 4.0, GPUDirect • OSU Micro Benchmarks – The source and desnaon addresses are in GPU device memory

ISC 2011 Hamburg 19 Ping Pong Latency 3000 3000 Memcpy+Send Memcpy-Send 2500 MemcpyAsync+Isend 2500 MemcpyAsync+Isend

2000 MVAPICH2-GPU 2000 MVAPICH2-GPU

1500 1500 with GPU Direct without GPU Direct Time (us) 1000 Time (us) 1000

500 500

0 0 32K 64K 128K 256K 512K 1M 2M 4M 32K 64K 128K 256K 512K 1M 2M 4M Message Size (bytes) Message Size (bytes) • With GPU-Direct – 45% improvement compared to Memcpy+Send (4MB messages) – 24% improvement compared to MemcpyAsync+Isend (4MB messages) • Without GPU-Direct – 38% improvement compared to Memcpy+send (4MB messages) – 33% improvement compared to MemcpyAsync+Isend (4MB messages) ISC 2011 Hamburg 20 One-sided Communicaon 3500 3000 Mempy+Put w/ GPU Direct Mempy+Get w/ GPU Direct 3000 MVAPICH2-GPU w/ GPU Direct 2500 MVAPICH2-GPU w/ GPU Direct 2500 Mempy+Put w/o GPU Direct Mempy+Get w/o GPU Direct 2000 MVAPICH2-GPU w/o GPU Direct MVAPICH2-GPU w/o GPU Direct 2000 1500 1500 Time (us) Time (us) 1000 1000 500 500 0 0 32K 64K 128K 256K 512K 1M 2M 4M 32K 64K 128K 256K 512K 1M 2M 4M Message Size (bytes) Message Size (bytes) • 45% improvement compared to Memcpy+Put (with GPU Direct) • 39% improvement compared with Memcpy+Put (without GPU Direct) • Similar improvement for Get operaon • Major improvement in programming – One sided communicaon not feasible without MPI+CUDA support ISC 2011 Hamburg 21 Collecve Communicaon 25 25 Mempy+Gather w/ GPU Direct Mempy+Scaer w/ GPU Direct MVAPICH2-GPU w/ GPU Direct MVAPICH2-GPU w/ GPU Direct 20 20 Mempy+Gather w/o GPU Direct Mempy+Scaer w/o GPU Direct 15 MVAPICH2-GPU w/o GPU Direct 15 MVAPICH2-GPU w/o GPU Direct

10 10 Time (ms) Time (ms)

5 5

0 0 32K 64K 128K 256K 512K 1M 2M 4M 32K 64K 128K 256K 512K 1M 2M 4M Message Size (bytes) Message Size (bytes)

• With GPU Direct – 37% improvement for MPI_Gather (1MB) and 32% improvement for MPI_Scaer (4MB) • Without GPU Direct – 33% improvement for MPI_Gather (1MB) and 23% improvement MPI_Scaer (4MB)

ISC 2011 Hamburg 22 Collective Communication: Alltoall

50 Mempy+Alltoall w/ GPU Direct MVAPICH2-GPU w/ GPU Direct 40 Mempy+Alltoall w/o GPU Direct 30 MVAPICH2-GPU w/o GPU Direct

20 Time (ms)

10

0 32K 64K 128K 256K 512K 1M 2M 4M Message Size (bytes)

• With GPU Direct • Without GPU Direct – 30% improvement for 4MB messages – 26% improvement for 4MB messages

ISC 2011 Hamburg 23 Comparison with Other MPI Stack: Latency Small Messages Large Messages 60 2500 MVAPICH2-GPU 50 2000 OpenMPI-CUDA 40 1500 30 1000 20 Latnecy (us)

MVAPICH2-GPU Latency (us) 10 OpenMPI-CUDA 500 0 0 1 4 16 64 256 1K 4K 8K 32K 128K 512K 2M Message Size (Bytes) Message Size (Bytes) • Open MPI implementaon by Rolf vandeVaart(NVIDIA) – hps://bitbucket.org/rolfv/ompi-trunk-cuda-3 • Small message latency around 10% beer (1B – 512B) • Large message latency around 40% beer (4M) ISC 2011 Hamburg 24 Comparison with Other MPI Stack: Bandwidth 3500

3000 MVAPICH2-GPU 2500 OpenMPI-CUDA 2000

1500

Bandwidth (MBytes/s) 1000

500

0 1 4 16 64 256 1K 4K 16K 64K 256K 1M 4M Message Size (Bytes)

• Bandwidth for large messages 512K improved by almost factor of three

ISC 2011 Hamburg 25 Conclusion & Future Work • GPUs are key for data-centric applicaons • However, data movement is the key boleneck in current generaon clusters with GPUs • Asynchronous CUDA Copy and RDMA provide opportunity for latency hiding • Opmizaons at applicaon level reduce producvity and achieve sub- opmal performance • MPI Library can handle this for the users efficiently • MVAPICH2-GPU shows substanal improvements for MPI operaons – Will be available in future MVAPICH2 release • Future work includes collecve specific opmizaons and applicaons- level studies with MVAPICH2-GPU • Upcoming paper at Cluster 2011: “Opmized Non-conguous MPI Datatype Communicaon for GPU Clusters: Design, Implementaon and Evaluaon with MVAPICH2”, H. Wang, S. Potluri, M. Luo, A. K. Singh, X. Ouyang, S. Sur, D. K. Panda

ISC 2011 Hamburg 26 Thank You! {wangh, potluri, luom, singhas, surs, panda}@cse.ohio-state.edu

Network-Based Compung Laboratory hp://nowlab.cse.ohio-state.edu/ MVAPICH Web Page hp://mvapich.cse.ohio-state.edu/

27