MVAPICH2-‐GPU: Opumized GPU to GPU Communicauon for Infiniband

MVAPICH2-GPU: Op0mized GPU to GPU Communicaon for InfiniBand Clusters H. Wang, S. Potluri, M. Luo, A. K. Singh, S. Sur D. K. Panda Network-Based Compu0ng Laboratory The Ohio State University ISC 2011 Hamburg 1 Outline • Introduc0on • Problem Statement • Our Solu0on: MVAPICH2-GPU • Design Consideraons • Performance Evaluaon • Conclusion & Future Work ISC 2011 Hamburg 2 InfiniBand Clusters in Top500 • Percentage share of InfiniBand is steadily increasing • 41% of systems in TOP 500 using InfiniBand (June ’11) • 61% of systems in TOP 100 using InfiniBand (June ‘11) ISC 2011 Hamburg 3 Growth in GPGPUs • GPGPUs are gaining significance on clusters for data-centric applicaons: – Word Occurrence, Sparse Integer Occurrence – K-means clustering, Linear regression • GPGPUs + InfiniBand are gaining momentum for large clusters – #2 (Tianhe-1A), #4 (Nebulae) and #5 (Tsubame) Petascale systems • GPGPUs programming – CUDA or OpenCL + MPI – Dr. Sumit Gupta briefed industry users at NVIDIA mee0ng yesterday on programmability advances on GPUs • Big issues: performance of data movement – Latency – Bandwidth – Overlap ISC 2011 Hamburg 4 Data movement in GPU clusters Main IB IB Main GPU GPU Memory Adapter Adapter Memory PCI-E PCI-E PCI-E PCI-E PCI-E Hub IB Network PCI-E Hub • Data movement in InfiniBand clusters with GPUs – CUDA: Device memory à Main memory [at source process] – MPI: Source rank à Des0naon process – CUDA: Main memory à Device memory [at des0naon process] • GPU and InfiniBand require separate memory registraon ISC 2011 Hamburg 5 GPU Direct without GPU Direct with GPU Direct • Collaboraon between Mellanox and NVIDIA to converge on one memory registraon technique • Both devices can register same host memory: – GPU and network adapters can access the buffer ISC 2011 Hamburg 6 Outline • Introduc0on • Problem Statement • Our Solu0on: MVAPICH2-GPU • Design Consideraons • Performance Evaluaon • Conclusion & Future Work ISC 2011 Hamburg 7 Problem Statement • Data movement from/to GPGPUs – Performance bojleneck – Reduced programmer produc0vity • Hard to op0mize at the applicaon level – CUDA and MPI exper0se required for efficient implementaon – Hardware dependent latency characteris0cs – Hard to support and op0mize collec0ves – Hard to support advanced features like one-sided communicaon ISC 2011 Hamburg 8 Outline • Introduc0on • Problem Statement • Our Solu0on: MVAPICH2-GPU • Design Consideraons • Performance Evaluaon • Conclusion & Future Work ISC 2011 Hamburg 9 MVAPICH2-GPU: Design Goals • Support GPU to GPU communicaon through standard MPI interfaces – e.g. enable MPI_Send, MPI_Recv from/to GPU memory • Provide high performance without exposing low level details to the programmer – Pipelined data transfer which automa:cally provides op0mizaons inside MPI library without user tuning • Available to work with – GPU Direct – Without GPU Direct ISC 2011 Hamburg 10 Sample Code - without MPI integraon • Naïve implementaon with MPI and CUDA At Sender: cudaMemcpy(s_buf, s_device, size, cudaMemcpyDeviceToHost); MPI_Send(s_buf, size, MPI_CHAR, 1, 1, MPI_COMM_WORLD); At Receiver: MPI_Recv(r_buf, size, MPI_CHAR, 0, 1, MPI_COMM_WORLD, &req); cudaMemcpy(r_device, r_buf, size, cudaMemcpyHostToDevice); • High producvity but poor performance ISC 2011 Hamburg 11 Sample Code – User op0mized code • Pipelining at user level with non-blocking MPI and CUDA interfaces • Code repeated at receiver side • Good performance but poor producvity At Sender: for (j = 0; j < pipeline_len; j++) cudaMemcpyAsync(s_buf + j * block_sz, s_device + j * block_sz, …); for (j = 0; j < pipeline_len; j++) { while (result != cudaSucess) { result = cudaStreamQuery(…); if(j > 0) MPI_Test(…); } MPI_Isend(s_buf + j * block_sz, block_sz, MPI_CHAR, 1, 1, ….); } MPI_Waitall(); ISC 2011 Hamburg 12 Sample Code – MVAPICH2-GPU • MVAPICH2-GPU: provides standard MPI interfaces for GPU At Sender: MPI_Send(s_device, size, …); // s_device is data buffer in GPU At Receiver: MPI_Recv(r_device, size, …); // r_device is data buffer in GPU • High productivity and high performance! ISC 2011 Hamburg 13 Outline • Introduc0on • Problem Statement • Our Solu0on: MVAPICH2-GPU • Design Consideraons • Performance Evaluaon • Conclusion & Future Work ISC 2011 Hamburg 14 Design consideraons • Memory detec0on – CUDA 4.0 introduces Unified Virtual Addressing (UVA) – MPI library can differen0ate between device memory and host memory without any hints from the user • Overlap CUDA copy and RDMA transfer – Pipeline DMA of data from GPU and InfiniBand RDMA – Allow for progressing DMAs individual data chunks ISC 2011 Hamburg 15 Pipelined Design MPI_Send MPI_Recv GPU Device Host Main Host Main GPU Device Memory Memory Memory Memory RTS/CTS cudaMemcpy Async() … RDMA Write & Finish MSG cudaStream Query() … cudaMemcpy … Async() … cudaStream … Query() with GPU-Direct – Data is divided into chunks – Pipeline CUDA copies with RDMA transfers – If system does not have GPU-Direct, an extra copy is required ISC 2011 Hamburg 16 Pipeline Design (Cont.) • Chunk size depends on CUDA copy cost and RDMA latency over the network • Automac tuning of chunk size – Detects CUDA copy and RDMA latencies during installaon – Chunk size can be stored in configuraon file (mvapich.conf) • User transparent to deliver the best performance ISC 2011 Hamburg 17 Outline • Introduc0on • Problem Statement • Our Solu0on: MVAPICH2-GPU • Design Consideraons • Performance Evaluaon • Conclusion & Future Work ISC 2011 Hamburg 18 Performance Evaluaon • Experimental environment – NVIDIA Tesla C2050 – Mellanox QDR InfiniBand HCA MT26428 – Intel Westmere processor with 12 GB main memory – MVAPICH2 1.6, CUDA Toolkit 4.0, GPUDirect • OSU Micro Benchmarks – The source and des0naon addresses are in GPU device memory ISC 2011 Hamburg 19 Ping Pong Latency 3000 3000 Memcpy+Send Memcpy-Send 2500 MemcpyAsync+Isend 2500 MemcpyAsync+Isend 2000 MVAPICH2-GPU 2000 MVAPICH2-GPU 1500 1500 with GPU Direct without GPU Direct Time (us) 1000 Time (us) 1000 500 500 0 0 32K 64K 128K 256K 512K 1M 2M 4M 32K 64K 128K 256K 512K 1M 2M 4M Message Size (bytes) Message Size (bytes) • With GPU-Direct – 45% improvement compared to Memcpy+Send (4MB messages) – 24% improvement compared to MemcpyAsync+Isend (4MB messages) • Without GPU-Direct – 38% improvement compared to Memcpy+send (4MB messages) – 33% improvement compared to MemcpyAsync+Isend (4MB messages) ISC 2011 Hamburg 20 One-sided Communicaon 3500 3000 Mempy+Put w/ GPU Direct Mempy+Get w/ GPU Direct 3000 MVAPICH2-GPU w/ GPU Direct 2500 MVAPICH2-GPU w/ GPU Direct 2500 Mempy+Put w/o GPU Direct Mempy+Get w/o GPU Direct 2000 MVAPICH2-GPU w/o GPU Direct MVAPICH2-GPU w/o GPU Direct 2000 1500 1500 Time (us) Time (us) 1000 1000 500 500 0 0 32K 64K 128K 256K 512K 1M 2M 4M 32K 64K 128K 256K 512K 1M 2M 4M Message Size (bytes) Message Size (bytes) • 45% improvement compared to Memcpy+Put (with GPU Direct) • 39% improvement compared with Memcpy+Put (without GPU Direct) • Similar improvement for Get operaon • Major improvement in programming – One sided communicaon not feasible without MPI+CUDA support ISC 2011 Hamburg 21 Collec0ve Communicaon 25 25 Mempy+Gather w/ GPU Direct Mempy+Scaer w/ GPU Direct MVAPICH2-GPU w/ GPU Direct MVAPICH2-GPU w/ GPU Direct 20 20 Mempy+Gather w/o GPU Direct Mempy+Scaer w/o GPU Direct 15 MVAPICH2-GPU w/o GPU Direct 15 MVAPICH2-GPU w/o GPU Direct 10 10 Time (ms) Time (ms) 5 5 0 0 32K 64K 128K 256K 512K 1M 2M 4M 32K 64K 128K 256K 512K 1M 2M 4M Message Size (bytes) Message Size (bytes) • With GPU Direct – 37% improvement for MPI_Gather (1MB) and 32% improvement for MPI_Scaer (4MB) • Without GPU Direct – 33% improvement for MPI_Gather (1MB) and 23% improvement MPI_Scaer (4MB) ISC 2011 Hamburg 22 Collective Communication: Alltoall 50 Mempy+Alltoall w/ GPU Direct MVAPICH2-GPU w/ GPU Direct 40 Mempy+Alltoall w/o GPU Direct 30 MVAPICH2-GPU w/o GPU Direct 20 Time (ms) 10 0 32K 64K 128K 256K 512K 1M 2M 4M Message Size (bytes) • With GPU Direct • Without GPU Direct – 30% improvement for 4MB messages – 26% improvement for 4MB messages ISC 2011 Hamburg 23 Comparison with Other MPI Stack: Latency Small Messages Large Messages 60 2500 MVAPICH2-GPU 50 2000 OpenMPI-CUDA 40 1500 30 1000 20 Latnecy (us) MVAPICH2-GPU Latency (us) 10 OpenMPI-CUDA 500 0 0 1 4 16 64 256 1K 4K 8K 32K 128K 512K 2M Message Size (Bytes) Message Size (Bytes) • Open MPI implementaon by Rolf vandeVaart(NVIDIA) – hjps://bitbucket.org/rolfv/ompi-trunk-cuda-3 • Small message latency around 10% bejer (1B – 512B) • Large message latency around 40% bejer (4M) ISC 2011 Hamburg 24 Comparison with Other MPI Stack: Bandwidth 3500 3000 MVAPICH2-GPU 2500 OpenMPI-CUDA 2000 1500 Bandwidth (MBytes/s) 1000 500 0 1 4 16 64 256 1K 4K 16K 64K 256K 1M 4M Message Size (Bytes) • Bandwidth for large messages 512K improved by almost factor of three ISC 2011 Hamburg 25 Conclusion & Future Work • GPUs are key for data-centric applicaons • However, data movement is the key bojleneck in current generaon clusters with GPUs • Asynchronous CUDA Copy and RDMA provide opportunity for latency hiding • Op0mizaons at applicaon level reduce produc0vity and achieve sub- op0mal performance • MPI Library can handle this for the users efficiently • MVAPICH2-GPU shows substan0al improvements for MPI operaons – Will be available in future MVAPICH2 release • Future work includes collec0ve specific op0mizaons and applicaons- level studies with MVAPICH2-GPU • Upcoming paper at Cluster 2011: “Op:mized Non-con:guous MPI Datatype Communica:on for GPU Clusters: Design, Implementa:on and Evalua:on with MVAPICH2”, H. Wang, S. Potluri, M. Luo, A. K. Singh, X. Ouyang, S. Sur, D. K. Panda ISC 2011 Hamburg 26 Thank You! {wangh, potluri, luom, singhas, surs, panda}@cse.ohio-state.edu Network-Based Compu0ng Laboratory hjp://nowlab.cse.ohio-state.edu/ MVAPICH Web Page hjp://mvapich.cse.ohio-state.edu/ 27 .

Load more