MVAPICH2-‐GPU: Opumized GPU to GPU Communicauon for Infiniband

MVAPICH2-GPU: Op0mized GPU to GPU Communicaon for InfiniBand Clusters H. Wang, S. Potluri, M. Luo, A. K. Singh, S. Sur D. K. Panda Network-Based Compu0ng Laboratory The Ohio State University ISC 2011 Hamburg 1 Outline • Introduc0on • Problem Statement • Our Solu0on: MVAPICH2-GPU • Design Consideraons • Performance Evaluaon • Conclusion & Future Work ISC 2011 Hamburg 2 InfiniBand Clusters in Top500 • Percentage share of InfiniBand is steadily increasing • 41% of systems in TOP 500 using InfiniBand (June ’11) • 61% of systems in TOP 100 using InfiniBand (June ‘11) ISC 2011 Hamburg 3 Growth in GPGPUs • GPGPUs are gaining significance on clusters for data-centric applicaons: – Word Occurrence, Sparse Integer Occurrence – K-means clustering, Linear regression • GPGPUs + InfiniBand are gaining momentum for large clusters – #2 (Tianhe-1A), #4 (Nebulae) and #5 (Tsubame) Petascale systems • GPGPUs programming – CUDA or OpenCL + MPI – Dr. Sumit Gupta briefed industry users at NVIDIA mee0ng yesterday on programmability advances on GPUs • Big issues: performance of data movement – Latency – Bandwidth – Overlap ISC 2011 Hamburg 4 Data movement in GPU clusters Main IB IB Main GPU GPU Memory Adapter Adapter Memory PCI-E PCI-E PCI-E PCI-E PCI-E Hub IB Network PCI-E Hub • Data movement in InfiniBand clusters with GPUs – CUDA: Device memory à Main memory [at source process] – MPI: Source rank à Des0naon process – CUDA: Main memory à Device memory [at des0naon process] • GPU and InfiniBand require separate memory registraon ISC 2011 Hamburg 5 GPU Direct without GPU Direct with GPU Direct • Collaboraon between Mellanox and NVIDIA to converge on one memory registraon technique • Both devices can register same host memory: – GPU and network adapters can access the buffer ISC 2011 Hamburg 6 Outline • Introduc0on • Problem Statement • Our Solu0on: MVAPICH2-GPU • Design Consideraons • Performance Evaluaon • Conclusion & Future Work ISC 2011 Hamburg 7 Problem Statement • Data movement from/to GPGPUs – Performance bojleneck – Reduced programmer produc0vity • Hard to op0mize at the applicaon level – CUDA and MPI exper0se required for efficient implementaon – Hardware dependent latency characteris0cs – Hard to support and op0mize collec0ves – Hard to support advanced features like one-sided communicaon ISC 2011 Hamburg 8 Outline • Introduc0on • Problem Statement • Our Solu0on: MVAPICH2-GPU • Design Consideraons • Performance Evaluaon • Conclusion & Future Work ISC 2011 Hamburg 9 MVAPICH2-GPU: Design Goals • Support GPU to GPU communicaon through standard MPI interfaces – e.g. enable MPI_Send, MPI_Recv from/to GPU memory • Provide high performance without exposing low level details to the programmer – Pipelined data transfer which automa:cally provides op0mizaons inside MPI library without user tuning • Available to work with – GPU Direct – Without GPU Direct ISC 2011 Hamburg 10 Sample Code - without MPI integraon • Naïve implementaon with MPI and CUDA At Sender: cudaMemcpy(s_buf, s_device, size, cudaMemcpyDeviceToHost); MPI_Send(s_buf, size, MPI_CHAR, 1, 1, MPI_COMM_WORLD); At Receiver: MPI_Recv(r_buf, size, MPI_CHAR, 0, 1, MPI_COMM_WORLD, &req); cudaMemcpy(r_device, r_buf, size, cudaMemcpyHostToDevice); • High producvity but poor performance ISC 2011 Hamburg 11 Sample Code – User op0mized code • Pipelining at user level with non-blocking MPI and CUDA interfaces • Code repeated at receiver side • Good performance but poor producvity At Sender: for (j = 0; j < pipeline_len; j++) cudaMemcpyAsync(s_buf + j * block_sz, s_device + j * block_sz, …); for (j = 0; j < pipeline_len; j++) { while (result != cudaSucess) { result = cudaStreamQuery(…); if(j > 0) MPI_Test(…); } MPI_Isend(s_buf + j * block_sz, block_sz, MPI_CHAR, 1, 1, ….); } MPI_Waitall(); ISC 2011 Hamburg 12 Sample Code – MVAPICH2-GPU • MVAPICH2-GPU: provides standard MPI interfaces for GPU At Sender: MPI_Send(s_device, size, …); // s_device is data buffer in GPU At Receiver: MPI_Recv(r_device, size, …); // r_device is data buffer in GPU • High productivity and high performance! ISC 2011 Hamburg 13 Outline • Introduc0on • Problem Statement • Our Solu0on: MVAPICH2-GPU • Design Consideraons • Performance Evaluaon • Conclusion & Future Work ISC 2011 Hamburg 14 Design consideraons • Memory detec0on – CUDA 4.0 introduces Unified Virtual Addressing (UVA) – MPI library can differen0ate between device memory and host memory without any hints from the user • Overlap CUDA copy and RDMA transfer – Pipeline DMA of data from GPU and InfiniBand RDMA – Allow for progressing DMAs individual data chunks ISC 2011 Hamburg 15 Pipelined Design MPI_Send MPI_Recv GPU Device Host Main Host Main GPU Device Memory Memory Memory Memory RTS/CTS cudaMemcpy Async() … RDMA Write & Finish MSG cudaStream Query() … cudaMemcpy … Async() … cudaStream … Query() with GPU-Direct – Data is divided into chunks – Pipeline CUDA copies with RDMA transfers – If system does not have GPU-Direct, an extra copy is required ISC 2011 Hamburg 16 Pipeline Design (Cont.) • Chunk size depends on CUDA copy cost and RDMA latency over the network • Automac tuning of chunk size – Detects CUDA copy and RDMA latencies during installaon – Chunk size can be stored in configuraon file (mvapich.conf) • User transparent to deliver the best performance ISC 2011 Hamburg 17 Outline • Introduc0on • Problem Statement • Our Solu0on: MVAPICH2-GPU • Design Consideraons • Performance Evaluaon • Conclusion & Future Work ISC 2011 Hamburg 18 Performance Evaluaon • Experimental environment – NVIDIA Tesla C2050 – Mellanox QDR InfiniBand HCA MT26428 – Intel Westmere processor with 12 GB main memory – MVAPICH2 1.6, CUDA Toolkit 4.0, GPUDirect • OSU Micro Benchmarks – The source and des0naon addresses are in GPU device memory ISC 2011 Hamburg 19 Ping Pong Latency 3000 3000 Memcpy+Send Memcpy-Send 2500 MemcpyAsync+Isend 2500 MemcpyAsync+Isend 2000 MVAPICH2-GPU 2000 MVAPICH2-GPU 1500 1500 with GPU Direct without GPU Direct Time (us) 1000 Time (us) 1000 500 500 0 0 32K 64K 128K 256K 512K 1M 2M 4M 32K 64K 128K 256K 512K 1M 2M 4M Message Size (bytes) Message Size (bytes) • With GPU-Direct – 45% improvement compared to Memcpy+Send (4MB messages) – 24% improvement compared to MemcpyAsync+Isend (4MB messages) • Without GPU-Direct – 38% improvement compared to Memcpy+send (4MB messages) – 33% improvement compared to MemcpyAsync+Isend (4MB messages) ISC 2011 Hamburg 20 One-sided Communicaon 3500 3000 Mempy+Put w/ GPU Direct Mempy+Get w/ GPU Direct 3000 MVAPICH2-GPU w/ GPU Direct 2500 MVAPICH2-GPU w/ GPU Direct 2500 Mempy+Put w/o GPU Direct Mempy+Get w/o GPU Direct 2000 MVAPICH2-GPU w/o GPU Direct MVAPICH2-GPU w/o GPU Direct 2000 1500 1500 Time (us) Time (us) 1000 1000 500 500 0 0 32K 64K 128K 256K 512K 1M 2M 4M 32K 64K 128K 256K 512K 1M 2M 4M Message Size (bytes) Message Size (bytes) • 45% improvement compared to Memcpy+Put (with GPU Direct) • 39% improvement compared with Memcpy+Put (without GPU Direct) • Similar improvement for Get operaon • Major improvement in programming – One sided communicaon not feasible without MPI+CUDA support ISC 2011 Hamburg 21 Collec0ve Communicaon 25 25 Mempy+Gather w/ GPU Direct Mempy+Scaer w/ GPU Direct MVAPICH2-GPU w/ GPU Direct MVAPICH2-GPU w/ GPU Direct 20 20 Mempy+Gather w/o GPU Direct Mempy+Scaer w/o GPU Direct 15 MVAPICH2-GPU w/o GPU Direct 15 MVAPICH2-GPU w/o GPU Direct 10 10 Time (ms) Time (ms) 5 5 0 0 32K 64K 128K 256K 512K 1M 2M 4M 32K 64K 128K 256K 512K 1M 2M 4M Message Size (bytes) Message Size (bytes) • With GPU Direct – 37% improvement for MPI_Gather (1MB) and 32% improvement for MPI_Scaer (4MB) • Without GPU Direct – 33% improvement for MPI_Gather (1MB) and 23% improvement MPI_Scaer (4MB) ISC 2011 Hamburg 22 Collective Communication: Alltoall 50 Mempy+Alltoall w/ GPU Direct MVAPICH2-GPU w/ GPU Direct 40 Mempy+Alltoall w/o GPU Direct 30 MVAPICH2-GPU w/o GPU Direct 20 Time (ms) 10 0 32K 64K 128K 256K 512K 1M 2M 4M Message Size (bytes) • With GPU Direct • Without GPU Direct – 30% improvement for 4MB messages – 26% improvement for 4MB messages ISC 2011 Hamburg 23 Comparison with Other MPI Stack: Latency Small Messages Large Messages 60 2500 MVAPICH2-GPU 50 2000 OpenMPI-CUDA 40 1500 30 1000 20 Latnecy (us) MVAPICH2-GPU Latency (us) 10 OpenMPI-CUDA 500 0 0 1 4 16 64 256 1K 4K 8K 32K 128K 512K 2M Message Size (Bytes) Message Size (Bytes) • Open MPI implementaon by Rolf vandeVaart(NVIDIA) – hjps://bitbucket.org/rolfv/ompi-trunk-cuda-3 • Small message latency around 10% bejer (1B – 512B) • Large message latency around 40% bejer (4M) ISC 2011 Hamburg 24 Comparison with Other MPI Stack: Bandwidth 3500 3000 MVAPICH2-GPU 2500 OpenMPI-CUDA 2000 1500 Bandwidth (MBytes/s) 1000 500 0 1 4 16 64 256 1K 4K 16K 64K 256K 1M 4M Message Size (Bytes) • Bandwidth for large messages 512K improved by almost factor of three ISC 2011 Hamburg 25 Conclusion & Future Work • GPUs are key for data-centric applicaons • However, data movement is the key bojleneck in current generaon clusters with GPUs • Asynchronous CUDA Copy and RDMA provide opportunity for latency hiding • Op0mizaons at applicaon level reduce produc0vity and achieve sub- op0mal performance • MPI Library can handle this for the users efficiently • MVAPICH2-GPU shows substan0al improvements for MPI operaons – Will be available in future MVAPICH2 release • Future work includes collec0ve specific op0mizaons and applicaons- level studies with MVAPICH2-GPU • Upcoming paper at Cluster 2011: “Op:mized Non-con:guous MPI Datatype Communica:on for GPU Clusters: Design, Implementa:on and Evalua:on with MVAPICH2”, H. Wang, S. Potluri, M. Luo, A. K. Singh, X. Ouyang, S. Sur, D. K. Panda ISC 2011 Hamburg 26 Thank You! {wangh, potluri, luom, singhas, surs, panda}@cse.ohio-state.edu Network-Based Compu0ng Laboratory hjp://nowlab.cse.ohio-state.edu/ MVAPICH Web Page hjp://mvapich.cse.ohio-state.edu/ 27 .

MVAPICH2-‐GPU: Opumized GPU to GPU Communicauon for Infiniband

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support