Exploiting InfiniBand and GPUDirect Technology for High Performance Collectives on GPU Clusters

Ching-Hsiang Chu [email protected] Department of Computer Science and Engineering The Ohio State University Outline • Introduction • Advanced Designs in MVAPICH2-GDR – CUDA-Aware MPI_Bcast – CUDA-Aware MPI_Allreduce / MPI_Reduce • Concluding Remarks

Network Based Computing Laboratory OSU Booth - SC18 2 Drivers of Modern HPC Cluster Architectures - Hardware

Accelerators / Coprocessors High Performance Interconnects – high compute density, high Multi-/Many-core Processors InfiniBand (with SR-IOV) performance/watt SSD, NVMe-SSD, NVRAM <1usec latency, 200Gbps Bandwidth> >1 TFlop DP on a chip • Multi-core/many-core technologies • Remote Direct Memory Access (RDMA)-enabled networking (InfiniBand and RoCE) • Solid State Drives (SSDs), NVM, Parallel Filesystems, Object Storage Clusters • Accelerators (NVIDIA GPGPUs and Intel )

Sierra@LLNL Stampede2@TACC Comet@SDSC Network Based Computing Laboratory OSU Booth - SC18 3 Architectures for Deep Learning (DL)

Multi-core CPUs within a node Multi-core CPUs + Multi-GPU within a Multi-core CPUs + Multi-GPU node across nodes

IB Multi-core CPUs across nodes Multi-core CPUs + Single GPU across Networks IB nodes Networks

IB Networks

Network Based Computing Laboratory OSU Booth - SC18 4 Streaming-like Applications Data Source • Streaming-like applications on HPC systems Real-time streaming 1. Communication (MPI) HPC resources for Sender • Broadcast real-time analytics • Allreduce/Reduce Data streaming-like broadcast operations 2. Computation (CUDA) Worker Worker Worker Worker Worker • Multiple GPU nodes as workers CPU CPU CPU CPU CPU GPU GPU GPU GPU GPU

GPU GPU GPU GPU GPU

Network Based Computing Laboratory OSU Booth - SC18 5 High-performance Deep Learning • Computation using GPU GPU Node 1 GPU Node 3 • Communication using MPI – Exchanging partial gradients after each minibatch – All-to-all (Multi-Source) communications GPU Node 2 GPU Node 4 ➢ E.g., MPI_Bcast, MPI_Allreduce • Challenges – High computation-communication overlap – Good for upcoming large-scale GPU clusters – No application-level modification

Network Based Computing Laboratory OSU Booth - SC18 6 Outline • Introduction • Advanced Designs in MVAPICH2-GDR – CUDA-Aware MPI_Bcast – CUDA-Aware MPI_Allreduce / MPI_Reduce • Concluding Remarks

Network Based Computing Laboratory OSU Booth - SC18 7 Hardware Multicast-based Broadcast

• For GPU-resident data, using – GPUDirect RDMA (GDR) Destination 1 – InfiniBand Hardware Multicast (IB-MCAST) Header CPU IB • Overhead Source Header HCA – IB UD limit CPU GPU IB IB Data – GDR limit HCA Switch GPU Destination N Data Header CPU 1. IB Gather + GDR Read IB

A. Venkatesh, H. Subramoni, K. Hamidouche, and D. K. Panda, “A 2. IB Hardware Multicast HCA High Performance Broadcast Design with Hardware Multicast and GPU GPUDirect RDMA for Streaming Applications on InfiniBand 3. IB Scatter + GDR Write Data Clusters,” in HiPC 2014, Dec 2014. Network Based Computing Laboratory OSU Booth - SC18 8 Hardware Multicast-based Broadcast (con’t) • Heterogeneous Broadcast for streaming applications ➢ Free-up PCIe resources Node 1 C CPU Source IB HCA C Data Data GPU CPU IB IB HCA Switch GPU Node N C CPU Multicast steps IB HCA C.-H. Chu, K. Hamidouche, H. Subramoni, A. Venkatesh, IB SL step B. Elton, and D. K. Panda, "Designing High Performance GPU Heterogeneous Broadcast for Streaming Applications Data on GPU Clusters, " SBAC-PAD'16, Oct. 26-28, 2016. Network Based Computing Laboratory OSU Booth - SC18 9 Optimized Broadcast Send • Preparing Intermediate buffer (im_buf) – Page-locked (pinned) host buffer MPI_Bcast(d_out,…) Source ➢ Fast Device-Host data movement Header im_buf – Allocated at initialization phase CPU IB IB ➢ Low overhead, one time effort HCA Switch GPU • Streaming data through host d_out 1. Data Preparation – Fine-tuned chunked data 2. IB Gather – Asynchronous copy operations 3. IB Hardware Multicast

C.-H. Chu, X. Lu, A. A. Awan, H. Subramoni, J. Hashmi, B. Elton and D. K. Panda., ➢ Three-stage fine-tuned "Efficient and Scalable Multi-Source Streaming Broadcast on GPU Clusters for Deep Learning, " ICPP 2017, Aug 14-17, 2017. Network Based Computing Laboratory OSU Booth - SC18 10 Optimized Broadcast Receive MPI_Bcast(d_in,…) • Zero-copy broadcast receive Destination 1 Header – Pre-posted user buffer (d_in) CPU IB – Avoids additional data movement HCA GPU – Leverages IB Scatter and GDR IB d_in Switch features Destination N ➢ Low-latency Header CPU ➢ Free-up PCIe resources for IB IB Hardware Multicast HCA applications IB Scatter (GDR Write) GPU d_in C.-H. Chu, X. Lu, A. A. Awan, H. Subramoni, B. Elton, D. K. Panda, "Exploiting Hardware Multicast and GPUDirect RDMA for Efficient Broadcast," to appear in IEEE Transactions on Parallel and Distributed Systems (TPDS). Network Based Computing Laboratory OSU Booth - SC18 11 Efficient Reliability Support for IB-MCAST

• When a receiver experiences timeout (lost MCAST packet) – Performs the RMA Get operation to the sender’s backup buffer to retrieve lost MCAST packets – Sender is not interrupted Broadcast sender Broadcast receiver Time MPI IB HCA IB HCA MPI

Timeout

C.-H. Chu, K. Hamidouche, H. Subramoni, A. Venkatesh, B. Elton, and D. K. Panda, "Efficient Reliability Support for Hardware Multicast-based Broadcast in GPU-enabled Streaming Applications, " COMHPC Workshop, 2016. Network Based Computing Laboratory OSU Booth - SC18 12 Broadcast on Multi-GPU systems • Proposed Intra-node Topology-Aware Broadcast – CUDA InterProcess Communication (IPC) Node 1 CPU Source CPU GPU IB GPU Switch Node N Multicast steps CPU cudaMemcpy C.-H. Chu, K. Hamidouche, H. Subramoni, A. Venkatesh, B. Elton, and D. K. Panda, "Designing High Performance (Device ↔ Device) GPU 0 GPU 1 GPU N Heterogeneous Broadcast for Streaming Applications on GPU Clusters, " SBAC-PAD'16, Oct. 26-28, 2016. Network Based Computing Laboratory OSU Booth - SC18 13 Benchmark Evaluation

• @ RI2 cluster, 16 GPUs, 1 GPU/node Lower is better MV2-GDR-Knomial MV2-GDR-Ring 2 MB Message MCAST-GDR MCAST-GDR-Opt 10000 10000 1000 1000

100 65% 100 Near-Constant

Latency (μs) Latency 10 Latency (μs) Latency Hit GDR read limit 10 MV2-GDR-Knomial MV2-GDR-Ring MCAST-GDR MCAST-GDR-Opt

1 1 8K

4K 2 4 8 16

2M 1M 4M 8M

16K 32K 64K

16M

256K 512K 128K Number of GPU nodes Message Size (bytes) • Provide near-constant latency over the system sizes C.-H. Chu, X. Lu, A. A. Awan, H. Subramoni, J. Hashmi, B. Elton and D. K. Panda., "Efficient and Scalable Multi-Source Streaming Broadcast • Reduces up to 65% of latency for large messages on GPU Clusters for Deep Learning, " ICPP 2017, Aug 14-17, 2017. Network Based Computing Laboratory OSU Booth - SC18 14 Streaming Workload @ RI2 (16 GPUs) & CSCS (88 GPUs) 10 5 Knomial-GDR 4.5 8 Ring-GDR-Pipeline 4 TA-Zcpy-MCAST-GDR-Pipeline 3.5 6 3 2.5 4 2 1.5

Throughput (GB/s) Throughput 2 1 0.5

0 0

Peak Peak Peak

Streaming Streaming Streaming 4 GPU Nodes 8 GPUs Nodes 16 GPU Nodes 11 GPUs 22 GPUs 44 GPUs 88 GPUs • IB-MCAST + GDR + IPC-based MPI_Bcast schemes C.-H. Chu, X. Lu, A. A. Awan, H. Subramoni, B. Elton, D. K. Panda, "Exploiting Hardware Multicast and GPUDirect – Stable high throughput compared to existing schemes RDMA for Efficient Broadcast," to appear in IEEE Transactions on Parallel and Distributed Systems (TPDS). Network Based Computing Laboratory OSU Booth - SC18 15 Deep Learning Frameworks @ RI2 cluster, 16 GPUs

• CUDA-Aware Microsoft Cognitive Toolkit (CA-CNTK) without modification CA-CNTK - Image Classification Knomial-GDR Ring-GDR-Pipeling Zcpy-MCAST-GDR-Pipeline 1.5

1 15%

0.5 24% 0 8 16 8 16 8 16 AlexNet VGG ResNet-50 Scale (Number of GPU nodes) • Reduces up to 24%, 15%, 18% of latency for AlexNet, VGG, and ResNet-50 models C.-H. Chu, X. Lu, A. A. Awan, H. Subramoni, B. Elton, D. K. Panda, "Exploiting Hardware Multicast and GPUDirect • Higher improvement is expected for larger system sizes RDMA for Efficient Broadcast," to appear in IEEE Transactions on Parallel and Distributed Systems (TPDS). Network Based Computing Laboratory OSU Booth - SC18 16 CUDA-Aware MPI_Allreduce

• Existing designs 1. Explicit copy the data from GPU to host memory 2. Host-to-Host communication to remote processes 3. Perform computation on CPU 4. Explicit copy the data from host to GPU memory Node A Node B

• Proposed designs CPU 3 CPU 1. GPU-to-GPU communication Host Memory Host Memory • NVIDIA GPUDirect RDMA (GDR) 2 4 IB IB 1 PCIe PCIe • Pipeline through host for large msg Adapter Adapter 1 2. Perform computation on GPU GPU GPU 2 • Efficient CUDA kernels Network Based Computing Laboratory OSU Booth - SC18 17 Benchmark Evaluation @ RI2 cluster, 16 GPUs

1000000 25 MPI NCCL2 MPI-Opt Default 100000 20 RD-DD ) 10000 BRB-DD ms 15 1000

10 100

Latency (µs) Latency Latency ( Latency 5 10 0 1

Message Size (Bytes) Message Size (Bytes)

[1] C. Chu, K. Hamidouche, A. Venkatesh, A. A. Awan and D. K. Panda, "CUDA Kernel Based Collective Reduction Operations on Large-scale GPU Clusters," in CCGrid’16, Cartagena, 2016, pp. 726-735. [2] Awan AA, Bedorf J, Chu CH, Subramoni H, Panda DK. Scalable Distributed DNN Training using TensorFlow and CUDA- Aware MPI: Characterization, Designs, and Performance Evaluation. arXiv preprint arXiv:1810.11112. 2018 Oct 25. Network Based Computing Laboratory OSU Booth - SC18 18 Outline • Introduction • Advanced Designs in MVAPICH2-GDR – CUDA-Aware MPI_Bcast – CUDA-Aware MPI_Allreduce / MPI_Reduce • Concluding Remarks

Network Based Computing Laboratory OSU Booth - SC18 19 Concluding Remarks

• High-performance broadcast schemes to leverage GDR and IB- MCAST features for streaming and deep learning applications – Optimized streaming design for large messages transfers – High-performance reliability support for IB-MCAST • High-performance CUDA-Aware Allreduce for deep learning – Efficient reduction kernel on GPUs ➢ These features are included in MVAPICH2-GDR 2.3 ➢ http://mvapich.cse.ohio-state.edu/ ➢ http://mvapich.cse.ohio-state.edu/userguide/gdr/2.3/

Network Based Computing Laboratory OSU Booth - SC18 20 Thank You! Ching-Hsiang Chu [email protected]

The MVAPICH2 Project Network-Based Computing Laboratory http://mvapich.cse.ohio-state.edu/ http://nowlab.cse.ohio-state.edu/

[1] C.-H. Chu, K. Hamidouche, H. Subramoni, A. Venkatesh, B. Elton, and D. K. Panda, "Designing High Performance Heterogeneous Broadcast for Streaming Applications on GPU Clusters, " SBAC-PAD'16, Oct. 26-28, 2016. [2] C.-H. Chu, X. Lu, A. A. Awan, H. Subramoni, J. Hashmi, B. Elton and D. K. Panda., "Efficient and Scalable Multi-Source Streaming Broadcast on GPU Clusters for Deep Learning, " ICPP 2017, Aug 14-17, 2017. [3] C.-H. Chu, K. Hamidouche, H. Subramoni, A. Venkatesh, B. Elton, and D. K. Panda, "Efficient Reliability Support for Hardware Multicast-based Broadcast in GPU-enabled Streaming Applications, " COMHPC Workshop, 2016. [4] C.-H. Chu, X. Lu, A. A. Awan, H. Subramoni, B. Elton and D. K. Panda, "Exploiting Hardware Multicast and GPUDirect RDMA for Efficient Broadcast , ” to appear IEEE TPDS.

Network Based Computing Laboratory OSU Booth - SC18 21 Thank You!

• Join us for more tech talks from MVAPICH2 team – http://mvapich.cse.ohio-state.edu/talks/

The MVAPICH2 Project Network-Based Computing Laboratory http://mvapich.cse.ohio-state.edu/ http://nowlab.cse.ohio-state.edu/

Network Based Computing Laboratory OSU Booth - SC18 22