Efficient and Scalable Communication Middleware for Emerging Dense-GPU Clusters

Ching-Hsiang Chu Advisor: Dhabaleswar K. Panda

Network Based Computing Lab Department of Computer Science and Engineering The Ohio State University, Columbus, OH Outline

• Introduction • Problem Statement • Detailed Description and Results • Broader Impact on the HPC Community • Expected Contributions

Network Based Computing Laboratory SC 19 Doctoral Showcase 2 Trends in Modern HPC Architecture: Heterogeneous

Multi/ Many-core High Performance Interconnects Accelerators / Coprocessors SSD, NVMe-SSD, Processors InfiniBand, Omni-Path, EFA high compute density, NVRAM <1usec latency, 100Gbps+ Bandwidth high performance/watt Node local storage

• Multi-core/many-core technologies • High Performance Storage and Compute devices • High Performance Interconnects • Variety of programming models (MPI, PGAS, MPI+X)

#1 #2 Sierra (17,280 GPUs) #8 ABCI #22 DGX SuperPOD (27,648 GPUs) #10 Lassen (2,664 GPUs) (4,352 GPUs) (1,536 GPUs) Network Based Computing Laboratory SC 19 Doctoral Showcase 3 Trends in Modern Large-scale Dense-GPU Systems

• Scale-up (up to 150 GB/s) • Scale-out (up to 25 GB/s) – PCIe, NVLink/NVSwitch – InfiniBand, Omni-path, – Infinity Fabric, Gen-Z, CXL – Cray Slingshot

Network Based Computing Laboratory SC 19 Doctoral Showcase 4 GPU-enabled HPC Applications

Lattice Quantum Chromodynamics Weather Simulation Wave propagation simulation

Fuhrer O, Osuna C, Lapillonne X, Gysi T, Bianco M, Schulthess T. Towards GPU-accelerated operational weather forecasting. GTC 2013. Derek Leinweber, CSSM, University of Adelaide https://geodynamics.org/cig/software/specfem3d_globe/ 23x faster than CPU 2.8x faster than CPU 25x faster than CPU • Various scientific applications are ported to GPU – Reportedly significant speedup compared to CPU version – High-resolution/precision results Network Based Computing Laboratory SC 19 Doctoral Showcase 5 GPU-enabled Emerging Deep Learning Applications

• Easy-to-use and high-performance frameworks

999 PetaFlop/s sustained, and 1.13 ExaFlop/s peak FP 16 performance over 4560 nodes (27,360 GPU)

• Wide range of applications – Image Classification – Speech Recognition – Self-driving car – Healthcare

– Climate Analytic Kurth T, Treichler S, Romero J, Mudigonda M, Luehr N, Phillips E, Mahesh A, Matheson M, Deslippe J, Fatica M, Houston M. Exascale deep learning for climate analytics. SC 2018 Nov 11 (p. 51). (Golden Bell Prize) Network Based Computing Laboratory SC 19 Doctoral Showcase 6 GPU-Aware (CUDA-Aware) Communication Middleware

MPI-based Generic Communication Middleware DL-Specific Communication Middleware

inside MVAPICH2

• Supports and optimizes various communication patterns • Ring-based collective operations • Overlaps data movement from GPU with RDMA transfers • Optimized for DL workloads on GPU systems Network Based Computing Laboratory SC 19 Doctoral Showcase 7 Broad Challenge

Can we design a generic GPU-

enabled communication Ideal User Naive middleware to fully exploit GPU User Advanced resources and interconnects for Resource Utilization traditional HPC and emerging ML/DL applications? Overlap Productivity

Performance Farther from the center is Better Network Based Computing Laboratory SC 19 Doctoral Showcase 8 Outline

• Introduction • Problem Statement • Detailed Description and Results • Broader Impact on the HPC Community • Expected Contributions

Network Based Computing Laboratory SC 19 Doctoral Showcase 9 Problem Statements

• What kind of hardware capabilities can be leveraged to fully exploit the modern interconnects deployed in GPU clusters? • What is the potential scope of communication patterns can be benefited from the proposed GPU-enabled communication middleware? • How to leverage GPU resources such as high-bandwidth memory (HBM) and massive streaming multiprocessors (SM) to accelerate communication? • What are design considerations for a GPU-enabled communication middleware to efficiently utilize the hardware features? • What kind of performance benefits can be expected with the proposed GPU-enabled communication middleware? • How can the traditional HPC and DL applications can take advantage of the proposed communication middleware without application-level modifications?

Network Based Computing Laboratory SC 19 Doctoral Showcase 10 Research Framework

Benchmarks Traditional HPC Applications Machine/Deep Learning Applications OMB Streaming MILC COSMO AWP-ODC HOOMD-Blue TensorFlow CNTK

Programing Message Passing Interface (MPI) Compute Models Models Efficient Non-contiguous Data Process: GPU-Enabled Scalable Streaming Broadcast Scheduling and CUDA Communication Middleware GPU-enabled Cooperative and Link-efficient reduction operations OpenACC Hardware GPUDirect RDMA GDRCOPY Hardware Multicast Inter-GPU Direct Load/Store Capability

Modern HPC Multi-Core Processors Accelerator Interconnects Hardware Xeon OpenPOWER GPU InfiniBand PCIe NVLink/NVSwitch

Network Based Computing Laboratory SC 19 Doctoral Showcase 11 Outline

• Introduction • Problem Statement • Detailed Description and Results – Scalable Streaming Broadcast for InfiniBand Networks – Efficient Scheduling of Non-contiguous Data Transfer – GPU-enabled Zero-copy Transfer for Non-contiguous Data – GPU-enabled Reduction operations • Broader Impact on the HPC Community • Expected Contributions

Network Based Computing Laboratory SC 19 Doctoral Showcase 12 Motivated Example #1 – Need of scalable broadcast

• Streaming & Deep Learning applications Data Source – Large-scale broadcast operations – High computation-communication overlap Online/Offline streaming – No application-level modification Data Distributor / Parameter Server

GPU Node 1 GPU Node 3 Data streaming-like broadcast operations

Worker Worker Worker Worker Worker CPU CPU CPU CPU CPU GPU Node 2 GPU Node 4 GPU GPU GPU GPU GPU

GPU GPU GPU GPU GPU

Ching-Hsiang Chu, et al., "Exploiting Hardware Multicast and GPUDirect RDMA for Efficient Broadcast," IEEE TPDS, vol. 30, no. 3, pp. 575-588, 1 March 2019 Network Based Computing Laboratory SC 19 Doctoral Showcase 13 Proposed Efficient and Scalable Solution

• Streaming data through host MPI_Bcast(d_in,…) – Fine-tuned chunked data Destination 1 Ø Three-stage pipeline Header CPU – Leverages IB Scatter-Gather and GDR features IB Ø HCA Free-up PCIe resources for applications GPU IB d_in Source Switch 1. Data Preparation Header 2. IB Gather im_buf Destination N CPU Header 3. IB Hardware Multicast IB IB Scatter (GDR Write) CPU HCA IB GPU HCA MPI_Bcast(d_out,…) d_out GPU Ching-Hsiang Chu, et al., "Exploiting Hardware Multicast and GPUDirect RDMA for Efficient Broadcast," IEEE TPDS, vol. 30, no. 3, pp. 575-588, 1 March 2019 d_in Network Based Computing Laboratory SC 19 Doctoral Showcase 14 Performance Evaluation

*D. S. Banerjee, K. Hamidouche and D. K. Panda, "Re-Designing CNTK Deep Learning Framework on • Evaluated @ RI2 GPU nodes Modern GPU Enabled Clusters," CloudCom 2016. Streaming Workload *CA-CNTK - Image Classification 10 Knomial-GDR 1.5 8 Ring-GDR-Pipeline 1.2X TA-Zcpy-MCAST-GDR-Pipeline 1 6 6.9X

4 Speedup 0.5

Throughput (GB/s) Throughput 2

0 0 8 16 8 16 8 16 Peak Peak Peak AlexNet VGG ResNet-50 Streaming Streaming Streaming Scale (Number of GPU nodes) 4 GPU Nodes 8 GPUs Nodes 16 GPU Nodes

Ching-Hsiang Chu, et al., "Exploiting Hardware Multicast and GPUDirect RDMA for Efficient Broadcast," IEEE TPDS, vol. 30, no. 3, pp. 575-588, 1 March 2019 Network Based Computing Laboratory SC 19 Doctoral Showcase 15 Performance Model Validation and Prediction

• Based on the architecture on RI2 cluster 1 � = 2��; � = 512 ��; � = 4 ��; � ≈ 100 ����; � = 8 ����; � � ≈ × ln � , 15 ≤ � ≤ 20 � ���� � � 100 100000 Within 10% of error 10 1000 1 K-nomial-based: Model-based Estimation 10

Latency (s) Latency 0.1 K-nomial-based: Experiment Ring-based: Model-based Estimation (s) Latency 0.1 K-nomial-based: Model-based Estimation 0.01 Ring-based: Experiment Ring-based: Model-based Estimation MCAST-GDR-Opt: Model-based Estimation MCAST-GDR-Opt: Model-based Estimation MCAST-GDR-Opt: Experiment 0.001 0.001 2 4 8 16 2 4 8 16 32 64 128 256 512 1024 2048 Number of Broadcast Sources Number of Broadcast Sources Ching-Hsiang Chu, et al., "Exploiting Hardware Multicast and GPUDirect RDMA for Efficient Broadcast," IEEE TPDS, vol. 30, no. 3, pp. 575-588, 1 March 2019 Network Based Computing Laboratory SC 19 Doctoral Showcase 16 Outline

• Introduction • Problem Statement • Detailed Description and Results – Scalable Streaming Broadcast for InfiniBand Networks – Efficient Scheduling of Non-contiguous Data Transfer – GPU-enabled Zero-copy Transfer for Non-contiguous Data – GPU-enabled Reduction operations • Ongoing Work and Future Research Directions • Broader Impact on the HPC Community • Expected Contributions

Network Based Computing Laboratory SC 19 Doctoral Showcase 17 Motivated Example #2 – Non-contiguous Data Transfer

• Wide usages of MPI derived datatype for Non-contiguous Data Transfer – Requires Low-latency and high overlap processing

Quantum Chromodynamics: MILC with QUDA Weather Simulation: COSMO model

M. Martinasso, G. Kwasniewski, S. R. Alam, Thomas C. Schulthess, and T. Hoefler. “A PCIe congestion-aware performance model for densely populated accelerator servers. “ SC 2016

Mike Clark. “GPU Computing with QUDA, “Developer Technology Group, https://www.olcf.ornl.gov/wp-content/uploads/2013/02/Clark_M_LQCD.pdf Network Based Computing Laboratory SC 19 Doctoral Showcase 18 Existing GPU-enabled MPI Datatype Processing

Common Scenario Waste of computing resources on CPU and GPU Existing Design MPI_Isend (A,.. Datatype,…) Isend(1) Isend(1) Isend(1) Wait CPU

MPI_Isend (B,.. Datatype,…) l l l

e e e Wait For Wait For Wait For t t t t t t e e e d d d r r r a a a n n n n n n i i i a a a r r r t t t Progress t t t e e MPI_Isend (C,.. Datatype,…) Kernel e Kernel Kernel i i i e e e S S S S S S n n n K K K I I MPI_Isend (D,.. Datatype,…) I (WFK) (WFK) (WFK) … GPU Kernel on Stream Kernel on Stream Kernel on Stream MPI_Waitall (…);

Proposed Design

Isend(1)Isend(2)Isend(3) Wait CPU

l l l e e e Progress t t t e e e

a a a t n n n t t d i d K d i i K K r r r r r r t t t n F F F n n i i i a a a e e e t t t e e e n n n K K K S S S S S S I I I W *A, B…contain non-contiguous MPI Datatype W W

Kernel on Stream GPU Expected Benefits Kernel on Stream

Kernel on Stream

Ching-Hsiang Chu et al., "Exploiting Maximal Overlap for Non-Contiguous Data Start Time Finish Proposed Finish Existing Movement Processing on Modern GPU-enabled Systems, " IEEE IPDPS 2016. Network Based Computing Laboratory SC 19 Doctoral Showcase 19 Proposed Event-based Design – Low Latency

HCA CPU GPU

pack_kernel1<<< >>> MPI_Isend() cudaEventRecord() MPI_Isend() pack_kernel2<<< >>> cudaEventRecord() MPI_Isend() pack_kernel3<<< >>>

cudaEventRecord() stream1

MPI_Waitall() stream2 stream3 Query / Progress

Send Default stream Default

Completion Request Complete

Network Based Computing Laboratory SC 19 Doctoral Showcase 20 Proposed Callback-based Design – High Overlap

HCA CPU GPU main helper callback

pack_kernel1<<< >>> MPI_Isend() addCallback() MPI_Isend() pack_kernel2<<< >>> addCallback() MPI_Isend() pack_kernel3<<< >>>

addCallback() stream1 CPU stream2 Computations Callback stream3 Send Callback

Callback stream Default

Completion Request Complete MPI_Waitall() Network Based Computing Laboratory SC 19 Doctoral Showcase 21 Application-level (COSMO HaloExchange) Evaluation

MPI_Isend(Buf1, ...,request1); MPI_Isend(Buf2, ...,request2); MPI_Wait (request1, status1); MPI_Wait (request2, status2); Wilkes GPU Cluster CSCS GPU cluster Default Callback-based Event-based Default Callback-based Event-based 1.2 2X 1.2 1.6X 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 0 4 8 16 32 Normalized Execution Time Execution Normalized Time Execution Normalized 16 32 64 96 Number of GPUs Number of GPUs Ching-Hsiang Chu et al., "Exploiting Maximal Overlap for Non-Contiguous Data Movement Processing on Modern GPU-enabled Systems, " IEEE IPDPS 2016. Network Based Computing Laboratory SC 19 Doctoral Showcase 22 Outline

• Introduction • Problem Statement • Detailed Description and Results – Scalable Streaming Broadcast for InfiniBand Networks – Efficient Scheduling of Non-contiguous Data Transfer – GPU-enabled Zero-copy Transfer for Non-contiguous Data – GPU-enabled Reduction operations • Ongoing Work and Future Research Directions • Broader Impact on the HPC Community • Expected Contributions

Network Based Computing Laboratory SC 19 Doctoral Showcase 23 Proposed Zero-copy (packing-free) Datatype Transfer

• Exploiting load-store capability of modern interconnects – Eliminate extra data copies and expensive packing/unpacking processing

Existing Packing Schem Proposed Packing-free Schem

Load-Store

Copy NVLink NVLink NVLink PCIe/ PCIe/ PCIe/ Source GPU Memory GPU Source Source GPU Memory GPU Source System/HCA Memory Destination GPU Memory GPU Destination Destination GPU Memory GPU Destination

Ching-Hsiang Chu et al., “High-Performance Adaptive MPI Derived Datatype Communication for Modern Multi-GPU Systems”, HiPC 2019. Network Based Computing Laboratory SC 19 Doctoral Showcase 24 Performance Evaluation

• Zero-copy (packing-free) for GPUs with peer-to-peer direct access over PCIe/NVLink

Communication Kernel of COSMO Model GPU-based DDTBench mimics MILC (https://github.com/cosunae/HaloExchangeBenchmarks) communication kernel 25 25 Improved 3.4X 20 20 Improved 15X 15 15

10 10 Speedup 5 Execution Time (s) Time Execution 5 0 [6, 8,8,8,8] [6, 8,8,8,16] [6, 8,8,16,16] [6, 16,16,16,16] 0 MILC 16 32 64 Problem size Number of GPUs

OpenMPI 4.0.0 MVAPICH2-GDR 2.3.1 Proposed MVAPICH2-GDR 2.3.1 Proposed Platform: DGX-2 system Platform: Cray CS-Storm

Ching-Hsiang Chu et al., “High-Performance Adaptive MPI Derived Datatype Communication for Modern Multi-GPU Systems”, HiPC 2019. Network Based Computing Laboratory SC 19 Doctoral Showcase 25 Outline

• Introduction • Problem Statement • Detailed Description and Results – Scalable Streaming Broadcast for InfiniBand Networks – Efficient Scheduling of Non-contiguous Data Transfer – GPU-enabled Zero-copy Transfer for Non-contiguous Data – GPU-enabled Reduction operations • Ongoing Work and Future Research Directions • Broader Impact on the HPC Community • Expected Contributions

Network Based Computing Laboratory SC 19 Doctoral Showcase 26 Motivated Example #3 – Reduction Op. for DL Training

• Can GPU resources help improving compute-intensive communications? – E.g., MPI_Reduce, MPI_Allreduce, MPI_Scan – Emerging distributed deep learning training • Exchange and update weights – Requires fast and high-bandwidth solutions

Ben-Nun T, Hoefler T. Demystifying parallel and distributed deep learning: An in-depth concurrency analysis. arXiv preprint arXiv:1802.09941. 2018 Feb 26. https://www.oreilly.com/ideas/distributed-tensorflow Network Based Computing Laboratory SC 19 Doctoral Showcase 27 How to leverage GPUs for MPI Reduction Operations?

• Existing designs 1. Explicit copy the data from GPU to host memory Expensive! 2. Host-to-Host communication to remote processes Fast Relative slow for large data 3. Perform computation on CPU Good for small data Expensive! 4. Explicit copy the data from host to GPU memory • Proposed designs Node A Node B

1. GPU-to-GPU communication CPU 3 CPU • NVIDIA GPUDirect RDMA (GDR) Host Memory Host Memory • Pipeline through host for large msg 2 IB IB 4 2. Perform computation on GPU 1 PCIe PCIe Adapter Adapter • Efficient CUDA kernels 1 GPU GPU Ching-Hsiang Chu et al., "CUDA Kernel based Collective Reduction Operations on Large-scale GPU 2 Clusters, " IEEE/ACM CCGrid 2016 Network Based Computing Laboratory SC 19 Doctoral Showcase 28 Alternative and Extended Designs

Communication Computation Design Algorithm Benefit

BR-H-HH (Default) Binomial-Reduce Large scale, CPU RD-H-HH (Default) Recursive doubling small messages Host<->Host GR-H-HH

GR-HH Small scale, Gather-Reduce Host<->Device (GDR) GR-HD / GR-DH small messages GR-DD

Device<->Device GPU BR-DD Binomial-Reduce (GDR) BRB-DD Binomial-Reduce-Bcast Large messages RD-DD for any scale Recursive doubling Host<->Device (GDR) RD-HD/RD-DH Network Based Computing Laboratory SC 19 Doctoral Showcase 29 Proposed NVGroup Allreduce

• Grouping GPUs which are fully connected by NVLinks – Contention-free communication within the group • Cooperative Reduction Kernels to exploit load-compute-store primitives over NVLinks

Ching-Hsiang Chu et al., “NV-Group: Cooperative and Link-Efficient Reductions for Deep Learning on NVLink-enabled Dense GPU Systems, ” (to be submitted) Network Based Computing Laboratory SC 19 Doctoral Showcase 30 Preliminary Results – Allreduce Benchmark

Latency on 1,536 GPUs Bandwidth on 1,536 GPUs 128MB Message 500 6 10 8 400 5 1.7X better 4 6 1.7X better 300 1.6X better 3 4 200 2

Latency (us) Latency 2 Bandwidth (GB/s)

Bandwidth (GB/s) 1 100 0 0 0 32M 64M 128M 256M 24 48 96 192 384 768 1536 4 16 64 256 1K 4K 16K Message Size (Bytes) Number of GPUs Message Size (Bytes) Proposed NVGroup SpectrumMPI 10.2.0.11 OpenMPI 4.0.1 NVGroup NCCL 2.4 NCCL 2.4 NCCL 2.4 Proposed NVGroup

#1 Summit Platform: Dual-socket IBM POWER9 CPU, 6 NVIDIA Volta V100 GPUs, and 2-port InfiniBand EDR Interconnect

Ching-Hsiang Chu et al., “NV-Group: Cooperative and Link-Efficient Reductions for Deep Learning on NVLink-enabled Dense GPU Systems, ” (to be submitted) Network Based Computing Laboratory SC 19 Doctoral Showcase 31 Preliminary Results – Distributed Deep Learning Training

• ResNet-50 Training using TensorFlow benchmark on a DGX-2 machine (16 Volta GPUs) Actual throughput Scaling Eficiency = ×100% Ideal throughput at scale 7000 100 90 6000 8% higher 80 5000 70 60 4000 50 3000 40 30

Image per second Image 2000 Scaling Efficiency (%) Efficiency Scaling 20 1000 10 0 0 1 2 4 8 16 1 2 4 8 16 Number of GPUs Number of GPUs

NCCL-2.4 MVAPICH2-GDR 2.3.1 Ideal NCCL-2.4 Proposed

Ching-Hsiang Chu et al., “NV-Group: Cooperative and Link-Efficient Reductions for Deep Learning on NVLink-enabled Dense GPU Systems, ” (to be submitted) Network Based Computing Laboratory SC 19 Doctoral Showcase 32 Outline

• Introduction • Problem Statement • Detailed Description and Results • Ongoing Work and Future Research Directions • Broader Impact on the HPC Community • Expected Contributions

Network Based Computing Laboratory SC 19 Doctoral Showcase 33 MVAPICH2 Project • High Performance open-source MPI Library for InfiniBand, Omni-Path, Ethernet/iWARP, and RDMA over Converged Ethernet (RoCE) – MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.1), Started in 2001, First version available in 2002 – MVAPICH2-X (MPI + PGAS), Available since 2011 – Support for GPGPUs (MVAPICH2-GDR) and MIC (MVAPICH2-MIC), Available since 2014 – Support for Virtualization (MVAPICH2-Virt), Available since 2015 Empowering Top500 systems for over a decade – Support for Energy-Awareness (MVAPICH2-EA), Available since 2015 – Support for InfiniBand Network Analysis and Monitoring (OSU INAM) since 2015 – Used by more than 3,000 organizations in 89 countries – More than 553,000 (> 0.5 million) downloads from the OSU site directly – Empowering many TOP500 clusters (June ‘19 ranking) • 3rd ranked 10,649,640-core cluster (Sunway TaihuLight) at NSC, Wuxi, China • 16th, 556,104 cores (Oakforest-PACS) in Japan • 19th, 367,024 cores (Stampede2) at TACC • 31st, 241,108-core (Pleiades) at NASA and many others – Available with software stacks of many vendors and Linux Distros (RedHat, SuSE, and OpenHPC) th – http://mvapich.cse.ohio-state.edu Partner in the 5 ranked TACC Frontera System Network Based Computing Laboratory SC 19 Doctoral Showcase 34 Impact to the Community

• Accelerating GPU-enabled HPC applications worldwide – MVAPICH2-GDR is widely used on many top-ranked GPU clusters worldwide including Summit (#1), Sierra (#2), ABCI (#8), Lassen (#10) and more • Enabling fast weather forecasting – MeteoSwiss uses the COSMO numerical weather forecasting model for the production of regional and local forecast products – Use MVAPICH2-GDR to accelerate GPU Communication • Supporting scalable and reliable data dissemination – DoD streaming applications are using MVAPICH2-GDR to accelerate GPU-based broadcast operations – Tutorial sessions: PETTT ’17, PETTT ‘18

Network Based Computing Laboratory SC 19 Doctoral Showcase 35 Outline

• Introduction • Problem Statement • Detailed Description and Results • Ongoing Work and Future Research Directions • Broader Impact on the HPC Community • Expected Contributions

Network Based Computing Laboratory SC 19 Doctoral Showcase 36 Expected Contributions

• Improving Scale-out and Scale-up performance by exploiting features in modern Interconnects – Enabling Scalable Broadcast by using InfiniBand hardware multicast and GPUDirect RDMA – Link-efficient schemes by exploiting load-store primitives over GPU interconnects • Efficient GPU-enabled Communication Middleware – GPU-enabled MPI derived datatype processing – GPU-enabled reduction operations • Significant impact on the community – The abstraction for accelerator-enabled communication middleware – Benefit HPC & ML/DL workloads – Broader outreach through MVAPICH2-GDR public releases

Network Based Computing Laboratory SC 19 Doctoral Showcase 37 Thank You!

Questions?

[email protected] http://web.cse.ohio-state.edu/~chu.368

Network Based Computing Laboratory SC 19 Doctoral Showcase 38