MVAPICH2-GDR: High-Performance and Scalable CUDA-Aware MPI for HPC and AI

GPU Technology Conference (GTC 2019) by

Dhabaleswar K. (DK) Panda Hari Subramoni The Ohio State University The Ohio State University E-mail: [email protected] E-mail: [email protected] http://www.cse.ohio-state.edu/~panda http://www.cse.ohio-state.edu/~subramon Outline • Overview of the MVAPICH2 Project • MVAPICH2-GPU with GPUDirect-RDMA (GDR) • Current Features • Multi-stream Communication for IPC • CMA-based Intra-node Host-to-Host Communication Support • Maximal Overlap in MPI Datatype Processing • Efficient Support for Managed Memory • Streaming Support with InfiniBand Multicast and GDR • Support for Deep Learning • Support for OpenPOWER with NVLink • Support for Container • Upcoming Features • CMA-based Intra-node Collective Communication Support • XPMEM-based Collective Communication Support • Optimized Datatype Processing • Out-of-core processing for Deep Learning • Conclusions Network Based Computing Laboratory GTC 2019 2 Overview of the MVAPICH2 Project

• High Performance open-source MPI Library for InfiniBand, Omni-Path, Ethernet/iWARP, and RDMA over Converged Ethernet (RoCE) – MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.1), Started in 2001, First version available in 2002 – MVAPICH2-X (MPI + PGAS), Available since 2011 – Support for GPGPUs (MVAPICH2-GDR) and MIC (MVAPICH2-MIC), Available since 2014 – Support for Virtualization (MVAPICH2-Virt), Available since 2015 – Support for Energy-Awareness (MVAPICH2-EA), Available since 2015 – Support for InfiniBand Network Analysis and Monitoring (OSU INAM) since 2015 – Used by more than 2,975 organizations in 88 countries – More than 529,000 (> 0.5 million) downloads from the OSU site directly – Empowering many TOP500 clusters (Nov ‘18 ranking) • 3rd ranked 10,649,640-core cluster (Sunway TaihuLight) at NSC, Wuxi, China • 14th, 556,104 cores (Oakforest-PACS) in Japan • 17th, 367,024 cores (Stampede2) at TACC • 27th, 241,108-core (Pleiades) at NASA and many others – Available with software stacks of many vendors and Distros (RedHat, SuSE, and OpenHPC) – http://mvapich.cse.ohio-state.edu Partner in the upcoming TACC Frontera System • Empowering Top500 systems for over a decade Network Based Computing Laboratory GTC 2019 3 MVAPICH2 Release Timeline and Downloads 600000 MIC 2.0

500000 - GDR 2.0b - MV2 1.9 MV2

400000 1.8 MV2 MV2 2.3.1 GDR 2.3 1.7 - MV2 2.3rc1 Virt 2.2 X 1.6 - MV2 300000 MV2 1.5 MV2 MV2 OSU INAM 0.9.4 MV2 1.4 1.0.3 1.1 1.0 1.0 MV2 MV2 MV 200000 MV2 MV Number of Downloads Number MV2 MV2 0.9.8 MV 0.9.4 MV2 0.9.0

100000

0 Jul-15 Jul-10 Jul-05 Jan-18 Jan-13 Jan-08 Jun-18 Jun-13 Jun-08 Oct-16 Oct-11 Oct-06 Apr-14 Apr-09 Sep-14 Feb-15 Sep-09 Feb-10 Sep-04 Feb-05 Dec-15 Dec-10 Dec-05 Aug-17 Aug-12 Aug-07 Nov-13 Nov-08 Nov-18 Mar-17 Mar-12 Mar-07 May-16 May-11 May-06 Timeline Network Based Computing Laboratory GTC 2019 4 Architecture of MVAPICH2 Software Family

High Performance Parallel Programming Models

Message Passing Interface PGAS Hybrid --- MPI + X (MPI) (UPC, OpenSHMEM, CAF, UPC++) (MPI + PGAS + OpenMP/)

High Performance and Scalable Communication Runtime Diverse and Mechanisms

Point-to- Remote Collectives Energy- I/O and Fault Active Introspection point Job Startup Memory Virtualization Algorithms Awareness Tolerance Messages & Analysis Primitives Access File Systems

Support for Modern Networking Technology Support for Modern Multi-/Many-core Architectures (InfiniBand, iWARP, RoCE, Omni-Path) (Intel-Xeon, OpenPOWER, Xeon-Phi, ARM, NVIDIA GPGPU) Transport Protocols Modern Features Transport Mechanisms Modern Features

SR- Multi Shared RC XRC UD DC UMR ODP CMA IVSHMEM XPMEM MCDRAM* NVLink CAPI* IOV Rail Memory

* Upcoming Network Based Computing Laboratory GTC 2019 5 MVAPICH2 Software Family High-Performance Parallel Programming Libraries MVAPICH2 Support for InfiniBand, Omni-Path, Ethernet/iWARP, and RoCE MVAPICH2-X Advanced MPI features, OSU INAM, PGAS (OpenSHMEM, UPC, UPC++, and CAF), and MPI+PGAS programming models with unified communication runtime MVAPICH2-GDR Optimized MPI for clusters with NVIDIA GPUs and for GPU-enabled Deep Learning Applications MVAPICH2-Virt High-performance and scalable MPI for hypervisor and container based HPC cloud MVAPICH2-EA Energy aware and High-performance MPI MVAPICH2-MIC Optimized MPI for clusters with Intel KNC Microbenchmarks OMB Microbenchmarks suite to evaluate MPI and PGAS (OpenSHMEM, UPC, and UPC++) libraries for CPUs and GPUs Tools OSU INAM Network monitoring, profiling, and analysis for clusters with MPI and scheduler integration OEMT Utility to measure the energy consumption of MPI applications Network Based Computing Laboratory GTC 2019 6 MVAPICH2-GDR: Optimizing MPI Data Movement on GPU Clusters • Connected as PCIe devices – Flexibility but Complexity Memory buffers Node 0 Node 1 1. Intra-GPU QPI CPU CPU CPU 2. Intra-Socket GPU-GPU 3. Inter-Socket GPU-GPU 4. Inter-Node GPU-GPU

PCIe 5. Intra-Socket GPU-Host 6. Inter-Socket GPU-Host 7. Inter-Node GPU-Host GPU GPU GPU GPU IB

8. Inter-Node GPU-GPU with IB adapter on remote socket and more . . . • NVLink is leading to more paths ….. • For each path different schemes: Shared_mem, IPC, GPUDirect RDMA, … • Critical for runtimes to optimize data movement while hiding the complexity

Network Based Computing Laboratory GTC 2019 7 GPU-Aware (CUDA-Aware) MPI Library: MVAPICH2-GPU • Standard MPI interfaces used for unified data movement • Takes advantage of Unified Virtual Addressing (>= CUDA 4.0) • Overlaps data movement from GPU with RDMA transfers

At Sender:

MPI_Send(s_devbuf, size, …); inside MVAPICH2 At Receiver: MPI_Recv(r_devbuf, size, …);

High Performance and High Productivity

Network Based Computing Laboratory GTC 2019 8 CUDA-Aware MPI: MVAPICH2-GDR 1.8-2.3.1 Releases

• Support for MPI communication from NVIDIA GPU device memory • High performance RDMA-based inter-node point-to-point communication (GPU-GPU, GPU-Host and Host-GPU) • High performance intra-node point-to-point communication for multi-GPU adapters/node (GPU-GPU, GPU-Host and Host-GPU) • Taking advantage of CUDA IPC (available since CUDA 4.1) in intra-node communication for multiple GPU adapters/node • Optimized and tuned collectives for GPU device buffers • MPI datatype support for point-to-point and collective communication from GPU device buffers • Unified memory

Network Based Computing Laboratory GTC 2019 9 MVAPICH2-GDR: Pre-requisites for OpenPOWER & x86 Systems

• MVAPICH2-GDR 2.3.1 requires the following software to be installed on your system: 1. Mellanox OFED 3.2 and later 2. NVIDIA Driver 367.48 or later 3. NVIDIA CUDA Toolkit 7.5 and later 4. NVIDIA Peer Memory (nv_peer_mem) module to enable GPUDirect RDMA (GDR) support • Strongly Recommended for Best Performance 5. GDRCOPY Library by NVIDIA: https://github.com/NVIDIA/gdrcopy • Comprehensive Instructions can be seen from the MVAPICH2-GDR User Guide: – http://mvapich.cse.ohio-state.edu/userguide/gdr/

Network Based Computing Laboratory GTC 2019 10 MVAPICH2-GDR: Download and Setup on OpenPOWER & x86 Systems • Simple Installation steps for both systems • Pick the right MVAPICH2-GDR RPM from Downloads page: – http://mvapich.cse.ohio-state.edu/downloads/ – e.g. http://mvapich.cse.ohio-state.edu/download/mvapich/gdr/2.3/mofed4.5/mvapich2-gdr- mcast.cuda10.0.mofed4.5.gnu4.8.5-2.3-1.el7.x86_64.rpm (== .rpm) $ wget http://mvapich.cse.ohio-state.edu/download/mvapich/gdr/2.3/.rpm Root Users: $ rpm -Uvh --nodeps .rpm Non-Root Users: $ rpm2cpio .rpm | cpio – id • Contact MVAPICH help list with any questions related to the package [email protected]

Network Based Computing Laboratory GTC 2019 11 MVAPICH2-GDR 2.3.1

• Released on 03/16/2018 • Major Features and Enhancements – Based on MVAPICH2 2.3.1 – Enhanced intra-node and inter-node point-to-point performance for DGX-2 and IBM POWER8 and IBM POWER9 systems – Enhanced Allreduce performance for DGX-2 and IBM POWER8/POWER9 systems – Enhanced small message performance for CUDA-Aware MPI_Put and MPI_Get – Support for PGI 18.10 – Flexible support for running TensorFlow (Horovod) jobs – Add support for Volta (V100) GPU – Support for OpenPOWER with NVLink – Efficient Multiple CUDA stream-based IPC communication for multi-GPU systems with and without NVLink – Leverage Linux Cross Memory Attach (CMA) feature for enhanced host-based communication – InfiniBand Multicast (IB-MCAST) based designs for GPU-based broadcast and streaming applications – Efficient broadcast designs for Deep Learning applications

Network Based Computing Laboratory GTC 2019 12 Optimized MVAPICH2-GDR Design GPU-GPU Inter-node Latency 30 6000 GPU-GPU Inter-node Bi-Bandwidth

20 4000 10 11X 1.85us 10x 2000

Latency (us) 0 0 1 2 4 8 Bandwidth (MB/s) Bandwidth 0 16 32 64 1K 2K 4K 8K 1 2 4 8 128 256 512 16 32 64 1K 2K 4K

Message Size (Bytes) 128 256 512 Message Size (Bytes) MV2-(NO-GDR) MV2-GDR 2.3.1 MV2-(NO-GDR) MV2-GDR-2.3.1 GPU-GPU Inter-node Bandwidth 4000 3000 2000 9x MVAPICH2-GDR-2.3.1 Intel Haswell (E5-2687W @ 3.10 GHz) node - 20 cores 1000 NVIDIA Volta V100 GPU

Bandwidth (MB/s) Bandwidth 0 Mellanox Connect-X4 EDR HCA

1 2 4 8 CUDA 10.0 16 32 64 1K 2K 4K 128 256 512 Mellanox OFED 4.2 with GPU-Direct-RDMA Message Size (Bytes) MV2-(NO-GDR) MV2-GDR-2.3.1

Network Based Computing Laboratory GTC 2019 13 ROCE and Optimized Collectives Support • RoCE V1 and V2 support • RDMA_CM connection support • CUDA-Aware Collective Tuning – Point-point Tuning (available since MVAPICH2-GDR 2.0) • Tuned thresholds for the different communication patterns and features • Depending on the system configuration (CPU, HCA and GPU models) – Tuning Framework for GPU based collectives • Select the best algorithm depending on message size, system size and system configuration • Support for Bcast and Gather operations for different GDR-enabled systems • Available since MVAPICH2-GDR 2.2RC1 release

Network Based Computing Laboratory GTC 2019 14 Application-Level Evaluation (HOOMD-blue) 64K Particles 256K Particles 3500 3000 2500 MV2 MV2+GDR 2500 2000 2X 2000 1500 2X 1500 1000 1000

second (TPS)second 500 second (TPS) 500

0 Time per Steps Average 0 Average Time per Steps Average 4 8 16 32 4 8 16 32 Number of Processes Number of Processes • Platform: Wilkes (Intel Ivy Bridge + NVIDIA Tesla K20c + Mellanox Connect-IB) • HoomdBlue Version 1.0.5 • GDRCOPY enabled: MV2_USE_CUDA=1 MV2_IBA_HCA=mlx5_0 MV2_IBA_EAGER_THRESHOLD=32768 MV2_VBUF_TOTAL_SIZE=32768 MV2_USE_GPUDIRECT_LOOPBACK_LIMIT=32768 MV2_USE_GPUDIRECT_GDRCOPY=1 MV2_USE_GPUDIRECT_GDRCOPY_LIMIT=16384 Network Based Computing Laboratory GTC 2019 15 Application-Level Evaluation (Cosmo) and Weather Forecasting in Switzerland

Wilkes GPU Cluster CSCS GPU cluster

Default Callback-based Event-based Default Callback-based Event-based

1.2 1.2 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 0 4 8 16 32 Normalized Execution Time 16 32 64 96 Normalized Execution Time Number of GPUs Number of GPUs

• 2X improvement on 32 GPUs nodes Cosmo model: http://www2.cosmo-model.org/content • 30% improvement on 96 GPU nodes (8 GPUs/node) /tasks/operational/meteoSwiss/ On-going collaboration with CSCS and MeteoSwiss (Switzerland) in co-designing MV2-GDR and Cosmo Application C. Chu, K. Hamidouche, A. Venkatesh, . Banerjee , H. Subramoni, and D. K. Panda, Exploiting Maximal Overlap for Non-Contiguous Data Movement Processing on Modern GPU-enabled Systems, IPDPS’16 Network Based Computing Laboratory GTC 2019 16 Outline • Overview of the MVAPICH2 Project • MVAPICH2-GPU with GPUDirect-RDMA (GDR) • Current Features • Multi-stream Communication for IPC • CMA-based Intra-node Host-to-Host Communication Support • Maximal Overlap in MPI Datatype Processing • Efficient Support for Managed Memory • Streaming Support with InfiniBand Multicast and GDR • Support for Deep Learning • Support for OpenPOWER with NVLink • Support for Container • Upcoming Features • CMA-based Intra-node Collective Communication Support • XPMEM-based Collective Communication Support • Optimized Datatype Processing • Out-of-core processing for Deep Learning • Conclusions Network Based Computing Laboratory GTC 2019 17 Multi-stream Communication using CUDA IPC on OpenPOWER and DGX-1

• Up to 16% higher Device to Device (D2D) bandwidth on OpenPOWER + NVLink inter-connect • Up to 30% higher D2D bandwidth on DGX-1 with NVLink

• Pt-to-pt (D-D) Bandwidth: Pt-to-pt (D-D) Bandwidth: Benefits of Multi-stream CUDA IPC Design Benefits of Multi-stream CUDA IPC Design 20000 40000 18000 35000 16000 30% better 16% better 30000 14000 12000 25000 10000 20000 8000 15000 6000 10000 4000 2000 5000 0 0 Million Bytes (MB)/second Bytes Million (MB)/second Bytes Million 16K 32K 64K 128K 256K 512K 1M 2M 4M 128K 256K 512K 1M 2M 4M Message Size (Bytes) Message Size (Bytes)

1-stream 4-streams 1-stream 4-streams Available since MVAPICH2-GDR-2.3a Network Based Computing Laboratory GTC 2019 18 CMA-based Intra-node Host-to-Host Communication Support

• Up to 30% lower Host-to-Host (H2H) latency and 30% higher H2H Bandwidth

INTRA-NODE Pt-to-Pt (H2H) LATENCY INTRA-NODE Pt-to-Pt (H2H) BANDWIDTH 600 15000 500 30% better 30% better 400 10000 300 200

Latency (us) 100 5000 0 Bandwidth (MBps) 0

Message Size (Bytes) Message Size (Bytes)

MV2-GDR (w/out CMA) MV2-GDR (w/ CMA) MV2-GDR (w/out CMA) MV2-GDR (w/ CMA)

MVAPICH2-GDR-2.3a Intel Broadwell (E5-2680 v4 @ 3240 GHz) node – 28 cores NVIDIA Tesla K-80 GPU, and Mellanox Connect-X4 EDR HCA CUDA 8.0, Mellanox OFED 4.0 with GPU-Direct-RDMA

Network Based Computing Laboratory GTC 2019 19 Non-contiguous Data Exchange

Halo data exchange

• Multi-dimensional data • Row based organization • Contiguous on one dimension • Non-contiguous on other dimensions • Halo data exchange • Duplicate the boundary • Exchange the boundary in each iteration

Network Based Computing Laboratory GTC 2019 20 MPI Datatype support in MVAPICH2 • Datatypes support in MPI – Operate on customized datatypes to improve productivity – Enable MPI library to optimize non-contiguous data At Sender: MPI_Type_vector (n_blocks, n_elements, stride, old_type, &new_type); MPI_Type_commit(&new_type); … MPI_Send(s_buf, size, new_type, dest, tag, MPI_COMM_WORLD);

• Inside MVAPICH2 - Use datatype specific CUDA Kernels to pack data in chunks - Efficiently move data between nodes using RDMA - In progress - currently optimizes vector and hindexed datatypes - Transparent to the user H. Wang, S. Potluri, D. Bureddy, C. Rosales and D. K. Panda, GPU-aware MPI on RDMA-Enabled Clusters: Design, Implementation and Evaluation, IEEE Transactions on Parallel and Distributed Systems, Vol. 25, No. 10, pp. 2595-2605 , Oct 2014. Network Based Computing Laboratory GTC 2019 21 MPI Datatype Processing (Computation Optimization )

• Comprehensive support • Targeted kernels for regular datatypes - vector, subarray, indexed_block • Generic kernels for all other irregular datatypes • Separate non-blocking stream for kernels launched by MPI library • Avoids stream conflicts with application kernels • Flexible set of parameters for users to tune kernels • Vector • MV2_CUDA_KERNEL_VECTOR_TIDBLK_SIZE • MV2_CUDA_KERNEL_VECTOR_YSIZE • Subarray • MV2_CUDA_KERNEL_SUBARR_TIDBLK_SIZE • MV2_CUDA_KERNEL_SUBARR_XDIM • MV2_CUDA_KERNEL_SUBARR_YDIM • MV2_CUDA_KERNEL_SUBARR_ZDIM • Indexed_block • MV2_CUDA_KERNEL_IDXBLK_XDIM

Network Based Computing Laboratory GTC 2019 22 MPI Datatype Processing (Communication Optimization )

Common Scenario Waste of computing resources on CPU and GPU MPI_Isend (A,.. Datatype,…) MPI_Isend (B,.. Datatype,…) MPI_Isend (C,.. Datatype,…) MPI_Isend (D,.. Datatype,…) …

MPI_Waitall (…);

*A, B…contain non-contiguous MPI Datatype

Network Based Computing Laboratory GTC 2019 23 Outline • Overview of the MVAPICH2 Project • MVAPICH2-GPU with GPUDirect-RDMA (GDR) • Current Features • Multi-stream Communication for IPC • CMA-based Intra-node Host-to-Host Communication Support • Maximal Overlap in MPI Datatype Processing • Efficient Support for Managed Memory • Streaming Support with InfiniBand Multicast and GDR • Support for Deep Learning • Support for OpenPOWER with NVLink • Support for Container • Upcoming Features • CMA-based Intra-node Collective Communication Support • XPMEM-based Collective Communication Support • Optimized Datatype Processing • Out-of-core processing for Deep Learning • Conclusions Network Based Computing Laboratory GTC 2019 24 Enhanced Support for Intra-node Unified Memory ● CUDA Unified Memory(UM) => no memory pin down ● No IPC support for intra-node communication ● No GDR support for Inter-node communication On K80 with MV2-GDR ● Initial and basic support in MVAPICH2-GDR ● For both intra- and inter-nodes use “pipeline through” host memory ● Enhance intra-node UM to use IPC ● Double buffering pair-wise IPC-based scheme ● Brings IPC performance to UM ● High performance and high productivity ● Available since MVAPICH2-GDR 2.2RC1

K. Hamidouche, A. Awan, A. Venkatesh, and D. K Panda, CUDA M3: Designing Efficient CUDA Managed Memory-aware MPI by Exploiting GDR and IPC, HiPC ‘16

Network Based Computing Laboratory GTC 2019 25 Characterizing Unified Memory aware MPI on modern GPUs On V100 with MV2-GDR

● Improved UM support in Pascal & Volta GPUs through: ● Advanced GPU page fault engines ● cudaMemPrefetch and cudaMemAdvise APIs provide more control for UM data placement ● Are the UM designs developed during Kepler era still valid? ● Carried out an in-depth characterization On V100 with MV2-GDR and OMPI ● Our characterization studies show: ● The UM designs from Kepler era are still valid ● They are 4.2X and 2.8X better in latency compared to MVAPICH2-GDR and Open MPI

K. V. Manian, A. Awan, A. Ruhela, C. Chu, H. Subramoni and D. K Panda, Characterizing CUDA Unified Memory (UM)-Aware MPI Designs on Modern GPU Architectures, GPGPU ‘19 Workshop, in conjunction with ASPLOS ’19, April ‘19 Network Based Computing Laboratory GTC 2019 26 Outline • Overview of the MVAPICH2 Project • MVAPICH2-GPU with GPUDirect-RDMA (GDR) • Current Features • Multi-stream Communication for IPC • CMA-based Intra-node Host-to-Host Communication Support • Maximal Overlap in MPI Datatype Processing • Efficient Support for Managed Memory • Streaming Support with InfiniBand Multicast and GDR • Support for Deep Learning • Support for OpenPOWER with NVLink • Support for Container • Upcoming Features • CMA-based Intra-node Collective Communication Support • XPMEM-based Collective Communication Support • Optimized Datatype Processing • Out-of-core processing for Deep Learning • Conclusions Network Based Computing Laboratory GTC 2019 27 Streaming Applications Data Source • Streaming applications on HPC systems Real-time streaming 1. Communication (MPI) • Broadcast-type operations HPC resources for Sender real-time analytics 2. Computation (CUDA) • Multiple GPU nodes as workers Data streaming-like broadcast operations

Worker Worker Worker Worker Worker CPU CPU CPU CPU CPU GPU GPU GPU GPU GPU

GPU GPU GPU GPU GPU

Network Based Computing Laboratory GTC 2019 28 Hardware Multicast-based Broadcast

• For GPU-resident data, using – GPUDirect RDMA (GDR) Destination 1 – InfiniBand Hardware Multicast (IB-MCAST) Header CPU • Overhead IB Source – IB UD limit Header HCA CPU GPU – GDR limit IB IB Data HCA Switch Available since GPU Destination N MVAPICH2-GDR 2.3a Data Header CPU A. Venkatesh, H. Subramoni, K. Hamidouche, and D. K. Panda, 1. IB Gather + GDR Read IB “A High Performance Broadcast Design with Hardware Multicast and GPUDirect RDMA for Streaming Applications on 2. IB Hardware Multicast HCA InfiniBand Clusters,” in HiPC 2014, Dec 2014. GPU 3. IB Scatter + GDR Write Data

Network Based Computing Laboratory GTC 2019 29 Streaming Benchmark @ CSCS (88 GPUs)

MCAST-GDR-OPT MCAST-GDR MCAST-GDR-OPT MCAST-GDR 60 12000

50 10000 58% 79% 40 8000

30 6000

Latency(μs) 20 Latency(μs) 4000

10 2000

0 0 1 2 4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M 2M 4M Message Size (Bytes) Message Size (Bytes)

• IB-MCAST + GDR + Topology-aware IPC-based schemes

C.-H. Chu, K. Hamidouche, H. Subramoni, A. Venkatesh, B. Elton, and D. K. – Up to 58% and 79% reduction Panda, "Designing High Performance Heterogeneous Broadcast for Streaming for small and large messages Applications on GPU Clusters, " SBAC-PAD'16, Oct. 26-28, 2016.

Network Based Computing Laboratory GTC 2019 30 Application-based Evaluation: CUDA-Aware CNTK Research Poster (P9242) • @ RI2 cluster, 16 GPUs, 1 GPU/node – CUDA-Aware Microsoft Cognitive Toolkit (CA-CNTK) [2] Higher is better MV2-GDR-Knomial MV2-GDR-Ring MV2-MCAST-GDR-Opt 1.5 24% 16% 18% 1

Speedup 0.5

0 8 GPUs 16 GPUs 8 GPUs 16 GPUs 8 GPUs 16 GPUs AlexNet VGG ResNet-50 • Reduces up to 24%, 16% and 18% of latency for AlexNet, VGG and ResNet models • Higher improvement can be observed for larger system sizes [1] C.-H. Chu, X. Lu, A. A. Awan, H. Subramoni, J. Hashmi, B. Elton, and D. K. Panda, Efficient and Scalable Multi-Source Streaming Broadcast on GPU Clusters for Deep Learning, ICPP’17. [2] D. S. Banerjee, K. Hamidouche, and D. K. Panda, Re-Designing CNTK Deep Learning Framework on Modern GPU Enabled Clusters, IEEE CloudCom’16 Network Based Computing Laboratory GTC 2019 31 Outline • Overview of the MVAPICH2 Project • MVAPICH2-GPU with GPUDirect-RDMA (GDR) • Current Features • Multi-stream Communication for IPC • CMA-based Intra-node Host-to-Host Communication Support • Maximal Overlap in MPI Datatype Processing • Efficient Support for Managed Memory • Streaming Support with InfiniBand Multicast and GDR • Support for Deep Learning • Support for OpenPOWER with NVLink • Support for Container • Upcoming Features • CMA-based Intra-node Collective Communication Support • XPMEM-based Collective Communication Support • Optimized Datatype Processing • Out-of-core processing for Deep Learning • Conclusions Network Based Computing Laboratory GTC 2019 32 Deep Learning: New Challenges for Runtimes • Scale-up: Intra-node Communication Desired – Many improvements like: • NVIDIA cuDNN, cuBLAS, NCCL, etc. NCCL2 • CUDA 9 Co-operative Groups cuDNN • Scale-out: Inter-node Communication MPI – DL Frameworks – most are optimized for MKL-DNN single-node only – Distributed (Parallel) Training is an emerging up Performance up trend - gRPC • OSU-Caffe – MPI-based • Microsoft CNTK – MPI/NCCL2 Scale Hadoop • Google TensorFlow – gRPC-based/MPI/NCCL2 • Facebook Caffe2 – Hybrid (NCCL2/Gloo/MPI) Scale-out Performance

Network Based Computing Laboratory GTC 2019 33 Data Parallel Deep Learning and MPI Collectives

packed_comm_buff • Major MPI Collectives Loop {} involved in Designing MPI_Bcast (GPU 0) 1. Data Propagation distributed frameworks Params Params Params Params GPU 3 GPU 1 GPU 0 • MPI_Bcast – required for GPU 2 L DNN parameter exchange L1 L1 L1 1

L2 L L2 L2 • MPI_Reduce – needed for F B F 2 B F B F B 2. Forward ...... Backward gradient accumulation L L Ln n Ln n Pass from multiple solvers packed_red packed_red packed_red packed_red uce_buff uce_buff uce_buff uce_buff • MPI_Allreduce – use just one Allreduce instead of MPI_Reduce (GPU 0) 3. Gradient Reduce and Broadcast Gradients Aggregatio ApplyUpdates n

A. A. Awan, K. Hamidouche, J. M. Hashmi, and D. K. Panda, S-Caffe: Co-designing MPI Runtimes and Caffe for Scalable Deep Learning on Modern GPU Clusters. In Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '17) Network Based Computing Laboratory GTC 2019 34 MVAPICH2-GDR: Allreduce Comparison with Baidu and OpenMPI • 16 GPUs (4 nodes) MVAPICH2-GDR vs. Baidu-Allreduce and OpenMPI 3.0

100000 6000000 50000 ~10X better OpenMPI is ~5X slower 5000000 10000 than Baidu ~30X better 40000 4000000 MV2 is ~2X better 1000 3000000 30000 than Baidu 100 2000000 Latency (us)

Latency (us) 1000000 10 20000

Latency (us) 0 1 10000

4 ~4X better 16 64 256 8388608 1024 4096 0 16777216 33554432 67108864 16384 65536 134217728 268435456 536870912 262144 512K 1M 2M 4M Message Size (Bytes) Message Size (Bytes) Message Size (Bytes) MVAPICH2 BAIDU OPENMPI MVAPICH2 BAIDU OPENMPI MVAPICH2 BAIDU OPENMPI

*Available since MVAPICH2-GDR 2.3a Network Based Computing Laboratory GTC 2019 35 MVAPICH2-GDR vs. NCCL2 – Allreduce Operation

• Optimized designs since MVAPICH2-GDR 2.3 offer better/comparable performance for most cases • MPI_Allreduce (MVAPICH2-GDR) vs. ncclAllreduce (NCCL2) on 16 GPUs

1000 100000

10000 ~1.2X better 100 1000 ~3X better 100 Latency (us) Latency (us) 10 10

1 1 4 8 16 32 64 1K 2K 4K 8K 128 256 512 16K 32K 64K Message Size (Bytes) Message Size (Bytes)

MVAPICH2-GDR 2.3.1 NCCL2 MVAPICH2-GDR 2.3.1 NCCL2

Platform: Intel Xeon (Broadwell) nodes equipped with a dual-socket CPU, 1 K-80 GPUs, and EDR InfiniBand Inter-connect Network Based Computing Laboratory GTC 2019 36 MVAPICH2-GDR vs. NCCL2 – Allreduce Operation (DGX-2)

• Optimized designs in MVAPICH2-GDR 2.3.1 offer better/comparable performance for most cases • MPI_Allreduce (MVAPICH2-GDR) vs. ncclAllreduce (NCCL2) on 1 DGX-2 node (16 Volta GPUs)

60 10000

50 1000 40 ~2.5X better ~1.7X better 30 100 Latency (us) Latency (us) 20 10 10

0 1 8 16 32 64 1K 2K 4K 8K 128 256 512 16K 32K 64K 128K Message Size (Bytes) Message Size (Bytes)

MVAPICH2-GDR 2.3.1 NCCL-2.3 MVAPICH2-GDR 2.3.1 NCCL-2.3

Platform: Nvidia DGX-2 system (16 Nvidia Volta GPUs connected with NVSwitch), CUDA 9.2 Network Based Computing Laboratory GTC 2019 37 Scalable TensorFlow using Horovod, MPI, and NCCL

• Efficient Allreduce is crucial for Horovod’s overall training performance – Both MPI and NCCL designs are available • We have evaluated Horovod extensively and compared across a wide range of designs using gRPC and gRPC extensions • MVAPICH2-GDR achieved up to 90% scaling efficiency for ResNet-50 Training on 64 Pascal GPUs

A. A. Awan, J. Bedorf, C.-H. Chu, H. Subramoni and D. K. Panda, “Scalable Distributed DNN Training using TensorFlow and CUDA-Aware MPI: Characterization, Designs, and Performance Evaluation”, (To be presented) CCGrid ‘19. https://arxiv.org/abs/1810.11112

Network Based Computing Laboratory GTC 2019 38 Distributed Training with TensorFlow and MVAPICH2-GDR

• ResNet-50 Training using TensorFlow benchmark on 1 DGX-2 node (8 Volta GPUs) Actual throughput Scaling Efficiency = × 100% Ideal throughput at scale 3000 7.5% higher 100 2500 95 2000 90 1500 85 1000 80

Image per second 500 Scaling Efficiency Scaling(%) Efficiency 0 75 1 2 4 8 1 2 4 8 Number of GPUs Number of GPUs

NCCL-2.3 MVAPICH2-GDR-Next NCCL-2.3 MVAPICH2-GDR-Next

Platform: Nvidia DGX-2 system (16 Nvidia Volta GPUs connected with NVSwitch), CUDA 9.2 Network Based Computing Laboratory GTC 2019 39 OSU-Caffe: Scalable Deep Learning • Caffe : A flexible and layered Deep Learning framework. GoogLeNet (ImageNet) on 128 GPUs • Benefits and Weaknesses – Multi-GPU Training within a single node 250 – Performance degradation for GPUs across different sockets 200 – Limited Scale-out • OSU-Caffe: MPI-based Parallel Training 150 – Enable Scale-up (within a node) and Scale-out (across multi-GPU nodes) 100 – Scale-out on 64 GPUs for training CIFAR-10 network on CIFAR-10 dataset 50 Training Time (seconds) Training – Scale-out on 128 GPUs for training GoogLeNet network on ImageNet dataset 0 OSU-Caffe publicly available from 8 16 32 64 128 No. of GPUs http://hidl.cse.ohio-state.edu/ Invalid use case Support on OPENPOWER will be available soon Caffe OSU-Caffe (1024) OSU-Caffe (2048) Network Based Computing Laboratory GTC 2019 40 Outline • Overview of the MVAPICH2 Project • MVAPICH2-GPU with GPUDirect-RDMA (GDR) • Current Features • Multi-stream Communication for IPC • CMA-based Intra-node Host-to-Host Communication Support • Maximal Overlap in MPI Datatype Processing • Efficient Support for Managed Memory • Streaming Support with InfiniBand Multicast and GDR • Support for Deep Learning • Support for OpenPOWER with NVLink • Support for Container • Upcoming Features • CMA-based Intra-node Collective Communication Support • XPMEM-based Collective Communication Support • Optimized Datatype Processing • Out-of-core processing for Deep Learning • Conclusions Network Based Computing Laboratory GTC 2019 41 Point-to-Point Host-level Performance on OpenPOWER Intra-node Latency Intra-node Bandwidth Intra-node Bi-directional Bandwidth 1.5 40000 80000 MVAPICH2-GDR 2.3.1 MVAPICH2-GDR 2.3.1 MVAPICH2-GDR 2.3.1 30000 ~30GB/s 60000 ~60GB/s 1 20000 40000 0.5 Latency (us) 10000 20000 ~0.5 μs 0 Bandwidth (MB/s) Bandwidth (MB/s) 0

0 1 4 1 4 16 64 1K 4K 16 64 1K 4K 1 4 16 64 256 1K 4K 1M 256 16K 64K 1M 256 16K 64K 256K 256K

Inter-node Latency Inter-node Bandwidth Inter-node Bi-directional Bandwidth 5 15000 25000 MVAPICH2-GDR 2.3.1 MVAPICH2-GDR 2.3.1 MVAPICH2-GDR 2.3.1 4 20000 10000 ~12GB/s 3 15000 2 10000 ~24GB/s 5000 Latency (us) ~2.3 μs 1 5000 Bandwidth (MB/s) 0 Bandwidth (MB/s) 0 0 1 4 1 4 16 64 1K 4K 16 64 1 4 16 64 256 1K 4K 1K 4K 1M 1M 256 16K 64K 256 16K 64K 256K 256K Platform: OpenPOWER (Power8-ppc64le) CPU using Mellanox EDR (MT4115) HCA Network Based Computing Laboratory GTC 2019 42 Device-to-Device Performance on OpenPOWER (NVLink2 + Volta) INTRA-NODE LATENCY (SMALL) INTRA-NODE LATENCY (LARGE) INTRA-NODE BANDWIDTH 20 500 Intra-Socket Intra-Socket 80 400 Intra-Socket 15 Inter-Socket Inter-Socket 60 300 Inter-Socket 10 200 40 Latency (us) Latency (us) 5 100 20 0 0 0 1 2 4 8 1 4 Bandwidth (GB/sec) 16 32 64 1K 2K 4K 8K 16 64 16K 32K 64K 128K256K512K 1M 2M 4M 1K 4K 1M 4M 128 256 512 256 16K 64K Message Size (Bytes) Message Size (Bytes) 256K Message Size (Bytes) Intra-node Latency: 5.36 us (without GDRCopy) Intra-node Bandwidth: 70.4 GB/sec for 128MB (via NVLINK2) INTER-NODE LATENCY (SMALL) INTER-NODE LATENCY (LARGE) INTER-NODE BANDWIDTH 15 400 30 300 25 10 20 200 15 5 10

Latency (us) 100 Latency (us) 5 0 0

1 2 4 8 0 16 32 64 1K 2K 4K 8K 1 4 Bandwidth (GB/sec) 128 256 512 16 64 1K 4K 1M 4M 256 16K 64K

Message Size (Bytes) 256K Message Size (Bytes) Message Size (Bytes) Inter-node Latency: 5.66 us (without GDRCopy) Available since MVAPICH2-GDR 2.3a Inter-node Bandwidth: 23.7 GB/sec (2 port EDR) Platform: OpenPOWER (POWER9-ppc64le) nodes equipped with a dual-socket CPU, 4 Volta V100 GPUs, and 2port EDR InfiniBand Interconnect Network Based Computing Laboratory GTC 2019 43 Outline • Overview of the MVAPICH2 Project • MVAPICH2-GPU with GPUDirect-RDMA (GDR) • Current Features • Multi-stream Communication for IPC • CMA-based Intra-node Host-to-Host Communication Support • Maximal Overlap in MPI Datatype Processing • Efficient Support for Managed Memory • Streaming Support with InfiniBand Multicast and GDR • Support for Deep Learning • Support for OpenPOWER with NVLink • Support for Container • Upcoming Features • CMA-based Intra-node Collective Communication Support • XPMEM-based Collective Communication Support • Optimized Datatype Processing • Out-of-core processing for Deep Learning • Conclusions Network Based Computing Laboratory GTC 2019 44 Container Support

• Increasing trend to provide container support for MPI Libraries • Ease of build • Portability • Reproducibility • MVAPICH2-GDR 2.3.1 provides container (Docker) support • More details are available in the MVAPICH2-GDR User Guide • http://mvapich.cse.ohio-state.edu/userguide/gdr/ • Synergistic with the HPC-Container-Maker and hpccm efforts by NVIDIA • (https://github.com/NVIDIA/hpc-container-maker)

Network Based Computing Laboratory GTC 2019 45 MVAPICH2-GDR on Container with Negligible Overhead GPU-GPU Inter-node Latency GPU-GPU Inter-node Bi-Bandwidth 1000 20000

100 15000 10000 10

Latency (us) 5000

1 (MB/s) Bandwidth 1 4 16 64 256 1K 4K 16K 64K 256K 1M 4M 0 Message Size (Bytes) Docker Native Message Size (Bytes) GPU-GPU Inter-node Bandwidth Docker Native 12000 10000 8000 6000 MVAPICH2-GDR-2.3.1 4000 Intel Haswell (E5-2687W @ 3.10 GHz) node - 20 cores 2000 NVIDIA Volta V100 GPU

Bandwidth (MB/s) Bandwidth 0 Mellanox Connect-X4 EDR HCA CUDA 9.0 Mellanox OFED 4.0 with GPU-Direct-RDMA Message Size (Bytes) Docker Native

Network Based Computing Laboratory GTC 2019 46 Outline • Overview of the MVAPICH2 Project • MVAPICH2-GPU with GPUDirect-RDMA (GDR) • Current Features • Multi-stream Communication for IPC • CMA-based Intra-node Host-to-Host Communication Support • Maximal Overlap in MPI Datatype Processing • Efficient Support for Managed Memory • Streaming Support with InfiniBand Multicast and GDR • Support for Deep Learning • Support for OpenPOWER with NVLink • Support for Container • Upcoming Features • CMA-based Intra-node Collective Communication Support • XPMEM-based Collective Communication Support • Optimized Datatype Processing • Out-of-core processing for Deep Learning • Conclusions Network Based Computing Laboratory GTC 2019 47 Scalable Host-based Collectives on OpenPOWER with CMA (Intra-node Reduce & AlltoAll) Reduce 40 2000 3.2X MVAPICH2-GDR-Next MVAPICH2-GDR-Next 30 SpectrumMPI-10.1.0.2 1600 SpectrumMPI-10.1.0.2 OpenMPI-3.0.0 5.2X 1200 OpenMPI-3.0.0 20 1.4X 3.6X 800 10 Latency (us) Latency Latency (us) Latency 400 (Nodes=1, PPN=20) (Nodes=1, PPN=20) 0 0 4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M Alltoall 3.2X 1.3X 200 3.3X 10000 MVAPICH2-GDR-Next MVAPICH2-GDR-Next 1.2X 150 SpectrumMPI-10.1.0.2 7500 SpectrumMPI-10.1.0.2 OpenMPI-3.0.0 OpenMPI-3.0.0 100 5000

50 2500 Latency (us) Latency (us) Latency (Nodes=1, PPN=20) 0 0 (Nodes=1, PPN=20) 4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M Up to 5X and 3x performance improvement by MVAPICH2 for small and large messages respectively Network Based Computing Laboratory GTC 2019 48 Scalable Host-based Collectives on OpenPOWER with CMA (Multi-node, Reduce & Alltoall) Reduce 30 12.4X 6000 MVAPICH2-GDR-Next MVAPICH2-GDR-Next 8.5X OpenMPI-3.0.0 OpenMPI-3.0.0 20 4000

10 2000 Latency (us) Latency Latency (us) Latency (Nodes=4, PPN=20) (Nodes=4, PPN=20) 0 0 4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M Alltoall 1000 1.9X 150000 MVAPICH2-GDR-Next MVAPICH2-GDR-Next OpenMPI-3.0.0 OpenMPI-3.0.0 100000 500 50000 Latency (us) Latency Latency (us) Latency (Nodes=4, PPN=20) 0 0 (Nodes=4, PPN=20) 4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M Up to 12.4X and 8.5X performance improvement by MVAPICH2 for small and large messages respectively Network Based Computing Laboratory GTC 2019 49 Outline • Overview of the MVAPICH2 Project • MVAPICH2-GPU with GPUDirect-RDMA (GDR) • Current Features • Multi-stream Communication for IPC • CMA-based Intra-node Host-to-Host Communication Support • Maximal Overlap in MPI Datatype Processing • Efficient Support for Managed Memory • Streaming Support with InfiniBand Multicast and GDR • Support for Deep Learning • Support for OpenPOWER with NVLink • Support for Container • Upcoming Features • CMA-based Intra-node Collective Communication Support • XPMEM-based Collective Communication Support • Optimized Datatype Processing • Out-of-core processing for Deep Learning • Conclusions Network Based Computing Laboratory GTC 2019 50 Shared Address Space (XPMEM-based) Collectives

• Offload Reduction computation and communication to peer MPI ranks – Every Peer has direct “load/store” access to other peer’s buffers – Multiple pseudo roots independently carry-out reductions for intra-and inter-node – Directly put reduced data into root’s receive buffer • True “Zero-copy” design for Allreduce and Reduce – No copies require during the entire duration of Reduction operation – Scalable to multiple nodes • Zero contention overheads as memory copies happen in “user-space” Available since MVAPICH2-X 2.3rc1

J. Hashmi, S. Chakraborty, M. Bayatpour, H. Subramoni, and D. Panda, Designing Efficient Shared Address Space Reduction Collectives for Multi-/Many-cores, International Parallel & Distributed Processing Symposium (IPDPS '18), May 2018.

Network Based Computing Laboratory GTC 2019 51 Benefits of XPMEM based MPI_Bcast

Intel MPI 2018 10000 OpenMPI 3.0.1 5X over MV2X-2.3rc1 (CMA Coll) OpenMPI 1000 MV2X-2.3rc2 (XPMEM Coll)

100

Latency (us) 10

1 4K 8K 1M 2M 4M 16K 32K 64K 128K 256K 512K Message Size (bytes) • 28 MPI Processes on single dual-socket Broadwell E5-2680v4, 2x14 core processor

Network Based Computing Laboratory GTC 2019 52 Benefits of XPMEM based MPI_Scatter

100000 Intel MPI 2018 OpenMPI 3.0.1 10000 MV2X-2.3rc1 (CMA Coll) MV2X-2.3rc2 (XPMEM Coll) 1000

100

Latency (us) ~28X better than 10 state-of-the-art 1 1K 2K 4K 8K 1M 2M 4M 512 16K 32K 64K 128K 256K 512K Message Size (bytes)

• High cache-locality and contention-free access compared to CMA

Network Based Computing Laboratory GTC 2019 53 Optimized All-Reduce with XPMEM 2000 2000 48% MVAPICH2-GDR-Next 4X MVAPICH2-GDR-Next SpectrumMPI-10.1.0 SpectrumMPI-10.1.0 OpenMPI-3.0.0 1000 1000 OpenMPI-3.0.0 2X Latency (us) Latency Latency (us) Latency 34%

(Nodes=1, PPN=20) 0 0 16K 32K 64K 128K 256K 512K 1M 2M 16K 32K 64K 128K 256K 512K 1M 2M 4000 4000 MVAPICH2-GDR-Next 3.3X MVAPICH2-GDR-Next 2X SpectrumMPI-10.1.0 3000 SpectrumMPI-10.1.0 2000 OpenMPI-3.0.0 2000 OpenMPI-3.0.0

1000 2X Latency (us) Latency 3X (us) Latency

(Nodes=2, PPN=20) 0 0 16K 32K 64K 128K 256K 512K 1M 2M 16K 32K 64K 128K 256K 512K 1M 2M Message Size Message Size • Optimized MPI All-Reduce Design in MVAPICH2 – Up to 2X performance improvement over Spectrum MPI and 4X over OpenMPI for intra-node Optimized Runtime Parameters: MV2_CPU_BINDING_POLICY=hybrid MV2_HYBRID_BINDING_POLICY=bunch Network Based Computing Laboratory GTC 2019 54 Application-Level Benefits of XPMEM-Based Collectives

CNTK AlexNet Training (Broadwell, B.S=default, iteration=50, ppn=28) MiniAMR (Broadwell, ppn=16)

Intel MPI 70 800 Intel MPI MVAPICH2 9% 60 MVAPICH2 15% 700 MVAPICH2-XPMEM 50 MVAPICH2-XPMEM 600 20% 500 40 400 30 300 20 27% Execution Time (s) 200 Execution Time (s) 100 10 0 0 28 56 112 224 16 32 64 128 256 No. of Processes No. of Processes

• Up to 20% benefits over IMPI for CNTK DNN training using AllReduce • Up to 27% benefits over IMPI and up to 15% improvement over MVAPICH2 for MiniAMR application kernel

Network Based Computing Laboratory GTC 2019 55 Outline • Overview of the MVAPICH2 Project • MVAPICH2-GPU with GPUDirect-RDMA (GDR) • Current Features • Multi-stream Communication for IPC • CMA-based Intra-node Host-to-Host Communication Support • Maximal Overlap in MPI Datatype Processing • Efficient Support for Managed Memory • Streaming Support with InfiniBand Multicast and GDR • Support for Deep Learning • Support for OpenPOWER with NVLink • Support for Container • Upcoming Features • CMA-based Intra-node Collective Communication Support • XPMEM-based Collective Communication Support • Optimized Datatype Processing • Out-of-core processing for Deep Learning • Conclusions Network Based Computing Laboratory GTC 2019 56 MVAPICH2-GDR: Enhanced Derived Datatype Processing • Kernel-based and GDRCOPY-based one-shot packing for inter-socket and inter-node communication • Zero-copy (packing-free) for GPUs with peer-to-peer direct access over PCIe/NVLink

GPU-based DDTBench mimics MILC communication Communication Kernel of COSMO Model kernel 5 Improved ~10X 25 4 20 3 15 2 10 5 1

0 Execution Time (s) 0 [6, 8,8,8,8] [6, 8,8,8,16] [6, 8,8,16,16] [6, 16,16,16,16] 8 16 32 64 MILC Number of GPUs Problem size

OpenMPI 4.0.0 MVAPICH2-GDR 2.3 MVAPICH2-GDR-Next MVAPICH2-GDR 2.3 MVAPICH2-GDR-Next Platform: Nvidia DGX-2 system Platform: Cray CS-Storm (16 NVIDIA Volta GPUs connected with NVSwitch), CUDA 9.2 (8 NVIDIA Tesla K80 GPUs per node), CUDA 8.0 Network Based Computing Laboratory GTC 2019 57 Outline • Overview of the MVAPICH2 Project • MVAPICH2-GPU with GPUDirect-RDMA (GDR) • Current Features • Multi-stream Communication for IPC • CMA-based Intra-node Host-to-Host Communication Support • Maximal Overlap in MPI Datatype Processing • Efficient Support for Managed Memory • Streaming Support with InfiniBand Multicast and GDR • Support for Deep Learning • Support for OpenPOWER with NVLink • Support for Container • Upcoming Features • CMA-based Intra-node Collective Communication Support • XPMEM-based Collective Communication Support • Optimized Datatype Processing • Out-of-core processing for Deep Learning • Conclusions Network Based Computing Laboratory GTC 2019 58 Research Poster and Large (Out-of-core) Models? (P9243) • Large DNNs cannot be trained on GPUs due to memory limitation! – ResNet-50 for Image Recognition but current frameworks can only go up to a small batch size of 45 – Next generation models like Neural Machine Translation (NMT) are ridiculously large, consists of billions of parameters, and require even more memory – Can we design Out-of-core DNN training support using new software features in CUDA 8/9 and hardware mechanisms in Pascal/Volta GPUs? • General intuition is that managed allocations “will be” slow! – The proposed framework called OC-Caffe (Out-of-Core Caffe) shows the potential of managed memory designs that can provide performance with negligible/no overhead. • OC-Caffe-Opt: up to 80% better than Intel-optimized CPU Caffe for ResNet-50 training on the Volta V100 GPU with CUDA9 and CUDNN7 A. A. Awan, C.-H. Chu, H. Subramoni, X. Lu, and D. K. Panda, OC-DNN: Exploiting Advanced Unified Memory Capabilities in CUDA 9 and Volta GPUs for Out-of-Core DNN Training, HiPC ’18 Network Based Computing Laboratory GTC 2019 59 Outline • Overview of the MVAPICH2 Project • MVAPICH2-GPU with GPUDirect-RDMA (GDR) • Current Features • Multi-stream Communication for IPC • CMA-based Intra-node Host-to-Host Communication Support • Maximal Overlap in MPI Datatype Processing • Efficient Support for Managed Memory • Streaming Support with InfiniBand Multicast and GDR • Support for Deep Learning • Support for OpenPOWER with NVLink • Support for Container • Upcoming Features • CMA-based Intra-node Collective Communication Support • XPMEM-based Collective Communication Support • Optimized Datatype Processing • Out-of-core processing for Deep Learning • Conclusions Network Based Computing Laboratory GTC 2019 60 Conclusions • MVAPICH2-GDR MPI library optimizes MPI communication on InfiniBand and RoCE (V1 and V2) clusters with GPUs on both x86 and OpenPOWER platforms (including NVLink) • Provides optimized designs for point-to-point two-sided and one-sided communication, datatype processing and collective operations • Takes advantage of CUDA features like IPC and GPUDirect RDMA families • Allows flexible solutions for streaming applications with GPUs • Provides optimized solutions for both HPC and High-Performance Deep Learning (HiDL) frameworks and applications • Upcoming releases will be supporting advanced designs

Network Based Computing Laboratory GTC 2019 61 Please join us for more events.. Monday, March 18 Tuesday, March 19 Wednesday, March 20 Research Poster Talk Instructor-Led Training

1. P9243 - Exploiting CUDA S9476 - MVAPICH2-GDR: L9121 - How to Unified Memory for Efficient Out-of-Core DNN Training High-Performance and the Performance of 2. P9242 - Exploiting GPUDirect Scalable CUDA-Aware MPI HPC/AI Applications Technology and Hardware Multicast for Streaming and Library for HPC and AI Using MVAPICH2 Library Deep Learning Applications

SJCC Upper Concourse SJCC Room 211A SJCC Room LL21D (Concourse Level) (Lower Level) 06:00 PM - 08:00 PM 03:00 PM - 03:50 PM 08:00 AM - 10:00 AM

Network Based Computing Laboratory GTC 2019 62 Personnel Acknowledgments

Current Students (Graduate) Current Students (Undergraduate) Current Research Asst. Professor Current Post-doc – A. Awan (Ph.D.) – J. Hashmi (Ph.D.) – V. Gangal (B.S.) – X. Lu – A. Ruhela – M. Bayatpour (Ph.D.) – A. Jain (Ph.D.) – M. Haupt (B.S.) – K. Manian – S. Chakraborthy (Ph.D.) – K. S. Khorassani (Ph.D.) – N. Sarkauskas (B.S.) Current Research Scientist Current Research Specialist – C.-H. Chu (Ph.D.) – P. Kousha (Ph.D.) – A. Yeretzian (B.S.) – H. Subramoni – S. Guganani (Ph.D.) – D. Shankar (Ph.D.) – J. Smith

Past Students Past Research Scientist – A. Augustine (M.S.) – W. Huang (Ph.D.) – J. Liu (Ph.D.) – R. Rajachandrasekar (Ph.D.) – K. Hamidouche – P. Balaji (Ph.D.) – W. Jiang (M.S.) – M. Luo (Ph.D.) – G. Santhanaraman (Ph.D.) – S. Sur – R. Biswas (M.S.) – J. Jose (Ph.D.) – A. Mamidala (Ph.D.) – A. Singh (Ph.D.) – J. Sridhar (M.S.) – S. Bhagvat (M.S.) – S. Kini (M.S.) – G. Marsh (M.S.) Past Programmers – V. Meshram (M.S.) – S. Sur (Ph.D.) – A. Bhat (M.S.) – M. Koop (Ph.D.) – D. Bureddy – K. Kulkarni (M.S.) – A. Moody (M.S.) – H. Subramoni (Ph.D.) – D. Buntinas (Ph.D.) – J. Perkins – K. Vaidyanathan (Ph.D.) – L. Chai (Ph.D.) – R. Kumar (M.S.) – S. Naravula (Ph.D.) – A. Vishnu (Ph.D.) – B. Chandrasekharan (M.S.) – S. Krishnamoorthy (M.S.) – R. Noronha (Ph.D.) Past Research Specialist – N. Dandapanthula (M.S.) – K. Kandalla (Ph.D.) – X. Ouyang (Ph.D.) – J. Wu (Ph.D.) – M. Arnold – V. Dhanraj (M.S.) – M. Li (Ph.D.) – S. Pai (M.S.) – W. Yu (Ph.D.) – T. Gangadharappa (M.S.) – P. Lai (M.S.) – S. Potluri (Ph.D.) – J. Zhang (Ph.D.) – K. Gopalakrishnan (M.S.) Past Post-Docs – D. Banerjee – J. Lin – S. Marcarelli – X. Besseron – M. Luo – J. Vienne – H.-W. Jin – E. Mancini – H. Wang Network Based Computing Laboratory GTC 2019 63 Multiple Positions Available in My Group

• Looking for Bright and Enthusiastic Personnel to join as – PhD Students – Post-Doctoral Researchers – MPI Programmer/Software Engineer – Hadoop/Big Data Programmer/Software Engineer – Deep Learning and Cloud Programmer/Software Engineer • If interested, please send an e-mail to [email protected]

Network Based Computing Laboratory GTC 2019 64 Thank You! [email protected], [email protected]

Network-Based Computing Laboratory http://nowlab.cse.ohio-state.edu/

The High-Performance MPI/PGAS Project The High-Performance Deep Learning Project http://mvapich.cse.ohio-state.edu/ http://hidl.cse.ohio-state.edu/

Network Based Computing Laboratory GTC 2019 65