MPI, PGAS, Etc.) Memcached, Dask, Etc.)

Designing High Performance Scalable Middleware for HPC/AI Exascale Systems and Clouds

Keynote Talk at HPC-AI Advisory Council Stanford Conference (May ‘21) by

Dhabaleswar K. (DK) Panda The Ohio State University E-mail: [email protected]

Follow us on http://www.cse.ohio-state.edu/~panda https://twitter.com/mvapich High-End Computing (HEC): PetaFlop to ExaFlop

100 PetaFlops in 442 Peta 2017 Flops in 2020 (Fugaku in Japan with 7.63M cores

1 ExaFlops

Expected to have an ExaFlop system in 2021! Network Based Computing Laboratory HPC-AI-Switzerland (May ‘21) 2 Increasing Usage of HPC, Big Data and Deep/Machine Learning

Big Data (Hadoop, Spark, HPC HBase, (MPI, PGAS, etc.) Memcached, Dask, etc.)

Convergence of HPC, Big Deep/Machine Data, and Deep/Machine Learning Learning! (TensorFlow, PyTorch, BigDL, cuML, etc.) Increasing Need to Run these applications on the Cloud!!

Network Based Computing Laboratory HPC-AI-Switzerland (May ‘21) 3 Presentation Overview

• MVAPICH Project – MPI and PGAS (MVAPICH) Library with CUDA-Awareness • HiDL Project – High-Performance Deep Learning – High-Performance Machine Learning • HiBD Project – Accelerating Data Science Applications with Dask • Optimizations and Deployments in Public Cloud – AWS and Azure • Commercial Support and Value Added Products • Conclusions

Network Based Computing Laboratory HPC-AI-Switzerland (May ‘21) 4 Overview of the MVAPICH2 Project • High Performance open-source MPI Library • Support for multiple interconnects – InfiniBand, Omni-Path, Ethernet/iWARP, RDMA over Converged Ethernet (RoCE), and AWS EFA • Support for multiple platforms – x86, OpenPOWER, ARM, Xeon-Phi, GPGPUs (NVIDIA and AMD) • Started in 2001, first open-source version demonstrated at SC ‘02 • Used by more than 3,150 organizations in 89 countries • Supports the latest MPI-3.1 standard • More than 1.35 Million downloads from the OSU site • http://mvapich.cse.ohio-state.edu directly • Additional optimized versions for different systems/environments: • Empowering many TOP500 clusters (Nov ‘20 ranking) – MVAPICH2-X (Advanced MPI + PGAS), since 2011 – 4th , 10,649,600-core (Sunway TaihuLight) at NSC, Wuxi, China – MVAPICH2-GDR with support for NVIDIA GPGPUs, since 2014 – 9th, 448, 448 cores (Frontera) at TACC – MVAPICH2-MIC with support for Intel Xeon-Phi, since 2014 – 14th, 391,680 cores (ABCI) in Japan – MVAPICH2-Virt with virtualization support, since 2015 – 21st, 570,020 cores (Nurion) in South Korea and many others – MVAPICH2-EA with support for Energy-Awareness, since 2015 – MVAPICH2-Azure for Azure HPC IB instances, since 2019 • Available with software stacks of many vendors and – MVAPICH2-X-AWS for AWS HPC+EFA instances, since 2019 Linux Distros (RedHat, SuSE, OpenHPC, and Spack) • Tools: • Partner in the 9th ranked TACC Frontera system – OSU MPI Micro-Benchmarks (OMB), since 2003 • Empowering Top500 systems for more than 16 years – OSU InfiniBand Network Analysis and Monitoring (INAM), since 2015 Network Based Computing Laboratory HPC-AI-Switzerland (May ‘21) 5 Architecture of MVAPICH2 Software Family for HPC and DL/ML

High Performance Parallel Programming Models

Message Passing Interface PGAS Hybrid --- MPI + X (MPI) (UPC, OpenSHMEM, CAF, UPC++) (MPI + PGAS + OpenMP/Cilk)

High Performance and Scalable Communication Runtime Diverse APIs and Mechanisms

Point-to- Remote Collectives Energy- I/O and Fault Active Introspection point Job Startup Memory Virtualization Algorithms Awareness File Systems Tolerance Messages & Analysis Primitives Access

Support for Modern Networking Technology Support for Modern Multi-/Many-core Architectures (InfiniBand, iWARP, RoCE, Omni-Path, Elastic Fabric Adapter) (Intel-Xeon, OpenPOWER, Xeon-Phi, ARM, NVIDIA GPGPU) Transport Protocols Modern Features Transport Mechanisms Modern Features

SR- Multi Shared RC SRD UD DC UMR ODP CMA IVSHMEM XPMEM Optane* NVLink CAPI* IOV Rail Memory

* Upcoming Network Based Computing Laboratory HPC-AI-Switzerland (May ‘21) 6 MVAPICH2 Software Family

Requirements Library

MPI with IB, iWARP, Omni-Path, and RoCE MVAPICH2

Advanced MPI Features/Support, OSU INAM, PGAS and MPI+PGAS MVAPICH2-X with IB, Omni-Path, and RoCE MPI with IB, RoCE & GPU and Support for Deep/Machine Learning MVAPICH2-GDR

HPC Cloud with MPI & IB MVAPICH2-Virt

Energy-aware MPI with IB, iWARP and RoCE MVAPICH2-EA

MPI Energy Monitoring Tool OEMT

InfiniBand Network Analysis and Monitoring OSU INAM

Microbenchmarks for Measuring MPI and PGAS Performance OMB

Network Based Computing Laboratory HPC-AI-Switzerland (May ‘21) 7 Startup Performance on TACC Frontera

MVAPICH2 2.3.4 Intel MPI 2020 MVAPICH2 2.3.4 Intel MPI 2020 25 2000 20 1500 15 1000 10 14X 46X Time (s) 5 Time (s) 500 0 0 56 112 224 448 896 1792 3584 7168 14336 28672 57344 114688229376 Number of Processes Number of Processes

• MPI_Init takes 31 seconds on 229,376 processes on 4,096 nodes • All numbers reported with 56 processes per node

New designs available since MVAPICH2-2.3.4 Network Based Computing Laboratory HPC-AI-Switzerland (May ‘21) 8 Hardware Multicast-aware MPI_Bcast on TACC Frontera 20 1000 Default Multicast Default Multicast 15 750

10 500

Latency (us) 250 5 Latency (us) 1.9X 2X (Nodes=2K, PPN=28) (Nodes=2K, 0 0 2 8 32 128 512 2K 8K 32K 128K 512K Message Size Message Size 15 80 Default Multicast Default Multicast Size=16B, PPN=28 Size=32kB, PPN=28 10 60 1.8X 40 2X 5 Latency (us) Latency (us) 20

0 0 2 4 8 16 32 64 128 256 512 1k 2 4 8 16 32 64 128 256 512 1K Number of Nodes Number of Nodes • MCAST-based designs improve latency of MPI_Bcast by up to 2X at 2,048 nodes • Use MV2_USE_MCAST=1 to enable MCAST-based designs Network Based Computing Laboratory HPC-AI-Switzerland (May ‘21) 9 Performance of Collectives with SHARP on TACC Frontera 120 1000 100 MVAPICH2-X 80 100 MVAPICH2-X-SHARP 6X 60 40 10 MVAPICH2-X

5X Latency (us) Latency (us) MPI_Reduce

MPI_Allreduuce MPI_Allreduuce 20 MVAPICH2-X-SHARP (PPN (PPN 1, = Nodes 7861) = 0 (PPN 1, = Nodes 7861) = 1

Message size Message size 250 Optimized SHARP designs in MVAPICH2-X 200 MVAPICH2-X Up to 9X performance improvement with SHARP over MVAPICH2-X default for 1ppn 150 MVAPICH2-X-SHARP MPI_Barrier, 6X for 1ppn MPI_Reduce and 5X for 1ppn MPI_Allreduce MPI_Barrier 100 9X Latency (us) 50 B. Ramesh , K. Suresh , N. Sarkauskas , M. Bayatpour , J. Hashmi , H. Subramoni , and D. K. Panda, Scalable MPI Collectives using 0 SHARP: Large Scale Performance Evaluation on the TACC Frontera 4 8 16 32 64

128 256 512 System, ExaMPI2020 - Workshop on Exascale MPI 2020, Nov 2020. 1024 2048 4096 7861 Number of nodes Optimized Runtime Parameters: MV2_ENABLE_SHARP = 1 Network Based Computing Laboratory HPC-AI-Switzerland (May ‘21) 10 Support for ARM A64FX with InfiniBand (Ookami) Inter-Node Latency Inter-Node Bidirectional Bandwidth 400 25000 MVAPICH2-X OpenMPI 300 20000 MVAPICH2-X OpenMPI 15000 200 10000 100 Latency (us) 5000 0 0 0 2 8 Bandwidth (MB/s) Bandwidth 1 4 32 2K 8K 16 64 1K 4K 2M 1M 4M 128 512 32K 256 16K 64K 128K 512K 256K Message Size (Bytes) Message Size (Bytes)

Intra-Node Latency MPI_Bcast (8 nodes 48 PPN) 600 MVAPICH2-X OpenMPI 500 512 400 MVAPICH2-X OpenMPI 300 64 200 Latency (us)

100 Latency (us) 8 0

0 2 8 1 32 2K 8K 2M 1 4 128 512 32K 16 64 1K 4K 128K 512K 1M 256 16K 64K Message Size (Bytes) 256K Message Size (Bytes)

Network Based Computing Laboratory HPC-AI-Switzerland (May ‘21) 11 Performance of Neuroscience Mini-Application with MVAPICH2-X

Comparison of Execution Time (s) on 9,920 cores

Neighbors HPE-MPI MVAPICH2-X % Improvement Selection

Random 188.07 120.22 36.07

Consecutive 124.46 116.34 6.52

• EPFL Mini-Application is a benchmark where each process sends data to a selected set of neighbors. • Up to 36% benefits over HPE-MPI on Neuroscience Mini-Application at 496 nodes and 20 PPN Available in upcoming version of MVAPICH2-X Network Based Computing Laboratory HPC-AI-Switzerland (May ‘21) 12 Optimized MVAPICH2-GDR with CUDA-Aware MPI Support GPU-GPU Inter-node Latency 30 6000 GPU-GPU Inter-node Bi-Bandwidth

20 4000 10 11X 1.85us 10x 2000

Latency (us)Latency 0 0 1 2 4 8 Bandwidth (MB/s) Bandwidth 0 16 32 64 1K 2K 4K 8K 1 2 4 8 128 256 512 16 32 64 1K 2K 4K

Message Size (Bytes) 128 256 512 Message Size (Bytes) MV2-(NO-GDR) MV2-GDR 2.3 MV2-(NO-GDR) MV2-GDR-2.3

4000 GPU-GPU Inter-node Bandwidth 3000 2000 9x MVAPICH2-GDR-2.3 Intel Haswell (E5-2687W @ 3.10 GHz) node - 20 cores 1000 NVIDIA Volta V100 GPU

Bandwidth (MB/s) Bandwidth 0 Mellanox Connect-X4 EDR HCA 1 2 4 8 16 32 64 1K 2K 4K CUDA 9.0 128 256 512 Mellanox OFED 4.0 with GPU-Direct-RDMA Message Size (Bytes) MV2-(NO-GDR) MV2-GDR-2.3

Network Based Computing Laboratory HPC-AI-Switzerland (May ‘21) 13 On-the-fly GPU Compression (Benefits with AWP-ODC) • Weak-Scaling of HPC application AWP-ODC on Lassen cluster (V100 nodes) • MPC-OPT achieves up to +18% GPU computing flops, -15% runtime per timestep • ZFP-OPT achieves up to +35% GPU computing flops, -26% runtime per timestep (rate=8, compression ratio=4)

200 70 Baseline (No compression) 180 -15% MPC-OPT 60 160 ZFP-OPT (rate:16) 140 +18% 50 ZFP-OPT (rate:8) -26% 120 40 100 30 80 60 +35% 20 40 10 20 Run Time Per Timestep (ms) Timestep Per Time Run GPU Computing Flops (TFLOPS) Flops GPU Computing 0 0 8 16 32 64 128 256 512 8 16 32 64 128 256 512 GPUS GPUS

Q. Zhou, C. Chu, N. Senthil Kumar, P. Kousha, M. Ghazimirsaeed, H. Subramoni, D. Panda, “Designing High-Performance MPI Libraries with On-the-fly Compression for Modern GPU Clusters”, in 35th IEEE International Parallel & Distributed Processing Symposium, May 2021. Best Paper Finalist

Network Based Computing Laboratory HPC-AI-Switzerland (May ‘21) 14 On-the-fly GPU Compression (Benefits with DASK) • Data science framework DASK on RI2 cluster (V100 nodes) • Dask benchmark creates cuPy array and distributes its chunks across Dask workers • ZFP-OPT achieves up to 1.56x throughput, -37% runtime (rate=8, compression ratio=4) 300 4 Baseline (No compression) 250 ZFP-OPT (rate:16) 1.56x 3 -37% 200 ZFP-OPT (rate:8)

150 2

100 1 Execution Time(s) Execution 50

Aggregate Throughput (GB/s) Throughput Aggregate 0 0 2 4 6 8 2 4 6 8 Number of Dask Workers Number of Dask Workers (cuPy Dims: 10Kx10K, Chunk size: 1K) Q. Zhou, C. Chu, N. Senthil Kumar, P. Kousha, M. Ghazimirsaeed, H. Subramoni, D. Panda, “Designing High-Performance MPI Libraries with On-the-fly Compression for Modern GPU Clusters”, in 35th IEEE International Parallel & Distributed Processing Symposium, May 2021. Best Paper Finalist Network Based Computing Laboratory HPC-AI-Switzerland (May ‘21) 15 MVAPICH2-GDR ROCm Support for AMD GPUs Intra-Node Point-to-Point Latency Inter-Node Point-to-Point Latency

Allreduce – 128 GPUs (16 nodes, 8 GPUs Per Node) Bcast – 128 GPUs (16 nodes, 8 GPUs Per Node)

Corona Cluster - ROCm-4.1.0 (mi50 AMD GPUs) Available with MVAPICH2-GDR 2.3.5 Network Based Computing Laboratory HPC-AI-Switzerland (May ‘21) 16 Exploiting BlueField-2 DPU for Offloading MPI Functions: Accelerating P3DFFT 35 25 MV2 BF2 HCA 15% 30 MV2-BFO 1 WPN 20 25 MV2-BFO 2 WPN 25% MV2-BFO 4 WPN 20 19% 15 30% 15 10 Latency (us) Latency 10 (us) Latency 5 5

0 0 1024 2048 4096 1024 2048 4096 Grid Size X Grid Size X 32 Nodes, 16 PPN 32 Nodes, 32 PPN

M. Bayatpour, N. Sarkauskas, H. Subramoni, J. M. Hashmi, and D. K. Panda, BluesMPI: Efficient MPI Non-blocking Alltoall offloading Designs on Modern BlueField Smart NICs, Int’l Supercomputing Conference (ISC ‘21), Accepted to be Presented. Network Based Computing Laboratory HPC-AI-Switzerland (May ‘21) 17 Presentation Overview

Network Based Computing Laboratory HPC-AI-Switzerland (May ‘21) 18 MVAPICH2 (MPI)-driven Infrastructure for ML/DL Training

ML/DL Applications ML/DL Applications

TensorFlow PyTorch MXNet PyTorch

Horovod Torch.distributed DeepSpeed

MVAPICH2 or MVAPICH2-X MVAPICH2-GDR for MVAPICH2 or MVAPICH2-X MVAPICH2-GDR for for CPU Training GPU Training for CPU Training GPU Training

More details available from: http://hidl.cse.ohio-state.edu Network Based Computing Laboratory HPC-AI-Switzerland (May ‘21) 19 Multiple Approaches taken up by OSU • MPI-driven Deep Learning with Data Parallelism • Out-of-core DNN training • Exploiting Hybrid (Data and Model) Parallelism • Use-Case: AI-Driven Digital Pathology • Accelerating CuML Applications

Network Based Computing Laboratory HPC-AI-Switzerland (May ‘21) 20 Distributed TensorFlow on TACC Frontera (2,048 CPU nodes with 114,688 cores)

262144 • Scaled TensorFlow to 2048 nodes on 32768 Frontera using MVAPICH2 4096 • MVAPICH2 and IntelMPI give similar 512 performance for DNN training 64 Images per sec 8 • Report a peak of 260,000 images/sec on 1 1 2 4 8

2,048 nodes 16 32 64 128 256 512 1024 2048 • On 2048 nodes, ResNet-50 can be trained Nodes in 7 minutes! MVAPICH2-X Ideal

A. Jain, A. A. Awan, H. Subramoni, DK Panda, “Scaling TensorFlow, PyTorch, and MXNet using MVAPICH2 for High-Performance Deep Learning on Frontera”, DLS ’19 (SC ’19 Workshop).

Network Based Computing Laboratory HPC-AI-Switzerland (May ‘21) 21 MVAPICH2-GDR vs. NCCL2 – Allreduce Operation (DGX-2)

• Optimized designs in MVAPICH2-GDR offer better/comparable performance for most cases • MPI_Allreduce (MVAPICH2-GDR) vs. ncclAllreduce (NCCL2) on 1 DGX-2 node (16 Volta GPUs) Platform: Nvidia DGX-2 system (16 Nvidia Volta GPUs connected with NVSwitch), CUDA 10.1 50 10000 45 40 1000 35 ~2.5X better 30 25 100 20

~4.7X better Latency (us) Latency (us) 15 10 10 5 1 0 8 16 32 64 1K 2K 4K 8K 128 256 512 16K 32K 64K 128K Message Size (Bytes) Message Size (Bytes)

MVAPICH2-GDR-2.3.4 NCCL-2.6 MVAPICH2-GDR-2.3.4 NCCL-2.6 C.-H. Chu, P. Kousha, A. Awan, K. S. Khorassani, H. Subramoni and D. K. Panda, "NV-Group: Link-Efficient Reductions for Distributed Deep Learning on Modern Dense GPU Systems, " ICS-2020, June-July 2020. Network Based Computing Laboratory HPC-AI-Switzerland (May ‘21) 22 MVAPICH2-GDR: MPI_Allreduce at Scale (ORNL Summit)

• Optimized designs in MVAPICH2-GDR offer better performance for most cases • MPI_Allreduce (MVAPICH2-GDR) vs. ncclAllreduce (NCCL2) up to 1,536 GPUs Platform: Dual-socket IBM POWER9 CPU, 6 NVIDIA Volta V100 GPUs, and 2-port InfiniBand EDR Interconnect Latency on 1,536 GPUs Bandwidth on 1,536 GPUs 128MB Message 10 450 6 400 1.7X better 8 5 350 1.7X better 300 4 6 250 1.6X better 3 4 200

Latency (us) 150 2 (GB/s) Bandwidth 2

100 (GB/s) Bandwidth 1 50 0 0 0 24 48 96 192 384 768 1536 4 16 64 256 1K 4K 16K 32M 64M 128M 256M Number of GPUs Message Size (Bytes) Message Size (Bytes) SpectrumMPI 10.3 OpenMPI 4.0.1 MVAPICH2-GDR-2.3.4 NCCL 2.6 MVAPICH2-GDR-2.3.4 NCCL 2.6 NCCL 2.6 MVAPICH2-GDR-2.3.4 C.-H. Chu, P. Kousha, A. Awan, K. S. Khorassani, H. Subramoni and D. K. Panda, "NV-Group: Link-Efficient Reductions for Distributed Deep Learning on Modern Dense GPU Systems, " ICS-2020, June-July 2020. Network Based Computing Laboratory HPC-AI-Switzerland (May ‘21) 23 Distributed TensorFlow on ORNL Summit (1,536 GPUs)

• ResNet-50 Training using TensorFlow benchmark on 500 SUMMIT -- 1536 Volta ImageNet-1k has 1.2 million images GPUs! 400 MVAPICH2-GDR reaching ~0.42 million

Thousands 300 • 1,281,167 (1.2 mil.) images images per second for ImageNet-1k! 200

• Time/epoch = 3 seconds Image per second 100

0 • Total Time (90 epochs) 1 2 4 6 12 24 48 96 192 384 768 1536 = 3 x 90 = 270 seconds = Number of GPUs 4.5 minutes! NCCL-2.6 MVAPICH2-GDR 2.3.4 *We observed issues for NCCL2 beyond 384 GPUs Platform: The Summit Supercomputer (#2 on Top500.org) – 6 NVIDIA Volta GPUs per node connected with NVLink, CUDA 10.1 Network Based Computing Laboratory HPC-AI-Switzerland (May ‘21) 24 PyTorch at Scale: Training ResNet-50 on 256 V100 GPUs

• Training performance for 256 V100 GPUs on LLNL Lassen – ~10,000 Images/sec faster than NCCL training!

Distributed Torch.distributed Horovod DeepSpeed Framework Images/sec on 61,794 72,120 74,063 84,659 80,217 88,873 256 GPUs

Communication NCCL 2.7 MVAPICH2-GDR NCCL 2.7 MVAPICH2-GDR NCCL 2.7 MVAPICH2-GDR Backend

Network Based Computing Laboratory HPC-AI-Switzerland (May ‘21) 25 Multiple Approaches taken up by OSU • MPI-driven Deep Learning with Data Parallelism • Out-of-core DNN training • Exploiting Hybrid (Data and Model) Parallelism • Use-Case: AI-Driven Digital Pathology • Accelerating CuML Applications

Network Based Computing Laboratory HPC-AI-Switzerland (May ‘21) 26 Out-of-Core Training and Hybrid Parallelism (HyPar-Flow) • Why Hybrid parallelism? – Data Parallel training has limits!  • We propose HyPar-Flow – An easy-to-use Hybrid parallel training framework • Hybrid = Data + Model – Supports Keras models and exploits TF 2.0 Eager Execution

– Exploits MPI for Point-to- Benchmarking large-models lead to better point and Collectives insights and ability to develop new approaches!

A. A. Awan, A. Jain, Q. Anthony, H. Subramoni, and DK Panda, “HyPar-Flow: Exploiting MPI and Keras for Hybrid Parallel Training of TensorFlow models”, ISC ’20, https://arxiv.org/pdf/1911.05146.pdf Network Based Computing Laboratory HPC-AI-Switzerland (May ‘21) 27 HyPar-Flow at Scale (512 nodes on TACC Frontera) • ResNet-1001 with variable batch 481x speedup on 512 Intel Xeon Cascade Lake nodes (TACC Frontera) size • Approach: – 48 model-partitions for 56 cores – 512 model-replicas for 512 nodes – Total cores: 56 x 512 = 28,672 • Speedup – 253X on 256 nodes – 481X on 512 nodes • Scaling Efficiency – 98% up to 256 nodes A. A. Awan, A. Jain, Q. Anthony, H. Subramoni, and DK Panda, “HyPar-Flow: Exploiting MPI and Keras for Hybrid – 93.9% for 512 nodes Parallel Training of TensorFlow models”, ISC ‘20, https://arxiv.org/pdf/1911.05146.pdf Network Based Computing Laboratory HPC-AI-Switzerland (May ‘21) 28 GEMS: GPU Enabled Memory Aware Model Parallelism Systems

Why do we need Memory aware designs? – Data and Model Parallel training has limitation! – Maximum Batch Size depends on the memory – Basic Model Parallelism suffers from underutilization of memory and compute Memory requirement increases with the increase in image size!

A. Jain, A. Awan, A. Aljuhani, J. Hashmi, Q. Anthony, H. Subramoni, D. Panda, R. Machiraju, A. Parwani, “GEMS: GPU Enabled Memory Aware Model Parallelism System for Distributed DNN”, SC ‘20

Network Based Computing Laboratory HPC-AI-Switzerland (May ‘21) 29 GEMS at Scale (1,024 V100 GPUs on LLNL Lassen) • Two Approaches: 97.32% scaling efficiency on 1024 V100 GPUs (LLNL Lassen) – Memory Aware Synchronized Training (MAST) 128 GEMS-HY Basic – Memory Aware Synchronized Training 64 32 GEMS-HY MAST with Enhanced Replications (MASTER) 16 GEMS-HY MASTER 8 • Setup 4 2 – ResNet-1k on 512 X 512 images 1 Images per sec – 128 Replications on 1024 GPUs (Higher isbetter) 0.5 0.25 • Scaling Efficiency 8 16 32 64 128 256 512 1024 – 97.32% on 1024 nodes Number of GPUs

A. Jain, A. Awan, A. Aljuhani, J. Hashmi, Q. Anthony, H. Subramoni, D. Panda, R. Machiraju, A. Parwani, “GEMS: GPU Enabled Memory Aware Model Parallelism System for Distributed DNN”, SC ‘20

Network Based Computing Laboratory HPC-AI-Switzerland (May ‘21) 30 SUPER: SUb-Graph Parallelism for TransformERs

Sub-Graph Parallelism – Exploits inherent parallelism in modern DNN architectures – Improves the Performance of multi-branch DNN Simple example of a multi-branch DNN architecture architectures  – Can be used to accelerate the training of state-of-the-art Transformer models – Provides better performance than Data-Parallelism for in- core models 4-way Sub-Graph Parallelism combined with Data-Parallelism (D&SP)

A. Jain, T. moon, T. Benson, H. Subramoni, S. Jacobs, D. Panda, B. Essen, “SUPER: SUb-Graph Parallelism for TransformERs”, IPDPS ’21, to be presented Network Based Computing Laboratory HPC-AI-Switzerland (May ‘21) 31 Accelerating Transformers using SUPER

• We propose sub-graph parallelism Up to 3.05X speedup over Data Parallel designs (LLNL Lassen) integrated with data parallelism to DP 2-way D&SP accelerate the training of Transformers. 4-way D&SP 8-way D&SP 2 • Approach 1.5

– Data and Sub-Graph Parallelism (D&SP) batch (secs) - • #-way D&SP (#: number of sub-graphs) 1

• Setup 0.5

– T5-Large-Mod on WMT Dataset Time per mini 0 – 1,024 NVIDIA V100 GPUs 4 8 16 32 64 128 256 512 1024 #GPUs • Speedup – Up to 3.05X over Data Parallelism (DP)

A. Jain, T. moon, T. Benson, H. Subramoni, S. Jacobs, D. Panda, B. Essen, “SUPER: SUb-Graph Parallelism for TransformERs”, IPDPS ’21 (accepted for presentation)

Network Based Computing Laboratory HPC-AI-Switzerland (May ‘21) 32 Multiple Approaches taken up by OSU • MPI-driven Deep Learning with Data Parallelism • Out-of-core DNN training • Exploiting Hybrid (Data and Model) Parallelism • Use-Case: AI-Driven Digital Pathology • Accelerating CuML Applications

Network Based Computing Laboratory HPC-AI-Switzerland (May ‘21) 33 Digital Pathology • Whole Slide Images (WSI) – Replacing the glass slide for diagnostic purposes – Typically, 100,000 X 100,000 pixels in size

A whole slide image (WSI) A tile at 10x magnification level A tile at 20x magnification level A Hematoxylin and Eosin stained whole slide A 1024x1024 image tile at 10 magnification level A 1024x1024 image tile at 20 image labeled as Tall Cell Variant (TCV) of the shows histologic feature of elongated follicles magnification level shows cellular papillary thyroid cancer (PTC). arranged in parallel cords or tram tracks. features of tall cells Network Based Computing Laboratory HPC-AI-Switzerland (May ‘21) 34 AI-Driven Digital Pathology

• Traditionally, a pathologist reviews a slide and carries out diagnosis based on prior knowledge and training • Experience matters • Can Deep Learning to be used to train thousands of slides and – Let the computer carry out the diagnosis – Narrow down the diagnosis and help a pathologist to make a final decision • Significant benefits in – Reducing diagnosis time – Making pathologists productive – Reducing health care cost

Network Based Computing Laboratory HPC-AI-Switzerland (May ‘21) 35 Exploiting Model Parallelism in AI-Driven Digital Pathology • Pathology whole slide image (WSI)

– Each WSI = 100,000 x 100,000 pixels Courtesy: https://blog.kitware.com/digital-slide- archive-large-image-and-histomicstk-open-source- – Can not fit in a single GPU memory informatics-tools-for-management-visualization-and- analysis-of-digital-histopathology-data/ – Tiles are extracted to make training possible • Two main problems with tiles – Restricted tile size because of GPU memory limitation 25 22x – Smaller tiles loose structural information 20 • Reduced training time significantly 15 12x – GEMS-Basic: 7.25 hours (1 node, 4 GPUs) 10 7x – GEMS-MAST: 6.28 hours (1 node, 4 GPUs) 3.6x 5 1x 1.9x – GEMS-MASTER: 4.21 hours (1 node, 4 GPUs) 0 – GEMS-Hybrid: 0.46 hours (32 nodes, 128 GPUs) per sec) (images

Throughput Speedup Throughput 4 8 16 32 64 128 – Overall 15x reduction in training time!!!! A. Jain, A. Awan, A. Aljuhani, J. Hashmi, Q. Anthony, H. Subramoni, D. K. Panda, R. Machiraju, and A. Parwani, Number of GPUs “GEMS: GPU Enabled Memory Aware Model Parallelism System for Distributed DNN Training”, Supercomputing Scaling ResNet110 v2 on 1024×1024 image tiles (SC ‘20). using histopathology data Network Based Computing Laboratory HPC-AI-Switzerland (May ‘21) 36 Multiple Approaches taken up by OSU • MPI-driven Deep Learning with Data Parallelism • Out-of-core DNN training • Exploiting Hybrid (Data and Model) Parallelism • Use-Case: AI-Driven Digital Pathology • Accelerating CuML Applications

Network Based Computing Laboratory HPC-AI-Switzerland (May ‘21) 37 Accelerating cuML with MVAPICH2-GDR • Utilize MVAPICH2-GDR (with mpi4py) as communication backend during the training phase (the fit() function) for Multi-node Multi-GPU (MNMG) setting over cluster of GPUs • Communication primitives: Dask Python

– Allreduce UCX-Py Cython mpi4py – Reduce CUDA/C/C++ – Broadcast cuML Algorithms • Exploit optimized collectives UCX cuML Primitives MVAPICH2- NCCL GDR CUDA Libraries

CUDA

Network Based Computing Laboratory HPC-AI-Switzerland (May ‘21) 38 MPI4cuML 0.1 Release

• MPI4cuML 0.1 was released in Feb ‘21 adding support for high-performance MPI communication to cuML: – Can be downloaded from: http://hidl.cse.ohio-state.edu • Features: – Built with Python 3.7, CUDA 10.1, 10.2 or 11.0 – Optimized support at MPI-level for machine learning workloads • Efficient large-message and small-message collectives (e.g. Allreduce and Bcast) on GPUs • GPU-Direct Algorithms for all collective operations (e.g. Allgather and Alltoall) • Support for fork safety – Exploits efficient large-message and small-message collectives in MVAPICH2-GDR – Tested with • Mellanox InfiniBand adapters (FDR and HDR) • NVIDIA GPU P100 and V100 • Various x86-based multi-core platforms (AMD and Intel)

Network Based Computing Laboratory HPC-AI-Switzerland (May ‘21) 39 K-Means 2000 2.0 2.7 2000 Linear Regression 1.6 NCCL 2.7 NCCL 2.7 1.8 1.4 MVAPICH2-GDR 1.6 MVAPICH2-GDR 1500 1500 1.2 Speedup 1.4 Speedup 1.2 1.0 1000 1.0 1000 0.8 0.8 Speedup Speedup 0.6 0.6 Training Time (s) Training Time (s) 500 500 0.4 0.4 0.2 0.2 0 0.0 0 0.0 1 2 4 8 16 32 1 2 4 8 16 32 Number of GPUs MPI4cuML 0.1 release Number of GPUs Nearest Neighbors (http://hidl.cse.ohio-state.edu) Truncated SVD 2500 2.0 NCCL 2.7 1500 1.6 1.8 NCCL 2.7 MVAPICH2-GDR 2000 1.6 MVAPICH2-GDR 1.4 Speedup 1.4 Speedup 1.2 1000 1500 1.2 1.0 1.0 0.8 1000 0.8 Speedup 0.6 Speedup 0.6 500 Training Time (s) Time Training 500 0.4 0.4 Training Time (s) Time Training 0.2 0.2 0 0.0 0 0.0 1 2 4 8 16 32 Number of GPUs 1 2 Number4 of GPUs8 16 32 M. Ghazimirsaeed , Q. Anthony , A. Shafi , H. Subramoni , and D. K. Panda, Accelerating GPU-based Machine Learning in Python using MPI Library: A Case Study with MVAPICH2-GDR, MLHPC Workshop, Nov 2020 Network Based Computing Laboratory HPC-AI-Switzerland (May ‘21) 40 Presentation Overview

Network Based Computing Laboratory HPC-AI-Switzerland (May ‘21) 41 Dask Architecture Dask Dask Dask Bag Dask Array Delayed Future DataFrame

Task Graph

Dask-MPI Dask-CUDA Dask-Jobqueue

Distributed Scheduler Worker Client

tcp.py ucx.py MPI4Dask Comm Layer

UCX-Py mpi4py (Cython wrappers)

TCP UCX MVAPICH2-GDR

Laptops/ High Performance Computing Hardware Desktops Network Based Computing Laboratory HPC-AI-Switzerland (May ‘21) 42 MPI4Dask Release

• MPI4Dask 0.2 was released in Mar ‘21 adding support for high-performance MPI communication to Dask: – Can be downloaded from: http://hibd.cse.ohio-state.edu • Features: – Based on Dask Distributed 2021.01.0 – Compliant with user-level Dask APIs and packages – Support for MPI-based communication in Dask for cluster of GPUs – Implements point-to-point communication co-routines – Efficient chunking mechanism implemented for large messages – (NEW) Built on top of mpi4py over the MVAPICH2, MVAPICH2-X, and MVAPICH2-GDR libraries – (NEW) Support for MPI-based communication for CPU-based Dask applications – Supports starting execution of Dask programs using Dask-MPI – Tested with • (NEW) CPU-based Dask applications using numPy and Pandas data frames • (NEW) GPU-based Dask applications using cuPy and cuDF • Mellanox InfiniBand adapters (FDR and EDR) • Various multi-core platforms • NVIDIA V100 and Quadro RTX 5000 GPUs

Network Based Computing Laboratory HPC-AI-Switzerland (May ‘21) 43 Benchmark #1: Sum of cuPy Array and its Transpose (RI2)

3.47x better on average 6.92x better on average 10 2.0 IPoIB UCX MPI4Dask IPoIB UCX MPI4Dask 9 1.8 8 1.6 7 1.4 6 1.2 5 1.0 4 0.8 3 0.6 2 0.4 Total Execution Time (s)Execution Time Total

1 (s) Communication Time 0.2 0 0.0 2 3 4 5 6 2 3 4 5 6 Number of Dask Workers Number of Dask Workers A. Shafi , J. Hashmi , H. Subramoni , and D. K. Panda, Efficient MPI-based MPI4Dask 0.2 release Communication for GPU-Accelerated Dask Applications, https://arxiv.org/abs/2101.08878 (http://hibd.cse.ohio-state.edu) Network Based Computing Laboratory HPC-AI-Switzerland (May ‘21) 44 Benchmark #2: cuDF Merge (TACC Frontera GPU Subsystem)

2.91x better on average 2.90x better on average

A. Shafi , J. Hashmi , H. Subramoni , and D. K. Panda, Efficient MPI- MPI4Dask 0.2 release based Communication for GPU-Accelerated Dask Applications, https://arxiv.org/abs/2101.08878 (http://hibd.cse.ohio-state.edu) Network Based Computing Laboratory HPC-AI-Switzerland (May ‘21) 45 Presentation Overview

Network Based Computing Laboratory HPC-AI-Switzerland (May ‘21) 46 MVAPICH2-Azure Deployment

• Released on 05/20/2020 • Integrated Azure CentOS HPC Images – https://github.com/Azure/azhpc-images/releases/tag/centos-7.6-hpc-20200417 • MVAPICH2 2.3.3 – CentOS Images (7.6, 7.7 and 8.1) – Tested with multiple VM instances • MVAPICH2-X 2.3.RC3 – CentOS Images (7.6, 7.7 and 8.1) – Tested with multiple VM instances • More details from Azure Blog Post – https://techcommunity.microsoft.com/t5/azure-compute/mvapich2-on-azure-hpc-clusters/ba- p/1404305

Network Based Computing Laboratory HPC-AI-Switzerland (May ‘21) 47 WRF Application Results on HBv2 (AMD Rome) • Performance of WRF with MVAPICH2 and MVAPICH2-X-XPMEM

WRF Execution time • WRF 3.6 MVAPICH2 MVAPICH2-X+XPMEM • https://github.com/hanschen/WR FV3 60 • Benchmark: 12km resolution case 50 over the Continental U.S. (CONUS) 40 domain 30 • https://www2.mmm.ucar.edu/wrf

Time (s) 20 /WG2/benchv3/#_Toc212961288 10 • Update io_form_history in 0 namelist.input to 102 120 240 480 960 • https://www2.mmm.ucar.edu/wrf/user Number of processes s/namelist_best_prac_wrf.html#io_for m_history

MVAPICH2-X-XPMEM is able to deliver better performance and scalability

Network Based Computing Laboratory HPC-AI-Switzerland (May ‘21) 48 MVAPICH2-X-AWS 2.3

• Released on 09/24/2020 • Major Features and Enhancements – Based on MVAPICH2-X 2.3 – Improved inter-node latency and bandwidth performance for large messages – Optimized collectives – Support for dynamic run-time XPMEM module detection – Support for currently available basic OS types on AWS EC2 including: Amazon Linux 1/2, CentOS 6/7, Ubuntu 16.04/18.04

Network Based Computing Laboratory HPC-AI-Switzerland (May ‘21) 49 WRF Application Results

• Performance of WRF with Open MPI 4.0.3 vs Intel MPI 2019.7.217 vs MVAPICH2-X-AWS v2.3 • Run on c5n.18xlarge instances with libfabric 1.10

WRF Execution Time

OpenMPI Intel MPI MVAPICH2-X-AWS 140 120 100 80 60 41% 58% Time (s) 24% 47% 40 17% 27% 20 0 36 72 144 288 576 1152 Number of processes

Network Based Computing Laboratory HPC-AI-Switzerland (May ‘21) 50 Presentation Overview

Network Based Computing Laboratory HPC-AI-Switzerland (May ‘21) 51 Commercial Support for MVAPICH2, HiBD, and HiDL Libraries • Supported through X-ScaleSolutions (http://x-scalesolutions.com) • Benefits: – Help and guidance with installation of the library – Platform-specific optimizations and tuning – Timely support for operational issues encountered with the library – Web portal interface to submit issues and tracking their progress – Advanced debugging techniques – Application-specific optimizations and tuning – Obtaining guidelines on best practices – Periodic information on major fixes and updates – Information on major releases – Help with upgrading to the latest release – Flexible Service Level Agreements • Support being provided to National Laboratories and International Supercomputing Centers

Network Based Computing Laboratory HPC-AI-Switzerland (May ‘21) 52 X-ScaleAI Product and Features • Aim: High-performance solution for distributed training for your complex AI problems on modern HPC platforms • Features: – Powered by MVAPICH2 libraries – Great performance and scalability as delivered by MVAPICH2 libraries – Integrated packaging to run various Deep Learning Frameworks (TensorFlow, PyTorch, MXNet, and others) – Targeted for both CPU-based and GPU-based Deep Learning Training – Integrated profiling and introspection support for Deep Learning Applications across the stacks (DeepIntrospect) • Provides cross-stack performance analysis in a visual manner and help users to optimize their DL applications and harness higher performance and scalability – Out-of-the-box optimal performance • Tuned for various CPU- and GPU-based HPC systems – One-click deployment and execution • Do not need to struggle for many hours – Support for x86 and OpenPOWER platforms – Support for InfiniBand, RoCE and NVLink Interconnects

Network Based Computing Laboratory HPC-AI-Switzerland (May ‘21) 53 Concluding Remarks • Upcoming Exascale systems and Cloud need to be designed with a holistic view of HPC, Big Data, and Deep/Machine Learning • Presented an overview of designing convergent software stacks for HPC and Deep/Machine Learning • Presented solutions which enable HPC and Deep/Machine Learning communities to take advantage of current and next-generation systems • Next-generation Exascale and Zetascale systems will need continuous innovations in designing converged software architectures …..

Network Based Computing Laboratory HPC-AI-Switzerland (May ‘21) 54 Funding Acknowledgments Funding Support by

Equipment Support by

Network Based Computing Laboratory HPC-AI-Switzerland (May ‘21) 55 Acknowledgments to all the Heroes (Past/Current Students and Staffs)

Current Students (Graduate) Current Research Scientists Current Software Engineers – P. Kousha (Ph.D.) – N. Sarkauskas (Ph. D.) – Q. Anthony (Ph.D.) – A. Shafi – N. Shineman – N. S. Kumar (M.S.) – S. Srivastava (M.S.) – M. Bayatpour (Ph.D.) – H. Subramoni – A. H. Tu (Ph.D.) – C.-C. Chun (Ph.D.) – K. Luo (M.S.) – S. Xu (Ph.D.) Current Research Specialist – A. Jain (Ph.D.) – B. Ramesh (Ph.D.) – Q. Zhou (Ph.D.) – J. Smith – K. S. Khorassani (Ph.D.) – K. K. Suresh (Ph.D.)

Past Students – A. Awan (Ph.D.) – T. Gangadharappa (M.S.) – P. Lai (M.S.) – R. Rajachandrasekar (Ph.D.) Past Research Scientists – A. Augustine (M.S.) – K. Gopalakrishnan (M.S.) – J. Liu (Ph.D.) – D. Shankar (Ph.D.) – P. Balaji (Ph.D.) – J. Hashmi (Ph.D.) – M. Luo (Ph.D.) – G. Santhanaraman (Ph.D.) – K. Hamidouche – N. Sarkauskas (B.S.) – R. Biswas (M.S.) – W. Huang (Ph.D.) – A. Mamidala (Ph.D.) – S. Sur – A. Singh (Ph.D.) – S. Bhagvat (M.S.) – W. Jiang (M.S.) – G. Marsh (M.S.) – X. Lu – J. Jose (Ph.D.) – J. Sridhar (M.S.) – A. Bhat (M.S.) – V. Meshram (M.S.) Past Programmers – M. Kedia (M.S.) – A. Moody (M.S.) – S. Sur (Ph.D.) – D. Buntinas (Ph.D.) – D. Bureddy – H. Subramoni (Ph.D.) – L. Chai (Ph.D.) – S. Kini (M.S.) – S. Naravula (Ph.D.) – K. Vaidyanathan (Ph.D.) – J. Perkins – B. Chandrasekharan (M.S.) – M. Koop (Ph.D.) – R. Noronha (Ph.D.) – A. Vishnu (Ph.D.) – S. Chakraborthy (Ph.D.) – K. Kulkarni (M.S.) – X. Ouyang (Ph.D.) Past Research Specialist – J. Wu (Ph.D.) – N. Dandapanthula (M.S.) – R. Kumar (M.S.) – S. Pai (M.S.) – M. Arnold – W. Yu (Ph.D.) – V. Dhanraj (M.S.) – S. Krishnamoorthy (M.S.) – S. Potluri (Ph.D.) – J. Zhang (Ph.D.) – C.-H. Chu (Ph.D.) – K. Kandalla (Ph.D.) – K. Raj (M.S.) – M. Li (Ph.D.) Past Post-Docs – D. Banerjee – H.-W. Jin – E. Mancini – A. Ruhela – X. Besseron – J. Lin – K. Manian – J. Vienne – M. S. Ghazimeersaeed – M. Luo – S. Marcarelli – H. Wang Network Based Computing Laboratory HPC-AI-Switzerland (May ‘21) 56 Thank You! [email protected]

The High-Performance MPI/PGAS Project The High-Performance Big Data Project The High-Performance Deep Learning Project http://mvapich.cse.ohio-state.edu/ http://hibd.cse.ohio-state.edu/ http://hidl.cse.ohio-state.edu/

Network Based Computing Laboratory HPC-AI-Switzerland (May ‘21) 57