HPC and AI Middleware for Exascale Systems and Clouds

Talk at HPC-AI Advisory Council UK Conference (October ‘20) by

Dhabaleswar K. (DK) Panda The Ohio State University E-mail: [email protected]

Follow us on http://www.cse.ohio-state.edu/~panda https://twitter.com/mvapich High-End Computing (HEC): PetaFlop to ExaFlop

100 PetaFlops in 415 Peta 2017 Flops in 2020 (Fugaku in Japan with 7.3M cores

1 ExaFlops

Expected to have an ExaFlop system in 2021! Network Based Computing Laboratory HPC-AI-UK (Oct ‘20) 2 Increasing Usage of HPC, Big Data and Deep/Machine Learning

Big Data (Hadoop, Spark, HPC HBase, (MPI, PGAS, etc.) Memcached, etc.)

Convergence of HPC, Big Deep/Machine Data, and Deep/Machine Learning Learning! (TensorFlow, PyTorch, BigDL, cuML, etc.) Increasing Need to Run these applications on the Cloud!!

Network Based Computing Laboratory HPC-AI-UK (Oct ‘20) 3 Converged Middleware for HPC, Big Data and Deep/Machine Learning?

Physical Compute

Network Based Computing Laboratory HPC-AI-UK (Oct ‘20) 4 Converged Middleware for HPC, Big Data and Deep/Machine Learning?

Network Based Computing Laboratory HPC-AI-UK (Oct ‘20) 5 Converged Middleware for HPC, Big Data and Deep/Machine Learning?

Network Based Computing Laboratory HPC-AI-UK (Oct ‘20) 6 Converged Middleware for HPC, Big Data and Deep/Machine Learning?

Hadoop Job Deep/Machine Learning Job

Spark Job

Network Based Computing Laboratory HPC-AI-UK (Oct ‘20) 7 Presentation Overview

• MVAPICH Project – MPI and PGAS (MVAPICH) Library with CUDA-Awareness • HiBD Project – High-Performance Big Data Analytics Library • HiDL Project – High-Performance Deep Learning – High-Performance Machine Learning • Optimizations and Deployments in Public Cloud – AWS and Azure • Conclusions

Network Based Computing Laboratory HPC-AI-UK (Oct ‘20) 8 Designing (MPI+X) for Exascale • Scalability for million to billion processors – Support for highly-efficient inter-node and intra-node communication (both two-sided and one-sided) • Scalable Collective communication – Offloaded – Non-blocking – Topology-aware • Balancing intra-node and inter-node communication for next generation multi-/many-core (128-1024 cores/node) – Multiple end-points per node • Support for efficient multi-threading • Integrated Support for GPGPUs and Accelerators • Fault-tolerance/resiliency • QoS support for communication and I/O • Support for Hybrid MPI+PGAS programming • MPI + OpenMP, MPI + UPC, MPI + OpenSHMEM, CAF, MPI + UPC++… • Virtualization • Energy-Awareness

Network Based Computing Laboratory HPC-AI-UK (Oct ‘20) 9 Overview of the MVAPICH2 Project • High Performance open-source MPI Library • Support for multiple interconnects – InfiniBand, Omni-Path, Ethernet/iWARP, RDMA over Converged Ethernet (RoCE), and AWS EFA • Support for multiple platforms – x86, OpenPOWER, ARM, Xeon-Phi, GPGPUs (NVIDIA and AMD (upcoming)) • Started in 2001, first open-source version demonstrated at SC ‘02 • Used by more than 3,100 organizations in 89 countries • Supports the latest MPI-3.1 standard • More than 900,000 (> 0.9 million) downloads from the • http://mvapich.cse.ohio-state.edu OSU site directly • Additional optimized versions for different systems/environments: • Empowering many TOP500 clusters (June ‘20 ranking) – MVAPICH2-X (Advanced MPI + PGAS), since 2011 – 4th , 10,649,600-core (Sunway TaihuLight) at NSC, Wuxi, China – MVAPICH2-GDR with support for NVIDIA GPGPUs, since 2014 – 8th, 448, 448 cores (Frontera) at TACC – MVAPICH2-MIC with support for Intel Xeon-Phi, since 2014 – 12th, 391,680 cores (ABCI) in Japan – MVAPICH2-Virt with virtualization support, since 2015 – 18th, 570,020 cores (Nurion) in South Korea and many others – MVAPICH2-EA with support for Energy-Awareness, since 2015 – MVAPICH2-Azure for Azure HPC IB instances, since 2019 • Available with software stacks of many vendors and – MVAPICH2-X-AWS for AWS HPC+EFA instances, since 2019 Distros (RedHat, SuSE, OpenHPC, and Spack) • Tools: • Partner in the 8th ranked TACC Frontera system – OSU MPI Micro-Benchmarks (OMB), since 2003 • Empowering Top500 systems for more than 15 years – OSU InfiniBand Network Analysis and Monitoring (INAM), since 2015 Network Based Computing Laboratory HPC-AI-UK (Oct ‘20) 10 Architecture of MVAPICH2 Software Family for HPC and DL/ML

High Performance Parallel Programming Models

Message Passing Interface PGAS Hybrid --- MPI + X (MPI) (UPC, OpenSHMEM, CAF, UPC++) (MPI + PGAS + OpenMP/Cilk)

High Performance and Scalable Communication Runtime Diverse APIs and Mechanisms

Point-to- Remote Collectives Energy- I/O and Fault Active Introspection point Job Startup Memory Virtualization Algorithms Awareness Tolerance Messages & Analysis Primitives Access File Systems

Support for Modern Networking Technology Support for Modern Multi-/Many-core Architectures (InfiniBand, iWARP, RoCE, Omni-Path, Elastic Fabric Adapter) (Intel-Xeon, OpenPOWER, Xeon-Phi, ARM, NVIDIA GPGPU) Transport Protocols Modern Features Transport Mechanisms Modern Features

SR- Multi Shared RC SRD UD DC UMR ODP CMA IVSHMEM XPMEM Optane* NVLink CAPI* IOV Rail Memory

* Upcoming Network Based Computing Laboratory HPC-AI-UK (Oct ‘20) 11 MVAPICH2 Software Family

Requirements Library

MPI with IB, iWARP, Omni-Path, and RoCE MVAPICH2

Advanced MPI Features/Support, OSU INAM, PGAS and MPI+PGAS MVAPICH2-X with IB, Omni-Path, and RoCE MPI with IB, RoCE & GPU and Support for Deep/Machine Learning MVAPICH2-GDR

HPC Cloud with MPI & IB MVAPICH2-Virt

Energy-aware MPI with IB, iWARP and RoCE MVAPICH2-EA

MPI Energy Monitoring Tool OEMT

InfiniBand Network Analysis and Monitoring OSU INAM

Microbenchmarks for Measuring MPI and PGAS Performance OMB

Network Based Computing Laboratory HPC-AI-UK (Oct ‘20) 12 Converged Middleware for HPC, Big Data and Deep/Machine Learning

Big Data HPC (Hadoop, Spark, (MPI, PGAS, HBase, etc.) Memcached, etc.)

Deep/Machine Learning (TensorFlow, PyTorch, BigDL, cuML, etc.)

Network Based Computing Laboratory HPC-AI-UK (Oct ‘20) 13 Startup Performance on TACC Frontera

MVAPICH2 2.3.4 Intel MPI 2020 MVAPICH2 2.3.4 Intel MPI 2020 25 2000 20 1500 15 1000 10 14X 46X Time (s) 5 Time (s) 500 0 0 56 112 224 448 896 1792 3584 7168 14336 28672 57344 114688229376 Number of Processes Number of Processes

• MPI_Init takes 31 seconds on 229,376 processes on 4,096 nodes • All numbers reported with 56 processes per node

New designs available in MVAPICH2-2.3.4 Network Based Computing Laboratory HPC-AI-UK (Oct ‘20) 14 Hardware Multicast-aware MPI_Bcast on TACC Frontera 20 1000 Default Multicast Default Multicast 15 750

10 500

Latency (us) 250 5 Latency (us) 1.9X 2X (Nodes=2K, PPN=28) 0 0 2 8 32 128 512 2K 8K 32K 128K 512K Message Size Message Size 15 80 Default Multicast Default Multicast Size=16B, PPN=28 Size=32kB, PPN=28 10 60 1.8X 40 2X 5 Latency (us) Latency (us) 20

0 0 2 4 8 16 32 64 128 256 512 1k 2 4 8 16 32 64 128 256 512 1K Number of Nodes Number of Nodes • MCAST-based designs improve latency of MPI_Bcast by up to 2X at 2,048 nodes • Use MV2_USE_MCAST=1 to enable MCAST-based designs Network Based Computing Laboratory HPC-AI-UK (Oct ‘20) 15 Performance of Collectives with SHARP on TACC Frontera 120 1000 100 MVAPICH2-X 80 100 MVAPICH2-X-SHARP 6X 60 40 10 MVAPICH2-X

5X Latency (us) Latency (us) MPI_Reduce

MPI_Allreduuce 20 MVAPICH2-X-SHARP (PPN=1, 7861) Nodes = 0 (PPN=1, 7861) Nodes = 1

Message size Message size 250 Optimized SHARP designs in MVAPICH2-X 200 MVAPICH2-X Up to 9X performance improvement with SHARP over MVAPICH2-X default for 1ppn 150 MVAPICH2-X-SHARP MPI_Barrier, 6X for 1ppn MPI_Reduce and 5X for 1ppn MPI_Allreduce MPI_Barrier 100 9X Latency (us) 50 B. Ramesh , K. Suresh , N. Sarkauskas , M. Bayatpour , J. Hashmi , H. Subramoni , and D. K. Panda, Scalable MPI Collectives using 0 SHARP: Large Scale Performance Evaluation on the TACC Frontera 4 8 16 32 64

128 256 512 System, ExaMPI2020 - Workshop on Exascale MPI 2020, Nov 2020, 1024 2048 4096 7861 Number of nodes Accepted to be presented. Optimized Runtime Parameters: MV2_ENABLE_SHARP = 1 Network Based Computing Laboratory HPC-AI-UK (Oct ‘20) 16 Performance of MPI_Ialltoall using HW Tag Matching 4000 2000 1.6 X 1.7 X 3000 1500 8 Nodes 16 Nodes 2000 1000 1000

500 Latency (us) Latency (us) 0 0 16K 32K 64K 128K 256K 512K 1M 16K 32K 64K 128K 256K 512K 1M Message Size (byte) Message Size (byte) MVAPICH2 MVAPICH2+HW-TM MVAPICH2 MVAPICH2+HW-TM

8000 1.8 X 15000 1.5 X 6000 32 Nodes 64 Nodes 10000 4000 5000

2000 Latency (us) Latency (us) 0 0 16K 32K 64K 128K 256K 512K 1M 16K 32K 64K 128K 256K 512K 1M Message Size (byte) Message Size (byte) • Up to 1.8x Performance Improvement, Sustained benefits as system size increases M. Bayatpour , M. Ghazimirsaeed , S. Xu , H. Subramoni , and D. K. Panda, Design and Characterization of InfiniBand Hardware Tag Matching in MPI, CCGrid ‘20. Will be available in upcoming MVAPICH2-X Network Based Computing Laboratory HPC-AI-UK (Oct ‘20) 17 Neighborhood Collectives – Performance Benefits

• SpMM • NAS DT

up to 34x speedup up to 15% improvement

M. Ghazimirsaeed, Q. Zhou, A. Ruhela, M. Bayatpour, H. Subramoni, and D. K. Panda, A Hierarchical and Load-Aware Design for Large Message Neighborhood Collectives, SC ’20, Will be available in upcoming MVAPICH2-X Accepted to be presented Network Based Computing Laboratory HPC-AI-UK (Oct ‘20) 18 GPU-Aware (CUDA-Aware) MPI Library: MVAPICH2-GPU • Standard MPI interfaces used for unified data movement • Takes advantage of Unified Virtual Addressing (>= CUDA 4.0) • Overlaps data movement from GPU with RDMA transfers

At Sender:

MPI_Send(s_devbuf, size, …); inside MVAPICH2 At Receiver: MPI_Recv(r_devbuf, size, …);

High Performance and High Productivity

Network Based Computing Laboratory HPC-AI-UK (Oct ‘20) 19 Optimized MVAPICH2-GDR Design GPU-GPU Inter-node Latency 30 6000 GPU-GPU Inter-node Bi-Bandwidth

20 4000 10 11X 1.85us 10x 2000

Latency (us) 0 0 1 2 4 8 Bandwidth (MB/s) Bandwidth 0 16 32 64 1K 2K 4K 8K 1 2 4 8 128 256 512 16 32 64 1K 2K 4K

Message Size (Bytes) 128 256 512 Message Size (Bytes) MV2-(NO-GDR) MV2-GDR 2.3 MV2-(NO-GDR) MV2-GDR-2.3

4000 GPU-GPU Inter-node Bandwidth 3000 2000 9x MVAPICH2-GDR-2.3 Intel Haswell (E5-2687W @ 3.10 GHz) node - 20 cores 1000 NVIDIA Volta V100 GPU

Bandwidth (MB/s) Bandwidth 0 Mellanox Connect-X4 EDR HCA 1 2 4 8 16 32 64 1K 2K 4K CUDA 9.0 128 256 512 Mellanox OFED 4.0 with GPU-Direct-RDMA Message Size (Bytes) MV2-(NO-GDR) MV2-GDR-2.3

Network Based Computing Laboratory HPC-AI-UK (Oct ‘20) 20 Impact on HOOMD-blue, a Molecular Dynamics Application 64K Particles 256K Particles 4000 2500 MV2 MV2+GDR 3000 2000 2000 2X 1500 2X 1000 1000 500 per second (TPS) 0 per second (TPS) Average Time Steps Time Steps Average

Average Time Steps Time Steps Average 0 4 8 16 32 4 8 16 32 Number of Processes Number of Processes

Network Based Computing Laboratory HPC-AI-UK (Oct ‘20) 21 Application-Level Evaluation (Cosmo) and Weather Forecasting in Switzerland

Wilkes GPU Cluster CSCS GPU cluster

Default Callback-based Event-based Default Callback-based Event-based

1.2 1.2 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 0 4 8 16 32 Normalized Execution Time 16 32 64 96 Normalized Execution Time Number of GPUs Number of GPUs

• 2X improvement on 32 GPUs nodes Cosmo model: http://www2.cosmo-model.org/content • 30% improvement on 96 GPU nodes (8 GPUs/node) /tasks/operational/meteoSwiss/ On-going collaboration with CSCS and MeteoSwiss (Switzerland) in co-designing MV2-GDR and Cosmo Application C. Chu, K. Hamidouche, A. Venkatesh, D. Banerjee , H. Subramoni, and D. K. Panda, Exploiting Maximal Overlap for Non-Contiguous Data Movement Processing on Modern GPU-enabled Systems, IPDPS’16 Network Based Computing Laboratory HPC-AI-UK (Oct ‘20) 22 Presentation Overview

• MVAPICH Project – MPI and PGAS (MVAPICH) Library with CUDA-Awareness • HiBD Project – High-Performance Big Data Analytics Library • HiDL Project – High-Performance Deep Learning – High-Performance Machine Learning • Optimizations and Deployments in Public Cloud – AWS and Azure • Conclusions

Network Based Computing Laboratory HPC-AI-UK (Oct ‘20) 23 Converged Middleware for HPC, Big Data and Deep/Machine Learning

Big Data (Hadoop, Spark, HPC HBase, (MPI, PGAS, etc.) Memcached, etc.)

Deep/Machine Learning (TensorFlow, PyTorch, BigDL, cuML, etc.)

Network Based Computing Laboratory HPC-AI-UK (Oct ‘20) 24 The High-Performance Big Data (HiBD) Project • Since 2013 • RDMA for Apache Spark • RDMA for Apache Hadoop 3.x (RDMA-Hadoop-3.x) • RDMA for Apache Hadoop 2.x (RDMA-Hadoop-2.x) – Plugins for Apache, Hortonworks (HDP) and Cloudera (CDH) Hadoop distributions Available for InfiniBand and RoCE • RDMA for Also run on Ethernet • RDMA for Apache HBase • RDMA for Memcached (RDMA-Memcached) Available for x86 and OpenPOWER • RDMA for Apache Hadoop 1.x (RDMA-Hadoop) • OSU HiBD-Benchmarks (OHB) Support for Singularity and Docker – HDFS, Memcached, HBase, and Spark Micro-benchmarks • http://hibd.cse.ohio-state.edu • Users Base: 335 organizations from 36 countries • More than 37,500 downloads from the project site

Network Based Computing Laboratory HPC-AI-UK (Oct ‘20) 25 Different Modes of RDMA for Apache Hadoop 2.x

• HHH: Heterogeneous storage devices with hybrid replication schemes are supported in this mode of operation to have better fault-tolerance as well as performance. This mode is enabled by default in the package. • HHH-M: A high-performance in-memory based setup has been introduced in this package that can be utilized to perform all I/O operations in- memory and obtain as much performance benefit as possible. • HHH-L: With parallel file systems integrated, HHH-L mode can take advantage of the Lustre available in the cluster. • HHH-L-BB: This mode deploys a Memcached-based burst buffer system to reduce the bandwidth bottleneck of shared file system access. The burst buffer design is hosted by Memcached servers, each of which has a local SSD. • MapReduce over Lustre, with/without local disks: Besides, HDFS based solutions, this package also provides support to run MapReduce jobs on top of Lustre alone. Here, two different modes are introduced: with local disks and without local disks. • Running with Slurm and PBS: Supports deploying RDMA for Apache Hadoop 2.x with Slurm and PBS in different running modes (HHH, HHH-M, HHH- L, and MapReduce over Lustre). Network Based Computing Laboratory HPC-AI-UK (Oct ‘20) 26 Using HiBD Packages for Big Data Processing on Existing HPC Infrastructure

Network Based Computing Laboratory HPC-AI-UK (Oct ‘20) 27 RDMA-Spark on SDSC Comet – HiBench PageRank 800 450 37% 43% 700 IPoIB 400 IPoIB 600 RDMA 350 RDMA 500 300 250 400 200 Time (sec) 300 Time (sec) 150 200 100 100 50 0 0 Huge BigData Gigantic Huge BigData Gigantic Data Size (GB) Data Size (GB) 32 Worker Nodes, 768 cores, PageRank Total Time 64 Worker Nodes, 1536 cores, PageRank Total Time

• InfiniBand FDR, SSD, 32/64 Worker Nodes, 768/1536 Cores, (768/1536M 768/1536R) • RDMA-based design for Spark 1.5.1 • RDMA vs. IPoIB with 768/1536 concurrent tasks, single SSD per node. – 32 nodes/768 cores: Total time reduced by 37% over IPoIB (56Gbps) – 64 nodes/1536 cores: Total time reduced by 43% over IPoIB (56Gbps) Network Based Computing Laboratory HPC-AI-UK (Oct ‘20) 28 Using HiBD Packages for Big Data Processing on Existing HPC Infrastructure

Network Based Computing Laboratory HPC-AI-UK (Oct ‘20) 29 Presentation Overview

• MVAPICH Project – MPI and PGAS (MVAPICH) Library with CUDA-Awareness • HiBD Project – High-Performance Big Data Analytics Library • HiDL Project – High-Performance Deep Learning – High-Performance Machine Learning • Optimizations and Deployments in Public Cloud – AWS and Azure • Conclusions

Network Based Computing Laboratory HPC-AI-UK (Oct ‘20) 30 Converged Middleware for HPC, Big Data and Deep Learning

Big Data HPC (Hadoop, Spark, (MPI, PGAS, HBase, etc.) Memcached, etc.)

Deep/Machine Learning (TensorFlow, PyTorch, BigDL, cuML, etc.)

Network Based Computing Laboratory HPC-AI-UK (Oct ‘20) 31 MVAPICH2 (MPI)-driven Infrastructure for ML/DL Training

ML/DL Applications ML/DL Applications

TensorFlow PyTorch MXNet PyTorch

Horovod Torch.distributed DeepSpeed

MVAPICH2 or MVAPICH2-X MVAPICH2-GDR for MVAPICH2 or MVAPICH2-X MVAPICH2-GDR for for CPU Training GPU Training for CPU Training GPU Training

More details available from: http://hidl.cse.ohio-state.edu Network Based Computing Laboratory HPC-AI-UK (Oct ‘20) 32 Multiple Approaches taken up by OSU • MPI-driven Deep Learning – CPU-based Deep Learning – GPU-based Deep Learning • Out-of-core DNN training • Exploiting Hybrid (Data and Model) Parallelism • Use-Case: AI-Driven Digital Pathology

Network Based Computing Laboratory HPC-AI-UK (Oct ‘20) 33 Distributed TensorFlow on TACC Frontera (2,048 CPU nodes with 114,688 cores)

• Scaled TensorFlow to 2048 nodes on Frontera using MVAPICH2 • MVAPICH2 and IntelMPI give similar performance for DNN training • Report a peak of 260,000 images/sec on 2,048 nodes • On 2048 nodes, ResNet-50 can be trained in 7 minutes!

A. Jain, A. A. Awan, H. Subramoni, DK Panda, “Scaling TensorFlow, PyTorch, and MXNet using MVAPICH2 for High-Performance Deep Learning on Frontera”, DLS ’19 (SC ’19 Workshop).

Network Based Computing Laboratory HPC-AI-UK (Oct ‘20) 34 MVAPICH2-GDR vs. NCCL2 – Allreduce Operation (DGX-2)

• Optimized designs in MVAPICH2-GDR offer better/comparable performance for most cases • MPI_Allreduce (MVAPICH2-GDR) vs. ncclAllreduce (NCCL2) on 1 DGX-2 node (16 Volta GPUs) Platform: Nvidia DGX-2 system (16 Nvidia Volta GPUs connected with NVSwitch), CUDA 10.1 50 10000 45 40 1000 35 ~2.5X better 30 25 100 20

~4.7X better Latency (us) Latency (us) 15 10 10 5 1 0 8 16 32 64 1K 2K 4K 8K 128 256 512 16K 32K 64K 128K Message Size (Bytes) Message Size (Bytes)

MVAPICH2-GDR-2.3.4 NCCL-2.6 MVAPICH2-GDR-2.3.4 NCCL-2.6 C.-H. Chu, P. Kousha, A. Awan, K. S. Khorassani, H. Subramoni and D. K. Panda, "NV-Group: Link-Efficient Reductions for Distributed Deep Learning on Modern Dense GPU Systems, " ICS-2020, June-July 2020. Network Based Computing Laboratory HPC-AI-UK (Oct ‘20) 35 MVAPICH2-GDR: MPI_Allreduce at Scale (ORNL Summit)

• Optimized designs in MVAPICH2-GDR offer better performance for most cases • MPI_Allreduce (MVAPICH2-GDR) vs. ncclAllreduce (NCCL2) up to 1,536 GPUs Platform: Dual-socket IBM POWER9 CPU, 6 NVIDIA Volta V100 GPUs, and 2-port InfiniBand EDR Interconnect Latency on 1,536 GPUs Bandwidth on 1,536 GPUs 128MB Message 10 450 6 400 1.7X better 8 5 350 1.7X better 300 4 6 250 1.6X better 3 4 200

Latency (us) 150 2 Bandwidth (GB/s) 2

100 Bandwidth (GB/s) 1 50 0 0 0 24 48 96 192 384 768 1536 4 16 64 256 1K 4K 16K 32M 64M 128M 256M Number of GPUs Message Size (Bytes) Message Size (Bytes) SpectrumMPI 10.3 OpenMPI 4.0.1 MVAPICH2-GDR-2.3.4 NCCL 2.6 MVAPICH2-GDR-2.3.4 NCCL 2.6 NCCL 2.6 MVAPICH2-GDR-2.3.4 C.-H. Chu, P. Kousha, A. Awan, K. S. Khorassani, H. Subramoni and D. K. Panda, "NV-Group: Link-Efficient Reductions for Distributed Deep Learning on Modern Dense GPU Systems, " ICS-2020, June-July 2020. Network Based Computing Laboratory HPC-AI-UK (Oct ‘20) 36 Distributed TensorFlow on ORNL Summit (1,536 GPUs)

• ResNet-50 Training using TensorFlow benchmark on 500 SUMMIT -- 1536 Volta ImageNet-1k has 1.2 million images GPUs! 400 MVAPICH2-GDR reaching ~0.42 million

Thousands 300 • 1,281,167 (1.2 mil.) images images per second for ImageNet-1k! 200

• Time/epoch = 3 seconds Image per second 100

0 • Total Time (90 epochs) 1 2 4 6 12 24 48 96 192 384 768 1536 = 3 x 90 = 270 seconds = Number of GPUs 4.5 minutes! NCCL-2.6 MVAPICH2-GDR 2.3.4 *We observed issues for NCCL2 beyond 384 GPUs Platform: The Summit Supercomputer (#2 on Top500.org) – 6 NVIDIA Volta GPUs per node connected with NVLink, CUDA 10.1 Network Based Computing Laboratory HPC-AI-UK (Oct ‘20) 37 PyTorch at Scale: Training ResNet-50 on 256 V100 GPUs

• Training performance for 256 V100 GPUs on LLNL Lassen – ~10,000 Images/sec faster than NCCL training!

Distributed Torch.distributed Horovod DeepSpeed Framework

Images/sec on 61,794 72,120 74,063 84,659 80,217 88,873 256 GPUs

Communication NCCL MVAPICH2-GDR NCCL MVAPICH2-GDR NCCL MVAPICH2-GDR Backend

Network Based Computing Laboratory HPC-AI-UK (Oct ‘20) 38 Multiple Approaches taken up by OSU • MPI-driven Deep Learning – CPU-based Deep Learning – GPU-based Deep Learning • Out-of-core DNN training • Exploiting Hybrid (Data and Model) Parallelism • Use-Case: AI-Driven Digital Pathology

Network Based Computing Laboratory HPC-AI-UK (Oct ‘20) 39 HyPar-Flow: Model and Hybrid Parallelism for Out-of-Core Training • Data-Parallelism– only for models that fit the memory • Out-of-core models – Deeper model  Better accuracy but more memory required! • Model parallelism can work for out-of-core models! • Key Challenges – Model Partitioning is difficult for application programmers – Finding the right partition (grain) size is hard – cut at which layer and why?

– Developing a practical system for model-parallelism A. A. Awan, A. Jain, Q. Anthony, H. Subramoni, and DK Panda, “HyPar-Flow: Exploiting MPI and Keras for Hybrid Parallel • Redesign DL Framework or create additional layers? Training of TensorFlow models”, ISC ‘20, • Existing Communication middleware or extensions needed? https://arxiv.org/pdf/1911.05146.pdf

Network Based Computing Laboratory HPC-AI-UK (Oct ‘20) 40 Exploiting Model Parallelism in AI-Driven Digital Pathology • Pathology whole slide image (WSI) – Each WSI = 100,000 x 100,000 pixels – Can not fit in a single GPU memory – Tiles are extracted to make training possible • Two main problems with tiles – Restricted tile size because of GPU memory limitation – Smaller tiles loose structural information • Can we use Model Parallelism to train on larger tiles to get better accuracy and diagnosis? • Reduced training time significantly on OpenPOWER + NVIDIA V100 GPUs – 32 hours (1 node, 1 GPU) -> 7.25 hours (1 node, 4 GPUs) -> 27 mins (32 nodes, 128 GPUs) Courtesy: https://blog.kitware.com/digital-slide- archive-large-image-and-histomicstk-open-source- A. Jain, A. Awan, A. Aljuhani, J. Hashmi, Q. Anthony, H. Subramoni, D. K. Panda, R. Machiraju, and A. Parwani, informatics-tools-for-management-visualization-and- “GEMS: GPU Enabled Memory Aware Model Parallelism System for Distributed DNN Training”, Supercomputing analysis-of-digital-histopathology-data/ (SC ‘20), Accepted to be Presented. Network Based Computing Laboratory HPC-AI-UK (Oct ‘20) 41 Dask Architecture Dask Dask Dask Bag Dask Array Delayed Future DataFrame

Task Graph

Dask-MPI Dask-CUDA Dask-Jobqueue

Distributed Scheduler Worker Client

tcp.py ucx.py MPI4Dask Comm Layer

UCX-Py mpi4py (Cython wrappers)

TCP UCX MVAPICH2-GDR

Laptops/ High Performance Computing Hardware Desktops Network Based Computing Laboratory HPC-AI-UK (Oct ‘20) 42 Benchmark #1: Sum of cuPy Array and its Transpose

3.47x better on average 6.92x better on average 10 2.0 IPoIB UCX MPI4Dask IPoIB UCX MPI4Dask 9 1.8 8 1.6 7 1.4 6 1.2 5 1.0 4 0.8 3 0.6 2 0.4 Total Execution Total Time (s)

1 Communication Time (s) 0.2 0 0.0 2 3 4 5 6 2 3 4 5 6 Number of Dask Workers Number of Dask Workers

A. Shafi , J. Hashmi , H. Subramoni , and D. K. Panda, Blink: Towards Efficient RDMA-based Communication Coroutines for Parallel Python Applications, HiPC ’20, Accepted to be Presented Network Based Computing Laboratory HPC-AI-UK (Oct ‘20) 43 Accelerating cuML with MVAPICH2-GDR • Utilize MVAPICH2-GDR (with mpi4py) as communication backend during the training phase (the fit() function) for Multi-node Multi-GPU (MNMG) setting over cluster of GPUs • Communication primitives: Dask Python

– Allreduce UCX-Py Cython mpi4py – Reduce CUDA/C/C++ – Broadcast cuML Algorithms • Exploit optimized collectives UCX cuML Primitives MVAPICH2- NCCL GDR CUDA Libraries

CUDA

Network Based Computing Laboratory HPC-AI-UK (Oct ‘20) 44 K-Means 2000 2.0 2000 Linear Regression 1.6 NCCL NCCL 1.8 1.4 MVAPICH2-GDR 1.6 MVAPICH2-GDR 1500 1500 1.2 Speedup 1.4 Speedup 1.2 1.0 1000 1.0 1000 0.8 0.8 Speedup Speedup 0.6 0.6

Training Time (s) 500 Training Time (s) 500 0.4 0.4 0.2 0.2 0 0.0 0 0.0 1 2 4 8 16 32 1 2 4 8 16 32 Number of GPUs Number of GPUs Nearest Neighbors Truncated SVD 2500 2.0 NCCL 1500 1.6 1.8 NCCL MVAPICH2-GDR 2000 1.6 MVAPICH2-GDR 1.4 Speedup 1.4 Speedup 1.2 1500 1.2 1000 1.0 1.0 0.8 1000 0.8 Speedup 0.6 0.6 500 Speedup TrainingTime (s) 500 0.4 0.4 0.2 Training Time (s) 0.2 0 0.0 0 0.0 1 2 4 8 16 32 1 2 4 8 16 32 Number of GPUs Number of GPUs M. Ghazimirsaeed , Q. Anthony , A. Shafi , H. Subramoni , and D. K. Panda, Accelerating GPU-based Machine Learning in Python using MPI Library: A Case Study with MVAPICH2-GDR, MLHPC Workshop, Nov 2020, Accepted to be Presented Network Based Computing Laboratory HPC-AI-UK (Oct ‘20) 45 Using HiDL Packages for Deep Learning on Existing HPC Infrastructure

Hadoop Job

Network Based Computing Laboratory HPC-AI-UK (Oct ‘20) 46 Presentation Overview

• MVAPICH Project – MPI and PGAS (MVAPICH) Library with CUDA-Awareness • HiBD Project – High-Performance Big Data Analytics Library • HiDL Project – High-Performance Deep Learning • Optimizations and Deployments in Public Cloud – AWS and Azure • Conclusions

Network Based Computing Laboratory HPC-AI-UK (Oct ‘20) 47 MVAPICH2-Azure Deployment

• Released on 05/20/2020 • Integrated Azure CentOS HPC Images – https://github.com/Azure/azhpc-images/releases/tag/centos-7.6-hpc-20200417 • MVAPICH2 2.3.3 – CentOS Images (7.6, 7.7 and 8.1) – Tested with multiple VM instances • MVAPICH2-X 2.3.RC3 – CentOS Images (7.6, 7.7 and 8.1) – Tested with multiple VM instances • More details from Azure Blog Post – https://techcommunity.microsoft.com/t5/azure-compute/mvapich2-on-azure-hpc-clusters/ba- p/1404305

Network Based Computing Laboratory HPC-AI-UK (Oct ‘20) 48 WRF Application Results on HBv2 (AMD Rome) • Performance of WRF with MVAPICH2 and ​MVAPICH2-X-XPMEM

WRF Execution time • WRF 3.6 MVAPICH2 MVAPICH2-X+XPMEM • https://github.com/hanschen/WR FV3 60 • Benchmark: 12km resolution case 50 over the Continental U.S. (CONUS) 40 domain 30 • https://www2.mmm.ucar.edu/wrf

Time (s) 20 /WG2/benchv3/#_Toc212961288 10 • Update io_form_history in 0 namelist.input to 102 120 240 480 960 • https://www2.mmm.ucar.edu/wrf/user Number of processes s/namelist_best_prac_wrf.html#io_for m_history

MVAPICH2-X-XPMEM is able to deliver better performance and scalability

Network Based Computing Laboratory HPC-AI-UK (Oct ‘20) 49 MVAPICH2-X-AWS 2.3

• Released on 09/24/2020 • Major Features and Enhancements – Based on MVAPICH2-X 2.3 – Improved inter-node latency and bandwidth performance for large messages – Optimized collectives – Support for dynamic run-time XPMEM module detection – Support for currently available basic OS types on AWS EC2 including: Amazon Linux 1/2, CentOS 6/7, Ubuntu 16.04/18.04

Network Based Computing Laboratory HPC-AI-UK (Oct ‘20) 50 WRF Application Results

• Performance of WRF with Open MPI 4.0.3 vs Intel MPI 2019.7.217 vs ​MVAPICH2-X-AWS v2.3

WRF Execution Time

OpenMPI Intel MPI MVAPICH2-X-AWS 140 120 100 80 60 41% 58% Time (s) 24% 47% 40 17% 27% 20 0 36 72 144 288 576 1152 Number of processes

Network Based Computing Laboratory HPC-AI-UK (Oct ‘20) 51 Commercial Support for MVAPICH2, HiBD, and HiDL Libraries • Supported through X-ScaleSolutions (http://x-scalesolutions.com) • Benefits: – Help and guidance with installation of the library – Platform-specific optimizations and tuning – Timely support for operational issues encountered with the library – Web portal interface to submit issues and tracking their progress – Advanced debugging techniques – Application-specific optimizations and tuning – Obtaining guidelines on best practices – Periodic information on major fixes and updates – Information on major releases – Help with upgrading to the latest release – Flexible Service Level Agreements • Support being provided to National Laboratories and International Supercomputing Centers

Network Based Computing Laboratory HPC-AI-UK (Oct ‘20) 52 X-ScaleHPC Package

• Scalable solutions of communication middleware based on OSU MVAPICH2 libraries • “out-of-the-box” fine-tuned and optimal performance on various HPC systems including CPUs and GPUs • MPI communication offloading capabilities to ARM based smart NICs (such as Mellanox Bluefield NICs)

Please refer to the presentation made at the 8th Annual MVAPICH User group Meeting (MUG) last week http://mug.mvapich.cse.ohio-state.edu/program/

Network Based Computing Laboratory HPC-AI-UK (Oct ‘20) 53 X-ScaleAI Product and Features • Aim: High-performance solution for distributed training for your complex AI problems on modern HPC platforms • Features: – Powered by MVAPICH2 libraries – Great performance and scalability as delivered by MVAPICH2 libraries – Integrated packaging to run various Deep Learning Frameworks (TensorFlow, PyTorch, MXNet, and others) – Targeted for both CPU-based and GPU-based Deep Learning Training – Integrated profiling and introspection support for Deep Learning Applications across the stacks (DeepIntrospect) • Provides cross-stack performance analysis in a visual manner and help users to optimize their DL applications and harness higher performance and scalability – Out-of-the-box optimal performance • Tuned for various CPU- and GPU-based HPC systems – One-click deployment and execution • Do not need to struggle for many hours – Support for x86 and OpenPOWER platforms – Support for InfiniBand, RoCE and NVLink Interconnects

Network Based Computing Laboratory HPC-AI-UK (Oct ‘20) 54 Concluding Remarks • Upcoming Exascale systems and Cloud need to be designed with a holistic view of HPC, Big Data, and Deep/Machine Learning • Presented an overview of designing convergent software stacks for HPC, Big Data, and Deep/Machine Learning • Presented solutions which enable HPC, Big Data, and Deep/Machine Learning communities to take advantage of current and next-generation systems • Next-generation Exascale and Zetascale systems will need continuous innovations in designing converged software architectures …..

Network Based Computing Laboratory HPC-AI-UK (Oct ‘20) 55 Funding Acknowledgments Funding Support by

Equipment Support by

Network Based Computing Laboratory HPC-AI-UK (Oct ‘20) 56 Acknowledgments to all the Heroes (Past/Current Students and Staffs) Current Research Scientists Current Post-docs Current Students (Graduate) – N. Sarkauskas (Ph.D.) – A. Shafi – M. S. Ghazimeersaeed – Q. Anthony (Ph.D.) – K. S. Khorassani (Ph.D.) – S. Srivastava (M.S.) – H. Subramoni – M. Bayatpour (Ph.D.) – P. Kousha (Ph.D.) – S. Xu (Ph.D.) Current Senior Research Associate Current Research Specialist – C.-C. Chun (Ph.D.) – N. S. Kumar (M.S.) – Q. Zhou (Ph.D.) – J. Hashmi – J. Smith – A. Jain (Ph.D.) – B. Ramesh (Ph.D.) – M. Kedia (M.S.) – K. K. Suresh (Ph.D.) Current Software Engineer – A. Reifsteck Past Students – A. Awan (Ph.D.) – T. Gangadharappa (M.S.) – P. Lai (M.S.) – R. Rajachandrasekar (Ph.D.) Past Research Scientists – A. Augustine (M.S.) – K. Gopalakrishnan (M.S.) – J. Liu (Ph.D.) – D. Shankar (Ph.D.) – P. Balaji (Ph.D.) – J. Hashmi (Ph.D.) – M. Luo (Ph.D.) – G. Santhanaraman (Ph.D.) – K. Hamidouche – N. Sarkauskas (B.S.) – R. Biswas (M.S.) – W. Huang (Ph.D.) – A. Mamidala (Ph.D.) – S. Sur – A. Singh (Ph.D.) – S. Bhagvat (M.S.) – W. Jiang (M.S.) – G. Marsh (M.S.) – X. Lu – J. Sridhar (M.S.) – A. Bhat (M.S.) – J. Jose (Ph.D.) – V. Meshram (M.S.) Past Programmers – S. Kini (M.S.) – A. Moody (M.S.) – S. Sur (Ph.D.) – D. Buntinas (Ph.D.) – D. Bureddy – H. Subramoni (Ph.D.) – L. Chai (Ph.D.) – M. Koop (Ph.D.) – S. Naravula (Ph.D.) – K. Vaidyanathan (Ph.D.) – J. Perkins – B. Chandrasekharan (M.S.) – K. Kulkarni (M.S.) – R. Noronha (Ph.D.) – A. Vishnu (Ph.D.) – S. Chakraborthy (Ph.D.) – R. Kumar (M.S.) – X. Ouyang (Ph.D.) Past Research Specialist – J. Wu (Ph.D.) – N. Dandapanthula (M.S.) – S. Krishnamoorthy (M.S.) – S. Pai (M.S.) – M. Arnold – W. Yu (Ph.D.) – V. Dhanraj (M.S.) – K. Kandalla (Ph.D.) – S. Potluri (Ph.D.) – J. Zhang (Ph.D.) – C.-H. Chu (Ph.D.) – M. Li (Ph.D.) – K. Raj (M.S.) Past Post-Docs – D. Banerjee – J. Lin – K. Manian – J. Vienne – X. Besseron – M. Luo – S. Marcarelli – H. Wang – H.-W. Jin – E. Mancini – A. Ruhela Network Based Computing Laboratory HPC-AI-UK (Oct ‘20) 57 Thank You! [email protected]

Follow us on Network-Based Computing Laboratory https://twitter.com/mvapich http://nowlab.cse.ohio-state.edu/

The High-Performance MPI/PGAS Project The High-Performance Big Data Project The High-Performance Deep Learning Project http://mvapich.cse.ohio-state.edu/ http://hibd.cse.ohio-state.edu/ http://hidl.cse.ohio-state.edu/

Network Based Computing Laboratory HPC-AI-UK (Oct ‘20) 58