Scalable Distributed Deep Learning on Modern HPC Systems

Middleware for Message Passing Interface (MPI) and Deep Learning on OpenPOWER platforms

OpenPOWER ADG Workshop (Nov ’20) by

Dhabaleswar K. (DK) Panda The Ohio State University E-mail: [email protected] Follow us on http://www.cse.ohio-state.edu/~panda https://twitter.com/mvapich High-End Computing (HEC): PetaFlop to ExaFlop

100 PetaFlops in 415 Peta 2017 Flops in 2020 (Fugaku in Japan with 7.3M cores

1 ExaFlops

Expected to have an ExaFlop system in 2021! Network Based Computing Laboratory OpenPOWER-ADG (Nov ‘20) 2 Increasing Usage of HPC, Big Data and Deep/Machine Learning

Big Data (Hadoop, Spark, HPC HBase, (MPI, PGAS, etc.) Memcached, etc.)

Convergence of HPC, Big Deep/Machine Data, and Deep/Machine Learning Learning! (TensorFlow, PyTorch, BigDL, cuML, etc.) Increasing Need to Run these applications on the Cloud!!

Network Based Computing Laboratory OpenPOWER-ADG (Nov ‘20) 3 Presentation Overview

• MVAPICH Project – MPI and PGAS (MVAPICH) Library with CUDA-Awareness • HiDL Project – High-Performance Deep Learning • Conclusions

Network Based Computing Laboratory OpenPOWER-ADG (Nov ‘20) 4 Overview of the MVAPICH2 Project • High Performance open-source MPI Library • Support for multiple interconnects – InfiniBand, Omni-Path, Ethernet/iWARP, RDMA over Converged Ethernet (RoCE), and AWS EFA • Support for multiple platforms – x86, OpenPOWER, ARM, Xeon-Phi, GPGPUs (NVIDIA and AMD (upcoming)) • Started in 2001, first open-source version demonstrated at SC ‘02 • Used by more than 3,100 organizations in 89 countries • Supports the latest MPI-3.1 standard • More than 1 million downloads from the OSU site • http://mvapich.cse.ohio-state.edu directly • Additional optimized versions for different systems/environments: • Empowering many TOP500 clusters (June ‘20 ranking) – MVAPICH2-X (Advanced MPI + PGAS), since 2011 – 4th , 10,649,600-core (Sunway TaihuLight) at NSC, Wuxi, China – MVAPICH2-GDR with support for NVIDIA GPGPUs, since 2014 – 8th, 448, 448 cores (Frontera) at TACC – MVAPICH2-MIC with support for Intel Xeon-Phi, since 2014 – 12th, 391,680 cores (ABCI) in Japan – MVAPICH2-Virt with virtualization support, since 2015 – 18th, 570,020 cores (Nurion) in South Korea and many others – MVAPICH2-EA with support for Energy-Awareness, since 2015 – MVAPICH2-Azure for Azure HPC IB instances, since 2019 • Available with software stacks of many vendors and – MVAPICH2-X-AWS for AWS HPC+EFA instances, since 2019 Linux Distros (RedHat, SuSE, OpenHPC, and Spack) • Tools: • Partner in the 8th ranked TACC Frontera system – OSU MPI Micro-Benchmarks (OMB), since 2003 • Empowering Top500 systems for more than 15 years – OSU InfiniBand Network Analysis and Monitoring (INAM), since 2015 Network Based Computing Laboratory OpenPOWER-ADG (Nov ‘20) 5 Architecture of MVAPICH2 Software Family for HPC and DL/ML

High Performance Parallel Programming Models

Message Passing Interface PGAS Hybrid --- MPI + X (MPI) (UPC, OpenSHMEM, CAF, UPC++) (MPI + PGAS + OpenMP/Cilk)

High Performance and Scalable Communication Runtime Diverse APIs and Mechanisms

Point-to- Remote Collectives Energy- I/O and Fault Active Introspection point Job Startup Memory Virtualization Algorithms Awareness Tolerance Messages & Analysis Primitives Access File Systems

Support for Modern Networking Technology Support for Modern Multi-/Many-core Architectures (InfiniBand, iWARP, RoCE, Omni-Path, Elastic Fabric Adapter) (Intel-Xeon, OpenPOWER, Xeon-Phi, ARM, NVIDIA GPGPU) Transport Protocols Modern Features Transport Mechanisms Modern Features

SR- Multi Shared RC SRD UD DC UMR ODP CMA IVSHMEM XPMEM Optane* NVLink CAPI* IOV Rail Memory

* Upcoming

Network Based Computing Laboratory OpenPOWER-ADG (Nov ‘20) 6 MVAPICH2 Software Family

Requirements Library

MPI with IB, iWARP, Omni-Path, and RoCE MVAPICH2

Advanced MPI Features/Support, OSU INAM, PGAS and MPI+PGAS MVAPICH2-X with IB, Omni-Path, and RoCE MPI with IB, RoCE & GPU and Support for Deep/Machine Learning MVAPICH2-GDR

HPC Cloud with MPI & IB MVAPICH2-Virt

Energy-aware MPI with IB, iWARP and RoCE MVAPICH2-EA

MPI Energy Monitoring Tool OEMT

InfiniBand Network Analysis and Monitoring OSU INAM

Microbenchmarks for Measuring MPI and PGAS Performance OMB

Network Based Computing Laboratory OpenPOWER-ADG (Nov ‘20) 7 Converged Middleware for HPC, Big Data and Deep/Machine Learning

Big Data HPC (Hadoop, Spark, (MPI, PGAS, HBase, etc.) Memcached, etc.)

Deep/Machine Learning (TensorFlow, PyTorch, BigDL, cuML, etc.)

Network Based Computing Laboratory OpenPOWER-ADG (Nov ‘20) 8 Intra-node Point-to-Point Performance on OpenPOWER Intra-Socket Small Message Latency Intra-Socket Large Message Latency 0.6 100 MVAPICH2 2.3.5 MVAPICH2 2.3.5 SpectrumMPI-10.3.1.00 SpectrumMPI-10.3.1.00 0.4 50 0.2 Latency (us) Latency

Latency (us) Latency 0.2us

0 0 4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M 2M

Intra-Socket Bandwidth Intra-Socket Bi-directional Bandwidth 40000 60000 MVAPICH2 2.3.5 MVAPICH2 2.3.5 SpectrumMPI-10.3.1.00 SpectrumMPI-10.3.1.00 40000 20000 20000 Bandwidth (MB/s) Bandwidth (MB/s) 0 0 1 8 64 512 4K 32K 256K 2M 1 8 64 512 4K 32K 256K 2M

Platform: Two nodes of OpenPOWER (Power9-ppc64le) CPU using Mellanox EDR (MT4121) HCA MVAPICH2 2.3.5 (upcoming) Network Based Computing Laboratory OpenPOWER-ADG (Nov ‘20) 9 Inter-node Point-to-Point Performance on OpenPower

Small Message Latency Large Message Latency 4 150 MVAPICH2 2.3.5 MVAPICH2 2.3.5 SpectrumMPI-10.3.1.00 3 SpectrumMPI-10.3.1.00 100 2 50 Latency (us) Latency Latency (us) Latency 1

0 0 1 2 4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M 2M Bandwidth Bi-directional Bandwidth 30000 50000 MVAPICH2 2.3.5 MVAPICH2-2.3.5 24,678 MB/s 49,249 MB/s SpectrumMPI-10.3.1.00 40000 SpectrumMPI-10.3.1.00 20000 30000

20000 10000 10000 Bandwidth (MB/s) Bandwidth - Bandwidth (MB/s) Bandwidth 0 Bi 0 1 8 64 512 4K 32K 256K 2M 1 8 64 512 4K 32K 256K 2M Platform: Two nodes of OpenPOWER (POWER9-ppc64le) CPU using Mellanox EDR (MT4121) HCA Network Based Computing Laboratory OpenPOWER-ADG (Nov ‘20) 10 Optimized MVAPICH2 All-Reduce & Reduce on OpenPower All_Reduce 15 MVAPICH2-2.3.5 2000 MVAPICH2-2.3.5 Spectrum-MPI 10.3.1.03 Spectrum-MPI 10.3.1.03 3X 10 Title 1000 5 Latency (us) (Nodes=1, PPN=40) 0 0 4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M Reduce 8 1000 MVAPICH2-2.3.5 MVAPICH2-2.3.5 6 Spectrum-MPI 10.3.1.03 Spectrum-MPI 10.3.1.03 Title Title 4 2X 500

2 (Nodes=1, PPN=40) 0 0 4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M

Network Based Computing Laboratory OpenPOWER-ADG (Nov ‘20) 11 GPU-Aware (CUDA-Aware) MPI Library: MVAPICH2-GPU • Standard MPI interfaces used for unified data movement • Takes advantage of Unified Virtual Addressing (>= CUDA 4.0) • Overlaps data movement from GPU with RDMA transfers

At Sender:

MPI_Send(s_devbuf, size, …); inside MVAPICH2 At Receiver: MPI_Recv(r_devbuf, size, …);

High Performance and High Productivity

Network Based Computing Laboratory OpenPOWER-ADG (Nov ‘20) 12 Optimized MVAPICH2-GDR Design GPU-GPU Inter-node Latency 30 6000 GPU-GPU Inter-node Bi-Bandwidth

20 4000 10 11X 1.85us 10x 2000

Latency (us) 0 0 1 2 4 8 Bandwidth (MB/s) Bandwidth 0 16 32 64 1K 2K 4K 8K 1 2 4 8 128 256 512 16 32 64 1K 2K 4K

Message Size (Bytes) 128 256 512 Message Size (Bytes) MV2-(NO-GDR) MV2-GDR 2.3 MV2-(NO-GDR) MV2-GDR-2.3

4000 GPU-GPU Inter-node Bandwidth 3000 2000 9x MVAPICH2-GDR-2.3 Intel Haswell (E5-2687W @ 3.10 GHz) node - 20 cores 1000 NVIDIA Volta V100 GPU

Bandwidth (MB/s) Bandwidth 0 Mellanox Connect-X4 EDR HCA 1 2 4 8 16 32 64 1K 2K 4K CUDA 9.0 128 256 512 Mellanox OFED 4.0 with GPU-Direct-RDMA Message Size (Bytes) MV2-(NO-GDR) MV2-GDR-2.3

Network Based Computing Laboratory OpenPOWER-ADG (Nov ‘20) 13 D-to-D Performance on OpenPOWER w/ GDRCopy (NVLink2 + Volta)

Intra-Node Latency (Small Intra-Node Latency (Large Intra-Node Bandwidth Messages) Messages) 80000 2 100 60000 40000 1 50 20000 0 0 0 1 4 16 64 1 2 4 8 1K 4K 1M 4M 256 16 32 64 16K 64K 1K 2K 4K 8K 128 256 512 16K 32K 64K 128K 256K 512K 1M 2M 4M 256K Latency (us) Latency (us)

Message Size (Bytes) Message Size (Bytes) Bandwidth (MB/s) Message Size (Bytes)

Intra-node Latency: 0.76 us (with GDRCopy) Intra-node Bandwidth: 65.48 GB/sec for 4MB (via NVLINK2)

Inter-Node Latency (Small Inter-Node Latency (Large Inter-Node Bandwidth

Messages) Messages) 25000 20000 10 300 15000 200 10000 5 100 5000 0 0 0 1 4 16 64 1 2 4 8 1K 4K 1M 4M 256 16 32 64 16K 64K 1K 2K 4K 8K 128 256 512 16K 32K 64K 128K 256K 512K 1M 2M 4M 256K Latency (us) Latency (us)

Message Size (Bytes) Message Size (Bytes) Bandwidth (MB/s) Message Size (Bytes) Inter-node Latency: 2.18 us (with GDRCopy 2.0) Inter-node Bandwidth: 23 GB/sec for 4MB (via 2 Port EDR) Platform: OpenPOWER (POWER9-ppc64le) nodes equipped with a dual-socket CPU, 4 Volta V100 GPUs, and 2port EDR InfiniBand Interconnect Network Based Computing Laboratory OpenPOWER-ADG (Nov ‘20) 14 Impact on HOOMD-blue, a Molecular Dynamics Application 64K Particles 256K Particles 4000 2500 MV2 MV2+GDR 3000 2000 2000 2X 1500 2X 1000 1000 500 per second (TPS) 0 per second (TPS) Average Time Steps Time Steps Average

Average Time Steps Time Steps Average 0 4 8 16 32 4 8 16 32 Number of Processes Number of Processes

Network Based Computing Laboratory OpenPOWER-ADG (Nov ‘20) 15 Application-Level Evaluation (Cosmo) and Weather Forecasting in Switzerland

Wilkes GPU Cluster CSCS GPU cluster

Default Callback-based Event-based Default Callback-based Event-based

1.2 1.2 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 0 4 8 16 32 Normalized Execution Time 16 32 64 96 Normalized Execution Time Number of GPUs Number of GPUs

• 2X improvement on 32 GPUs nodes Cosmo model: http://www2.cosmo-model.org/content • 30% improvement on 96 GPU nodes (8 GPUs/node) /tasks/operational/meteoSwiss/ On-going collaboration with CSCS and MeteoSwiss (Switzerland) in co-designing MV2-GDR and Cosmo Application C. Chu, K. Hamidouche, A. Venkatesh, D. Banerjee , H. Subramoni, and D. K. Panda, Exploiting Maximal Overlap for Non-Contiguous Data Movement Processing on Modern GPU-enabled Systems, IPDPS’16 Network Based Computing Laboratory OpenPOWER-ADG (Nov ‘20) 16 Presentation Overview

• MVAPICH Project – MPI and PGAS (MVAPICH) Library with CUDA-Awareness • HiDL Project – High-Performance Deep Learning • Conclusions

Network Based Computing Laboratory OpenPOWER-ADG (Nov ‘20) 17 Scale-up and Scale-out Desired • Scale-up: Intra-node Communication – Many improvements like: NCCL2 • NVIDIA cuDNN, cuBLAS, NCCL, etc. • CUDA Co-operative Groups cuDNN MPI • Scale-out: Inter-node Communication MKL-DNN – DL Frameworks – most are optimized for single-node only up Performance up

– Distributed (Parallel) Training is an emerging - gRPC trend • PyTorch – MPI/NCCL2 Scale • TensorFlow – gRPC-based/MPI/NCCL2 Hadoop • OSU-Caffe – MPI-based Scale-out Performance

Network Based Computing Laboratory OpenPOWER-ADG (Nov ‘20) 18 Broad Challenge: Exploiting HPC for DL and ML How to efficiently Scale-up and scale-out Deep Learning (DL) frameworks and take advantage of heterogeneous High Performance Computing (HPC) resources?

Network Based Computing Laboratory OpenPOWER-ADG (Nov ‘20) 19 MVAPICH2 (MPI)-driven Infrastructure for ML/DL Training

ML/DL Applications ML/DL Applications

TensorFlow PyTorch MXNet PyTorch

Horovod Torch.distributed DeepSpeed

MVAPICH2 or MVAPICH2-X MVAPICH2-GDR for MVAPICH2 or MVAPICH2-X MVAPICH2-GDR for for CPU Training GPU Training for CPU Training GPU Training

More details available from: http://hidl.cse.ohio-state.edu Network Based Computing Laboratory OpenPOWER-ADG (Nov ‘20) 20 Accelerating DL and ML Applications with MVAPICH2 MPI Libraries • MPI-driven Deep Learning – CPU-based Deep Learning – GPU-based Deep Learning • Out-of-core DNN training • Exploiting Hybrid (Data and Model) Parallelism • Use-Case: AI-Driven Digital Pathology • Commercial Support and Products

Network Based Computing Laboratory OpenPOWER-ADG (Nov ‘20) 21 Distributed TensorFlow on TACC Frontera (2,048 CPU nodes) • Scaled TensorFlow to 2048 nodes on Frontera using MVAPICH2 and IntelMPI

• MVAPICH2 and IntelMPI give similar performance for DNN training

• Report a peak of 260,000 images/sec on 2,048 nodes

• On 2048 nodes, ResNet-50 can be trained in 7 minutes!

A. Jain, A. A. Awan, H. Subramoni, DK Panda, “Scaling TensorFlow, PyTorch, and MXNet using MVAPICH2 for High-Performance Deep Learning on Frontera”, DLS ’19 (SC ’19 Workshop).

Network Based Computing Laboratory OpenPOWER-ADG (Nov ‘20) 22 MVAPICH2-GDR: MPI_Allreduce at Scale (ORNL Summit)

• Optimized designs in MVAPICH2-GDR offer better performance for most cases • MPI_Allreduce (MVAPICH2-GDR) vs. ncclAllreduce (NCCL2) up to 1,536 GPUs Platform: Dual-socket IBM POWER9 CPU, 6 NVIDIA Volta V100 GPUs, and 2-port InfiniBand EDR Interconnect Latency on 1,536 GPUs Bandwidth on 1,536 GPUs 128MB Message 10 450 6 400 1.7X better 8 5 350 1.7X better 300 4 6 250 1.6X better 3 4 200

Latency (us) 150 2 Bandwidth (GB/s) 2

100 Bandwidth (GB/s) 1 50 0 0 0 24 48 96 192 384 768 1536 4 16 64 256 1K 4K 16K 32M 64M 128M 256M Number of GPUs Message Size (Bytes) Message Size (Bytes) SpectrumMPI 10.3 OpenMPI 4.0.1 MVAPICH2-GDR-2.3.4 NCCL 2.6 MVAPICH2-GDR-2.3.4 NCCL 2.6 NCCL 2.6 MVAPICH2-GDR-2.3.4 C.-H. Chu, P. Kousha, A. Awan, K. S. Khorassani, H. Subramoni and D. K. Panda, "NV-Group: Link-Efficient Reductions for Distributed Deep Learning on Modern Dense GPU Systems, " ICS-2020, June-July 2020. Network Based Computing Laboratory OpenPOWER-ADG (Nov ‘20) 23 Distributed TensorFlow on ORNL Summit (1,536 GPUs)

• ResNet-50 Training using TensorFlow benchmark on 500 SUMMIT -- 1536 Volta ImageNet-1k has 1.2 million images GPUs! 400 MVAPICH2-GDR reaching ~0.42 million

Thousands 300 • 1,281,167 (1.2 mil.) images images per second for ImageNet-1k! 200

• Time/epoch = 3 seconds Image per second 100

0 • Total Time (90 epochs) 1 2 4 6 12 24 48 96 192 384 768 1536 = 3 x 90 = 270 seconds = Number of GPUs 4.5 minutes! NCCL-2.6 MVAPICH2-GDR 2.3.4 *We observed issues for NCCL2 beyond 384 GPUs Platform: The Summit Supercomputer (#2 on Top500.org) – 6 NVIDIA Volta GPUs per node connected with NVLink, CUDA 10.1 Network Based Computing Laboratory OpenPOWER-ADG (Nov ‘20) 24 Scaling PyTorch on ORNL Summit using MVAPICH2-GDR

500 30% higher • ResNet-50 training using 450 PyTorch + Horovod on Summit 400 350

– Synthetic ImageNet dataset Thousands 300 – Up to 256 nodes, 1536 GPUs 250 • MVAPICH2-GDR can 200 150 outperform NCCL2 100 – Up to 30% higher throughput 50 0

• CUDA 10.1 cuDNN 7.6.5 Images/sec (higher is better) PyTorch v1.5.0 Horovod v0.19.1

C.-H. Chu, P. Kousha, A. Awan, K. S. Khorassani, H. Subramoni and D. K. Panda, No. of GPUs "NV-Group: Link-Efficient Reductions for Distributed Deep Learning on Modern Dense GPU Systems, " ICS-2020, June-July 2020. NCCL-2.6 MVAPICH2-GDR-2.3.4 Platform: The Summit Supercomputer (#2 on Top500.org) – 6 NVIDIA Volta GPUs per node connected with NVLink, CUDA 10.1 Network Based Computing Laboratory OpenPOWER-ADG (Nov ‘20) 25 Scaling PyTorch with Horovod

90 15% higher • ResNet-50 training using 80

70 PyTorch + Horovod on Lassen Thousands 60 – Synthetic ImageNet dataset 50 – Up to 64 nodes, 256 GPUs 40 Images/sec • MVAPICH2-GDR can outperform 30 NCCL2 20 10

– Up to 15% higher throughput 0 1 2 4 8 16 32 64 128 256 • CUDA 10.2 cuDNN 7.6.5 No. of GPUs PyTorch v1.5.0 Horovod v0.19.4 NCCL 2.7.8-1 MVAPICH2-GDR-2.3.4

Platform: The Lassen Supercomputer (#14 on Top500.org) – 4 NVIDIA Volta GPUs per node connected with NVLink, CUDA 10.2 Network Based Computing Laboratory OpenPOWER-ADG (Nov ‘20) 26 Scaling PyTorch with DeepSpeed

100 10% higher • ResNet-50 training using 90 80

PyTorch + DeepSpeed on Lassen 70 – Synthetic ImageNet dataset 60 – Up to 64 nodes, 256 GPUs 50 40 Images/sec (Thousands) • MVAPICH2-GDR can outperform 30 NCCL2 20 10

– Up to 10% higher throughput 0 1 2 4 8 16 32 64 128 256 • CUDA 10.2 cuDNN 7.6.5 No. of GPUs PyTorch v1.5.0 DeepSpeed v0.2.0 NCCL 2.7.8-1 MVAPICH2-GDR-2.3.4

• ResNet-50 training using PyTorch 70 16% higher

+ torch.distributed on Lassen 60 – Synthetic ImageNet dataset 50 – Up to 64 nodes, 256 GPUs 40

Images/sec 30 • MVAPICH2-GDR can outperform (Thousands) 20 NCCL2 10 – Up to 16% higher throughput 0 • CUDA 10.2 cuDNN 7.6.5 1 2 4 8 16 32 64 128 256 No. of GPUs PyTorch v1.5.0 NCCL 2.7.8-1 MVAPICH2-GDR-2.3.4

• Training performance for 256 V100 GPUs on LLNL Lassen – ~10,000 Images/sec faster than NCCL training!

Distributed Torch.distributed Horovod DeepSpeed Framework

Images/sec on 61,794 72,120 74,063 84,659 80,217 88,873 256 GPUs

Communication NCCL MVAPICH2-GDR NCCL MVAPICH2-GDR NCCL MVAPICH2-GDR Backend

Network Based Computing Laboratory OpenPOWER-ADG (Nov ‘20) 29 Accelerating DL and ML Applications with MVAPICH2 MPI Libraries • MPI-driven Deep Learning – CPU-based Deep Learning – GPU-based Deep Learning • Out-of-core DNN training • Exploiting Hybrid (Data and Model) Parallelism • Use-Case: AI-Driven Digital Pathology • Commercial Support and Products

Network Based Computing Laboratory OpenPOWER-ADG (Nov ‘20) 30 HyPar-Flow: Hybrid and Model Parallelism for CPUs • Data-Parallelism– only for models that fit the memory • Out-of-core models – Deeper model  Better accuracy but more memory required! • Model parallelism can work for out-of-core models! • Key Challenges – Model Partitioning is difficult for application programmers – Finding the right partition (grain) size is hard – cut at which layer and why?

– Developing a practical system for model-parallelism A. A. Awan, A. Jain, Q. Anthony, H. Subramoni, and DK Panda, “HyPar-Flow: Exploiting MPI and Keras for Hybrid Parallel • Redesign DL Framework or create additional layers? Training of TensorFlow models”, ISC ‘20, • Existing Communication middleware or extensions needed? https://arxiv.org/pdf/1911.05146.pdf

Network Based Computing Laboratory OpenPOWER-ADG (Nov ‘20) 31 Out-of-core Training with HyPar-Flow (512 nodes on TACC Frontera)

• ResNet-1001 with variable batch size 481x speedup on 512 Intel Xeon Skylake nodes (TACC Frontera) • Approach: – 48 model-partitions for 56 cores – 512 model-replicas for 512 nodes – Total cores: 48 x 512 = 24,576 • Speedup – 253X on 256 nodes – 481X on 512 nodes • Scaling Efficiency – 98% up to 256 nodes – 93.9% for 512 nodes

A. A. Awan, A. Jain, Q. Anthony, H. Subramoni, and DK Panda, “HyPar-Flow: Exploiting MPI and Keras for Hybrid Parallel Training of TensorFlow models”, ISC ‘20, https://arxiv.org/pdf/1911.05146.pdf

Network Based Computing Laboratory OpenPOWER-ADG (Nov ‘20) 32 Exploiting Model Parallelism in AI-Driven Digital Pathology • Pathology whole slide image (WSI) – Each WSI = 100,000 x 100,000 pixels – Can not fit in a single GPU memory – Tiles are extracted to make training possible • Two main problems with tiles – Restricted tile size because of GPU memory limitation – Smaller tiles loose structural information • Can we use Model Parallelism to train on larger tiles to get better accuracy and diagnosis? • Reduced training time significantly on OpenPOWER + NVIDIA V100 GPUs – 32 hours (1 node, 1 GPU) -> 7.25 hours (1 node, 4 GPUs) -> 27 mins (32 nodes, 128 GPUs) Courtesy: https://blog.kitware.com/digital-slide- archive-large-image-and-histomicstk-open-source- A. Jain, A. Awan, A. Aljuhani, J. Hashmi, Q. Anthony, H. Subramoni, D. K. Panda, R. Machiraju, and A. Parwani, informatics-tools-for-management-visualization-and- “GEMS: GPU Enabled Memory Aware Model Parallelism System for Distributed DNN Training”, Supercomputing analysis-of-digital-histopathology-data/ (SC ‘20), Accepted to be Presented. Network Based Computing Laboratory OpenPOWER-ADG (Nov ‘20) 33 Accelerating DL and ML Applications with MVAPICH2 MPI Libraries • MPI-driven Deep Learning – CPU-based Deep Learning – GPU-based Deep Learning • Out-of-core DNN training • Exploiting Hybrid (Data and Model) Parallelism • Use-Case: AI-Driven Digital Pathology • Commercial Support and Products

Network Based Computing Laboratory OpenPOWER-ADG (Nov ‘20) 34 Commercial Support for MVAPICH2, HiBD, and HiDL Libraries • Supported through X-ScaleSolutions (http://x-scalesolutions.com) • Benefits: – Help and guidance with installation of the library – Platform-specific optimizations and tuning – Timely support for operational issues encountered with the library – Web portal interface to submit issues and tracking their progress – Advanced debugging techniques – Application-specific optimizations and tuning – Obtaining guidelines on best practices – Periodic information on major fixes and updates – Information on major releases – Help with upgrading to the latest release – Flexible Service Level Agreements • Support being provided to National Laboratories and International Supercomputing Centers

Network Based Computing Laboratory OpenPOWER-ADG (Nov ‘20) 35 Silver ISV Member for the OpenPOWER Consortium

• Silver ISV member of the OpenPOWER Consortium • Provides flexibility: – To have MVAPICH2, HiDL and HiBD libraries getting integrated into the OpenPOWER software stack – A part of the OpenPOWER ecosystem – Can participate with different vendors for bidding, installation and deployment process • Introduced two new integrated products with support for OpenPOWER systems (Presented at the 2019 OpenPOWER North America Summit) – X-ScaleHPC – X-ScaleAI

Network Based Computing Laboratory OpenPOWER-ADG (Nov ‘20) 36 X-ScaleHPC Package

• Scalable solutions of communication middleware based on OSU MVAPICH2 libraries • “out-of-the-box” fine-tuned and optimal performance on various HPC systems including CPUs and GPUs • MPI communication offloading capabilities to ARM based smart NICs (such as Mellanox Bluefield NICs)

Please refer to the presentation made at the 8th Annual MVAPICH User group Meeting (MUG) in Aug ‘20 http://mug.mvapich.cse.ohio-state.edu/program/

Network Based Computing Laboratory OpenPOWER-ADG (Nov ‘20) 37 X-ScaleAI Product and Features • Aim: High-performance solution for distributed training for your complex AI problems on modern HPC platforms • Features: – Powered by MVAPICH2 libraries – Great performance and scalability as delivered by MVAPICH2 libraries – Integrated packaging to run various Deep Learning Frameworks (TensorFlow, PyTorch, MXNet, and others) – Targeted for both CPU-based and GPU-based Deep Learning Training – Integrated profiling and introspection support for Deep Learning Applications across the stacks (DeepIntrospect) • Provides cross-stack performance analysis in a visual manner and help users to optimize their DL applications and harness higher performance and scalability – Out-of-the-box optimal performance • Tuned for various CPU- and GPU-based HPC systems – One-click deployment and execution • Do not need to struggle for many hours – Support for x86 and OpenPOWER platforms – Support for InfiniBand, RoCE and NVLink Interconnects

Network Based Computing Laboratory OpenPOWER-ADG (Nov ‘20) 38 X-ScaleAI Product with DeepIntrospect (DI) Capability

More details in MUG ‘20 talk, Contact for a demo and free-trial! ([email protected]) Network Based Computing Laboratory OpenPOWER-ADG (Nov ‘20) 39 Conclusions • HPC and DL applications require high-performance and scalability • Requires high-performance middleware designs while exploiting modern interconnects • Provided a set of solutions to achieve – MPI (MVAPICH2)-driven solution for HPC applications – MPI (MVAPICH2)-driven solution for DL applications with TensorFlow, PyTorch frameworks – Out-of-core training and Hybrid Parallelism – Commercial support and product with Deep Introspection Capability • Will continue to enable the HPC and DL community to achieve scalability and high-performance with OpenPOWER Platforms

Network Based Computing Laboratory OpenPOWER-ADG (Nov ‘20) 40 Funding Acknowledgments Funding Support by

Equipment Support by

Network Based Computing Laboratory OpenPOWER-ADG (Nov ‘20) 41 Acknowledgments to all the Heroes (Past/Current Students and Staffs) Current Research Scientists Current Post-docs Current Students (Graduate) – N. Sarkauskas (Ph.D.) – A. Shafi – M. S. Ghazimeersaeed – Q. Anthony (Ph.D.) – K. S. Khorassani (Ph.D.) – S. Srivastava (M.S.) – H. Subramoni – M. Bayatpour (Ph.D.) – P. Kousha (Ph.D.) – S. Xu (Ph.D.) Current Senior Research Associate Current Research Specialist – C.-C. Chun (Ph.D.) – N. S. Kumar (M.S.) – Q. Zhou (Ph.D.) – J. Hashmi – J. Smith – A. Jain (Ph.D.) – B. Ramesh (Ph.D.) – M. Kedia (M.S.) – K. K. Suresh (Ph.D.) Current Software Engineer – A. Reifsteck Past Students – A. Awan (Ph.D.) – T. Gangadharappa (M.S.) – P. Lai (M.S.) – R. Rajachandrasekar (Ph.D.) Past Research Scientists – A. Augustine (M.S.) – K. Gopalakrishnan (M.S.) – J. Liu (Ph.D.) – D. Shankar (Ph.D.) – P. Balaji (Ph.D.) – J. Hashmi (Ph.D.) – M. Luo (Ph.D.) – G. Santhanaraman (Ph.D.) – K. Hamidouche – N. Sarkauskas (B.S.) – R. Biswas (M.S.) – W. Huang (Ph.D.) – A. Mamidala (Ph.D.) – S. Sur – A. Singh (Ph.D.) – S. Bhagvat (M.S.) – W. Jiang (M.S.) – G. Marsh (M.S.) – X. Lu – J. Sridhar (M.S.) – A. Bhat (M.S.) – J. Jose (Ph.D.) – V. Meshram (M.S.) Past Programmers – S. Kini (M.S.) – A. Moody (M.S.) – S. Sur (Ph.D.) – D. Buntinas (Ph.D.) – D. Bureddy – H. Subramoni (Ph.D.) – L. Chai (Ph.D.) – M. Koop (Ph.D.) – S. Naravula (Ph.D.) – K. Vaidyanathan (Ph.D.) – J. Perkins – B. Chandrasekharan (M.S.) – K. Kulkarni (M.S.) – R. Noronha (Ph.D.) – A. Vishnu (Ph.D.) – S. Chakraborthy (Ph.D.) – R. Kumar (M.S.) – X. Ouyang (Ph.D.) Past Research Specialist – J. Wu (Ph.D.) – N. Dandapanthula (M.S.) – S. Krishnamoorthy (M.S.) – S. Pai (M.S.) – M. Arnold – W. Yu (Ph.D.) – V. Dhanraj (M.S.) – K. Kandalla (Ph.D.) – S. Potluri (Ph.D.) – J. Zhang (Ph.D.) – C.-H. Chu (Ph.D.) – M. Li (Ph.D.) – K. Raj (M.S.) Past Post-Docs – D. Banerjee – J. Lin – K. Manian – J. Vienne – X. Besseron – M. Luo – S. Marcarelli – H. Wang – H.-W. Jin – E. Mancini – A. Ruhela Network Based Computing Laboratory OpenPOWER-ADG (Nov ‘20) 42 Thank You! [email protected]

Network-Based Computing Laboratory http://nowlab.cse.ohio-state.edu/

The High-Performance MPI/PGAS Project The High-Performance Big Data Project The High-Performance Deep Learning Project http://mvapich.cse.ohio-state.edu/ http://hibd.cse.ohio-state.edu/ http://hidl.cse.ohio-state.edu/

Network Based Computing Laboratory OpenPOWER-ADG (Nov ‘20) 43