Building the Software Foundation for HPC and AI

Talk at HPC-AI Online Webinar (March ’20) by

Dhabaleswar K. (DK) Panda The Ohio State University E-mail: [email protected] http://www.cse.ohio-state.edu/~panda High-End Computing (HEC): PetaFlop to ExaFlop

100 PFlops in 2017 149 PFlops in 2018

1 EFlops in 2021-2022?

Expected to have an ExaFlop system in 2021-2022! Network Based Computing Laboratory HPC-AI Webinar (March ‘20) 2 Drivers of Modern HPC Cluster Architectures

High Performance Interconnects - Accelerators / Coprocessors InfiniBand high compute density, high Multi-core Processors <1usec latency, 200Gbps Bandwidth> performance/watt SSD, NVMe-SSD, NVRAM >1 TFlop DP on a chip • Multi-core/many-core technologies • Remote Direct Memory Access (RDMA)-enabled networking (InfiniBand and RoCE) • Solid State Drives (SSDs), Non-Volatile Random-Access Memory (NVRAM), NVMe-SSD • Accelerators (NVIDIA GPGPUs and Phi) • Available on HPC Clouds, e.g., Amazon EC2, NSF Chameleon, Microsoft Azure, etc.

Summit Sierra Sunway TaihuLight K - Computer Network Based Computing Laboratory HPC-AI Webinar (March ‘20) 3 Increasing Usage of HPC, Big Data and Deep Learning

Big Data HPC (Hadoop, Spark, (MPI, RDMA, HBase, Lustre, etc.) Memcached, etc.)

Convergence of HPC, Big Deep Learning Data, and Deep Learning! (Caffe, TensorFlow, BigDL, etc.) Increasing Need to Run these applications on the Cloud!!

Network Based Computing Laboratory HPC-AI Webinar (March ‘20) 4 Parallel Programming Models Overview

P1 P2 P3 P1 P2 P3 P1 P2 P3

Logical shared memory Memory Memory Memory Shared Memory Memory Memory Memory

Shared Memory Model Distributed Memory Model Partitioned Global Address Space (PGAS) SHMEM, DSM MPI (Message Passing Interface) OpenSHMEM, UPC, Chapel, X10, CAF, … • Programming models provide abstract machine models • Models can be mapped on different types of systems – e.g. Distributed Shared Memory (DSM), MPI within a node, etc. • PGAS models and Hybrid MPI+PGAS models are gradually receiving importance Network Based Computing Laboratory HPC-AI Webinar (March ‘20) 5 Deep Learning & Machine Learning

• Deep Learning (DL) – A subset of Machine Learning that uses Deep Neural Networks (DNNs) – Perhaps, the most revolutionary subset! • Based on learning data representation • Examples Convolutional Neural Networks, Recurrent Neural Networks, Hybrid Networks • Data Scientist or Developer Perspective 1. Identify DL as solution to a problem 2. Determine Data Set 3. Select Deep Learning Algorithm to Use 4. Use a large data set to train an algorithm

Courtesy: https://hackernoon.com/difference-between-artificial-intelligence-machine-learning- and-deep-learning-1pcv3zeg, https://blog.dataiku.com/ai-vs.-machine-learning-vs.-deep-learning Network Based Computing Laboratory HPC-AI Webinar (March ‘20) 6 Building Software Foundation for HPC and AI: Challenges

Application Kernels/Applications (HPC and DL) Co-Design Middleware Opportunities and Programming Models Challenges MPI, PGAS (UPC, Global Arrays, OpenSHMEM), CUDA, OpenMP, across Various OpenACC, Spark (RDD, DAG), TensorFlow, PyTorch, etc. Layers

Communication Library or Runtime for Programming Models Performance Point-to-point Collective Energy- Synchronization I/O and Fault Communication Communication Awareness and Locks File Systems Tolerance Scalability Resilience Networking Technologies Multi-/Many-core Accelerators (InfiniBand, 40/100/200GigE, Architectures (GPU and FPGA) Aries, Omni-Path, Slingshot)

Network Based Computing Laboratory HPC-AI Webinar (March ‘20) 7 Presentation Overview

• MVAPICH Project for HPC – MPI and PGAS Library with CUDA-Awareness • HiDL Project – High-Performance Deep Learning • Public Cloud Deployment – Microsoft-Azure and Amazon-AWS • Conclusions

Network Based Computing Laboratory HPC-AI Webinar (March ‘20) 8 Overview of the MVAPICH2 Project • High Performance open-source MPI Library for InfiniBand, Omni-Path, Ethernet/iWARP, and RDMA over Converged Ethernet (RoCE) – MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.1), Started in 2001, First version available in 2002 – MVAPICH2-X (MPI + PGAS), Available since 2011 – Support for GPGPUs (MVAPICH2-GDR) and MIC (MVAPICH2-MIC), Available since 2014 – Support for Virtualization (MVAPICH2-Virt), Available since 2015 – Support for Energy-Awareness (MVAPICH2-EA), Available since 2015 – Support for InfiniBand Network Analysis and Monitoring (OSU INAM) since 2015 – Used by more than 3,075 organizations in 89 countries – More than 714,000 (> 0.7 million) downloads from the OSU site directly – Empowering many TOP500 clusters (November ‘19 ranking) • 3rd ranked 10,649,640-core cluster (Sunway TaihuLight) at NSC, Wuxi, China • 5th, 448, 448 cores (Frontera) at TACC • 8th, 391,680 cores (ABCI) in Japan • 14th, 570,020 cores (Nurion) in South Korea and many others – Available with software stacks of many vendors and Linux Distros (RedHat, SuSE, OpenHPC, and Spack) – http://mvapich.cse.ohio-state.edu th • Empowering Top500 systems for over a decade Partner in the 5 ranked TACC Frontera System

Network Based Computing Laboratory HPC-AI Webinar (March ‘20) 9 MVAPICH2 Release Timeline and Downloads 800000

700000 MIC 2.0 - GDR 2.0b - Azure 2.3.2 -

600000 MV2 1.9 MV2 MV2 1.8 MV2

500000 Virt 2.2 1.7 2.3 MV2 1.6 MV2 MV2 AWS

400000 - MV2 2.3.3 1.5 MV2 1.4 GDR 2.3.2 1.0.3 - 1.1 rc3 MV2 OSU INAM 0.9.5 1.0 1.0 300000 MV2 2.3 MV2 MV MV2 MV2 X MV Number of Downloads Number - MV2 MV2 0.9.8 MV 0.9.4 200000 MV2 0.9.0 MV2

100000

0 Sep-04 Sep-05 Sep-06 Sep-07 Sep-08 Sep-09 Sep-10 Sep-11 Sep-12 Sep-13 Sep-14 Sep-15 Sep-16 Sep-17 Sep-18 Sep-19 Mar-05 Mar-06 Mar-07 Mar-08 Mar-09 Mar-10 Mar-11 Mar-12 Mar-13 Mar-14 Mar-15 Mar-16 Mar-17 Mar-18 Mar-19 Timeline Network Based Computing Laboratory HPC-AI Webinar (March ‘20) 10 MVAPICH2 Software Family Requirements Library

MPI with Support for InfiniBand, Omni-Path, Ethernet/iWARP and, RoCE (v1/v2) MVAPICH2

Advanced MPI features/support (UMR, ODP, DC, Core-Direct, SHARP, XPMEM), OSU MVAPICH2-X INAM (InfiniBand Network Monitoring and Analysis) Optimized MPI for clusters with NVIDIA GPUs and for GPU-enabled Deep Learning MVAPICH2-GDR Applications Optimized Support for Microsoft Azure Platform with InfiniBand MVAPICH2-Azure

Advanced MPI features (SRD and XPMEM) with support for Amazon Elastic Fabric MVAPICH2-X-AWS Adapter (EFA) Energy-aware MPI with Support for InfiniBand, Omni-Path, Ethernet/iWARP and, MVAPICH2-EA RoCE (v1/v2) MPI Energy Monitoring Tool OEMT

InfiniBand Network Analysis and Monitoring OSU INAM

Microbenchmarks for Measuring MPI and PGAS Performance OMB

Network Based Computing Laboratory HPC-AI Webinar (March ‘20) 11 One-way Latency: MPI over IB with MVAPICH2

1.8 Small Message Latency 120 Large Message Latency TrueScale-QDR 1.6 1.19 100 ConnectX-3-FDR 1.4 1.11 ConnectIB-DualFDR 1.2 80 ConnectX-4-EDR 1 60 Omni-Path 0.8 1.15 ConnectX-6 HDR 0.6 1.01 40 Latency (us)

Latency (us) 1.04 0.4 1.1 20 0.2 0 0

Message Size (bytes) Message Size (bytes) TrueScale-QDR - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with IB switch ConnectX-3-FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch ConnectIB-Dual FDR - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with IB switch ConnectX-4-EDR - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with IB Switch Omni-Path - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with Omni-Path switch ConnectX-6-HDR - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with IB Switch Network Based Computing Laboratory HPC-AI Webinar (March ‘20) 12 Bandwidth: MPI over IB with MVAPICH2

30000 Unidirectional Bandwidth 60000 Bidirectional Bandwidth TrueScale-QDR 24,532 48,027 25000 50000 ConnectX-3-FDR 20000 40000 ConnectIB-DualFDR ConnectX-4-EDR 15000 30000 21,983 12,590 Omni-Path 24,136 12,366 ConnectX-6 HDR Bandwidth

10000 Bandwidth 20000 (MBytes/sec) 12,083 (MBytes/sec) 21,227 5000 6,356 10000 12,161 3,373 6,228 0 0

Message Size (bytes) Message Size (bytes) TrueScale-QDR - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with IB switch ConnectX-3-FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch ConnectIB-Dual FDR - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with IB switch ConnectX-4-EDR - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with IB Switch Omni-Path - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with Omni-Path switch ConnectX-6-HDR - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with IB Switch Network Based Computing Laboratory HPC-AI Webinar (March ‘20) 13 Shared Address Space (XPMEM)-based Collectives Design OSU_Allreduce (Broadwell 256 procs) OSU_Reduce (Broadwell 256 procs) MVAPICH2-2.3b 100000 MVAPICH2-2.3b 100000 1.8X IMPI-2017v1.132 IMPI-2017v1.132 4X 10000 10000 MVAPICH2-X-2.3rc1 MVAPICH2-2.3rc1

1000 1000 73.2 37.9 100 100 Latency (us) Latency 10 10 36.1 16.8 1 1 16K 32K 64K 128K 256K 512K 1M 2M 4M 16K 32K 64K 128K 256K 512K 1M 2M 4M Message Size Message Size • “Shared Address Space”-based true zero-copy Reduction collective designs in MVAPICH2 • Offloaded computation/communication to peers ranks in reduction collective operation • Up to 4X improvement for 4MB Reduce and up to 1.8X improvement for 4M AllReduce J. Hashmi, S. Chakraborty, M. Bayatpour, H. Subramoni, and D. Panda, Designing Efficient Shared Address Space Reduction Available since MVAPICH2-X 2.3rc1 Collectives for Multi-/Many-cores, International Parallel & Distributed Processing Symposium (IPDPS '18), May 2018. Network Based Computing Laboratory HPC-AI Webinar (March ‘20) 14 Benefits of the New Asynchronous Progress Design: Broadwell + InfiniBand P3DFFT High Performance Linpack (HPL) 9 30000 8 27% Lower is better Higher is better 29% 25000 7 6 20000 Memory Consumption = 69% 5 26% 12% 4 15000 33% 3 10000 2 8% 1 5000 0 0

112 224 448 in GFLOPS Performance 224 448 896 Time per seconds per in Time loop Number of Processes PPN=28 Number of Processes PPN=28 MVAPICH2 Async MVAPICH2 Default IMPI 2019 Async MVAPICH2 Async MVAPICH2 Default IMPI 2019 Default

Up to 33% performance improvement in P3DFFT application with 448 processes Up to 29% performance improvement in HPL application with 896 processes

A. Ruhela, H. Subramoni, S. Chakraborty, M. Bayatpour, P. Kousha, and D.K. Panda, Efficient Asynchronous Communication Progress for MPI without Dedicated Resources, EuroMPI 2018. Enhanced version accepted for PARCO Journal. Available since MVAPICH2-X 2.3rc1

Network Based Computing Laboratory HPC-AI Webinar (March ‘20) 15 Point-to-point: Latency & Bandwidth (Intra-socket) on ARM

Latency - Small Messages Latency - Medium Messages Latency - Large Messages 20 2000 2 MVAPICH2-X MVAPICH2-X 8.2x better 1600 15 OpenMPI+UCX 1.5 OpenMPI+UCX 1200 MVAPICH2-X 10 1 OpenMPI+UCX (us) Latency 800 Latency (us) Latency Latency (us) Latency 5 0.5 400

0 0 0 128 512 2048 8192 32K 128K 512K 2M 0 2 8 32 Message Size (Bytes) Message Size (Bytes) Message Size (Bytes)

Bandwidth - Small Messages Bandwidth – Medium Messages Bandwidth - Large Messages 500 10000 12000 MVAPICH2-X MVAPICH2-X 400 8000 OpenMPI+UCX OpenMPI+UCX 9000 300 6000 MVAPICH2-X 6000 OpenMPI+UCX 200 4000 Bandwidth (MB/s) Bandwidth Bandwidth (MB/s)

Bandwidth (MB/s) Bandwidth 3000 100 2000

0 0 0 1 4 16 64 256 1K 4K 16K 64K 256K 1M 4M Message Size (Bytes) Message Size (Bytes) Message Size (Bytes)

Network Based Computing Laboratory HPC-AI Webinar (March ‘20) 16 MVAPICH2 Software Family Requirements Library

MPI with Support for InfiniBand, Omni-Path, Ethernet/iWARP and, RoCE (v1/v2) MVAPICH2

Advanced MPI features/support (UMR, ODP, DC, Core-Direct, SHARP, XPMEM), OSU MVAPICH2-X INAM (InfiniBand Network Monitoring and Analysis) Optimized MPI for clusters with NVIDIA GPUs and for GPU-enabled Deep Learning MVAPICH2-GDR Applications Optimized Support for Microsoft Azure Platform with InfiniBand MVAPICH2-Azure

Advanced MPI features (SRD and XPMEM) with support for Amazon Elastic Fabric MVAPICH2-X-AWS Adapter (EFA) Energy-aware MPI with Support for InfiniBand, Omni-Path, Ethernet/iWARP and, MVAPICH2-EA RoCE (v1/v2) MPI Energy Monitoring Tool OEMT

InfiniBand Network Analysis and Monitoring OSU INAM

Microbenchmarks for Measuring MPI and PGAS Performance OMB

Network Based Computing Laboratory HPC-AI Webinar (March ‘20) 17 GPU-Aware (CUDA-Aware) MPI Library: MVAPICH2-GPU • Standard MPI interfaces used for unified data movement • Takes advantage of Unified Virtual Addressing (>= CUDA 4.0) • Overlaps data movement from GPU with RDMA transfers

At Sender:

MPI_Send(s_devbuf, size, …); inside MVAPICH2 At Receiver: MPI_Recv(r_devbuf, size, …);

High Performance and High Productivity

Network Based Computing Laboratory HPC-AI Webinar (March ‘20) 18 Optimized MVAPICH2-GDR Design GPU-GPU Inter-node Latency 30 6000 GPU-GPU Inter-node Bi-Bandwidth

20 4000 10 11X 1.85us 10x 2000

Latency (us) 0 0 1 2 4 8 Bandwidth (MB/s) Bandwidth 0 16 32 64 1K 2K 4K 8K 1 2 4 8 128 256 512 16 32 64 1K 2K 4K

Message Size (Bytes) 128 256 512 Message Size (Bytes) MV2-(NO-GDR) MV2-GDR 2.3 MV2-(NO-GDR) MV2-GDR-2.3

4000 GPU-GPU Inter-node Bandwidth 3000 2000 9x MVAPICH2-GDR-2.3 Intel Haswell (E5-2687W @ 3.10 GHz) node - 20 cores 1000 NVIDIA Volta V100 GPU

Bandwidth (MB/s) Bandwidth 0 Mellanox Connect-X4 EDR HCA 1 2 4 8 16 32 64 1K 2K 4K CUDA 9.0 128 256 512 Mellanox OFED 4.0 with GPU-Direct-RDMA Message Size (Bytes) MV2-(NO-GDR) MV2-GDR-2.3 Network Based Computing Laboratory HPC-AI Webinar (March ‘20) 19 Application-Level Evaluation (HOOMD-blue) 64K Particles 256K Particles 3500 3000 2500 MV2 MV2+GDR 2500 2000 2X 2000 1500 2X 1500 1000 1000 second (TPS)second 500 second (TPS) 500

0 Time per Steps Average 0 Average Time per Steps Average 4 8 16 32 4 8 16 32 Number of Processes Number of Processes • Platform: Wilkes (Intel Ivy Bridge + NVIDIA Tesla K20c + Mellanox Connect-IB) • HoomdBlue Version 1.0.5 • GDRCOPY enabled: MV2_USE_CUDA=1 MV2_IBA_HCA=mlx5_0 MV2_IBA_EAGER_THRESHOLD=32768 MV2_VBUF_TOTAL_SIZE=32768 MV2_USE_GPUDIRECT_LOOPBACK_LIMIT=32768 MV2_USE_GPUDIRECT_GDRCOPY=1 MV2_USE_GPUDIRECT_GDRCOPY_LIMIT=16384 Network Based Computing Laboratory HPC-AI Webinar (March ‘20) 20 Application-Level Evaluation (Cosmo) and Weather Forecasting in Switzerland

Wilkes GPU Cluster CSCS GPU cluster

Default Callback-based Event-based Default Callback-based Event-based

1.2 1.2 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 0 4 8 16 32 Normalized Execution Time 16 32 64 96 Normalized Execution Time Number of GPUs Number of GPUs

• 2X improvement on 32 GPUs nodes Cosmo model: http://www2.cosmo-model.org/content • 30% improvement on 96 GPU nodes (8 GPUs/node) /tasks/operational/meteoSwiss/ On-going collaboration with CSCS and MeteoSwiss (Switzerland) in co-designing MV2-GDR and Cosmo Application C. Chu, K. Hamidouche, A. Venkatesh, D. Banerjee , H. Subramoni, and D. K. Panda, Exploiting Maximal Overlap for Non-Contiguous Data Movement Processing on Modern GPU-enabled Systems, IPDPS’16 Network Based Computing Laboratory HPC-AI Webinar (March ‘20) 21 Presentation Overview

• MVAPICH Project for HPC – MPI and PGAS Library with CUDA-Awareness • HiDL Project – High-Performance Deep Learning • Public Cloud Deployment – Microsoft-Azure and Amazon-AWS • Conclusions

Network Based Computing Laboratory HPC-AI Webinar (March ‘20) 22 Scale-up and Scale-out Desired • Scale-up: Intra-node Communication – Many improvements like: NCCL2 • NVIDIA cuDNN, cuBLAS, NCCL, etc. • CUDA 9 Co-operative Groups cuDNN MPI • Scale-out: Inter-node Communication MKL-DNN – DL Frameworks – most are optimized for single-node only up Performance up

– Distributed (Parallel) Training is an emerging - gRPC trend • OSU-Caffe – MPI-based Scale • Microsoft CNTK – MPI/NCCL2 Hadoop • Google TensorFlow – gRPC-based/MPI/NCCL2 • Facebook Caffe2 – Hybrid (NCCL2/Gloo/MPI) Scale-out Performance

Network Based Computing Laboratory HPC-AI Webinar (March ‘20) 23 Data Parallel Deep Learning and MPI Collectives Loop {} • Major MPI Collectives packed_comm_buff involved in Designing MPI_Bcast (GPU 0) 1. Data Propagation distributed frameworks Params Params Params Params GPU 3 GPU 1 GPU 0 • MPI_Bcast – required for GPU 2 L DNN parameter exchange L1 L1 L1 1

L2 L L2 L2 • MPI_Reduce – needed for F B F 2 B F B F B 2. Forward ...... Backward L L Ln gradient accumulation n Ln n Pass from multiple solvers packed_red packed_red packed_red packed_red uce_buff uce_buff uce_buff uce_buff • MPI_Allreduce – use just one Allreduce instead of MPI_Reduce (GPU 0) 3. Gradient Reduce and Broadcast Gradients Aggregatio ApplyUpdates n

A. A. Awan, K. Hamidouche, J. M. Hashmi, and D. K. Panda, S-Caffe: Co-designing MPI Runtimes and Caffe for Scalable Deep Learning on Modern GPU Clusters. In Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '17) Network Based Computing Laboratory HPC-AI Webinar (March ‘20) 24 Multiple Approaches taken up by OSU • MPI-driven Deep Learning – CPU-based Deep Learning – GPU-based Deep Learning • Co-designing Deep Learning Stacks with High-Performance MPI • Out-of-core DNN training • Hybrid (Data and Model) Parallelism

Network Based Computing Laboratory HPC-AI Webinar (March ‘20) 25 MVAPICH2 (MPI)-driven Infrastructure for ML/DL Training

ML/DL Applications

TensorFlow PyTorch MXNet

Horovod

MVAPICH2-X for MVAPICH2-GDR for CPU-Based Training GPU-Based Training

Network Based Computing Laboratory HPC-AI Webinar (March ‘20) 26 Performance of CNTK with MVAPICH2-X on CPU-based Deep Learning

CNTK AlexNet Training • CPU-based training of AlexNet neural (B.S=default, iteration=50, ppn=28) network using ImageNet ILSVRC2012 Intel MPI dataset 800 MVAPICH2 9% MVAPICH2-XPMEM • Advanced XPMEM-based designs show 600 20% up to 20% benefits over Intel MPI (IMPI) 400 for CNTK DNN training using All_Reduce 200 • The proposed designs show good Execution Execution Time (s) scalability with increasing system size 0 28 56 112 224 No. of Processes Available since MVAPICH2-X 2.3rc1 release

Designing Efficient Shared Address Space Reduction Collectives for Multi-/Many-cores, J. Hashmi, S. Chakraborty, M. Bayatpour, H. Subramoni, and DK Panda, 32nd IEEE International Parallel & Distributed Processing Symposium (IPDPS '18), May 2018 Network Based Computing Laboratory HPC-AI Webinar (March ‘20) 27 Distributed TensorFlow on TACC Frontera (2048 CPU nodes) • Scaled TensorFlow to 2048 nodes on Frontera using MVAPICH2 and IntelMPI

• MVAPICH2 and IntelMPI give similar performance for DNN training

• Report a peak of 260,000 images/sec on 2048 nodes

• On 2048 nodes, ResNet-50 can be trained in 7 minutes!

A. Jain, A. A. Awan, H. Subramoni, DK Panda, “Scaling TensorFlow, PyTorch, and MXNet using MVAPICH2 for High-Performance Deep Learning on Frontera”, DLS ’19 (SC ’19 Workshop).

Network Based Computing Laboratory HPC-AI Webinar (March ‘20) 28 Scaling DL Frameworks using MVAPICH2-X on TACC Frontera

• On single node, TensorFlow (TF) is 8% better than MXNet • TF (tf_cnn_benchmark) is 2.13x better than PyTorch • TensorFlow is 1.7x better than MXNet • TF (Keras) gives better performance compared to PyTorch and MXNet.

A. Jain, A. A. Awan, H. Subramoni, DK Panda, “Scaling TensorFlow, PyTorch, and MXNet using MVAPICH2 for High-Performance Deep Learning on Frontera”, DLS ’19 (SC ’19 Workshop).

Network Based Computing Laboratory HPC-AI Webinar (March ‘20) 29 MVAPICH2-GDR vs. NCCL2 – Allreduce Operation (DGX-2)

• Optimized designs in upcoming MVAPICH2-GDR offer better/comparable performance for most cases • MPI_Allreduce (MVAPICH2-GDR) vs. ncclAllreduce (NCCL2) on 1 DGX-2 node (16 Volta GPUs)

50 10000 45 40 1000 35 30 ~2.5X better 25 100 20 Latency (us)

Latency (us) ~4.7X better 15 10 10 5 0 1 8 16 32 64 1K 2K 4K 8K 128 256 512 16K 32K 64K 128K Message Size (Bytes) Message Size (Bytes)

MVAPICH2-GDR-2.3.3 NCCL-2.5 MVAPICH2-GDR-2.3.3 NCCL-2.5

Platform: Nvidia DGX-2 system (16 Nvidia Volta GPUs connected with NVSwitch), CUDA 9.2 Network Based Computing Laboratory HPC-AI Webinar (March ‘20) 30 MVAPICH2-GDR: Enhanced MPI_Allreduce at Scale

• Optimized designs in upcoming MVAPICH2-GDR offer better performance for most cases • MPI_Allreduce (MVAPICH2-GDR) vs. ncclAllreduce (NCCL2) up to 1,536 GPUs

Latency on 1,536 GPUs Bandwidth on 1,536 GPUs 128MB Message 10 450 6 400 8 5 1.7X better 350 300 4 6 1.7X better 250 1.6X better 3 4 200

Latency (us) 150 2 Bandwidth (GB/s) 2

100 Bandwidth (GB/s) 1 50 0 0 0 24 48 96 192 384 768 1536 4 16 64 256 1K 4K 16K 32M 64M 128M 256M Number of GPUs Message Size (Bytes) Message Size (Bytes) SpectrumMPI 10.3 OpenMPI 4.0.1 MVAPICH2-GDR-2.3.3 NCCL 2.5 MVAPICH2-GDR-2.3.3 NCCL 2.5 NCCL 2.5 MVAPICH2-GDR-2.3.3 Platform: Dual-socket IBM POWER9 CPU, 6 NVIDIA Volta V100 GPUs, and 2-port InfiniBand EDR Interconnect Network Based Computing Laboratory HPC-AI Webinar (March ‘20) 31 Scalable TensorFlow using Horovod and MVAPICH2-GDR

• ResNet-50 Training using TensorFlow benchmark on 1 DGX-2 node (16 Volta GPUs) Actual throughput Scaling Efficiency = × 100% Ideal throughput at scale 7000 100 9% higher 6000 80 5000 4000 60 3000 40 2000 20

Image per second 1000 Scaling Efficiency Scaling(%) Efficiency 0 0 1 2 4 8 16 1 2 4 8 16 Number of GPUs Number of GPUs

NCCL-2.5 MVAPICH2-GDR-2.3.3 NCCL-2.5 MVAPICH2-GDR-2.3.3

Platform: Nvidia DGX-2 system (16 Nvidia Volta GPUs connected with NVSwitch), CUDA 9.2 Network Based Computing Laboratory HPC-AI Webinar (March ‘20) 32 Distributed TensorFlow on ORNL Summit (1536 GPUs)

• ResNet-50 Training using TensorFlow benchmark on 400 SUMMIT -- 1536 Volta ImageNet-1k has 1.2 million images 350 GPUs! 300 MVAPICH2-GDR reaching ~0.35 million 250 Thousands • 1,281,167 (1.2 mil.) images 200 images per second for ImageNet-1k! 150 100

• Time/epoch = 3.6 seconds Image per second 50 0 • Total Time (90 epochs) 1 2 4 6 12 24 48 96 192 384 768 1536 = 3.6 x 90 = 332 seconds = Number of GPUs 5.5 minutes! NCCL-2.5 MVAPICH2-GDR 2.3.3 *We observed errors for NCCL2 beyond 96 GPUs Platform: The Summit Supercomputer (#1 on Top500.org) – 6 NVIDIA Volta GPUs per node connected with NVLink, CUDA 9.2 Network Based Computing Laboratory HPC-AI Webinar (March ‘20) 33 Multiple Approaches taken up by OSU • MPI-driven Deep Learning – CPU-based Deep Learning – GPU-based Deep Learning • Co-designing Deep Learning Stacks with High-Performance MPI • Out-of-core DNN training • Hybrid (Data and Model) Parallelism

Network Based Computing Laboratory HPC-AI Webinar (March ‘20) 34 OSU-Caffe: Scalable Deep Learning • Caffe : A flexible and layered Deep Learning framework. GoogLeNet (ImageNet) on 128 GPUs • Benefits and Weaknesses 250 – Multi-GPU Training within a single node – Performance degradation for GPUs across different 200 sockets – Limited Scale-out 150 • OSU-Caffe: MPI-based Parallel Training – Enable Scale-up (within a node) and Scale-out (across 100 multi-GPU nodes) – Scale-out on 64 GPUs for training CIFAR-10 network on 50 CIFAR-10 dataset Training Time (seconds) Training – Scale-out on 128 GPUs for training GoogLeNet network on ImageNet dataset 0 8 16 32 64 128 No. of GPUs OSU-Caffe publicly available from Invalid use case http://hidl.cse.ohio-state.edu/ Caffe OSU-Caffe (1024) OSU-Caffe (2048) Network Based Computing Laboratory HPC-AI Webinar (March ‘20) 35 Training Large (Out-of-core) Models • Large DNNs cannot be trained on GPUs due to memory limitation! – ResNet-50 for Image Recognition but current frameworks can only go up to a small batch size of 45 – Next generation models like Neural Machine Translation (NMT) are ridiculously large, consists of billions of parameters, and require even more memory – Can we design Out-of-core DNN training support using new software features in CUDA 8/9 and hardware mechanisms in Pascal/Volta GPUs? • General intuition is that managed allocations “will be” slow! – The proposed framework called OC-Caffe (Out-of-Core Caffe) shows the potential of managed memory designs that can provide performance with negligible/no overhead. • OC-Caffe-Opt: up to 80% better than Intel-optimized CPU Caffe for ResNet-50 training on the Volta V100 GPU with CUDA9 and CUDNN7

A. Awan et al., OC-DNN: Exploiting Advanced Unified Memory Capabilities in CUDA 9 and Volta GPUs for Out-of-Core DNN Training, HiPC ’18 Network Based Computing Laboratory HPC-AI Webinar (March ‘20) 36 HyPar-Flow at Scale (512 nodes on TACC Frontera) • ResNet-1001 with variable batch size 481x speedup on 512 Intel Xeon Skylake nodes (TACC Frontera) • Approach: – 48 model-partitions for 56 cores – 512 model-replicas for 512 nodes – Total cores: 48 x 512 = 24,576 • Speedup – 253X on 256 nodes – 481X on 512 nodes • Scaling Efficiency – 98% up to 256 nodes – 93.9% for 512 nodes

A. A. Awan, A. Jain, Q. Anthony, H. Subramoni, and DK Panda, “HyPar-Flow: Exploiting MPI and Keras for Hybrid Parallel Training of TensorFlow models”, ISC ‘20 (accepted to be presented), https://arxiv.org/pdf/1911.05146.pdf Network Based Computing Laboratory HPC-AI Webinar (March ‘20) 37 Presentation Overview

• MVAPICH Project for HPC – MPI and PGAS Library with CUDA-Awareness • HiDL Project – High-Performance Deep Learning • Public Cloud Deployment – Microsoft-Azure and Amazon-AWS • Conclusions

Network Based Computing Laboratory HPC-AI Webinar (March ‘20) 38 MVAPICH2 Software Family Requirements Library

MPI with Support for InfiniBand, Omni-Path, Ethernet/iWARP and, RoCE (v1/v2) MVAPICH2

Advanced MPI features/support (UMR, ODP, DC, Core-Direct, SHARP, XPMEM), OSU MVAPICH2-X INAM (InfiniBand Network Monitoring and Analysis) Optimized MPI for clusters with NVIDIA GPUs and for GPU-enabled Deep Learning MVAPICH2-GDR Applications Optimized Support for Microsoft Azure Platform with InfiniBand MVAPICH2-Azure

Advanced MPI features (SRD and XPMEM) with support for Amazon Elastic Fabric MVAPICH2-X-AWS Adapter (EFA) Energy-aware MPI with Support for InfiniBand, Omni-Path, Ethernet/iWARP and, MVAPICH2-EA RoCE (v1/v2) MPI Energy Monitoring Tool OEMT

InfiniBand Network Analysis and Monitoring OSU INAM

Microbenchmarks for Measuring MPI and PGAS Performance OMB

Network Based Computing Laboratory HPC-AI Webinar (March ‘20) 39 MVAPICH2-Azure

• Released on 08/16/2019 • Major Features and Enhancements – Based on MVAPICH2-2.3.2 – Enhanced tuning for point-to-point and collective operations – Targeted for Azure HB & HC virtual machine instances – Flexibility for 'one-click' deployment – Tested with Azure HB & HC VM instances • Being upgraded to the latest MVAPICH2-X 2.3rc3 release

Network Based Computing Laboratory HPC-AI Webinar (March ‘20) 40 Performance of Radix

Total Execution Time on HB (Lower is better) Total Execution Time on HC (Lower is better) 38% faster 30 25 MVAPICH2-X 3x faster MVAPICH2-X 25 HPCx 20 HPCx 20 15 15 10 10 5 5 Execution Time (Seconds) Execution (Seconds)Time Execution 0 0 16(1x16) 32(1x32) 44(1X44) 88(2X44) 176(4X44)352(8x44) 60(1X60) 120(2X60) 240(4X60)

Number of Processes (Nodes X PPN) Number of Processes (Nodes X PPN)

MVAPICH2-X is able to deliver better performance compared to HPCx on both HB and HC series systems

Network Based Computing Laboratory HPC-AI Webinar (March ‘20) 41 Performance of FDS (HC)

Single Node Multi-Node Total Execution Time (Lower is better) Total Execution Time (Lower is better) 10 600 1.11x better 9 MVAPICH2-X MVAPICH2-X 8 HPCx 500 HPCx 7 400 6 5 300 4 3 200 2 100 1

Execution (Seconds)Time Execution 0 (Seconds)Time Execution 0 16(1x16) 32(1x32) 44(1X44) 88(2X44) 176(4X44)

Processes (Nodes X PPN) Processes (Nodes X PPN)

Part of input parameter: MESH IJK=5,5,5, XB=-1.0,0.0,-1.0,0.0,0.0,1.0, MULT_ID='mesh array'

MVAPICH2-X is able to deliver better performance compared to HPCx on HC series

Network Based Computing Laboratory HPC-AI Webinar (March ‘20) 42 MVAPICH2-X-AWS 2.3

• Released on 08/12/2019 • Major Features and Enhancements – Based on MVAPICH2-X 2.3 – New design based on Amazon EFA adapter's Scalable Reliable Datagram (SRD) transport protocol – Support for XPMEM based intra-node communication for point-to-point and collectives – Enhanced tuning for point-to-point and collective operations – Targeted for AWS instances with Amazon Linux 2 AMI and EFA support – Tested with c5n.18xlarge instance • Being upgraded to the latest MVAPICH2-X 2.3rc3 release

Network Based Computing Laboratory HPC-AI Webinar (March ‘20) 43 MPI-level Performance with SRD Pt2pt Latency Allreduce – 16 nodes 36 ppn Bcast – 16 nodes 36 ppn 2001 100000 100000 MV2X MV2X MV2X 1601 10000 10000 OpenMPI OpenMPI OpenMPI 1201 1000 1000

801 100 100

401 10 10 Latency (us) Latency (us) Latency (us) 1 1 1 1 16 256 4k 64K 1M 4 64 1k 16K 256K 1 16 256 4k 64K 1M Message Size (Bytes) Message Size (Bytes) Message Size (Bytes)

miniGhost CloverLeaf 80 10% better 30 MV2X OpenMPI MV2X Instance type: c5n.18xlarge 70 25 CPU: Intel Xeon Platinum 8124M @ 3.00GHz 60 27.5% 20 OpenMPI MVAPICH2 version: Latest MVAPICH2-X + SRD support 50 better OpenMPI version: Open MPI v4.0.2 with libfabric 1.8 40 15 30 10 S. Chakraborty, S. Xu, H. Subramoni, and D. K. Panda, 20 5 Designing Scalable and High-performance MPI 10 Libraries on Amazon Elastic Fabric Adapter, to be 0 0 presented at the 26th Symposium on High 72(2x36) 144(4x36) 288(8x36) 72(2x36) 144(4x36) 288(8x36) Performance Interconnects, (HOTI ’19)

Execution (Seconds)Time Execution Processes (Nodes X PPN) (Seconds)Time Execution Processes (Nodes X PPN) Network Based Computing Laboratory HPC-AI Webinar (March ‘20) 44 Conclusions

• Upcoming Exascale systems need to be designed with a holistic view of HPC, Big Data, Deep Learning, and Cloud • Presented an overview of designing convergent software stacks • Presented solutions enable HPC, Deep Learning, and Cloud Computing communities to take advantage of current and next- generation systems

Network Based Computing Laboratory HPC-AI Webinar (March ‘20) 45 Commercial Support for MVAPICH2, HiBD, and HiDL Libraries • Supported through X-ScaleSolutions (http://x-scalesolutions.com) • Benefits: – Help and guidance with installation of the library – Platform-specific optimizations and tuning – Timely support for operational issues encountered with the library – Web portal interface to submit issues and tracking their progress – Advanced debugging techniques – Application-specific optimizations and tuning – Obtaining guidelines on best practices – Periodic information on major fixes and updates – Information on major releases – Help with upgrading to the latest release – Flexible Service Level Agreements • Support being provided to Lawrence Livermore National Laboratory (LLNL) and KISTI, Korea

Network Based Computing Laboratory HPC-AI Webinar (March ‘20) 46 Silver ISV Member for the OpenPOWER Consortium + Products • Member of the OpenPOWER Consortium as a silver ISV member • Provides flexibility: – To have MVAPICH2, HiDL and HiBD libraries getting integrated into the OpenPOWER software stack – A part of the OpenPOWER ecosystem – Can participate with different vendors for bidding, installation and deployment process • Introduced two new integrated products with support for OpenPOWER and systems (Presented at the OpenPOWER North America Summit) – X-ScaleHPC – X-ScaleAI – Send an e-mail to [email protected] for free trial!!

Network Based Computing Laboratory HPC-AI Webinar (March ‘20) 47 Funding Acknowledgments Funding Support by

Equipment Support by

Network Based Computing Laboratory HPC-AI Webinar (March ‘20) 48 Personnel Acknowledgments Current Students (Graduate) Current Research Scientist Current Post-doc – A. Awan (Ph.D.) – Kamal Raj (M.S.) – Q. Zhou (Ph.D.) – A. Shafi – M. S. Ghazimeersaeed – M. Bayatpour (Ph.D.) – K. S. Khorassani (Ph.D.) – H. Subramoni – K. Manian – C.-H. Chu (Ph.D.) – P. Kousha (Ph.D.) Current Students (Undergraduate) Current Research Specialist – J. Hashmi (Ph.D.) – A. Quentin (Ph.D.) – V. Gangal (B.S.) – J. Smith – A. Jain (Ph.D.) – B. Ramesh (Ph.D.) – N. Sarkauskas (B.S.) – K. S. Kandadi (Ph.D.) – S. Xu (Ph.D.)

Past Students Past Research Scientist – A. Augustine (M.S.) – T. Gangadharappa (M.S.) – P. Lai (M.S.) – R. Rajachandrasekar (Ph.D.) – K. Hamidouche – P. Balaji (Ph.D.) – K. Gopalakrishnan (M.S.) – J. Liu (Ph.D.) – D. Shankar (Ph.D.) – S. Sur – R. Biswas (M.S.) – W. Huang (Ph.D.) – M. Luo (Ph.D.) – G. Santhanaraman (Ph.D.) – A. Singh (Ph.D.) – S. Bhagvat (M.S.) – W. Jiang (M.S.) – A. Mamidala (Ph.D.) – X. Lu – J. Sridhar (M.S.) – A. Bhat (M.S.) – J. Jose (Ph.D.) – G. Marsh (M.S.) Past Programmers – S. Sur (Ph.D.) – D. Buntinas (Ph.D.) – S. Kini (M.S.) – V. Meshram (M.S.) – D. Bureddy – M. Koop (Ph.D.) – A. Moody (M.S.) – H. Subramoni (Ph.D.) – L. Chai (Ph.D.) – J. Perkins – K. Vaidyanathan (Ph.D.) – B. Chandrasekharan (M.S.) – K. Kulkarni (M.S.) – S. Naravula (Ph.D.) – A. Vishnu (Ph.D.) – S. Chakraborthy (Ph.D.) – R. Kumar (M.S.) – R. Noronha (Ph.D.) Past Research Specialist – J. Wu (Ph.D.) – N. Dandapanthula (M.S.) – S. Krishnamoorthy (M.S.) – X. Ouyang (Ph.D.) – M. Arnold – W. Yu (Ph.D.) – V. Dhanraj (M.S.) – K. Kandalla (Ph.D.) – S. Pai (M.S.) – M. Li (Ph.D.) – S. Potluri (Ph.D.) – J. Zhang (Ph.D.) Past Post-Docs – D. Banerjee – J. Lin – S. Marcarelli – H. Wang – X. Besseron – M. Luo – A. Ruhela – H.-W. Jin – E. Mancini – J. Vienne Network Based Computing Laboratory HPC-AI Webinar (March ‘20) 49 Thank You! [email protected]

Network-Based Computing Laboratory http://nowlab.cse.ohio-state.edu/

The High-Performance MPI/PGAS Project The High-Performance Big Data Project The High-Performance Deep Learning Project http://mvapich.cse.ohio-state.edu/ http://hibd.cse.ohio-state.edu/ http://hidl.cse.ohio-state.edu/

Network Based Computing Laboratory HPC-AI Webinar (March ‘20) 50