Quick viewing(Text Mode)

Scalable Distributed Deep Learning on Modern HPC Systems

Scalable Distributed Deep Learning on Modern HPC Systems

High-Performance Deep Learning and on Modern HPC Systems

HPC-AI Advisory Council Australia Conference (Sept ’20) by

Dhabaleswar K. (DK) Panda The Ohio State University -mail: [email protected] Follow us on http://www.cse.ohio-state.edu/~panda https://twitter.com/mvapich High-End (HEC): PetaFlop to ExaFlop

100 PetaFlops in 415 Peta 2017 Flops in 2020 (Fugaku in Japan with 7.3M cores

1 ExaFlops

Expected to have an ExaFlop system in 2021! Network Based Computing Laboratory HPC-AI-Australia (Sept ‘20) 2 AI, Deep Learning & Machine Learning

• Machine Learning (ML) with many traditional applications – K-means – Random Forest – Linear Regression – Nearest Neighbor • Deep Learning (DL) – A subset of Machine Learning that uses Deep Neural Networks (DNNs) • Based on learning data representation • Examples Convolutional Neural Networks, Recurrent Neural Networks, Hybrid Networks

Courtesy: https://hackernoon.com/difference-between-artificial-intelligence-machine-learning- and-deep-learning-1pcv3zeg, https://blog.dataiku.com/ai-vs.-machine-learning-vs.-deep-learning Network Based Computing Laboratory HPC-AI-Australia (Sept ‘20) 3 Key Phases of Deep Learning • Deep Learning has two major tasks 1. Training of the Deep Neural Network 2. Inference (or deployment) that uses a trained DNN • DNN Training – Training is a compute/communication intensive – can take days to weeks – Faster training is necessary! • Faster training can be achieved by – Using Newer and Faster Hardware – But, there is a limit! – Can we use more GPUs or nodes? • The need for Parallel and Distributed Training Network Based Computing Laboratory HPC-AI-Australia (Sept ‘20) 4 Introduction to Dask and GPU-based

• Dask is a popular task-based framework: – Scales Python applications from laptops to high-end systems – Builds a task-graph that is executed lazily on parallel hardware – Natively extends popular data processing libraries like numPy, Pandas • NVIDIA RAPIDS framework is a GPU-based data science ecosystem around Dask: – Aims to hide low-level complexities of the CUDA framework – cuPy/cuDF/cuML are GPU-counterparts of numPy/Pandas/Scikit-learn • The Dask Distributed supports parallel and distributed execution: – Built using the asyncio package that allows execution of asynchronous/non- blocking/concurrent operations called coroutines: • These are defined using async and invoked using await

Network Based Computing Laboratory HPC-AI-Australia (Sept ‘20) 5 cuML: GPU-accelerated Machine Learning Library

• Collection of ML and mathematical primitives for GPUs: – Maintains a similar interface to Scikit-learn – Hides the complexities of CUDA programming from data scientists • Scales out using the Dask ecosystem • cuML supports execution on a variety of platforms: – A single GPU – Single- Multi-GPUs (SNMG) – Multi-Node Multi-GPUs (MNMG) Image Courtesy: https://rapids.ai/about.html

Network Based Computing Laboratory HPC-AI-Australia (Sept ‘20) 6 Scale-up and Scale-out Desired • Scale-up: Intra-node Communication – Many improvements like: NCCL2 • NVIDIA cuDNN, cuBLAS, NCCL, etc. • CUDA Co-operative Groups cuDNN MPI • Scale-out: Inter-node Communication MKL-DNN – DL and ML Frameworks – most are optimized for single-node only up Performance

– Distributed (Parallel) Execution is an - gRPC emerging trend Scale Hadoop

Scale-out Performance

Network Based Computing Laboratory HPC-AI-Australia (Sept ‘20) 7 Broad Challenge: Exploiting HPC for DL and ML How to efficiently Scale-up and scale-out Deep Learning (DL) and Machine Learning (ML) frameworks and take advantage of heterogeneous High Performance Computing (HPC) resources?

Network Based Computing Laboratory HPC-AI-Australia (Sept ‘20) 8 Programming Models and Runtimes for Multi-Petaflop and Exaflop Systems: Challenges

Application Kernels/Applications (HPC, DL, and ML) Co-Design Opportunities and Programming Models Challenges MPI, PGAS (UPC & OpenSHMEM), CUDA, OpenMP, OpenACC, Hadoop, across Various Spark (RDD, DAG), TensorFlow, PyTorch, DASK, cuML, etc. Layers

Communication Library or Runtime for Programming Models Performance Point-to-point Collective Energy- Synchronization I/O and Fault Communication Communication Awareness and Locks File Systems Tolerance Resilience Networking Technologies Multi-/Many-core Accelerators (InfiniBand, 40/100/200GigE, Architectures (GPU and FPGA) RoCE, Omni-Path, EFA, and Slingshot)

Network Based Computing Laboratory HPC-AI-Australia (Sept ‘20) 9 Overview of the MVAPICH2 Project • High Performance open-source MPI Library • Support for multiple interconnects – InfiniBand, Omni-Path, /iWARP, RDMA over Converged Ethernet (RoCE), and AWS EFA • Support for multiple platforms – , OpenPOWER, ARM, Xeon-Phi, GPGPUs (NVIDIA and AMD (upcoming)) • Started in 2001, first open-source version demonstrated at SC ‘02 • Used by more than 3,100 organizations in 89 countries • Supports the latest MPI-3.1 standard • More than 827,000 (> 0.8 million) downloads from the • http://mvapich.cse.ohio-state.edu OSU site directly • Additional optimized versions for different systems/environments: • Empowering many TOP500 clusters (June ‘20 ranking) – MVAPICH2-X (Advanced MPI + PGAS), since 2011 – 4th , 10,649,600-core (Sunway TaihuLight) at NSC, Wuxi, China – MVAPICH2-GDR with support for NVIDIA GPGPUs, since 2014 – 8th, 448, 448 cores (Frontera) at TACC – MVAPICH2-MIC with support for Intel Xeon-Phi, since 2014 – 12th, 391,680 cores (ABCI) in Japan – MVAPICH2-Virt with support, since 2015 – 18th, 570,020 cores (Nurion) in South Korea and many others – MVAPICH2-EA with support for Energy-Awareness, since 2015 – MVAPICH2-Azure for Azure HPC IB instances, since 2019 • Available with software stacks of many vendors and – MVAPICH2-X-AWS for AWS HPC+EFA instances, since 2019 Linux Distros (RedHat, SuSE, OpenHPC, and Spack) • Tools: • Partner in the 8th ranked TACC Frontera system – OSU MPI Micro-Benchmarks (OMB), since 2003 • Empowering Top500 systems for more than 15 years – OSU InfiniBand Network Analysis and Monitoring (INAM), since 2015 Network Based Computing Laboratory HPC-AI-Australia (Sept ‘20) 10 Architecture of MVAPICH2 Software Family (for HPC, DL, and ML)

High Performance Parallel Programming Models

Message Passing Interface PGAS Hybrid --- MPI + X (MPI) (UPC, OpenSHMEM, CAF, UPC++) (MPI + PGAS + OpenMP/)

High Performance and Scalable Communication Runtime Diverse and Mechanisms

Point-to- Remote Collectives Energy- I/O and Fault Active Introspection point Job Startup Memory Virtualization Algorithms Awareness File Systems Tolerance Messages & Analysis Primitives Access

Support for Modern Networking Technology Support for Modern Multi-/Many-core Architectures (InfiniBand, iWARP, RoCE, EFA, Omni-Path) (Intel-Xeon, OpenPower, Xeon-Phi, ARM, NVIDIA & AMD GPUs) Transport Protocols Modern Features Transport Mechanisms Modern Features

SR- Multi Shared RC XRC UD DC SHARP2* ODP CMA IVSHMEM XPMEM MCDRAM* NVLink CAPI* IOV Rail Memory

* Upcoming

Network Based Computing Laboratory HPC-AI-Australia (Sept ‘20) 11 MVAPICH2 Release Timeline and Downloads 900000

800000 MIC 2.0 MIC - GDR 2.0b GDR - Azure Azure 2.3.2 700000 - MV2 1.9 MV2 MV2

600000 1.8 MV2 Virt 2.2 Virt 1.7 2.3 500000 MV2 1.6 MV2 MV2 AWS - GDR 2.3.4 GDR 1.5 - OSU INAM 0.9.6 INAM OSU MV2 400000 1.4 1.0.3 1.1 MV2 1.0 1.0 2.3 MV2 MV2 X MV2 MV - 300000 MV2 MV Number of Downloads Number MV2 MV2 0.9.8 MV2 MV 0.9.4 MV MV2 0.9.0 MV2 MV2 200000 MV2 2.3.4 MV2

100000

0 Sep-04 Sep-05 Sep-06 Sep-07 Sep-08 Sep-09 Sep-10 Sep-11 Sep-12 Sep-13 Sep-14 Sep-15 Sep-16 Sep-17 Sep-18 Sep-19 Mar-05 Mar-06 Mar-07 Mar-08 Mar-09 Mar-10 Mar-11 Mar-12 Mar-13 Mar-14 Mar-15 Mar-16 Mar-17 Mar-18 Mar-19 Mar-20 Timeline Network Based Computing Laboratory HPC-AI-Australia (Sept ‘20) 12 Accelerating DL and ML Applications with MVAPICH2 MPI Libraries • MPI-driven Deep Learning – CPU-based Deep Learning – GPU-based Deep Learning • Out-of-core DNN training • Exploiting Hybrid (Data and Model) Parallelism • Use-Case: AI-Driven Digital Pathology • High-Performance MPI runtime for DASK • MPI-driven Acceleration of cuML Algorithms • Commercial Support and Products Network Based Computing Laboratory HPC-AI-Australia (Sept ‘20) 13 High-Performance Distributed Data Parallel Training with TensorFlow

• gRPC – Officially available and supported – Open-source – can be enhanced by others – Accelerated gRPC (add RDMA to gRPC) • gRPC+X – Use gRPC for bootstrap and rendezvous – Actual communication is in “X” – X MPI, Verbs, GPUDirect RDMA (GDR), etc. • No-gRPC – Baidu – the first one to use MPI Collectives for TF – Horovod – Use NCCL, or MPI, or any other future library (e.g. IBM DDL support recently added)

A. A. Awan, J. Bedorf, C-H Chu, H. Subramoni, and DK Panda., “Scalable Distributed DNN Training using TensorFlow and CUDA-Aware MPI: Characterization, Designs, and Performance Evaluation”, CCGrid’19

Network Based Computing Laboratory HPC-AI-Australia (Sept ‘20) 14 MVAPICH2 (MPI)-driven Infrastructure for ML/DL Training

ML/DL Applications ML/DL Applications

TensorFlow PyTorch MXNet PyTorch

Horovod Torch.distributed DeepSpeed

MVAPICH2 or MVAPICH2-X MVAPICH2-GDR for MVAPICH2 or MVAPICH2-X MVAPICH2-GDR for for CPU Training GPU Training for CPU Training GPU Training

More details available from: http://hidl.cse.ohio-state.edu Network Based Computing Laboratory HPC-AI-Australia (Sept ‘20) 15 MVAPICH2 Software Family (CPU-Based Deep Learning) High-Performance Parallel Programming Libraries MVAPICH2 Support for InfiniBand, Omni-Path, Ethernet/iWARP, and RoCE MVAPICH2-X Advanced MPI features, OSU INAM, PGAS (OpenSHMEM, UPC, UPC++, and CAF), and MPI+PGAS programming models with unified communication runtime MVAPICH2-GDR Optimized MPI for clusters with NVIDIA GPUs and for GPU-enabled Deep Learning Applications MVAPICH2-Virt High-performance and scalable MPI for hypervisor and container based HPC cloud MVAPICH2-EA Energy aware and High-performance MPI MVAPICH2-MIC Optimized MPI for clusters with Intel KNC Microbenchmarks OMB Microbenchmarks suite to evaluate MPI and PGAS (OpenSHMEM, UPC, and UPC++) libraries for CPUs and GPUs Tools OSU INAM Network monitoring, profiling, and analysis for clusters with MPI and scheduler integration OEMT Utility to measure the energy consumption of MPI applications

Network Based Computing Laboratory HPC-AI-Australia (Sept ‘20) 16 One-way Latency: MPI over IB with MVAPICH2 (x86 Systems)

1.8 Small Message Latency 120 Large Message Latency TrueScale-QDR 1.6 1.19 100 ConnectX-3-FDR 1.4 1.11 ConnectIB-DualFDR 1.2 80 ConnectX-4-EDR 1 60 Omni-Path 0.8 1.15 ConnectX-6 HDR 0.6 1.01 40 Latency (us) Latency

Latency (us) 1.04 0.4 1.1 20 0.2 0 0

Message Size () Message Size (bytes) TrueScale-QDR - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with IB switch ConnectX-3-FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch ConnectIB-Dual FDR - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with IB switch ConnectX-4-EDR - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with IB Switch Omni-Path - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with Omni-Path switch ConnectX-6-HDR - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with IB Switch Network Based Computing Laboratory HPC-AI-Australia (Sept ‘20) 17 Bandwidth: MPI over IB with MVAPICH2 (x86 Systems)

30000 Unidirectional Bandwidth 60000 Bidirectional Bandwidth TrueScale-QDR 24,532 48,027 25000 50000 ConnectX-3-FDR 20000 40000 ConnectIB-DualFDR ConnectX-4-EDR 15000 30000 21,983 12,590 Omni-Path 24,136 12,366 ConnectX-6 HDR

Bandwidth Bandwidth 10000

Bandwidth Bandwidth 20000 (MBytes/sec) 12,083 (MBytes/sec) 21,227 5000 6,356 10000 12,161 3,373 6,228 0 0

Message Size (bytes) Message Size (bytes) TrueScale-QDR - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with IB switch ConnectX-3-FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch ConnectIB-Dual FDR - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with IB switch ConnectX-4-EDR - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with IB Switch Omni-Path - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with Omni-Path switch ConnectX-6-HDR - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with IB Switch Network Based Computing Laboratory HPC-AI-Australia (Sept ‘20) 18 Intra-node Point-to-Point Performance on OpenPOWER Intra-Socket Small Message Latency Intra-Socket Large Message Latency 0.8 100 MVAPICH2-X 2.3rc3 MVAPICH2-X 2.3rc3 80 SpectrumMPI-10.3.0.01 0.6 SpectrumMPI-10.3.0.01 60 0.4 40 Latency (us)Latency Latency (us)Latency 0.2 0.23us 20 0 0 4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M 2M Intra-Socket Bandwidth Intra-Socket Bi-directional Bandwidth 40000 40000 MVAPICH2-X 2.3rc3 MVAPICH2-X 2.3.rc3 SpectrumMPI-10.3.0.01 SpectrumMPI-10.3.0.01 30000 30000

20000 20000

10000 10000 Bandwidth (MB/s) Bandwidth (MB/s) 0 0 1 8 64 512 4K 32K 256K 2M 1 8 64 512 4K 32K 256K 2M Platform: Two nodes of OpenPOWER (Power9-ppc64le) CPU using Mellanox EDR (MT4121) HCA Network Based Computing Laboratory HPC-AI-Australia (Sept ‘20) 19 Inter-node Point-to-Point Performance on OpenPower

Small Message Latency Large Message Latency 4 150 MVAPICH2-X 2.3rc3 MVAPICH2-X 2.3rc3 SpectrumMPI-10.3.0.01 3 SpectrumMPI-10.3.0.01 100 2 50 Latency (us)Latency Latency (us)Latency 1

0 0 1 2 4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M 2M Bandwidth Bi-directional Bandwidth 30000 50000 MVAPICH2-X 2.3rc3 24,743 MB/s MVAPICH2-X 2.3rc3 40000 49,249 MB/s 20000 SpectrumMPI-10.3.0.01 SpectrumMPI-10.3.0.01 30000

20000 10000 10000 - Bandwidth (MB/s) Bi

Bandwidth (MB/s) 0 0 1 8 64 512 4K 32K 256K 2M 1 8 64 512 4K 32K 256K 2M Platform: Two nodes of OpenPOWER (POWER9-ppc64le) CPU using Mellanox EDR (MT4121) HCA Network Based Computing Laboratory HPC-AI-Australia (Sept ‘20) 20 Optimized MVAPICH2 All-Reduce with XPMEM 300 3000 MVAPICH2-2.3 4X MVAPICH2-2.3 SpectrumMPI-10.1.0 SpectrumMPI-10.1.0 200 2000 48% OpenMPI-3.0.0 2X OpenMPI-3.0.0 MVAPICH2-XPMEM MVAPICH2-XPMEM 100 1000 Latency (us) Latency (us) 34%

(Nodes=1, PPN=20) (Nodes=1, 0 0 16K 32K 64K 128K 256K 512K 1M 2M 300 4000 MVAPICH2-2.3 MVAPICH2-2.3 2X SpectrumMPI-10.1.0 3000 SpectrumMPI-10.1.0 200 OpenMPI-3.0.0 OpenMPI-3.0.0 MVAPICH2-XPMEM 2000 MVAPICH2-XPMEM 100 Latency (us) Latency (us) 1000 2X 3X 3.3X

(Nodes=2, PPN=20) (Nodes=2, 0 0 16K 32K 64K 128K 256K 512K 1M 2M Message Size Message Size • Optimized MPI All-Reduce Design in MVAPICH2 – Up to 2X performance improvement over Spectrum MPI and 4X over OpenMPI for intra-node Optimized Runtime Parameters: MV2_CPU_BINDING_POLICY=hybrid MV2_HYBRID_BINDING_POLICY=bunch Network Based Computing Laboratory HPC-AI-Australia (Sept ‘20) 21 Performance of CNTK with MVAPICH2-X on CPU-based Deep Learning

CNTK AlexNet Training • CPU-based training of AlexNet neural (B.S=default, iteration=50, ppn=28) network using ImageNet ILSVRC2012 Intel MPI dataset 800 MVAPICH2 9% MVAPICH2-XPMEM • Advanced XPMEM-based designs show 600 20%

up to 20% benefits over Intel MPI (IMPI) 400 for CNTK DNN training using All_Reduce 200 • The proposed designs show good Execution Time (s) scalability with increasing system size 0 28 56 112 224 No. of Processes Available since MVAPICH2-X 2.3rc1 release

Designing Efficient Shared Address Space Reduction Collectives for Multi-/Many-cores, J. Hashmi, S. Chakraborty, M. Bayatpour, H. Subramoni, and DK Panda, 32nd IEEE International Parallel & Distributed Processing Symposium (IPDPS '18), May 2018 Network Based Computing Laboratory HPC-AI-Australia (Sept ‘20) 22 Distributed TensorFlow on TACC Frontera (2,048 CPU nodes) • Scaled TensorFlow to 2048 nodes on Frontera using MVAPICH2 and IntelMPI

• MVAPICH2 and IntelMPI give similar performance for DNN training

• Report a peak of 260,000 images/sec on 2,048 nodes

• On 2048 nodes, ResNet-50 can be trained in 7 minutes!

A. Jain, A. A. Awan, H. Subramoni, DK Panda, “Scaling TensorFlow, PyTorch, and MXNet using MVAPICH2 for High-Performance Deep Learning on Frontera”, DLS ’19 (SC ’19 Workshop).

Network Based Computing Laboratory HPC-AI-Australia (Sept ‘20) 23 MVAPICH2 Software Family (GPU-Based Deep Learning) High-Performance Parallel Programming Libraries MVAPICH2 Support for InfiniBand, Omni-Path, Ethernet/iWARP, and RoCE MVAPICH2-X Advanced MPI features, OSU INAM, PGAS (OpenSHMEM, UPC, UPC++, and CAF), and MPI+PGAS programming models with unified communication runtime MVAPICH2-GDR Optimized MPI for clusters with NVIDIA GPUs and for GPU-enabled Deep Learning Applications MVAPICH2-Virt High-performance and scalable MPI for hypervisor and container based HPC cloud MVAPICH2-EA Energy aware and High-performance MPI MVAPICH2-MIC Optimized MPI for clusters with Intel KNC Microbenchmarks OMB Microbenchmarks suite to evaluate MPI and PGAS (OpenSHMEM, UPC, and UPC++) libraries for CPUs and GPUs Tools OSU INAM Network monitoring, profiling, and analysis for clusters with MPI and scheduler integration OEMT Utility to measure the energy consumption of MPI applications

Network Based Computing Laboratory HPC-AI-Australia (Sept ‘20) 24 GPU-Aware (CUDA-Aware) MPI Library: MVAPICH2-GPU • Standard MPI interfaces used for unified data movement • Takes advantage of Unified Virtual Addressing (>= CUDA 4.0) • Overlaps data movement from GPU with RDMA transfers

At Sender:

MPI_Send(s_devbuf, size, …); inside MVAPICH2 At Receiver: MPI_Recv(r_devbuf, size, …);

High Performance and High Productivity

Network Based Computing Laboratory HPC-AI-Australia (Sept ‘20) 25 CUDA-Aware MPI: MVAPICH2-GDR 1.8-2.3.4 Releases

• Support for MPI communication from NVIDIA GPU device memory • High performance RDMA-based inter-node point-to-point communication (GPU-GPU, GPU-Host and Host-GPU) • High performance intra-node point-to-point communication for multi-GPU adapters/node (GPU-GPU, GPU-Host and Host-GPU) • Taking advantage of CUDA IPC (available since CUDA 4.1) in intra-node communication for multiple GPU adapters/node • Optimized and tuned collectives for GPU device buffers • MPI datatype support for point-to-point and collective communication from GPU device buffers • Unified memory

Network Based Computing Laboratory HPC-AI-Australia (Sept ‘20) 26 Optimized MVAPICH2-GDR Design GPU-GPU Inter-node Latency 30 6000 GPU-GPU Inter-node Bi-Bandwidth

20 4000 10 11X 1.85us 10x 2000

Latency (us)Latency 0 0 1 2 4 8 Bandwidth (MB/s) Bandwidth 0 16 32 64 1K 2K 4K 8K 1 2 4 8 128 256 512 16 32 64 1K 2K 4K

Message Size (Bytes) 128 256 512 Message Size (Bytes) MV2-(NO-GDR) MV2-GDR 2.3 MV2-(NO-GDR) MV2-GDR-2.3

4000 GPU-GPU Inter-node Bandwidth 3000 2000 9x MVAPICH2-GDR-2.3 Intel Haswell (E5-2687W @ 3.10 GHz) node - 20 cores 1000 NVIDIA Volta V100 GPU

Bandwidth (MB/s) Bandwidth 0 Mellanox Connect-X4 EDR HCA 1 2 4 8 16 32 64 1K 2K 4K CUDA 9.0 128 256 512 Mellanox OFED 4.0 with GPU-Direct-RDMA Message Size (Bytes) MV2-(NO-GDR) MV2-GDR-2.3

Network Based Computing Laboratory HPC-AI-Australia (Sept ‘20) 27 D-to-D Performance on OpenPOWER w/ GDRCopy (NVLink2 + Volta)

Intra-Node Latency (Small Intra-Node Latency (Large Intra-Node Bandwidth Messages) Messages) 80000 2 100 60000 40000 1 50 20000 0 0 0 1 4 16 64 1 2 4 8 1K 4K 1M 4M 256 16 32 64 16K 64K 1K 2K 4K 8K 128 256 512 16K 32K 64K 128K 256K 512K 1M 2M 4M 256K Latency (us) Latency (us)

Message Size (Bytes) Message Size (Bytes) (MB/s) Bandwidth Message Size (Bytes)

Intra-node Latency: 0.76 us (with GDRCopy) Intra-node Bandwidth: 65.48 GB/sec for 4MB (via NVLINK2)

Inter-Node Latency (Small Inter-Node Latency (Large Inter-Node Bandwidth

Messages) Messages) 25000 20000 10 300 15000 200 10000 5 100 5000 0 0 0 1 4 16 64 1 2 4 8 1K 4K 1M 4M 256 16 32 64 16K 64K 1K 2K 4K 8K 128 256 512 16K 32K 64K 128K 256K 512K 1M 2M 4M 256K Latency (us) Latency (us)

Message Size (Bytes) Message Size (Bytes) (MB/s) Bandwidth Message Size (Bytes) Inter-node Latency: 2.18 us (with GDRCopy 2.0) Inter-node Bandwidth: 23 GB/sec for 4MB (via 2 Port EDR) Platform: OpenPOWER (POWER9-ppc64le) nodes equipped with a dual-socket CPU, 4 Volta V100 GPUs, and 2port EDR InfiniBand Interconnect Network Based Computing Laboratory HPC-AI-Australia (Sept ‘20) 28 MVAPICH2-GDR vs. NCCL2 – Allreduce Operation (DGX-2)

• Optimized designs in MVAPICH2-GDR offer better/comparable performance for most cases • MPI_Allreduce (MVAPICH2-GDR) vs. ncclAllreduce (NCCL2) on 1 DGX-2 node (16 Volta GPUs) Platform: Nvidia DGX-2 system (16 Nvidia Volta GPUs connected with NVSwitch), CUDA 10.1 50 10000 45 40 1000 35 ~2.5X better 30 25 100 20

~4.7X better Latency (us) Latency (us) 15 10 10 5 1 0 8 16 32 64 1K 2K 4K 8K 128 256 512 16K 32K 64K 128K Message Size (Bytes) Message Size (Bytes)

MVAPICH2-GDR-2.3.4 NCCL-2.6 MVAPICH2-GDR-2.3.4 NCCL-2.6 C.-H. Chu, P. Kousha, A. Awan, K. S. Khorassani, H. Subramoni and D. K. Panda, "NV-Group: Link-Efficient Reductions for Distributed Deep Learning on Modern Dense GPU Systems, " ICS-2020, June-July 2020. Network Based Computing Laboratory HPC-AI-Australia (Sept ‘20) 29 MVAPICH2-GDR: MPI_Allreduce at Scale (ORNL Summit)

• Optimized designs in MVAPICH2-GDR offer better performance for most cases • MPI_Allreduce (MVAPICH2-GDR) vs. ncclAllreduce (NCCL2) up to 1,536 GPUs Platform: Dual-socket IBM POWER9 CPU, 6 NVIDIA Volta V100 GPUs, and 2-port InfiniBand EDR Interconnect Latency on 1,536 GPUs Bandwidth on 1,536 GPUs 128MB Message 10 450 6 400 1.7X better 8 5 350 1.7X better 300 4 6 250 1.6X better 3 4 200

Latency (us) 150 2 (GB/s) Bandwidth 2

100 (GB/s) Bandwidth 1 50 0 0 0 24 48 96 192 384 768 1536 4 16 64 256 1K 4K 16K 32M 64M 128M 256M Number of GPUs Message Size (Bytes) Message Size (Bytes) SpectrumMPI 10.3 OpenMPI 4.0.1 MVAPICH2-GDR-2.3.4 NCCL 2.6 MVAPICH2-GDR-2.3.4 NCCL 2.6 NCCL 2.6 MVAPICH2-GDR-2.3.4 C.-H. Chu, P. Kousha, A. Awan, K. S. Khorassani, H. Subramoni and D. K. Panda, "NV-Group: Link-Efficient Reductions for Distributed Deep Learning on Modern Dense GPU Systems, " ICS-2020, June-July 2020. Network Based Computing Laboratory HPC-AI-Australia (Sept ‘20) 30 Scalable TensorFlow using Horovod and MVAPICH2-GDR

• ResNet-50 Training using TensorFlow benchmark on 1 DGX-2 node (16 Volta GPUs) Actual throughput Platform: Nvidia DGX-2 system, CUDA 10.1 Scaling Efficiency = × 100% Ideal throughput at scale 7000 100 9% higher 6000 80 5000 4000 60 3000 40 2000 20

Image per second 1000 Scaling Efficiency (%) Efficiency Scaling 0 0 1 2 4 8 16 1 2 4 8 16 Number of GPUs Number of GPUs

NCCL-2.6 MVAPICH2-GDR-2.3.3 NCCL-2.6 MVAPICH2-GDR-2.3.3

C.-H. Chu, P. Kousha, A. Awan, K. S. Khorassani, H. Subramoni and D. K. Panda, "NV-Group: Link-Efficient Reductions for Distributed Deep Learning on Modern Dense GPU Systems, " ICS-2020, June-July 2020. Network Based Computing Laboratory HPC-AI-Australia (Sept ‘20) 31 Distributed TensorFlow on ORNL Summit (1,536 GPUs)

• ResNet-50 Training using TensorFlow benchmark on 500 SUMMIT -- 1536 Volta ImageNet-1k has 1.2 million images GPUs! 400 MVAPICH2-GDR reaching ~0.42 million

Thousands 300 • 1,281,167 (1.2 mil.) images images per second for ImageNet-1k! 200

• Time/epoch = 3 seconds Image per second 100

0 • Total Time (90 epochs) 1 2 4 6 12 24 48 96 192 384 768 1536 = 3 x 90 = 270 seconds = Number of GPUs 4.5 minutes! NCCL-2.6 MVAPICH2-GDR 2.3.4 *We observed issues for NCCL2 beyond 384 GPUs Platform: The Summit (#2 on Top500.org) – 6 NVIDIA Volta GPUs per node connected with NVLink, CUDA 10.1 Network Based Computing Laboratory HPC-AI-Australia (Sept ‘20) 32 Scaling PyTorch on ORNL Summit using MVAPICH2-GDR

500 30% higher • ResNet-50 training using 450 PyTorch + Horovod on Summit 400 350

– Synthetic ImageNet dataset Thousands 300 – Up to 256 nodes, 1536 GPUs 250 • MVAPICH2-GDR can 200 150 outperform NCCL2 100 – Up to 30% higher throughput 50 0

• CUDA 10.1 cuDNN 7.6.5 (higherImages/sec is better) PyTorch v1.5.0 Horovod v0.19.1

C.-H. Chu, P. Kousha, A. Awan, K. S. Khorassani, H. Subramoni and D. K. Panda, No. of GPUs "NV-Group: Link-Efficient Reductions for Distributed Deep Learning on Modern Dense GPU Systems, " ICS-2020, June-July 2020. NCCL-2.6 MVAPICH2-GDR-2.3.4 Platform: The Summit Supercomputer (#2 on Top500.org) – 6 NVIDIA Volta GPUs per node connected with NVLink, CUDA 10.1 Network Based Computing Laboratory HPC-AI-Australia (Sept ‘20) 33 Scaling PyTorch with Horovod

90 15% higher • ResNet-50 training using 80

70 PyTorch + Horovod on Lassen Thousands 60 – Synthetic ImageNet dataset 50 – Up to 64 nodes, 256 GPUs 40 Images/sec • MVAPICH2-GDR can outperform 30 NCCL2 20 10

– Up to 15% higher throughput 0 1 2 4 8 16 32 64 128 256 • CUDA 10.2 cuDNN 7.6.5 No. of GPUs PyTorch v1.5.0 Horovod v0.19.4 NCCL 2.7.8-1 MVAPICH2-GDR-2.3.4

Platform: The Lassen Supercomputer (#14 on Top500.org) – 4 NVIDIA Volta GPUs per node connected with NVLink, CUDA 10.2 Network Based Computing Laboratory HPC-AI-Australia (Sept ‘20) 34 Scaling PyTorch with DeepSpeed

100 10% higher • ResNet-50 training using 90 80

PyTorch + DeepSpeed on Lassen 70 – Synthetic ImageNet dataset 60 – Up to 64 nodes, 256 GPUs 50 40 Images/sec (Thousands) • MVAPICH2-GDR can outperform 30 NCCL2 20 10

– Up to 10% higher throughput 0 1 2 4 8 16 32 64 128 256 • CUDA 10.2 cuDNN 7.6.5 No. of GPUs PyTorch v1.5.0 DeepSpeed v0.2.0 NCCL 2.7.8-1 MVAPICH2-GDR-2.3.4

Platform: The Lassen Supercomputer (#14 on Top500.org) – 4 NVIDIA Volta GPUs per node connected with NVLink, CUDA 10.2 Network Based Computing Laboratory HPC-AI-Australia (Sept ‘20) 35 Scaling PyTorch with torch.distributed

80

• ResNet-50 training using PyTorch 70 16% higher

+ torch.distributed on Lassen 60 – Synthetic ImageNet dataset 50 – Up to 64 nodes, 256 GPUs 40

Images/sec 30 • MVAPICH2-GDR can outperform (Thousands) 20 NCCL2 10 – Up to 16% higher throughput 0 • CUDA 10.2 cuDNN 7.6.5 1 2 4 8 16 32 64 128 256 No. of GPUs PyTorch v1.5.0 NCCL 2.7.8-1 MVAPICH2-GDR-2.3.4

Platform: The Lassen Supercomputer (#14 on Top500.org) – 4 NVIDIA Volta GPUs per node connected with NVLink, CUDA 10.2 Network Based Computing Laboratory HPC-AI-Australia (Sept ‘20) 36 PyTorch at Scale: Training ResNet-50 on 256 V100 GPUs

• Training performance for 256 V100 GPUs on LLNL Lassen – ~10,000 Images/sec faster than NCCL training!

Distributed Torch.distributed Horovod DeepSpeed Framework

Images/sec on 61,794 72,120 74,063 84,659 80,217 88,873 256 GPUs

Communication NCCL MVAPICH2-GDR NCCL MVAPICH2-GDR NCCL MVAPICH2-GDR Backend

Network Based Computing Laboratory HPC-AI-Australia (Sept ‘20) 37 Accelerating DL and ML Applications with MVAPICH2 MPI Libraries • MPI-driven Deep Learning – CPU-based Deep Learning – GPU-based Deep Learning • Out-of-core DNN training • Exploiting Hybrid (Data and Model) Parallelism • Use-Case: AI-Driven Digital Pathology • High-Performance MPI runtime for DASK • MPI-driven Acceleration of cuML Algorithms • Commercial Support and Products Network Based Computing Laboratory HPC-AI-Australia (Sept ‘20) 38 Out-of-core Training via CUDA Unified Memory • Large DNNs cannot be trained on GPUs due to memory limitation! – ResNet-50 -- current frameworks only allow small batch sizes – Next-generation models like Neural Machine Translation (NMT), AmoebaNet, etc. • Ridiculously large consists of billions of parameters – Can we design Out-of-core DNN training support using new software features in CUDA 8/9 and hardware mechanisms in Pascal/Volta GPUs? • General intuition is that managed allocations “will be” slow! – The proposed OC-Caffe (Out-of-Core Caffe) exploits the potential of CUDA Unified Memory. • OC-Caffe-Opt: up to 80% better than Intel-optimized CPU Caffe for ResNet-50 training on the Volta V100 GPU with CUDA9 and CUDNN7

A. Awan et al., OC-DNN: Exploiting Advanced Unified Memory Capabilities in CUDA 9 and Volta GPUs for Out-of-Core DNN Training, HiPC ’18 Network Based Computing Laboratory HPC-AI-Australia (Sept ‘20) 39 Parallelization Strategies

• What are the Parallelization Strategies – – Model Parallelism Spatial / Channel – Spatial / Channel Parallelism Parallelism

Model Parallelism Data Parallelism

Network Based Computing Laboratory HPC-AI-Australia (Sept ‘20) 40 HyPar-Flow: Hybrid and Model Parallelism for CPUs • Data-Parallelism– only for models that fit the memory • Out-of-core models – Deeper model  Better accuracy but more memory required! • Model parallelism can work for out-of-core models! • Key Challenges – Model Partitioning is difficult for application programmers – Finding the right partition (grain) size is hard – cut at which layer and why?

– Developing a practical system for model-parallelism A. A. Awan, A. Jain, Q. Anthony, H. Subramoni, and DK Panda, “HyPar-Flow: Exploiting MPI and Keras for Hybrid Parallel • Redesign DL Framework or create additional layers? Training of TensorFlow models”, ISC ‘20, • Existing Communication middleware or extensions needed? https://arxiv.org/pdf/1911.05146.pdf

Network Based Computing Laboratory HPC-AI-Australia (Sept ‘20) 41 Out-of-core Training with HyPar-Flow (512 nodes on TACC Frontera)

• ResNet-1001 with variable batch size 481x on 512 Intel Xeon Skylake nodes (TACC Frontera) • Approach: – 48 model-partitions for 56 cores – 512 model-replicas for 512 nodes – Total cores: 48 x 512 = 24,576 • Speedup – 253X on 256 nodes – 481X on 512 nodes • Scaling Efficiency – 98% up to 256 nodes – 93.9% for 512 nodes

A. A. Awan, A. Jain, Q. Anthony, H. Subramoni, and DK Panda, “HyPar-Flow: Exploiting MPI and Keras for Hybrid Parallel Training of TensorFlow models”, ISC ‘20, https://arxiv.org/pdf/1911.05146.pdf

Network Based Computing Laboratory HPC-AI-Australia (Sept ‘20) 42 AI-Driven Digital Pathology

• The field of Pathology (like many other medical disciplines) is moving Digital • Traditionally, a pathologist reviews a slide and carries out diagnosis based on prior knowledge and training • Experience matters • Can Deep Learning to be used to train thousands of slides and – Let the computer carry out the diagnosis – Narrow down the diagnosis and help a pathologist to make a final decision • Significant benefits in – Reducing diagnosis time – Making pathologists productive – Reducing health care cost

Network Based Computing Laboratory HPC-AI-Australia (Sept ‘20) 43 Exploiting Model Parallelism in AI-Driven Digital Pathology • Pathology whole slide image (WSI) – Each WSI = 100,000 x 100,000 pixels – Can not fit in a single GPU memory – Tiles are extracted to make training possible • Two main problems with tiles – Restricted tile size because of GPU memory limitation – Smaller tiles loose structural information • Can we use Model Parallelism to train on larger tiles to get better accuracy and diagnosis? • Reduced training time significantly on OpenPOWER + NVIDIA V100 GPUs – 32 hours (1 node, 1 GPU) -> 7.25 hours (1 node, 4 GPUs) -> 27 mins (32 nodes, 128 GPUs) Courtesy: https://blog.kitware.com/digital-slide- archive-large-image-and-histomicstk-open-source- A. Jain, A. Awan, A. Aljuhani, J. Hashmi, Q. Anthony, H. Subramoni, D. K. Panda, R. Machiraju, and A. Parwani, -tools-for-management--and- “GEMS: GPU Enabled Memory Aware Model Parallelism System for Distributed DNN Training”, Supercomputing analysis-of-digital-histopathology-data/ (SC ‘20), Accepted to be Presented. Network Based Computing Laboratory HPC-AI-Australia (Sept ‘20) 44 Accelerating DL and ML Applications with MVAPICH2 MPI Libraries • MPI-driven Deep Learning – CPU-based Deep Learning – GPU-based Deep Learning • Out-of-core DNN training • Exploiting Hybrid (Data and Model) Parallelism • Use-Case: AI-Driven Digital Pathology • High-Performance MPI runtime for DASK • MPI-driven Acceleration of cuML Algorithms • Commercial Support and Products Network Based Computing Laboratory HPC-AI-Australia (Sept ‘20) 45 Dask Architecture Dask Dask Dask Bag Dask Array Delayed Future DataFrame

Task Graph

Dask-MPI Dask-CUDA Dask-Jobqueue

Distributed Scheduler Worker Client

tcp.py ucx.py MPI4Dask Comm Layer

UCX-Py mpi4py (Cython wrappers)

TCP UCX MVAPICH2-GDR

Laptops/ High Performance Computing Hardware Desktops Network Based Computing Laboratory HPC-AI-Australia (Sept ‘20) 46 Latency/Throughput Comparison (UCX-Py vs. MPI4Dask)

160 100 MVAPICH2-GDR 90 140 MPI4Dask UCX 80 120 UCX-Py (Polling Mode) 70 4x better 100 60 MVAPICH2-GDR 80 50 MPI4Dask 40 UCX

Latency (us) Latency 60 UCX-Py (Polling Mode) 30 Throughput (Gbps) 40 6x better for 1 20 20 10 0 0

Message Size (Bytes) Message Size (Bytes) Network Based Computing Laboratory HPC-AI-Australia (Sept ‘20) 47 Benchmark #1: Sum of cuPy Array and its Transpose 3.47x better on average 6.92x better on average 10 2.0 IPoIB UCX MPI4Dask IPoIB UCX MPI4Dask 9 1.8 8 1.6 7 1.4 6 1.2 5 1.0 4 0.8 3 0.6 Total Execution Time (s)Execution Time Total 2 (s) Communication Time 0.4 1 0.2 0 0.0 2 3 4 5 6 2 3 4 5 6 Number of Dask Workers Number of Dask Workers

Network Based Computing Laboratory HPC-AI-Australia (Sept ‘20) 48 Benchmark #2: cuDF Merge Operation 3.11x better on average 3.22x better on average 16 8 IPoIB UCX MPI4Dask IPoIB UCX MPI4Dask 14 7 12 6 10 5 8 4 6 3 4 2 Total Execution Time (s)Execution Time Total 2 (s) Communication Time 1 0 0 2 3 4 5 6 2 3 4 5 6 Number of Dask Workers Number of Dask Workers

Network Based Computing Laboratory HPC-AI-Australia (Sept ‘20) 49 Throughput Comparison for Benchmark Applications

• Sum of cuPy and its Transpose • cuDF Merge Operation 240 160 IPoIB UCX MPI4Dask IPoIB UCX MPI4Dask 140 200 120 160 100 80 120 60 80 Workers (GB/s)Workers Workers (GB/s)Workers 40 40 Aggregate Throughput for Aggregate Throughput for 20 0 0 2 3 4 5 6 2 3 4 5 6 Number of Dask Workers Number of Dask Workers 6.57x better for 6 workers 3.74x better for 6 workers Network Based Computing Laboratory HPC-AI-Australia (Sept ‘20) 50 Accelerating cuML with MVAPICH2-GDR • Utilize MVAPICH2-GDR (with mpi4py) as communication backend during the training phase (the fit() function) for Multi-node Multi-GPU (MNMG) setting over cluster of GPUs • Communication primitives: Dask Python

– Allreduce UCX-Py Cython mpi4py – Reduce CUDA/C/C++ – Broadcast cuML Algorithms • Exploit optimized collectives UCX cuML Primitives MVAPICH2- NCCL GDR CUDA Libraries

CUDA

Network Based Computing Laboratory HPC-AI-Australia (Sept ‘20) 51 K-Means 2000 2.0 2000 Linear Regression 1.6 NCCL NCCL 1.8 1.4 MVAPICH2-GDR 1.6 MVAPICH2-GDR 1500 1500 1.2 Speedup 1.4 Speedup 1.2 1.0 1000 1.0 1000 0.8 0.8 Speedup Speedup 0.6 0.6 Training Time (s) Training Time (s) 500 500 0.4 0.4 0.2 0.2 0 0.0 0 0.0 1 2 4 8 16 32 1 2 4 8 16 32 Number of GPUs Number of GPUs Nearest Neighbors Truncated SVD 2500 2.0 1500 1.6 NCCL NCCL 1.8 1.4 2000 MVAPICH2-GDR 1.6 MVAPICH2-GDR Speedup Speedup 1.2 1.4 1000 1500 1.2 1.0 1.0 0.8 1000 0.8 Speedup

Speedup 0.6 0.6 500

Training Time (s) 0.4 500 0.4 Training Time (s) 0.2 0.2 0 0.0 0 0.0 1 2 4 8 16 32 1 2 4 8 16 32 Number of GPUs Number of GPUs Network Based Computing Laboratory HPC-AI-Australia (Sept ‘20) 52 Accelerating DL and ML Applications with MVAPICH2 MPI Libraries • MPI-driven Deep Learning – CPU-based Deep Learning – GPU-based Deep Learning • Out-of-core DNN training • Exploiting Hybrid (Data and Model) Parallelism • Use-Case: AI-Driven Digital Pathology • High-Performance MPI runtime for DASK • MPI-driven Acceleration of cuML Algorithms • Commercial Support and Products Network Based Computing Laboratory HPC-AI-Australia (Sept ‘20) 53 Commercial Support for MVAPICH2, HiBD, and HiDL Libraries • Supported through X-ScaleSolutions (http://x-scalesolutions.com) • Benefits: – Help and guidance with installation of the library – Platform-specific optimizations and tuning – Timely support for operational issues encountered with the library – interface to submit issues and tracking their progress – Advanced debugging techniques – Application-specific optimizations and tuning – Obtaining guidelines on best practices – Periodic information on major fixes and updates – Information on major releases – Help with upgrading to the latest release – Flexible Service Level Agreements • Support being provided to National Laboratories and International Supercomputing Centers

Network Based Computing Laboratory HPC-AI-Australia (Sept ‘20) 54 Silver ISV Member for the OpenPOWER Consortium

• Silver ISV member of the OpenPOWER Consortium • Provides flexibility: – To have MVAPICH2, HiDL and HiBD libraries getting integrated into the OpenPOWER software stack – A part of the OpenPOWER ecosystem – Can participate with different vendors for bidding, installation and deployment process • Introduced two new integrated products with support for OpenPOWER systems (Presented at the 2019 OpenPOWER North America Summit) – X-ScaleHPC – X-ScaleAI

Network Based Computing Laboratory HPC-AI-Australia (Sept ‘20) 55 X-ScaleHPC Package

• Scalable solutions of communication middleware based on OSU MVAPICH2 libraries • “out-of-the-box” fine-tuned and optimal performance on various HPC systems including CPUs and GPUs • MPI communication offloading capabilities to ARM based smart NICs (such as Mellanox Bluefield NICs)

Please refer to the presentation made at the 8th Annual MVAPICH User group Meeting (MUG) last week http://mug.mvapich.cse.ohio-state.edu/program/

Network Based Computing Laboratory HPC-AI-Australia (Sept ‘20) 56 X-ScaleAI Product and Features • Aim: High-performance solution for distributed training for your complex AI problems on modern HPC platforms • Features: – Powered by MVAPICH2 libraries – Great performance and scalability as delivered by MVAPICH2 libraries – Integrated packaging to run various Deep Learning Frameworks (TensorFlow, PyTorch, MXNet, and others) – Targeted for both CPU-based and GPU-based Deep Learning Training – Integrated profiling and introspection support for Deep Learning Applications across the stacks (DeepIntrospect) • Provides cross-stack performance analysis in a visual manner and help users to optimize their DL applications and harness higher performance and scalability – Out-of-the-box optimal performance • Tuned for various CPU- and GPU-based HPC systems – One-click deployment and execution • Do not need to struggle for many hours – Support for x86 and OpenPOWER platforms – Support for InfiniBand, RoCE and NVLink Interconnects

Network Based Computing Laboratory HPC-AI-Australia (Sept ‘20) 57 Enhancing ML/DL Workflow with DeepIntrospect(DI)

1. Import a Model

Keras Model Train Applications Zoo

3. Use Deep Profile Introspect (DI)

Monitor 2. Use DL Framework

Analyze

DeepIntrospect (DI): a performance analysis and optimization tool for distributed DL applications

Network Based Computing Laboratory HPC-AI-Australia (Sept ‘20) 58 Overview of DeepIntrospect SW Architecture

Deep Learning Applications, Frameworks, and Middleware Vision, Speech, Text, etc. TensorFlow, PyTorch Horovod

DeepIntrospect

Performance ProfilergRPC Others Monitor Visualizer Advanced Text to XML LogAdvanced Parser Plugins Text to Data Store Database Connectors to Data Store

Extreme-Scale HPC Platforms Interconnects GPUs (NVIDIA) CPUs (POWER, x86) (NVLink, PCIe, InfiniBand)

Network Based Computing Laboratory HPC-AI-Australia (Sept ‘20) 59 X-ScaleAI Product with DeepIntrospect (DI) Capability

Profiling Information (command line output)

… JSON Format

… Network Based Computing Laboratory HPC-AI-Australia (Sept ‘20) 60 X-ScaleAI Product with DeepIntrospect (DI) Capability

More details in MUG ‘20 talk, Contact for a demo and free-trial! ([email protected]) Network Based Computing Laboratory HPC-AI-Australia (Sept ‘20) 61 Conclusions • DL and ML applications require high-performance and scalability • Requires high-performance middleware designs while exploiting modern interconnects • Provided a set of solutions to achieve – MPI (MVAPICH2)-driven solution with Horovod for TensorFlow, PyTorch and MXNet – Out-of-core training and Hybrid Parallelism – MPI (MVAPICH2)-driven solutions for DASK and cuML applications – Commercial support and product with Deep Introspection Capability • Will continue to enable the DL community to achieve scalability and high-performance for their distributed training

Network Based Computing Laboratory HPC-AI-Australia (Sept ‘20) 62 Funding Acknowledgments Funding Support by

Equipment Support by

Network Based Computing Laboratory HPC-AI-Australia (Sept ‘20) 63 Acknowledgments to all the Heroes (Past/Current Students and Staffs) Current Students (Graduate) Current Research Scientists Current Post-docs – Q. Anthony (Ph.D.) – K. S. Khorassani (Ph.D.) – N. Sarkauskas (Ph.D.) – A. Shafi – M. S. Ghazimeersaeed – M. Bayatpour (Ph.D.) – P. Kousha (Ph.D.) – S. Srivastava (M.S.) – H. Subramoni – K. Manian – C.-H. Chu (Ph.D.) – N. S. Kumar (M.S.) – S. Xu (Ph.D.) Current Senior Research Associate Current Research Specialist – A. Jain (Ph.D.) – B. Ramesh (Ph.D.) – Q. Zhou (Ph.D.) – J. Hashmi – J. Smith – M. Kedia (M.S.) – K. K. Suresh (Ph.D.) Current Software – A. Reifsteck

Past Students Past Research Scientists – A. Awan (Ph.D.) – T. Gangadharappa (M.S.) – P. Lai (M.S.) – R. Rajachandrasekar (Ph.D.) – K. Hamidouche – A. Augustine (M.S.) – K. Gopalakrishnan (M.S.) – J. Liu (Ph.D.) – D. Shankar (Ph.D.) – S. Sur – P. Balaji (Ph.D.) – J. Hashmi (Ph.D.) – M. Luo (Ph.D.) – G. Santhanaraman (Ph.D.) – N. Sarkauskas (B.S.) – R. Biswas (M.S.) – W. Huang (Ph.D.) – A. Mamidala (Ph.D.) – X. Lu – A. Singh (Ph.D.) – S. Bhagvat (M.S.) – W. Jiang (M.S.) – G. Marsh (M.S.) Past Programmers – J. Jose (Ph.D.) – J. Sridhar (M.S.) – A. Bhat (M.S.) – . Meshram (M.S.) – D. Bureddy – S. Kini (M.S.) – A. Moody (M.S.) – S. Sur (Ph.D.) – D. Buntinas (Ph.D.) – J. Perkins – H. Subramoni (Ph.D.) – L. Chai (Ph.D.) – M. Koop (Ph.D.) – S. Naravula (Ph.D.) – K. Vaidyanathan (Ph.D.) – B. Chandrasekharan (M.S.) – K. Kulkarni (M.S.) – R. Noronha (Ph.D.) Past Research Specialist – A. Vishnu (Ph.D.) – S. Chakraborthy (Ph.D.) – R. Kumar (M.S.) – X. Ouyang (Ph.D.) – M. Arnold – J. Wu (Ph.D.) – N. Dandapanthula (M.S.) – S. Krishnamoorthy (M.S.) – S. Pai (M.S.) – W. Yu (Ph.D.) – V. Dhanraj (M.S.) – K. Kandalla (Ph.D.) – S. Potluri (Ph.D.) – M. Li (Ph.D.) – J. Zhang (Ph.D.) Past Post-Docs – K. Raj (M.S.) – D. Banerjee – J. Lin – S. Marcarelli – H. Wang – X. Besseron – M. Luo – A. Ruhela – H.-W. Jin – E. Mancini – J. Vienne Network Based Computing Laboratory HPC-AI-Australia (Sept ‘20) 64 Multiple Positions Available in MVAPICH2, DL/ML, and BigData Projects in my Group

• Looking for Bright and Enthusiastic Personnel to join as – PhD Students – Post-Doctoral Researchers – MPI Programmer/Software Engineer – Hadoop/Spark/Big Data Programmer/Software Engineer – Deep Learning and Cloud Programmer/Software Engineer • If interested, please send an e-mail to [email protected]

Network Based Computing Laboratory HPC-AI-Australia (Sept ‘20) 65 Thank You! [email protected]

Network-Based Computing Laboratory http://nowlab.cse.ohio-state.edu/

The High-Performance MPI/PGAS Project The High-Performance Big Data Project The High-Performance Deep Learning Project http://mvapich.cse.ohio-state.edu/ http://hibd.cse.ohio-state.edu/ http://hidl.cse.ohio-state.edu/

Network Based Computing Laboratory HPC-AI-Australia (Sept ‘20) 66