ECP Community BOF Days March 30 – April 1, 2021 Hosts: Ashley Barker (ORNL) and Osni Marques (LBNL)

The MVAPICH2 Project: Latest Status and Future Plans

Dhabaleswar K (DK) Panda The Ohio State University E-mail: [email protected] http://www.cse.ohio-state.edu/~panda History of MVAPICH

• A long time ago, in a galaxy far, far away…. (actually… 21 years ago), there existed… • MPICH – High performance and widely portable implementation of MPI standard – From ANL • MVICH – Implementation of MPICH ADI-2 for VIA – VIA – Virtual Interface Architecture (precursor to InfiniBand) – From LBL • VAPI – Verbs level API – Initial InfiniBand API from IB Vendors (older version of OFED/IB verbs) MPICH + MVICH + VAPI = MVAPICH Overview of the MVAPICH2 Project

• High Performance open-source MPI • Support for multiple interconnects – InfiniBand, Omni-Path, Ethernet/iWARP, RDMA over Converged Ethernet (RoCE), and AWS EFA • Support for multiple platforms – x86, OpenPOWER, ARM, Xeon-Phi, GPGPUs (NVIDIA and AMD) • Started in 2001, first open-source version demonstrated at SC ‘02 • Supports the latest MPI-3.1 standard • Used by more than 3,150 organizations in 89 countries • http://mvapich.cse.ohio-state.edu • More than 1.28 Million downloads from the OSU site directly • Additional optimized versions for different systems/environments: • Empowering many TOP500 clusters (Nov ‘20 ranking) – MVAPICH2-X (Advanced MPI + PGAS), since 2011 – 4th , 10,649,600-core (Sunway TaihuLight) at NSC, Wuxi, China – MVAPICH2-GDR with support for NVIDIA GPGPUs, since 2014 – 9th, 448, 448 cores (Frontera) at TACC – MVAPICH2-MIC with support for Intel Xeon-Phi, since 2014 – 14th, 391,680 cores (ABCI) in Japan – MVAPICH2-Virt with virtualization support, since 2015 – 21st, 570,020 cores (Nurion) in South Korea and many others – MVAPICH2-EA with support for Energy-Awareness, since 2015 • Available with software stacks of many vendors and Distros – MVAPICH2-Azure for Azure HPC IB instances, since 2019 (RedHat, SuSE, OpenHPC, and Spack) – MVAPICH2-X-AWS for AWS HPC+EFA instances, since 2019 • Partner in the 9th ranked TACC Frontera system • Tools: – OSU MPI Micro-Benchmarks (OMB), since 2003 • Empowering Top500 systems for more than 16 years – OSU InfiniBand Network Analysis and Monitoring (INAM), since 2015 MVAPICH2 Release Timeline and Downloads 1400000

1200000 MIC 2.0 MIC - GDR 2.0b GDR - Azure Azure 2.3.2 - 2.3 MV2 MV2 1000000 1.9 MV2 AWS - 1.8 MV2 MV2 Virt 2.2 Virt 1.7

800000 MV2 1.6 MV2 MV2 1.5 MV2

600000 1.4 1.0.3 1.1 MV2 2.3.5 MV2 MV2 1.0 1.0 MV2 MV Number of Downloads Number MV2 MV

400000 MV2 MV2 0.9.8 MV2 MV 0.9.4 MV MV2 0.9.0 MV2 2.3 X - GDR 2.3.5 GDR 200000 - OSU INAM 0.9.6 INAM OSU MV2 MV2

0 Jul-20 Jul-15 Jul-10 Jul-05 Jan-18 Jan-13 Jan-08 Jun-18 Jun-13 Jun-08 Oct-16 Oct-11 Oct-06 Apr-19 Apr-14 Apr-09 Sep-19 Feb-20 Sep-14 Feb-15 Sep-09 Feb-10 Sep-04 Feb-05 Dec-15 Dec-10 Dec-05 Dec-20 Aug-17 Aug-12 Aug-07 Nov-18 Nov-13 Nov-08 Mar-17 Mar-12 Mar-07 May-16 May-11 May-06 Timeline Architecture of MVAPICH2 Software Family (HPC and DL) High Performance Parallel Programming Models

Message Passing Interface PGAS Hybrid --- MPI + X (MPI) (UPC, OpenSHMEM, CAF, UPC++) (MPI + PGAS + OpenMP/)

High Performance and Scalable Communication Runtime Diverse and Mechanisms

Point-to-point Collectives Energy- Remote I/O and Fault Active Introspection & Job Startup Virtualization Primitives Algorithms Awareness Memory Access File Systems Tolerance Messages Analysis

Support for Modern Networking Technology Support for Modern Multi-/Many-core Architectures (InfiniBand, iWARP, RoCE, Omni-Path, Elastic Fabric Adapter) (Intel-Xeon, OpenPOWER, Xeon-Phi, ARM, NVIDIA GPGPU) Transport Protocols Modern Features Transport Mechanisms Modern Features

Multi Shared RC SRD UD DC UMR ODP SR-IOV CMA IVSHMEM XPMEM Optane* NVLink CAPI* Rail Memory

* Upcoming MVAPICH2 Software Family Requirements Library

MPI with IB, iWARP, Omni-Path, and RoCE MVAPICH2

Advanced MPI Features/Support, OSU INAM, PGAS and MPI+PGAS MVAPICH2-X with IB, Omni-Path, and RoCE MPI with IB, RoCE & GPU and Support for Deep/Machine Learning MVAPICH2-GDR

HPC Cloud with MPI & IB MVAPICH2-Virt

Energy-aware MPI with IB, iWARP and RoCE MVAPICH2-EA

MPI Energy Monitoring Tool OEMT

InfiniBand Network Analysis and Monitoring OSU INAM

Microbenchmarks for Measuring MPI and PGAS Performance OMB MVAPICH2 – Basic MPI

MPI_Init on Frontera (Small Scale) MPI_Init on Frontera (Large Scale) Intra-node Latency on OpenPOWER 25 2000 1 20 14X MVAPICH2 2.3.4 MVAPICH2 2.3.5 MVAPICH2 2.3.4 1500 46X 15 Intel MPI 2020 SpectrumMPI-10.3.1.00 1000 0.5 10 Intel MPI 2020 0.2us Time (s) 5 Time (s) 500 0 0 0 (us)Latency 56 112 224 448 896 1792 3584 7168 14336 28672 57344 114688 229376 4 8 16 32 64 128 256 512 1K 2K Number of Processes Number of Processes Message size

Allreduce Latency on OpenPOWER MPI_Bcast using Multicast on Frontera MPI_Bcast using Multicast on Frontera 1 Node 40 PPN 2048 Nodes 32 Bytes 3000 20 80 MVAPICH2 2.3.5 Default Multicast Default 3X 15 60 2000 SpectrumMPI-10.3.1.00 2X Multicast 3X 10 40 1000 5 20 Latency (us)Latency Latency (us)Latency Latency (us)Latency 0 0 0 8K 16K 32K 64K 128K 256K 512K 1M 1 2 4 8 16 32 64 128 256 512 1k 2 4 8 16 32 64 128 256 512 1K Message size Message Size Number of Nodes MVAPICH2-X – Advanced MPI + PGAS + Tools MPI_Allreduce using SHARP on Frontera MPI_Barrier using SHARP on Frontera (1ppn, 7,861 nodes) (1ppn, 7,861 nodes) Intra-nodeIntra Latency-Node onLatency ARM A64FX Oakami 120 250 600 100 MVAPICH2-X MVAPICH2-X 200 500 46% 80 MVAPICH2-X-SHARP 150 MVAPICH2-X-SHARP 400 MVAPICH2-X OpenMPI

60 MPI_Barrier 9X 300 100 40 5X 200 Latency (us) Latency (us) Latency (us) MPI_Allreduuce MPI_Allreduuce 100 20 50 (PPN = 1, Nodes = 7861) 0

0 0 0 2 8 32 2K 8K 4 8 2M 128 512 32K 16 32 64 128K 512K 128 256 512 1024 2048 4096 7861 Message Size (Bytes) Message Size Number of Nodes P3DFFT using BlueField-2 DPU on HPCAC Overhead of RC 1600 25 MVAPICH2 MVAPICH2-X MV2 BF2 HCA 25% protocol for 1400 MV2-BFO 1 WPN 20 1200 MV2-BFO 2 WPN connection 1000 10% 15 MV2-BFO 4 WPN 76% establishment and 800 39% 10 communication 600 Execution Time (s) Execution 400 Latency (us) Latency 5 200 0 0 1024 2048 4096 512 1024 2048 4096 Grid Size X Number of Processes MVAPICH2-GDR – Optimized MPI for clusters with NVIDIA and AMD GPUs Best Performance for GPU-based Transfers TensorFlow Training with MVAPICH2-GDR on Summit GPU-Based MPI_Allreduce on Summit 30 600 10 MV2-(NO-GDR) MV2-GDR-2.3.5 NCCL-2.6 MVAPICH2-GDR 2.3.5 8 128MB Message 20 400 6 1.7X better Time/epoch = 3 seconds Thousands 4 10 1.85us 10x 200 Total Time (90 epochs) = 3 x 90 = 270 sec = 4.5 minutes! 2 Bandwidth (GB/s) Bandwidth Latency (us) 0 0 0 Image Image per second 0 1 2 4 8 1 2 4 6 16 32 64 1K 2K 4K 8K

12 24 48 96 24 48 96 192 384 768 1536 128 256 512 192 384 768 Message Size (Bytes) Number of GPUs 1536 Number of GPUs SpectrumMPI 10.3 OpenMPI 4.0.1 ROCm Support for AMD GPUs (Available with MVAPICH2-GDR 2.3.5) NCCL 2.6 MVAPICH2-GDR-2.3.5

LLNL Corona Cluster - ROCm-4.0.0 (mi50 AMD GPUs) Intra-Node Point-to-Point Latency Inter-Node Point-to-Point Latency Allreduce 128 GPUs (16 Nodes, 8 GPN) Broadcast 128 GPUs (16 Nodes, 8 GPN)

UCX Error at message size > 256KB (for OpenMPI)

PyTorch at Scale: Training ResNet-50 on 256 GPUs: Distributed Torch.distributed Horovod DeepSpeed Framework

LLNL Lassen (10,000 Images/sec faster than NCCL training) Images/sec on 256 61,794 72,120 74,063 84,659 80,217 88,873 GPUs

Communication NCCL 2.7 MVAPICH2- NCCL MVAPICH2-GDR NCCL MVAPICH2- Backend GDR 2.7 2.7 GDR MVAPICH2-X Advanced Support for HPC-Clouds Performance on Amazon EFA Performance of WRF on Microsoft Azure WRF 3.6 Execution Time WRF 3.6 Execution time OpenMPI Intel MPI MVAPICH2-X-AWS 140 60 MVAPICH2 MVAPICH2-X+XPMEM 120 50 100 40 80 41% 30 60 27%47% Time (s) Time (s) 58% 40 20 20 10 0 0 36 72 144 288 576 1152 120 240 480 960 Instance type: c5n.18xlarge Number of processes Number of processes CPU: Intel Xeon Platinum 8124M @ 3.00GHz VM type: HBv2 MVAPICH2 version: MVAPICH2-X-aws v2.3 CPU: AMD EPYC 7V12 @ 2.45GHz OpenMPI version: Open MPI v4.0.3 with libfabric 1.9 MVAPICH2 version: MVAPICH2-Azure 2.3.3 IntelMPI version: Intel MPI 2019.7.217 MVAPICH2-X version: MVAPICH2-X (2.3rc3) • MVAPICH2-X-AWS 2.3 • MVAPICH2-Azure 2.3.3 • Released on 09/24/2020 – Released on 05/20/2020 • Major Features and Enhancements – Integerated Azure CentOS HPC Images: – Based on MVAPICH2-X 2.3 https://github.com/Azure/azhpc-images/releases/tag/centos-7.6-hpc-20200417 – Support for on Amazon EFA adapter's Scalable Reliable Datagram (SRD) • MVAPICH2-X 2.3.RC3 – Support for XPMEM based intra-node communication for point-to-point and collectives – Enhanced tuning for point-to-point and collective operations • Major Features and Enhancements – Targeted for AWS instances with Amazon Linux 2 AMI and EFA support – Based on MVAPICH2-2.3.3 – Tested with c5n & p3d instance types and different Operation Systems – Enhanced tuning for point-to-point and collective operations – Targeted for Azure HB & HC virtual machine instances – Tested with Azure HB & HC VM instances • Detailed User Guide: http://mvapich.cse.ohio-state.edu/userguide/mv2-azure/ MVAPICH2 – Future Roadmap and Plans for Exascale • Update to MPICH 3.3 CH3 channel • 2021 • Initial support for the CH4 channel • 2021/2022 • Making CH4 channel default • 2022/2023 • Performance and Memory toward 1M-10M cores • Hybrid programming (MPI + OpenSHMEM, MPI + UPC, MPI + CAF …) • MPI + Task* • Enhanced Optimization for GPUs and FPGAs* • Taking advantage of advanced features of Mellanox InfiniBand • Tag Matching* • Adapter Memory* • Enhanced communication schemes for upcoming architectures • GenZ* • CAPI* • Extended topology-aware collectives • Extended Energy-aware designs and Virtualization Support • Extended Support for MPI Tools Interface (as in MPI 3.0) • Extended FT support • Support for * features will be available in future MVAPICH2 Releases Thank You!

[email protected] http://web.cse.ohio-state.edu/~panda

Network-Based Computing Laboratory http://nowlab.cse.ohio-state.edu/

The High-Performance MPI/PGAS Project The High-Performance Deep Learning Project http://mvapich.cse.ohio-state.edu/ http://hidl.cse.ohio-state.edu/