Software Libraries and Middleware for Exascale Systems

HPC and AI Middleware for Exascale Systems and Clouds Talk at HPC-AI Advisory Council Japan Conference (January ‘21) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: [email protected] Follow us on http://www.cse.ohio-state.edu/~panda https://twitter.com/mvapich Increasing Usage of HPC, Big Data and Deep/Machine Learning Big Data (Hadoop, Spark, HPC HBase, (MPI, PGAS, etc.) Memcached, etc.) Convergence of HPC, Big Deep/Machine Data, and Deep/Machine Learning Learning! (TensorFlow, PyTorch, BigDL, cuML, etc.) Increasing Need to Run these applications on the Cloud!! Network Based Computing Laboratory HPC-AI-Japan (Jan ‘21) 2 Presentation Overview • MVAPICH Project – MPI and PGAS (MVAPICH) Library with CUDA-Awareness • HiDL Project – High-Performance Deep Learning – High-Performance Machine Learning • Optimizations and Deployments in Public Cloud – AWS and Azure • Conclusions Network Based Computing Laboratory HPC-AI-Japan (Jan ‘21) 3 Overview of the MVAPICH2 Project • High Performance open-source MPI Library • Support for multiple interconnects – InfiniBand, Omni-Path, Ethernet/iWARP, RDMA over Converged Ethernet (RoCE), and AWS EFA • Support for multiple platforms – x86, OpenPOWER, ARM, Xeon-Phi, GPGPUs (NVIDIA and AMD) • Started in 2001, first open-source version demonstrated at SC ‘02 • Used by more than 3,125 organizations in 89 countries • Supports the latest MPI-3.1 standard • More than 1.2 Million downloads from the OSU site • http://mvapich.cse.ohio-state.edu directly • Additional optimized versions for different systems/environments: • Empowering many TOP500 clusters (Nov ‘20 ranking) – MVAPICH2-X (Advanced MPI + PGAS), since 2011 – 4th , 10,649,600-core (Sunway TaihuLight) at NSC, Wuxi, China – MVAPICH2-GDR with support for NVIDIA GPGPUs, since 2014 – 9th, 448, 448 cores (Frontera) at TACC – MVAPICH2-MIC with support for Intel Xeon-Phi, since 2014 – 14th, 391,680 cores (ABCI) in Japan – MVAPICH2-Virt with virtualization support, since 2015 – 21th, 570,020 cores (Nurion) in South Korea and many others – MVAPICH2-EA with support for Energy-Awareness, since 2015 – MVAPICH2-Azure for Azure HPC IB instances, since 2019 • Available with software stacks of many vendors and – MVAPICH2-X-AWS for AWS HPC+EFA instances, since 2019 Linux Distros (RedHat, SuSE, OpenHPC, and Spack) • Tools: • Partner in the 9th ranked TACC Frontera system – OSU MPI Micro-Benchmarks (OMB), since 2003 • Empowering Top500 systems for more than 16 years – OSU InfiniBand Network Analysis and Monitoring (INAM), since 2015 Network Based Computing Laboratory HPC-AI-Japan (Jan ‘21) 4 Architecture of MVAPICH2 Software Family for HPC and DL/ML High Performance Parallel Programming Models Message Passing Interface PGAS Hybrid --- MPI + X (MPI) (UPC, OpenSHMEM, CAF, UPC++) (MPI + PGAS + OpenMP/Cilk) High Performance and Scalable Communication Runtime Diverse APIs and Mechanisms Point-to- Remote Collectives Energy- I/O and Fault Active Introspection point Job Startup Memory Virtualization Algorithms Awareness Tolerance Messages & Analysis Primitives Access File Systems Support for Modern Networking Technology Support for Modern Multi-/Many-core Architectures (InfiniBand, iWARP, RoCE, Omni-Path, Elastic Fabric Adapter) (Intel-Xeon, OpenPOWER, Xeon-Phi, ARM, NVIDIA GPGPU) Transport Protocols Modern Features Transport Mechanisms Modern Features SR- Multi Shared RC SRD UD DC UMR ODP CMA IVSHMEM XPMEM Optane* NVLink CAPI* IOV Rail Memory * Upcoming Network Based Computing Laboratory HPC-AI-Japan (Jan ‘21) 5 MVAPICH2 Software Family Requirements Library MPI with IB, iWARP, Omni-Path, and RoCE MVAPICH2 Advanced MPI Features/Support, OSU INAM, PGAS and MPI+PGAS MVAPICH2-X with IB, Omni-Path, and RoCE MPI with IB, RoCE & GPU and Support for Deep/Machine Learning MVAPICH2-GDR HPC Cloud with MPI & IB MVAPICH2-Virt Energy-aware MPI with IB, iWARP and RoCE MVAPICH2-EA MPI Energy Monitoring Tool OEMT InfiniBand Network Analysis and Monitoring OSU INAM Microbenchmarks for Measuring MPI and PGAS Performance OMB Network Based Computing Laboratory HPC-AI-Japan (Jan ‘21) 6 Startup Performance on TACC Frontera MVAPICH2 2.3.4 Intel MPI 2020 MVAPICH2 2.3.4 Intel MPI 2020 25 2000 20 1500 15 1000 10 14X 46X Time (s) 5 Time (s) 500 0 0 56 112 224 448 896 1792 3584 7168 14336 28672 57344 114688229376 Number of Processes Number of Processes • MPI_Init takes 31 seconds on 229,376 processes on 4,096 nodes • All numbers reported with 56 processes per node New designs available since MVAPICH2-2.3.4 Network Based Computing Laboratory HPC-AI-Japan (Jan ‘21) 7 Hardware Multicast-aware MPI_Bcast on TACC Frontera 20 1000 Default Multicast Default Multicast 15 750 10 500 Latency (us) 250 5 Latency (us) 1.9X 2X (Nodes=2K, PPN=28) 0 0 2 8 32 128 512 2K 8K 32K 128K 512K Message Size Message Size 15 80 Default Multicast Default Multicast Size=16B, PPN=28 Size=32kB, PPN=28 10 60 1.8X 40 2X 5 Latency (us) Latency (us) 20 0 0 2 4 8 16 32 64 128 256 512 1k 2 4 8 16 32 64 128 256 512 1K Number of Nodes Number of Nodes • MCAST-based designs improve latency of MPI_Bcast by up to 2X at 2,048 nodes • Use MV2_USE_MCAST=1 to enable MCAST-based designs Network Based Computing Laboratory HPC-AI-Japan (Jan ‘21) 8 Performance of Collectives with SHARP on TACC Frontera 120 1000 100 MVAPICH2-X 80 100 MVAPICH2-X-SHARP 6X 60 40 10 MVAPICH2-X 5X Latency (us) Latency (us) MPI_Reduce MPI_Allreduuce 20 MVAPICH2-X-SHARP (PPN=1, 7861) Nodes = 0 (PPN=1, 7861) Nodes = 1 Message size Message size 250 Optimized SHARP designs in MVAPICH2-X 200 MVAPICH2-X Up to 9X performance improvement with SHARP over MVAPICH2-X default for 1ppn 150 MVAPICH2-X-SHARP MPI_Barrier, 6X for 1ppn MPI_Reduce and 5X for 1ppn MPI_Allreduce MPI_Barrier 100 9X Latency (us) 50 B. Ramesh , K. Suresh , N. Sarkauskas , M. Bayatpour , J. Hashmi , H. Subramoni , and D. K. Panda, Scalable MPI Collectives using 0 SHARP: Large Scale Performance Evaluation on the TACC Frontera 4 8 16 32 64 128 256 512 System, ExaMPI2020 - Workshop on Exascale MPI 2020, Nov 2020. 1024 2048 4096 7861 Number of nodes Optimized Runtime Parameters: MV2_ENABLE_SHARP = 1 Network Based Computing Laboratory HPC-AI-Japan (Jan ‘21) 9 Support for ARM A64FX with InfiniBand (Ookami) Inter-Node Latency Inter-Node Bidirectional Bandwidth 400 25000 MVAPICH2-X OpenMPI 300 20000 MVAPICH2-X OpenMPI 15000 200 10000 100 Latency (us) 5000 0 0 0 2 8 Bandwidth (MB/s) 1 4 32 2K 8K 16 64 1K 4K 2M 1M 4M 128 512 32K 256 16K 64K 128K 512K 256K Message Size (Bytes) Message Size (Bytes) Intra-Node Latency MPI_Bcast (8 nodes 48 PPN) 600 MVAPICH2-X OpenMPI 500 512 400 MVAPICH2-X OpenMPI 300 64 200 Latency (us) 100 Latency (us) 8 0 0 2 8 1 32 2K 8K 2M 1 4 128 512 32K 16 64 1K 4K 128K 512K 1M 256 16K 64K Message Size (Bytes) 256K Message Size (Bytes) Network Based Computing Laboratory HPC-AI-Japan (Jan ‘21) 10 Optimized MVAPICH2-GDR with CUDA-Aware MPI Support GPU-GPU Inter-node Latency 30 6000 GPU-GPU Inter-node Bi-Bandwidth 20 4000 10 11X 1.85us 10x 2000 Latency (us) 0 0 1 2 4 8 Bandwidth (MB/s) Bandwidth 0 16 32 64 1K 2K 4K 8K 1 2 4 8 128 256 512 16 32 64 1K 2K 4K Message Size (Bytes) 128 256 512 Message Size (Bytes) MV2-(NO-GDR) MV2-GDR 2.3 MV2-(NO-GDR) MV2-GDR-2.3 4000 GPU-GPU Inter-node Bandwidth 3000 2000 9x MVAPICH2-GDR-2.3 Intel Haswell (E5-2687W @ 3.10 GHz) node - 20 cores 1000 NVIDIA Volta V100 GPU Bandwidth (MB/s) Bandwidth 0 Mellanox Connect-X4 EDR HCA 1 2 4 8 16 32 64 1K 2K 4K CUDA 9.0 128 256 512 Mellanox OFED 4.0 with GPU-Direct-RDMA Message Size (Bytes) MV2-(NO-GDR) MV2-GDR-2.3 Network Based Computing Laboratory HPC-AI-Japan (Jan ‘21) 11 MVAPICH2-GDR ROCm Support for AMD GPUs Intra-Node Point-to-Point Latency Inter-Node Point-to-Point Latency Allreduce – 64 GPUs (8 nodes, 8 GPUs Per Node) Bcast – 64 GPUs (8 nodes, 8 GPUs Per Node) Corona Cluster - ROCm-3.9.0 (mi50 AMD GPUs) Available with MVAPICH2-GDR 2.3.5 Network Based Computing Laboratory HPC-AI-Japan (Jan ‘21) 12 Presentation Overview • MVAPICH Project – MPI and PGAS (MVAPICH) Library with CUDA-Awareness • HiDL Project – High-Performance Deep Learning – High-Performance Machine Learning • Optimizations and Deployments in Public Cloud – AWS and Azure • Conclusions Network Based Computing Laboratory HPC-AI-Japan (Jan ‘21) 13 MVAPICH2 (MPI)-driven Infrastructure for ML/DL Training ML/DL Applications ML/DL Applications TensorFlow PyTorch MXNet PyTorch Horovod Torch.distributed DeepSpeed MVAPICH2 or MVAPICH2-X MVAPICH2-GDR for MVAPICH2 or MVAPICH2-X MVAPICH2-GDR for for CPU Training GPU Training for CPU Training GPU Training More details available from: http://hidl.cse.ohio-state.edu Network Based Computing Laboratory HPC-AI-Japan (Jan ‘21) 14 Multiple Approaches taken up by OSU • MPI-driven Deep Learning – CPU-based Deep Learning – GPU-based Deep Learning • Out-of-core DNN training • Exploiting Hybrid (Data and Model) Parallelism • Use-Case: AI-Driven Digital Pathology • Accelerating Dask and CuML Applications Network Based Computing Laboratory HPC-AI-Japan (Jan ‘21) 15 Distributed TensorFlow on TACC Frontera (2,048 CPU nodes with 114,688 cores) • Scaled TensorFlow to 2048 nodes on Frontera using MVAPICH2 • MVAPICH2 and IntelMPI give similar performance for DNN training • Report a peak of 260,000 images/sec on 2,048 nodes • On 2048 nodes, ResNet-50 can be trained in 7 minutes! A. Jain, A. A. Awan, H. Subramoni, DK Panda, “Scaling TensorFlow, PyTorch, and MXNet using MVAPICH2 for High-Performance Deep Learning on Frontera”, DLS ’19 (SC ’19 Workshop).

Software Libraries and Middleware for Exascale Systems

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support