
HPC and AI Middleware for Exascale Systems and Clouds Talk at HPC-AI Advisory Council Japan Conference (January ‘21) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: [email protected] Follow us on http://www.cse.ohio-state.edu/~panda https://twitter.com/mvapich Increasing Usage of HPC, Big Data and Deep/Machine Learning Big Data (Hadoop, Spark, HPC HBase, (MPI, PGAS, etc.) Memcached, etc.) Convergence of HPC, Big Deep/Machine Data, and Deep/Machine Learning Learning! (TensorFlow, PyTorch, BigDL, cuML, etc.) Increasing Need to Run these applications on the Cloud!! Network Based Computing Laboratory HPC-AI-Japan (Jan ‘21) 2 Presentation Overview • MVAPICH Project – MPI and PGAS (MVAPICH) Library with CUDA-Awareness • HiDL Project – High-Performance Deep Learning – High-Performance Machine Learning • Optimizations and Deployments in Public Cloud – AWS and Azure • Conclusions Network Based Computing Laboratory HPC-AI-Japan (Jan ‘21) 3 Overview of the MVAPICH2 Project • High Performance open-source MPI Library • Support for multiple interconnects – InfiniBand, Omni-Path, Ethernet/iWARP, RDMA over Converged Ethernet (RoCE), and AWS EFA • Support for multiple platforms – x86, OpenPOWER, ARM, Xeon-Phi, GPGPUs (NVIDIA and AMD) • Started in 2001, first open-source version demonstrated at SC ‘02 • Used by more than 3,125 organizations in 89 countries • Supports the latest MPI-3.1 standard • More than 1.2 Million downloads from the OSU site • http://mvapich.cse.ohio-state.edu directly • Additional optimized versions for different systems/environments: • Empowering many TOP500 clusters (Nov ‘20 ranking) – MVAPICH2-X (Advanced MPI + PGAS), since 2011 – 4th , 10,649,600-core (Sunway TaihuLight) at NSC, Wuxi, China – MVAPICH2-GDR with support for NVIDIA GPGPUs, since 2014 – 9th, 448, 448 cores (Frontera) at TACC – MVAPICH2-MIC with support for Intel Xeon-Phi, since 2014 – 14th, 391,680 cores (ABCI) in Japan – MVAPICH2-Virt with virtualization support, since 2015 – 21th, 570,020 cores (Nurion) in South Korea and many others – MVAPICH2-EA with support for Energy-Awareness, since 2015 – MVAPICH2-Azure for Azure HPC IB instances, since 2019 • Available with software stacks of many vendors and – MVAPICH2-X-AWS for AWS HPC+EFA instances, since 2019 Linux Distros (RedHat, SuSE, OpenHPC, and Spack) • Tools: • Partner in the 9th ranked TACC Frontera system – OSU MPI Micro-Benchmarks (OMB), since 2003 • Empowering Top500 systems for more than 16 years – OSU InfiniBand Network Analysis and Monitoring (INAM), since 2015 Network Based Computing Laboratory HPC-AI-Japan (Jan ‘21) 4 Architecture of MVAPICH2 Software Family for HPC and DL/ML High Performance Parallel Programming Models Message Passing Interface PGAS Hybrid --- MPI + X (MPI) (UPC, OpenSHMEM, CAF, UPC++) (MPI + PGAS + OpenMP/Cilk) High Performance and Scalable Communication Runtime Diverse APIs and Mechanisms Point-to- Remote Collectives Energy- I/O and Fault Active Introspection point Job Startup Memory Virtualization Algorithms Awareness Tolerance Messages & Analysis Primitives Access File Systems Support for Modern Networking Technology Support for Modern Multi-/Many-core Architectures (InfiniBand, iWARP, RoCE, Omni-Path, Elastic Fabric Adapter) (Intel-Xeon, OpenPOWER, Xeon-Phi, ARM, NVIDIA GPGPU) Transport Protocols Modern Features Transport Mechanisms Modern Features SR- Multi Shared RC SRD UD DC UMR ODP CMA IVSHMEM XPMEM Optane* NVLink CAPI* IOV Rail Memory * Upcoming Network Based Computing Laboratory HPC-AI-Japan (Jan ‘21) 5 MVAPICH2 Software Family Requirements Library MPI with IB, iWARP, Omni-Path, and RoCE MVAPICH2 Advanced MPI Features/Support, OSU INAM, PGAS and MPI+PGAS MVAPICH2-X with IB, Omni-Path, and RoCE MPI with IB, RoCE & GPU and Support for Deep/Machine Learning MVAPICH2-GDR HPC Cloud with MPI & IB MVAPICH2-Virt Energy-aware MPI with IB, iWARP and RoCE MVAPICH2-EA MPI Energy Monitoring Tool OEMT InfiniBand Network Analysis and Monitoring OSU INAM Microbenchmarks for Measuring MPI and PGAS Performance OMB Network Based Computing Laboratory HPC-AI-Japan (Jan ‘21) 6 Startup Performance on TACC Frontera MVAPICH2 2.3.4 Intel MPI 2020 MVAPICH2 2.3.4 Intel MPI 2020 25 2000 20 1500 15 1000 10 14X 46X Time (s) 5 Time (s) 500 0 0 56 112 224 448 896 1792 3584 7168 14336 28672 57344 114688229376 Number of Processes Number of Processes • MPI_Init takes 31 seconds on 229,376 processes on 4,096 nodes • All numbers reported with 56 processes per node New designs available since MVAPICH2-2.3.4 Network Based Computing Laboratory HPC-AI-Japan (Jan ‘21) 7 Hardware Multicast-aware MPI_Bcast on TACC Frontera 20 1000 Default Multicast Default Multicast 15 750 10 500 Latency (us) 250 5 Latency (us) 1.9X 2X (Nodes=2K, PPN=28) 0 0 2 8 32 128 512 2K 8K 32K 128K 512K Message Size Message Size 15 80 Default Multicast Default Multicast Size=16B, PPN=28 Size=32kB, PPN=28 10 60 1.8X 40 2X 5 Latency (us) Latency (us) 20 0 0 2 4 8 16 32 64 128 256 512 1k 2 4 8 16 32 64 128 256 512 1K Number of Nodes Number of Nodes • MCAST-based designs improve latency of MPI_Bcast by up to 2X at 2,048 nodes • Use MV2_USE_MCAST=1 to enable MCAST-based designs Network Based Computing Laboratory HPC-AI-Japan (Jan ‘21) 8 Performance of Collectives with SHARP on TACC Frontera 120 1000 100 MVAPICH2-X 80 100 MVAPICH2-X-SHARP 6X 60 40 10 MVAPICH2-X 5X Latency (us) Latency (us) MPI_Reduce MPI_Allreduuce 20 MVAPICH2-X-SHARP (PPN=1, 7861) Nodes = 0 (PPN=1, 7861) Nodes = 1 Message size Message size 250 Optimized SHARP designs in MVAPICH2-X 200 MVAPICH2-X Up to 9X performance improvement with SHARP over MVAPICH2-X default for 1ppn 150 MVAPICH2-X-SHARP MPI_Barrier, 6X for 1ppn MPI_Reduce and 5X for 1ppn MPI_Allreduce MPI_Barrier 100 9X Latency (us) 50 B. Ramesh , K. Suresh , N. Sarkauskas , M. Bayatpour , J. Hashmi , H. Subramoni , and D. K. Panda, Scalable MPI Collectives using 0 SHARP: Large Scale Performance Evaluation on the TACC Frontera 4 8 16 32 64 128 256 512 System, ExaMPI2020 - Workshop on Exascale MPI 2020, Nov 2020. 1024 2048 4096 7861 Number of nodes Optimized Runtime Parameters: MV2_ENABLE_SHARP = 1 Network Based Computing Laboratory HPC-AI-Japan (Jan ‘21) 9 Support for ARM A64FX with InfiniBand (Ookami) Inter-Node Latency Inter-Node Bidirectional Bandwidth 400 25000 MVAPICH2-X OpenMPI 300 20000 MVAPICH2-X OpenMPI 15000 200 10000 100 Latency (us) 5000 0 0 0 2 8 Bandwidth (MB/s) 1 4 32 2K 8K 16 64 1K 4K 2M 1M 4M 128 512 32K 256 16K 64K 128K 512K 256K Message Size (Bytes) Message Size (Bytes) Intra-Node Latency MPI_Bcast (8 nodes 48 PPN) 600 MVAPICH2-X OpenMPI 500 512 400 MVAPICH2-X OpenMPI 300 64 200 Latency (us) 100 Latency (us) 8 0 0 2 8 1 32 2K 8K 2M 1 4 128 512 32K 16 64 1K 4K 128K 512K 1M 256 16K 64K Message Size (Bytes) 256K Message Size (Bytes) Network Based Computing Laboratory HPC-AI-Japan (Jan ‘21) 10 Optimized MVAPICH2-GDR with CUDA-Aware MPI Support GPU-GPU Inter-node Latency 30 6000 GPU-GPU Inter-node Bi-Bandwidth 20 4000 10 11X 1.85us 10x 2000 Latency (us) 0 0 1 2 4 8 Bandwidth (MB/s) Bandwidth 0 16 32 64 1K 2K 4K 8K 1 2 4 8 128 256 512 16 32 64 1K 2K 4K Message Size (Bytes) 128 256 512 Message Size (Bytes) MV2-(NO-GDR) MV2-GDR 2.3 MV2-(NO-GDR) MV2-GDR-2.3 4000 GPU-GPU Inter-node Bandwidth 3000 2000 9x MVAPICH2-GDR-2.3 Intel Haswell (E5-2687W @ 3.10 GHz) node - 20 cores 1000 NVIDIA Volta V100 GPU Bandwidth (MB/s) Bandwidth 0 Mellanox Connect-X4 EDR HCA 1 2 4 8 16 32 64 1K 2K 4K CUDA 9.0 128 256 512 Mellanox OFED 4.0 with GPU-Direct-RDMA Message Size (Bytes) MV2-(NO-GDR) MV2-GDR-2.3 Network Based Computing Laboratory HPC-AI-Japan (Jan ‘21) 11 MVAPICH2-GDR ROCm Support for AMD GPUs Intra-Node Point-to-Point Latency Inter-Node Point-to-Point Latency Allreduce – 64 GPUs (8 nodes, 8 GPUs Per Node) Bcast – 64 GPUs (8 nodes, 8 GPUs Per Node) Corona Cluster - ROCm-3.9.0 (mi50 AMD GPUs) Available with MVAPICH2-GDR 2.3.5 Network Based Computing Laboratory HPC-AI-Japan (Jan ‘21) 12 Presentation Overview • MVAPICH Project – MPI and PGAS (MVAPICH) Library with CUDA-Awareness • HiDL Project – High-Performance Deep Learning – High-Performance Machine Learning • Optimizations and Deployments in Public Cloud – AWS and Azure • Conclusions Network Based Computing Laboratory HPC-AI-Japan (Jan ‘21) 13 MVAPICH2 (MPI)-driven Infrastructure for ML/DL Training ML/DL Applications ML/DL Applications TensorFlow PyTorch MXNet PyTorch Horovod Torch.distributed DeepSpeed MVAPICH2 or MVAPICH2-X MVAPICH2-GDR for MVAPICH2 or MVAPICH2-X MVAPICH2-GDR for for CPU Training GPU Training for CPU Training GPU Training More details available from: http://hidl.cse.ohio-state.edu Network Based Computing Laboratory HPC-AI-Japan (Jan ‘21) 14 Multiple Approaches taken up by OSU • MPI-driven Deep Learning – CPU-based Deep Learning – GPU-based Deep Learning • Out-of-core DNN training • Exploiting Hybrid (Data and Model) Parallelism • Use-Case: AI-Driven Digital Pathology • Accelerating Dask and CuML Applications Network Based Computing Laboratory HPC-AI-Japan (Jan ‘21) 15 Distributed TensorFlow on TACC Frontera (2,048 CPU nodes with 114,688 cores) • Scaled TensorFlow to 2048 nodes on Frontera using MVAPICH2 • MVAPICH2 and IntelMPI give similar performance for DNN training • Report a peak of 260,000 images/sec on 2,048 nodes • On 2048 nodes, ResNet-50 can be trained in 7 minutes! A. Jain, A. A. Awan, H. Subramoni, DK Panda, “Scaling TensorFlow, PyTorch, and MXNet using MVAPICH2 for High-Performance Deep Learning on Frontera”, DLS ’19 (SC ’19 Workshop).
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages41 Page
-
File Size-