Mvapich2-Gdr

High-Performance MPI and Deep Learning on OpenPOWER Platform Talk at OpenPOWER SUMMIT (Mar ‘18) by Dhabaleswar K. (DK) Panda Hari Subramoni The Ohio State University The Ohio State University E-mail: [email protected] E-mail: [email protected] http://www.cse.ohio-state.edu/~panda http://www.cse.ohio-state.edu/~subramon Increasing Usage of HPC, Big Data and Deep Learning Big Data HPC (Hadoop, Spark, (MPI, RDMA, HBase, Lustre, etc.) Memcached, etc.) Convergence of HPC, Big Deep Learning Data, and Deep Learning! (Caffe, TensorFlow, BigDL, etc.) Increasing Need to Run these applications on the Cloud!! Network Based Computing Laboratory OpenPOWER Summit (Mar ’18) 2 Parallel Programming Models Overview P1 P2 P3 P1 P2 P3 P1 P2 P3 Logical shared memory Memory Memory Memory Shared Memory Memory Memory Memory Shared Memory Model Distributed Memory Model Partitioned Global Address Space (PGAS) SHMEM, DSM MPI (Message Passing Interface) Global Arrays, UPC, Chapel, X10, CAF, … • Programming models provide abstract machine models • Models can be mapped on different types of systems – e.g. Distributed Shared Memory (DSM), MPI within a node, etc. • PGAS models and Hybrid MPI+PGAS models are gradually receiving importance Network Based Computing Laboratory OpenPOWER Summit (Mar ’18) 3 Partitioned Global Address Space (PGAS) Models • Key features - Simple shared memory abstractions - Light weight one-sided communication - Easier to express irregular communication • Different approaches to PGAS - Languages - Libraries • Unified Parallel C (UPC) • OpenSHMEM • Co-Array Fortran (CAF) • UPC++ • X10 • Global Arrays • Chapel Network Based Computing Laboratory OpenPOWER Summit (Mar ’18) 4 Hybrid (MPI+PGAS) Programming • Application sub-kernels can be re-written in MPI/PGAS HPC Application based on communication characteristics Kernel 1 • Benefits: MPI – Best of Distributed Computing Model Kernel 2 MPI – Best of Shared Memory Computing Model PGAS Kernel 3 MPI Kernel N PGASMPI Network Based Computing Laboratory OpenPOWER Summit (Mar ’18) 5 Supporting Programming Models for Multi-Petaflop and Exaflop Systems: Challenges Application Kernels/Applications Co-Design Middleware Opportunities and Programming Models Challenges MPI, PGAS (UPC, Global Arrays, OpenSHMEM), CUDA, OpenMP, across Various OpenACC, Cilk, Hadoop (MapReduce), Spark (RDD, DAG), etc. Layers Communication Library or Runtime for Programming Models Performance Point-to-point Collective Energy- Synchronization I/O and Fault Communication Communication Awareness and Locks File Systems Tolerance Scalability Resilience Networking Technologies Multi-/Many-core Accelerators (InfiniBand, 40/100GigE, Architectures (GPU and FPGA) Aries, and Omni-Path) Network Based Computing Laboratory OpenPOWER Summit (Mar ’18) 6 MPI, PGAS and Deep Learning Support for OpenPOWER • Message Passing Interface (MPI) Support • Support for PGAS and MPI + PGAS (OpenSHMEM, UPC) • Exploiting Accelerators (NVIDIA GPGPUs) • Support for Deep Learning Support for Big Data (Hadoop and Spark) on OpenPOWER (Talk at 2:15 pm) Network Based Computing Laboratory OpenPOWER Summit (Mar ’18) 7 Overview of the MVAPICH2 Project • High Performance open-source MPI Library for InfiniBand, Omni-Path, Ethernet/iWARP, and RDMA over Converged Ethernet (RoCE) – MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.1), Started in 2001, First version available in 2002 – MVAPICH2-X (MPI + PGAS), Available since 2011 – Support for GPGPUs (MVAPICH2-GDR) and MIC (MVAPICH2-MIC), Available since 2014 – Support for Virtualization (MVAPICH2-Virt), Available since 2015 – Support for Energy-Awareness (MVAPICH2-EA), Available since 2015 – Support for InfiniBand Network Analysis and Monitoring (OSU INAM) since 2015 – Used by more than 2,875 organizations in 85 countries – More than 455,000 (> 0.45 million) downloads from the OSU site directly – Empowering many TOP500 clusters (Nov ‘17 ranking) • 1st, 10,649,600-core (Sunway TaihuLight) at National Supercomputing Center in Wuxi, China • 9th, 556,104 cores (Oakforest-PACS) in Japan • 12th, 368,928-core (Stampede2) at TACC • 17th, 241,108-core (Pleiades) at NASA • 48th, 76,032-core (Tsubame 2.5) at Tokyo Institute of Technology – Available with software stacks of many vendors, Linux Distros (RedHat and SuSE), and OpenHPC – http://mvapich.cse.ohio-state.edu • Empowering Top500 systems for over a decade Network Based Computing Laboratory OpenPOWER Summit (Mar ’18) 8 MVAPICH2 Software Family High-Performance Parallel Programming Libraries MVAPICH2 Support for InfiniBand, Omni-Path, Ethernet/iWARP, and RoCE MVAPICH2-X Advanced MPI features, OSU INAM, PGAS (OpenSHMEM, UPC, UPC++, and CAF), and MPI+PGAS programming models with unified communication runtime MVAPICH2-GDR Optimized MPI for clusters with NVIDIA GPUs MVAPICH2-Virt High-performance and scalable MPI for hypervisor and container based HPC cloud MVAPICH2-EA Energy aware and High-performance MPI MVAPICH2-MIC Optimized MPI for clusters with Intel KNC (being updated with KNL-based designs) Microbenchmarks OMB Microbenchmarks suite to evaluate MPI and PGAS (OpenSHMEM, UPC, and UPC++) libraries for CPUs and GPUs Tools OSU INAM Network monitoring, profiling, and analysis for clusters with MPI and scheduler integration OEMT Utility to measure the energy consumption of MPI applications Network Based Computing Laboratory OpenPOWER Summit (Mar ’18) 9 Network Based Computing Laboratory Number of Downloads 100000 150000 200000 250000 300000 350000 400000 450000 500000 50000 0 MVAPICH2 ReleaseMVAPICH2 Timeline and Downloads Sep-04 Feb-05 Jul-05 Dec-05 MV 0.9.4 May-06 Oct-06 MV2 0.9.0 Mar-07 Aug-07 Jan-08 MV2 0.9.8 Jun-08 Nov-08 MV2 1.0 OpenPOWER Summit (Mar ’18) Apr-09 Sep-09 MV 1.0 Feb-10 MV2 1.0.3 Jul-10 MV 1.1 Timeline Dec-10 May-11 Oct-11 MV2 1.4 Mar-12 Aug-12 Jan-13 MV2 1.5 Jun-13 MV2 1.6 Nov-13 Apr-14 MV2 1.7 Sep-14 MV2 1.8 Feb-15 Jul-15 Dec-15 MV2 1.9 May-16 MV2-GDR 2.0b Oct-16 MV2-MIC 2.0 MV2 Virt 2.2 Mar-17 MV2-X 2.3b Aug-17 MV2-GDR 2.3a MV2 2.3rc1 10 Jan-18 OSU INAM 0.9.3 Architecture of MVAPICH2 Software Family High Performance Parallel Programming Models Message Passing Interface PGAS Hybrid --- MPI + X (MPI) (UPC, OpenSHMEM, CAF, UPC++) (MPI + PGAS + OpenMP/Cilk) High Performance and Scalable Communication Runtime Diverse APIs and Mechanisms Point-to- Remote Collectives Energy- I/O and Fault Active Introspection point Job Startup Memory Virtualization Algorithms Awareness Tolerance Messages & Analysis Primitives Access File Systems Support for Modern Networking Technology Support for Modern Multi-/Many-core Architectures (InfiniBand, iWARP, RoCE, Omni-Path) (Intel-Xeon, OpenPOWER, Xeon-Phi, ARM, NVIDIA GPGPU) Transport Protocols Modern Features Transport Mechanisms Modern Features SR- Multi Shared RC XRC UD DC UMR ODP CMA IVSHMEM XPMEM* MCDRAM* NVLink CAPI* IOV Rail Memory * Upcoming Network Based Computing Laboratory OpenPOWER Summit (Mar ’18) 11 OpenPOWER Platform Support in MVAPICH2 Libraries • MVAPICH2 – Basic MPI support • since MVAPICH2 2.2rc1 (March 2016) • MVAPICH2-X – PGAS (OpenSHMEM and UPC) and Hybrid MPI+PGAS support • since MVAPICH2-X 2.2b (March 2016) – Advanced Collective Support with CMA • Since MVAPICH2-X 2.3b (Oct 2017) • MVAPICH2-GDR – NVIDIA GPGPU support with NVLink (CORAL like systems) • Since MVAPICH2-GDR 2.3a (Nov 2017) Network Based Computing Laboratory OpenPOWER Summit (Mar ’18) 12 MPI, PGAS and Deep Learning Support for OpenPOWER • Message Passing Interface (MPI) Support – Point-to-point Inter-node and Intra-node – CMA-based Collectives – Upcoming XPMEM Support • Support for PGAS and MPI + PGAS (OpenSHMEM, UPC) • Exploiting Accelerators (NVIDIA GPGPUs) • Support for Deep Learning Network Based Computing Laboratory OpenPOWER Summit (Mar ’18) 13 Intra-node Point-to-Point Performance on OpenPower Intra-Socket Small Message Latency Intra-Socket Large Message Latency 1.5 80 MVAPICH2-2.3rc1 MVAPICH2-2.3rc1 SpectrumMPI-10.1.0.2 60 SpectrumMPI-10.1.0.2 1 OpenMPI-3.0.0 OpenMPI-3.0.0 40 0.5 Latency (us) Latency Latency (us) Latency 20 0.30us 0 0 0 1 2 4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M 2M Intra-Socket Bandwidth Intra-Socket Bi-directional Bandwidth 40000 80000 MVAPICH2-2.3rc1 MVAPICH2-2.3rc1 30000 SpectrumMPI-10.1.0.2 60000 SpectrumMPI-10.1.0.2 OpenMPI-3.0.0 OpenMPI-3.0.0 20000 40000 10000 20000 Bandwidth (MB/s) Bandwidth Bandwidth (MB/s) Bandwidth 0 0 1 8 64 512 4K 32K 256K 2M 1 8 64 512 4K 32K 256K 2M Platform: Two nodes of OpenPOWER (Power8-ppc64le) CPU using Mellanox EDR (MT4115) HCA Network Based Computing Laboratory OpenPOWER Summit (Mar ’18) 14 Inter-node Point-to-Point Performance on OpenPower Small Message Latency Large Message Latency 4 200 MVAPICH2-2.3rc1 MVAPICH2-2.3rc1 3 SpectrumMPI-10.1.0.2 150 SpectrumMPI-10.1.0.2 OpenMPI-3.0.0 OpenMPI-3.0.0 2 100 Latency (us) Latency Latency (us) Latency 1 50 0 0 0 1 2 4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M 2M Bandwidth Bi-directional Bandwidth 15000 30000 MVAPICH2-2.3rc1 MVAPICH2-2.3rc1 SpectrumMPI-10.1.0.2 SpectrumMPI-10.1.0.2 10000 OpenMPI-3.0.0 20000 OpenMPI-3.0.0 5000 10000 Bandwidth (MB/s) Bandwidth Bandwidth (MB/s) Bandwidth 0 0 1 8 64 512 4K 32K 256K 2M 1 8 64 512 4K 32K 256K 2M Platform: Two nodes of OpenPOWER (Power8-ppc64le) CPU using Mellanox EDR (MT4115) HCA Network Based Computing Laboratory OpenPOWER Summit (Mar ’18) 15 Multi-Pair Latency Comparison 2 MVAPICH2-2.3rc1 500 MVAPICH2-2.3rc1 40% SpectrumMPI-10.1.0 400 SpectrumMPI-10.1.0 OpenMPI-3.0.0 48% pair pair 300 OpenMPI-3.0.0 - 1 MVAPICH2-XPMEM pair - 200 MVAPICH2-XPMEM

Mvapich2-Gdr

Science-Driven Development of Supercomputer Fugaku

Providing a Robust Tools Landscape for CORAL Machines

Summit Supercomputer

NVIDIA Powers the World's Top 13 Most Energy Efficient Supercomputers

TOP500 Supercomputer Sites

Industry Insights | HPC and the Future of Seismic

NVIDIA A100 Tensor Core GPU Architecture UNPRECEDENTED ACCELERATION at EVERY SCALE

GPU Static Modeling Using PTX and Deep Structured Learning

Inside Volta

World's Fastest Computer

OLCF AR 2016-17 FINAL 9-7-17.Pdf

NVIDIA Tensor Core Programmability, Performance & Precision