(DK) Panda the Ohio State University E-Mail: [email protected] Parallel Programming Models Overview

Overview of the MVAPICH Project: Latest Status and Future Roadmap MVAPICH2 User Group (MUG) Meeting by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: [email protected] http://www.cse.ohio-state.edu/~panda Parallel Programming Models Overview P1 P2 P3 P1 P2 P3 P1 P2 P3 Logical shared memory Memory Memory Memory Shared Memory Memory Memory Memory Shared Memory Model Distributed Memory Model Partitioned Global Address Space (PGAS) SHMEM, DSM MPI (Message Passing Interface) Global Arrays, UPC, UPC++, Chapel, , CAF, … • Programming models provide abstract machine models • Models can be mapped on different types of systems – e.g. Distributed Shared Memory (DSM), MPI within a node, etc. • PGAS models and Hybrid MPI+PGAS models are gradually receiving importance Network Based Computing Laboratory MVAPICH User Group Meeting (MUG) 2018 2 Supporting Programming Models for Multi-Petaflop and Exaflop Systems: Challenges Application Kernels/Applications Middleware Co-Design Opportunities Programming Models and Challenges MPI, PGAS (UPC, Global Arrays, OpenSHMEM), CUDA, OpenMP, across Various OpenACC, Cilk, Hadoop (MapReduce), Spark (RDD, DAG), etc. Layers Communication Library or Runtime for Programming Models Performance Point-to-point Collective Energy- Synchronization I/O and Fault Scalability Communication Communication Awareness and Locks File Systems Tolerance Resilience Networking Technologies Multi-/Many-core Accelerators (InfiniBand, 40/100GigE, Architectures (GPU and MIC) Aries, and Omni-Path) Network Based Computing Laboratory MVAPICH User Group Meeting (MUG) 2018 3 Designing (MPI+X) at Exascale • Scalability for million to billion processors – Support for highly-efficient inter-node and intra-node communication (both two-sided and one-sided) – Scalable job start-up – Low memory footprint • Scalable Collective communication – Offload – Non-blocking – Topology-aware • Balancing intra-node and inter-node communication for next generation nodes (128-1024 cores) – Multiple end-points per node • Support for efficient multi-threading • Integrated Support for Accelerators (GPGPUs and FPGAs) • Fault-tolerance/resiliency • QoS support for communication and I/O • Support for Hybrid MPI+PGAS programming (MPI + OpenMP, MPI + UPC, MPI + OpenSHMEM, MPI+UPC++, CAF, …) • Virtualization • Energy-Awareness Network Based Computing Laboratory MVAPICH User Group Meeting (MUG) 2018 4 Overview of the MVAPICH2 Project • High Performance open-source MPI Library for InfiniBand, Omni-Path, Ethernet/iWARP, and RDMA over Converged Ethernet (RoCE) – MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.1), Started in 2001, First version available in 2002 – MVAPICH2-X (MPI + PGAS), Available since 2011 – Support for GPGPUs (MVAPICH2-GDR) and MIC (MVAPICH2-MIC), Available since 2014 – Support for Virtualization (MVAPICH2-Virt), Available since 2015 – Support for Energy-Awareness (MVAPICH2-EA), Available since 2015 – Support for InfiniBand Network Analysis and Monitoring (OSU INAM) since 2015 – Used by more than 2,925 organizations in 86 countries – More than 485,000 (> 0.48 million) downloads from the OSU site directly – Empowering many TOP500 clusters (Jul ‘18 ranking) • 2nd ranked 10,649,640-core cluster (Sunway TaihuLight) at NSC, Wuxi, China • 12th, 556,104 cores (Oakforest-PACS) in Japan • 15th, 367,024 cores (Stampede2) at TACC • 24th, 241,108-core (Pleiades) at NASA and many others – Available with software stacks of many vendors and Linux Distros (RedHat and SuSE) – http://mvapich.cse.ohio-state.edu • Empowering Top500 systems for over a decade Network Based Computing Laboratory MVAPICH User Group Meeting (MUG) 2018 5 Network Based Computing Computing Based Laboratory Network MVAPICH Project Timeline Oct-02 Jan-04 MVAPICH Nov-04 MVAPICH User GroupMeeting (MUG) 2018 Jan-10 Timeline EOL OMB MVAPICH2 Nov-12 Aug-14 MVAPICH2 Apr-15 MVAPICH2 MVAPICH2 - X Jul-15 MVAPICH2 MVAPICH2 - Aug-15 GDR OSU Sep-15 - - MIC INAM - Virt - EA 6 Network Based Computing Computing Based Laboratory Network Number of Downloads 600000 100000 200000 300000 400000 500000 MVAPICH2 Release TimelineDownloadsand 0 Sep-04 Jan-05 May-05 Sep-05 Jan-06 MV 0.9.4 May-06 Sep-06 MV2 0.9.0 Jan-07 May-07 Sep-07 Jan-08 MV2 0.9.8 May-08 Sep-08 Jan-09 MV2 1.0 May-09 MVAPICH User GroupMeeting (MUG) 2018 Sep-09 Jan-10 MV 1.0 May-10 MV2 1.0.3 Sep-10 MV 1.1 Jan-11 Timeline May-11 Sep-11 Jan-12 MV2 1.4 May-12 Sep-12 Jan-13 MV2 1.5 May-13 Sep-13 MV2 1.6 Jan-14 May-14 Sep-14 MV2 1.7 Jan-15 MV2 1.8 May-15 Sep-15 Jan-16 MV2 1.9 May-16 MV2-GDR 2.0b Sep-16 Jan-17 MV2-MIC 2.0 MV2 Virt 2.2 May-17 MV2-X 2.3b Sep-17 Jan-18 MV2-GDR 2.3a OSU INAM 0.9.3 May-18 MV2 2.3 7 Architecture of MVAPICH2 Software Family High Performance Parallel Programming Models Message Passing Interface PGAS Hybrid --- MPI + X (MPI) (UPC, OpenSHMEM, CAF, UPC++) (MPI + PGAS + OpenMP/Cilk) High Performance and Scalable Communication Runtime Diverse APIs and Mechanisms Point-to- Remote Collectives Energy- I/O and Fault Active Introspectio point Job Startup Memory Virtualization Algorithms Awareness Tolerance Messages n & Analysis Primitives Access File Systems Support for Modern Networking Technology Support for Modern Multi-/Many-core Architectures (InfiniBand, iWARP, RoCE, Omni-Path) (Intel-Xeon, OpenPOWER, Xeon-Phi (MIC, KNL), NVIDIA GPGPU) Transport Protocols Modern Features Transport Mechanisms Modern Features SR- Multi Shared RC XRC UD DC UMR ODP CMA IVSHMEM XPMEM* NVLink* CAPI* IOV Rail Memory * Upcoming Network Based Computing Laboratory MVAPICH User Group Meeting (MUG) 2018 8 Strong Procedure for Design, Development and Release • Research is done for exploring new designs • Designs are first presented to conference/journal publications • Best performing designs are incorporated into the codebase • Rigorous Q&A procedure before making a release – Exhaustive unit testing – Various test procedures on diverse range of platforms and interconnects – Test 19 different benchmarks and applications including, but not limited to • OMB, IMB, MPICH Test Suite, Intel Test Suite, NAS, ScalaPak, and SPEC – Spend about 18,000 core hours per commit – Performance regression and tuning – Applications-based evaluation – Evaluation on large-scale systems • All versions (alpha, beta, RC1 and RC2) go through the above testing Network Based Computing Laboratory MVAPICH User Group Meeting (MUG) 2018 9 MVAPICH2 Software Family Requirements Library MPI with IB, iWARP, Omni-Path, and RoCE MVAPICH2 Advanced MPI Features/Support, OSU INAM, PGAS and MPI+PGAS MVAPICH2-X with IB, Omni-Path, and RoCE MPI with IB, RoCE & GPU and Support for Deep Learning MVAPICH2-GDR HPC Cloud with MPI & IB MVAPICH2-Virt Energy-aware MPI with IB, iWARP and RoCE MVAPICH2-EA MPI Energy Monitoring Tool OEMT InfiniBand Network Analysis and Monitoring OSU INAM Microbenchmarks for Measuring MPI and PGAS Performance OMB Network Based Computing Laboratory MVAPICH User Group Meeting (MUG) 2018 10 MVAPICH2 2.3-GA • Released on 07/23/2018 • Major Features and Enhancements – Based on MPICH v3.2.1 – Enhancements to process mapping strategies and automatic architecture/network – Introduce basic support for executing MPI jobs in Singularity detection – Improve performance for MPI-3 RMA operations • Improve performance of architecture detection on high core-count systems – Enhancements for Job Startup • Enhanced architecture detection for OpenPOWER, Intel Skylake and Cavium ARM (ThunderX) systems • Improved job startup time for OFA-IB-CH3, PSM-CH3, and PSM2-CH3 • New environment variable MV2_THREADS_BINDING_POLICY for multi-threaded MPI and • On-demand connection management for PSM-CH3 and PSM2-CH3 channels MPI+OpenMP applications • Enhance PSM-CH3 and PSM2-CH3 job startup to use non-blocking PMI calls • Support 'spread', 'bunch', 'scatter', 'linear' and 'compact' placement of threads • Introduce capability to run MPI jobs across multiple InfiniBand subnets – Warn user if oversubscription of core is detected – Enhancements to point-to-point operations • Enhance MV2_SHOW_CPU_BINDING to enable display of CPU bindings on all nodes • Enhance performance of point-to-point operations for CH3-Gen2 (InfiniBand), CH3-PSM, and • Added support for MV2_SHOW_CPU_BINDING to display number of OMP threads CH3-PSM2 (Omni-Path) channels • Added logic to detect heterogeneous CPU/HFI configurations in PSM-CH3 and PSM2-CH3 • Improve performance for Intra- and Inter-node communication for OpenPOWER architecture channels • Enhanced tuning for OpenPOWER, Intel Skylake and Cavium ARM (ThunderX) systems – Thanks to Matias Cabral@Intel for the report • Improve performance for host-based transfers when CUDA is enabled • Enhanced HFI selection logic for systems with multiple Omni-Path HFIs • Improve support for large processes per node and hugepages on SMP systems • Introduce run time parameter MV2_SHOW_HCA_BINDING to show process to HCA – Enhancements to collective operations bindings • Enhanced performance for Allreduce, Reduce_scatter_block, Allgather, Allgatherv – Miscellaneous enhancements and improved debugging and tools support – Thanks to Danielle Sikich and Adam Moody @ LLNL for the patch • Enhance support for MPI_T PVARs and CVARs • Add support for non-blocking Allreduce using Mellanox SHARP • Enhance debugging support for PSM-CH3 and PSM2-CH3 channels – Enhance tuning framework for Allreduce using SHArP • Update to hwloc version 1.11.9 • Enhanced collective tuning for IBM POWER8, IBM POWER9, Intel Skylake, Intel KNL, Intel • Tested with CLANG v5.0.0 Broadwell Network Based Computing Laboratory MVAPICH

(DK) Panda the Ohio State University E-Mail: [email protected] Parallel Programming Models Overview

Scalable and Distributed Deep Learning (DL): Co-Design MPI Runtimes and DL Frameworks

MVAPICH2 2.2 User Guide

Parallel Programming

Benchmarking the Intel FPGA SDK for Opencl Memory Interface

Parallel Computer Architecture

Scalable and High Performance MPI Design for Very Large

BCL: a Cross-Platform Distributed Data Structures Library

Advances, Applications and Performance of The

MVAPICH2 2.3 User Guide

Parallel Programming Using Openmp Feb 2014

Enabling Efficient Use of UPC and Openshmem PGAS Models on GPU Clusters

Automatic Handling of Global Variables for Multi-Threaded MPI Programs