Overview of the MVAPICH Project: Latest Status and Future Roadmap

Overview of the MVAPICH Project: Latest Status and Future Roadmap MVAPICH2 User Group (MUG) MeeKng by DhaBaleswar K. (DK) Panda The Ohio State University E-mail: [email protected] h<p://www.cse.ohio-state.edu/~panda Drivers of Modern HPC Cluster Architectures Accelerators / Coprocessors High Performance Interconnects - InfiniBand high compute density, high performance/wa MulK-core Processors <1usec latency, >100GBps Bandwidth >1 TFlop DP on a chip • MulC-core processors are ubiquitous • InfiniBand very popular in HPC clusters • • Accelerators/Coprocessors becoming common in high-end systems • Pushing the envelope for Exascale compuCng Tianhe – 2 (1) Titan (2) Stampede (8) Tianhe – 1A (24) MVAPICH User Group MeeKng 2015 2 Parallel Programming Models Overview P1 P2 P3 P1 P2 P3 P1 P2 P3 Logical shared memory Shared Memory Memory Memory Memory Memory Memory Memory Shared Memory Model Distributed Memory Model ParCConed Global Address Space (PGAS) SHMEM, DSM MPI (Message Passing Interface) Global Arrays, UPC, Chapel, X10, CAF, … • Programming models provide abstract machine models • Models can be mapped on different types of systems – e.g. Distributed Shared Memory (DSM), MPI within a node, etc. • PGAS models and Hybrid MPI+PGAS models are gradually receiving importance MVAPICH User Group MeeKng 2015 3 SupporKng Programming Models for MulK- Petaflop and Exaflop Systems: Challenges ApplicaKon Kernels/ApplicaKons Co-Design Middleware Opportunies and Challenges across Various Programming Models Layers MPI, PGAS (UPC, GloBal Arrays, OpenSHMEM), CUDA, OpenACC, Cilk, Hadoop, MapReduce, etc. CommunicaKon LiBrary or RunKme for Programming Models Point-to-point Collecve Synchronizaon & I/O & File Fault Communicaon Communicaon Locks (two-sided & one-sided) Systems Tolerance Networking Technologies MulK/Many-core Accelerators (InfiniBand, 10/40GigE, Architectures (NVIDIA and MIC) Aries, OmniPath) MVAPICH User Group MeeKng 2015 4 Designing (MPI+X) at Exascale • Scalability for million to billion processors – Support for highly-efficient inter-node and intra-node communicaon (both two-sided and one-sided) – Extremely minimum memory footprint • Hybrid programming (MPI + OpenMP, MPI + UPC, MPI + OpenSHMEM, …) • Balancing intra-node and inter-node communicaon for next generaon mulC-core (128-1024 cores/node) – MulCple end-points per node • Support for efficient mulC-threading • Scalable CollecCve communicaon – Offload – Non-blocking – Topology-aware – Power-aware • Support for MPI-3 RMA Model • Support for GPGPUs and Accelerators • Fault-tolerance/resiliency • QoS support for communicaon and I/O • Virtualizaon Support MVAPICH User Group MeeKng 2015 5 The MVAPICH2 Soeware Family • High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP, RDMA over Converged Enhanced Ethernet (RoCE), and virtualized clusters. – MVAPICH (MPI-1) , AvailaBle since 2002 – MVAPICH2 (MPI-2.2, MPI-3.0 and MPI-3.1), AvailaBle since 2004 – MVAPICH2-X (Advanced MPI + PGAS), AvailaBle since 2012 – Support for GPGPUs (MVAPICH2-GDR), AvailaBle since 2014 – Support for MIC (MVAPICH2-MIC), AvailaBle since 2014 – Support for VirtualizaKon (MVAPICH2-Virt), AvailaBle since 2015 • h<p://mvapich.cse.ohio-state.edu MVAPICH User Group MeeKng 2015 6 MVAPICH Project Timeline MVAPICH2-Virt MVAPICH2-MIC MVAPICH2-GDR MVAPICH2-X MVAPICH2 OMB MVAPICH EOL Jul-15 Jan-04 Oct-02 Oct-02 Jan-10 Nov-04 Nov-12 Apr-15 Apr-15 MVAPICH User Group MeeKng 2015 Timeline Aug-14 7 Number of Downloads 300000 350000 100000 150000 200000 250000 50000 MVAPICH/MVAPICH2 Release Timeline and Downloads MVAPICH User Group MeeKng 2015 0 Sep-04 MV 0.9.4 Mar-05 Sep-05 MV2 0.9.0 Mar-06 Sep-06 MV2 0.9.8 Mar-07 Sep-07 MV2 1.0 Mar-08 MV 1.0 Sep-08 MV2 1.0.3 Mar-09 MV 1.1 Timeline Sep-09 Mar-10 MV2 1.4 Sep-10 Mar-11 MV2 1.5 Sep-11 MV2 1.6 Mar-12 MV2 1.7 Sep-12 MV2 1.8 Mar-13 Sep-13 MV2 1.9 Mar-14 MV2-GDR 2.0B Sep-14 MV2-MIC 2.0 MV2 2.1 Mar-15 MV2-Virt 2.1rc2 MV2-GDR 2.1rc2 MV2 2.2a MV2-X 2.2a 8 The MVAPICH2 Soeware Family (Cont.) • Empowering many TOP500 clusters (July ‘15 ranking) – 8th ranked 519,640-core cluster (Stampede) at TACC – 11th ranked 185,344-core cluster (Pleiades) at NASA – 22nd ranked 76,032-core cluster (Tsubame 2.5) at Tokyo InsCtute of Technology and many others • Used by more than 2,450 organizaons in 76 countries • More than 282,000 downloads from the OSU site directly • Available with sopware stacks of many IB, HSE, and server vendors including Linux Distros (RedHat and SuSE) • Empowering Top500 systems for over a decade – System-X from Virginia Tech (3rd in Nov 2003, 2,200 processors, 12.25 TFlops) -> – Stampede at TACC (8th in Jun’15, 462,462 cores, 5.168 Plops) MVAPICH User Group MeeKng 2015 9 Usage Guidelines for the MVAPICH2 Soeware Family Requirements MVAPICH2 LiBrary to use MPI with IB, iWARP and RoCE MVAPICH2 Advanced MPI, PGAS and MPI+PGAS with IB and RoCE MVAPICH2-X MPI with IB & GPU MVAPICH2-GDR MPI with IB & MIC MVAPICH2-MIC HPC Cloud with MPI & IB MVAPICH2-Virt MVAPICH User Group MeeKng 2015 10 Strong Procedure for Design, Development and Release • Research is done for exploring new designs • Designs are first presented to conference/journal publicaons • Best performing designs are incorporated into the codebase • Rigorous Q&A procedure before making a release – ExhausCve unit tesCng – Various test procedures on diverse range of plaorms and interconnects – Performance tuning – Applicaons-based evaluaon – Evaluaon on large-scale systems • Even alpha and beta versions go through the above tesCng MVAPICH User Group MeeKng 2015 11 MVAPICH2 2.2a • Released on 08/18/2015 • Major Features and Enhancements – Based on MPICH-3.1.4 – Support for backing on-demand UD CM informaon with shared memory for minimizing memory footprint – Dynamic idenCficaon of maximum read/atomic operaons supported by HCA – Enabling support for intra-node communicaons in RoCE mode without shared memory – Updated to hwloc 1.11.0 – Updated to sm_20 kernel opCmizaons for MPI Datatypes – Automac detecCon and tuning for 24-core Haswell architecture – Enhanced startup performance • Support for PMI-2 based startup with SLURM – Checkpoint-Restart Support with DMTCP (Distributed MulCThreaded CheckPoinCng) – Enhanced communicaon performance for small/medium message sizes – Support for linking Intel Trace Analyzer and Collector MVAPICH User Group MeeKng 2015 12 One-way Latency: MPI over IB with MVAPICH2 Small Message Latency Large Message Latency 2.00 120 TrueScale-QDR 1.80 100 ConnectX-3-FDR 1.60 ConnectIB-DualFDR 1.40 1.26 80 1.19 ConnectX-4-EDR 1.20 1.00 60 1.15 0.80 Latency (us) Latency (us) 1.05 40 0.60 0.40 20 0.20 0.00 0 Message Size (Bytes) Message Size (Bytes) TrueScale-QDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch ConnectX-3-FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch ConnectIB-Dual FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch ConnectX-4-EDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 Back-to-Back MVAPICH User Group MeeKng 2015 13 Bandwidth: MPI over IB with MVAPICH2 UnidirecKonal Bandwidth BidirecKonal Bandwidth 14000 30000 12465 TrueScale-QDR 24353 12000 25000 ConnectX-3-FDR 11497 10000 ConnectIB-DualFDR 22909 20000 8000 ConnectX-4-EDR 6356 15000 12161 6000 4000 3387 10000 6308 Bandwidth (MBytes/sec) Bandwidth (MBytes/sec) 2000 5000 0 0 4 16 64 256 1024 4K 16K 64K 256K 1M Message Size (Bytes) Message Size (Bytes) TrueScale-QDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch ConnectX-3-FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch ConnectIB-Dual FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch ConnectX-4-EDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 Back-to-Back MVAPICH User Group MeeKng 2015 14 MVAPICH2 Two-Sided Intra-Node Performance (Shared memory and Kernel-Based Zero-copy Support (LiMIC and CMA)) Latency 1 0.8 Intra-Socket Inter-Socket Latest MVAPICH2 2.2a 0.6 Intel Ivy-bridge 0.45 us 0.4 Latency (us) 0.18 us 0.2 0 0 1 2 4 8 16 32 64 128 256 512 1K Message Size (Bytes) Bandwidth (Intra-socket) Bandwidth (Inter-socket) 16000 16000 14,250 MB/s 13,749 MB/s 14000 intra-Socket-CMA 14000 inter-Socket-CMA inter-Socket-Shmem 12000 intra-Socket-Shmem 12000 inter-Socket-LiMIC 10000 intra-Socket-LiMIC 10000 8000 8000 6000 6000 4000 4000 Bandwidth (MB/s) Bandwidth (MB/s) 2000 2000 0 0 Message Size (Bytes) Message Size (Bytes) MVAPICH User Group MeeKng 2015 15 MVAPICH2-X for Advanced MPI, HyBrid MPI + PGAS ApplicaKons MPI, OpenSHMEM, UPC, CAF or HyBrid (MPI + PGAS) ApplicaKons OpenSHMEM Calls CAF Calls UPC Calls MPI Calls Unified MVAPICH2-X RunKme InfiniBand, RoCE, iWARP • Current Model – Separate RunCmes for OpenSHMEM/UPC/CAF and MPI – Possible deadlock if both runCmes are not progressed – Consumes more network resource • Unified communicaon runCme for MPI, UPC, OpenSHMEM, CAF available with MVAPICH2-X 1.9 onwards! – h<p://mvapich.cse.ohio-state.edu MVAPICH User Group MeeKng 2015 16 HyBrid (MPI+PGAS) Programming • Applicaon sub-kernels can be re-wri<en in MPI/PGAS based on communicaon characterisCcs HPC ApplicaKon • Benefits: Kernel 1 – Best of Distributed CompuCng Model MPI – Best of Shared Memory CompuCng Model Kernel 2 PGAS MPI • Exascale Roadmap*: Kernel 3 – “Hybrid Programming is a pracCcal way to MPI program exascale systems” Kernel N Kernel N PGAS MPI * The Interna8onal Exascale So?ware Roadmap, Dongarra, J., Beckman, P. et al., Volume 25, Number 1, 2011, Interna8onal Journal of High Performance Computer Applica8ons, ISSN 1094-3420 MVAPICH User Group MeeKng 2015 17 MVAPICH2-X 2.2a • Released on 08/18/2015

Overview of the MVAPICH Project: Latest Status and Future Roadmap

Mellanox Scalableshmem Support the Openshmem Parallel Programming Language Over Infiniband

Scalable and Distributed Deep Learning (DL): Co-Design MPI Runtimes and DL Frameworks

Parallel Programming

Introduction to Parallel Programming

Benchmarking the Intel FPGA SDK for Opencl Memory Interface

An Open Standard for SHMEM Implementations Introduction

Parallel Computer Architecture

BCL: a Cross-Platform Distributed Data Structures Library

Advances, Applications and Performance of The

Parallel Programming Using Openmp Feb 2014

Enabling Efficient Use of UPC and Openshmem PGAS Models on GPU Clusters

Automatic Handling of Global Variables for Multi-Threaded MPI Programs