MVAPICH2 Project: Latest Status and Future Plans

Total Page:16

File Type:pdf, Size:1020Kb

MVAPICH2 Project: Latest Status and Future Plans ECP Community BOF Days March 30 – April 1, 2021 Hosts: Ashley Barker (ORNL) and Osni Marques (LBNL) The MVAPICH2 Project: Latest Status and Future Plans Dhabaleswar K (DK) Panda The Ohio State University E-mail: [email protected] http://www.cse.ohio-state.edu/~panda History of MVAPICH • A long time ago, in a galaxy far, far away…. (actually… 21 years ago), there existed… • MPICH – High performance and widely portable implementation of MPI standard – From ANL • MVICH – Implementation of MPICH ADI-2 for VIA – VIA – Virtual Interface Architecture (precursor to InfiniBand) – From LBL • VAPI – Verbs level API – Initial InfiniBand API from IB Vendors (older version of OFED/IB verbs) MPICH + MVICH + VAPI = MVAPICH Overview of the MVAPICH2 Project • High Performance open-source MPI Library • Support for multiple interconnects – InfiniBand, Omni-Path, Ethernet/iWARP, RDMA over Converged Ethernet (RoCE), and AWS EFA • Support for multiple platforms – x86, OpenPOWER, ARM, Xeon-Phi, GPGPUs (NVIDIA and AMD) • Started in 2001, first open-source version demonstrated at SC ‘02 • Supports the latest MPI-3.1 standard • Used by more than 3,150 organizations in 89 countries • http://mvapich.cse.ohio-state.edu • More than 1.28 Million downloads from the OSU site directly • Additional optimized versions for different systems/environments: • Empowering many TOP500 clusters (Nov ‘20 ranking) – MVAPICH2-X (Advanced MPI + PGAS), since 2011 – 4th , 10,649,600-core (Sunway TaihuLight) at NSC, Wuxi, China – MVAPICH2-GDR with support for NVIDIA GPGPUs, since 2014 – 9th, 448, 448 cores (Frontera) at TACC – MVAPICH2-MIC with support for Intel Xeon-Phi, since 2014 – 14th, 391,680 cores (ABCI) in Japan – MVAPICH2-Virt with virtualization support, since 2015 – 21st, 570,020 cores (Nurion) in South Korea and many others – MVAPICH2-EA with support for Energy-Awareness, since 2015 • Available with software stacks of many vendors and Linux Distros – MVAPICH2-Azure for Azure HPC IB instances, since 2019 (RedHat, SuSE, OpenHPC, and Spack) – MVAPICH2-X-AWS for AWS HPC+EFA instances, since 2019 • Partner in the 9th ranked TACC Frontera system • Tools: – OSU MPI Micro-Benchmarks (OMB), since 2003 • Empowering Top500 systems for more than 16 years – OSU InfiniBand Network Analysis and Monitoring (INAM), since 2015 Number of Downloads 1400000 1000000 1200000 200000 400000 600000 800000 MVAPICH2 Release Timeline Downloads and Release MVAPICH2 0 Sep-04 Feb-05 Jul-05 Dec-05 May-06 MV 0.9.4 Oct-06 Mar-07 MV2 0.9.0 Aug-07 Jan-08 Jun-08 MV2 0.9.8 Nov-08 Apr-09 Sep-09 MV2 1.0 Feb-10 Jul-10 Dec-10 MV 1.0 May-11 MV2 1.0.3 Oct-11 MV 1.1 Mar-12 Timeline Aug-12 Jan-13 Jun-13 MV2 1.4 Nov-13 Apr-14 MV2 1.5 Sep-14 Feb-15 MV2 1.6 Jul-15 Dec-15 May-16 MV2 1.7 Oct-16 MV2 1.8 Mar-17 Aug-17 Jan-18 MV2 1.9 Jun-18 MV2-GDR 2.0b Nov-18 MV2-MIC 2.0 Apr-19 MV2 Virt 2.2 Sep-19 MV2-Azure 2.3.2 Feb-20 MV2-X 2.3 OSU INAM 0.9.6 MV2-AWS 2.3 Jul-20 MV2 2.3.5 Dec-20 MV2-GDR 2.3.5 Architecture of MVAPICH2 Software Family (HPC and DL) High Performance Parallel Programming Models Message Passing Interface PGAS Hybrid --- MPI + X (MPI) (UPC, OpenSHMEM, CAF, UPC++) (MPI + PGAS + OpenMP/Cilk) High Performance and Scalable Communication Runtime Diverse APIs and Mechanisms Point-to-point Collectives Energy- Remote I/O and Fault Active Introspection & Job Startup Virtualization Primitives Algorithms Awareness Memory Access File Systems Tolerance Messages Analysis Support for Modern Networking Technology Support for Modern Multi-/Many-core Architectures (InfiniBand, iWARP, RoCE, Omni-Path, Elastic Fabric Adapter) (Intel-Xeon, OpenPOWER, Xeon-Phi, ARM, NVIDIA GPGPU) Transport Protocols Modern Features Transport Mechanisms Modern Features Multi Shared RC SRD UD DC UMR ODP SR-IOV CMA IVSHMEM XPMEM Optane* NVLink CAPI* Rail Memory * Upcoming MVAPICH2 Software Family Requirements Library MPI with IB, iWARP, Omni-Path, and RoCE MVAPICH2 Advanced MPI Features/Support, OSU INAM, PGAS and MPI+PGAS MVAPICH2-X with IB, Omni-Path, and RoCE MPI with IB, RoCE & GPU and Support for Deep/Machine Learning MVAPICH2-GDR HPC Cloud with MPI & IB MVAPICH2-Virt Energy-aware MPI with IB, iWARP and RoCE MVAPICH2-EA MPI Energy Monitoring Tool OEMT InfiniBand Network Analysis and Monitoring OSU INAM Microbenchmarks for Measuring MPI and PGAS Performance OMB MVAPICH2 – Basic MPI MPI_Init on Frontera (Small Scale) MPI_Init on Frontera (Large Scale) Intra-node Latency on OpenPOWER 25 2000 1 20 14X MVAPICH2 2.3.4 MVAPICH2 2.3.5 MVAPICH2 2.3.4 1500 46X 15 Intel MPI 2020 SpectrumMPI-10.3.1.00 1000 0.5 10 Intel MPI 2020 0.2us Time (s) 5 Time (s) 500 0 0 0 (us)Latency 56 112 224 448 896 1792 3584 7168 14336 28672 57344 114688 229376 4 8 16 32 64 128 256 512 1K 2K Number of Processes Number of Processes Message size Allreduce Latency on OpenPOWER MPI_Bcast using Multicast on Frontera MPI_Bcast using Multicast on Frontera 1 Node 40 PPN 2048 Nodes 32 Bytes 3000 20 80 MVAPICH2 2.3.5 Default Multicast Default 3X 15 60 2000 SpectrumMPI-10.3.1.00 2X Multicast 3X 10 40 1000 5 20 Latency (us)Latency Latency (us)Latency Latency (us)Latency 0 0 0 8K 16K 32K 64K 128K 256K 512K 1M 1 2 4 8 16 32 64 128 256 512 1k 2 4 8 16 32 64 128 256 512 1K Message size Message Size Number of Nodes MVAPICH2-X – Advanced MPI + PGAS + Tools MPI_Allreduce using SHARP on Frontera MPI_Barrier using SHARP on Frontera (1ppn, 7,861 nodes) (1ppn, 7,861 nodes) Intra-nodeIntra Latency-Node onLatency ARM A64FX Oakami 120 250 600 100 MVAPICH2-X MVAPICH2-X 200 500 46% 80 MVAPICH2-X-SHARP 150 MVAPICH2-X-SHARP 400 MVAPICH2-X OpenMPI 60 MPI_Barrier 9X 300 100 40 5X 200 Latency (us) Latency (us) Latency (us) MPI_Allreduuce MPI_Allreduuce 100 20 50 (PPN = 1, Nodes = 7861) 0 0 0 0 2 8 32 2K 8K 4 8 2M 128 512 32K 16 32 64 128K 512K 128 256 512 1024 2048 4096 7861 Message Size (Bytes) Message Size Number of Nodes P3DFFT using BlueField-2 DPU on HPCAC Overhead of RC 1600 25 MVAPICH2 MVAPICH2-X MV2 BF2 HCA 25% protocol for 1400 MV2-BFO 1 WPN 20 1200 MV2-BFO 2 WPN connection 1000 10% 15 MV2-BFO 4 WPN 76% establishment and 800 39% 10 communication 600 Execution Time (s) Execution 400 Latency (us) Latency 5 200 0 0 1024 2048 4096 512 1024 2048 4096 Grid Size X Number of Processes MVAPICH2-GDR – Optimized MPI for clusters with NVIDIA and AMD GPUs Best Performance for GPU-based Transfers TensorFlow Training with MVAPICH2-GDR on Summit GPU-Based MPI_Allreduce on Summit 30 600 10 MV2-(NO-GDR) MV2-GDR-2.3.5 NCCL-2.6 MVAPICH2-GDR 2.3.5 8 128MB Message 20 400 6 1.7X better Time/epoch = 3 seconds Thousands 4 10 1.85us 10x 200 Total Time (90 epochs) = 3 x 90 = 270 sec = 4.5 minutes! 2 Bandwidth (GB/s) Bandwidth Latency (us) 0 0 0 Image per second 0 1 2 4 8 1 2 4 6 16 32 64 1K 2K 4K 8K 12 24 48 96 24 48 96 192 384 768 1536 128 256 512 192 384 768 Message Size (Bytes) Number of GPUs 1536 Number of GPUs SpectrumMPI 10.3 OpenMPI 4.0.1 ROCm Support for AMD GPUs (Available with MVAPICH2-GDR 2.3.5) NCCL 2.6 MVAPICH2-GDR-2.3.5 LLNL Corona Cluster - ROCm-4.0.0 (mi50 AMD GPUs) Intra-Node Point-to-Point Latency Inter-Node Point-to-Point Latency Allreduce 128 GPUs (16 Nodes, 8 GPN) Broadcast 128 GPUs (16 Nodes, 8 GPN) UCX Error at message size > 256KB (for OpenMPI) PyTorch at Scale: Training ResNet-50 on 256 GPUs: Distributed Torch.distributed Horovod DeepSpeed Framework LLNL Lassen (10,000 Images/sec faster than NCCL training) Images/sec on 256 61,794 72,120 74,063 84,659 80,217 88,873 GPUs Communication NCCL 2.7 MVAPICH2- NCCL MVAPICH2-GDR NCCL MVAPICH2- Backend GDR 2.7 2.7 GDR MVAPICH2-X Advanced Support for HPC-Clouds Performance on Amazon EFA Performance of WRF on Microsoft Azure WRF 3.6 Execution Time WRF 3.6 Execution time OpenMPI Intel MPI MVAPICH2-X-AWS 140 60 MVAPICH2 MVAPICH2-X+XPMEM 120 50 100 40 80 41% 30 60 27%47% Time (s) Time (s) 58% 40 20 20 10 0 0 36 72 144 288 576 1152 120 240 480 960 Instance type: c5n.18xlarge Number of processes Number of processes CPU: Intel Xeon Platinum 8124M @ 3.00GHz VM type: HBv2 MVAPICH2 version: MVAPICH2-X-aws v2.3 CPU: AMD EPYC 7V12 @ 2.45GHz OpenMPI version: Open MPI v4.0.3 with libfabric 1.9 MVAPICH2 version: MVAPICH2-Azure 2.3.3 IntelMPI version: Intel MPI 2019.7.217 MVAPICH2-X version: MVAPICH2-X (2.3rc3) • MVAPICH2-X-AWS 2.3 • MVAPICH2-Azure 2.3.3 • Released on 09/24/2020 – Released on 05/20/2020 • Major Features and Enhancements – Integerated Azure CentOS HPC Images: – Based on MVAPICH2-X 2.3 https://github.com/Azure/azhpc-images/releases/tag/centos-7.6-hpc-20200417 – Support for on Amazon EFA adapter's Scalable Reliable Datagram (SRD) • MVAPICH2-X 2.3.RC3 – Support for XPMEM based intra-node communication for point-to-point and collectives – Enhanced tuning for point-to-point and collective operations • Major Features and Enhancements – Targeted for AWS instances with Amazon Linux 2 AMI and EFA support – Based on MVAPICH2-2.3.3 – Tested with c5n & p3d instance types and different Operation Systems – Enhanced tuning for point-to-point and collective operations – Targeted for Azure HB & HC virtual machine instances – Tested with Azure HB & HC VM instances • Detailed User Guide: http://mvapich.cse.ohio-state.edu/userguide/mv2-azure/ MVAPICH2 – Future Roadmap and Plans for Exascale • Update to MPICH 3.3 CH3 channel • 2021 • Initial support for the CH4 channel • 2021/2022 • Making CH4 channel default • 2022/2023 • Performance and Memory scalability toward 1M-10M cores • Hybrid programming (MPI + OpenSHMEM, MPI + UPC, MPI + CAF …) • MPI + Task* • Enhanced Optimization for GPUs and FPGAs* • Taking advantage of advanced features of Mellanox InfiniBand • Tag Matching* • Adapter Memory* • Enhanced communication schemes for upcoming architectures • GenZ* • CAPI* • Extended topology-aware collectives • Extended Energy-aware designs and Virtualization Support • Extended Support for MPI Tools Interface (as in MPI 3.0) • Extended FT support • Support for * features will be available in future MVAPICH2 Releases Thank You! [email protected] http://web.cse.ohio-state.edu/~panda Network-Based Computing Laboratory http://nowlab.cse.ohio-state.edu/ The High-Performance MPI/PGAS Project The High-Performance Deep Learning Project http://mvapich.cse.ohio-state.edu/ http://hidl.cse.ohio-state.edu/.
Recommended publications
  • Bench - Benchmarking the State-Of- The-Art Task Execution Frameworks of Many- Task Computing
    MATRIX: Bench - Benchmarking the state-of- the-art Task Execution Frameworks of Many- Task Computing Thomas Dubucq, Tony Forlini, Virgile Landeiro Dos Reis, and Isabelle Santos Illinois Institute of Technology, Chicago, IL, USA {tdubucq, tforlini, vlandeir, isantos1}@hawk.iit.edu Stanford University. Finally HPX is a general purpose C++ Abstract — Technology trends indicate that exascale systems will runtime system for parallel and distributed applications of any have billion-way parallelism, and each node will have about three scale developed by Louisiana State University and Staple is a orders of magnitude more intra-node parallelism than today’s framework for developing parallel programs from Texas A&M. peta-scale systems. The majority of current runtime systems focus a great deal of effort on optimizing the inter-node parallelism by MATRIX is a many-task computing job scheduling system maximizing the bandwidth and minimizing the latency of the use [3]. There are many resource managing systems aimed towards of interconnection networks and storage, but suffer from the lack data-intensive applications. Furthermore, distributed task of scalable solutions to expose the intra-node parallelism. Many- scheduling in many-task computing is a problem that has been task computing (MTC) is a distributed fine-grained paradigm that considered by many research teams. In particular, Charm++ [4], aims to address the challenges of managing parallelism and Legion [5], Swift [6], [10], Spark [1][2], HPX [12], STAPL [13] locality of exascale systems. MTC applications are typically structured as direct acyclic graphs of loosely coupled short tasks and MATRIX [11] offer solutions to this problem and have with explicit input/output data dependencies.
    [Show full text]
  • Adaptive Data Migration in Load-Imbalanced HPC Applications
    Louisiana State University LSU Digital Commons LSU Doctoral Dissertations Graduate School 10-16-2020 Adaptive Data Migration in Load-Imbalanced HPC Applications Parsa Amini Louisiana State University and Agricultural and Mechanical College Follow this and additional works at: https://digitalcommons.lsu.edu/gradschool_dissertations Part of the Computer Sciences Commons Recommended Citation Amini, Parsa, "Adaptive Data Migration in Load-Imbalanced HPC Applications" (2020). LSU Doctoral Dissertations. 5370. https://digitalcommons.lsu.edu/gradschool_dissertations/5370 This Dissertation is brought to you for free and open access by the Graduate School at LSU Digital Commons. It has been accepted for inclusion in LSU Doctoral Dissertations by an authorized graduate school editor of LSU Digital Commons. For more information, please [email protected]. ADAPTIVE DATA MIGRATION IN LOAD-IMBALANCED HPC APPLICATIONS A Dissertation Submitted to the Graduate Faculty of the Louisiana State University and Agricultural and Mechanical College in partial fulfillment of the requirements for the degree of Doctor of Philosophy in The Department of Computer Science by Parsa Amini B.S., Shahed University, 2013 M.S., New Mexico State University, 2015 December 2020 Acknowledgments This effort has been possible, thanks to the involvement and assistance of numerous people. First and foremost, I thank my advisor, Dr. Hartmut Kaiser, who made this journey possible with their invaluable support, precise guidance, and generous sharing of expertise. It has been a great privilege and opportunity for me be your student, a part of the STE||AR group, and the HPX development effort. I would also like to thank my mentor and former advisor at New Mexico State University, Dr.
    [Show full text]
  • MVAPICH2 2.2 User Guide
    MVAPICH2 2.2 User Guide MVAPICH TEAM NETWORK-BASED COMPUTING LABORATORY DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING THE OHIO STATE UNIVERSITY http://mvapich.cse.ohio-state.edu Copyright (c) 2001-2016 Network-Based Computing Laboratory, headed by Dr. D. K. Panda. All rights reserved. Last revised: October 19, 2017 Contents 1 Overview of the MVAPICH Project1 2 How to use this User Guide?1 3 MVAPICH2 2.2 Features2 4 Installation Instructions 13 4.1 Building from a tarball .................................... 13 4.2 Obtaining and Building the Source from SVN repository .................. 13 4.3 Selecting a Process Manager................................. 14 4.3.1 Customizing Commands Used by mpirun rsh..................... 15 4.3.2 Using SLURM..................................... 15 4.3.3 Using SLURM with support for PMI Extensions ................... 15 4.4 Configuring a build for OFA-IB-CH3/OFA-iWARP-CH3/OFA-RoCE-CH3......... 16 4.5 Configuring a build for NVIDIA GPU with OFA-IB-CH3.................. 19 4.6 Configuring a build for Shared-Memory-CH3........................ 20 4.7 Configuring a build for OFA-IB-Nemesis .......................... 20 4.8 Configuring a build for Intel TrueScale (PSM-CH3)..................... 21 4.9 Configuring a build for Intel Omni-Path (PSM2-CH3).................... 22 4.10 Configuring a build for TCP/IP-Nemesis........................... 23 4.11 Configuring a build for TCP/IP-CH3............................. 24 4.12 Configuring a build for OFA-IB-Nemesis and TCP/IP Nemesis (unified binary) . 24 4.13 Configuring a build for Shared-Memory-Nemesis...................... 25 5 Basic Usage Instructions 26 5.1 Compile Applications..................................... 26 5.2 Run Applications....................................... 26 5.2.1 Run using mpirun rsh ...............................
    [Show full text]
  • Scalable and High Performance MPI Design for Very Large
    SCALABLE AND HIGH-PERFORMANCE MPI DESIGN FOR VERY LARGE INFINIBAND CLUSTERS DISSERTATION Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Graduate School of The Ohio State University By Sayantan Sur, B. Tech ***** The Ohio State University 2007 Dissertation Committee: Approved by Prof. D. K. Panda, Adviser Prof. P. Sadayappan Adviser Prof. S. Parthasarathy Graduate Program in Computer Science and Engineering c Copyright by Sayantan Sur 2007 ABSTRACT In the past decade, rapid advances have taken place in the field of computer and network design enabling us to connect thousands of computers together to form high-performance clusters. These clusters are used to solve computationally challenging scientific problems. The Message Passing Interface (MPI) is a popular model to write applications for these clusters. There are a vast array of scientific applications which use MPI on clusters. As the applications operate on larger and more complex data, the size of the compute clusters is scaling higher and higher. Thus, in order to enable the best performance to these scientific applications, it is very critical for the design of the MPI libraries be extremely scalable and high-performance. InfiniBand is a cluster interconnect which is based on open-standards and gaining rapid acceptance. This dissertation presents novel designs based on the new features offered by InfiniBand, in order to design scalable and high-performance MPI libraries for large-scale clusters with tens-of-thousands of nodes. Methods developed in this dissertation have been applied towards reduction in overall resource consumption, increased overlap of computa- tion and communication, improved performance of collective operations and finally designing application-level benchmarks to make efficient use of modern networking technology.
    [Show full text]
  • Beowulf Clusters — an Overview
    WinterSchool 2001 Å. Ødegård Beowulf clusters — an overview Åsmund Ødegård April 4, 2001 Beowulf Clusters 1 WinterSchool 2001 Å. Ødegård Contents Introduction 3 What is a Beowulf 5 The history of Beowulf 6 Who can build a Beowulf 10 How to design a Beowulf 11 Beowulfs in more detail 12 Rules of thumb 26 What Beowulfs are Good For 30 Experiments 31 3D nonlinear acoustic fields 35 Incompressible Navier–Stokes 42 3D nonlinear water wave 44 Beowulf Clusters 2 WinterSchool 2001 Å. Ødegård Introduction Why clusters ? ² “Work harder” – More CPU–power, more memory, more everything ² “Work smarter” – Better algorithms ² “Get help” – Let more boxes work together to solve the problem – Parallel processing ² by Greg Pfister Beowulf Clusters 3 WinterSchool 2001 Å. Ødegård ² Beowulfs in the Parallel Computing picture: Parallel Computing MetaComputing Clusters Tightly Coupled Vector WS farms Pile of PCs NOW NT/Win2k Clusters Beowulf CC-NUMA Beowulf Clusters 4 WinterSchool 2001 Å. Ødegård What is a Beowulf ² Mass–market commodity off the shelf (COTS) ² Low cost local area network (LAN) ² Open Source UNIX like operating system (OS) ² Execute parallel application programmed with a message passing model (MPI) ² Anything from small systems to large, fast systems. The fastest rank as no.84 on todays Top500. ² The best price/performance system available for many applications ² Philosophy: The cheapest system available which solve your problem in reasonable time Beowulf Clusters 5 WinterSchool 2001 Å. Ødegård The history of Beowulf ² 1993: Perfect conditions for the first Beowulf – Major CPU performance advance: 80286 ¡! 80386 – DRAM of reasonable costs and densities (8MB) – Disk drives of several 100MBs available for PC – Ethernet (10Mbps) controllers and hubs cheap enough – Linux improved rapidly, and was in a usable state – PVM widely accepted as a cross–platform message passing model ² Clustering was done with commercial UNIX, but the cost was high.
    [Show full text]
  • MVAPICH2 2.3 User Guide
    MVAPICH2 2.3 User Guide MVAPICH Team Network-Based Computing Laboratory Department of Computer Science and Engineering The Ohio State University http://mvapich.cse.ohio-state.edu Copyright (c) 2001-2018 Network-Based Computing Laboratory, headed by Dr. D. K. Panda. All rights reserved. Last revised: February 19, 2018 Contents 1 Overview of the MVAPICH Project1 2 How to use this User Guide?1 3 MVAPICH2 2.3 Features2 4 Installation Instructions 14 4.1 Building from a tarball................................... 14 4.2 Obtaining and Building the Source from SVN repository................ 14 4.3 Selecting a Process Manager................................ 15 4.3.1 Customizing Commands Used by mpirun rsh.................. 16 4.3.2 Using SLURM................................... 16 4.3.3 Using SLURM with support for PMI Extensions................ 16 4.4 Configuring a build for OFA-IB-CH3/OFA-iWARP-CH3/OFA-RoCE-CH3...... 17 4.5 Configuring a build for NVIDIA GPU with OFA-IB-CH3............... 20 4.6 Configuring a build to support running jobs across multiple InfiniBand subnets... 21 4.7 Configuring a build for Shared-Memory-CH3...................... 21 4.8 Configuring a build for OFA-IB-Nemesis......................... 21 4.9 Configuring a build for Intel TrueScale (PSM-CH3)................... 22 4.10 Configuring a build for Intel Omni-Path (PSM2-CH3)................. 23 4.11 Configuring a build for TCP/IP-Nemesis......................... 24 4.12 Configuring a build for TCP/IP-CH3........................... 25 4.13 Configuring a build for OFA-IB-Nemesis and TCP/IP Nemesis (unified binary)... 26 4.14 Configuring a build for Shared-Memory-Nemesis.................... 26 4.15 Configuration and Installation with Singularity....................
    [Show full text]
  • The Component Architecture of Open Mpi: Enabling Third-Party Collective Algorithms∗
    THE COMPONENT ARCHITECTURE OF OPEN MPI: ENABLING THIRD-PARTY COLLECTIVE ALGORITHMS∗ Jeffrey M. Squyres and Andrew Lumsdaine Open Systems Laboratory, Indiana University Bloomington, Indiana, USA [email protected] [email protected] Abstract As large-scale clusters become more distributed and heterogeneous, significant research interest has emerged in optimizing MPI collective operations because of the performance gains that can be realized. However, researchers wishing to develop new algorithms for MPI collective operations are typically faced with sig- nificant design, implementation, and logistical challenges. To address a number of needs in the MPI research community, Open MPI has been developed, a new MPI-2 implementation centered around a lightweight component architecture that provides a set of component frameworks for realizing collective algorithms, point-to-point communication, and other aspects of MPI implementations. In this paper, we focus on the collective algorithm component framework. The “coll” framework provides tools for researchers to easily design, implement, and exper- iment with new collective algorithms in the context of a production-quality MPI. Performance results with basic collective operations demonstrate that the com- ponent architecture of Open MPI does not introduce any performance penalty. Keywords: MPI implementation, Parallel computing, Component architecture, Collective algorithms, High performance 1. Introduction Although the performance of the MPI collective operations [6, 17] can be a large factor in the overall run-time of a parallel application, their optimiza- tion has not necessarily been a focus in some MPI implementations until re- ∗This work was supported by a grant from the Lilly Endowment and by National Science Foundation grant 0116050.
    [Show full text]
  • Improving MPI Threading Support for Current Hardware Architectures
    University of Tennessee, Knoxville TRACE: Tennessee Research and Creative Exchange Doctoral Dissertations Graduate School 12-2019 Improving MPI Threading Support for Current Hardware Architectures Thananon Patinyasakdikul University of Tennessee, [email protected] Follow this and additional works at: https://trace.tennessee.edu/utk_graddiss Recommended Citation Patinyasakdikul, Thananon, "Improving MPI Threading Support for Current Hardware Architectures. " PhD diss., University of Tennessee, 2019. https://trace.tennessee.edu/utk_graddiss/5631 This Dissertation is brought to you for free and open access by the Graduate School at TRACE: Tennessee Research and Creative Exchange. It has been accepted for inclusion in Doctoral Dissertations by an authorized administrator of TRACE: Tennessee Research and Creative Exchange. For more information, please contact [email protected]. To the Graduate Council: I am submitting herewith a dissertation written by Thananon Patinyasakdikul entitled "Improving MPI Threading Support for Current Hardware Architectures." I have examined the final electronic copy of this dissertation for form and content and recommend that it be accepted in partial fulfillment of the equirr ements for the degree of Doctor of Philosophy, with a major in Computer Science. Jack Dongarra, Major Professor We have read this dissertation and recommend its acceptance: Michael Berry, Michela Taufer, Yingkui Li Accepted for the Council: Dixie L. Thompson Vice Provost and Dean of the Graduate School (Original signatures are on file with official studentecor r ds.) Improving MPI Threading Support for Current Hardware Architectures A Dissertation Presented for the Doctor of Philosophy Degree The University of Tennessee, Knoxville Thananon Patinyasakdikul December 2019 c by Thananon Patinyasakdikul, 2019 All Rights Reserved. ii To my parents Thanawij and Issaree Patinyasakdikul, my little brother Thanarat Patinyasakdikul for their love, trust and support.
    [Show full text]
  • Automatic Handling of Global Variables for Multi-Threaded MPI Programs
    Automatic Handling of Global Variables for Multi-threaded MPI Programs Gengbin Zheng, Stas Negara, Celso L. Mendes, Laxmikant V. Kale´ Eduardo R. Rodrigues Department of Computer Science Institute of Informatics University of Illinois at Urbana-Champaign Federal University of Rio Grande do Sul Urbana, IL 61801, USA Porto Alegre, Brazil fgzheng,snegara2,cmendes,[email protected] [email protected] Abstract— necessary to privatize global and static variables to ensure Hybrid programming models, such as MPI combined with the thread-safety of an application. threads, are one of the most efficient ways to write parallel ap- In high-performance computing, one way to exploit multi- plications for current machines comprising multi-socket/multi- core nodes and an interconnection network. Global variables in core platforms is to adopt hybrid programming models that legacy MPI applications, however, present a challenge because combine MPI with threads, where different programming they may be accessed by multiple MPI threads simultaneously. models are used to address issues such as multiple threads Thus, transforming legacy MPI applications to be thread-safe per core, decreasing amount of memory per core, load in order to exploit multi-core architectures requires proper imbalance, etc. When porting legacy MPI applications to handling of global variables. In this paper, we present three approaches to eliminate these hybrid models that involve multiple threads, thread- global variables to ensure thread-safety for an MPI program. safety of the applications needs to be addressed again. These approaches include: (a) a compiler-based refactoring Global variables cause no problem with traditional MPI technique, using a Photran-based tool as an example, which implementations, since each process image contains a sep- automates the source-to-source transformation for programs arate instance of the variable.
    [Show full text]
  • Exascale Computing Project -- Software
    Exascale Computing Project -- Software Paul Messina, ECP Director Stephen Lee, ECP Deputy Director ASCAC Meeting, Arlington, VA Crystal City Marriott April 19, 2017 www.ExascaleProject.org ECP scope and goals Develop applications Partner with vendors to tackle a broad to develop computer spectrum of mission Support architectures that critical problems national security support exascale of unprecedented applications complexity Develop a software Train a next-generation Contribute to the stack that is both workforce of economic exascale-capable and computational competitiveness usable on industrial & scientists, engineers, of the nation academic scale and computer systems, in collaboration scientists with vendors 2 Exascale Computing Project, www.exascaleproject.org ECP has formulated a holistic approach that uses co- design and integration to achieve capable exascale Application Software Hardware Exascale Development Technology Technology Systems Science and Scalable and Hardware Integrated mission productive technology exascale applications software elements supercomputers Correctness Visualization Data Analysis Applicationsstack Co-Design Programming models, Math libraries and development environment, Tools Frameworks and runtimes System Software, resource Workflows Resilience management threading, Data Memory scheduling, monitoring, and management and Burst control I/O and file buffer system Node OS, runtimes Hardware interface ECP’s work encompasses applications, system software, hardware technologies and architectures, and workforce
    [Show full text]
  • MVAPICH USER GROUP MEETING August 26, 2020
    APPLICATION AND MICROBENCHMARK STUDY USING MVAPICH2 and MVAPICH2-GDR ON SDSC COMET AND EXPANSE MVAPICH USER GROUP MEETING August 26, 2020 Mahidhar Tatineni SDSC San Diego Supercomputer Center NSF Award 1928224 Overview • History of InfiniBand based clusters at SDSC • Hardware summaries for Comet and Expanse • Application and Software Library stack on SDSC systems • Containerized approach – Singularity on SDSC systems • Benchmark results on Comet, Expanse development system - OSU Benchmarks, NEURON, RAxML, TensorFlow, QE, and BEAST are presented. InfiniBand and MVAPICH2 on SDSC Systems Gordon (NSF) Trestles (NSF) 2012-2017 COMET (NSF) Expanse (NSF) 2011-2014 GordonS 2015-Current Production Fall 2020 (Simons Foundation) 2017- 2020 • 324 nodes, 10,368 cores • 1024 nodes, 16,384 cores • 1944 compute, 72 GPU, and • 728 compute, 52 GPU, and • 4-socket AMD Magny-Cours • 2-socket Intel Sandy Bridge 4 large memory nodes. 4 large memory nodes. • QDR InfiniBand • Dual Rail QDR InfiniBand • 2-socket Intel Haswell • 2-socket AMD EPYC 7742, • Fat Tree topology • 3-D Torus topology • FDR InfiniBand HDR100 InfiniBand • MVAPICH2 • 300TB of SSD storage - via • Fat Tree topology • GPU nodes with 4 V100 iSCSI over RDMA (iSER) • MVAPICH2, MVAPICH2-X, GPUs + NVLINK • MVAPICH2 (1.9, 2.1) with 3-D MVAPICH2-GDR • HDR200 Switches, Fat torus support • Leverage SRIOV for Virtual Tree topology with 3:1 Clusters oversubscription • MVAPICH2, MVAPICH2-X, MVAPICH2-GDR Comet: System Characteristics • Total peak flops ~2.76 PF • Hybrid fat-tree topology • Dell primary
    [Show full text]
  • Infiniband and 10-Gigabit Ethernet for Dummies
    The MVAPICH2 Project: Latest Developments and Plans Towards Exascale Computing Presentation at Mellanox Theatre (SC ‘18) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: [email protected] http://www.cse.ohio-state.edu/~panda Drivers of Modern HPC Cluster Architectures High Performance Interconnects - Accelerators / Coprocessors InfiniBand high compute density, high Multi-core Processors <1usec latency, 100Gbps Bandwidth> performance/watt SSD, NVMe-SSD, NVRAM >1 TFlop DP on a chip • Multi-core/many-core technologies • Remote Direct Memory Access (RDMA)-enabled networking (InfiniBand and RoCE) • Solid State Drives (SSDs), Non-Volatile Random-Access Memory (NVRAM), NVMe-SSD • Accelerators (NVIDIA GPGPUs and Intel Xeon Phi) • Available on HPC Clouds, e.g., Amazon EC2, NSF Chameleon, Microsoft Azure, etc. Summit Sunway TaihuLight Sierra K - Computer Network Based Computing Laboratory Mellanox Theatre SC’18 2 Parallel Programming Models Overview P1 P2 P3 P1 P2 P3 P1 P2 P3 Logical shared memory Memory Memory Memory Shared Memory Memory Memory Memory Shared Memory Model Distributed Memory Model Partitioned Global Address Space (PGAS) SHMEM, DSM MPI (Message Passing Interface) Global Arrays, UPC, Chapel, X10, CAF, … • Programming models provide abstract machine models • Models can be mapped on different types of systems – e.g. Distributed Shared Memory (DSM), MPI within a node, etc. • PGAS models and Hybrid MPI+PGAS models are gradually receiving importance Network Based Computing Laboratory Mellanox Theatre SC’18 3 Designing Communication Libraries for Multi-Petaflop and Exaflop Systems: Challenges Application Kernels/Applications Co-Design Middleware Opportunities and Programming Models Challenges MPI, PGAS (UPC, Global Arrays, OpenSHMEM), CUDA, OpenMP, across Various OpenACC, Cilk, Hadoop (MapReduce), Spark (RDD, DAG), etc.
    [Show full text]