Infiniband + GPU Systems (Past)

Total Page:16

File Type:pdf, Size:1020Kb

Infiniband + GPU Systems (Past) MVAPICH2 and GPUDirect RDMA Presentation at HPC Advisory Council, June 2013 by Dhabaleswar K. (DK) Panda, Hari Subramoni and Sreeram Potluri The Ohio State University E-mail: {panda, subramon, potluri}@cse.ohio-state.edu https://mvapich.cse.ohio-state.edu/ Trends for Commodity Computing Clusters in the Top 500 List (http://www.top500.org) 500 100 Percentage of Clusters 450 90 400 Number of Clusters 80 350 70 300 60 250 50 200 40 150 30 Numberof Clusters 100 20 Percentage of Clusters of Clusters Percentage 50 10 0 0 Timeline DK-OSU HPC Advisory Council June'13 2 Large-scale InfiniBand Installations • 224 IB Clusters (44.8%) in the November 2012 Top500 list (http://www.top500.org) • Installations in the Top 40 (16 systems): 147, 456 cores (Super MUC) in Germany (6th) 122,400 cores (Roadrunner) at LANL (22nd) 204,900 cores (Stampede) at TACC (7th) 53,504 (PRIMERGY) at Australia/NCI (24th) 77,184 cores (Curie thin nodes) at France/CEA (11th) 78,660 cores (Lomonosov) in Russia (26th ) 120, 640 cores (Nebulae) at China/NSCS (12th) 137,200 cores (Sunway Blue Light) in China (28th) 72,288 cores (Yellowstone) at NCAR (13th) 46,208 cores (Zin) at LLNL (29th) 125,980 cores (Pleiades) at NASA/Ames (14th) 33,664 (MareNostrum) at Spain/BSC (36th) 70,560 cores (Helios) at Japan/IFERC (15th) 32,256 (SGI Altix X) at Japan/CRIEPI (39th) 73,278 cores (Tsubame 2.0) at Japan/GSIC (17th) More are getting installed ! 138,368 cores (Tera-100) at France/CEA (20th) DK-OSU HPC Advisory Council June'13 3 MVAPICH2/MVAPICH2-X Software • High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP and RDMA over Converged Enhanced Ethernet (RoCE) – MVAPICH (MPI-1) ,MVAPICH2 (MPI-3.0), Available since 2002 – MVAPICH2-X (MPI + PGAS), Available since 2012 – Used by more than 2,000 organizations (HPC Centers, Industry and Universities) in 70 countries – More than 173,000 downloads from OSU site directly – Empowering many TOP500 clusters • 7th ranked 204,900-core cluster (Stampede) at TACC • 14th ranked 125,980-core cluster (Pleiades) at NASA • 17th ranked 73,278-core cluster (Tsubame 2.0) at Tokyo Institute of Technology • and many others – Available with software stacks of many IB, HSE and server vendors including Linux Distros (RedHat and SuSE) – http://mvapich.cse.ohio-state.edu • Partner in the U.S. NSF-TACC Stampede System DK-OSU HPC Advisory Council June'13 4 MVAPICH2 1.9 and MVAPICH2-X 1.9 • Released on 05/06/13 • Major Features and Enhancements – Based on MPICH-3.0.3 • Support for all MPI-3 features – Support for single copy intra-node communication using Linux supported CMA (Cross Memory Attach) • Provides flexibility for intra-node communication: shared memory, LiMIC2, and CMA – Checkpoint/Restart using LLNL's Scalable Checkpoint/Restart Library (SCR) • Support for application-level checkpointing • Support for hierarchical system-level checkpointing – Scalable UD-multicast-based designs and tuned algorithm selection for collectives – Improved and tuned MPI communication from GPU device memory – Improved job startup time • Provided a new runtime variable MV2_HOMOGENEOUS_CLUSTER for optimized startup on homogeneous clusters – Revamped Build system with support for parallel builds • MVAPICH2-X 1.9 supports hybrid MPI + PGAS (UPC and OpenSHMEM) programming models. – Based on MVAPICH2 1.9 including MPI-3 features; Compliant with UPC 2.16.2 and OpenSHMEM v1.0d DK-OSU HPC Advisory Council June'13 5 Outline • MVAPICH2/MVAPICH2-X Overview – Efficient Intra-node and Inter-node Communication and Scalable Protocols – Scalable and Non-blocking Collective Communication – High Performance Fault Tolerance Mechanisms – Support for Hybrid MPI+PGAS Programming models • MVAPICH2 for GPU Clusters – Point-to-point Communication – Collective Communication – MPI Datatype processing – Multi-GPU Configurations – MPI + OpenACC • MVAPICH2 with GPUDirect RDMA • Conclusion DK-OSU HPC Advisory Council June'13 6 One-way Latency: MPI over IB Small Message Latency Large Message Latency 250.00 6.00 MVAPICH-Qlogic-DDR MVAPICH-Qlogic-QDR 5.00 MVAPICH-ConnectX-DDR 200.00 MVAPICH-ConnectX2-PCIe2-QDR MVAPICH-ConnectX3-PCIe3-FDR 4.00 MVAPICH2-Mellanox-ConnectIB-DualFDR 150.00 1.82 3.00 1.66 100.00 Latency (us) Latency Latency (us) Latency 1.64 2.00 1.56 50.00 1.00 1.09 0.99 0.00 0.00 Message Size (bytes) Message Size (bytes) DDR, QDR - 2.4 GHz Quad-core (Westmere) Intel PCI Gen2 with IB switch FDR - 2.6 GHz Octa-core (Sandybridge) Intel PCI Gen3 with IB switch ConnectIB-Dual FDR - 2.6 GHz Octa-core (Sandybridge) Intel PCI Gen3 with IB switch DK-OSU HPC Advisory Council June'13 7 Bandwidth: MPI over IB Unidirectional Bandwidth Bidirectional Bandwidth 14000 25000 MVAPICH-Qlogic-DDR MVAPICH-Qlogic-QDR 12000 MVAPICH-ConnectX-DDR 21025 12485 20000 MVAPICH-ConnectX2-PCIe2-QDR ) ) 10000 MVAPICH-ConnectX3-PCIe3-FDR sec MVAPICH2-Mellanox-ConnectIB-DualFDR 8000 15000 11643 MBytes/sec (MBytes/ ( 6000 6343 10000 6521 4000 3385 Bandwidth 3280 Bandwidth 5000 4407 2000 1917 3704 1706 3341 0 0 Message Size (bytes) Message Size (bytes) DDR, QDR - 2.4 GHz Quad-core (Westmere) Intel PCI Gen2 with IB switch FDR - 2.6 GHz Octa-core (Sandybridge) Intel PCI Gen3 with IB switch ConnectIB-Dual FDR - 2.6 GHz Octa-core (Sandybridge) Intel PCI Gen3 with IB switch DK-OSU HPC Advisory Council June'13 8 eXtended Reliable Connection (XRC) and Hybrid Mode Memory Usage Performance on NAMD (1024 cores) 500 1.5 MVAPICH2-RC 400 MVAPICH2-RC MVAPICH2-XRC MVAPICH2-XRC 300 1 200 0.5 100 NormalizedTime 0 0 Memory (MB/process) Memory 1 4 16 64 256 1K 4K 16K apoa1 er-gre f1atpase jac Connections Dataset • Memory usage for 32K processes with 8-cores per node can be 54 MB/process (for connections) • NAMD performance improves when there is frequent communication to many peers UD Hybrid RC • Both UD and RC/XRC have benefits 6 • Hybrid for the best of both 26% 40% 38% 30% 4 • Available since MVAPICH2 1.7 as integrated interface 2 • Runtime Parameters: RC - default; Time (us) Time • UD - MV2_USE_ONLY_UD=1 0 128 256 512 1024 • Hybrid - MV2_HYBRID_ENABLE_THRESHOLD=1 Number of Processes M. Koop, J. Sridhar and D. K. Panda, “Scalable MPI Design over InfiniBand using eXtended Reliable Connection,” Cluster ‘08 DK-OSU HPC Advisory Council June'13 9 MVAPICH2 Two-Sided Intra-Node Performance (Shared memory and Kernel-based Zero-copy Support (LiMIC and CMA)) Latency 1.2 1 Intra-Socket Inter-Socket Latest MVAPICH2 1.9 0.8 Intel Sandy-bridge 0.6 0.45 us 0.4 Latency (us) Latency 0.19 us 0.2 0 0 1 2 4 8 16 32 64 128 256 512 1K Message Size (Bytes) Bandwidth (intra-socket) Bandwidth (inter-socket) 14000 14000 12,000MB/s 12,000MB/s intra-Socket-CMA inter-Socket-CMA 12000 12000 intra-Socket-Shmem inter-Socket-Shmem 10000 10000 intra-Socket-LiMIC inter-Socket-LiMIC 8000 8000 6000 6000 4000 4000 Bandwidth (MB/s) Bandwidth Bandwidth (MB/s) Bandwidth 2000 2000 0 0 Message Size (Bytes) Message Size (Bytes) DK-OSU HPC Advisory Council June'13 10 Outline • MVAPICH2/MVAPICH2-X Overview – Highly Efficient Intra-node and Inter-node Communication and Scalable Protocols – Scalable Blocking and Non-blocking Collective Communication – High Performance Fault Tolerance Mechanisms – Support for Hybrid MPI+PGAS Programming models • MVAPICH2 for GPU Clusters – Point-to-point Communication – Collective Communication – MPI Datatype processing – Multi-GPU Configurations • MVAPICH2 with GPUDirect RDMA • Conclusion DK-OSU HPC Advisory Council June'13 11 Hardware Multicast-aware MPI_Bcast on Stampede Small Messages (102,400 Cores) Large Messages (102,400 Cores) 40 450 35 Default Default 400 30 Multicast 350 Multicast (us) 25 (us) 300 20 250 15 200 150 Latency Latency 10 Latency Latency 100 5 50 0 0 2 8 32 128 512 2K 8K 32K 128K Message Size (Bytes) Message Size (Bytes) 16 Byte Message 32 KByte Message 30 200 25 Default 150 Default 20 Multicast Multicast 15 100 10 Latency (us) Latency Latency (us) Latency 50 5 0 0 Number of Nodes Number of Nodes ConnectX-3-FDR (54 Gbps): 2.7 GHz Dual Octa-core (SandyBridge) Intel PCI Gen3 with Mellanox IB FDR switch DK-OSU HPC Advisory Council June'13 12 Application benefits with Non-Blocking Collectives based on CX-2 Collective Offload 5 17% HPL-Offload HPL-1ring HPL-Host 4 1.2 4.5% Time - 1 3 0.8 2 0.6 (s) 0.4 1 Normalized Normalized 0.2 Performance Performance 0 0 Application Run Application 512 600 720 800 10 20 30 40 50 60 70 Data Size HPL Problem Size (N) as % of Total Memory Modified P3DFFT with Offload-Alltoall does up to Modified HPL with Offload-Bcast does up to 4.5% 17% better than default version (128 Processes) better than default version (512 Processes) PCG-Default Modified-PCG-Offload K. Kandalla, et. al.. High-Performance and Scalable Non-Blocking 15 All-to-All with Collective Offload on InfiniBand Clusters: A Study 21.8% 10 with Parallel 3D FFT, ISC 2011 5 K. Kandalla, et. al, Designing Non-blocking Broadcast with Time (s) - 0 Collective Offload on InfiniBand Clusters: A Case Study with HPL, HotI 2011 Run 64 128 256 512 Number of Processes K. Kandalla, et. al., Designing Non-blocking Allreduce with Modified Pre-Conjugate Gradient Solver with Collective Offload on InfiniBand Clusters: A Case Study with Offload-Allreduce does up to 21.8% better than Conjugate Gradient Solvers, IPDPS ’12 default version DK-OSU HPC Advisory Council June'13 13 Outline • MVAPICH2/MVAPICH2-X Overview – Highly Efficient Intra-node and Inter-node Communication and Scalable Protocols – Scalable Blocking and Non-blocking Collective Communication – High Performance Fault Tolerance Mechanisms – Support for Hybrid MPI+PGAS Programming models • MVAPICH2 for GPU Clusters – Point-to-point Communication – Collective Communication – MPI Datatype processing – Multi-GPU Configurations • MVAPICH2 with GPUDirect RDMA • Conclusion DK-OSU HPC Advisory Council June'13 14 Multi-Level Checkpointing with ScalableCR (SCR) Low • LLNL’s Scalable Checkpoint/Restart library • Can be used for application guided and application transparent checkpointing • Effective utilization of storage hierarchy – Local: Store checkpoint data on node’s local storage, e.g.
Recommended publications
  • Bench - Benchmarking the State-Of- The-Art Task Execution Frameworks of Many- Task Computing
    MATRIX: Bench - Benchmarking the state-of- the-art Task Execution Frameworks of Many- Task Computing Thomas Dubucq, Tony Forlini, Virgile Landeiro Dos Reis, and Isabelle Santos Illinois Institute of Technology, Chicago, IL, USA {tdubucq, tforlini, vlandeir, isantos1}@hawk.iit.edu Stanford University. Finally HPX is a general purpose C++ Abstract — Technology trends indicate that exascale systems will runtime system for parallel and distributed applications of any have billion-way parallelism, and each node will have about three scale developed by Louisiana State University and Staple is a orders of magnitude more intra-node parallelism than today’s framework for developing parallel programs from Texas A&M. peta-scale systems. The majority of current runtime systems focus a great deal of effort on optimizing the inter-node parallelism by MATRIX is a many-task computing job scheduling system maximizing the bandwidth and minimizing the latency of the use [3]. There are many resource managing systems aimed towards of interconnection networks and storage, but suffer from the lack data-intensive applications. Furthermore, distributed task of scalable solutions to expose the intra-node parallelism. Many- scheduling in many-task computing is a problem that has been task computing (MTC) is a distributed fine-grained paradigm that considered by many research teams. In particular, Charm++ [4], aims to address the challenges of managing parallelism and Legion [5], Swift [6], [10], Spark [1][2], HPX [12], STAPL [13] locality of exascale systems. MTC applications are typically structured as direct acyclic graphs of loosely coupled short tasks and MATRIX [11] offer solutions to this problem and have with explicit input/output data dependencies.
    [Show full text]
  • Adaptive Data Migration in Load-Imbalanced HPC Applications
    Louisiana State University LSU Digital Commons LSU Doctoral Dissertations Graduate School 10-16-2020 Adaptive Data Migration in Load-Imbalanced HPC Applications Parsa Amini Louisiana State University and Agricultural and Mechanical College Follow this and additional works at: https://digitalcommons.lsu.edu/gradschool_dissertations Part of the Computer Sciences Commons Recommended Citation Amini, Parsa, "Adaptive Data Migration in Load-Imbalanced HPC Applications" (2020). LSU Doctoral Dissertations. 5370. https://digitalcommons.lsu.edu/gradschool_dissertations/5370 This Dissertation is brought to you for free and open access by the Graduate School at LSU Digital Commons. It has been accepted for inclusion in LSU Doctoral Dissertations by an authorized graduate school editor of LSU Digital Commons. For more information, please [email protected]. ADAPTIVE DATA MIGRATION IN LOAD-IMBALANCED HPC APPLICATIONS A Dissertation Submitted to the Graduate Faculty of the Louisiana State University and Agricultural and Mechanical College in partial fulfillment of the requirements for the degree of Doctor of Philosophy in The Department of Computer Science by Parsa Amini B.S., Shahed University, 2013 M.S., New Mexico State University, 2015 December 2020 Acknowledgments This effort has been possible, thanks to the involvement and assistance of numerous people. First and foremost, I thank my advisor, Dr. Hartmut Kaiser, who made this journey possible with their invaluable support, precise guidance, and generous sharing of expertise. It has been a great privilege and opportunity for me be your student, a part of the STE||AR group, and the HPX development effort. I would also like to thank my mentor and former advisor at New Mexico State University, Dr.
    [Show full text]
  • Beowulf Clusters — an Overview
    WinterSchool 2001 Å. Ødegård Beowulf clusters — an overview Åsmund Ødegård April 4, 2001 Beowulf Clusters 1 WinterSchool 2001 Å. Ødegård Contents Introduction 3 What is a Beowulf 5 The history of Beowulf 6 Who can build a Beowulf 10 How to design a Beowulf 11 Beowulfs in more detail 12 Rules of thumb 26 What Beowulfs are Good For 30 Experiments 31 3D nonlinear acoustic fields 35 Incompressible Navier–Stokes 42 3D nonlinear water wave 44 Beowulf Clusters 2 WinterSchool 2001 Å. Ødegård Introduction Why clusters ? ² “Work harder” – More CPU–power, more memory, more everything ² “Work smarter” – Better algorithms ² “Get help” – Let more boxes work together to solve the problem – Parallel processing ² by Greg Pfister Beowulf Clusters 3 WinterSchool 2001 Å. Ødegård ² Beowulfs in the Parallel Computing picture: Parallel Computing MetaComputing Clusters Tightly Coupled Vector WS farms Pile of PCs NOW NT/Win2k Clusters Beowulf CC-NUMA Beowulf Clusters 4 WinterSchool 2001 Å. Ødegård What is a Beowulf ² Mass–market commodity off the shelf (COTS) ² Low cost local area network (LAN) ² Open Source UNIX like operating system (OS) ² Execute parallel application programmed with a message passing model (MPI) ² Anything from small systems to large, fast systems. The fastest rank as no.84 on todays Top500. ² The best price/performance system available for many applications ² Philosophy: The cheapest system available which solve your problem in reasonable time Beowulf Clusters 5 WinterSchool 2001 Å. Ødegård The history of Beowulf ² 1993: Perfect conditions for the first Beowulf – Major CPU performance advance: 80286 ¡! 80386 – DRAM of reasonable costs and densities (8MB) – Disk drives of several 100MBs available for PC – Ethernet (10Mbps) controllers and hubs cheap enough – Linux improved rapidly, and was in a usable state – PVM widely accepted as a cross–platform message passing model ² Clustering was done with commercial UNIX, but the cost was high.
    [Show full text]
  • Improving MPI Threading Support for Current Hardware Architectures
    University of Tennessee, Knoxville TRACE: Tennessee Research and Creative Exchange Doctoral Dissertations Graduate School 12-2019 Improving MPI Threading Support for Current Hardware Architectures Thananon Patinyasakdikul University of Tennessee, [email protected] Follow this and additional works at: https://trace.tennessee.edu/utk_graddiss Recommended Citation Patinyasakdikul, Thananon, "Improving MPI Threading Support for Current Hardware Architectures. " PhD diss., University of Tennessee, 2019. https://trace.tennessee.edu/utk_graddiss/5631 This Dissertation is brought to you for free and open access by the Graduate School at TRACE: Tennessee Research and Creative Exchange. It has been accepted for inclusion in Doctoral Dissertations by an authorized administrator of TRACE: Tennessee Research and Creative Exchange. For more information, please contact [email protected]. To the Graduate Council: I am submitting herewith a dissertation written by Thananon Patinyasakdikul entitled "Improving MPI Threading Support for Current Hardware Architectures." I have examined the final electronic copy of this dissertation for form and content and recommend that it be accepted in partial fulfillment of the equirr ements for the degree of Doctor of Philosophy, with a major in Computer Science. Jack Dongarra, Major Professor We have read this dissertation and recommend its acceptance: Michael Berry, Michela Taufer, Yingkui Li Accepted for the Council: Dixie L. Thompson Vice Provost and Dean of the Graduate School (Original signatures are on file with official studentecor r ds.) Improving MPI Threading Support for Current Hardware Architectures A Dissertation Presented for the Doctor of Philosophy Degree The University of Tennessee, Knoxville Thananon Patinyasakdikul December 2019 c by Thananon Patinyasakdikul, 2019 All Rights Reserved. ii To my parents Thanawij and Issaree Patinyasakdikul, my little brother Thanarat Patinyasakdikul for their love, trust and support.
    [Show full text]
  • Exascale Computing Project -- Software
    Exascale Computing Project -- Software Paul Messina, ECP Director Stephen Lee, ECP Deputy Director ASCAC Meeting, Arlington, VA Crystal City Marriott April 19, 2017 www.ExascaleProject.org ECP scope and goals Develop applications Partner with vendors to tackle a broad to develop computer spectrum of mission Support architectures that critical problems national security support exascale of unprecedented applications complexity Develop a software Train a next-generation Contribute to the stack that is both workforce of economic exascale-capable and computational competitiveness usable on industrial & scientists, engineers, of the nation academic scale and computer systems, in collaboration scientists with vendors 2 Exascale Computing Project, www.exascaleproject.org ECP has formulated a holistic approach that uses co- design and integration to achieve capable exascale Application Software Hardware Exascale Development Technology Technology Systems Science and Scalable and Hardware Integrated mission productive technology exascale applications software elements supercomputers Correctness Visualization Data Analysis Applicationsstack Co-Design Programming models, Math libraries and development environment, Tools Frameworks and runtimes System Software, resource Workflows Resilience management threading, Data Memory scheduling, monitoring, and management and Burst control I/O and file buffer system Node OS, runtimes Hardware interface ECP’s work encompasses applications, system software, hardware technologies and architectures, and workforce
    [Show full text]
  • Parallel Data Mining from Multicore to Cloudy Grids
    311 Parallel Data Mining from Multicore to Cloudy Grids Geoffrey FOXa,b,1 Seung-Hee BAEb, Jaliya EKANAYAKEb, Xiaohong QIUc, and Huapeng YUANb a Informatics Department, Indiana University 919 E. 10th Street Bloomington, IN 47408 USA b Computer Science Department and Community Grids Laboratory, Indiana University 501 N. Morton St., Suite 224, Bloomington IN 47404 USA c UITS Research Technologies, Indiana University, 501 N. Morton St., Suite 211, Bloomington, IN 47404 Abstract. We describe a suite of data mining tools that cover clustering, information retrieval and the mapping of high dimensional data to low dimensions for visualization. Preliminary applications are given to particle physics, bioinformatics and medical informatics. The data vary in dimension from low (2- 20), high (thousands) to undefined (sequences with dissimilarities but not vectors defined). We use deterministic annealing to provide more robust algorithms that are relatively insensitive to local minima. We discuss the algorithm structure and their mapping to parallel architectures of different types and look at the performance of the algorithms on three classes of system; multicore, cluster and Grid using a MapReduce style algorithm. Each approach is suitable in different application scenarios. We stress that data analysis/mining of large datasets can be a supercomputer application. Keywords. MPI, MapReduce, CCR, Performance, Clustering, Multidimensional Scaling Introduction Computation and data intensive scientific data analyses are increasingly prevalent. In the near future, data volumes processed by many applications will routinely cross the peta-scale threshold, which would in turn increase the computational requirements. Efficient parallel/concurrent algorithms and implementation techniques are the key to meeting the scalability and performance requirements entailed in such scientific data analyses.
    [Show full text]
  • High Performance Integration of Data Parallel File Systems and Computing
    HIGH PERFORMANCE INTEGRATION OF DATA PARALLEL FILE SYSTEMS AND COMPUTING: OPTIMIZING MAPREDUCE Zhenhua Guo Submitted to the faculty of the University Graduate School in partial fulfillment of the requirements for the degree Doctor of Philosophy in the Department of Computer Science Indiana University August 2012 Accepted by the Graduate Faculty, Indiana University, in partial fulfillment of the require- ments of the degree of Doctor of Philosophy. Doctoral Geoffrey Fox, Ph.D. Committee (Principal Advisor) Judy Qiu, Ph.D. Minaxi Gupta, Ph.D. David Leake, Ph.D. ii Copyright c 2012 Zhenhua Guo ALL RIGHTS RESERVED iii I dedicate this dissertation to my parents and my wife Mo. iv Acknowledgements First and foremost, I owe my sincerest gratitude to my advisor Prof. Geoffrey Fox. Throughout my Ph.D. research, he guided me into the research field of distributed systems; his insightful advice inspires me to identify the challenging research problems I am interested in; and his generous intelligence support is critical for me to tackle difficult research issues one after another. During the course of working with him, I learned how to become a professional researcher. I would like to thank my entire research committee: Dr. Judy Qiu, Prof. Minaxi Gupta, and Prof. David Leake. I am greatly indebted to them for their professional guidance, generous support, and valuable suggestions that were given throughout this research work. I am grateful to Dr. Judy Qiu for offering me the opportunities to participate into closely related projects. As a result, I obtained deeper understanding of related systems including Dryad and Twister, and could better position my research in the big picture of the whole research area.
    [Show full text]
  • MPICH Installer's Guide Version 3.3.2 Mathematics and Computer
    MPICH Installer's Guide∗ Version 3.3.2 Mathematics and Computer Science Division Argonne National Laboratory Abdelhalim Amer Pavan Balaji Wesley Bland William Gropp Yanfei Guo Rob Latham Huiwei Lu Lena Oden Antonio J. Pe~na Ken Raffenetti Sangmin Seo Min Si Rajeev Thakur Junchao Zhang Xin Zhao November 12, 2019 ∗This work was supported by the Mathematical, Information, and Computational Sci- ences Division subprogram of the Office of Advanced Scientific Computing Research, Sci- DAC Program, Office of Science, U.S. Department of Energy, under Contract DE-AC02- 06CH11357. 1 Contents 1 Introduction 1 2 Quick Start 1 2.1 Prerequisites ........................... 1 2.2 From A Standing Start to Running an MPI Program . 2 2.3 Selecting the Compilers ..................... 6 2.4 Compiler Optimization Levels .................. 7 2.5 Common Non-Default Configuration Options . 8 2.5.1 The Most Important Configure Options . 8 2.5.2 Using the Absoft Fortran compilers with MPICH . 9 2.6 Shared Libraries ......................... 9 2.7 What to Tell the Users ...................... 9 3 Migrating from MPICH1 9 3.1 Configure Options ........................ 10 3.2 Other Differences ......................... 10 4 Choosing the Communication Device 11 5 Installing and Managing Process Managers 12 5.1 hydra ............................... 12 5.2 gforker ............................... 12 6 Testing 13 7 Benchmarking 13 8 All Configure Options 14 i 1 INTRODUCTION 1 1 Introduction This manual describes how to obtain and install MPICH, the MPI imple- mentation from Argonne National Laboratory. (Of course, if you are reading this, chances are good that you have already obtained it and found this doc- ument, among others, in its doc subdirectory.) This Guide will explain how to install MPICH so that you and others can use it to run MPI applications.
    [Show full text]
  • 3.0 and Beyond
    MPICH: 3.0 and Beyond Pavan Balaji Computer Scien4st Group Lead, Programming Models and Run4me systems Argonne Naonal Laboratory MPICH: Goals and Philosophy § MPICH con4nues to aim to be the preferred MPI implementaons on the top machines in the world § Our philosophy is to create an “MPICH Ecosystem” TAU PETSc MPE Intel ADLB HPCToolkit MPI IBM Tianhe MPI MPI MPICH DDT Cray Math MPI MVAPICH Works Microso3 Totalview MPI ANSYS Pavan Balaji, Argonne Na1onal Laboratory MPICH on the Top Machines § 7/10 systems use MPICH 1. Titan (Cray XK7) exclusively 2. Sequoia (IBM BG/Q) § #6 One of the top 10 systems 3. K Computer (Fujitsu) uses MPICH together with other 4. Mira (IBM BG/Q) MPI implementaons 5. JUQUEEN (IBM BG/Q) 6. SuperMUC (IBM InfiniBand) § #3 We are working with Fujitsu 7. Stampede (Dell InfiniBand) and U. Tokyo to help them 8. Tianhe-1A (NUDT Proprietary) support MPICH 3.0 on the K Computer (and its successor) 9. Fermi (IBM BG/Q) § #10 IBM has been working with 10. DARPA Trial Subset (IBM PERCS) us to get the PERCS plaorm to use MPICH (the system was just a li^le too early) Pavan Balaji, Argonne Na1onal Laboratory MPICH-3.0 (and MPI-3) § MPICH-3.0 is the new MPICH2 :-) – Released mpich-3.0rc1 this morning! – Primary focus of this release is to support MPI-3 – Other features are also included (such as support for nave atomics with ARM-v7) § A large number of MPI-3 features included – Non-blocking collecves – Improved MPI one-sided communicaon (RMA) – New Tools Interface – Shared memory communicaon support – (please see the MPI-3 BoF on
    [Show full text]
  • Spack Package Repositories
    Managing HPC Software Complexity with Spack Full-day Tutorial 1st Workshop on NSF and DOE High Performance Computing Tools July 10, 2019 The most recent version of these slides can be found at: https://spack.readthedocs.io/en/latest/tutorial.html Eugene, Oregon LLNL-PRES-806064 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE-AC52-07NA27344. github.com/spack/spack Lawrence Livermore National Security, LLC Tutorial Materials Materials: Download the latest version of slides and handouts at: bit.ly/spack-tutorial Slides and hands-on scripts are in the “Tutorial” Section of the docs. § Spack GitHub repository: http://github.com/spack/spack § Spack Documentation: http://spack.readthedocs.io Tweet at us! @spackpm 2 LLNL-PRES-806064 github.com/spack/spack Tutorial Presenters Todd Gamblin Greg Becker 3 LLNL-PRES-806064 github.com/spack/spack Software complexity in HPC is growing glm suite-sparse yaml-cpp metis cmake ncurses parmetis pkgconf nalu hwloc libxml2 xz bzip2 openssl boost trilinos superlu openblas netlib-scalapack mumps openmpi zlib netcdf hdf5 matio m4 libsigsegv parallel-netcdf Nalu: Generalized Unstructured Massively Parallel Low Mach Flow 4 LLNL-PRES-806064 github.com/spack/spack Software complexity in HPC is growing adol-c automake autoconf perl glm suite-sparse yaml-cpp metis cmake ncurses gmp libtool parmetis pkgconf m4 libsigsegv nalu hwloc libxml2 xz bzip2 openssl p4est pkgconf boost hwloc libxml2 trilinos superlu openblas xz netlib-scalapack
    [Show full text]
  • Performance Comparison of MPICH and Mpi4py on Raspberry Pi-3B
    Jour of Adv Research in Dynamical & Control Systems, Vol. 11, 03-Special Issue, 2019 Performance Comparison of MPICH and MPI4py on Raspberry Pi-3B Beowulf Cluster Saad Wazir, EPIC Lab, FAST-National University of Computer & Emerging Sciences, Islamabad, Pakistan. E-mail: [email protected] Ataul Aziz Ikram, EPIC Lab FAST-National University of Computer & Emerging Sciences, Islamabad, Pakistan. E-mail: [email protected] Hamza Ali Imran, EPIC Lab, FAST-National University of Computer & Emerging Sciences, Islambad, Pakistan. E-mail: [email protected] Hanif Ullah, Research Scholar, Riphah International University, Islamabad, Pakistan. E-mail: [email protected] Ahmed Jamal Ikram, EPIC Lab, FAST-National University of Computer & Emerging Sciences, Islamabad, Pakistan. E-mail: [email protected] Maryam Ehsan, Information Technology Department, University of Gujarat, Gujarat, Pakistan. E-mail: [email protected] Abstract--- Moore’s Law is running out. Instead of making powerful computer by increasing number of transistor now we are moving toward Parallelism. Beowulf cluster means cluster of any Commodity hardware. Our Cluster works exactly similar to current day’s supercomputers. The motivation is to create a small sized, cheap device on which students and researchers can get hands on experience. There is a master node, which interacts with user and all other nodes are slave nodes. Load is equally divided among all nodes and they send their results to master. Master combines those results and show the final output to the user. For communication between nodes we have created a network over Ethernet. We are using MPI4py, which a Python based implantation of Message Passing Interface (MPI) and MPICH which also an open source implementation of MPI and allows us to code in C, C++ and Fortran.
    [Show full text]
  • Installation Guide for X86-64 Cpus and Tesla Gpus
    INSTALLATION GUIDE FOR X86-64 CPUS AND TESLA GPUS Version 2020 TABLE OF CONTENTS Chapter 1. Introduction.........................................................................................1 1.1. Product Overview......................................................................................... 1 1.1.1. PGI Professional Edition............................................................................ 1 1.1.2. PGI Community Edition.............................................................................1 1.1.3. PGI in the Cloud.....................................................................................1 1.2. Release Components..................................................................................... 2 1.2.1. Additional Components............................................................................. 2 1.2.2. MPI Support...........................................................................................2 1.3. Terms and Definitions....................................................................................2 1.4. Supported Processors.....................................................................................3 1.4.1. Supported Processors............................................................................... 3 1.5. Supported Operating Systems.......................................................................... 4 1.6. Hyperthreading and Numa.............................................................................. 5 1.7. Java Runtime Environment (JRE).....................................................................
    [Show full text]