MPI Usage at NERSC: Present and Future

Total Page:16

File Type:pdf, Size:1020Kb

MPI Usage at NERSC: Present and Future 9/13/2016 NERSC MPI Usage Survey - Google Forms Parallel Research Kernels Lattice QCD, RHMC, Complex Langevin, Conjugate Gradient ALPSCore own code Applications that use PETSc GTFock, Elemental Our own code, adapted for MPI MILC QCD simulations codes MPI usagePSI-T et at(nite ele meNERSC:nt based MHD code) Present and Future Ray, HipMer HipMER, Combinatorial BLAS, SuperLU idealized climate simulations using code named EULAG Alice Koniges, BrandonNERSC MCook,PI Usage SJackurvey Deslippe, Thorston Kurth, Hongzhang Shan MPI-3.1 Lawrenceb enBerkeleychmarks National Laboratory, Berkeley, CA USA desispec and related DESI-specic custom code Abstract: We describe how MPI is used at the National Energy Research Scientific Computing QUESTIONS RESPONSES 32 Center (NERSC). NERSC is the production high-performance computing center for the US Department of Energy, with more than 5000 users and 800 distinct projects. We also compare the usage of MPI to exascale developmental programming models such as UPC++ and HPX, with an eye on what features and extensions to MPI are plausible and useful for NERSC users. We also Which types of decomposition and communication does your application discuss perceived shortcomings of MPI, and why certain groups use other parallel programming How MPI is used at NERSC Portability issues wrt different MPI implementations models. We believe our results tend to represent the advanced MPI NERSC user base, rather than use? (check all that apply) an extensive survey of the entire NERSC code group population. (32 responses) In survey responses of 32 code groups, 10 gave the following comments on portability 9/13/2016 NERSC MPI Usage Survey - Google Forms Special thanks to Dr. Rolf Rabenseifner and colleagues at Stuttgart High Performance Computing Which types of decomposition and communication does your application use? • Have noticed that MPI implementations seem to differ in how they implement read/write_at_all() and read/write_at(). On NERSC's machines, I need to use Center for input in making the MPI NERSC User Survey. ~10,000 read/write_at_all(), while on NCAR's machine, the same code prefers read/write_at(). Nearest-neighborN communicationearest-neigh… 12 (37.5%) typically 16^2 x 45 (horizontal footprint x column depth) • Open MPI: thread multiple and RMA usually broken, blue gene has no MPI-3 Row/ColumnR communicationow/Column c… 8 (25%) typically 256-1024 • Compilation problem with the Cray compiler In 2015, the NERSC Program supported 6,000 active users Communication with arbitrary Cotheromm processesunicatio… 13 (40.6%) from universities, national laboratories and industry who used • Cray mpi sometimes crashes with alltoallv at large process counts NERSC Hardware CartesianCartesian grid gri d 11 (34.4%) approximately 3 billion supercomputing hours. Our users came • We write MPI code following the MPI standard, so generally no portability issues exist. But sometimes we get unexpected RMA bugs in CrayMPI releases (in from across the United States, with 48 states represented, as Non-CartesianN buton- Cregularartesi agridn… 11 (34.4%) How do you implement point-to-point communication (check any that both regular mode and DMAPP mode). well as from around the globe, with 47 countries represented. Cyclic boundary conditions/communicationCyclic bounda… 13 (40.6%) apply) Cori is NERSC's newest supercomputer system (NERSC-8). It is named after American biochemist Gerty Cori, (29 responses) • Not directly. Co-workers experienced problems with one-sided MPI (Intel MPI/MPICH) the first woman to win a Nobel Prize in Physiology or Medicine. Non-cyclic boundaryNon- cconditionsyclic bo… 7 (21.9%) • Old versions of MPI did not have MPI_INTEGER8 data type. Newer versions do. Scatter/GatherScatter/Gathe r 16 (B5loc0kin%g M)PI… 18 (62.1%) • MPI_Comm_split deadlocking, MPI_Bcast deadlocking, MPI_Comm_dup using linear memory on each call, incorrect attribute caching destructor semantics. Cori Phase 1 system is a Cray XC based on the AllToAllAllToA ll 13 (40.6%) Blocking MPI… 2 (6.9%) Intel Haswell multi-core processor. Cori Phase 2 is based on 9/13/2016 NERSC MPI Usage Survey - Google Forms Blocking MPI… 1 (3.4%) • MPI_IO external32 vs. native encodings with OpenMPI & MPICH 9/13/2016 Broadcast/ReduceBroadcast/Re… NERSC MPI Usage Survey - Google Forms Blocking MPI… 5 (17.2%) 27 (84.4%) the second generation of the Intel® Xeon Phi™ family of • Too many issues over 15+ years to try to list them here. Nonblocking… 14 (48.3%) No communication (my code is embarrassingly~10,000 No com parallel)munic… 3 (9.4%) yes products, called the Knights Landing (KNL) Many Integrated Nonblocking… 13 (44.8%) Core (MIC) Architecture. Cori Phase I & Cori Phase II Other:Othe r 3 (9.4%) no typically 16^2 x 45 (horizontal footprint x column depth) 37.5% 0 2 4 6 8 10 12 14 16 18 integration begins ~September 19, 2016 ( for ~6 weeks). Other responses: typically 256-1024 0 5 10 15 20 25 • Primarily halo communication for integration and all-to-one for writing output, although many options exist for output depending on domain size. Comments/concerns/suggestions/requests wrt MPI • master/slave Have you ever checked if your • One sided Have you ever checked if your communication is serialized? (31 responses) • Cori is a Cray XC40 supercomputer • MPI Structure Communications, threaded communications communication is serialized? • Theoretical Peak performance: Phase I Haswell: 1.92 PFlops/sec; Phase II KNL: 27.9 PFlops/sec. • one-sided and point-to-point between rank 0 and other ranks yes In survey responses 5 code groups gave additional information 3 • Total compute nodes: Phase I Haswell: 1,630 computes nodes, 52,160 cores in total (32 cores per node); • Graph traversal / random access How do you implement point-to-point communication (check any that 61.3% no • 1 master PE for I/O : generally this is minor amount of time • don't break things!!!! • Phase II KNL: 9,304 compute nodes, 632,672 cores in total (68 cores per node). apply) • Cray Aries high-speed interconnect with Dragonfly topology (0.25 µs to 3.7 µs MPI latency, ~8GB/sec MPI bandwidth) • ideally all meaningful patterns areht tbenchmarkedps://docs.goog le.com/a/lbl.gov/forms/d/11xmHlRI5w-jGwTmPcxzuFwMDyJnFu9uQLzJLtHdmeas/edit#responses 3/19 • Lack of good open source MPI for Cray makes lots of things hard. So many problems would have been solved much quicker with source. (29 responses) 62.5% • Aggregate memory: Phase I Haswell partition: 203 TB; Phase II KNL partition: 1 PB. How do you implement point-to-point communication • It would be good to document automatic potential protocol switches in specific MPI implementations (e.g. Cray MPI) for small messages. That could explain • Scratch storage capacity: 30 PB 38.7% some observed communication anomalies. Blocking MPI_Send / MPI_RecvBlocking MP I… 18 (62.1%) • Neighborhood collectives and nonblocking collectives are the most relevant optimizations. Implementation of "endpoints" would be huge for interoperability Edison has been the NERSC’s primary supercomputer system for a number of years. It is named after Blocking MPI_Bsend / MPI_RecvBlocking MPI … 2 (6.9%) between packages that use OpenMP or a similar threading system and those that use MPI only (perhaps with shared memory windows). U.S. inventor and businessman Thomas Alva Edison. Blocking MPI_Ssend / MPI_RecvBlocking MP I… 1 (3.4%) Do you overlap communication • For progress of RMA operations, I use a process-based progress engine for MPI RMA, namely Casper. Blocking MPI_SendRecvBlocking MP I… 5 (17.2%) and computation? 9/13/2016 Do you overlap communicatiNoEnRS Ca MnPdI U csaoge mSurvpeyu - tGaootgiloe Fnor?ms (32 responses) Edison, a Cray XC30, has a peak performance of 2.57 Nonblocking MPI_Isend and MPI_RecvNonblocking … 14 (48.3%) yes no petaflops/sec, 133,824 compute cores for running scientific Nonblocking MPI_Send and MPI_IrecvNonblocking … 13 (44.8%) https://docs.google.com/a/lbl.gov/forms/d/11xmHlRI5w-jGwTm3P7c.x5z%uFwMDyJnFu9uQLzJLtHdmeas/edit#responses 5/19 applications, 357 Terabytes of memory, and 7.56 Petabytes 0 2 4 6 8 10 12 14 16 18 Application codes used by MPI survey responders of online disk storage with a peak I/O bandwidth of 168 Do you use any of these features? (31 responses) gigabytes (GB) per second. Do you use any9/13 of/201 6these features? NERSC MPI Usage Survey - Google Forms 62.5% • NIMROD • NWChem • NWChem, CTF • MILC QCD simulations codes 9/13/2016 NERSC MPI Usage Survey - Google Forms We often use alltoallv in fact, as we have variable data lengths to send to arbitrary processes. Probably could • NCAR LES • Parallel Research Kernels • epiSNP • PSI-Tet ( finite element based MHD code) opMPI_Puttimise by knowing ne ig/h bMPI_Getors. MPI_Put / MPI… 11 (35.5%) • GINGER FEL code • WARP/PICSAR + own applications UsinHg RMaA vfore irre gyulaor coumm eunicvateionr; u sicng hdereivedc daktateypeds to iabfs tryacto struidedr d acta ion NmWchem; sub- nication is serialized? (31 responses) • ALPSCore • Isam MPI_Neighbor_alltoallcommunicatiors are needed to separate comput ing groups and special groups for asynchronous progress (using NWMChePm wIit_h ANRMCeI/MigPI ahndb Caosperr)…; blocking collectives are used to fo2r gl ob(a6l sy.n5chr%oniza)tion in • WRF NWChem; non-blocking collectives are used in asynchronous progress engine (i.e., Casper) to handle special • own code • NIMROD • HipMER, Combinatorial BLAS, SuperLU Howb acdkground synchoroMPI_Alltoallniuzatio n dthat ocan beM overlaPpp edI w ithI anppliciatioin aexeclutiozn. ation? (32 responses) Do you use any of these features? (31 responses) MPI_Alltoall 11 (35.5%) • im3shape, cosmosis (my own codes) • Applications that use PETSc • MPI-3.1 • idealized climate simulations using code named EULAG MPI_Barrier for synchronization within shared memory domain (with MPI3 shared memory windows). Sub- yes communicatorsMPI_AllgatherM for rPow/Ico_lumAn clolmgmaunitcahtioen s irn dense matrix applications 12 (38.7%) no • WRF, UMWM, HYCOM, ESMF • GTFock, Elemental • ISx • MUMPS, SuperLU_DIST, STRUMPACK, PETSc I need to send non-contiguous parts of an array.
Recommended publications
  • Bench - Benchmarking the State-Of- The-Art Task Execution Frameworks of Many- Task Computing
    MATRIX: Bench - Benchmarking the state-of- the-art Task Execution Frameworks of Many- Task Computing Thomas Dubucq, Tony Forlini, Virgile Landeiro Dos Reis, and Isabelle Santos Illinois Institute of Technology, Chicago, IL, USA {tdubucq, tforlini, vlandeir, isantos1}@hawk.iit.edu Stanford University. Finally HPX is a general purpose C++ Abstract — Technology trends indicate that exascale systems will runtime system for parallel and distributed applications of any have billion-way parallelism, and each node will have about three scale developed by Louisiana State University and Staple is a orders of magnitude more intra-node parallelism than today’s framework for developing parallel programs from Texas A&M. peta-scale systems. The majority of current runtime systems focus a great deal of effort on optimizing the inter-node parallelism by MATRIX is a many-task computing job scheduling system maximizing the bandwidth and minimizing the latency of the use [3]. There are many resource managing systems aimed towards of interconnection networks and storage, but suffer from the lack data-intensive applications. Furthermore, distributed task of scalable solutions to expose the intra-node parallelism. Many- scheduling in many-task computing is a problem that has been task computing (MTC) is a distributed fine-grained paradigm that considered by many research teams. In particular, Charm++ [4], aims to address the challenges of managing parallelism and Legion [5], Swift [6], [10], Spark [1][2], HPX [12], STAPL [13] locality of exascale systems. MTC applications are typically structured as direct acyclic graphs of loosely coupled short tasks and MATRIX [11] offer solutions to this problem and have with explicit input/output data dependencies.
    [Show full text]
  • Adaptive Data Migration in Load-Imbalanced HPC Applications
    Louisiana State University LSU Digital Commons LSU Doctoral Dissertations Graduate School 10-16-2020 Adaptive Data Migration in Load-Imbalanced HPC Applications Parsa Amini Louisiana State University and Agricultural and Mechanical College Follow this and additional works at: https://digitalcommons.lsu.edu/gradschool_dissertations Part of the Computer Sciences Commons Recommended Citation Amini, Parsa, "Adaptive Data Migration in Load-Imbalanced HPC Applications" (2020). LSU Doctoral Dissertations. 5370. https://digitalcommons.lsu.edu/gradschool_dissertations/5370 This Dissertation is brought to you for free and open access by the Graduate School at LSU Digital Commons. It has been accepted for inclusion in LSU Doctoral Dissertations by an authorized graduate school editor of LSU Digital Commons. For more information, please [email protected]. ADAPTIVE DATA MIGRATION IN LOAD-IMBALANCED HPC APPLICATIONS A Dissertation Submitted to the Graduate Faculty of the Louisiana State University and Agricultural and Mechanical College in partial fulfillment of the requirements for the degree of Doctor of Philosophy in The Department of Computer Science by Parsa Amini B.S., Shahed University, 2013 M.S., New Mexico State University, 2015 December 2020 Acknowledgments This effort has been possible, thanks to the involvement and assistance of numerous people. First and foremost, I thank my advisor, Dr. Hartmut Kaiser, who made this journey possible with their invaluable support, precise guidance, and generous sharing of expertise. It has been a great privilege and opportunity for me be your student, a part of the STE||AR group, and the HPX development effort. I would also like to thank my mentor and former advisor at New Mexico State University, Dr.
    [Show full text]
  • Benchmarking the Intel FPGA SDK for Opencl Memory Interface
    The Memory Controller Wall: Benchmarking the Intel FPGA SDK for OpenCL Memory Interface Hamid Reza Zohouri*†1, Satoshi Matsuoka*‡ *Tokyo Institute of Technology, †Edgecortix Inc. Japan, ‡RIKEN Center for Computational Science (R-CCS) {zohour.h.aa@m, matsu@is}.titech.ac.jp Abstract—Supported by their high power efficiency and efficiency on Intel FPGAs with different configurations recent advancements in High Level Synthesis (HLS), FPGAs are for input/output arrays, vector size, interleaving, kernel quickly finding their way into HPC and cloud systems. Large programming model, on-chip channels, operating amounts of work have been done so far on loop and area frequency, padding, and multiple types of blocking. optimizations for different applications on FPGAs using HLS. However, a comprehensive analysis of the behavior and • We outline one performance bug in Intel’s compiler, and efficiency of the memory controller of FPGAs is missing in multiple deficiencies in the memory controller, leading literature, which becomes even more crucial when the limited to significant loss of memory performance for typical memory bandwidth of modern FPGAs compared to their GPU applications. In some of these cases, we provide work- counterparts is taken into account. In this work, we will analyze arounds to improve the memory performance. the memory interface generated by Intel FPGA SDK for OpenCL with different configurations for input/output arrays, II. METHODOLOGY vector size, interleaving, kernel programming model, on-chip channels, operating frequency, padding, and multiple types of A. Memory Benchmark Suite overlapped blocking. Our results point to multiple shortcomings For our evaluation, we develop an open-source benchmark in the memory controller of Intel FPGAs, especially with respect suite called FPGAMemBench, available at https://github.com/ to memory access alignment, that can hinder the programmer’s zohourih/FPGAMemBench.
    [Show full text]
  • BCL: a Cross-Platform Distributed Data Structures Library
    BCL: A Cross-Platform Distributed Data Structures Library Benjamin Brock, Aydın Buluç, Katherine Yelick University of California, Berkeley Lawrence Berkeley National Laboratory {brock,abuluc,yelick}@cs.berkeley.edu ABSTRACT high-performance computing, including several using the Parti- One-sided communication is a useful paradigm for irregular paral- tioned Global Address Space (PGAS) model: Titanium, UPC, Coarray lel applications, but most one-sided programming environments, Fortran, X10, and Chapel [9, 11, 12, 25, 29, 30]. These languages are including MPI’s one-sided interface and PGAS programming lan- especially well-suited to problems that require asynchronous one- guages, lack application-level libraries to support these applica- sided communication, or communication that takes place without tions. We present the Berkeley Container Library, a set of generic, a matching receive operation or outside of a global collective. How- cross-platform, high-performance data structures for irregular ap- ever, PGAS languages lack the kind of high level libraries that exist plications, including queues, hash tables, Bloom filters and more. in other popular programming environments. For example, high- BCL is written in C++ using an internal DSL called the BCL Core performance scientific simulations written in MPI can leverage a that provides one-sided communication primitives such as remote broad set of numerical libraries for dense or sparse matrices, or get and remote put operations. The BCL Core has backends for for structured, unstructured, or adaptive meshes. PGAS languages MPI, OpenSHMEM, GASNet-EX, and UPC++, allowing BCL data can sometimes use those numerical libraries, but are missing the structures to be used natively in programs written using any of data structures that are important in some of the most irregular these programming environments.
    [Show full text]
  • Advances, Applications and Performance of The
    1 Introduction The two predominant classes of programming models for MIMD concurrent computing are distributed memory and ADVANCES, APPLICATIONS AND shared memory. Both shared memory and distributed mem- PERFORMANCE OF THE GLOBAL ory models have advantages and shortcomings. Shared ARRAYS SHARED MEMORY memory model is much easier to use but it ignores data PROGRAMMING TOOLKIT locality/placement. Given the hierarchical nature of the memory subsystems in modern computers this character- istic can have a negative impact on performance and scal- Jarek Nieplocha1 1 ability. Careful code restructuring to increase data reuse Bruce Palmer and replacing fine grain load/stores with block access to Vinod Tipparaju1 1 shared data can address the problem and yield perform- Manojkumar Krishnan ance for shared memory that is competitive with mes- Harold Trease1 2 sage-passing (Shan and Singh 2000). However, this Edoardo Aprà performance comes at the cost of compromising the ease of use that the shared memory model advertises. Distrib- uted memory models, such as message-passing or one- Abstract sided communication, offer performance and scalability This paper describes capabilities, evolution, performance, but they are difficult to program. and applications of the Global Arrays (GA) toolkit. GA was The Global Arrays toolkit (Nieplocha, Harrison, and created to provide application programmers with an inter- Littlefield 1994, 1996; Nieplocha et al. 2002a) attempts to face that allows them to distribute data while maintaining offer the best features of both models. It implements a the type of global index space and programming syntax shared-memory programming model in which data local- similar to that available when programming on a single ity is managed by the programmer.
    [Show full text]
  • Beowulf Clusters — an Overview
    WinterSchool 2001 Å. Ødegård Beowulf clusters — an overview Åsmund Ødegård April 4, 2001 Beowulf Clusters 1 WinterSchool 2001 Å. Ødegård Contents Introduction 3 What is a Beowulf 5 The history of Beowulf 6 Who can build a Beowulf 10 How to design a Beowulf 11 Beowulfs in more detail 12 Rules of thumb 26 What Beowulfs are Good For 30 Experiments 31 3D nonlinear acoustic fields 35 Incompressible Navier–Stokes 42 3D nonlinear water wave 44 Beowulf Clusters 2 WinterSchool 2001 Å. Ødegård Introduction Why clusters ? ² “Work harder” – More CPU–power, more memory, more everything ² “Work smarter” – Better algorithms ² “Get help” – Let more boxes work together to solve the problem – Parallel processing ² by Greg Pfister Beowulf Clusters 3 WinterSchool 2001 Å. Ødegård ² Beowulfs in the Parallel Computing picture: Parallel Computing MetaComputing Clusters Tightly Coupled Vector WS farms Pile of PCs NOW NT/Win2k Clusters Beowulf CC-NUMA Beowulf Clusters 4 WinterSchool 2001 Å. Ødegård What is a Beowulf ² Mass–market commodity off the shelf (COTS) ² Low cost local area network (LAN) ² Open Source UNIX like operating system (OS) ² Execute parallel application programmed with a message passing model (MPI) ² Anything from small systems to large, fast systems. The fastest rank as no.84 on todays Top500. ² The best price/performance system available for many applications ² Philosophy: The cheapest system available which solve your problem in reasonable time Beowulf Clusters 5 WinterSchool 2001 Å. Ødegård The history of Beowulf ² 1993: Perfect conditions for the first Beowulf – Major CPU performance advance: 80286 ¡! 80386 – DRAM of reasonable costs and densities (8MB) – Disk drives of several 100MBs available for PC – Ethernet (10Mbps) controllers and hubs cheap enough – Linux improved rapidly, and was in a usable state – PVM widely accepted as a cross–platform message passing model ² Clustering was done with commercial UNIX, but the cost was high.
    [Show full text]
  • Improving MPI Threading Support for Current Hardware Architectures
    University of Tennessee, Knoxville TRACE: Tennessee Research and Creative Exchange Doctoral Dissertations Graduate School 12-2019 Improving MPI Threading Support for Current Hardware Architectures Thananon Patinyasakdikul University of Tennessee, [email protected] Follow this and additional works at: https://trace.tennessee.edu/utk_graddiss Recommended Citation Patinyasakdikul, Thananon, "Improving MPI Threading Support for Current Hardware Architectures. " PhD diss., University of Tennessee, 2019. https://trace.tennessee.edu/utk_graddiss/5631 This Dissertation is brought to you for free and open access by the Graduate School at TRACE: Tennessee Research and Creative Exchange. It has been accepted for inclusion in Doctoral Dissertations by an authorized administrator of TRACE: Tennessee Research and Creative Exchange. For more information, please contact [email protected]. To the Graduate Council: I am submitting herewith a dissertation written by Thananon Patinyasakdikul entitled "Improving MPI Threading Support for Current Hardware Architectures." I have examined the final electronic copy of this dissertation for form and content and recommend that it be accepted in partial fulfillment of the equirr ements for the degree of Doctor of Philosophy, with a major in Computer Science. Jack Dongarra, Major Professor We have read this dissertation and recommend its acceptance: Michael Berry, Michela Taufer, Yingkui Li Accepted for the Council: Dixie L. Thompson Vice Provost and Dean of the Graduate School (Original signatures are on file with official studentecor r ds.) Improving MPI Threading Support for Current Hardware Architectures A Dissertation Presented for the Doctor of Philosophy Degree The University of Tennessee, Knoxville Thananon Patinyasakdikul December 2019 c by Thananon Patinyasakdikul, 2019 All Rights Reserved. ii To my parents Thanawij and Issaree Patinyasakdikul, my little brother Thanarat Patinyasakdikul for their love, trust and support.
    [Show full text]
  • Enabling Efficient Use of UPC and Openshmem PGAS Models on GPU Clusters
    Enabling Efficient Use of UPC and OpenSHMEM PGAS Models on GPU Clusters Presented at GTC ’15 Presented by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: [email protected] hCp://www.cse.ohio-state.edu/~panda Accelerator Era GTC ’15 • Accelerators are becominG common in hiGh-end system architectures Top 100 – Nov 2014 (28% use Accelerators) 57% use NVIDIA GPUs 57% 28% • IncreasinG number of workloads are beinG ported to take advantage of NVIDIA GPUs • As they scale to larGe GPU clusters with hiGh compute density – hiGher the synchronizaon and communicaon overheads – hiGher the penalty • CriPcal to minimize these overheads to achieve maximum performance 3 Parallel ProGramminG Models Overview GTC ’15 P1 P2 P3 P1 P2 P3 P1 P2 P3 LoGical shared memory Shared Memory Memory Memory Memory Memory Memory Memory Shared Memory Model Distributed Memory Model ParPPoned Global Address Space (PGAS) DSM MPI (Message PassinG Interface) Global Arrays, UPC, Chapel, X10, CAF, … • ProGramminG models provide abstract machine models • Models can be mapped on different types of systems - e.G. Distributed Shared Memory (DSM), MPI within a node, etc. • Each model has strenGths and drawbacks - suite different problems or applicaons 4 Outline GTC ’15 • Overview of PGAS models (UPC and OpenSHMEM) • Limitaons in PGAS models for GPU compuPnG • Proposed DesiGns and Alternaves • Performance Evaluaon • ExploiPnG GPUDirect RDMA 5 ParPPoned Global Address Space (PGAS) Models GTC ’15 • PGAS models, an aracPve alternave to tradiPonal message passinG - Simple
    [Show full text]
  • Automatic Handling of Global Variables for Multi-Threaded MPI Programs
    Automatic Handling of Global Variables for Multi-threaded MPI Programs Gengbin Zheng, Stas Negara, Celso L. Mendes, Laxmikant V. Kale´ Eduardo R. Rodrigues Department of Computer Science Institute of Informatics University of Illinois at Urbana-Champaign Federal University of Rio Grande do Sul Urbana, IL 61801, USA Porto Alegre, Brazil fgzheng,snegara2,cmendes,[email protected] [email protected] Abstract— necessary to privatize global and static variables to ensure Hybrid programming models, such as MPI combined with the thread-safety of an application. threads, are one of the most efficient ways to write parallel ap- In high-performance computing, one way to exploit multi- plications for current machines comprising multi-socket/multi- core nodes and an interconnection network. Global variables in core platforms is to adopt hybrid programming models that legacy MPI applications, however, present a challenge because combine MPI with threads, where different programming they may be accessed by multiple MPI threads simultaneously. models are used to address issues such as multiple threads Thus, transforming legacy MPI applications to be thread-safe per core, decreasing amount of memory per core, load in order to exploit multi-core architectures requires proper imbalance, etc. When porting legacy MPI applications to handling of global variables. In this paper, we present three approaches to eliminate these hybrid models that involve multiple threads, thread- global variables to ensure thread-safety for an MPI program. safety of the applications needs to be addressed again. These approaches include: (a) a compiler-based refactoring Global variables cause no problem with traditional MPI technique, using a Photran-based tool as an example, which implementations, since each process image contains a sep- automates the source-to-source transformation for programs arate instance of the variable.
    [Show full text]
  • Overview of the Global Arrays Parallel Software Development Toolkit: Introduction to Global Address Space Programming Models
    Overview of the Global Arrays Parallel Software Development Toolkit: Introduction to Global Address Space Programming Models P. Saddayappan2, Bruce Palmer1, Manojkumar Krishnan1, Sriram Krishnamoorthy1, Abhinav Vishnu1, Daniel Chavarría1, Patrick Nichols1, Jeff Daily1 1Pacific Northwest National Laboratory 2Ohio State University Outline of the Tutorial ! " Parallel programming models ! " Global Arrays (GA) programming model ! " GA Operations ! " Writing, compiling and running GA programs ! " Basic, intermediate, and advanced calls ! " With C and Fortran examples ! " GA Hands-on session 2 Performance vs. Abstraction and Generality Domain Specific “Holy Grail” Systems GA CAF MPI OpenMP Scalability Autoparallelized C/Fortran90 Generality 3 Parallel Programming Models ! " Single Threaded ! " Data Parallel, e.g. HPF ! " Multiple Processes ! " Partitioned-Local Data Access ! " MPI ! " Uniform-Global-Shared Data Access ! " OpenMP ! " Partitioned-Global-Shared Data Access ! " Co-Array Fortran ! " Uniform-Global-Shared + Partitioned Data Access ! " UPC, Global Arrays, X10 4 High Performance Fortran ! " Single-threaded view of computation ! " Data parallelism and parallel loops ! " User-specified data distributions for arrays ! " Compiler transforms HPF program to SPMD program ! " Communication optimization critical to performance ! " Programmer may not be conscious of communication implications of parallel program HPF$ Independent HPF$ Independent s=s+1 DO I = 1,N DO I = 1,N A(1:100) = B(0:99)+B(2:101) HPF$ Independent HPF$ Independent HPF$ Independent
    [Show full text]
  • Exascale Computing Project -- Software
    Exascale Computing Project -- Software Paul Messina, ECP Director Stephen Lee, ECP Deputy Director ASCAC Meeting, Arlington, VA Crystal City Marriott April 19, 2017 www.ExascaleProject.org ECP scope and goals Develop applications Partner with vendors to tackle a broad to develop computer spectrum of mission Support architectures that critical problems national security support exascale of unprecedented applications complexity Develop a software Train a next-generation Contribute to the stack that is both workforce of economic exascale-capable and computational competitiveness usable on industrial & scientists, engineers, of the nation academic scale and computer systems, in collaboration scientists with vendors 2 Exascale Computing Project, www.exascaleproject.org ECP has formulated a holistic approach that uses co- design and integration to achieve capable exascale Application Software Hardware Exascale Development Technology Technology Systems Science and Scalable and Hardware Integrated mission productive technology exascale applications software elements supercomputers Correctness Visualization Data Analysis Applicationsstack Co-Design Programming models, Math libraries and development environment, Tools Frameworks and runtimes System Software, resource Workflows Resilience management threading, Data Memory scheduling, monitoring, and management and Burst control I/O and file buffer system Node OS, runtimes Hardware interface ECP’s work encompasses applications, system software, hardware technologies and architectures, and workforce
    [Show full text]
  • The Global Arrays User Manual
    The Global Arrays User Manual Manojkumar Krishnan, Bruce Palmer, Abhinav Vishnu, Sriram Krishnamoorthy, Jeff Daily, Daniel Chavarria February 8, 2012 1 This document is intended to be used with version 5.1 of Global Arrays (Pacific Northwest National Laboratory Technical Report Number PNNL-13130) DISCLAIMER This material was prepared as an account of work sponsored by an agency of the United States Government. Neither the United States Gov- ernment nor the United States Department of Energy, nor Battelle, nor any of their employees, MAKES ANY WARRANTY, EXPRESS OR IMPLIED, OR ASSUMES ANY LEGAL LIABILITY OR RESPONSIBILITY FOR THE ACCURACY, COMPLETENESS, OR USEFULNESS OF ANY INFORMA- TION, APPARATUS, PRODUCT, SOFTWARE, OR PROCESS DISCLOSED, OR REPRESENTS THAT ITS USE WOULD NOT INFRINGE PRIVATELY OWNED RIGHTS. ACKNOWLEDGMENT This software and its documentation were pro- duced with United States Government support under Contract Number DE- AC05-76RLO-1830 awarded by the United States Department of Energy. The United States Government retains a paid-up non-exclusive, irrevocable world- wide license to reproduce, prepare derivative works, perform publicly and display publicly by or for the US Government, including the right to distribute to other US Government contractors. December, 2011 Contents 1 Introduction 3 1.1 Overview . .3 1.2 Basic Functionality . .4 1.3 Programming Model . .5 1.4 Application Guidelines . .6 1.4.1 When to use GA: . .6 1.4.2 When not to use GA . .6 2 Writing, Building and Running GA Programs 7 2.1 Platform and Library Dependencies . .7 2.1.1 Supported Platforms . .7 2.1.2 Selection of the communication network for ARMCI .
    [Show full text]