Introduction to Parallel Programming with Mpi And

Total Page:16

File Type:pdf, Size:1020Kb

Introduction to Parallel Programming with Mpi And INTRODUCTION TO PARALLEL PROGRAMMING WITH MPI AND OPENMP 1–5 March 2021 | Benedikt Steinbusch | Jülich Supercomputing Centre Member of the Helmholtz Association IV Nonblocking Point-to-Point Communication 14 18 Introduction 14 19 Start 14 20 Completion 15 INTRODUCTION TO 21 Remarks 16 22 Exercises 17 PARALLEL PROGRAMMING WITH V Collective Communication 18 23 Introduction 18 MPI AND OPENMP 24 Reductions 18 25 Reduction Variants 19 Benedikt Steinbusch | Jülich Supercomputing Centre | 1–5 March 2021 26 Exercises 19 27 Data Movement 20 28 Data Movement Variants 22 CONTENTS 29 Exercises 22 30 In Place Mode 23 I Fundamentals of Parallel Computing 3 31 Synchronization 24 1 Motivation 3 32 Nonblocking Collective Communication 24 2 Hardware 3 3 Software 4 VI Derived Datatypes 25 33 Introduction 25 II First Steps with MPI 5 34 Constructors 26 4 What is MPI? 5 35 Exercises 28 5 Terminology 6 36 Address Calculation 28 6 Infrastructure 6 37 Padding 29 7 Basic Program Structure 7 38 Exercises 30 8 Exercises 9 VII Input/Output 31 III Blocking Point-to-Point Communication 10 39 Introduction 31 9 Introduction 10 40 File Manipulation 31 10 Sending 10 41 File Views 33 11 Exercises 11 42 Data Access 34 12 Receiving 11 43 Consistency 40 13 Exercises 12 44 Exercises 40 14 Communication Modes 12 15 Semantics 12 VIII Tools 41 16 Pitfalls 13 45 MUST 41 17 Exercises 13 46 Exercises 42 IX Communicators 42 1 47 Introduction 42 77 The task Construct 60 48 Constructors 42 78 task Clauses 61 49 Accessors 43 79 Task Scheduling 61 50 Destructors 45 80 Task Synchronization 62 51 Exercises 45 81 Exercises 62 X Thread Compliance 46 XV Wrap-up 63 52 Introduction 46 XVI Tutorial 63 53 Enabling Thread Support 46 54 Matching Probe and Receive 46 TIMETABLE 55 Remarks 47 Day 1 Day 2 Day 3 Day 4 (Day 5) XI First Steps with OpenMP 47 56 What is OpenMP? 47 09:00 Fundamentals Blocking I/O First Steps with Tutorial 57 Terminology 49 10:30 of Parallel Collective OpenMP Computing Communication 58 Infrastructure 49 59 Basic Program Structure 50 COFFEE 60 Exercises 52 11:00 First Steps with Nonblocking I/O Low-Level Tutorial XII Low-Level OpenMP Concepts 52 12:30 MPI Collective Constructs Comm. 61 Introduction 52 62 Exercises 53 UTENSILS 63 Data Environment 54 13:30 Blocking P2P Derived Tools & Loop Tutorial 64 Exercises 55 14:30 Communication Datatypes Communicators Worksharing 65 Thread Synchronization 56 COFFEE 66 Exercises 57 15:00 Nonblocking Derived Thread Task Tutorial XIII Worksharing 57 16:30 P2P Datatypes Compliance Worksharing 67 Introduction 57 Communication 68 The single construct 57 69 single Clauses 57 70 The loop construct 58 71 loop Clauses 58 72 Exercises 59 73 workshare Construct 59 74 Exercises 59 75 Combined Constructs 59 XIV Task Worksharing 60 76 Introduction 60 2 PART I 2 HARDWARE FUNDAMENTALS OF PARALLEL COMPUTING A MODERN SUPERCOMPUTER Interconnect Interconnect Interconnect 1 MOTIVATION … CPU RAM CPU RAM CPU RAM PARALLEL COMPUTING Parallel computing is a type of computation in which many calculations or the MICROCHIP MEMORY MICROCHIP MEMORY MICROCHIP MEMORY execution of processes are carried out simultaneously. (Wikipedia1) Internal Bus Internal Bus Internal Bus WHYAMIHERE? TH MEMORY TH MEMORY TH MEMORY The Way Forward Accel. RAM Accel. RAM Accel. RAM • Frequency scaling has stopped • Performance increase through more parallel hardware PARALLEL COMPUTATIONAL UNITS • Treating scientific problems Implicit Parallelism – of larger scale • Parallel execution of different (parts of) processor instructions – in higher accuracy • Happens automatically – of a completely new kind • Can only be influenced indirectly by the programmer PARALLELISM IN THE TOP 500 LIST Multi-core / Multi-CPU Average Number of Cores of the Top 10 Systems • Found in commodity hardware today • Computational units share the same memory 106 Cluster • Found in computing centers • Independent systems linked via a (fast) interconnect 104 • Each system has its own memory Number of cores Accelerators • Strive to perform certain tasks faster than is possible on a general purpose CPU 1993 1999 2004 2009 2015 2020 • Make different trade-offs 1Wikipedia. Parallel computing — Wikipedia, The Free Encyclopedia. 2017. URL: https://en.wikipedia.org/w/index.php?title=Parallel_computing&oldid=787466585 • Often have their own memory (visited on 06/28/2017). • Often not autonomous 3 Vector Processors / Vector Units Message Passing • Perform same operation on multiple pieces of data simultaneously • Parts of program state are transferred from one process to another for coordination • Making a come-back as SIMD units in commodity CPUs (AVX-512) and GPGPU • Primitive operations are active send and active receive MPI MEMORY DOMAINS • Implements a form of Distributed State and Message Passing Shared Memory • (But also Shared State and Synchronization) • All memory is directly accessible by the parallel computational units SHARED STATE & SYNCHRONIZATION • Single address space • Programmer might have to synchronize access Shared State Distributed Memory The whole program state is directly accessible by the parallel threads. Synchronization • Memory is partitioned into parts which are private to the different computational units • Threads can manipulate shared state using common loads and stores • “Remote” parts of memory are accessed via an interconnect • Establish agreement about progress of execution using synchronization primitives, e.g. • Access is usually nonuniform barriers, critical sections, … OpenMP 3 SOFTWARE • Implements Shared State and Synchronization PROCESSES & THREADS & TASKS • (But also higher level constructs) Abstractions for the independent execution of (part of) a program. Process Usually, multiple processes, each with their own associated set of resources (memory, file descriptors, etc.), can coexist Thread • Typically “smaller” than processes • Often, multiple threads per one process • Threads of the same process can share resources Task • Typically “smaller” than threads • Often, multiple tasks per one thread • Here: user-level construct DISTRIBUTED STATE & MESSAGE PASSING Distributed State Program state is partitioned into parts which are private to the different processes. 4 PART II 4. Datatypes ✓ 5. Collective Communication ✓ FIRST STEPS WITH MPI 6. Groups, Contexts, Communicators and Caching (✓) 7. Process Topologies (✓) 4 WHATISMPI? 8. MPI Environmental Management (✓) MPI (Message-Passing Interface) is a message-passing library interface specification. 9. The Info Object [...] MPI addresses primarily the message-passing parallel programming model, in 10. Process Creation and Management which data is moved from the address space of one process to that of another process through cooperative operations on each process. (MPI Forum2) 11. One-Sided Communications • Industry standard for a message-passing programming model 12. External interfaces (✓) • Provides specifications (no implementations) 13. I/O ✓ • Implemented as a library with language bindings for Fortran and C 14. Tool Support • Portable across different computer architectures 15. … Current version of the standard: 3.1 (June 2015) LITERATURE & ACKNOWLEDGEMENTS BRIEF HISTORY Literature <1992 several message-passing libraries were developed, PVM, P4, ... • Message Passing Interface Forum. MPI: A Message-Passing 1992 At SC92, several developers for message-passing libraries agreed to develop a standard for Interface Standard. Version 3.1. June 4, 2015. URL: message-passing https://mpi-forum.org/docs/mpi-3.1/mpi31-report.pdf 1994 MPI-1.0 standard published • William Gropp, Ewing Lusk, and Anthony Skjellum. Using MPI. Portable Parallel Programming with the Message-Passing Interface. 3rd ed. The MIT Press, Nov. 2014. 336 pp. 1997 MPI-2.0 standard adds process creation and management, one-sided communication, ISBN: 9780262527392 extended collective communication, external interfaces and parallel I/O • William Gropp et al. Using Advanced MPI. Modern Features of the Message-Passing Interface. 2008 MPI-2.1 combines MPI-1.3 and MPI-2.0 1st ed. Nov. 2014. 392 pp. ISBN: 9780262527637 2009 MPI-2.2 corrections and clarifications with minor extensions • https://www.mpi-forum.org 2012 MPI-3.0 nonblocking collectives, new one-sided operations, Fortran 2008 bindings Acknowledgements 2015 MPI-3.1 nonblocking collective I/O, current version of the standard • Rolf Rabenseifner for his comprehensive course on MPI and OpenMP • Marc-André Hermanns, Florian Janetzko and Alexander Trautmann for their course material COVERAGE on MPI and OpenMP 1. Introduction to MPI ✓ 2. MPI Terms and Conventions ✓ 3. Point-to-Point Communication ✓ 2Message Passing Interface Forum. MPI: A Message-Passing Interface Standard. Version 3.1. June 4, 2015. URL: https://mpi-forum.org/docs/mpi-3.1/mpi31-report.pdf. 5 5 TERMINOLOGY C Compiler Wrappers PROCESS ORGANIZATION [MPI-3.1, 6.2] $ # Generic compiler wrapper shipped with e.g. OpenMPI $ mpicc foo.c -o foo Terminology: Process $ # Vendor specific wrapper for IBM's XL C compiler on BG/Q $ bgxlc foo.c -o foo An MPI program consists of autonomous processes, executing their own code, in an MIMD style. Fortran Compiler Wrappers Terminology: Rank $ # Generic compiler wrapper shipped with e.g. OpenMPI A unique number assigned to each process within a group (start at 0) $ mpifort foo.f90 -o foo Terminology: Group $ # Vendor specific wrapper for IBM's XL Fortran compiler on BG/Q $ bgxlf90 foo.f90 -o foo An ordered set of process identifiers Terminology: Context However, neither the existence nor the interface of these wrappers
Recommended publications
  • Integration of CUDA Processing Within the C++ Library for Parallelism and Concurrency (HPX)
    1 Integration of CUDA Processing within the C++ library for parallelism and concurrency (HPX) Patrick Diehl, Madhavan Seshadri, Thomas Heller, Hartmut Kaiser Abstract—Experience shows that on today’s high performance systems the utilization of different acceleration cards in conjunction with a high utilization of all other parts of the system is difficult. Future architectures, like exascale clusters, are expected to aggravate this issue as the number of cores are expected to increase and memory hierarchies are expected to become deeper. One big aspect for distributed applications is to guarantee high utilization of all available resources, including local or remote acceleration cards on a cluster while fully using all the available CPU resources and the integration of the GPU work into the overall programming model. For the integration of CUDA code we extended HPX, a general purpose C++ run time system for parallel and distributed applications of any scale, and enabled asynchronous data transfers from and to the GPU device and the asynchronous invocation of CUDA kernels on this data. Both operations are well integrated into the general programming model of HPX which allows to seamlessly overlap any GPU operation with work on the main cores. Any user defined CUDA kernel can be launched on any (local or remote) GPU device available to the distributed application. We present asynchronous implementations for the data transfers and kernel launches for CUDA code as part of a HPX asynchronous execution graph. Using this approach we can combine all remotely and locally available acceleration cards on a cluster to utilize its full performance capabilities.
    [Show full text]
  • Heterogeneous Task Scheduling for Accelerated Openmp
    Heterogeneous Task Scheduling for Accelerated OpenMP ? ? Thomas R. W. Scogland Barry Rountree† Wu-chun Feng Bronis R. de Supinski† ? Department of Computer Science, Virginia Tech, Blacksburg, VA 24060 USA † Center for Applied Scientific Computing, Lawrence Livermore National Laboratory, Livermore, CA 94551 USA [email protected] [email protected] [email protected] [email protected] Abstract—Heterogeneous systems with CPUs and computa- currently requires a programmer either to program in at tional accelerators such as GPUs, FPGAs or the upcoming least two different parallel programming models, or to use Intel MIC are becoming mainstream. In these systems, peak one of the two that support both GPUs and CPUs. Multiple performance includes the performance of not just the CPUs but also all available accelerators. In spite of this fact, the models however require code replication, and maintaining majority of programming models for heterogeneous computing two completely distinct implementations of a computational focus on only one of these. With the development of Accelerated kernel is a difficult and error-prone proposition. That leaves OpenMP for GPUs, both from PGI and Cray, we have a clear us with using either OpenCL or accelerated OpenMP to path to extend traditional OpenMP applications incrementally complete the task. to use GPUs. The extensions are geared toward switching from CPU parallelism to GPU parallelism. However they OpenCL’s greatest strength lies in its broad hardware do not preserve the former while adding the latter. Thus support. In a way, though, that is also its greatest weak- computational potential is wasted since either the CPU cores ness. To enable one to program this disparate hardware or the GPU cores are left idle.
    [Show full text]
  • Bench - Benchmarking the State-Of- The-Art Task Execution Frameworks of Many- Task Computing
    MATRIX: Bench - Benchmarking the state-of- the-art Task Execution Frameworks of Many- Task Computing Thomas Dubucq, Tony Forlini, Virgile Landeiro Dos Reis, and Isabelle Santos Illinois Institute of Technology, Chicago, IL, USA {tdubucq, tforlini, vlandeir, isantos1}@hawk.iit.edu Stanford University. Finally HPX is a general purpose C++ Abstract — Technology trends indicate that exascale systems will runtime system for parallel and distributed applications of any have billion-way parallelism, and each node will have about three scale developed by Louisiana State University and Staple is a orders of magnitude more intra-node parallelism than today’s framework for developing parallel programs from Texas A&M. peta-scale systems. The majority of current runtime systems focus a great deal of effort on optimizing the inter-node parallelism by MATRIX is a many-task computing job scheduling system maximizing the bandwidth and minimizing the latency of the use [3]. There are many resource managing systems aimed towards of interconnection networks and storage, but suffer from the lack data-intensive applications. Furthermore, distributed task of scalable solutions to expose the intra-node parallelism. Many- scheduling in many-task computing is a problem that has been task computing (MTC) is a distributed fine-grained paradigm that considered by many research teams. In particular, Charm++ [4], aims to address the challenges of managing parallelism and Legion [5], Swift [6], [10], Spark [1][2], HPX [12], STAPL [13] locality of exascale systems. MTC applications are typically structured as direct acyclic graphs of loosely coupled short tasks and MATRIX [11] offer solutions to this problem and have with explicit input/output data dependencies.
    [Show full text]
  • Accelerating the LOBPCG Method on Gpus Using a Blocked Sparse Matrix Vector Product
    Accelerating the LOBPCG method on GPUs using a blocked Sparse Matrix Vector Product Hartwig Anzt and Stanimire Tomov and Jack Dongarra Innovative Computing Lab University of Tennessee Knoxville, USA Email: [email protected], [email protected], [email protected] Abstract— the computing power of today’s supercomputers, often accel- erated by coprocessors like graphics processing units (GPUs), This paper presents a heterogeneous CPU-GPU algorithm design and optimized implementation for an entire sparse iter- becomes challenging. ative eigensolver – the Locally Optimal Block Preconditioned Conjugate Gradient (LOBPCG) – starting from low-level GPU While there exist numerous efforts to adapt iterative lin- data structures and kernels to the higher-level algorithmic choices ear solvers like Krylov subspace methods to coprocessor and overall heterogeneous design. Most notably, the eigensolver technology, sparse eigensolvers have so far remained out- leverages the high-performance of a new GPU kernel developed side the main focus. A possible explanation is that many for the simultaneous multiplication of a sparse matrix and a of those combine sparse and dense linear algebra routines, set of vectors (SpMM). This is a building block that serves which makes porting them to accelerators more difficult. Aside as a backbone for not only block-Krylov, but also for other from the power method, algorithms based on the Krylov methods relying on blocking for acceleration in general. The subspace idea are among the most commonly used general heterogeneous LOBPCG developed here reveals the potential of eigensolvers [1]. When targeting symmetric positive definite this type of eigensolver by highly optimizing all of its components, eigenvalue problems, the recently developed Locally Optimal and can be viewed as a benchmark for other SpMM-dependent applications.
    [Show full text]
  • HPX – a Task Based Programming Model in a Global Address Space
    HPX – A Task Based Programming Model in a Global Address Space Hartmut Kaiser1 Thomas Heller2 Bryce Adelstein-Lelbach1 [email protected] [email protected] [email protected] Adrian Serio1 Dietmar Fey2 [email protected] [email protected] 1Center for Computation and 2Computer Science 3, Technology, Computer Architectures, Louisiana State University, Friedrich-Alexander-University, Louisiana, U.S.A. Erlangen, Germany ABSTRACT 1. INTRODUCTION The significant increase in complexity of Exascale platforms due to Todays programming models, languages, and related technologies energy-constrained, billion-way parallelism, with major changes to that have sustained High Performance Computing (HPC) appli- processor and memory architecture, requires new energy-efficient cation software development for the past decade are facing ma- and resilient programming techniques that are portable across mul- jor problems when it comes to programmability and performance tiple future generations of machines. We believe that guarantee- portability of future systems. The significant increase in complex- ing adequate scalability, programmability, performance portability, ity of new platforms due to energy constrains, increasing paral- resilience, and energy efficiency requires a fundamentally new ap- lelism and major changes to processor and memory architecture, proach, combined with a transition path for existing scientific ap- requires advanced programming techniques that are portable across plications, to fully explore the rewards of todays and tomorrows multiple future generations of machines [1]. systems. We present HPX – a parallel runtime system which ex- tends the C++11/14 standard to facilitate distributed operations, A fundamentally new approach is required to address these chal- enable fine-grained constraint based parallelism, and support run- lenges.
    [Show full text]
  • Parallel Programming
    Parallel Programming Libraries and implementations Outline • MPI – distributed memory de-facto standard • Using MPI • OpenMP – shared memory de-facto standard • Using OpenMP • CUDA – GPGPU de-facto standard • Using CUDA • Others • Hybrid programming • Xeon Phi Programming • SHMEM • PGAS MPI Library Distributed, message-passing programming Message-passing concepts Explicit Parallelism • In message-passing all the parallelism is explicit • The program includes specific instructions for each communication • What to send or receive • When to send or receive • Synchronisation • It is up to the developer to design the parallel decomposition and implement it • How will you divide up the problem? • When will you need to communicate between processes? Message Passing Interface (MPI) • MPI is a portable library used for writing parallel programs using the message passing model • You can expect MPI to be available on any HPC platform you use • Based on a number of processes running independently in parallel • HPC resource provides a command to launch multiple processes simultaneously (e.g. mpiexec, aprun) • There are a number of different implementations but all should support the MPI 2 standard • As with different compilers, there will be variations between implementations but all the features specified in the standard should work. • Examples: MPICH2, OpenMPI Point-to-point communications • A message sent by one process and received by another • Both processes are actively involved in the communication – not necessarily at the same time • Wide variety of semantics provided: • Blocking vs. non-blocking • Ready vs. synchronous vs. buffered • Tags, communicators, wild-cards • Built-in and custom data-types • Can be used to implement any communication pattern • Collective operations, if applicable, can be more efficient Collective communications • A communication that involves all processes • “all” within a communicator, i.e.
    [Show full text]
  • Openmp API 5.1 Specification
    OpenMP Application Programming Interface Version 5.1 November 2020 Copyright c 1997-2020 OpenMP Architecture Review Board. Permission to copy without fee all or part of this material is granted, provided the OpenMP Architecture Review Board copyright notice and the title of this document appear. Notice is given that copying is by permission of the OpenMP Architecture Review Board. This page intentionally left blank in published version. Contents 1 Overview of the OpenMP API1 1.1 Scope . .1 1.2 Glossary . .2 1.2.1 Threading Concepts . .2 1.2.2 OpenMP Language Terminology . .2 1.2.3 Loop Terminology . .9 1.2.4 Synchronization Terminology . 10 1.2.5 Tasking Terminology . 12 1.2.6 Data Terminology . 14 1.2.7 Implementation Terminology . 18 1.2.8 Tool Terminology . 19 1.3 Execution Model . 22 1.4 Memory Model . 25 1.4.1 Structure of the OpenMP Memory Model . 25 1.4.2 Device Data Environments . 26 1.4.3 Memory Management . 27 1.4.4 The Flush Operation . 27 1.4.5 Flush Synchronization and Happens Before .................. 29 1.4.6 OpenMP Memory Consistency . 30 1.5 Tool Interfaces . 31 1.5.1 OMPT . 32 1.5.2 OMPD . 32 1.6 OpenMP Compliance . 33 1.7 Normative References . 33 1.8 Organization of this Document . 35 i 2 Directives 37 2.1 Directive Format . 38 2.1.1 Fixed Source Form Directives . 43 2.1.2 Free Source Form Directives . 44 2.1.3 Stand-Alone Directives . 45 2.1.4 Array Shaping . 45 2.1.5 Array Sections .
    [Show full text]
  • Openmp Made Easy with INTEL® ADVISOR
    OpenMP made easy with INTEL® ADVISOR Zakhar Matveev, PhD, Intel CVCG, November 2018, SC’18 OpenMP booth Why do we care? Ai Bi Ai Bi Ai Bi Ai Bi Vector + Processing Ci Ci Ci Ci VL Motivation (instead of Agenda) • Starting from 4.x, OpenMP introduces support for both levels of parallelism: • Multi-Core (think of “pragma/directive omp parallel for”) • SIMD (think of “pragma/directive omp simd”) • 2 pillars of OpenMP SIMD programming model • Hardware with Intel ® AVX-512 support gives you theoretically 8x speed- up over SSE baseline (less or even more in practice) • Intel® Advisor is here to assist you in: • Enabling SIMD parallelism with OpenMP (if not yet) • Improving performance of already vectorized OpenMP SIMD code • And will also help to optimize for Memory Sub-system (Advisor Roofline) 3 Don’t use a single Vector lane! Un-vectorized and un-threaded software will under perform 4 Permission to Design for All Lanes Threading and Vectorization needed to fully utilize modern hardware 5 A Vector parallelism in x86 AVX-512VL AVX-512BW B AVX-512DQ AVX-512CD AVX-512F C AVX2 AVX2 AVX AVX AVX SSE SSE SSE SSE Intel® microarchitecture code name … NHM SNB HSW SKL 6 (theoretically) 8x more SIMD FLOP/S compared to your (–O2) optimized baseline x - Significant leap to 512-bit SIMD support for processors - Intel® Compilers and Intel® Math Kernel Library include AVX-512 support x - Strong compatibility with AVX - Added EVEX prefix enables additional x functionality Don’t leave it on the table! 7 Two level parallelism decomposition with OpenMP: image processing example B #pragma omp parallel for for (int y = 0; y < ImageHeight; ++y){ #pragma omp simd C for (int x = 0; x < ImageWidth; ++x){ count[y][x] = mandel(in_vals[y][x]); } } 8 Two level parallelism decomposition with OpenMP: fluid dynamics processing example B #pragma omp parallel for for (int i = 0; i < X_Dim; ++i){ #pragma omp simd C for (int m = 0; x < n_velocities; ++m){ next_i = f(i, velocities(m)); X[i] = next_i; } } 9 Key components of Intel® Advisor What’s new in “2019” release Step 1.
    [Show full text]
  • Introduction to Openmp Paul Edmon ITC Research Computing Associate
    Introduction to OpenMP Paul Edmon ITC Research Computing Associate FAS Research Computing Overview • Threaded Parallelism • OpenMP Basics • OpenMP Programming • Benchmarking FAS Research Computing Threaded Parallelism • Shared Memory • Single Node • Non-uniform Memory Access (NUMA) • One thread per core FAS Research Computing Threaded Languages • PThreads • Python • Perl • OpenCL/CUDA • OpenACC • OpenMP FAS Research Computing OpenMP Basics FAS Research Computing What is OpenMP? • OpenMP (Open Multi-Processing) – Application Program Interface (API) – Governed by OpenMP Architecture Review Board • OpenMP provides a portable, scalable model for developers of shared memory parallel applications • The API supports C/C++ and Fortran on a wide variety of architectures FAS Research Computing Goals of OpenMP • Standardization – Provide a standard among a variety shared memory architectures / platforms – Jointly defined and endorsed by a group of major computer hardware and software vendors • Lean and Mean – Establish a simple and limited set of directives for programming shared memory machines – Significant parallelism can be implemented by just a few directives • Ease of Use – Provide the capability to incrementally parallelize a serial program • Portability – Specified for C/C++ and Fortran – Most majors platforms have been implemented including Unix/Linux and Windows – Implemented for all major compilers FAS Research Computing OpenMP Programming Model Shared Memory Model: OpenMP is designed for multi-processor/core, shared memory machines Thread Based Parallelism: OpenMP programs accomplish parallelism exclusively through the use of threads Explicit Parallelism: OpenMP provides explicit (not automatic) parallelism, offering the programmer full control over parallelization Compiler Directive Based: Parallelism is specified through the use of compiler directives embedded in the C/C++ or Fortran code I/O: OpenMP specifies nothing about parallel I/O.
    [Show full text]
  • Parallel Programming with Openmp
    Parallel Programming with OpenMP OpenMP Parallel Programming Introduction: OpenMP Programming Model Thread-based parallelism utilized on shared-memory platforms Parallelization is either explicit, where programmer has full control over parallelization or through using compiler directives, existing in the source code. Thread is a process of a code is being executed. A thread of execution is the smallest unit of processing. Multiple threads can exist within the same process and share resources such as memory OpenMP Parallel Programming Introduction: OpenMP Programming Model Master thread is a single thread that runs sequentially; parallel execution occurs inside parallel regions and between two parallel regions, only the master thread executes the code. This is called the fork-join model: OpenMP Parallel Programming OpenMP Parallel Computing Hardware Shared memory allows immediate access to all data from all processors without explicit communication. Shared memory: multiple cpus are attached to the BUS all processors share the same primary memory the same memory address on different CPU's refer to the same memory location CPU-to-memory connection becomes a bottleneck: shared memory computers cannot scale very well OpenMP Parallel Programming OpenMP versus MPI OpenMP (Open Multi-Processing): easy to use; loop-level parallelism non-loop-level parallelism is more difficult limited to shared memory computers cannot handle very large problems An alternative is MPI (Message Passing Interface): require low-level programming; more difficult programming
    [Show full text]
  • Present and Future Leadership Computers at OLCF
    Present and Future Leadership Computers at OLCF Buddy Bland OLCF Project Director Presented at: SC’14 November 17-21, 2014 New Orleans ORNL is managed by UT-Battelle for the US Department of Energy Oak Ridge Leadership Computing Facility (OLCF) Mission: Deploy and operate the computational resources required to tackle global challenges Providing world-leading computational and data resources and specialized services for the most computationally intensive problems Providing stable hardware/software path of increasing scale to maximize productive applications development Providing the resources to investigate otherwise inaccessible systems at every scale: from galaxy formation to supernovae to earth systems to automobiles to nanomaterials With our partners, deliver transforming discoveries in materials, biology, climate, energy technologies, and basic science SC’14 Summit - Bland 2 Our Science requires that we continue to advance OLCF’s computational capability over the next decade on the roadmap to Exascale. Since clock-rate scaling ended in 2003, HPC Titan and beyond deliver hierarchical parallelism with performance has been achieved through increased very powerful nodes. MPI plus thread level parallelism parallelism. Jaguar scaled to 300,000 cores. through OpenACC or OpenMP plus vectors OLCF5: 5-10x Summit Summit: 5-10x Titan ~20 MW Titan: 27 PF Hybrid GPU/CPU Jaguar: 2.3 PF Hybrid GPU/CPU 10 MW Multi-core CPU 9 MW CORAL System 7 MW 2010 2012 2017 2022 3 SC’14 Summit - Bland Today’s Leadership System - Titan Hybrid CPU/GPU architecture, Hierarchical Parallelism Vendors: Cray™ / NVIDIA™ • 27 PF peak • 18,688 Compute nodes, each with – 1.45 TF peak – NVIDIA Kepler™ GPU - 1,311 GF • 6 GB GDDR5 memory – AMD Opteron™- 141 GF • 32 GB DDR3 memory – PCIe2 link between GPU and CPU • Cray Gemini 3-D Torus Interconnect • 32 PB / 1 TB/s Lustre® file system 4 SC’14 Summit - Bland Scientific Progress at all Scales Fusion Energy Liquid Crystal Film Stability A Princeton Plasma Physics Laboratory ORNL Postdoctoral fellow Trung Nguyen team led by C.S.
    [Show full text]
  • Exascale Computing Project -- Software
    Exascale Computing Project -- Software Paul Messina, ECP Director Stephen Lee, ECP Deputy Director ASCAC Meeting, Arlington, VA Crystal City Marriott April 19, 2017 www.ExascaleProject.org ECP scope and goals Develop applications Partner with vendors to tackle a broad to develop computer spectrum of mission Support architectures that critical problems national security support exascale of unprecedented applications complexity Develop a software Train a next-generation Contribute to the stack that is both workforce of economic exascale-capable and computational competitiveness usable on industrial & scientists, engineers, of the nation academic scale and computer systems, in collaboration scientists with vendors 2 Exascale Computing Project, www.exascaleproject.org ECP has formulated a holistic approach that uses co- design and integration to achieve capable exascale Application Software Hardware Exascale Development Technology Technology Systems Science and Scalable and Hardware Integrated mission productive technology exascale applications software elements supercomputers Correctness Visualization Data Analysis Applicationsstack Co-Design Programming models, Math libraries and development environment, Tools Frameworks and runtimes System Software, resource Workflows Resilience management threading, Data Memory scheduling, monitoring, and management and Burst control I/O and file buffer system Node OS, runtimes Hardware interface ECP’s work encompasses applications, system software, hardware technologies and architectures, and workforce
    [Show full text]