Parallel Simulation of Brownian Dynamics on Shared Memory

Total Page:16

File Type:pdf, Size:1020Kb

Parallel Simulation of Brownian Dynamics on Shared Memory 1 2 Noname manuscript No. 3 (will be inserted by the editor) 4 5 6 7 8 9 Parallel simulation of Brownian dynamics on shared 10 memory systems with OpenMP and Unified Parallel C 11 12 · · 13 Carlos Teijeiro Godehard Sutmann · 14 Guillermo L. Taboada Juan Touri˜no 15 16 17 18 19 Received: date / Accepted: date 20 21 22 Abstract The simulation of particle dynamics is an essential method to ana- 23 lyze and predict the behavior of molecules in a given medium. This work 24 presents the design and implementation of a parallel simulation of Brownian 25 dynamics with hydrodynamic interactions for shared memory systems using 26 two approaches: (1) OpenMP directives and (2) the Partitioned Global Ad- 27 dress Space (PGAS) paradigm with the Unified Parallel C (UPC) language. 28 The structure of the code is described, and different techniques for work distri- 29 bution are analyzed in terms of efficiency, in order to select the most suitable 30 strategy for each part of the simulation. Additionally, performance results have 31 been collected from two representative NUMA systems, and they are studied 32 and compared against the original sequential code. 33 34 Keywords Brownian dynamics parallel simulation OpenMP PGAS 35 UPC shared memory · · · · 36 · 37 38 1Introduction 39 40 Particle based simulation methods have been continuously used in physics 41 and biology to model the behavior of different elements (e.g., molecules, cells) 42 in a medium (e.g., fluid, gas) under thermodynamical conditions (e.g., tem- 43 perature, density). These methods represent a simplification of the real-world 44 45 C. Teijeiro, G. L. Taboada, J. Touri˜no 46 Computer Architecture Group, University of A Coru˜na 47 15071 A Coru˜na, Spain 48 E-mail: {cteijeiro,taboada,juan}@udc.es 49 G. Sutmann 50 J¨ulich Supercomputing Centre (JSC), Institute for Advanced Simulation 51 Forschungszentrum J¨ulich GmbH 52425 J¨ulich, Germany 52 ICAMS, Ruhr-University Bochum, D-44801 Bochum, Germany 53 E-mail: [email protected] 54 55 56 57 58 59 60 61 62 63 64 65 1 2 3 4 2 Carlos Teijeiro et al. 5 6 scenario but often provide enough details to model and predict the state and 7 evolution of a system on a given time and length scale. Among them, Brownian 8 dynamics describes the movement of particles in solution on a diffusive time 9 scale, where the solvent particles are taken into account only by their statisti- 10 cal properties, i.e. the interactions between solute and solvent are modeled as 11 a stochastic process. 12 The main goal of this work is to provide a clear and efficient paralleliza- 13 tion for the Brownian dynamics simulation, using two different approaches: 14 OpenMP and the PGAS paradigm. On the one hand, OpenMP facilitates the 15 parallelization of codes by simply introducing directives (pragmas) to create 16 parallel regions in the code (with private and shared variables) and distribute 17 their associated workload between threads. Here the access to shared variables 18 is concurrent for all threads, thus the programmer should control possible 19 data dependencies and deadlocks. On the other hand, PGAS is an emerging 20 21 paradigm that treats memory as a global address space divided in private and 22 shared areas. The shared area is logically partitioned in chunks with affinity 23 to different threads. Using the UPC language [15] (which extends ANSI C 24 including PGAS support) the programmer can deal directly with workload 25 distribution and data storage by means of different constructs, such as assign- 26 ments to shared variables, collective functions or parallel loop definitions. 27 The rest of the paper presents the design and implementation of the parallel 28 code. First, some related work on this area is presented, and then amathe- 29 matical and computational description of the different parts in the sequential 30 code is provided. Next, the most relevant design decisions taken forthepa- 31 rallel implementation of the code, according to the nature of the problem, are 32 discussed. Finally, a performance analysis of the parallel code is accomplished, 33 and the main conclusions are drawn. 34 35 36 2Relatedwork 37 38 39 There are multiple works on simulations based on Brownian dynamics, e.g. 40 several simulation tools such as BrownDye [7] and the BROWNFLEX pro- 41 gram included in the SIMUFLEX suite [5], and many other studies oriented 42 to specific fields, such as DNA research [8] or copolymers [10]. Although most 43 of these works focus on sequential codes, some of them have developed parallel 44 implementations with OpenMP, applied to smoothed particle hydrodynamics 45 (SPH) [6] and the transportation of particles in a fluid [16]. The UPC language 46 has also been used in a few works related to particle simulations, such as the 47 implementation of the particle-in-cell algorithm [12]. However, comparative 48 analyses between parallel programming approaches on shared memory with 49 OpenMP and UPC are mainly restricted to some computational kernels for 50 different purposes [11,17], without considering large simulation applications. 51 The most relevant recent work on parallel Brownian dynamics simulation 52 is BD BOX [2], which supports simulations on CPU using codes written with 53 MPI and OpenMP, and also on GPU with CUDA. However, there is still little 54 55 56 57 58 59 60 61 62 63 64 65 1 2 3 4 Parallel Brownian dynamics on shared memory 3 5 6 information published about the actual algorithmic implementation of these 7 simulations, and their performance has not been thoroughly studied, espe- 8 cially under periodic boundary conditions. In order to solve this and provide 9 an efficient parallel solution on different NUMA shared memory systems, we 10 have analyzed the Brownian dynamics simulation for different problem sizes, 11 and described how to address the main performance issues of the code using 12 OpenMP and a PGAS-based approach with UPC. 13 14 15 3Theoreticaldescriptionofthesimulation 16 17 The present work focuses on the simulation of Brownian motion of particles in- 18 cluding hydrodynamic interactions. The time evolution in configurationspace 19 has been stated by Ermak and McCammon [3] (based on the Focker-Planck 20 and Langevin descriptions) which includes the direct interactions between par- 21 22 ticles via systematic forces as well as solvent mediated effects via a correlated 23 random process 24 ∂Dij (t) 1 25 r (t + ∆t)=r (t)+ ∆t + D (t)F (t)∆t + R (t + ∆t)(1) i i ! ∂r ! k T ij j i 26 j j j B 27 28 Here the underlying stochastic process can be discretized in different ways, 29 leading to e.g. the Milstein [9] or Runge-Kutta [14] schemes, but in Eq. 1 30 the time integration method follows a simple Euler scheme. In a system that 31 contains N particles, the trajectory ri(t); t [0,tmax] of particle i is cal- { ∈ } 32 culated as a succession of small and fixed time step increments ∆t,defined 33 as the sum of interactions on the particles during each time step (that is, 0 34 the partial derivative of the initial diffusion tensor Dij with respect to the 35 position component and the product of the forces F and the diffusion ten- 36 sor at the beginning of a time step, being kB Boltzmann’s constant and T 37 the absolute temperature), and a random displacement Ri(∆t) associated to 38 the possible correlations between displacements for each particle. The random 39 displacement vector R is Gaussian distributed with average value Ri =0 40 and covariance R (t + ∆t)RT (t + ∆t) =2D (t)∆t. The diffusion" tensor# 41 i j ij D thereby contains" information about hydrodynamic# interactions, which, for- 42 mally, can be neglected by considering only the diagonal components D . 43 ii 44 Choosing appropriate models for the diffusion tensor, the partial derivative of 45 D drops out of Eq. 1, which is true for the Rotne-Prager tensor [13] used in 46 this work. Therefore, Eq. 1 reduces to 47 1 48 ∆r = DF∆t + √2∆t Sξ (2) kT 49 50 where the expression R = √2∆tSξ holds and D = SST , which relates the 51 stochastic process to the diffusion matrix. Therefore, S may be calculated via a 52 Cholesky decomposition of D or via the square root of D.Bothapproachesare 53 3 very CPU time consuming and have a complexity of (N ). A faster approach 54 O 55 56 57 58 59 60 61 62 63 64 65 1 2 3 4 4 Carlos Teijeiro et al. 5 6 that approximates the random displacement vector R without constructing S 7 explicitly was introduced by Fixman [4] using an expansion of R in terms of 8 Chebyshev polynomials, which has a complexity of (N 2.25). O 9 The interaction model for the computation of forces in the system (which 10 here considers periodic boundary conditions) consists of (1) direct systematic 11 interactions, which are computed as Lennard-Jones type interactions and eval- 12 uated within a fixed cutoffradius, and (2) hydrodynamic interactions, which 13 have long range character. The latter ones are reflected in the off-diagonal 14 components of the diffusion tensor D and therefore contribute to both terms 15 of the right-hand side of Eq. 2. Due to the long range nature, the diffusion 16 tensor is calculated via the Ewald summation method [1]. 17 18 19 4Designoftheparallelsimulation 20 21 The sequential code considers a unity 3-dimensional box with periodic bound- 22 ary conditions that contains a set of N equal particles.
Recommended publications
  • Heterogeneous Task Scheduling for Accelerated Openmp
    Heterogeneous Task Scheduling for Accelerated OpenMP ? ? Thomas R. W. Scogland Barry Rountree† Wu-chun Feng Bronis R. de Supinski† ? Department of Computer Science, Virginia Tech, Blacksburg, VA 24060 USA † Center for Applied Scientific Computing, Lawrence Livermore National Laboratory, Livermore, CA 94551 USA [email protected] [email protected] [email protected] [email protected] Abstract—Heterogeneous systems with CPUs and computa- currently requires a programmer either to program in at tional accelerators such as GPUs, FPGAs or the upcoming least two different parallel programming models, or to use Intel MIC are becoming mainstream. In these systems, peak one of the two that support both GPUs and CPUs. Multiple performance includes the performance of not just the CPUs but also all available accelerators. In spite of this fact, the models however require code replication, and maintaining majority of programming models for heterogeneous computing two completely distinct implementations of a computational focus on only one of these. With the development of Accelerated kernel is a difficult and error-prone proposition. That leaves OpenMP for GPUs, both from PGI and Cray, we have a clear us with using either OpenCL or accelerated OpenMP to path to extend traditional OpenMP applications incrementally complete the task. to use GPUs. The extensions are geared toward switching from CPU parallelism to GPU parallelism. However they OpenCL’s greatest strength lies in its broad hardware do not preserve the former while adding the latter. Thus support. In a way, though, that is also its greatest weak- computational potential is wasted since either the CPU cores ness. To enable one to program this disparate hardware or the GPU cores are left idle.
    [Show full text]
  • L22: Parallel Programming Language Features (Chapel and Mapreduce)
    12/1/09 Administrative • Schedule for the rest of the semester - “Midterm Quiz” = long homework - Handed out over the holiday (Tuesday, Dec. 1) L22: Parallel - Return by Dec. 15 - Projects Programming - 1 page status report on Dec. 3 – handin cs4961 pdesc <file, ascii or PDF ok> Language Features - Poster session dry run (to see material) Dec. 8 (Chapel and - Poster details (next slide) MapReduce) • Mailing list: [email protected] December 1, 2009 12/01/09 Poster Details Outline • I am providing: • Global View Languages • Foam core, tape, push pins, easels • Chapel Programming Language • Plan on 2ft by 3ft or so of material (9-12 slides) • Map-Reduce (popularized by Google) • Content: • Reading: Ch. 8 and 9 in textbook - Problem description and why it is important • Sources for today’s lecture - Parallelization challenges - Brad Chamberlain, Cray - Parallel Algorithm - John Gilbert, UCSB - How are two programming models combined? - Performance results (speedup over sequential) 12/01/09 12/01/09 1 12/1/09 Shifting Gears Global View Versus Local View • What are some important features of parallel • P-Independence programming languages (Ch. 9)? - If and only if a program always produces the same output on - Correctness the same input regardless of number or arrangement of processors - Performance - Scalability • Global view - Portability • A language construct that preserves P-independence • Example (today’s lecture) • Local view - Does not preserve P-independent program behavior - Example from previous lecture? And what about ease
    [Show full text]
  • Parallel Programming
    Parallel Programming Libraries and implementations Outline • MPI – distributed memory de-facto standard • Using MPI • OpenMP – shared memory de-facto standard • Using OpenMP • CUDA – GPGPU de-facto standard • Using CUDA • Others • Hybrid programming • Xeon Phi Programming • SHMEM • PGAS MPI Library Distributed, message-passing programming Message-passing concepts Explicit Parallelism • In message-passing all the parallelism is explicit • The program includes specific instructions for each communication • What to send or receive • When to send or receive • Synchronisation • It is up to the developer to design the parallel decomposition and implement it • How will you divide up the problem? • When will you need to communicate between processes? Message Passing Interface (MPI) • MPI is a portable library used for writing parallel programs using the message passing model • You can expect MPI to be available on any HPC platform you use • Based on a number of processes running independently in parallel • HPC resource provides a command to launch multiple processes simultaneously (e.g. mpiexec, aprun) • There are a number of different implementations but all should support the MPI 2 standard • As with different compilers, there will be variations between implementations but all the features specified in the standard should work. • Examples: MPICH2, OpenMPI Point-to-point communications • A message sent by one process and received by another • Both processes are actively involved in the communication – not necessarily at the same time • Wide variety of semantics provided: • Blocking vs. non-blocking • Ready vs. synchronous vs. buffered • Tags, communicators, wild-cards • Built-in and custom data-types • Can be used to implement any communication pattern • Collective operations, if applicable, can be more efficient Collective communications • A communication that involves all processes • “all” within a communicator, i.e.
    [Show full text]
  • Openmp API 5.1 Specification
    OpenMP Application Programming Interface Version 5.1 November 2020 Copyright c 1997-2020 OpenMP Architecture Review Board. Permission to copy without fee all or part of this material is granted, provided the OpenMP Architecture Review Board copyright notice and the title of this document appear. Notice is given that copying is by permission of the OpenMP Architecture Review Board. This page intentionally left blank in published version. Contents 1 Overview of the OpenMP API1 1.1 Scope . .1 1.2 Glossary . .2 1.2.1 Threading Concepts . .2 1.2.2 OpenMP Language Terminology . .2 1.2.3 Loop Terminology . .9 1.2.4 Synchronization Terminology . 10 1.2.5 Tasking Terminology . 12 1.2.6 Data Terminology . 14 1.2.7 Implementation Terminology . 18 1.2.8 Tool Terminology . 19 1.3 Execution Model . 22 1.4 Memory Model . 25 1.4.1 Structure of the OpenMP Memory Model . 25 1.4.2 Device Data Environments . 26 1.4.3 Memory Management . 27 1.4.4 The Flush Operation . 27 1.4.5 Flush Synchronization and Happens Before .................. 29 1.4.6 OpenMP Memory Consistency . 30 1.5 Tool Interfaces . 31 1.5.1 OMPT . 32 1.5.2 OMPD . 32 1.6 OpenMP Compliance . 33 1.7 Normative References . 33 1.8 Organization of this Document . 35 i 2 Directives 37 2.1 Directive Format . 38 2.1.1 Fixed Source Form Directives . 43 2.1.2 Free Source Form Directives . 44 2.1.3 Stand-Alone Directives . 45 2.1.4 Array Shaping . 45 2.1.5 Array Sections .
    [Show full text]
  • Openmp Made Easy with INTEL® ADVISOR
    OpenMP made easy with INTEL® ADVISOR Zakhar Matveev, PhD, Intel CVCG, November 2018, SC’18 OpenMP booth Why do we care? Ai Bi Ai Bi Ai Bi Ai Bi Vector + Processing Ci Ci Ci Ci VL Motivation (instead of Agenda) • Starting from 4.x, OpenMP introduces support for both levels of parallelism: • Multi-Core (think of “pragma/directive omp parallel for”) • SIMD (think of “pragma/directive omp simd”) • 2 pillars of OpenMP SIMD programming model • Hardware with Intel ® AVX-512 support gives you theoretically 8x speed- up over SSE baseline (less or even more in practice) • Intel® Advisor is here to assist you in: • Enabling SIMD parallelism with OpenMP (if not yet) • Improving performance of already vectorized OpenMP SIMD code • And will also help to optimize for Memory Sub-system (Advisor Roofline) 3 Don’t use a single Vector lane! Un-vectorized and un-threaded software will under perform 4 Permission to Design for All Lanes Threading and Vectorization needed to fully utilize modern hardware 5 A Vector parallelism in x86 AVX-512VL AVX-512BW B AVX-512DQ AVX-512CD AVX-512F C AVX2 AVX2 AVX AVX AVX SSE SSE SSE SSE Intel® microarchitecture code name … NHM SNB HSW SKL 6 (theoretically) 8x more SIMD FLOP/S compared to your (–O2) optimized baseline x - Significant leap to 512-bit SIMD support for processors - Intel® Compilers and Intel® Math Kernel Library include AVX-512 support x - Strong compatibility with AVX - Added EVEX prefix enables additional x functionality Don’t leave it on the table! 7 Two level parallelism decomposition with OpenMP: image processing example B #pragma omp parallel for for (int y = 0; y < ImageHeight; ++y){ #pragma omp simd C for (int x = 0; x < ImageWidth; ++x){ count[y][x] = mandel(in_vals[y][x]); } } 8 Two level parallelism decomposition with OpenMP: fluid dynamics processing example B #pragma omp parallel for for (int i = 0; i < X_Dim; ++i){ #pragma omp simd C for (int m = 0; x < n_velocities; ++m){ next_i = f(i, velocities(m)); X[i] = next_i; } } 9 Key components of Intel® Advisor What’s new in “2019” release Step 1.
    [Show full text]
  • Outro to Parallel Computing
    Outro To Parallel Computing John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist Purpose of this talk Now that you know how to do some real parallel programming, you may wonder how much you don’t know. With your newly informed perspective we will take a look at the parallel software landscape so that you can see how much of it you are equipped to traverse. How parallel is a code? ⚫ Parallel performance is defined in terms of scalability Strong Scalability Can we get faster for a given problem size? Weak Scalability Can we maintain runtime as we scale up the problem? Weak vs. Strong scaling More Processors Weak Scaling More accurate results More Processors Strong Scaling Faster results (Tornado on way!) Your Scaling Enemy: Amdahl’s Law How many processors can we really use? Let’s say we have a legacy code such that is it only feasible to convert half of the heavily used routines to parallel: Amdahl’s Law If we run this on a parallel machine with five processors: Our code now takes about 60s. We have sped it up by about 40%. Let’s say we use a thousand processors: We have now sped our code by about a factor of two. Is this a big enough win? Amdahl’s Law ⚫ If there is x% of serial component, speedup cannot be better than 100/x. ⚫ If you decompose a problem into many parts, then the parallel time cannot be less than the largest of the parts. ⚫ If the critical path through a computation is T, you cannot complete in less time than T, no matter how many processors you use .
    [Show full text]
  • Introduction to Openmp Paul Edmon ITC Research Computing Associate
    Introduction to OpenMP Paul Edmon ITC Research Computing Associate FAS Research Computing Overview • Threaded Parallelism • OpenMP Basics • OpenMP Programming • Benchmarking FAS Research Computing Threaded Parallelism • Shared Memory • Single Node • Non-uniform Memory Access (NUMA) • One thread per core FAS Research Computing Threaded Languages • PThreads • Python • Perl • OpenCL/CUDA • OpenACC • OpenMP FAS Research Computing OpenMP Basics FAS Research Computing What is OpenMP? • OpenMP (Open Multi-Processing) – Application Program Interface (API) – Governed by OpenMP Architecture Review Board • OpenMP provides a portable, scalable model for developers of shared memory parallel applications • The API supports C/C++ and Fortran on a wide variety of architectures FAS Research Computing Goals of OpenMP • Standardization – Provide a standard among a variety shared memory architectures / platforms – Jointly defined and endorsed by a group of major computer hardware and software vendors • Lean and Mean – Establish a simple and limited set of directives for programming shared memory machines – Significant parallelism can be implemented by just a few directives • Ease of Use – Provide the capability to incrementally parallelize a serial program • Portability – Specified for C/C++ and Fortran – Most majors platforms have been implemented including Unix/Linux and Windows – Implemented for all major compilers FAS Research Computing OpenMP Programming Model Shared Memory Model: OpenMP is designed for multi-processor/core, shared memory machines Thread Based Parallelism: OpenMP programs accomplish parallelism exclusively through the use of threads Explicit Parallelism: OpenMP provides explicit (not automatic) parallelism, offering the programmer full control over parallelization Compiler Directive Based: Parallelism is specified through the use of compiler directives embedded in the C/C++ or Fortran code I/O: OpenMP specifies nothing about parallel I/O.
    [Show full text]
  • The Continuing Renaissance in Parallel Programming Languages
    The continuing renaissance in parallel programming languages Simon McIntosh-Smith University of Bristol Microelectronics Research Group [email protected] © Simon McIntosh-Smith 1 Didn’t parallel computing use to be a niche? © Simon McIntosh-Smith 2 When I were a lad… © Simon McIntosh-Smith 3 But now parallelism is mainstream Samsung Exynos 5 Octa: • 4 fast ARM cores and 4 energy efficient ARM cores • Includes OpenCL programmable GPU from Imagination 4 HPC scaling to millions of cores Tianhe-2 at NUDT in China 33.86 PetaFLOPS (33.86x1015), 16,000 nodes Each node has 2 CPUs and 3 Xeon Phis 3.12 million cores, $390M, 17.6 MW, 720m2 © Simon McIntosh-Smith 5 A renaissance in parallel programming Metal C++11 OpenMP OpenCL Erlang Unified Parallel C Fortress XC Go Cilk HMPP CHARM++ CUDA Co-Array Fortran Chapel Linda X10 MPI Pthreads C++ AMP © Simon McIntosh-Smith 6 Groupings of || languages Partitioned Global Address Space GPU languages: (PGAS): • OpenCL • Fortress • CUDA • X10 • HMPP • Chapel • Metal • Co-array Fortran • Unified Parallel C Object oriented: • C++ AMP CSP: XC • CHARM++ Message passing: MPI Multi-threaded: • Cilk Shared memory: OpenMP • Go • C++11 © Simon McIntosh-Smith 7 Emerging GPGPU standards • OpenCL, DirectCompute, C++ AMP, … • Also OpenMP 4.0, OpenACC, CUDA… © Simon McIntosh-Smith 8 Apple's Metal • A "ground up" parallel programming language for GPUs • Designed for compute and graphics • Potential to replace OpenGL compute shaders, OpenCL/GL interop etc. • Close to the "metal" • Low overheads • "Shading" language based
    [Show full text]
  • C Language Extensions for Hybrid CPU/GPU Programming with Starpu
    C Language Extensions for Hybrid CPU/GPU Programming with StarPU Ludovic Courtès arXiv:1304.0878v2 [cs.MS] 10 Apr 2013 RESEARCH REPORT N° 8278 April 2013 Project-Team Runtime ISSN 0249-6399 ISRN INRIA/RR--8278--FR+ENG C Language Extensions for Hybrid CPU/GPU Programming with StarPU Ludovic Courtès Project-Team Runtime Research Report n° 8278 — April 2013 — 22 pages Abstract: Modern platforms used for high-performance computing (HPC) include machines with both general- purpose CPUs, and “accelerators”, often in the form of graphical processing units (GPUs). StarPU is a C library to exploit such platforms. It provides users with ways to define tasks to be executed on CPUs or GPUs, along with the dependencies among them, and by automatically scheduling them over all the available processing units. In doing so, it also relieves programmers from the need to know the underlying architecture details: it adapts to the available CPUs and GPUs, and automatically transfers data between main memory and GPUs as needed. While StarPU’s approach is successful at addressing run-time scheduling issues, being a C library makes for a poor and error-prone programming interface. This paper presents an effort started in 2011 to promote some of the concepts exported by the library as C language constructs, by means of an extension of the GCC compiler suite. Our main contribution is the design and implementation of language extensions that map to StarPU’s task programming paradigm. We argue that the proposed extensions make it easier to get started with StarPU, eliminate errors that can occur when using the C library, and help diagnose possible mistakes.
    [Show full text]
  • Unified Parallel C (UPC)
    Unified Parallel C (UPC) Vivek Sarkar Department of Computer Science Rice University [email protected] COMP 422 Lecture 21 March 27, 2008 Acknowledgments • Supercomputing 2007 tutorial on “Programming using the Partitioned Global Address Space (PGAS) Model” by Tarek El- Ghazawi and Vivek Sarkar —http://sc07.supercomputing.org/schedule/event_detail.php?evid=11029 2 Programming Models Process/Thread Address Space Message Passing Shared Memory DSM/PGAS MPI OpenMP UPC 3 The Partitioned Global Address Space (PGAS) Model • Aka the Distributed Shared Memory (DSM) model • Concurrent threads with a Th Th Th 0 … n-2 n-1 partitioned shared space —Memory partition Mi has affinity to thread Thi • (+)ive: x —Data movement is implicit —Data distribution simplified with M0 … Mn-2 Mn-1 global address space • (-)ive: —Computation distribution and Legend: synchronization still remain programmer’s responsibility Thread/Process Memory Access —Lack of performance transparency of local vs. remote Address Space accesses • UPC, CAF, Titanium, X10, … 4 What is Unified Parallel C (UPC)? • An explicit parallel extension of ISO C • Single global address space • Collection of threads —each thread bound to a processor —each thread has some private data —part of the shared data is co-located with each thread • Set of specs for a parallel C —v1.0 completed February of 2001 —v1.1.1 in October of 2003 —v1.2 in May of 2005 • Compiler implementations by industry and academia 5 UPC Execution Model • A number of threads working independently in a SPMD fashion —MYTHREAD specifies
    [Show full text]
  • IBM XL Unified Parallel C User's Guide
    IBM XL Unified Parallel C for AIX, V11.0 (Technology Preview) IBM XL Unified Parallel C User’s Guide Version 11.0 IBM XL Unified Parallel C for AIX, V11.0 (Technology Preview) IBM XL Unified Parallel C User’s Guide Version 11.0 Note Before using this information and the product it supports, read the information in “Notices” on page 97. © Copyright International Business Machines Corporation 2010. US Government Users Restricted Rights – Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp. Contents Chapter 1. Parallel programming and Declarations ..............39 Unified Parallel C...........1 Type qualifiers ............39 Parallel programming ...........1 Declarators .............41 Partitioned global address space programming model 1 Statements and blocks ...........42 Unified Parallel C introduction ........2 Synchronization statements ........42 Iteration statements...........46 Chapter 2. Unified Parallel C Predefined macros and directives .......47 Unified Parallel C directives ........47 programming model .........3 Predefined macros ...........48 Distributed shared memory programming ....3 Data affinity and data distribution .......3 Chapter 5. Unified Parallel C library Memory consistency ............5 functions..............49 Synchronization mechanism .........6 Utility functions .............49 Chapter 3. Using the XL Unified Parallel Program termination ..........49 Dynamic memory allocation ........50 C compiler .............9 Pointer-to-shared manipulation .......56 Compiler options .............9
    [Show full text]
  • Parallel Programming with Openmp
    Parallel Programming with OpenMP OpenMP Parallel Programming Introduction: OpenMP Programming Model Thread-based parallelism utilized on shared-memory platforms Parallelization is either explicit, where programmer has full control over parallelization or through using compiler directives, existing in the source code. Thread is a process of a code is being executed. A thread of execution is the smallest unit of processing. Multiple threads can exist within the same process and share resources such as memory OpenMP Parallel Programming Introduction: OpenMP Programming Model Master thread is a single thread that runs sequentially; parallel execution occurs inside parallel regions and between two parallel regions, only the master thread executes the code. This is called the fork-join model: OpenMP Parallel Programming OpenMP Parallel Computing Hardware Shared memory allows immediate access to all data from all processors without explicit communication. Shared memory: multiple cpus are attached to the BUS all processors share the same primary memory the same memory address on different CPU's refer to the same memory location CPU-to-memory connection becomes a bottleneck: shared memory computers cannot scale very well OpenMP Parallel Programming OpenMP versus MPI OpenMP (Open Multi-Processing): easy to use; loop-level parallelism non-loop-level parallelism is more difficult limited to shared memory computers cannot handle very large problems An alternative is MPI (Message Passing Interface): require low-level programming; more difficult programming
    [Show full text]