CS 267 Unified Parallel C

Total Page:16

File Type:pdf, Size:1020Kb

CS 267 Unified Parallel C UPC Outline 1. Background CS 267 2. UPC Execution Model Unified Parallel C (UPC) 3. Basic Memory Model: Shared vs. Private Scalars 4. Synchronization 5. Collectives Kathy Yelick 6. Data and Pointers 7. Dynamic Memory Management http://upc.lbl.gov 8. Programming Examples 8. Performance Tuning and Early Results Slides adapted from some by Tarek El-Ghazawi (GWU) 9. Concluding Remarks 5/30/2006 CS267 Lecture: UPC 1 5/30/2006 CS267 Lecture: UPC 2 Context Partitioned Global Address Space Languages • Explicitly-parallel programming model with SPMD parallelism • Most parallel programs are written using either: • Fixed at program start-up, typically 1 thread per processor • Message passing with a SPMD model • Global address space model of memory • Usually for scientific applications with C++/Fortran • Allows programmer to directly represent distributed data • Scales easily structures • Shared memory with threads in OpenMP, • Address space is logically partitioned Threads+C/C++/F or Java • Local vs. remote memory (two-level hierarchy) • Usually for non-scientific applications • Programmer control over performance critical decisions • Easier to program, but less scalable performance • Data layout and communication • Global Address Space (GAS) Languages take the best of both • Performance transparency and tunability are goals • global address space like threads (programmability) • Initial implementation can use fine-grained shared memory • SPMD parallelism like MPI (performance) • Multiple PGAS languages: UPC (C), CAF (Fortran), Titanium • local/global distinction, i.e., layout matters (performance) (Java) 5/30/2006 CS267 Lecture: UPC 3 5/30/2006 CS267 Lecture: UPC 4 Global Address Space Eases Programming Current Implementations of PGAS Languages Thread0 Thread1 Threadn • A successful language/library must run everywhere X[0] X[1] X[P] Shared •UPC • Commercial compilers available on Cray, SGI, HP machines Global Global ptr: ptr: ptr: • Open source compiler from LBNL/UCB (source-to-source) Private address space • Open source gcc-based compiler from Intrepid •CAF • The languages share the global address space abstraction • Shared memory is logically partitioned by processors • Commercial compiler available on Cray machines • Remote memory may stay remote: no automatic caching implied • Open source compiler available from Rice • One-sided communication: reads/writes of shared variables •Titanium • Both individual and bulk memory copies • Open source compiler from UCB runs on most machines • Languages differ on details • Common tools • Some models have a separate private memory area • Open64 open source research compiler infrastructure • Distributed array generality and how they are constructed • ARMCI, GASNet for distributed memory implementations • Pthreads, System V shared memory 5/30/2006 CS267 Lecture: UPC 5 5/30/2006 CS267 Lecture: UPC 6 CS267 Lecture 2 1 UPC Overview and Design Philosophy • Unified Parallel C (UPC) is: • An explicit parallel extension of ANSI C • A partitioned global address space language • Sometimes called a GAS language UPC Execution • Similar to the C language philosophy Model • Programmers are clever and careful, and may need to get close to hardware • to get performance, but • can get in trouble • Concise and efficient syntax • Common and familiar syntax and semantics for parallel C with simple extensions to ANSI C • Based on ideas in Split-C, AC, and PCP 5/30/2006 CS267 Lecture: UPC 7 5/30/2006 CS267 Lecture: UPC 8 UPC Execution Model Hello World in UPC • A number of threads working independently in a SPMD • Any legal C program is also a legal UPC program fashion • If you compile and run it as UPC with P threads, it will • Number of threads specified at compile-time or run-time; run P copies of the program. available as program variable THREADS • Using this fact, plus the identifiers from the previous • MYTHREAD specifies thread index (0..THREADS-1) slides, we can parallel hello world: • upc_barrier is a global synchronization: all wait #include <upc.h> /* needed for UPC extensions */ • There is a form of parallel loop that we will see later #include <stdio.h> • There are two compilation modes • Static Threads mode: main() { • THREADS is specified at compile time by the user printf("Thread %d of %d: hello UPC world\n", • The program may use THREADS as a compile-time constant MYTHREAD, THREADS); • Dynamic threads mode: } • Compiled code may be run with varying numbers of threads 5/30/2006 CS267 Lecture: UPC 9 5/30/2006 CS267 Lecture: UPC 10 Example: Monte Carlo Pi Calculation Pi in UPC • Estimate Pi by throwing darts at a unit square • Independent estimates of pi: • Calculate percentage that fall in the unit circle main(int argc, char **argv) { • Area of square = r2 = 1 int i, hits, trials = 0; Each thread gets its own copy of these variables • Area of circle quadrant = ¼ * π r2 = π/4 double pi; • Randomly throw darts at x,y positions if (argc != 2)trials = 1000000; Each thread can use 2 2 input arguments • If x + y < 1, then point is inside circle else trials = atoi(argv[1]); • Compute ratio: Initialize random in • # points inside / # points total srand(MYTHREAD*17); math library • π = 4*ratio for (i=0; i < trials; i++) hits += hit(); pi = 4.0*hits/trials; printf("PI estimated to %f.", pi); r =1 } Each thread calls “hit” separately 5/30/2006 CS267 Lecture: UPC 11 5/30/2006 CS267 Lecture: UPC 12 CS267 Lecture 2 2 Helper Code for Pi in UPC • Required includes: #include <stdio.h> #include <math.h> #include <upc.h> Shared vs. Private • Function to throw dart and calculate where it hits: Variables int hit(){ int const rand_max = 0xFFFFFF; double x = ((double) rand()) / RAND_MAX; double y = ((double) rand()) / RAND_MAX; if ((x*x + y*y) <= 1.0) { return(1); } else { return(0); } } 5/30/2006 CS267 Lecture: UPC 13 5/30/2006 CS267 Lecture: UPC 14 Private vs. Shared Variables in UPC Pi in UPC: Shared Memory Style • Normal C variables and objects are allocated in the • Parallel computing of pi, but with a bug private memory space for each thread. shared int hits; shared variable to record hits • Shared variables are allocated only once, with thread 0 main(int argc, char **argv) { shared int ours; // use sparingly: performance int i, my_trials = 0; int mine; int trials = atoi(argv[1]); divide work up evenly • Shared variables may not have dynamic lifetime: may not my_trials = (trials + THREADS - 1)/THREADS; occur in a in a function definition, except as static. Why? srand(MYTHREAD*17); for (i=0; i < my_trials; i++) hits += hit(); Thread0 Thread1 Threadn accumulate hits upc_barrier; if (MYTHREAD == 0) { ours: Shared printf("PI estimated to %f.", 4.0*hits/trials); } space mine: mine: mine: What is the problem with this program? Private } Global address address Global 5/30/2006 CS267 Lecture: UPC 15 5/30/2006 CS267 Lecture: UPC 16 Shared Arrays Are Cyclic By Default Pi in UPC: Shared Array Version • Shared scalars always live in thread 0 • Alternative fix to the race condition • Shared arrays are spread over the threads • Have each thread update a separate counter: • Shared array elements are spread across the threads • But do it in a shared array shared int x[THREADS] /* 1 element per thread */ • Have one thread compute sum all_hits is shared int y[3][THREADS] /* 3 elements per thread */ shared int all_hits [THREADS]; shared by all /* 2 or 3 elements per thread */ shared int z[3][3] main(int argc, char **argv) { processors, • In the pictures below, assume THREADS = 4 … declarations an initialization code omitted just as hits was • Red elts have affinity to thread 0 for (i=0; i < my_trials; i++) Think of linearized all_hits[MYTHREAD] += hit(); update element x C array, then map upc_barrier; in round-robin with local affinity if (MYTHREAD == 0) { As a 2D array, y is y for (i=0; i < THREADS; i++) hits += all_hits[i]; logically blocked by columns printf("PI estimated to %f.", 4.0*hits/trials); z } z is not } 5/30/2006 CS267 Lecture: UPC 17 5/30/2006 CS267 Lecture: UPC 18 CS267 Lecture 2 3 UPC Global Synchronization • UPC has two basic forms of barriers: • Barrier: block until all other threads arrive upc_barrier UPC • Split-phase barriers upc_notify; this thread is ready for barrier Synchronization do computation unrelated to barrier upc_wait; wait for others to be ready • Optional labels allow for debugging #define MERGE_BARRIER 12 if (MYTHREAD%2 == 0) { ... upc_barrier MERGE_BARRIER; } else { ... upc_barrier MERGE_BARRIER; } 5/30/2006 CS267 Lecture: UPC 19 5/30/2006 CS267 Lecture: UPC 20 Synchronization - Locks Pi in UPC: Shared Memory Style • Locks in UPC are represented by an opaque type: • Parallel computing of pi, without the bug upc_lock_t shared int hits; main(int argc, char **argv) { • Locks must be allocated before use: int i, my_hits, my_trials = 0; create a lock upc_lock_t *upc_all_lock_alloc(void); upc_lock_t *hit_lock = upc_all_lock_alloc(); allocates 1 lock, pointer to all threads int trials = atoi(argv[1]); upc_lock_t *upc_global_lock_alloc(void); my_trials = (trials + THREADS - 1)/THREADS; allocates 1 lock, pointer to one thread srand(MYTHREAD*17); for (i=0; i < my_trials; i++) accumulate hits • To use a lock: my_hits += hit(); locally void upc_lock(upc_lock_t *l) upc_lock(hit_lock); void upc_unlock(upc_lock_t *l) hits += my_hits; accumulate use at start and end of critical region upc_unlock(hit_lock); across threads upc_barrier; • Locks can be freed when not in use if (MYTHREAD == 0) void upc_lock_free(upc_lock_t *ptr); printf("PI: %f", 4.0*hits/trials); } 5/30/2006 CS267 Lecture: UPC 21 5/30/2006 CS267 Lecture: UPC 22 UPC Collectives in General • The UPC collectives interface is available from: • http://www.gwu.edu/~upc/docs/ • It contains typical functions: • Data movement: broadcast, scatter, gather, … UPC Collectives • Computational: reduce, prefix, … • Interface has synchronization modes: • Avoid over-synchronizing (barrier before/after is simplest semantics, but may be unnecessary) • Data being collected may be read/written by any thread simultaneously 5/30/2006 CS267 Lecture: UPC 23 5/30/2006 CS267 Lecture: UPC 24 CS267 Lecture 2 4 Pi in UPC: Data Parallel Style Recap: Private vs.
Recommended publications
  • Heterogeneous Task Scheduling for Accelerated Openmp
    Heterogeneous Task Scheduling for Accelerated OpenMP ? ? Thomas R. W. Scogland Barry Rountree† Wu-chun Feng Bronis R. de Supinski† ? Department of Computer Science, Virginia Tech, Blacksburg, VA 24060 USA † Center for Applied Scientific Computing, Lawrence Livermore National Laboratory, Livermore, CA 94551 USA [email protected] [email protected] [email protected] [email protected] Abstract—Heterogeneous systems with CPUs and computa- currently requires a programmer either to program in at tional accelerators such as GPUs, FPGAs or the upcoming least two different parallel programming models, or to use Intel MIC are becoming mainstream. In these systems, peak one of the two that support both GPUs and CPUs. Multiple performance includes the performance of not just the CPUs but also all available accelerators. In spite of this fact, the models however require code replication, and maintaining majority of programming models for heterogeneous computing two completely distinct implementations of a computational focus on only one of these. With the development of Accelerated kernel is a difficult and error-prone proposition. That leaves OpenMP for GPUs, both from PGI and Cray, we have a clear us with using either OpenCL or accelerated OpenMP to path to extend traditional OpenMP applications incrementally complete the task. to use GPUs. The extensions are geared toward switching from CPU parallelism to GPU parallelism. However they OpenCL’s greatest strength lies in its broad hardware do not preserve the former while adding the latter. Thus support. In a way, though, that is also its greatest weak- computational potential is wasted since either the CPU cores ness. To enable one to program this disparate hardware or the GPU cores are left idle.
    [Show full text]
  • L22: Parallel Programming Language Features (Chapel and Mapreduce)
    12/1/09 Administrative • Schedule for the rest of the semester - “Midterm Quiz” = long homework - Handed out over the holiday (Tuesday, Dec. 1) L22: Parallel - Return by Dec. 15 - Projects Programming - 1 page status report on Dec. 3 – handin cs4961 pdesc <file, ascii or PDF ok> Language Features - Poster session dry run (to see material) Dec. 8 (Chapel and - Poster details (next slide) MapReduce) • Mailing list: [email protected] December 1, 2009 12/01/09 Poster Details Outline • I am providing: • Global View Languages • Foam core, tape, push pins, easels • Chapel Programming Language • Plan on 2ft by 3ft or so of material (9-12 slides) • Map-Reduce (popularized by Google) • Content: • Reading: Ch. 8 and 9 in textbook - Problem description and why it is important • Sources for today’s lecture - Parallelization challenges - Brad Chamberlain, Cray - Parallel Algorithm - John Gilbert, UCSB - How are two programming models combined? - Performance results (speedup over sequential) 12/01/09 12/01/09 1 12/1/09 Shifting Gears Global View Versus Local View • What are some important features of parallel • P-Independence programming languages (Ch. 9)? - If and only if a program always produces the same output on - Correctness the same input regardless of number or arrangement of processors - Performance - Scalability • Global view - Portability • A language construct that preserves P-independence • Example (today’s lecture) • Local view - Does not preserve P-independent program behavior - Example from previous lecture? And what about ease
    [Show full text]
  • Parallel Programming
    Parallel Programming Libraries and implementations Outline • MPI – distributed memory de-facto standard • Using MPI • OpenMP – shared memory de-facto standard • Using OpenMP • CUDA – GPGPU de-facto standard • Using CUDA • Others • Hybrid programming • Xeon Phi Programming • SHMEM • PGAS MPI Library Distributed, message-passing programming Message-passing concepts Explicit Parallelism • In message-passing all the parallelism is explicit • The program includes specific instructions for each communication • What to send or receive • When to send or receive • Synchronisation • It is up to the developer to design the parallel decomposition and implement it • How will you divide up the problem? • When will you need to communicate between processes? Message Passing Interface (MPI) • MPI is a portable library used for writing parallel programs using the message passing model • You can expect MPI to be available on any HPC platform you use • Based on a number of processes running independently in parallel • HPC resource provides a command to launch multiple processes simultaneously (e.g. mpiexec, aprun) • There are a number of different implementations but all should support the MPI 2 standard • As with different compilers, there will be variations between implementations but all the features specified in the standard should work. • Examples: MPICH2, OpenMPI Point-to-point communications • A message sent by one process and received by another • Both processes are actively involved in the communication – not necessarily at the same time • Wide variety of semantics provided: • Blocking vs. non-blocking • Ready vs. synchronous vs. buffered • Tags, communicators, wild-cards • Built-in and custom data-types • Can be used to implement any communication pattern • Collective operations, if applicable, can be more efficient Collective communications • A communication that involves all processes • “all” within a communicator, i.e.
    [Show full text]
  • Openmp API 5.1 Specification
    OpenMP Application Programming Interface Version 5.1 November 2020 Copyright c 1997-2020 OpenMP Architecture Review Board. Permission to copy without fee all or part of this material is granted, provided the OpenMP Architecture Review Board copyright notice and the title of this document appear. Notice is given that copying is by permission of the OpenMP Architecture Review Board. This page intentionally left blank in published version. Contents 1 Overview of the OpenMP API1 1.1 Scope . .1 1.2 Glossary . .2 1.2.1 Threading Concepts . .2 1.2.2 OpenMP Language Terminology . .2 1.2.3 Loop Terminology . .9 1.2.4 Synchronization Terminology . 10 1.2.5 Tasking Terminology . 12 1.2.6 Data Terminology . 14 1.2.7 Implementation Terminology . 18 1.2.8 Tool Terminology . 19 1.3 Execution Model . 22 1.4 Memory Model . 25 1.4.1 Structure of the OpenMP Memory Model . 25 1.4.2 Device Data Environments . 26 1.4.3 Memory Management . 27 1.4.4 The Flush Operation . 27 1.4.5 Flush Synchronization and Happens Before .................. 29 1.4.6 OpenMP Memory Consistency . 30 1.5 Tool Interfaces . 31 1.5.1 OMPT . 32 1.5.2 OMPD . 32 1.6 OpenMP Compliance . 33 1.7 Normative References . 33 1.8 Organization of this Document . 35 i 2 Directives 37 2.1 Directive Format . 38 2.1.1 Fixed Source Form Directives . 43 2.1.2 Free Source Form Directives . 44 2.1.3 Stand-Alone Directives . 45 2.1.4 Array Shaping . 45 2.1.5 Array Sections .
    [Show full text]
  • Openmp Made Easy with INTEL® ADVISOR
    OpenMP made easy with INTEL® ADVISOR Zakhar Matveev, PhD, Intel CVCG, November 2018, SC’18 OpenMP booth Why do we care? Ai Bi Ai Bi Ai Bi Ai Bi Vector + Processing Ci Ci Ci Ci VL Motivation (instead of Agenda) • Starting from 4.x, OpenMP introduces support for both levels of parallelism: • Multi-Core (think of “pragma/directive omp parallel for”) • SIMD (think of “pragma/directive omp simd”) • 2 pillars of OpenMP SIMD programming model • Hardware with Intel ® AVX-512 support gives you theoretically 8x speed- up over SSE baseline (less or even more in practice) • Intel® Advisor is here to assist you in: • Enabling SIMD parallelism with OpenMP (if not yet) • Improving performance of already vectorized OpenMP SIMD code • And will also help to optimize for Memory Sub-system (Advisor Roofline) 3 Don’t use a single Vector lane! Un-vectorized and un-threaded software will under perform 4 Permission to Design for All Lanes Threading and Vectorization needed to fully utilize modern hardware 5 A Vector parallelism in x86 AVX-512VL AVX-512BW B AVX-512DQ AVX-512CD AVX-512F C AVX2 AVX2 AVX AVX AVX SSE SSE SSE SSE Intel® microarchitecture code name … NHM SNB HSW SKL 6 (theoretically) 8x more SIMD FLOP/S compared to your (–O2) optimized baseline x - Significant leap to 512-bit SIMD support for processors - Intel® Compilers and Intel® Math Kernel Library include AVX-512 support x - Strong compatibility with AVX - Added EVEX prefix enables additional x functionality Don’t leave it on the table! 7 Two level parallelism decomposition with OpenMP: image processing example B #pragma omp parallel for for (int y = 0; y < ImageHeight; ++y){ #pragma omp simd C for (int x = 0; x < ImageWidth; ++x){ count[y][x] = mandel(in_vals[y][x]); } } 8 Two level parallelism decomposition with OpenMP: fluid dynamics processing example B #pragma omp parallel for for (int i = 0; i < X_Dim; ++i){ #pragma omp simd C for (int m = 0; x < n_velocities; ++m){ next_i = f(i, velocities(m)); X[i] = next_i; } } 9 Key components of Intel® Advisor What’s new in “2019” release Step 1.
    [Show full text]
  • Outro to Parallel Computing
    Outro To Parallel Computing John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist Purpose of this talk Now that you know how to do some real parallel programming, you may wonder how much you don’t know. With your newly informed perspective we will take a look at the parallel software landscape so that you can see how much of it you are equipped to traverse. How parallel is a code? ⚫ Parallel performance is defined in terms of scalability Strong Scalability Can we get faster for a given problem size? Weak Scalability Can we maintain runtime as we scale up the problem? Weak vs. Strong scaling More Processors Weak Scaling More accurate results More Processors Strong Scaling Faster results (Tornado on way!) Your Scaling Enemy: Amdahl’s Law How many processors can we really use? Let’s say we have a legacy code such that is it only feasible to convert half of the heavily used routines to parallel: Amdahl’s Law If we run this on a parallel machine with five processors: Our code now takes about 60s. We have sped it up by about 40%. Let’s say we use a thousand processors: We have now sped our code by about a factor of two. Is this a big enough win? Amdahl’s Law ⚫ If there is x% of serial component, speedup cannot be better than 100/x. ⚫ If you decompose a problem into many parts, then the parallel time cannot be less than the largest of the parts. ⚫ If the critical path through a computation is T, you cannot complete in less time than T, no matter how many processors you use .
    [Show full text]
  • Introduction to Openmp Paul Edmon ITC Research Computing Associate
    Introduction to OpenMP Paul Edmon ITC Research Computing Associate FAS Research Computing Overview • Threaded Parallelism • OpenMP Basics • OpenMP Programming • Benchmarking FAS Research Computing Threaded Parallelism • Shared Memory • Single Node • Non-uniform Memory Access (NUMA) • One thread per core FAS Research Computing Threaded Languages • PThreads • Python • Perl • OpenCL/CUDA • OpenACC • OpenMP FAS Research Computing OpenMP Basics FAS Research Computing What is OpenMP? • OpenMP (Open Multi-Processing) – Application Program Interface (API) – Governed by OpenMP Architecture Review Board • OpenMP provides a portable, scalable model for developers of shared memory parallel applications • The API supports C/C++ and Fortran on a wide variety of architectures FAS Research Computing Goals of OpenMP • Standardization – Provide a standard among a variety shared memory architectures / platforms – Jointly defined and endorsed by a group of major computer hardware and software vendors • Lean and Mean – Establish a simple and limited set of directives for programming shared memory machines – Significant parallelism can be implemented by just a few directives • Ease of Use – Provide the capability to incrementally parallelize a serial program • Portability – Specified for C/C++ and Fortran – Most majors platforms have been implemented including Unix/Linux and Windows – Implemented for all major compilers FAS Research Computing OpenMP Programming Model Shared Memory Model: OpenMP is designed for multi-processor/core, shared memory machines Thread Based Parallelism: OpenMP programs accomplish parallelism exclusively through the use of threads Explicit Parallelism: OpenMP provides explicit (not automatic) parallelism, offering the programmer full control over parallelization Compiler Directive Based: Parallelism is specified through the use of compiler directives embedded in the C/C++ or Fortran code I/O: OpenMP specifies nothing about parallel I/O.
    [Show full text]
  • The Continuing Renaissance in Parallel Programming Languages
    The continuing renaissance in parallel programming languages Simon McIntosh-Smith University of Bristol Microelectronics Research Group [email protected] © Simon McIntosh-Smith 1 Didn’t parallel computing use to be a niche? © Simon McIntosh-Smith 2 When I were a lad… © Simon McIntosh-Smith 3 But now parallelism is mainstream Samsung Exynos 5 Octa: • 4 fast ARM cores and 4 energy efficient ARM cores • Includes OpenCL programmable GPU from Imagination 4 HPC scaling to millions of cores Tianhe-2 at NUDT in China 33.86 PetaFLOPS (33.86x1015), 16,000 nodes Each node has 2 CPUs and 3 Xeon Phis 3.12 million cores, $390M, 17.6 MW, 720m2 © Simon McIntosh-Smith 5 A renaissance in parallel programming Metal C++11 OpenMP OpenCL Erlang Unified Parallel C Fortress XC Go Cilk HMPP CHARM++ CUDA Co-Array Fortran Chapel Linda X10 MPI Pthreads C++ AMP © Simon McIntosh-Smith 6 Groupings of || languages Partitioned Global Address Space GPU languages: (PGAS): • OpenCL • Fortress • CUDA • X10 • HMPP • Chapel • Metal • Co-array Fortran • Unified Parallel C Object oriented: • C++ AMP CSP: XC • CHARM++ Message passing: MPI Multi-threaded: • Cilk Shared memory: OpenMP • Go • C++11 © Simon McIntosh-Smith 7 Emerging GPGPU standards • OpenCL, DirectCompute, C++ AMP, … • Also OpenMP 4.0, OpenACC, CUDA… © Simon McIntosh-Smith 8 Apple's Metal • A "ground up" parallel programming language for GPUs • Designed for compute and graphics • Potential to replace OpenGL compute shaders, OpenCL/GL interop etc. • Close to the "metal" • Low overheads • "Shading" language based
    [Show full text]
  • C Language Extensions for Hybrid CPU/GPU Programming with Starpu
    C Language Extensions for Hybrid CPU/GPU Programming with StarPU Ludovic Courtès arXiv:1304.0878v2 [cs.MS] 10 Apr 2013 RESEARCH REPORT N° 8278 April 2013 Project-Team Runtime ISSN 0249-6399 ISRN INRIA/RR--8278--FR+ENG C Language Extensions for Hybrid CPU/GPU Programming with StarPU Ludovic Courtès Project-Team Runtime Research Report n° 8278 — April 2013 — 22 pages Abstract: Modern platforms used for high-performance computing (HPC) include machines with both general- purpose CPUs, and “accelerators”, often in the form of graphical processing units (GPUs). StarPU is a C library to exploit such platforms. It provides users with ways to define tasks to be executed on CPUs or GPUs, along with the dependencies among them, and by automatically scheduling them over all the available processing units. In doing so, it also relieves programmers from the need to know the underlying architecture details: it adapts to the available CPUs and GPUs, and automatically transfers data between main memory and GPUs as needed. While StarPU’s approach is successful at addressing run-time scheduling issues, being a C library makes for a poor and error-prone programming interface. This paper presents an effort started in 2011 to promote some of the concepts exported by the library as C language constructs, by means of an extension of the GCC compiler suite. Our main contribution is the design and implementation of language extensions that map to StarPU’s task programming paradigm. We argue that the proposed extensions make it easier to get started with StarPU, eliminate errors that can occur when using the C library, and help diagnose possible mistakes.
    [Show full text]
  • Unified Parallel C (UPC)
    Unified Parallel C (UPC) Vivek Sarkar Department of Computer Science Rice University [email protected] COMP 422 Lecture 21 March 27, 2008 Acknowledgments • Supercomputing 2007 tutorial on “Programming using the Partitioned Global Address Space (PGAS) Model” by Tarek El- Ghazawi and Vivek Sarkar —http://sc07.supercomputing.org/schedule/event_detail.php?evid=11029 2 Programming Models Process/Thread Address Space Message Passing Shared Memory DSM/PGAS MPI OpenMP UPC 3 The Partitioned Global Address Space (PGAS) Model • Aka the Distributed Shared Memory (DSM) model • Concurrent threads with a Th Th Th 0 … n-2 n-1 partitioned shared space —Memory partition Mi has affinity to thread Thi • (+)ive: x —Data movement is implicit —Data distribution simplified with M0 … Mn-2 Mn-1 global address space • (-)ive: —Computation distribution and Legend: synchronization still remain programmer’s responsibility Thread/Process Memory Access —Lack of performance transparency of local vs. remote Address Space accesses • UPC, CAF, Titanium, X10, … 4 What is Unified Parallel C (UPC)? • An explicit parallel extension of ISO C • Single global address space • Collection of threads —each thread bound to a processor —each thread has some private data —part of the shared data is co-located with each thread • Set of specs for a parallel C —v1.0 completed February of 2001 —v1.1.1 in October of 2003 —v1.2 in May of 2005 • Compiler implementations by industry and academia 5 UPC Execution Model • A number of threads working independently in a SPMD fashion —MYTHREAD specifies
    [Show full text]
  • IBM XL Unified Parallel C User's Guide
    IBM XL Unified Parallel C for AIX, V11.0 (Technology Preview) IBM XL Unified Parallel C User’s Guide Version 11.0 IBM XL Unified Parallel C for AIX, V11.0 (Technology Preview) IBM XL Unified Parallel C User’s Guide Version 11.0 Note Before using this information and the product it supports, read the information in “Notices” on page 97. © Copyright International Business Machines Corporation 2010. US Government Users Restricted Rights – Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp. Contents Chapter 1. Parallel programming and Declarations ..............39 Unified Parallel C...........1 Type qualifiers ............39 Parallel programming ...........1 Declarators .............41 Partitioned global address space programming model 1 Statements and blocks ...........42 Unified Parallel C introduction ........2 Synchronization statements ........42 Iteration statements...........46 Chapter 2. Unified Parallel C Predefined macros and directives .......47 Unified Parallel C directives ........47 programming model .........3 Predefined macros ...........48 Distributed shared memory programming ....3 Data affinity and data distribution .......3 Chapter 5. Unified Parallel C library Memory consistency ............5 functions..............49 Synchronization mechanism .........6 Utility functions .............49 Chapter 3. Using the XL Unified Parallel Program termination ..........49 Dynamic memory allocation ........50 C compiler .............9 Pointer-to-shared manipulation .......56 Compiler options .............9
    [Show full text]
  • Parallel Programming with Openmp
    Parallel Programming with OpenMP OpenMP Parallel Programming Introduction: OpenMP Programming Model Thread-based parallelism utilized on shared-memory platforms Parallelization is either explicit, where programmer has full control over parallelization or through using compiler directives, existing in the source code. Thread is a process of a code is being executed. A thread of execution is the smallest unit of processing. Multiple threads can exist within the same process and share resources such as memory OpenMP Parallel Programming Introduction: OpenMP Programming Model Master thread is a single thread that runs sequentially; parallel execution occurs inside parallel regions and between two parallel regions, only the master thread executes the code. This is called the fork-join model: OpenMP Parallel Programming OpenMP Parallel Computing Hardware Shared memory allows immediate access to all data from all processors without explicit communication. Shared memory: multiple cpus are attached to the BUS all processors share the same primary memory the same memory address on different CPU's refer to the same memory location CPU-to-memory connection becomes a bottleneck: shared memory computers cannot scale very well OpenMP Parallel Programming OpenMP versus MPI OpenMP (Open Multi-Processing): easy to use; loop-level parallelism non-loop-level parallelism is more difficult limited to shared memory computers cannot handle very large problems An alternative is MPI (Message Passing Interface): require low-level programming; more difficult programming
    [Show full text]