Frontiers of HPC: Unified Parallel C

Total Page:16

File Type:pdf, Size:1020Kb

Frontiers of HPC: Unified Parallel C Overview • Parallel programming models are not in short supply – methodology dictated by underlying hardware organization Frontiers of HPC: • shared memory systems (SMP) Unified Parallel C • distributed memory systems (cluster) – trade-offs in ease of use; complexity • Unified Parallel C: David McCaughan, HPC Analyst – an open standard for a uniform programming model SHARCNET, University of Guelph – distributed shared memory [email protected] – fundamentals – a new set of trade-offs; is it worth it? HPC Resources Some of this material is derived from notes written by Behzad Salami (M.Sc. U. of Guelph) Thanks for the memory! Shared Memory Model • Traditional parallel programming abstractly takes one of two forms, depending on how memory is able to be referenced by the program b: 5 x: 5 f() • MIMD Models: Multiple Instruction Multiple Data c: 2 {...} – Shared memory int foo() void bar() int f(); • processors share a single physical memory { { a = b+c; if (x > b) int main() CPU y = x; CPU { CPU • programs can share blocks of memory between them return(x); 1 else 2 int a; 3 } y = 0; • issues: exclusive access, race conditions, synchronization, } a = f(); } scalability a:=b+c if x > b call f – Distributed memory • unique memory associated with each processor Shared Memory Programming • issues: communication is explicit, communication overhead HPC Resources HPC Resources 1 Shared Memory Hello, world! (pthreads) Programming • Shared memory programming benefits from the ability to #include <stdio.h> handle communication implicitly #include “pthread.h” – using the shared memory space void output (int *); – fundamentals of programming in SMP environments is relatively int main(int argc, char *argv[]) { straightforward int id; • issues typically revolve around exclusive access and race pthread_t thread[atoi(argv[1])]; conditions for (id = 0; id < atoi(argv[1]); id++) pthread_create(&thread[id], NULL, (void *)output, (void *)&id); return(0); • Common SMP programming paradigms: } – POSIX threads (pthreads) void output(int *thread_num) – OpenMP { printf(“Hello, world! from thread %d\n”, *thread_num); HPC Resources HPC} Resources Distributed Memory Distributed Memory Model Programming • Communication is handled explicitly – processes send data to one another over an interconnection b: 5 x: 5 f() network c: 2 {...} – communication overhead limits granularity of parallelism int foo() void bar() int f(); – conforms to the strengths of traditional computing hardware so { { a = b+c; if (x > 1) int main() scalability can be excellent CPU y = x; CPU { CPU return(a); 1 else 2 int a; 3 } y = 0; } a = f(); } a:=b+c if x > 1 call f • The past is littered with the corpses of distributed programming models Distributed Memory Programming – the modern standard: MPI HPC Resources HPC Resources 2 Hello, World! (MPI) SPMD #include <stdio.h> • Single Program Multiple Data (SPMD) #include “mpi.h” – special case of MIMD model int main(int argc, char *argv[]) { – many processors executing the same program int rank, size; • conditional branches used where specific MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); behaviour required on the processors MPI_Comm_size(MPI_COMM_WORLD, &size); – shared/distributed memory organization printf(“Process %d of %d\n”, rank, size); MPI_Finalize(); • MPI and UPC explicitly use this model return(0); } HPC Resources HPC Resources SPMD Model What is UPC? • C language extensions for HPC programming on large scale parallel systems b: 5 b: 3 b: 3 – attempts to provide a single, uniform programming model for c: 2 c: 2 c: 2 SMP and cluster-based machines – superset of C (any C program is automatically a valid UPC int main() int main() int main() { { { program) if cpu_1 if cpu_1 if cpu_1 b = 5; CPU b = 5; CPU b = 5; CPU else 1 else 2 else 3 b = 3 b = 3 b = 3 } } } • Explicitly parallel SPMD model b := 5 b := 3 b := 3 – the same program runs on all processors – Global Address Space (GAS) language; an attempt to balance SPMD illustrating conditional branching to control divergent behaviour • convenience (threads) • performance (MPI, data layout) HPC Resources HPC Resources 3 What is UPC? (cont.) Hello, World! (UPC) • Single shared address space #include <stdio.h> – variables can be accessed by any process, however are physically associated with one #include “upc.h” • hugely convenient from the programmer's perspective int main() • what implications does this have? { – most of UPCs complexity comes from the way it handles pointers and printf(“Hello, world! from UPC thread %d of %d\n”, shared memory allocations MYTHREAD, THREADS); return(0); • Front-end/back-end organization allows for great flexibility in } implementation – high speed interconnect, SMP, MPI, etc. • Note: even for the simplest of examples, the implicit availability of • Well suited to parallelizing existing serial applications THREADS and MYTHREADS variable reduces code volume dramatically (over pthreads or MPI) HPC Resources HPC Resources Major Features: Basics Major Features: Advanced • Multi-processing abstractly modeled as threads • Synchronization – pre-defined variables THREADS, MYTHREAD available at run-time – locks (upc_block_t) – barrier (upc_barrier, upc_notify, upc_wait) – memory fence (upc_fence) • new keyword: shared – defines variables available across all processes • User-controlled consistency models – affinity (physical location of data) can also be specified – Per-variable, per-statement • scalars (affinity to process 0) – strict • cyclic (per element) • all strict references within the same thread appear in program order • block-cyclic (user-defined block sizes) to all processes – relaxed • blocked (run-time contiguous affinity for “even” distribution) • all references within the same thread appear in program order to the – can make affinity inquiries at run-time issuing process HPC Resources HPC Resources 4 UPC and Memory Variables • Partitioned Global Address Space • Private – private memory: local to a process – C-style declarations (i.e. the default) – shared memory: partitioned over the address space of all processors – e.g. int foo; • affinity refers to the physical location of data otherwise visible to all – one instance per thread processes • Shared shared – e.g. shared int bar; – one instance for all threads private – shared array elements can be distributed across threads shared int foo[THREADS] – 1 element per thread T T T T T shared int bar[10][THREADS] 0 1 2 3 4 – 10 elements per thread HPC Resources HPC Resources Example: private scalar Example: shared scalar #include <stdio.h> #include <stdio.h> #include <upc.h> #include <upc.h> shared shared b = 01234 int main() int main() private a = 01 aa == 120 a = 0312 { private a = 0 a = 0 a = 0 { int a; int a; a = 0; static shared int b; a = b = 0; T T T a++; T0 T1 T2 0 1 2 return(0); caution: race conditions } a++; b++; • NOTE: serial farming in a box; exactly like running multiple copies return(0); • NOTE: scalar affinity is to Thread_0; of the same program; no use of shared memory resources at all } shared variables must be statically allocated (global data segment) HPC Resources HPC Resources 5 Example: shared scalar Example: shared arrays #include <stdio.h> … #include <upc.h><upc_relaxed.h> static shared int a[4][THREADS]; shared b = 01234 … int main() a[0][0] a[0][1] a[0][2] { private a = 01 aa == 120 a = 0312 a[1][0] a[1][1] a[1][2] int a; a[2][0] a[2][1] a[2][2] static shared int b; • assuming this is run with 3 shared a[2][0] a[2][1] a[2][2] acaution: = b race= 0; conditions T T T 0 1 2 threads, this memory will private be organized as shown a++; b++; T T T return(0); • NOTE: scalar affinity is to Thread_0; 0 1 2 } shared variables must be statically allocated (global data segment) HPC Resources HPC Resources Parallel Loop: upc_forall • UPC defined loop (similar to the serial for loop) Exercise: upc_forall(init; condition; post; affinity) To Affinity and Beyond • affinity expression dictates which THREAD executes a given iteration of the loop – pointer to shared type: upc_threadof(affinity) == MYTHREAD – integer: affinity % THREADS == MYTHREAD The purpose of this exercise is to allow you to – continue: MYTHREAD (all) explore your understanding of UPC shared declarations and memory affinity – upc_forall loops can be nested: careful HPC Resources 6 Exercise Pointers 1) The sharedarray.c file in ~dbm/public/ • Declaration: exercises/UPC implements a working demonstration of a shared allocation of a 2-D array of integers and the output of its contents by processor affinity (using the shared int *a; upc_forall loop) – a is a pointer to an integer that lives in the shared 2) Ensure that you understand the issue of processor memory space affinity by changing the initialization of the array so that it too occurs in parallel, in the thread with affinity to that – we refer to the type of a as a pointer to shared section of the array HPC Resources HPC Resources A New Dimension for Dynamic Memory Pointers Allocation int *ptr1; /* private pointer */ void *upc_all_alloc(int num, int size); shared int *ptr2; /* private pointer to shared */ – allocates num * size bytes of data across all threads int *shared ptr3 /* shared pointer to private */ – returns a pointer to the first element of the shared block of shared int *shared ptr4; /* shared pointer to shared */ memory for all threads – collective operations: called by all threads – e.g. allocate 25 elements per thread ptr4 ptr3 shared shared [5] int *ptr1; /* note use of block size */ … ptr1 ptr1 = (shared int *)upc_all_alloc(25*THREADS, sizeof(int)); private ptr2 ptr2 ptr2 ptr1 ptr1 HPC Resources HPC Resources 7 Dynamic Memory Dynamic Memory Allocation (cont.) Allocation (cont.) void *upc_global_alloc(int num, int size); void *upc_alloc(int num, int size); – allocates num * size bytes of data across all threads – allocates num * size bytes of data in the shared memory space, – initializes shared pointers on all threads only on the calling thread (i.e. affinity issue) – only called by one thread, but pointer will be initialized on all – returns a pointer to the first element of that shared block of threads (i.e.
Recommended publications
  • L22: Parallel Programming Language Features (Chapel and Mapreduce)
    12/1/09 Administrative • Schedule for the rest of the semester - “Midterm Quiz” = long homework - Handed out over the holiday (Tuesday, Dec. 1) L22: Parallel - Return by Dec. 15 - Projects Programming - 1 page status report on Dec. 3 – handin cs4961 pdesc <file, ascii or PDF ok> Language Features - Poster session dry run (to see material) Dec. 8 (Chapel and - Poster details (next slide) MapReduce) • Mailing list: [email protected] December 1, 2009 12/01/09 Poster Details Outline • I am providing: • Global View Languages • Foam core, tape, push pins, easels • Chapel Programming Language • Plan on 2ft by 3ft or so of material (9-12 slides) • Map-Reduce (popularized by Google) • Content: • Reading: Ch. 8 and 9 in textbook - Problem description and why it is important • Sources for today’s lecture - Parallelization challenges - Brad Chamberlain, Cray - Parallel Algorithm - John Gilbert, UCSB - How are two programming models combined? - Performance results (speedup over sequential) 12/01/09 12/01/09 1 12/1/09 Shifting Gears Global View Versus Local View • What are some important features of parallel • P-Independence programming languages (Ch. 9)? - If and only if a program always produces the same output on - Correctness the same input regardless of number or arrangement of processors - Performance - Scalability • Global view - Portability • A language construct that preserves P-independence • Example (today’s lecture) • Local view - Does not preserve P-independent program behavior - Example from previous lecture? And what about ease
    [Show full text]
  • Cimple: Instruction and Memory Level Parallelism a DSL for Uncovering ILP and MLP
    Cimple: Instruction and Memory Level Parallelism A DSL for Uncovering ILP and MLP Vladimir Kiriansky, Haoran Xu, Martin Rinard, Saman Amarasinghe MIT CSAIL {vlk,haoranxu510,rinard,saman}@csail.mit.edu Abstract Processors have grown their capacity to exploit instruction Modern out-of-order processors have increased capacity to level parallelism (ILP) with wide scalar and vector pipelines, exploit instruction level parallelism (ILP) and memory level e.g., cores have 4-way superscalar pipelines, and vector units parallelism (MLP), e.g., by using wide superscalar pipelines can execute 32 arithmetic operations per cycle. Memory and vector execution units, as well as deep buffers for in- level parallelism (MLP) is also pervasive with deep buffering flight memory requests. These resources, however, often ex- between caches and DRAM that allows 10+ in-flight memory hibit poor utilization rates on workloads with large working requests per core. Yet, modern CPUs still struggle to extract sets, e.g., in-memory databases, key-value stores, and graph matching ILP and MLP from the program stream. analytics, as compilers and hardware struggle to expose ILP Critical infrastructure applications, e.g., in-memory databases, and MLP from the instruction stream automatically. key-value stores, and graph analytics, characterized by large In this paper, we introduce the IMLP (Instruction and working sets with multi-level address indirection and pointer Memory Level Parallelism) task programming model. IMLP traversals push hardware to its limits: large multi-level caches tasks execute as coroutines that yield execution at annotated and branch predictors fail to keep processor stalls low. The long-latency operations, e.g., memory accesses, divisions, out-of-order windows of hundreds of instructions are also or unpredictable branches.
    [Show full text]
  • Introduction to Multi-Threading and Vectorization Matti Kortelainen Larsoft Workshop 2019 25 June 2019 Outline
    Introduction to multi-threading and vectorization Matti Kortelainen LArSoft Workshop 2019 25 June 2019 Outline Broad introductory overview: • Why multithread? • What is a thread? • Some threading models – std::thread – OpenMP (fork-join) – Intel Threading Building Blocks (TBB) (tasks) • Race condition, critical region, mutual exclusion, deadlock • Vectorization (SIMD) 2 6/25/19 Matti Kortelainen | Introduction to multi-threading and vectorization Motivations for multithreading Image courtesy of K. Rupp 3 6/25/19 Matti Kortelainen | Introduction to multi-threading and vectorization Motivations for multithreading • One process on a node: speedups from parallelizing parts of the programs – Any problem can get speedup if the threads can cooperate on • same core (sharing L1 cache) • L2 cache (may be shared among small number of cores) • Fully loaded node: save memory and other resources – Threads can share objects -> N threads can use significantly less memory than N processes • If smallest chunk of data is so big that only one fits in memory at a time, is there any other option? 4 6/25/19 Matti Kortelainen | Introduction to multi-threading and vectorization What is a (software) thread? (in POSIX/Linux) • “Smallest sequence of programmed instructions that can be managed independently by a scheduler” [Wikipedia] • A thread has its own – Program counter – Registers – Stack – Thread-local memory (better to avoid in general) • Threads of a process share everything else, e.g. – Program code, constants – Heap memory – Network connections – File handles
    [Show full text]
  • Scheduling Task Parallelism on Multi-Socket Multicore Systems
    Scheduling Task Parallelism" on Multi-Socket Multicore Systems" Stephen Olivier, UNC Chapel Hill Allan Porterfield, RENCI Kyle Wheeler, Sandia National Labs Jan Prins, UNC Chapel Hill The University of North Carolina at Chapel Hill Outline" Introduction and Motivation Scheduling Strategies Evaluation Closing Remarks The University of North Carolina at Chapel Hill ! Outline" Introduction and Motivation Scheduling Strategies Evaluation Closing Remarks The University of North Carolina at Chapel Hill ! Task Parallel Programming in a Nutshell! • A task consists of executable code and associated data context, with some bookkeeping metadata for scheduling and synchronization. • Tasks are significantly more lightweight than threads. • Dynamically generated and terminated at run time • Scheduled onto threads for execution • Used in Cilk, TBB, X10, Chapel, and other languages • Our work is on the recent tasking constructs in OpenMP 3.0. The University of North Carolina at Chapel Hill ! 4 Simple Task Parallel OpenMP Program: Fibonacci! int fib(int n)! {! fib(10)! int x, y;! if (n < 2) return n;! #pragma omp task! fib(9)! fib(8)! x = fib(n - 1);! #pragma omp task! y = fib(n - 2);! #pragma omp taskwait! fib(8)! fib(7)! return x + y;! }! The University of North Carolina at Chapel Hill ! 5 Useful Applications! • Recursive algorithms cilksort cilksort cilksort cilksort cilksort • E.g. Mergesort • List and tree traversal cilkmerge cilkmerge cilkmerge cilkmerge cilkmerge cilkmerge • Irregular computations cilkmerge • E.g., Adaptive Fast Multipole cilkmerge cilkmerge
    [Show full text]
  • Parallel Programming
    Parallel Programming Libraries and implementations Outline • MPI – distributed memory de-facto standard • Using MPI • OpenMP – shared memory de-facto standard • Using OpenMP • CUDA – GPGPU de-facto standard • Using CUDA • Others • Hybrid programming • Xeon Phi Programming • SHMEM • PGAS MPI Library Distributed, message-passing programming Message-passing concepts Explicit Parallelism • In message-passing all the parallelism is explicit • The program includes specific instructions for each communication • What to send or receive • When to send or receive • Synchronisation • It is up to the developer to design the parallel decomposition and implement it • How will you divide up the problem? • When will you need to communicate between processes? Message Passing Interface (MPI) • MPI is a portable library used for writing parallel programs using the message passing model • You can expect MPI to be available on any HPC platform you use • Based on a number of processes running independently in parallel • HPC resource provides a command to launch multiple processes simultaneously (e.g. mpiexec, aprun) • There are a number of different implementations but all should support the MPI 2 standard • As with different compilers, there will be variations between implementations but all the features specified in the standard should work. • Examples: MPICH2, OpenMPI Point-to-point communications • A message sent by one process and received by another • Both processes are actively involved in the communication – not necessarily at the same time • Wide variety of semantics provided: • Blocking vs. non-blocking • Ready vs. synchronous vs. buffered • Tags, communicators, wild-cards • Built-in and custom data-types • Can be used to implement any communication pattern • Collective operations, if applicable, can be more efficient Collective communications • A communication that involves all processes • “all” within a communicator, i.e.
    [Show full text]
  • Outro to Parallel Computing
    Outro To Parallel Computing John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist Purpose of this talk Now that you know how to do some real parallel programming, you may wonder how much you don’t know. With your newly informed perspective we will take a look at the parallel software landscape so that you can see how much of it you are equipped to traverse. How parallel is a code? ⚫ Parallel performance is defined in terms of scalability Strong Scalability Can we get faster for a given problem size? Weak Scalability Can we maintain runtime as we scale up the problem? Weak vs. Strong scaling More Processors Weak Scaling More accurate results More Processors Strong Scaling Faster results (Tornado on way!) Your Scaling Enemy: Amdahl’s Law How many processors can we really use? Let’s say we have a legacy code such that is it only feasible to convert half of the heavily used routines to parallel: Amdahl’s Law If we run this on a parallel machine with five processors: Our code now takes about 60s. We have sped it up by about 40%. Let’s say we use a thousand processors: We have now sped our code by about a factor of two. Is this a big enough win? Amdahl’s Law ⚫ If there is x% of serial component, speedup cannot be better than 100/x. ⚫ If you decompose a problem into many parts, then the parallel time cannot be less than the largest of the parts. ⚫ If the critical path through a computation is T, you cannot complete in less time than T, no matter how many processors you use .
    [Show full text]
  • The Continuing Renaissance in Parallel Programming Languages
    The continuing renaissance in parallel programming languages Simon McIntosh-Smith University of Bristol Microelectronics Research Group [email protected] © Simon McIntosh-Smith 1 Didn’t parallel computing use to be a niche? © Simon McIntosh-Smith 2 When I were a lad… © Simon McIntosh-Smith 3 But now parallelism is mainstream Samsung Exynos 5 Octa: • 4 fast ARM cores and 4 energy efficient ARM cores • Includes OpenCL programmable GPU from Imagination 4 HPC scaling to millions of cores Tianhe-2 at NUDT in China 33.86 PetaFLOPS (33.86x1015), 16,000 nodes Each node has 2 CPUs and 3 Xeon Phis 3.12 million cores, $390M, 17.6 MW, 720m2 © Simon McIntosh-Smith 5 A renaissance in parallel programming Metal C++11 OpenMP OpenCL Erlang Unified Parallel C Fortress XC Go Cilk HMPP CHARM++ CUDA Co-Array Fortran Chapel Linda X10 MPI Pthreads C++ AMP © Simon McIntosh-Smith 6 Groupings of || languages Partitioned Global Address Space GPU languages: (PGAS): • OpenCL • Fortress • CUDA • X10 • HMPP • Chapel • Metal • Co-array Fortran • Unified Parallel C Object oriented: • C++ AMP CSP: XC • CHARM++ Message passing: MPI Multi-threaded: • Cilk Shared memory: OpenMP • Go • C++11 © Simon McIntosh-Smith 7 Emerging GPGPU standards • OpenCL, DirectCompute, C++ AMP, … • Also OpenMP 4.0, OpenACC, CUDA… © Simon McIntosh-Smith 8 Apple's Metal • A "ground up" parallel programming language for GPUs • Designed for compute and graphics • Potential to replace OpenGL compute shaders, OpenCL/GL interop etc. • Close to the "metal" • Low overheads • "Shading" language based
    [Show full text]
  • Task Parallelism Bit-Level Parallelism
    Parallel languages as extensions of sequential ones Alexey A. Romanenko [email protected] What this section about? ● Computers. History. Trends. ● What is parallel program? ● What is parallel programming for? ● Features of parallel programs. ● Development environment. ● etc. Agenda 1. Sequential program 2. Applications, required computational power. 3. What does parallel programming for? 4. Parallelism inside ordinary PC. 5. Architecture of modern CPUs. 6. What is parallel program? 7. Types of parallelism. Agenda 8. Types of computational installations. 9. Specificity of parallel programs. 10.Amdahl's law 11.Development environment 12.Approaches to development of parallel programs. Cost of development. 13.Self-test questions History George Boole Claude Elwood Shannon Alan Turing Charles Babbage John von Neumann Norbert Wiener Henry Edward Roberts Sciences ● Computer science is the study of the theoretical foundations of information and computation, and of practical techniques for their implementation and application in computer systems. ● Cybernetics is the interdisciplinary study of the structure of regulatory system Difference machine Arithmometer Altair 8800 Computer with 8-inch floppy disk system Sequential program A program perform calculation of a function F = G(X) for example: a*x2+b*x+c=0, a != 0. x1=(-b-sqrt(b2-4ac))/(2a), x2=(-b+sqrt(b2-4ac))/(2a) Turing machine Plasma modeling N ~ 106 dX ~ F dT2 j j F ~ sum(q, q ) j i i j Complexity ~ O(N*N) more then 1012 * 100...1000 operations Resource consumable calculations ● Nuclear/Gas/Hydrodynamic
    [Show full text]
  • C Language Extensions for Hybrid CPU/GPU Programming with Starpu
    C Language Extensions for Hybrid CPU/GPU Programming with StarPU Ludovic Courtès arXiv:1304.0878v2 [cs.MS] 10 Apr 2013 RESEARCH REPORT N° 8278 April 2013 Project-Team Runtime ISSN 0249-6399 ISRN INRIA/RR--8278--FR+ENG C Language Extensions for Hybrid CPU/GPU Programming with StarPU Ludovic Courtès Project-Team Runtime Research Report n° 8278 — April 2013 — 22 pages Abstract: Modern platforms used for high-performance computing (HPC) include machines with both general- purpose CPUs, and “accelerators”, often in the form of graphical processing units (GPUs). StarPU is a C library to exploit such platforms. It provides users with ways to define tasks to be executed on CPUs or GPUs, along with the dependencies among them, and by automatically scheduling them over all the available processing units. In doing so, it also relieves programmers from the need to know the underlying architecture details: it adapts to the available CPUs and GPUs, and automatically transfers data between main memory and GPUs as needed. While StarPU’s approach is successful at addressing run-time scheduling issues, being a C library makes for a poor and error-prone programming interface. This paper presents an effort started in 2011 to promote some of the concepts exported by the library as C language constructs, by means of an extension of the GCC compiler suite. Our main contribution is the design and implementation of language extensions that map to StarPU’s task programming paradigm. We argue that the proposed extensions make it easier to get started with StarPU, eliminate errors that can occur when using the C library, and help diagnose possible mistakes.
    [Show full text]
  • Unified Parallel C (UPC)
    Unified Parallel C (UPC) Vivek Sarkar Department of Computer Science Rice University [email protected] COMP 422 Lecture 21 March 27, 2008 Acknowledgments • Supercomputing 2007 tutorial on “Programming using the Partitioned Global Address Space (PGAS) Model” by Tarek El- Ghazawi and Vivek Sarkar —http://sc07.supercomputing.org/schedule/event_detail.php?evid=11029 2 Programming Models Process/Thread Address Space Message Passing Shared Memory DSM/PGAS MPI OpenMP UPC 3 The Partitioned Global Address Space (PGAS) Model • Aka the Distributed Shared Memory (DSM) model • Concurrent threads with a Th Th Th 0 … n-2 n-1 partitioned shared space —Memory partition Mi has affinity to thread Thi • (+)ive: x —Data movement is implicit —Data distribution simplified with M0 … Mn-2 Mn-1 global address space • (-)ive: —Computation distribution and Legend: synchronization still remain programmer’s responsibility Thread/Process Memory Access —Lack of performance transparency of local vs. remote Address Space accesses • UPC, CAF, Titanium, X10, … 4 What is Unified Parallel C (UPC)? • An explicit parallel extension of ISO C • Single global address space • Collection of threads —each thread bound to a processor —each thread has some private data —part of the shared data is co-located with each thread • Set of specs for a parallel C —v1.0 completed February of 2001 —v1.1.1 in October of 2003 —v1.2 in May of 2005 • Compiler implementations by industry and academia 5 UPC Execution Model • A number of threads working independently in a SPMD fashion —MYTHREAD specifies
    [Show full text]
  • IBM XL Unified Parallel C User's Guide
    IBM XL Unified Parallel C for AIX, V11.0 (Technology Preview) IBM XL Unified Parallel C User’s Guide Version 11.0 IBM XL Unified Parallel C for AIX, V11.0 (Technology Preview) IBM XL Unified Parallel C User’s Guide Version 11.0 Note Before using this information and the product it supports, read the information in “Notices” on page 97. © Copyright International Business Machines Corporation 2010. US Government Users Restricted Rights – Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp. Contents Chapter 1. Parallel programming and Declarations ..............39 Unified Parallel C...........1 Type qualifiers ............39 Parallel programming ...........1 Declarators .............41 Partitioned global address space programming model 1 Statements and blocks ...........42 Unified Parallel C introduction ........2 Synchronization statements ........42 Iteration statements...........46 Chapter 2. Unified Parallel C Predefined macros and directives .......47 Unified Parallel C directives ........47 programming model .........3 Predefined macros ...........48 Distributed shared memory programming ....3 Data affinity and data distribution .......3 Chapter 5. Unified Parallel C library Memory consistency ............5 functions..............49 Synchronization mechanism .........6 Utility functions .............49 Chapter 3. Using the XL Unified Parallel Program termination ..........49 Dynamic memory allocation ........50 C compiler .............9 Pointer-to-shared manipulation .......56 Compiler options .............9
    [Show full text]
  • Overview of the Global Arrays Parallel Software Development Toolkit: Introduction to Global Address Space Programming Models
    Overview of the Global Arrays Parallel Software Development Toolkit: Introduction to Global Address Space Programming Models P. Saddayappan2, Bruce Palmer1, Manojkumar Krishnan1, Sriram Krishnamoorthy1, Abhinav Vishnu1, Daniel Chavarría1, Patrick Nichols1, Jeff Daily1 1Pacific Northwest National Laboratory 2Ohio State University Outline of the Tutorial ! " Parallel programming models ! " Global Arrays (GA) programming model ! " GA Operations ! " Writing, compiling and running GA programs ! " Basic, intermediate, and advanced calls ! " With C and Fortran examples ! " GA Hands-on session 2 Performance vs. Abstraction and Generality Domain Specific “Holy Grail” Systems GA CAF MPI OpenMP Scalability Autoparallelized C/Fortran90 Generality 3 Parallel Programming Models ! " Single Threaded ! " Data Parallel, e.g. HPF ! " Multiple Processes ! " Partitioned-Local Data Access ! " MPI ! " Uniform-Global-Shared Data Access ! " OpenMP ! " Partitioned-Global-Shared Data Access ! " Co-Array Fortran ! " Uniform-Global-Shared + Partitioned Data Access ! " UPC, Global Arrays, X10 4 High Performance Fortran ! " Single-threaded view of computation ! " Data parallelism and parallel loops ! " User-specified data distributions for arrays ! " Compiler transforms HPF program to SPMD program ! " Communication optimization critical to performance ! " Programmer may not be conscious of communication implications of parallel program HPF$ Independent HPF$ Independent s=s+1 DO I = 1,N DO I = 1,N A(1:100) = B(0:99)+B(2:101) HPF$ Independent HPF$ Independent HPF$ Independent
    [Show full text]