CS 267 Unified Parallel C

UPC Outline 1. Background CS 267 2. UPC Execution Model Unified Parallel C (UPC) 3. Basic Memory Model: Shared vs. Private Scalars 4. Synchronization 5. Collectives Kathy Yelick 6. Data and Pointers 7. Dynamic Memory Management http://upc.lbl.gov 8. Programming Examples 8. Performance Tuning and Early Results Slides adapted from some by Tarek El-Ghazawi (GWU) 9. Concluding Remarks 5/30/2006 CS267 Lecture: UPC 1 5/30/2006 CS267 Lecture: UPC 2 Context Partitioned Global Address Space Languages • Explicitly-parallel programming model with SPMD parallelism • Most parallel programs are written using either: • Fixed at program start-up, typically 1 thread per processor • Message passing with a SPMD model • Global address space model of memory • Usually for scientific applications with C++/Fortran • Allows programmer to directly represent distributed data • Scales easily structures • Shared memory with threads in OpenMP, • Address space is logically partitioned Threads+C/C++/F or Java • Local vs. remote memory (two-level hierarchy) • Usually for non-scientific applications • Programmer control over performance critical decisions • Easier to program, but less scalable performance • Data layout and communication • Global Address Space (GAS) Languages take the best of both • Performance transparency and tunability are goals • global address space like threads (programmability) • Initial implementation can use fine-grained shared memory • SPMD parallelism like MPI (performance) • Multiple PGAS languages: UPC (C), CAF (Fortran), Titanium • local/global distinction, i.e., layout matters (performance) (Java) 5/30/2006 CS267 Lecture: UPC 3 5/30/2006 CS267 Lecture: UPC 4 Global Address Space Eases Programming Current Implementations of PGAS Languages Thread0 Thread1 Threadn • A successful language/library must run everywhere X[0] X[1] X[P] Shared •UPC • Commercial compilers available on Cray, SGI, HP machines Global Global ptr: ptr: ptr: • Open source compiler from LBNL/UCB (source-to-source) Private address space • Open source gcc-based compiler from Intrepid •CAF • The languages share the global address space abstraction • Shared memory is logically partitioned by processors • Commercial compiler available on Cray machines • Remote memory may stay remote: no automatic caching implied • Open source compiler available from Rice • One-sided communication: reads/writes of shared variables •Titanium • Both individual and bulk memory copies • Open source compiler from UCB runs on most machines • Languages differ on details • Common tools • Some models have a separate private memory area • Open64 open source research compiler infrastructure • Distributed array generality and how they are constructed • ARMCI, GASNet for distributed memory implementations • Pthreads, System V shared memory 5/30/2006 CS267 Lecture: UPC 5 5/30/2006 CS267 Lecture: UPC 6 CS267 Lecture 2 1 UPC Overview and Design Philosophy • Unified Parallel C (UPC) is: • An explicit parallel extension of ANSI C • A partitioned global address space language • Sometimes called a GAS language UPC Execution • Similar to the C language philosophy Model • Programmers are clever and careful, and may need to get close to hardware • to get performance, but • can get in trouble • Concise and efficient syntax • Common and familiar syntax and semantics for parallel C with simple extensions to ANSI C • Based on ideas in Split-C, AC, and PCP 5/30/2006 CS267 Lecture: UPC 7 5/30/2006 CS267 Lecture: UPC 8 UPC Execution Model Hello World in UPC • A number of threads working independently in a SPMD • Any legal C program is also a legal UPC program fashion • If you compile and run it as UPC with P threads, it will • Number of threads specified at compile-time or run-time; run P copies of the program. available as program variable THREADS • Using this fact, plus the identifiers from the previous • MYTHREAD specifies thread index (0..THREADS-1) slides, we can parallel hello world: • upc_barrier is a global synchronization: all wait #include <upc.h> /* needed for UPC extensions */ • There is a form of parallel loop that we will see later #include <stdio.h> • There are two compilation modes • Static Threads mode: main() { • THREADS is specified at compile time by the user printf("Thread %d of %d: hello UPC world\n", • The program may use THREADS as a compile-time constant MYTHREAD, THREADS); • Dynamic threads mode: } • Compiled code may be run with varying numbers of threads 5/30/2006 CS267 Lecture: UPC 9 5/30/2006 CS267 Lecture: UPC 10 Example: Monte Carlo Pi Calculation Pi in UPC • Estimate Pi by throwing darts at a unit square • Independent estimates of pi: • Calculate percentage that fall in the unit circle main(int argc, char **argv) { • Area of square = r2 = 1 int i, hits, trials = 0; Each thread gets its own copy of these variables • Area of circle quadrant = ¼ * π r2 = π/4 double pi; • Randomly throw darts at x,y positions if (argc != 2)trials = 1000000; Each thread can use 2 2 input arguments • If x + y < 1, then point is inside circle else trials = atoi(argv[1]); • Compute ratio: Initialize random in • # points inside / # points total srand(MYTHREAD*17); math library • π = 4*ratio for (i=0; i < trials; i++) hits += hit(); pi = 4.0*hits/trials; printf("PI estimated to %f.", pi); r =1 } Each thread calls “hit” separately 5/30/2006 CS267 Lecture: UPC 11 5/30/2006 CS267 Lecture: UPC 12 CS267 Lecture 2 2 Helper Code for Pi in UPC • Required includes: #include <stdio.h> #include <math.h> #include <upc.h> Shared vs. Private • Function to throw dart and calculate where it hits: Variables int hit(){ int const rand_max = 0xFFFFFF; double x = ((double) rand()) / RAND_MAX; double y = ((double) rand()) / RAND_MAX; if ((x*x + y*y) <= 1.0) { return(1); } else { return(0); } } 5/30/2006 CS267 Lecture: UPC 13 5/30/2006 CS267 Lecture: UPC 14 Private vs. Shared Variables in UPC Pi in UPC: Shared Memory Style • Normal C variables and objects are allocated in the • Parallel computing of pi, but with a bug private memory space for each thread. shared int hits; shared variable to record hits • Shared variables are allocated only once, with thread 0 main(int argc, char **argv) { shared int ours; // use sparingly: performance int i, my_trials = 0; int mine; int trials = atoi(argv[1]); divide work up evenly • Shared variables may not have dynamic lifetime: may not my_trials = (trials + THREADS - 1)/THREADS; occur in a in a function definition, except as static. Why? srand(MYTHREAD*17); for (i=0; i < my_trials; i++) hits += hit(); Thread0 Thread1 Threadn accumulate hits upc_barrier; if (MYTHREAD == 0) { ours: Shared printf("PI estimated to %f.", 4.0*hits/trials); } space mine: mine: mine: What is the problem with this program? Private } Global address address Global 5/30/2006 CS267 Lecture: UPC 15 5/30/2006 CS267 Lecture: UPC 16 Shared Arrays Are Cyclic By Default Pi in UPC: Shared Array Version • Shared scalars always live in thread 0 • Alternative fix to the race condition • Shared arrays are spread over the threads • Have each thread update a separate counter: • Shared array elements are spread across the threads • But do it in a shared array shared int x[THREADS] /* 1 element per thread */ • Have one thread compute sum all_hits is shared int y[3][THREADS] /* 3 elements per thread */ shared int all_hits [THREADS]; shared by all /* 2 or 3 elements per thread */ shared int z[3][3] main(int argc, char **argv) { processors, • In the pictures below, assume THREADS = 4 … declarations an initialization code omitted just as hits was • Red elts have affinity to thread 0 for (i=0; i < my_trials; i++) Think of linearized all_hits[MYTHREAD] += hit(); update element x C array, then map upc_barrier; in round-robin with local affinity if (MYTHREAD == 0) { As a 2D array, y is y for (i=0; i < THREADS; i++) hits += all_hits[i]; logically blocked by columns printf("PI estimated to %f.", 4.0*hits/trials); z } z is not } 5/30/2006 CS267 Lecture: UPC 17 5/30/2006 CS267 Lecture: UPC 18 CS267 Lecture 2 3 UPC Global Synchronization • UPC has two basic forms of barriers: • Barrier: block until all other threads arrive upc_barrier UPC • Split-phase barriers upc_notify; this thread is ready for barrier Synchronization do computation unrelated to barrier upc_wait; wait for others to be ready • Optional labels allow for debugging #define MERGE_BARRIER 12 if (MYTHREAD%2 == 0) { ... upc_barrier MERGE_BARRIER; } else { ... upc_barrier MERGE_BARRIER; } 5/30/2006 CS267 Lecture: UPC 19 5/30/2006 CS267 Lecture: UPC 20 Synchronization - Locks Pi in UPC: Shared Memory Style • Locks in UPC are represented by an opaque type: • Parallel computing of pi, without the bug upc_lock_t shared int hits; main(int argc, char **argv) { • Locks must be allocated before use: int i, my_hits, my_trials = 0; create a lock upc_lock_t *upc_all_lock_alloc(void); upc_lock_t *hit_lock = upc_all_lock_alloc(); allocates 1 lock, pointer to all threads int trials = atoi(argv[1]); upc_lock_t *upc_global_lock_alloc(void); my_trials = (trials + THREADS - 1)/THREADS; allocates 1 lock, pointer to one thread srand(MYTHREAD*17); for (i=0; i < my_trials; i++) accumulate hits • To use a lock: my_hits += hit(); locally void upc_lock(upc_lock_t *l) upc_lock(hit_lock); void upc_unlock(upc_lock_t *l) hits += my_hits; accumulate use at start and end of critical region upc_unlock(hit_lock); across threads upc_barrier; • Locks can be freed when not in use if (MYTHREAD == 0) void upc_lock_free(upc_lock_t *ptr); printf("PI: %f", 4.0*hits/trials); } 5/30/2006 CS267 Lecture: UPC 21 5/30/2006 CS267 Lecture: UPC 22 UPC Collectives in General • The UPC collectives interface is available from: • http://www.gwu.edu/~upc/docs/ • It contains typical functions: • Data movement: broadcast, scatter, gather, … UPC Collectives • Computational: reduce, prefix, … • Interface has synchronization modes: • Avoid over-synchronizing (barrier before/after is simplest semantics, but may be unnecessary) • Data being collected may be read/written by any thread simultaneously 5/30/2006 CS267 Lecture: UPC 23 5/30/2006 CS267 Lecture: UPC 24 CS267 Lecture 2 4 Pi in UPC: Data Parallel Style Recap: Private vs.

CS 267 Unified Parallel C

Heterogeneous Task Scheduling for Accelerated Openmp

L22: Parallel Programming Language Features (Chapel and Mapreduce)

Parallel Programming

Openmp API 5.1 Specification

Openmp Made Easy with INTEL® ADVISOR

Outro to Parallel Computing

Introduction to Openmp Paul Edmon ITC Research Computing Associate

The Continuing Renaissance in Parallel Programming Languages

C Language Extensions for Hybrid CPU/GPU Programming with Starpu

Unified Parallel C (UPC)

IBM XL Unified Parallel C User's Guide

Parallel Programming with Openmp