CS 267 Unified Parallel C

CS 267 Unified Parallel C

UPC Outline 1. Background CS 267 2. UPC Execution Model Unified Parallel C (UPC) 3. Basic Memory Model: Shared vs. Private Scalars 4. Synchronization 5. Collectives Kathy Yelick 6. Data and Pointers 7. Dynamic Memory Management http://upc.lbl.gov 8. Programming Examples 8. Performance Tuning and Early Results Slides adapted from some by Tarek El-Ghazawi (GWU) 9. Concluding Remarks 5/30/2006 CS267 Lecture: UPC 1 5/30/2006 CS267 Lecture: UPC 2 Context Partitioned Global Address Space Languages • Explicitly-parallel programming model with SPMD parallelism • Most parallel programs are written using either: • Fixed at program start-up, typically 1 thread per processor • Message passing with a SPMD model • Global address space model of memory • Usually for scientific applications with C++/Fortran • Allows programmer to directly represent distributed data • Scales easily structures • Shared memory with threads in OpenMP, • Address space is logically partitioned Threads+C/C++/F or Java • Local vs. remote memory (two-level hierarchy) • Usually for non-scientific applications • Programmer control over performance critical decisions • Easier to program, but less scalable performance • Data layout and communication • Global Address Space (GAS) Languages take the best of both • Performance transparency and tunability are goals • global address space like threads (programmability) • Initial implementation can use fine-grained shared memory • SPMD parallelism like MPI (performance) • Multiple PGAS languages: UPC (C), CAF (Fortran), Titanium • local/global distinction, i.e., layout matters (performance) (Java) 5/30/2006 CS267 Lecture: UPC 3 5/30/2006 CS267 Lecture: UPC 4 Global Address Space Eases Programming Current Implementations of PGAS Languages Thread0 Thread1 Threadn • A successful language/library must run everywhere X[0] X[1] X[P] Shared •UPC • Commercial compilers available on Cray, SGI, HP machines Global Global ptr: ptr: ptr: • Open source compiler from LBNL/UCB (source-to-source) Private address space • Open source gcc-based compiler from Intrepid •CAF • The languages share the global address space abstraction • Shared memory is logically partitioned by processors • Commercial compiler available on Cray machines • Remote memory may stay remote: no automatic caching implied • Open source compiler available from Rice • One-sided communication: reads/writes of shared variables •Titanium • Both individual and bulk memory copies • Open source compiler from UCB runs on most machines • Languages differ on details • Common tools • Some models have a separate private memory area • Open64 open source research compiler infrastructure • Distributed array generality and how they are constructed • ARMCI, GASNet for distributed memory implementations • Pthreads, System V shared memory 5/30/2006 CS267 Lecture: UPC 5 5/30/2006 CS267 Lecture: UPC 6 CS267 Lecture 2 1 UPC Overview and Design Philosophy • Unified Parallel C (UPC) is: • An explicit parallel extension of ANSI C • A partitioned global address space language • Sometimes called a GAS language UPC Execution • Similar to the C language philosophy Model • Programmers are clever and careful, and may need to get close to hardware • to get performance, but • can get in trouble • Concise and efficient syntax • Common and familiar syntax and semantics for parallel C with simple extensions to ANSI C • Based on ideas in Split-C, AC, and PCP 5/30/2006 CS267 Lecture: UPC 7 5/30/2006 CS267 Lecture: UPC 8 UPC Execution Model Hello World in UPC • A number of threads working independently in a SPMD • Any legal C program is also a legal UPC program fashion • If you compile and run it as UPC with P threads, it will • Number of threads specified at compile-time or run-time; run P copies of the program. available as program variable THREADS • Using this fact, plus the identifiers from the previous • MYTHREAD specifies thread index (0..THREADS-1) slides, we can parallel hello world: • upc_barrier is a global synchronization: all wait #include <upc.h> /* needed for UPC extensions */ • There is a form of parallel loop that we will see later #include <stdio.h> • There are two compilation modes • Static Threads mode: main() { • THREADS is specified at compile time by the user printf("Thread %d of %d: hello UPC world\n", • The program may use THREADS as a compile-time constant MYTHREAD, THREADS); • Dynamic threads mode: } • Compiled code may be run with varying numbers of threads 5/30/2006 CS267 Lecture: UPC 9 5/30/2006 CS267 Lecture: UPC 10 Example: Monte Carlo Pi Calculation Pi in UPC • Estimate Pi by throwing darts at a unit square • Independent estimates of pi: • Calculate percentage that fall in the unit circle main(int argc, char **argv) { • Area of square = r2 = 1 int i, hits, trials = 0; Each thread gets its own copy of these variables • Area of circle quadrant = ¼ * π r2 = π/4 double pi; • Randomly throw darts at x,y positions if (argc != 2)trials = 1000000; Each thread can use 2 2 input arguments • If x + y < 1, then point is inside circle else trials = atoi(argv[1]); • Compute ratio: Initialize random in • # points inside / # points total srand(MYTHREAD*17); math library • π = 4*ratio for (i=0; i < trials; i++) hits += hit(); pi = 4.0*hits/trials; printf("PI estimated to %f.", pi); r =1 } Each thread calls “hit” separately 5/30/2006 CS267 Lecture: UPC 11 5/30/2006 CS267 Lecture: UPC 12 CS267 Lecture 2 2 Helper Code for Pi in UPC • Required includes: #include <stdio.h> #include <math.h> #include <upc.h> Shared vs. Private • Function to throw dart and calculate where it hits: Variables int hit(){ int const rand_max = 0xFFFFFF; double x = ((double) rand()) / RAND_MAX; double y = ((double) rand()) / RAND_MAX; if ((x*x + y*y) <= 1.0) { return(1); } else { return(0); } } 5/30/2006 CS267 Lecture: UPC 13 5/30/2006 CS267 Lecture: UPC 14 Private vs. Shared Variables in UPC Pi in UPC: Shared Memory Style • Normal C variables and objects are allocated in the • Parallel computing of pi, but with a bug private memory space for each thread. shared int hits; shared variable to record hits • Shared variables are allocated only once, with thread 0 main(int argc, char **argv) { shared int ours; // use sparingly: performance int i, my_trials = 0; int mine; int trials = atoi(argv[1]); divide work up evenly • Shared variables may not have dynamic lifetime: may not my_trials = (trials + THREADS - 1)/THREADS; occur in a in a function definition, except as static. Why? srand(MYTHREAD*17); for (i=0; i < my_trials; i++) hits += hit(); Thread0 Thread1 Threadn accumulate hits upc_barrier; if (MYTHREAD == 0) { ours: Shared printf("PI estimated to %f.", 4.0*hits/trials); } space mine: mine: mine: What is the problem with this program? Private } Global address address Global 5/30/2006 CS267 Lecture: UPC 15 5/30/2006 CS267 Lecture: UPC 16 Shared Arrays Are Cyclic By Default Pi in UPC: Shared Array Version • Shared scalars always live in thread 0 • Alternative fix to the race condition • Shared arrays are spread over the threads • Have each thread update a separate counter: • Shared array elements are spread across the threads • But do it in a shared array shared int x[THREADS] /* 1 element per thread */ • Have one thread compute sum all_hits is shared int y[3][THREADS] /* 3 elements per thread */ shared int all_hits [THREADS]; shared by all /* 2 or 3 elements per thread */ shared int z[3][3] main(int argc, char **argv) { processors, • In the pictures below, assume THREADS = 4 … declarations an initialization code omitted just as hits was • Red elts have affinity to thread 0 for (i=0; i < my_trials; i++) Think of linearized all_hits[MYTHREAD] += hit(); update element x C array, then map upc_barrier; in round-robin with local affinity if (MYTHREAD == 0) { As a 2D array, y is y for (i=0; i < THREADS; i++) hits += all_hits[i]; logically blocked by columns printf("PI estimated to %f.", 4.0*hits/trials); z } z is not } 5/30/2006 CS267 Lecture: UPC 17 5/30/2006 CS267 Lecture: UPC 18 CS267 Lecture 2 3 UPC Global Synchronization • UPC has two basic forms of barriers: • Barrier: block until all other threads arrive upc_barrier UPC • Split-phase barriers upc_notify; this thread is ready for barrier Synchronization do computation unrelated to barrier upc_wait; wait for others to be ready • Optional labels allow for debugging #define MERGE_BARRIER 12 if (MYTHREAD%2 == 0) { ... upc_barrier MERGE_BARRIER; } else { ... upc_barrier MERGE_BARRIER; } 5/30/2006 CS267 Lecture: UPC 19 5/30/2006 CS267 Lecture: UPC 20 Synchronization - Locks Pi in UPC: Shared Memory Style • Locks in UPC are represented by an opaque type: • Parallel computing of pi, without the bug upc_lock_t shared int hits; main(int argc, char **argv) { • Locks must be allocated before use: int i, my_hits, my_trials = 0; create a lock upc_lock_t *upc_all_lock_alloc(void); upc_lock_t *hit_lock = upc_all_lock_alloc(); allocates 1 lock, pointer to all threads int trials = atoi(argv[1]); upc_lock_t *upc_global_lock_alloc(void); my_trials = (trials + THREADS - 1)/THREADS; allocates 1 lock, pointer to one thread srand(MYTHREAD*17); for (i=0; i < my_trials; i++) accumulate hits • To use a lock: my_hits += hit(); locally void upc_lock(upc_lock_t *l) upc_lock(hit_lock); void upc_unlock(upc_lock_t *l) hits += my_hits; accumulate use at start and end of critical region upc_unlock(hit_lock); across threads upc_barrier; • Locks can be freed when not in use if (MYTHREAD == 0) void upc_lock_free(upc_lock_t *ptr); printf("PI: %f", 4.0*hits/trials); } 5/30/2006 CS267 Lecture: UPC 21 5/30/2006 CS267 Lecture: UPC 22 UPC Collectives in General • The UPC collectives interface is available from: • http://www.gwu.edu/~upc/docs/ • It contains typical functions: • Data movement: broadcast, scatter, gather, … UPC Collectives • Computational: reduce, prefix, … • Interface has synchronization modes: • Avoid over-synchronizing (barrier before/after is simplest semantics, but may be unnecessary) • Data being collected may be read/written by any thread simultaneously 5/30/2006 CS267 Lecture: UPC 23 5/30/2006 CS267 Lecture: UPC 24 CS267 Lecture 2 4 Pi in UPC: Data Parallel Style Recap: Private vs.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    11 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us