Introduction to UPC

Introduction to UPC Presenter: Rajesh Nishtala (UC Berkeley) Advisor: Katherine Yelick Joint work with Berkeley UPC and Titanium Groups Lawrence Berkeley Nat’l Labs & UC Berkeley Some slides adapted from Katherine Yelick and Tarek El- Ghazawi Berkeley UPC: http://upc.lbl.gov 1 Titanium: http://titanium.cs.berkeley.edu Context • Most parallel programs are written using either: – Message passing with a SPMD model • Usually for scientific applications with C++/Fortran • Scales easily – Shared memory with threads in OpenMP, Threads+C/C++/F or Java • Usually for non-scientific applications • Easier to program, but less scalable performance • Global Address Space (GAS) Languages take the best of both – global address space like threads (programmability) – SPMD parallelism like MPI (performance) – local/global distinction, i.e., layout matters (performance) Berkeley UPC: http://upc.lbl.gov 2 Titanium: http://titanium.cs.berkeley.edu Partitioned Global Address Space Languages • Explicitly-parallel programming model with SPMD parallelism – Fixed at program start-up, typically 1 thread per processor • Global address space model of memory – Allows programmer to directly represent distributed data structures • Address space is logically partitioned – Local vs. remote memory (two-level hierarchy) • Programmer control over performance critical decisions – Data layout and communication • Performance transparency and tunability are goals – Initial implementation can use fine-grained shared memory • Multiple PGAS languages: UPC (C), CAF (Fortran), Titanium (Java) Berkeley UPC: http://upc.lbl.gov 3 Titanium: http://titanium.cs.berkeley.edu Global Address Space Eases Programming Thread0 Thread1 Threadn X[0] X[1] X[P] Shared space ptr: ptr: ptr: Private Global address • The languages share the global address space abstraction – Shared memory is logically partitioned by processors – Remote memory may stay remote: no automatic caching implied – One-sided communication: reads/writes of shared variables – Both individual and bulk memory copies • Languages differ on details – Some models have a separate private memory area – Distributed array generality and how they are constructed Berkeley UPC: http://upc.lbl.gov 4 Titanium: http://titanium.cs.berkeley.edu State of PGAS Languages • A successful language/library must run everywhere • UPC – Commercial compilers available on Cray, SGI, HP machines – Open source compiler from LBNL/UCB (source-to-source) – Open source gcc-based compiler from Intrepid • CAF – Commercial compiler available on Cray machines – Open source compiler available from Rice • Titanium – Open source compiler from UCB runs on most machines • Common tools – Open64 open source research compiler infrastructure – ARMCI, GASNet for distributed memory implementations – Pthreads, System V shared memory Berkeley UPC: http://upc.lbl.gov 5 Titanium: http://titanium.cs.berkeley.edu UPC Overview and Design • Unified Parallel C (UPC) is: – An explicit parallel extension of ANSI C – A partitioned global address space language – Sometimes called a GAS language • Similar to the C language philosophy – Programmers are clever and careful, and may need to get close to hardware • to get performance, but • can get in trouble – Concise and efficient syntax • Common and familiar syntax and semantics for parallel C with simple extensions to ANSI C • Based on ideas in Split-C, AC, and PCP Berkeley UPC: http://upc.lbl.gov 6 Titanium: http://titanium.cs.berkeley.edu One-Sided vs. Two-Sided Messaging two-sided message (e.g., MPI) host message id data payload CPU network one-sided put (e.g., UPC) interface dest. addr. data payload memory • Two-sided messaging – Message does not contain information about final destination – Have to perform look up at the target or do a rendezvous – Point-to-point synchronization is implied with all transfers • One-sided messaging – Message contains information about final destination – Decouple synchronization from data movement • What does the network hardware support? • What about when we need point-to-point sync? – Hold that thought… Berkeley UPC: http://upc.lbl.gov 7 Titanium: http://titanium.cs.berkeley.edu GASNet Latency Performance • GASNet implemented on top of Deep Computing Messaging Framework (DCMF) – Lower level than MPI – Provides Puts, Gets, AMSend, and Collectives • Point-to-point ping-ack latency performance – N-byte transfer w/ 0 byte acknowledgement • GASNet takes advantage of DCMF remote completion notification – Minimum semantics needed to implement the UPC memory model – Almost a factor of two difference until 32 bytes – Indication of better semantic match to underlying communication system Berkeley UPC: http://upc.lbl.gov 8 Titanium: http://titanium.cs.berkeley.edu 8 GASNet Multilink Bandwidth • Each node has six 850MB/s* bidirectional link • Vary number of links from 1 to 6 • Initiate a series of nonblocking puts on the links (round-robin) – Communication/ communication overlap • Both MPI and GASNet asymptote to the same bandwidth • GASNet outperforms MPI at midrange message sizes * Kumar et. al showed the maximum achievable bandwidth – Lower software overhead for DCMF transfers is 748 MB/s per link so we use this as our peak implies more efficient bandwidth See “The deep computing message injection messaging framework: generalized scalable message passing on the – GASNet avoids rendezvous to blue gene/P supercomputer”, Kumar et al. ICS08 leverage RDMA Berkeley UPC: http://upc.lbl.gov 9 Titanium: http://titanium.cs.berkeley.edu 9 UPC (PGAS) Execution Model Berkeley UPC: http://upc.lbl.gov 10 Titanium: http://titanium.cs.berkeley.edu UPC Execution Model • A number of threads working independently in a SPMD fashion – Number of threads specified at compile-time or run-time; available as program variable THREADS – MYTHREAD specifies thread index (0..THREADS-1) – upc_barrier is a global synchronization: all wait – There is a form of parallel loop that we will see later • There are two compilation modes – Static Threads mode: • THREADS is specified at compile time by the user • The program may use THREADS as a compile-time constant – Dynamic threads mode: • Compiled code may be run with varying numbers of threads Berkeley UPC: http://upc.lbl.gov 11 Titanium: http://titanium.cs.berkeley.edu Hello World in UPC • Any legal C program is also a legal UPC program • If you compile and run it as UPC with P threads, it will run P copies of the program. • Using this fact, plus the identifiers from the previous slides, we can parallel hello world: #include <upc.h> /* needed for UPC extensions */ #include <stdio.h> main() { printf("Thread %d of %d: hello UPC world\n", MYTHREAD, THREADS); } Berkeley UPC: http://upc.lbl.gov 12 Titanium: http://titanium.cs.berkeley.edu Example: Monte Carlo Pi Calculation • Estimate Pi by throwing darts at a unit square • Calculate percentage that fall in the unit circle – Area of square = r2 = 1 – Area of circle quadrant = ¼ * π r2 = π/4 • Randomly throw darts at x,y positions • If x2 + y2 < 1, then point is inside circle • Compute ratio: – # points inside / # points total – π = 4*ratio r =1 Berkeley UPC: http://upc.lbl.gov 13 Titanium: http://titanium.cs.berkeley.edu Pi in UPC • Independent estimates of pi: main(int argc, char **argv) { int i, hits, trials = 0; Each thread gets its own copy double pi; of these variables if (argc != 2)trials = 1000000; Each thread can use input else trials = atoi(argv[1]); arguments Initialize random in math srand(MYTHREAD*17); library for (i=0; i < trials; i++) hits += hit(); pi = 4.0*hits/trials; printf("PI estimated to %f.", pi); } Each thread calls “hit” separately Berkeley UPC: http://upc.lbl.gov 14 Titanium: http://titanium.cs.berkeley.edu Helper Code for Pi in UPC • Required includes: #include <stdio.h> #include <math.h> #include <upc.h> • Function to throw dart and calculate where it hits: int hit(){ double x = ((double) rand()) / RAND_MAX; double y = ((double) rand()) / RAND_MAX; if ((x*x + y*y) <= 1.0) { return(1); } else { return(0); } } Berkeley UPC: http://upc.lbl.gov 15 Titanium: http://titanium.cs.berkeley.edu Shared vs. Private Variables Berkeley UPC: http://upc.lbl.gov 16 Titanium: http://titanium.cs.berkeley.edu Private vs. Shared Variables in UPC • Normal C variables and objects are allocated in the private memory space for each thread. • Shared variables are allocated only once, with thread 0 shared int ours; // use sparingly: performance int mine; • Shared variables may not have dynamic lifetime: may not occur in a in a function definition, except as static. Thread0 Thread1 Threadn ours: Shared space mine: mine: mine: Private Global address Berkeley UPC: http://upc.lbl.gov 17 Titanium: http://titanium.cs.berkeley.edu Pi in UPC: Shared Memory Style • Parallel computing of pi, but with a bug shared int hits; shared variable to record main(int argc, char **argv) { hits int i, my_trials = 0; int trials = atoi(argv[1]); my_trials = (trials + THREADS - 1)/THREADS; srand(MYTHREAD*17); divide work up evenly for (i=0; i < my_trials; i++) hits += hit(); accumulate hits upc_barrier; if (MYTHREAD == 0) { printf("PI estimated to %f.", 4.0*hits/ trials); What is the problem with this program? }} Berkeley UPC: http://upc.lbl.gov 18 Titanium: http://titanium.cs.berkeley.edu Shared Arrays Are Cyclic By Default • Shared scalars always live in thread 0 • Shared arrays are spread over the threads • Shared array elements are spread across the threads shared int x[THREADS] /* 1 element per thread */ shared int y[3][THREADS] /* 3 elements per thread */ shared

Introduction to UPC

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support