UPC Outline

1. Background CS 267 2. UPC Unified Parallel (UPC) 3. Basic Memory Model: Shared vs. Private Scalars 4. Synchronization 5. Collectives Kathy Yelick 6. Data and Pointers 7. Dynamic Memory Management http://upc.lbl.gov 8. Programming Examples 8. Performance Tuning and Early Results Slides adapted from some by Tarek El-Ghazawi (GWU) 9. Concluding Remarks

5/30/2006 CS267 Lecture: UPC 1 5/30/2006 CS267 Lecture: UPC 2

Context Partitioned Global Address Space Languages • Explicitly-parallel programming model with SPMD parallelism • Most parallel programs are written using either: • Fixed at program start-up, typically 1 per processor • with a SPMD model • Global address space model of memory • Usually for scientific applications with C++/ • Allows to directly represent distributed data • Scales easily structures • with threads in OpenMP, • Address space is logically partitioned Threads+C/C++/F or Java • Local vs. remote memory (two-level hierarchy) • Usually for non-scientific applications • Programmer control over performance critical decisions • Easier to program, but less scalable performance • Data layout and communication • Global Address Space (GAS) Languages take the best of both • Performance transparency and tunability are goals • global address space like threads (programmability) • Initial implementation can use fine-grained shared memory • SPMD parallelism like MPI (performance) • Multiple PGAS languages: UPC (C), CAF (Fortran), Titanium • local/global distinction, i.e., layout matters (performance) (Java) 5/30/2006 CS267 Lecture: UPC 3 5/30/2006 CS267 Lecture: UPC 4

Global Address Space Eases Programming Current Implementations of PGAS Languages

Thread0 Thread1 Threadn • A successful language/ must run everywhere X[0] X[1] X[P] Shared •UPC • Commercial available on , SGI, HP machines

Global Global ptr: ptr: ptr: • Open source from LBNL/UCB (source-to-source) Private

address space • Open source gcc-based compiler from Intrepid •CAF • The languages share the global address space abstraction • Shared memory is logically partitioned by processors • Commercial compiler available on Cray machines • Remote memory may stay remote: no automatic caching implied • Open source compiler available from Rice • One-sided communication: reads/writes of shared variables •Titanium • Both individual and bulk memory copies • Open source compiler from UCB runs on most machines • Languages differ on details • Common tools • Some models have a separate private memory area • open source research compiler infrastructure • Distributed array generality and how they are constructed • ARMCI, GASNet for implementations • Pthreads, System V shared memory 5/30/2006 CS267 Lecture: UPC 5 5/30/2006 CS267 Lecture: UPC 6

CS267 Lecture 2 1 UPC Overview and Design Philosophy

(UPC) is: • An explicit parallel extension of ANSI C • A partitioned global address space language • Sometimes called a GAS language UPC Execution • Similar to the C language philosophy Model • are clever and careful, and may need to get close to hardware • to get performance, but • can get in trouble • Concise and efficient syntax • Common and familiar syntax and semantics for parallel C with simple extensions to ANSI C • Based on ideas in Split-C, AC, and PCP

5/30/2006 CS267 Lecture: UPC 7 5/30/2006 CS267 Lecture: UPC 8

UPC Execution Model Hello World in UPC

• A number of threads working independently in a SPMD • Any legal C program is also a legal UPC program fashion • If you compile and run it as UPC with P threads, it will • Number of threads specified at compile-time or run-time; run P copies of the program. available as program variable THREADS • Using this fact, plus the identifiers from the previous • MYTHREAD specifies thread index (0..THREADS-1) slides, we can parallel hello world: • upc_barrier is a global synchronization: all wait #include /* needed for UPC extensions */ • There is a form of parallel loop that we will see later #include • There are two compilation modes • Static Threads mode: main() { • THREADS is specified at compile time by the user printf("Thread % of %d: hello UPC world\n", • The program may use THREADS as a compile-time constant MYTHREAD, THREADS); • Dynamic threads mode: } • Compiled code may be run with varying numbers of threads

5/30/2006 CS267 Lecture: UPC 9 5/30/2006 CS267 Lecture: UPC 10

Example: Monte Carlo Pi Calculation Pi in UPC • Estimate Pi by throwing darts at a unit square • Independent estimates of pi: • Calculate percentage that fall in the unit circle main(int argc, char **argv) { • Area of square = r2 = 1 int i, hits, trials = 0; Each thread gets its own copy of these variables • Area of circle quadrant = ¼ * π r2 = π/4 double pi; • Randomly throw darts at x,y positions if (argc != 2)trials = 1000000; Each thread can use 2 2 input arguments • If x + y < 1, then point is inside circle else trials = atoi(argv[1]); • Compute ratio: Initialize random in • # points inside / # points total srand(MYTHREAD*17); math library • π = 4*ratio for (i=0; i < trials; i++) hits += hit(); pi = 4.0*hits/trials; printf("PI estimated to %f.", pi); r =1 } Each thread calls “hit” separately

5/30/2006 CS267 Lecture: UPC 11 5/30/2006 CS267 Lecture: UPC 12

CS267 Lecture 2 2 Helper Code for Pi in UPC • Required includes: #include #include #include Shared vs. Private • Function to throw dart and calculate where it hits: Variables int hit(){ int const rand_max = 0xFFFFFF; double x = ((double) rand()) / RAND_MAX; double y = ((double) rand()) / RAND_MAX; if ((x*x + y*y) <= 1.0) { return(1); } else { return(0); } } 5/30/2006 CS267 Lecture: UPC 13 5/30/2006 CS267 Lecture: UPC 14

Private vs. Shared Variables in UPC Pi in UPC: Shared Memory Style • Normal C variables and objects are allocated in the • of pi, but with a bug private memory space for each thread. shared int hits; shared variable to record hits • Shared variables are allocated only once, with thread 0 main(int argc, char **argv) { shared int ours; // use sparingly: performance int i, my_trials = 0; int mine; int trials = atoi(argv[1]); divide work up evenly • Shared variables may not have dynamic lifetime: may not my_trials = (trials + THREADS - 1)/THREADS; occur in a in a function definition, except as static. Why? srand(MYTHREAD*17); for (i=0; i < my_trials; i++) hits += hit(); Thread0 Thread1 Threadn accumulate hits upc_barrier; if (MYTHREAD == 0) { ours: Shared printf("PI estimated to %f.", 4.0*hits/trials); }

space mine: mine: mine: What is the problem with this program? Private } Global address address Global

5/30/2006 CS267 Lecture: UPC 15 5/30/2006 CS267 Lecture: UPC 16

Shared Arrays Are Cyclic By Default Pi in UPC: Shared Array Version • Shared scalars always live in thread 0 • Alternative fix to the • Shared arrays are spread over the threads • Have each thread update a separate counter: • Shared array elements are spread across the threads • But do it in a shared array shared int x[THREADS] /* 1 element per thread */ • Have one thread compute sum all_hits is shared int y[3][THREADS] /* 3 elements per thread */ shared int all_hits [THREADS]; shared by all /* 2 or 3 elements per thread */ shared int z[3][3] main(int argc, char **argv) { processors, • In the pictures below, assume THREADS = 4 … declarations an initialization code omitted just as hits was • Red elts have affinity to thread 0 for (i=0; i < my_trials; i++) Think of linearized all_hits[MYTHREAD] += hit(); update element x C array, then upc_barrier; in round-robin with local affinity if (MYTHREAD == 0) { As a 2D array, y is y for (i=0; i < THREADS; i++) hits += all_hits[i]; logically blocked by columns printf("PI estimated to %f.", 4.0*hits/trials); z } z is not } 5/30/2006 CS267 Lecture: UPC 17 5/30/2006 CS267 Lecture: UPC 18

CS267 Lecture 2 3 UPC Global Synchronization

• UPC has two basic forms of barriers: • Barrier: block until all other threads arrive upc_barrier UPC • Split-phase barriers upc_notify; this thread is ready for barrier Synchronization do computation unrelated to barrier upc_wait; wait for others to be ready • Optional labels allow for debugging #define MERGE_BARRIER 12 if (MYTHREAD%2 == 0) { ... upc_barrier MERGE_BARRIER; } else { ... upc_barrier MERGE_BARRIER; } 5/30/2006 CS267 Lecture: UPC 19 5/30/2006 CS267 Lecture: UPC 20

Synchronization - Locks Pi in UPC: Shared Memory Style

• Locks in UPC are represented by an opaque type: • Parallel computing of pi, without the bug upc_lock_t shared int hits; main(int argc, char **argv) { • Locks must be allocated before use: int i, my_hits, my_trials = 0; create a upc_lock_t *upc_all_lock_alloc(void); upc_lock_t *hit_lock = upc_all_lock_alloc(); allocates 1 lock, pointer to all threads int trials = atoi(argv[1]); upc_lock_t *upc_global_lock_alloc(void); my_trials = (trials + THREADS - 1)/THREADS; allocates 1 lock, pointer to one thread srand(MYTHREAD*17); for (i=0; i < my_trials; i++) accumulate hits • To use a lock: my_hits += hit(); locally void upc_lock(upc_lock_t *l) upc_lock(hit_lock); void upc_unlock(upc_lock_t *l) hits += my_hits; accumulate use at start and end of critical region upc_unlock(hit_lock); across threads upc_barrier; • Locks can be freed when not in use if (MYTHREAD == 0) void upc_lock_free(upc_lock_t *ptr); printf("PI: %f", 4.0*hits/trials); } 5/30/2006 CS267 Lecture: UPC 21 5/30/2006 CS267 Lecture: UPC 22

UPC Collectives in General • The UPC collectives interface is available from: • http://www.gwu.edu/~upc/docs/ • It contains typical functions: • Data movement: broadcast, scatter, gather, … UPC Collectives • Computational: reduce, prefix, … • Interface has synchronization modes: • Avoid over-synchronizing (barrier before/after is simplest semantics, but may be unnecessary) • Data being collected may be read/written by any thread simultaneously

5/30/2006 CS267 Lecture: UPC 23 5/30/2006 CS267 Lecture: UPC 24

CS267 Lecture 2 4 Pi in UPC: Data Parallel Style Recap: Private vs. Shared Variables in UPC • The previous version of Pi works, but is not scalable: • We saw several kinds of variables in the pi example • On a large # of threads, the locked region will be a bottleneck • Private scalars (my_hits) • Use a reduction for better • Shared scalars (hits) • Shared arrays (all_hits) #include Berkeley collectives • Shared locks ( ) // shared int hits; no shared variables hit_lock main(int argc, char **argv) { Thread0 Thread1 Threadn ... where: for (i=0; i < my_trials; i++) hits: n=Threads-1

my_hits += hit(); hit_lock: my_hits = // type, input, thread, op bupc_allv_reduce(int, my_hits, 0, UPC_ADD); all_hits[0]: all_hits[1]: all_hits[n]: Shared // upc_barrier; barrier implied by collective

space my_hits: my_hits: if (MYTHREAD == 0) my_hits: printf("PI: %f", 4.0*my_hits/trials); Private Global address address Global } 5/30/2006 CS267 Lecture: UPC 25 5/30/2006 CS267 Lecture: UPC 26

Example: Vector Addition • Questions about parallel vector additions: • How to layout data (here it is cyclic) Work Distribution • Which processor does what (here it is “owner computes”) Using /* vadd.c */ upc_forall #include #define N 100*THREADS cyclic layout shared int v1[N], v2[N], sum[N]; void main() { int i; owner computes for(i=0; i

5/30/2006 CS267 Lecture: UPC 27 5/30/2006 CS267 Lecture: UPC 28

Work Sharing with upc_forall() Vector Addition with upc_forall • The idiom in the previous slide is very common • The vadd example can be rewritten as follows • Loop over all; work on those owned by this proc • Equivalent code could use “&sum[i]” for affinity • UPC adds a special type of loop upc_forall(init; test; loop; affinity) • The code would be correct but slow if the affinity statement; expression were i+1 rather than i. • Programmer indicates the iterations are independent #define N 100*THREADS • Undefined if there are dependencies across threads The cyclic data • Affinity expression indicates which iterations to run on each thread. shared int v1[N], v2[N], sum[N]; distribution may It may have one of two types: perform poorly on • Integer: affinity%THREADS is MYTHREAD void main() { some machines int i; • Pointer: upc_threadof(affinity) is MYTHREAD upc_forall(i=0; i

CS267 Lecture 2 5 Blocked Layouts in UPC

• The cyclic layout is typically stored in one of two ways • Distributed memory: each processor has a chunk of memory • Thread 0 would have: 0,THREADS, THREADS*2,… in a chunk Distributed Arrays • Shared memory machine: each thread has a logical chunk • Shared memory would have: 0,1,2,…THREADS,THREADS+1,… in UPC • What performance problem is there with the latter? • What is this code was instead doing nearest neighbor averaging? • Vector addition example can be rewritten as follows

#define N 100*THREADS shared int [*] v1[N], v2[N], sum[N]; blocked layout

void main() { int i; upc_forall(i=0; i

Layouts in General 2D Array Layouts in UPC • All non-array objects have affinity with thread zero. • Array a1 has a row layout and array a2 has a block row • Array layouts are controlled by layout specifiers: layout. • Empty (cyclic layout) shared [m] int a1 [n][m]; • [*] (blocked layout) shared [k*m] int a2 [n][m]; • [0] or [] (indefinite layout, all on 1 thread) • If (k + m) % THREADS = = 0 them a3 has a row layout • [b] or [b1][b2]…[bn] = [b1*b2*…bn] (fixed block size) shared int a3 [n][m+k]; • The affinity of an array element is defined in terms of: • To get more general HPF and ScaLAPACK style 2D • block size, a compile-time constant blocked layouts, one needs to add dimensions. • and THREADS. • Assume r*c = THREADS; • Element i has affinity with thread shared [b1][b2] int a5 [m][n][r][c][b1][b2]; (i / block_size) % THREADS • or equivalently • In 2D and higher, linearize the elements as in a C shared [b1*b2] int a5 [m][n][r][c][b1][b2]; representation, and then use above mapping 5/30/2006 CS267 Lecture: UPC 33 5/30/2006 CS267 Lecture: UPC 34

UPC Matrix Vector Multiplication Code UPC Matrix Multiplication Code

• Matrix-vector multiplication with matrix stored by rows /* mat_mult_1.c */ • (Contrived example: problems size is PxP) #include #define N 4 #define P 4 #define M 4 shared [THREADS] int a[THREADS][THREADS]; shared int b[THREADS], c[THREADS]; shared [N*P /THREADS] int a[N][P], c[N][M]; // a and c are row-wise blocked shared matrices

void main (void) { shared[M/THREADS] int b[P][M]; //column-wise blocking int i, j , l; upc_forall( i = 0 ; i < THREADS ; i++; i) { void main (void) { c[i] = 0; int i, j , l; // private variables for ( l= 0 ; l< THREADS ; l++) upc_forall(i = 0 ; i

CS267 Lecture 2 6 Notes on the Matrix Multiplication Example Domain Decomposition for UPC

• Exploits locality in matrix multiplication • The UPC code for the matrix multiplication is almost •A (N × P) is decomposed row-wise •B(P × M) is decomposed column wise the same size as the sequential code into blocks of size (N × P) / THREADS into M/ THREADS blocks as shown as shown below: below: Thread THREADS-1 • Shared variable declarations include the keyword Thread 0 shared P M 0 .. (N*P / THREADS) -1 Thread 0 • Making a private copy of matrix B in each thread (N*P / THREADS)..(2*N*P / THREADS)-1 Thread 1 might result in better performance since many remote N memory operations can be avoided P • Can be done with the help of upc_memget ((THREADS-1)×N*P) / THREADS .. Thread THREADS-1 (THREADS*N*P / THREADS)-1

•Note: N and M are assumed to be multiples Columns 0: of THREADS (M/THREADS)-1 Columns ((THREAD-1) × M)/THREADS:(M-1)

5/30/2006 CS267 Lecture: UPC 37 5/30/2006 CS267 Lecture: UPC 38

Pointers to Shared vs. Arrays UPC Pointers

• In the C tradition, array can be access through pointers Where does the pointer point? • Here is the vector addition example using pointers Local Shared Where Private PP (p1) PS (p3) #define N 100*THREADS does the shared int v1[N], v2[N], sum[N]; pointer void main() { reside? Shared SP (p2) SS (p4) int i; shared int *p1, *p2; v1 int *p1; /* private pointer to local memory */ p1 p1=v1; p2=v2; shared int *p2; /* private pointer to shared space */ for (i=0; i

5/30/2006 CS267 Lecture: UPC 39 5/30/2006 CS267 Lecture: UPC 40

UPC Pointers Common Uses for UPC Pointer Types

Thread0 Thread1 Threadn int *p1; • These pointers are fast (just like C pointers) p3: p3: p3: • Use to access local data in part of code performing local work p4: p4: p4: Shared • Often cast a pointer-to-shared to one of these to get faster access to shared data that is local Global Global p1: p1: p1: shared int *p2;

address space p2: Private p2: p2: • Use to refer to remote data • Larger and slower due to test-for-local + possible int *p1; /* private pointer to local memory */ communication shared int *p2; /* private pointer to shared space */ int *shared p3; int *shared p3; /* shared pointer to local memory */ • Not recommended shared int *shared p4; /* shared pointer to shared int *shared p4; shared space */ • Use to build shared linked structures, e.g., a linked list Pointers to shared often require more storage and are more costly to dereference; they may refer to local or remote memory. 5/30/2006 CS267 Lecture: UPC 41 5/30/2006 CS267 Lecture: UPC 42

CS267 Lecture 2 7 UPC Pointers UPC Pointers

• Pointer arithmetic supports blocked and non-blocked • In UPC pointers to shared objects have three fields: array distributions • thread number • Casting of shared to private pointers is allowed but • local address of block not vice versa ! • phase (specifies position in the block) • When casting a pointer-to-shared to a pointer-to-local, the thread number of the pointer to shared may be Virtual Address Thread Phase lost • Example: Cray T3E implementation • Casting of shared to local is well defined only if the object pointed to by the pointer to shared has affinity Phase Thread Virtual Address with the thread performing the cast

63 49 48 38 37 0

5/30/2006 CS267 Lecture: UPC 43 5/30/2006 CS267 Lecture: UPC 44

Dynamic Memory Allocation in UPC Special Functions

• Dynamic memory allocation of shared memory is • size_t upc_threadof(shared void *ptr); available in UPC returns the thread number that has affinity to the pointer to shared • Functions can be collective or not • size_t upc_phaseof(shared void *ptr); • A collective function has to be called by every thread returns the index (position within the block)field of the and will return the same value to all of them pointer to shared • shared void *upc_resetphase(shared void *ptr); resets the phase to zero

5/30/2006 CS267 Lecture: UPC 45 5/30/2006 CS267 Lecture: UPC 46

Collective Global Memory Allocation Global Memory Allocation

shared void *upc_global_alloc(size_t nblocks, size_t nbytes);

shared void *upc_all_alloc(size_t nblocks, size_t nbytes); nblocks : number of blocks nbytes : block size nblocks: number of blocks nbytes: block size • Non-collective: called by one thread • The calling thread allocates a contiguous memory • This function has the same result as upc_global_alloc.But this space in the shared space is a collective function, which is expected to be called by all threads • If called by more than one thread, multiple regions are • All the threads will get the same pointer allocated and each thread which makes the call gets • Equivalent to : a different pointer shared [nbytes] char[nblocks * nbytes] • Space allocated per calling thread is equivalent to : shared [nbytes] char[nblocks * nbytes]

5/30/2006 CS267 Lecture: UPC 47 5/30/2006 CS267 Lecture: UPC 48

CS267 Lecture 2 8 Memory Freeing Distributed Arrays Directory Style • Some high performance UPC programmers avoid the UPC style arrays void upc_free(shared void *ptr); • Instead, build directories of distributed objects • Also more general • The upc_free function frees the dynamically allocated shared memory pointed to by ptr • upc_free is not collective typedef shared [] double *sdblptr; shared sdblptr directory[THREADS]; directory[i]=upc_alloc(local_size*sizeof(double)); upc_barrier;

5/30/2006 CS267 Lecture: UPC 49 5/30/2006 CS267 Lecture: UPC 50

Memory Consistency in UPC Synchronization- Fence

• The defines the order in which one thread may see another threads accesses to memory • If you write a program with unsychronized accesses, what happens? • Upc provides a fence construct • Does this work? • Equivalent to a null strict reference, and has the data = … while (!flag) { }; syntax flag = 1; … = data; // use the data • upc_fence; • UPC has two types of accesses: • Strict: will always appear in order • UPC ensures that all shared references issued • Relaxed: May appear out of order to other threads before the upc_fence are complete • There are several ways of designating the type, commonly: • Use the include file: #include • Which makes all accesses in the file relaxed by default • Use strict on variables that are used as synchronization (flag)

5/30/2006 CS267 Lecture: UPC 51 5/30/2006 CS267 Lecture: UPC 52

PGAS Languages have Performance Advantages One-Sided vs Two-Sided

Strategy for acceptance of a new language one-sided put message • Make it run faster than anything else host address data payload CPU Keys to high performance network interface • Parallelism: two-sided message • Scaling the number of processors message id data payload memory • Maximize single node performance • Generate friendly code or use tuned libraries • A one-sided put/get message can be handled directly by a network (BLAS, FFTW, etc.) interface with RDMA support • Avoid (unnecessary) communication cost • Avoid interrupting the CPU or storing data from CPU (preposts) • Latency, bandwidth, overhead • A two-sided messages needs to be matched with a receive to identify memory address to put data • Berkeley UPC and Titanium use GASNet communication layer • Offloaded to Network Interface in networks like Quadrics • Need to download match tables to interface (from host) • Avoid unnecessary delays due to dependencies • Ordering requirements on messages can also hinder bandwidth • Load balance; algorithmic dependencies 5/30/2006 CS267 Lecture: UPC 53 5/30/2006 CS267 Lecture: UPC 54

CS267 Lecture 2 9 Performance Advantage of One-Sided Communication GASNet: Portability and High-Performance

900 GASNet put (nonblock)" 8-byte Roundtrip Latency 800 24.2 MPI Flood 25 700 22.1 MPI ping-pong 600 GASNet put+sync 20 500 Re lative BW (GASNe t/M PI) 18.5 17.8 400 2.4 300 2.2 14.6 2.0 15 Bandwidth (MB/s)Bandwidth (up is good) (up is 1.8 13.5 200 1.6 1.4 1.2 100 1.0 10 1000Size (byte s ) 100000 10000000 9.6 9.5 (down is good) is (down 10 0 8.3 10 100 1,000 10,000 100,000 1,000,000 10,000,000 6.6 6.6

Size (bytes) Roundtrip Latency (usec) 4.5 5 • Opteron/InfiniBand (Jacquard at NERSC): • GASNet’s vapi-conduit and OSU MPI 0.9.5 MVAPICH • This is a very good MPI implementation – it’s limited by 0 semantics of message matching, ordering, etc. Elan3/Alpha Elan4/IA64 Myrinet/ IB/G5 IB/Opteron SP/Fed • Half power point (N ½ ) differs by one order of magnitude Joint work with Paul Hargrove and Dan Bonachea GASNet better for latency across machines 5/30/2006 CS267 Lecture: UPC 55 5/30/2006 CS267 Lecture: UPC 56 Joint work with UPC Group; GASNet design by Dan Bonachea

GASNet: Portability and High-Performance GASNet: Portability and High-Performance

Flood Bandwidth for 2MB messages Flood Bandwidth for 4KB messages 100% 100% 857 858 228 1504 1490 225 795 799 223 MPI 90% 255 90% 763 244 GASNet 702 714 80% 80% 231 679 610 630 70% 70% 190 152 60% 60%

50% 50% 420 750

(up is good) (up is 40% 40% 547 30% 252

Percent HW peak Percent 30% 20% 20% Percent HW MB) peak (BW in 10% MPI GASNet good) (up is 10% 0% Elan3/Alpha Elan4/IA64 Myrinet/x86 IB/G5 IB/Opteron SP/Fed 0% Elan3/Alpha Elan4/IA64 Myrinet/x86 IB/G5 IB/Opteron SP/Fed

GASNet at least as high (comparable) for large messages GASNet excels at mid-range sizes: important for overlap

5/30/2006 CS267 Lecture: UPC 57 5/30/2006 CS267 Lecture: UPC 58 Joint work with UPC Group; GASNet design by Dan Bonachea Joint work with UPC Group; GASNet design by Dan Bonachea

Case Study 2: NAS FT Overlapping Communication

• Performance of Exchange (Alltoall) is critical • Goal: make use of “all the wires all the time” • 1D FFTs in each dimension, 3 phases • Schedule communication to avoid network backup • Transpose after first 2 for locality • Trade-off: overhead vs. overlap • Bisection bandwidth-limited • Exchange has fewest messages, less message overhead • Problem as #procs grows • Slabs and pencils have more overlap; pencils the most • Example: Class D problem on 256 Processors • Three approaches: • Exchange: •wait for 2nd dim FFTs to finish, send 1 Exchange (all data at once) 512 Kbytes message per processor pair Slabs (contiguous rows that go to 1 processor) 64 Kbytes • Slab: Pencils (single row) 16 Kbytes • wait for chunk of rows destined for 1 proc, send when ready • Pencil: • send each row as it completes

5/30/2006 CS267 Lecture: UPC 59 5/30/2006 CS267 Lecture: UPC 60 Joint work with Chris Bell, Rajesh Nishtala, Dan Bonachea Joint work with Chris Bell, Rajesh Nishtala, Dan Bonachea

CS267 Lecture 2 10 NAS FT Variants Performance Summary Case Study 2: LU Factorization

1100 Best MFlop rates for all NAS FT Benchmark versions Best NAS Fortran/MPI • Direct methods have complicated dependencies 1000 Best NAS Fortran/MPI .5 Tflops 1000Best MPI (always Slabs) Best MPI 900 Best UPC (always Pencils) • Especially with pivoting (unpredictable communication) Best UPC 800 800 • Especially for sparse matrices (dependence graph with holes) 700 • LU Factorization in UPC • Slab600 is always600 best for MPI; small message cost too high • Use overlap ideas and multithreading to mask latency • Pencil500 is always best for UPC; more overlap 400

MFlops per Thread 400 MFlops per Thread per MFlops • Multithreaded: UPC threads + user threads + threaded BLAS

300 200 • Panel factorization: Including pivoting 200 • Update to a block of U 100 0 4 6 • Trailing submatrix updates et 6 d 25 256 512 256 512 yrin iBan lan3 lan3 lan4 lan4 0 M Infin E E E E •Status: Myrinet 64 InfiniBand 256 Elan3 256 Elan3 512 Elan4 256 Elan4 512 • Dense LU done: HPL-compliant • Sparse version underway

5/30/2006 CS267 Lecture: UPC 61 5/30/2006 CS267 Lecture: UPC 62 Joint work with Parry Husbands Joint work with Chris Bell, Rajesh Nishtala, Dan Bonachea

UPC HPL Performance Summary • UPC designed to be consistent with C X1 Linpack Performance Opteron Cluster Altix Linpack • Some low level details, such as memory layout are Linpack Performance •MPI HPL numbers 1400 Performance exposed MPI/HPL 160 from HPCC 1200 UPC 140 • Ability to use pointers and arrays interchangeably 200 database 1000 120 •Large scaling: • Designed for high performance 800 150 100

80 • 2.2 TFlops on 512p,

GFlop/s 600 • Memory consistency explicit

100 GFlop/s GFlop/s MPI/HPL 60 • 4.4 TFlops on 1024p MPI/HPL 400 UPC • Small implementation 40 UPC (Thunder) 50 200 20 • Berkeley compiler (used for next homework) 0 0 0 60 X1/64 X1/128 Opt/64 Alt/32 http://upc.lbl.gov • Comparison to ScaLAPACK on an Altix, a 2 x 4 grid • Language specification and other documents • ScaLAPACK (block size 64) 25.25 GFlop/s (tried several block sizes) http://upc.gwu.edu • UPC LU (block size 256) - 33.60 GFlop/s, (block size 64) - 26.47 GFlop/s • n = 32000 on a 4x4 process grid • ScaLAPACK - 43.34 GFlop/s (block size = 64)

•UPC5/30/2006 -70.26 Gflop/s (block size CS267 = 200) Lecture: UPC 63 5/30/2006 CS267 Lecture: UPC 64 Joint work with Parry Husbands

CS267 Lecture 2 11