IBM Software Group | Rational ®

XLUPC: Developing an optimizing compiler UPC Programming Language

Ettore Tiotto IBM Toronto Laboratory etiotto@ca..com

ScicomP 2012 IBM Disclaimer © Copyright IBM Corporation 2012. All rights reserved. The information contained in these materials is provided for informational purposes only, and is provided AS IS without warranty of any kind, express or implied. IBM shall not be responsible for any damages arising out of the use of, or otherwise related to, these materials. Nothing contained in these materials is intended to, nor shall have the effect of, creating any warranties or representations from IBM or its suppliers or licensors, or altering the terms and conditions of the applicable license agreement governing the use of IBM software. References in these materials to IBM products, programs, or services do not imply that they will be available in all countries in which IBM operates. Product release dates and/or capabilities referenced in these materials may change at any time at IBM’s sole discretion based on market opportunities or other factors, and are not intended to be a commitment to future product or feature availability in any way. IBM, the IBM logo, Rational, the Rational logo, Telelogic, the Telelogic logo, and other IBM products and services are trademarks of the International Business Machines Corporation, in the United States, other countries or both. Other company, product, or service names may be trademarks or service marks of others.

Please Note:

IBM’s statements regarding its plans, directions, and intent are subject to change or withdrawal without notice at IBM’s sole discretion. Information regarding potential future products is intended to outline our general product direction and it should not be relied on in making a purchasing decision. The information mentioned regarding potential future products is not a commitment, promise, or legal obligation to deliver any material, code or functionality. Information about potential future products may not be incorporated into any contract. The development, release, and timing of any future features or functionality described for our products remains at our sole discretion.

ScicomP 2012 Outline

1. Overview of the PGAS programming model 2. The UPC Programming Language 3. Compiler optimizations for PGAS Languages

ScicomP 2012 IBM Software Group | Rational ®

1. Overview of the PGAS programming model

Ettore Tiotto IBM Toronto Laboratory [email protected]

ScicomP 2012 Partitioned Global Address Space

. Explicitly parallel, shared-memory like programming model . Global addressable space – Allows programmers to declare and “directly” access data distributed across machines . Partitioned address space – Memory is logically partitioned between local and remote (a two-level hierarchy) – Forces the programmer to pay attention to data locality, by exposing the inherent NUMA-ness of current architectures . Single Processor Multiple Data (SPMD) execution model – All threads of control execute the same program – Number of threads fixed at startup – Newer languages such as X10 escape this model, allowing fine-grain threading . Different language implementations: – UPC (C-based), Coarray (Fortran-based), Titanium and X10 (Java-based)

ScicomP 2012 Partitioned Global Address Space

Process/Thread Address Space

PGAS Message passing Shared Memory UPC, CAF, X10 MPI OpenMP . Computation is performed in . A datum in one place may point to a multiple places. datum in another place. . A place contains data that can be . Data-structures (e.g. arrays) may be operated on remotely. distributed across many places. . Data lives in the place it was . Places may have different created, for its lifetime. computational properties (mapping to a hierarchy of compute engines) A place expresses locality. ScicomP 2012 UPC Execution Model

. A number of threads working independently in a SPMD fashion – Number of threads available as program variable THREADS – MYTHREAD specifies thread index (0..THREADS-1) – upc_barrier is a global synchronization: all wait – There is a form of worksharing loop that we will see later . Two compilation modes – Static Threads mode: • THREADS is specified at compile time by the user (compiler option) • The program may use THREADS as a compile-time constant • The compiler can generate more efficient code – Dynamic threads mode: • Compiled code may be run with varying numbers of threads • THREADS is specified at runtime time by the user (via env. variable)

ScicomP 2012 Hello World in UPC

. Any legal C program is also a legal UPC program . If you compile and run it as UPC with N threads, it will run N copies of the program (Single Program executed by all threads).

#include #include

int main() { printf("Thread %d of %d: Hello UPC world\n", MYTHREAD, THREADS); return 0; }

hello > xlupc helloWorld.upc hello > env UPC_NTHREADS=4 ./a.out Thread 1 of 4: Hello UPC world Thread 0 of 4: Hello UPC world Thread 3 of 4: Hello UPC world Thread 2 of 4: Hello UPC world

ScicomP 2012 IBM Software Group | Rational ®

2. The UPC Programming Language

Ettore Tiotto IBM Toronto Laboratory [email protected]

Some slides adapted with permission from Kathy Yelick

ScicomP 2012 An Overview of the UPC Language . Private vs. Shared Variables in UPC . Shared Array Declaration and Layouts . Parallel Work sharing Loop . Synchronization . Pointers in UPC . Shared data transfer and collectives

ScicomP 2012 Private vs. Shared Variables in UPC

. Normal C variables and objects are allocated in the private memory space for each thread (thread local variables) . Shared variables are allocated only once, on thread 0 shared int ours; int mine; . Shared variables may not be declared automatic, i.e., may not occur in a function definition, except as static.

Thread0 Thread1 Threadn

s s e

r Shared d e ours: d c a a

l p a s mine: mine: mine: b

o Private l G

ScicomP 2012 Shared Arrays

. Shared arrays are distributed over the threads, in a cyclic fashion shared int x[THREADS]; /* 1 element per thread */ shared int y[3][THREADS]; /* 3 elements per thread */ shared int z[3][3]; /* 2 or 3 elements per thread */ . Assuming THREADS = 4 (colors map to threads):

x Think of a linearized C y array, then map round- robin on THREADS z

ScicomP 2012 Distribution of a shared array in UPC

. Elements are distributed in block-cyclic fashion . Each thread “owns” blocks of adjacent elements

shared [2] int X[10];

Blocking factor

2 threads

th0 th0 th1 th1 th0 th0 th1 th1 th0 th0

0 1 2 3 4 5 6 7 8 9

ScicomP 2012 Layouts in General

. All non-array objects have affinity with thread zero. . Array layouts are controlled by layout specifiers: – Empty (cyclic layout) – [*] (blocked layout) – [0] or [] (indefinitely layout, all on 1 thread) – [b] or [b1][b2]…[bn] = [b1*b2*…bn] (fixed block size) . The affinity of an array element is defined in terms of: – block size, a compile-time constant – and THREADS. . Element i has affinity with thread (i / block_size) % THREADS . In 2D and higher, linearize the elements as in a C representation, and then use above mapping

ScicomP 2012 Physical layout of shared arrays

shared [2] int X[10];

2 threads Logical Distribution th0 th0 th1 th1 th0 th0 th1 th1 th0 th0

0 1 2 3 4 5 6 7 8 9

Physical Distribution th0 th0 th0 th0 th0 th0 th1 th1 th1 th1

0 1 4 5 8 9 2 3 6 7

ScicomP 2012 Work Sharing with upc_forall()

. UPC adds a special type of loop upc_forall(init; test; loop; affinity) statement; . Affinity expression indicates which iterations to run on each thread. It may have one of two types: – Integer: affinity%THREADS is MYTHREAD upc_forall(i=0; i

ScicomP 2012 Work Sharing with upc_forall()

. Similar to C for loop, 4th field indicates the affinity . Thread that “owns” elem. A[i] executes iteration

shared int A[10],B[10],C[10]; upc_forall(i=0; i < 10; i++; &A[i]){ th0 th1 A[i] = B[i] + C[i]; } No communication 2 threads

i=0 i=2 i=3 i=4 i=5 i=6 i=7 i=8i=1 i=9

ScicomP 2012 Vector Addition with upc_forall

#define N 100*THREADS

shared int v1[N], v2[N], sum[N];

void main() { int i; upc_forall(i=0; i

• Equivalent code could use “&sum[i]” for affinity • The code would be correct but slow if the affinity expression were i+1 rather than i. Why ?

ScicomP 2012 UPC Global Synchronization

. Synchronization between threads is necessary to control relative execution of threads . The programmer is responsible for synchronization . UPC has two basic forms of global barriers: – Barrier: block until all other threads arrive upc_barrier – Split-phase barriers upc_notify; this thread is ready for barrier Porform local computation upc_wait; wait for others to be ready

ScicomP 2012 Recap: Shared Variables, Work sharing and Synchronization

. With what you’ve seen until now, you can write a bare- bones data-parallel program . Shared variables are distributed and visible to all threads – Shared scalars have affinity to thread 0 – Shared arrays are distributed (cyclically by default) on all threads . Execution model is SPMD with the upc_forall provided to work . Barriers and split barriers provide global synchronization

ScicomP 2012 Pointers to Shared vs. Arrays • In the C tradition, array can be access through pointers • Here is a vector addition example using pointers

#define N 100*THREADS shared int v1[N], v2[N], sum[N];

void main() { v1 int i; shared int *p1, *p2; p1 p1=&v1[0]; p2=&v2[0];

upc_forall(i=0; i

ScicomP 2012 Common Uses for UPC Pointer Types int *p1; – These pointers are fast (just like C pointers) – Use to access local data in part of code performing local work – Often cast a pointer-to-shared to one of these to get faster access to shared data that is local shared int *p2; – Use to refer to remote data – Larger and slower due to test-for-local + possible communication int * shared p3; – Not recommended … Why not ?

shared int * shared p4; – Use to build shared linked structures, e.g., a linked list

ScicomP 2012 UPC Pointers

Where does the pointer point to?

Local Shared Private (p1) PS (p3)PP Where does the pointer reside? Shared (p2) SS (p4)SP

int *p1; /* private pointer to local memory */ shared int *p2; /* private pointer to shared space */ int *shared p3; /* shared pointer to local memory */ shared int *shared p4; /* shared pointer to shared space */

Shared to private is not recommended. Why?

ScicomP 2012 UPC Pointers

Thread0 Thread1 Threadn

e p3: p3: p3: c a

l p p4: p4: p4: Shared a s

b s o s l e G r p1: p1: p1: d d

a p2: p2: p2: Private

int *p1; /* private pointer to local memory */ shared int *p2; /* private pointer to shared space */ int *shared p3; /* shared pointer to local memory */ shared int *shared p4; /* shared pointer to shared space */ Pointers to shared often require more storage and are more costly to dereference than local pointers; they may refer to local or remote memory. ScicomP 2012 UPC Pointers

. Pointer arithmetic supports blocked and non-blocked array distributions . Casting of shared to private pointers is allowed but not vice versa ! . When casting a pointer to shared to a private pointer, the thread number of the pointer to shared may be lost . Casting of shared to private is well defined only if the object pointed to by the pointer to shared has affinity with the thread performing the cast shared [BF] int A[N]; … shared [BF] int *p = &A[BF*MYTHREAD]; int *lp = (int *) p; // cast to a local pointer (careful here !) for (i=0; i

ScicomP 2012 Data movement

. Fine grain (array element, by array element access) are easy to program in an imperative way. However, especially on distributed memory machines, block transfers are more efficient . UPC provides library functions for shared data movement: – upc_memset • Set a block of values in shared memory – upc_memget, upc_memput • Transfer blocks of data from shared memory to/from private memory – upc_memcpy • Transfer blocks of data from shared memory to shared memory

ScicomP 2012 Data movement Collectives

. Used to move shared data across threads: - upc_all_broadcast(shared void* dst, shared void* src, size_t nbytes, …) • A thread copies a block of memory it “owns” and sends it to all threads - upc_all_scatter(shared void* dst, shared void *src, size_t nbytes, …) • A single thread splits memory in blocks and sends each block to a different thread - upc_all_gather(shared void* dst, shared void *src, size_t nbytes, …) • Each thread copies a block of memory it “owns” and sends it to a single thread - upc_all_gather_all(shared void* dst, shared void *src, size_t nbytes, …) • Each threads copies a block of memory it “owns” and sends it to all threads - upc_all_exchange(shared void* dst, shared void *src, size_t nbytes, …) • Each threads splits memory in blocks and sends each block to all thread - upc_all_permute(shared void* dst, shared void *src, shared int* perm, size_t nbytes, …) • Each threads copies a block of memory and sends it to thread in perm[i]

ScicomP 2012 upc_all_broadcast

src THREAD 0 dst

THREAD i dst

THREAD n dst

Thread 0 copies a block of memory and sends it to all threads

ScicomP 2012 upc_all_scatter

src THREAD 0 dst

THREAD i dst

THREAD n dst

Thread 0 sends a unique block to all threads

ScicomP 2012 upc_all_gather

dst THREAD 0 src

THREAD i src

THREAD n src

Each thread sends a block to thread 0

ScicomP 2012 Computational Collectives

. Used to perform data reductions - upc_all_reduceT(shared void* dst, shared void* src, upc_op_t op, …) - upc_all_prefix_reduceT(shared void* dst, shared void *src, upc_op_t op, …) . One version for each type T (22 versions in total) . Many operations are supported: • OP can be: +, *, &, |, xor, &&, ||, min, max • perform OP on all elements of src array and place result in dst array

ScicomP 2012 upc_all_reduce

src

THREAD 0 1 2 3 45 dst

6

src

THREAD i 4 5 6

15

src

THREAD n 7 8 9 24

Threads perform partial sums, each partial sum added and result stored on thread 0

ScicomP 2012 Simple sum: Introduction

. Simple forall loop to add values into a sum

shared int values[N], sum;

sum = 0; upc_forall (int i=0; i

. Is there a problem with this code ?

. Implementation is broken: “sum” is not guarded by a lock

. Write-after-write hazard; will get different answers every time

ScicomP 2012 Synchronization primitives

. We have seen upc_barrier, upc_notify and upc_wait . UPC supports locks: – Represented by an opaque type: upc_lock_t – Must be allocated before use: upc_lock_t *upc_all_lock_alloc(void); allocates 1 lock, pointer to all threads upc_lock_t *upc_global_lock_alloc(void); allocates 1 lock, pointer to all threads – To use a lock: void upc_lock(upc_lock_t *l) void upc_unlock(upc_lock_t *l) use at start and end of critical region – Locks can be freed when not in use void upc_lock_free(upc_lock_t *ptr);

ScicomP 2012 Dynamic memory allocation

. As in C, memory can be dynamically allocated . UPC provides several memory allocation routines to obtain space in the shared heap – shared void* upc_all_alloc(size_t nblocks, size_t nbytes) • a collective operation that allocates memory on all threads • layout equivalent to: shared [nbytes] char[nblocks * nbytes] – shared void* upc_global_alloc(size_t nblocks, size_t nbytes) • A non-collective operation, invoked by one thread to allocate memory on all threads • layout equivalent to: shared [nbytes] char[nblocks * nbytes] – shared void* upc_alloc(size_t nbytes) • A non-collective operation to obtain memory in the thread’s shared section of memory – void upc_free(size_t nbytes) • A non-collective operation to free data allocated in shared memory • Why do we need just one version?

ScicomP 2012 IBM Software Group | Rational ®

3. Compiler Optimizations for PGAS Languages

Ettore Tiotto IBM Toronto Laboratory [email protected]

ScicomP 2012 XLUPC Compiler Architecture . Built to leverage the IBM XLC technology – Common Front End for C and UPC (extends C technology to UPC, generates WCode IL) – Common High Level Optimizer (common infrastructure with C, UPC specific optimizations developed) – Common Low Level Optimizer (vastly unmodified, UPC artifacts handled upstream) – PGAS Runtime (exploitation of P7IH specialized network hardware)

ScicomP 2012 XLUPC Optimizer Architecture

Data and Control Flow Optimizer Loop Optimizer

Constant Copy Dead store Traditional Loop Propagation Propagation elimination Shared Array Optimizations Idioms Opt. Dead Code Expression Backward Shared Objects Elimination simplification and Forward Prefetching Coalescing Remote Shared Loop Loop store motion Data Unrolling Normalization Redundant Shared Objects Privatization Loop Condition Parallel Work- Unswitching Elimination Remote Update Sharing Opt.

Optimization Strategy Code Gen. 1) Remove overhead -Parallel Work-Sharing Loop Opt. UPC Code Gen. PGAS Runtime -Shared pointer arithmetic Inlining Thread Local Storage System 2) Exploit Locality Information PAMI - Shared Objects Privatization 4) P7IH Exploitation (through PAMI) 3) Reduce communication Requirements - Shared Objects Coalescing - HFI Immediate Send Instruction support - Collective Acceleration Unit support - Remote Update - HFI remote updates support - Shared Array Idioms Recognition - PAMI network endpoints

ScicomP 2012 XL UPC Compiler optimizations

3.0 Translation of shared object accesses – Preview of compiler optimized code 3.1 Optimizing a Work Sharing Loop – Removing the affinity branch in a work sharing loop 3.2 Locality Analysis for Work Sharing Loops 3.3 Shared Array Optimizations – Optimizations for Local Accesses (privatization) – Optimizations for Remote Accesses (coalescing) – Optimizing Commonly Used Code Patterns 3.4 Optimizing Shared Accesses through UPC pointers – Optimizing Shared Accesses through pointers (versioning)

ScicomP 2012 3.0 Translation of shared object accesses

. Achieved by translating shared accesses into calls to our PGAS runtime system:

Pseudocode generated by Code the compilerSource

shared int a; 3 | int foo() int foo () { 5 | { int i, res; if (!(__mythread == 0)) goto lab_1; if (MYTHREAD==0) 6 | i = 0; for (i=0; i

ScicomP 2012 3.0 Translation of shared object accesses

. To translate the shared array access “a[i]” the compiler generates: – A PGAS RT call to position a pointer to shared memory to point to the correct array element (&a[i]):

__xlupc_incr_shared_addr(&shared_addr.a, i);

– A PGAS RT call retrieve the value of a[i] trough the pointer: __xlupc_ptrderef(shared_addr.a, &tmp);

– The value of “a[i]” is now in a local buffer “tmp” . Performance considerations – Function call overhead (setup arguments, branch to runtime routine) – Runtime implementation cost • Receiving arguments, performing local computation, … • Communication cost (if the target array elements is not local) – Pointer increment cost (actually inlined by the compiler for common cases)

ScicomP 2012 3.0 Preview of compiler optimized code

. Beside code generation the compiler concerns itself with generation of efficient code (remove overhead). For example, certain shared accesses could be proven local (at compile time) therefore the compiler can: – Cast the shared array element address to a local C pointer (on each thread) – Use the local C pointer to dereference the array element directly

p i i e o e (O )oreCode Optimized Code (-O3)Source

shared int A[THREADS]; int foo(){ __threadsPerNode = __xlupc_threadsPerNode(); __shr_address = __xlupc_get_array_base_address(A); __shr_address[(((__mythread % 4) % __threadsPerNode) * int foo(void) { 4 + A[MYTHREAD] = MYTHREAD; (__mythread/4)*4)] = __mythread; } return; } /* function */

ScicomP 2012 3.1 Optimizing a Work Sharing Loop

. A UPC work sharing loop (upc_forall) is equivalent to a for loop where the execution of the loop body is controlled by the affinity expression . Example:

Equivalent for upc_forall loopOriginal loop

shared int a[N], b[N], c[N]; shared int a[N], b[N], c[N]; void foo () { void foo() { int i; int i; upc_forall (i=0; i

ScicomP 2012 3.1 Optimizing a Work Sharing Loop

. To eliminate the affinity branch the compiler transforms the upc_forall loop into a loop next comprising two for loops:

Equivalent (optimized) upc_forall for loop loopOriginal

shared int a[N], b[N], c[N]; shared int a[N], b[N], c[N]; void foo () { void foo() { int i; int i, j; upc_forall (i=0; i

. For each thread the outer loop cycles through each block of elements while the inner loop cycles through each element in a block Thread 0 Thread 1 Thread 0Thread 1 Thread 0 Thread 1

i = 1 i = 0 i = 1 i = 2 i = 3 i = 4 i = 5 j = 0 j = 1 j = 2 j = 3 j = 4 j = 5

ScicomP 2012 3.1 Optimizing a Work Sharing Loop

Objective: remove the affinity branch overhead in upc_forall loops

Stream Triad shared [BF] int A[N],B[N],C[N]; MB/s 20 million elem. – address affinity upc_forall (i=0; i < N; i++; &A[i]) 3000 A[i] = B[i] + k*C[i]; Unoptimized (base line) Optimized (strip-mined) 2500 naïve translation branch shared [BF] int A[N],B[N],C[N]; 2000 t en m for (i=0; i < N; i++) ve ro if (upc_threadof(A[i]) == MYTHREAD) 1500 p im A[i] = B[i] + k*C[i]; 5X 8. 1000

Affinity branch evaluation optimized loop 500 limits scalability shared [BF] int A[N], B[N], C[N];

for (i=MYTHREAD*BF; i

ScicomP 2012 3.2 Locality Analysis for Work Sharing Loops

Thread 0 Thread 1 Thread 2 Thread 3 Thread 0 Thread 1 Thread 2 Thread 3 A 2 2 2 2 3 local accesses, 1 remote access 4 Threads / 4 Nodes

shared [4] int A[2][16]; – Each block has 4 elements int main ( ) { – upc_forall loop used to traverse the int i, j; inner array dimension for (i =0; i<2 ; i++) { // row traversal – Note the index [j+1], causing ¼ of upc_forall (j=0; j<16; j++; &A[i][j]) { accesses to the next block th A[i][j+1] = 2; – Each 4 loop iteration we have one remote access, the first 3 accesses } are shared local } – Can we somehow isolate the remote } access from the local accesses ?

ScicomP 2012 3.2 Locality Analysis for Work Sharing Loops

Thread 0 Thread 1 Thread 2 Thread 3 Thread 0 Thread 1 Thread 2 Thread 3 A 4 Threads / 4 Nodes A1 A2

shared [4] int A[2][16]; – The compiler identifies the point in the loop iteration space where the locality if int main ( ) { shared accesses changes for (int i=0; i < 2; i++) – The compiler then “splits” the loop body upc_forall (int j=0; j < 16; j++; &A[i][j]) { into 2 regions if (j % 4 < 3) + Region 1: shared access is always local A[i][j+1] = 2; / / A1 Region 1 + Region 2: shared access is always remote – Now shared local accesses (region 1) else can be “privatized”, and shared remote A[i][j+1] = 2; / / A2 Region 2 accesses could be “coalesced” (more } later on this) }

ScicomP 2012 3.3 Optimizations for Local Accesses

. After the compiler has proven (via locality analysis) that a shared array access is local, shared object access privatization can be applied:

Privatized shared shared accessOriginal access upc_forall (int j=0; j < 16; j++; &A[j]) { __A_address = __xlupc_get_array_address(A); if (j % 4 < 3) upc_forall (int j=0; j < 16; j++; &A[j]) { A[j+1] = 2; // shared local if (j % 4 < 3) else __A_address [ new_index ] = 2; // privatized A[j+1] = 2; // shared remote else } A[j+1] = 2; // shared remote }

. The compiler translates the shared local access so that no call to the UPC runtime system is necessary (no call overhead) – the base address of the shared array is retrieved outside the loop – the index into the local portion of the shared array is computed – the shared array access is then just a local memory dereference

ScicomP 2012 3.3 Optimizations for Remote Accesses

. Locality Analysis tells us whether a shared reference is local or remote . Local shared references are privatized, how can remote shared reference be optimized ?

Example of remote accesses Observations shared [BF] int A[N], B[N+BF]; (1) A[i] is shared local, therefore the compiler privatizes the access void loop() { (2) B[i+BF] is always remote (for every “i”) so upc_forall(int i=0; i

. The compiler transforms the parallel loop to prefetches multiple elements of array B (reducing the number of messages required)

Optimized Parallel Loop LoopOriginal #define BF (N/THREADS) #define BF (N/THREADS) shared [BF] int A[N], B[N+BF]; shared [BF] int A[N], B[N+BF]; void loop() { void loop() { upc_forall(int i=0; i

Sobel (1 thread per node) Sobel (16 threads per node) 1000 DISTRIBUTED 10 HYBRID No Opt. Image Size: 4096 X 4096 Image Size: 4096 X 4096 No Opt. Privatization Privatization Coalescing 100 Coalescing Full Opt Full Opt 1 + 1.75X ) 10 ) c c e e s s ( (

e e m m i i T

1 T Coalescing + 5.7X 0.1 Coalescing Improvement 0.1 Improvement + 17.5X

0.01 0.01 + 130X 0 2 4 6 8 10 12 14 16 18 0 2 4 6 8 10 12 14 16 18 Nodes Lower is better Nodes

Distributed: Coalescing accounts for a 3X improvement Hybrid: Coalescing accounts for a 74X improvement

ScicomP 2012 3.3 Optimizing Commonly Used Code Patterns

. Transformation detects common initialization code patterns and changes them into one of the 4 string handling functions provided by the UPC standard (upc_memset, upc_memget, upc_memput, upc_memcpy). . The compiler is also able to recognize non-shared initialization idiom and change them to one of the C standard string handling functions

Code generated by the Code XLUPC Snippet compiler (-O2)Source #define N 64000 After Array Init Idioms #define BF 512 @BCIV2 = 0; shared [BF] int A[N]; do { shared [BF] int B[N]; @CIV1 = 512ll * @BCIV2; int a[N], b[N]; 50 | upc_memput(&A + 4ll*@CIV1,&a + 4ll*@CIV1,2048ll) 51 | upc_memget(&b + 4ll*@CIV1,&B + 4ll*@CIV1,2048ll) void func(void) { do { for (int i = 0; i < N; i++) { @CIV1 = @CIV1 + 1ll; A[i] = a[i]; } while (@CIV1 < min(@BCIV2 * 512ll + 512ll,64000ll)); b[i] = B[i]; lab_5: } @BCIV2 = @BCIV2 + 1; } } while (@BCIV2 < 125ll);

ScicomP 2012 3.4 Optimizing Shared Accesses through pointers

. In order to determine whether a shared access in a parallel loop is local or remote locality analysis has to compare it to the loop affinity expression (to determine its distance) . If the access is through a pointer to shared memory the distance computation depends on where the pointer points to

ObservationsShared accesses via pointers-to-shared shared [BF] int A[N], B[N]; (1) p1[i] is trivially a shared local access for void foo(shared [BF] int *p1, every value of “i” because the affinity expression is “&p[i]” shared [BF] int *p2) { (2) what is the distance of p2[i+1] from the upc_forall (int i=0; i

ScicomP 2012 3.4 Optimizing Shared Accesses through pointers

. Key insight from the prior example: privatization needs to be sensitive to different calling contexts (call site 1 vs. call site 2) . The approach we implemented is to create different “versions” of the parallel loop based on the value of a variable (ver_p2 below)

Optimized Parallel accesses LoopShared via pointers-to-shared

shared [BF] int A[N], B[N]; void foo(shared [BF] int *p1, shared [BF] int *p2) { void foo(shared [BF] int *p1, _Bool ver_p2=(upc_phaseof(p2)==0)?1:0; shared [BF] int *p2) { if (ver_p2) { // p2 points to first elem. in a block upc_forall (int i=0; i

ScicomP 2012 Research Directions

. Advanced Prefetching Optimizations – Applicable to for loops as well as upc_forall loops – Hybrid in nature • Collection phase: shared accesses in loop are described to the runtime system • Scheduling Phase: runtime analyzes access patterns, coalesce related accesses and prefetches them • Execution Phase: prefetched buffers are consumed in the loop – Applicable in a dynamic execution environment (threads not known at compile time) 100

90 ) 20 Millions iterations, 228 MB Prefetch c 80 Power 7 HW (1 node) e

s RHEL6

/ 70 B 60 M (

50 t ~15X speedup A[i] = B[i+1] + C[i+1] u 40 p

h 30 g

u 20 o r 10 Baseline (-O3) h Prefetched T 0 2 4 8 16 32 Threads

ScicomP 2012 Baseline (-O3) Research Directions

. Advanced Runtime Diagnostics – Compiler instruments UPC application 1 #include 2 #include – Errors detected while application runs 3 #define BF 3 4 5 shared [BF] int A[BF * THREADS], *ptr; 6 shared [BF*THREADS] int B[THREADS][BF*THREADS]; UPC Program 7 shared int sh_num_bytes; 8 9 int main() { 10 if (sh_num_bytes>=0) ptr = &A[2*BF]; UPC Compiler PGAS Runtime 11 else ptr=A; Code Analysis 12 upc_all_gather_all(B, ptr, sizeof(int)*BF, Diagnostician UPC_IN_ALLSYNC|UPC_OUT_ALLSYNC); Code Optimizer 13 return 0; 14 } Code Diagnostic API Generator Interface -qcheck % xlupc -qcheck coll_diag.upc -qupc=threads=4 UPC Checker Instrumented Executable % ./a.out -procs 4 1588-226 Invalid target affinity detected at line 12 in file 'coll_diag.upc'. Parameter 2 of function 'upc_all_gather_all' should have affinity to thread 0.

ScicomP 2012 Baseline (-O3) Research Directions

#include // let's do something fun with MPI. #include "mpi.h" BigArray[MYTHREAD] = MYTHREAD + 1; extern int xlpgas_tsp_thrd_id (void); int * localBigArray = shared int BigArray[THREADS]; (int *) &BigArray[MYTHREAD];

int main(int argc, char ** argv) { MPI_Allreduce(localBigArray,&result,1, // We are in a UPC program, but we can MPI_INTEGER, MPI_SUM, // initialize MPI: MPI_COMM_WORLD); int mpirank, mpisize, result; if (MYTHREAD==0) printf ("The sum = %d\n", result); if (xlpgas_tsp_thrd_id()==0) MPI_Init (&argc, &argv); upc_barrier; upc_barrier; if (xlpgas_tsp_thrd_id()==0) MPI_Finalize(); } MPI_Comm_rank (MPI_COMM_WORLD, &mpirank); MPI_Comm_size (MPI_COMM_WORLD, &mpisize); % ./a.out -msg_api pgas,mpi -procs 4 I'm UPC thread 0 of 4 and MPI task 0 of 4 // let's dump our MPI and UPC identification: I'm UPC thread 1 of 4 and MPI task 1 of 4 printf ("I'm UPC thread %d/%d and MPI task %d/%d\n", I'm UPC thread 2 of 4 and MPI task 2 of 4 MYTHREAD, THREADS, mpirank, mpisize); I'm UPC thread 3 of 4 and MPI task 3 of 4 The sum = 10

ScicomP 2012 IBM Software Group | Rational ®

Thank You !!!

Ettore Tiotto IBM Toronto Laboratory [email protected]

ScicomP 2012