CAF) Unified Parallel C (UPC
Total Page:16
File Type:pdf, Size:1020Kb
PGAS Partitioned Global Address Space Languages Coarray Fortran (CAF) Unified Parallel C (UPC) Dr. R. Bader Dr. A. Block May 2010 Applying PGAS to classical HPC languages Design target for PGAS extensions: smallest changes required to convert Fortran and C into robust and efficient parallel languages add only a few new rules to the languages provide mechanisms to allow explicitly parallel execution: SPMD style programming model data distribution : partitioned memory model synchronization vs. race conditions memory management for dynamic sharable entities Standardization efforts: Fortran 2008 draft standard (now in DIS stage, publication targeted for August 2010) separately standardized C extension (work in progress; existing document is somewhat informal) © 2010 LRZ PGAS Languages: Coarray Fortran/Unified Parallel C 2 Execution model: UPC threads / CAF images Going from single to multiple Replicate single program a execution contexts fixed number of times CAF - images : set number of replicates at compile time or at execution time asynchronous execution – loose 1 coupling unless program-control- 2 led synchronization occurs 3 Separate set of entities on 4 time each replicate program-controlled exchange of data UPC uses zero-based may necessitate synchronization counting UPC uses the term thread where CAF has images © 2010 LRZ PGAS Languages: Coarray Fortran/Unified Parallel C 3 Execution model: Resource mappings One-to-one: each image / thread executed by a single physical processor core Many-to-one: some (or all) images / threads are executed by multiple cores each (e.g., socket could support OpenMP multi-threading within an image) One-to-many: fewer cores are available to the program than images / threads scheduling issues useful typically only for algorithms which do not require the bulk of CPU resources on one image Many-to-many Note: startup mechanism and resource assignment method are implementation-dependent © 2010 LRZ PGAS Languages: Coarray Fortran/Unified Parallel C 4 Simplest possible program CAF – intrinsic integer functions for orientation program hello implicit none write(*, '(''Hello from image '',i0, '' of '',i0)') & this_image() , num_images() end program between 1 and num_images() UPC non-repeatably unsorted output uses integer expressions for the same purpose if multiple images /threads used #include <upc.h > #include <stdlib.h> #include <stdio.h> int main (void) { printf(“Hello from thread %d of %d \n”, \ MYTHREAD , THREADS ); return 0; } between 0 and THREADS - 1 © 2010 LRZ PGAS Languages: Coarray Fortran/Unified Parallel C 5 PGAS memory model All entities belong to one of two classes: execute on any image global entities not explicitly shown: s1 s2 s3 s4 purely local accesses (fastest ) global memory execute on image where address (e.g. ,128 bit) „right“ x is located impossible local entities x x x x per-image address physical memory on local (private) entities: only accessible to the image/thread core execu- which „owns“ them this is what we get from conventional ting image 4 language semantics global ( shared ) entities in partitioned global memory: objects declared on and physically assigned to one image/thread the term „shared“: may be accessed by any other one different semantics allows implementation for distributed memory systems than in OpenMP (esp. for CAF) © 2010 LRZ PGAS Languages: Coarray Fortran/Unified Parallel C 6 Declaration of corrays/shared entities (simplest case) CAF: UPC: coarray notation with explicit uses shared attribute indication of location (coindex) implicit locality; various blocking symmetry is enforced strategies (asymmetric data must use asymmetry – threads may have derived types – see later) uneven share of data integer, & shared [1] int A[10]; codimension [*] :: a(3) shared int A[10]; integer a(3) [*] Image 1 2 3 4 Thread 0 1 2 3 A(1)[1] A(1)[2] A(1)[3] A(1)[4] A[0] A[1] A[2] A[3] A(2)[1] A(2)[2] A(2)[3] A(2)[4] A[4] A[5] A[6] A[7] A(3)[1] A(3)[2] A(3)[3] A(3)[4] A[8] A[9] more images additional more threads e.g., A[4] moves coindex value to another physical memory © 2010 LRZ PGAS Languages: Coarray Fortran/Unified Parallel C 7 UPC shared data: variations on blocking General syntax Some examples: for a one-dimensional array shared [block size ] type \ shared [N] float A[N][N]; var_name [total size ]; complete matrix rows on each thread (≥1 per thread) scalars and multi-dimensional arrays also possible Values for block size shared [*] int \ omitted default value is 1 A[ THREADS ][3]; integer constant (maximum storage sequence matches with value UPC_MAX_BLOCK_SIZE ) rank 1 coarray from previous [*] one block on each slide ( symmetry restored) thread, as large as possible, size static THREADS environment depends on number of threads may be required (compile-time [] or [0] all elements on thread number determination) one thread © 2010 LRZ PGAS Languages: Coarray Fortran/Unified Parallel C 8 CAF shared data: coindex to image mapping Image 1 2 3 4 Coarrays S[-2] S[-1] S[0] S[1] may be scalars integer s [-2:*] and/or have corank > 1 real z(10,10) [0:3,3:*] Intrinsic functions query lower and upper cobounds lower cobound upper cobound of codimension 1 of last codimension mapping to image index: lcobound (coarray [,dim,kind ]) ucobound (coarray [,dim,kind ]) 3 4 5 z(:,:)[2,4] 0 1 5 9 return an integer array or scalar 10 images available here scalar if optional dim argument (/4,3/) 1 2 6 10 ragged rectangular pattern is present 2 3 7 z(:,:)[3,5] example: 3 4 8 invalid write(*, *) lcobound (z) coshape = programmer‘s responsibility for will produce the output 0 3 valid coindex reference (see later for additional intrinsic support) © 2010 LRZ PGAS Languages: Coarray Fortran/Unified Parallel C 9 Accessing shared data Simplest example CAF syntax: collective scatter: each integer :: b(3), a(3)[*] thread/image executes one statement implying data transfer b = a(:) [q] a b 1 q: same value on all images/threads (shared) (private) UPC syntax: int b[3]; remember enforced symmetry q shared [*] int a[THREADS][3]; q+1 for (i=0; i<3; i++) { n b[i] = a[ q][i]; time } Note: one-sided semantics: initializations of a and q omitted – b = a (from image/thread q) there‘s a catch here … „Pull“ © 2010 LRZ PGAS Languages: Coarray Fortran/Unified Parallel C 10 Locality control CAF: UPC: local accesses are fastest and trivial implicit locality cross-thread integer :: a(3)[*] accesses easier to write a(:) = (/ … /) ensuring local access: explicit mapping of array index to thread may same as a(:)[this_image()] be required (see next slide) but probably faster supporting intrinsics: coarray ↔ coindexed object explicit coindex: usually a visual indication of communication shared [B] int A[N]; a block of A supporting intrinsics: size_t img, pos; on thread 4 real :: z(10,10)[0:3,3*] img = \ A[4*B] integer :: cindx(2), m1, img 0 10 images upc_threadof (&A[4*B+2]); 1 A[4*B+1] on any thread, returns 4 2 A[4*B+2] 3 4 5 cindx = this_image(z) on image 7, returns (/2,4/) pos = \ … … 0 1 5 9 m1 = this_image(z,1) upc_phaseof (&A[4*B+2]); B-1 A[5*B-1] on any thread, returns 2 1 2 6 10 on image 7, returns 2 2 img = image_index (z,(/2,4/)) 3 7 on all images, returns 7 A[5*B] 3 4 8 img = image_index (z, (/2,5/) phase on any image, returns 0 © 2010 LRZ PGAS Languages: Coarray Fortran/Unified Parallel C 11 Work sharing + avoiding non-local accesses Typical case conditional may be inefficient loop or iterator execution cyclic distribution may be slow CAF: upc_forall index transformations between integrates affinity with loop local and global construct integer :: a(nlocal)[*] shared int a[N]; do i=1, nlocal upc_forall (i=0; i<N; i++; i) { j = … ! global index a[i] = … ; a(i) = … } afffinity expression end do UPC: affinity expression: loop over all, work on subset an integer execute if shared int a[N]; i%THREADS == MYTHREAD for (i=0; i<N; i++) { a global address execute if if (i%THREADS == MYTHREAD ) { upc_threadof(…) == MYTHREAD a[i] = … ; example above: could replace „i“ } } with „&a[i]“ © 2010 LRZ PGAS Languages: Coarray Fortran/Unified Parallel C 12 Race conditions – need for synchronization integer :: b(3), a(3)[*] Serial semantics a = c execution sequence b = a(:) [q] Parallel semantics CAF terminology Focus on image pair q, p: relaxed consistency unordered segments of p, q three scenarios explicit synchronization by user c required to prevent races q a Imposed by algorithm : RaU („reference after update“) is correct b p global barrier: enforce ordering of segments – on all images RaU time a = c for (…) a[n][i]= …; sync all upc_barrier; b = a(:)[q] for (…) b[i]=a[q][i]; © 2010 LRZ PGAS Languages: Coarray Fortran/Unified Parallel C 13 UPC: Consistency modes How are shared entities accessed? relaxed mode program assumes no concurrent accesses from different threads strict mode program ensures that accesses from different threads are separated, and prevents code movement across these implicit barriers relaxed is default; strict may have large performance penalty Options for synchronization mode selection variable level: strict shared int flag = 0; (at declaration ) relaxed shared [*] int c[THREADS][3]; q has same example for c[q][i] = …; while (!flag) {…}; value on a spin lock flag = 1; … = c[q][j]; thread p as Thread q Thread Thread p Thread on thread q code section level: program level { // start of block #include <upc_strict.h> #pragma upc strict // or upc_relaxed.h … // block statements } consistency mode on variable declaration overrides // return to default mode code section or program level specification © 2010 LRZ PGAS Languages: Coarray Fortran/Unified Parallel C 14 CAF: partial synchronization Synchronizing subsets Critical regions code / time code / time only one thread at a time if (this_image() < 3) then executes sync images ( (/ 1, 2 /) ) end if order is unspecified all images against one: critical if (this_image() == 1) then : ! statements in region : ! send data end critical sync images ( * ) else images 2 etc.