
UPC: A Portable High Performance Dialect of C Kathy Yelick Christian Bell, Dan Bonachea, Wei Chen, Jason Duell, Paul Hargrove, Parry Husbands, Costin Iancu, Wei Tu, Mike Welcome Unified Parallel C at LBNL/UCB Parallelism on the Rise 1000000.00001P Aggregate Systems Performance 100000.0000 100T Earth Simulator ASCI Q ASCI White 10000.0000 SXSX---66 10T VPP5000 SXSX---55 SR8000G1 VPP800 ASCI Blue Increasing 1000.00001T ASCI Red VPP700 ASCI Blue Mountain SX-SX-44 T3E SX-4 T3E SR8000 Paragon Parallelism VPP500 SR2201/2K NWT/166 100G100.0000 T3D CMCM---55 SXSX---3R3R T90 SS---38003800 SXSX---33 SR2201 10G10.0000 C90 Single CPU FLOPS FLOPS FLOPS FLOPS VP2600/1 CRAYCRAY---222YYY---MP8MP8 0 Performance SXSX---22 SS---820/80820/80 1.0000 10GHz 1G SS---810/20810/20 VPVP---400400 10GHz XX---MPMP VPVP- --200200 100M0.1000 1GHz CPU Frequencies 10M0.0100 100MHz 1M0.0010 10MHz 1980 1985 1990 1995 2000 2005 2010 • 1.8x annual performance increase Year - 1.4x from improved technology and on-chip parallelism - 1.3x in processor count for larger machines Unified Parallel C at LBNL/UCB Parallel Programming Models • Parallel software is still an unsolved problem ! • Most parallel programs are written using either: - Message passing with a SPMD model - for scientific applications; scales easily - Shared memory with threads in OpenMP, Threads, or Java - non-scientific applications; easier to program • Partitioned Global Address Space (PGAS) Languages - global address space like threads (programmability) - SPMD parallelism like MPI (performance) - local/global distinction, i.e., layout matters (performance) Unified Parallel C at LBNL/UCB Partitioned Global Address Space Languages • Explicitly-parallel programming model with SPMD parallelism - Fixed at program start-up, typically 1 thread per processor • Global address space model of memory - Allows programmer to directly represent distributed data structures • Address space is logically partitioned - Local vs. remote memory (two-level hierarchy) • Programmer control over performance critical decisions - Data layout and communication • Performance transparency and tunability are goals - Initial implementation can use fine-grained shared memory • Base languages differ: UPC (C), CAF (Fortran), Titanium (Java) Unified Parallel C at LBNL/UCB UPC Design Philosophy • Unified Parallel C (UPC) is: - An explicit parallel extension of ISO C - A partitioned global address space language - Sometimes called a GAS language • Similar to the C language philosophy - Concise and familiar syntax - Orthogonal extensions of semantics - Assume programmers are clever and careful - Given them control; possibly close to hardware - Even though they may get intro trouble • Based on ideas in Split-C, AC, and PCP Unified Parallel C at LBNL/UCB A Quick UPC Tutorial Unified Parallel C at LBNL/UCB Virtual Machine Model Thread0 Thread1 Threadn X[0] X[1] X[P] Shared Global Global ptr: ptr: ptr: Private address space • Global address space abstraction - Shared memory is partitioned over threads - Shared vs. private memory partition within each thread - Remote memory may stay remote: no automatic caching implied - One-sided communication through reads/writes of shared variables • Build data structures using - Distributed arrays - Two kinds of pointers: Local vs. global pointers (“pointers to shared”) Unified Parallel C at LBNL/UCB UPC Execution Model • Threads work independently in a SPMD fashion - Number of threads given by THREADS set as compile time or runtime flag - MYTHREAD specifies thread index (0..THREADS-1) - upc_barrier is a global synchronization: all wait • Any legal C program is also a legal UPC program #include <upc.h> /* needed for UPC extensions */ #include <stdio.h> main() { printf("Thread %d of %d: hello UPC world\n", MYTHREAD, THREADS); } Unified Parallel C at LBNL/UCB Private vs. Shared Variables in UPC • C variables and objects are allocated in the private memory space • Shared variables are allocated only once, in thread 0’s space shared int ours; int mine; • Shared arrays are spread across the threads shared int x[2*THREADS] /* cyclic, 1 element each, wrapped */ shared int [2] y [2*THREADS] /* blocked, with block size 2 */ • Shared variables may not occur in a function definition unless static Thread0 Thread1 Threadn ours: x[0,n+1] x[1,n+2] x[n,2n] Shared y[0,1] y[2,3] y[2n-1,2n] space Private mine: Global address Global address mine: mine: Unified Parallel C at LBNL/UCB Work Sharing with upc_forall() shared int v1[N], v2[N], sum[N]; void main() { cyclic layout int i; i would for(i=0;upc_forall i<N;(i=0; i++) i<N; i++; &v1[i]owner) also computes work ifsum[i]=v1[i]+v2[i]; (MYTHREAD = = i%THREADS) } sum[i]=v1[i]+v2[i]; } • This owner computes idiom is common, so UPC has upc_forall(init; test; loop; affinity) statement; • Programmer indicates the iterations are independent - Undefined if there are dependencies across threads - Affinity expression indicates which iterations to run - Integer: affinity%THREADS is MYTHREAD - Pointer: upc_threadof(affinity) is MYTHREAD Unified Parallel C at LBNL/UCB Memory Consistency in UPC • Shared accesses are strict or relaxed, designed by: - A pragma affects all otherwise unqualified accesses - #pragma upc relaxed - #pragma upc strict - Usually done by including standard .h files with these - A type qualifier in a declaration affects all accesses - int strict shared flag; - A strict or relaxed cast can be used to override the current pragma or declared qualifier. • Informal semantics - Relaxed accesses must obey dependencies, but non-dependent access may appear reordered by other threads - Strict accesses appear in order: sequentially consistent Unified Parallel C at LBNL/UCB Other Features of UPC • Synchronization constructs - Global barriers - Variant with labels to document matching of barriers - Split-phase variant (upc_notify and upc_wait) - Locks - upc_lock, upc_lock_attempt, upc_unlock • Collective communication library - Allows for asynchronous entry/exit shared [] int A[10]; shared [10] int B[10*THREADS]; // Initialize A. upc_all_broadcast(B, A, sizeof(int)*NELEMS, UPC_IN_MYSYNC | UPC_OUT_ALLSYNC ); • Parallel I/O library Unified Parallel C at LBNL/UCB The Berkeley UPC Compiler Unified Parallel C at LBNL/UCB Goals of the Berkeley UPC Project • Make UPC Ubiquitous on - Parallel machines - Workstations and PCs for development - A portable compiler: for future machines too • Components of research agenda: 1. Runtime work for Partitioned Global Address Space (PGAS) languages in general 2. Compiler optimizations for parallel languages 3. Application demonstrations of UPC Unified Parallel C at LBNL/UCB Berkeley UPC Compiler • Compiler based on Open64 UPC • Multiple front-ends, including gcc • Intermediate form called WHIRL Higher WHIRL • Current focus on C backend • IA64 possible in future Optimizing C + • UPC Runtime transformations Runtime • Pointer representation • Shared/distribute memory Lower WHIRL • Communication in GASNet • Portable Assembly: IA64, • Language-independent MIPS,… + Runtime Unified Parallel C at LBNL/UCB Optimizations • In Berkeley UPC compiler - Pointer representation - Generating optimizable single processor code - Message coalescing (aka vectorization) • Opportunities - forall loop optimizations (unnecessary iterations) - Irregular data set communication (Titanium) - Sharing inference - Automatic relaxation analysis and optimizations Unified Parallel C at LBNL/UCB Pointer-to-Shared Representation • UPC has three difference kinds of pointers: - Block-cyclic, cyclic, and indefinite (always local) • A pointer needs a “phase” to keep track of where it is in a block - Source of overhead for updating and de-referencing - Consumes space in the pointer Address Thread Phase • Our runtime has special cases for: - Phaseless (cyclic and indefinite) – skip phase update - Indefinite – skip thread id update - Some machine-specific special cases for some memory layouts • Pointer size/representation easily reconfigured - 64 bits on small machines, 128 on large, word or struct Unified Parallel C at LBNL/UCB Performance of Pointers to Shared Pointer-to-shared operations Cost of shared remote access 50 6000 45 ptr + int -- HP 5000 40 ptr + int -- Berkeley 35 ptr == ptr -- HP 4000 30 25 ptr == ptr-- Berkeley 3000 20 2000 15 ns/cycle) 10 1000 5 (1.5 cycles # of # of cycles (1.5 ns/cycle) (1.5 # of cycles 0 0 generic cyclic indefinite HP read Berkeley HP write Berkeley type of pointer read write • Phaseless pointers are an important optimization - Indefinite pointers almost as fast as regular C pointers - General blocked cyclic pointer 7x slower for addition • Competitive with HP compiler, which generates native code - Both compiler have improved since these were measured Unified Parallel C at LBNL/UCB Generating Optimizable (Vectorizable) Code Livermore Loops 1.15 1.1 1.05 1 0.95 C time/ UPC time C time/ 0.9 0.85 1 3 5 7 9 11 13 15 17 19 21 23 Kernel • Translator generated C code can be as efficient as original C code • Source-to-source translation a good strategy for portable PGAS language implementations Unified Parallel C at LBNL/UCB NAS CG: OpenMP style vs. MPI style NAS CG Performance 120 100 UPC (OpenMP style) 80 UPC (MPI Style) 60 second 40 MPI Fortran 20 MFLOPS per thread/ thread/ per MFLOPS 0 24812 Threads (SSP mode, two nodes) • GAS language outperforms MPI+Fortran (flat is good!) • Fine-grained (OpenMP style) version still slower • shared memory programming style leads to more overhead (redundant boundary computation) • GAS languages can support both programming
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages46 Page
-
File Size-