Frontiers of HPC: Unified Parallel C

Overview • Parallel programming models are not in short supply – methodology dictated by underlying hardware organization Frontiers of HPC: • shared memory systems (SMP) Unified Parallel C • distributed memory systems (cluster) – trade-offs in ease of use; complexity • Unified Parallel C: David McCaughan, HPC Analyst – an open standard for a uniform programming model SHARCNET, University of Guelph – distributed shared memory [email protected] – fundamentals – a new set of trade-offs; is it worth it? HPC Resources Some of this material is derived from notes written by Behzad Salami (M.Sc. U. of Guelph) Thanks for the memory! Shared Memory Model • Traditional parallel programming abstractly takes one of two forms, depending on how memory is able to be referenced by the program b: 5 x: 5 f() • MIMD Models: Multiple Instruction Multiple Data c: 2 {...} – Shared memory int foo() void bar() int f(); • processors share a single physical memory { { a = b+c; if (x > b) int main() CPU y = x; CPU { CPU • programs can share blocks of memory between them return(x); 1 else 2 int a; 3 } y = 0; • issues: exclusive access, race conditions, synchronization, } a = f(); } scalability a:=b+c if x > b call f – Distributed memory • unique memory associated with each processor Shared Memory Programming • issues: communication is explicit, communication overhead HPC Resources HPC Resources 1 Shared Memory Hello, world! (pthreads) Programming • Shared memory programming benefits from the ability to #include <stdio.h> handle communication implicitly #include “pthread.h” – using the shared memory space void output (int *); – fundamentals of programming in SMP environments is relatively int main(int argc, char *argv[]) { straightforward int id; • issues typically revolve around exclusive access and race pthread_t thread[atoi(argv[1])]; conditions for (id = 0; id < atoi(argv[1]); id++) pthread_create(&thread[id], NULL, (void *)output, (void *)&id); return(0); • Common SMP programming paradigms: } – POSIX threads (pthreads) void output(int *thread_num) – OpenMP { printf(“Hello, world! from thread %d\n”, *thread_num); HPC Resources HPC} Resources Distributed Memory Distributed Memory Model Programming • Communication is handled explicitly – processes send data to one another over an interconnection b: 5 x: 5 f() network c: 2 {...} – communication overhead limits granularity of parallelism int foo() void bar() int f(); – conforms to the strengths of traditional computing hardware so { { a = b+c; if (x > 1) int main() scalability can be excellent CPU y = x; CPU { CPU return(a); 1 else 2 int a; 3 } y = 0; } a = f(); } a:=b+c if x > 1 call f • The past is littered with the corpses of distributed programming models Distributed Memory Programming – the modern standard: MPI HPC Resources HPC Resources 2 Hello, World! (MPI) SPMD #include <stdio.h> • Single Program Multiple Data (SPMD) #include “mpi.h” – special case of MIMD model int main(int argc, char *argv[]) { – many processors executing the same program int rank, size; • conditional branches used where specific MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); behaviour required on the processors MPI_Comm_size(MPI_COMM_WORLD, &size); – shared/distributed memory organization printf(“Process %d of %d\n”, rank, size); MPI_Finalize(); • MPI and UPC explicitly use this model return(0); } HPC Resources HPC Resources SPMD Model What is UPC? • C language extensions for HPC programming on large scale parallel systems b: 5 b: 3 b: 3 – attempts to provide a single, uniform programming model for c: 2 c: 2 c: 2 SMP and cluster-based machines – superset of C (any C program is automatically a valid UPC int main() int main() int main() { { { program) if cpu_1 if cpu_1 if cpu_1 b = 5; CPU b = 5; CPU b = 5; CPU else 1 else 2 else 3 b = 3 b = 3 b = 3 } } } • Explicitly parallel SPMD model b := 5 b := 3 b := 3 – the same program runs on all processors – Global Address Space (GAS) language; an attempt to balance SPMD illustrating conditional branching to control divergent behaviour • convenience (threads) • performance (MPI, data layout) HPC Resources HPC Resources 3 What is UPC? (cont.) Hello, World! (UPC) • Single shared address space #include <stdio.h> – variables can be accessed by any process, however are physically associated with one #include “upc.h” • hugely convenient from the programmer's perspective int main() • what implications does this have? { – most of UPCs complexity comes from the way it handles pointers and printf(“Hello, world! from UPC thread %d of %d\n”, shared memory allocations MYTHREAD, THREADS); return(0); • Front-end/back-end organization allows for great flexibility in } implementation – high speed interconnect, SMP, MPI, etc. • Note: even for the simplest of examples, the implicit availability of • Well suited to parallelizing existing serial applications THREADS and MYTHREADS variable reduces code volume dramatically (over pthreads or MPI) HPC Resources HPC Resources Major Features: Basics Major Features: Advanced • Multi-processing abstractly modeled as threads • Synchronization – pre-defined variables THREADS, MYTHREAD available at run-time – locks (upc_block_t) – barrier (upc_barrier, upc_notify, upc_wait) – memory fence (upc_fence) • new keyword: shared – defines variables available across all processes • User-controlled consistency models – affinity (physical location of data) can also be specified – Per-variable, per-statement • scalars (affinity to process 0) – strict • cyclic (per element) • all strict references within the same thread appear in program order • block-cyclic (user-defined block sizes) to all processes – relaxed • blocked (run-time contiguous affinity for “even” distribution) • all references within the same thread appear in program order to the – can make affinity inquiries at run-time issuing process HPC Resources HPC Resources 4 UPC and Memory Variables • Partitioned Global Address Space • Private – private memory: local to a process – C-style declarations (i.e. the default) – shared memory: partitioned over the address space of all processors – e.g. int foo; • affinity refers to the physical location of data otherwise visible to all – one instance per thread processes • Shared shared – e.g. shared int bar; – one instance for all threads private – shared array elements can be distributed across threads shared int foo[THREADS] – 1 element per thread T T T T T shared int bar[10][THREADS] 0 1 2 3 4 – 10 elements per thread HPC Resources HPC Resources Example: private scalar Example: shared scalar #include <stdio.h> #include <stdio.h> #include <upc.h> #include <upc.h> shared shared b = 01234 int main() int main() private a = 01 aa == 120 a = 0312 { private a = 0 a = 0 a = 0 { int a; int a; a = 0; static shared int b; a = b = 0; T T T a++; T0 T1 T2 0 1 2 return(0); caution: race conditions } a++; b++; • NOTE: serial farming in a box; exactly like running multiple copies return(0); • NOTE: scalar affinity is to Thread_0; of the same program; no use of shared memory resources at all } shared variables must be statically allocated (global data segment) HPC Resources HPC Resources 5 Example: shared scalar Example: shared arrays #include <stdio.h> … #include <upc.h><upc_relaxed.h> static shared int a[4][THREADS]; shared b = 01234 … int main() a[0][0] a[0][1] a[0][2] { private a = 01 aa == 120 a = 0312 a[1][0] a[1][1] a[1][2] int a; a[2][0] a[2][1] a[2][2] static shared int b; • assuming this is run with 3 shared a[2][0] a[2][1] a[2][2] acaution: = b race= 0; conditions T T T 0 1 2 threads, this memory will private be organized as shown a++; b++; T T T return(0); • NOTE: scalar affinity is to Thread_0; 0 1 2 } shared variables must be statically allocated (global data segment) HPC Resources HPC Resources Parallel Loop: upc_forall • UPC defined loop (similar to the serial for loop) Exercise: upc_forall(init; condition; post; affinity) To Affinity and Beyond • affinity expression dictates which THREAD executes a given iteration of the loop – pointer to shared type: upc_threadof(affinity) == MYTHREAD – integer: affinity % THREADS == MYTHREAD The purpose of this exercise is to allow you to – continue: MYTHREAD (all) explore your understanding of UPC shared declarations and memory affinity – upc_forall loops can be nested: careful HPC Resources 6 Exercise Pointers 1) The sharedarray.c file in ~dbm/public/ • Declaration: exercises/UPC implements a working demonstration of a shared allocation of a 2-D array of integers and the output of its contents by processor affinity (using the shared int *a; upc_forall loop) – a is a pointer to an integer that lives in the shared 2) Ensure that you understand the issue of processor memory space affinity by changing the initialization of the array so that it too occurs in parallel, in the thread with affinity to that – we refer to the type of a as a pointer to shared section of the array HPC Resources HPC Resources A New Dimension for Dynamic Memory Pointers Allocation int *ptr1; /* private pointer */ void *upc_all_alloc(int num, int size); shared int *ptr2; /* private pointer to shared */ – allocates num * size bytes of data across all threads int *shared ptr3 /* shared pointer to private */ – returns a pointer to the first element of the shared block of shared int *shared ptr4; /* shared pointer to shared */ memory for all threads – collective operations: called by all threads – e.g. allocate 25 elements per thread ptr4 ptr3 shared shared [5] int *ptr1; /* note use of block size */ … ptr1 ptr1 = (shared int *)upc_all_alloc(25*THREADS, sizeof(int)); private ptr2 ptr2 ptr2 ptr1 ptr1 HPC Resources HPC Resources 7 Dynamic Memory Dynamic Memory Allocation (cont.) Allocation (cont.) void *upc_global_alloc(int num, int size); void *upc_alloc(int num, int size); – allocates num * size bytes of data across all threads – allocates num * size bytes of data in the shared memory space, – initializes shared pointers on all threads only on the calling thread (i.e. affinity issue) – only called by one thread, but pointer will be initialized on all – returns a pointer to the first element of that shared block of threads (i.e.

Frontiers of HPC: Unified Parallel C

L22: Parallel Programming Language Features (Chapel and Mapreduce)

Cimple: Instruction and Memory Level Parallelism a DSL for Uncovering ILP and MLP

Introduction to Multi-Threading and Vectorization Matti Kortelainen Larsoft Workshop 2019 25 June 2019 Outline

Scheduling Task Parallelism on Multi-Socket Multicore Systems

Parallel Programming

Outro to Parallel Computing

The Continuing Renaissance in Parallel Programming Languages

Task Parallelism Bit-Level Parallelism

C Language Extensions for Hybrid CPU/GPU Programming with Starpu

Unified Parallel C (UPC)

IBM XL Unified Parallel C User's Guide

Overview of the Global Arrays Parallel Software Development Toolkit: Introduction to Global Address Space Programming Models