Bulk Synchronous and SPMD Programming CS315B Lecture 2

Bulk Synchronous and SPMD Programming The Bulk Synchronous Model CS315B Lecture 2 Prof. Aiken CS 315B Lecture 2 1 Prof. Aiken CS 315B Lecture 2 2 Bulk Synchronous Model The Machine • A model – An idealized machine • Originally proposed for analyzing parallel algorithms – Leslie Valiant – “A Bridging Model for Parallel Computatation”, 1990 Prof. Aiken CS 315B Lecture 2 3 Prof. Aiken CS 315B Lecture 2 4 1 Computations • A sequence of supersteps: • Repeat: What are some properties of this machine • All processors do local computation model? • Barrier • All processors communicate • Barrier Prof. Aiken CS 315B Lecture 2 5 Prof. Aiken CS 315B Lecture 2 6 Basic Properties • Uniform – compute nodes – communication costs What are properties of this computational model? • Separate communication and computation • Synchronization is global Prof. Aiken CS 315B Lecture 2 7 Prof. Aiken CS 315B Lecture 2 8 2 The Idea How Does This Work? • Programs are • Roughly – written for v virtual processors – Memory addresses are hashed to a random location – run on p physical processors in the machine – Guarantees that on average, memory accesses have the same cost • If v >= p log p then – The extra log p factor of threads are multiplexed – Managing memory, communication and onto the p processors to hide the latency of synchronization can be done automatically within a memory requests constant factor of optimal – The processors are kept busy and do no more compute than necessary Prof. Aiken CS 315B Lecture 2 9 Prof. Aiken CS 315B Lecture 2 10 Terminology • SIMD – Single Instruction, Multiple Data SPMD • SPMD – Single Program, Multiple Data Prof. Aiken CS 315B Lecture 2 11 Prof. Aiken CS 315B Lecture 2 12 3 SIMD = Vector Processing Picture if (factor == 0) if (factor == 0) factor = 1.0 factor = 1.0 A[1..N] = B[1..N] * factor; A[1] = B[1] * A[2] = B[2] A[3] = B[3] j += factor; factor * factor * factor . j += factor Prof. Aiken CS 315B Lecture 2 13 Prof. Aiken CS 315B Lecture 2 14 Comments SPMD = Single Program, Multiple Data • Single thread of control SIMD SPMD – Global synchronization at each program instruction if (factor == 0) if (factor == 0) • Can exploit fine-grain parallelism factor = 1.0 factor = 1.0 – Assumption of hardware support A[1..N] = B[1..N] * A[myid] = B[myid] * factor; factor; j += factor; j += factor; … … Prof. Aiken CS 315B Lecture 2 15 Prof. Aiken CS 315B Lecture 2 16 4 Picture Comments if (factor == 0) if (factor == 0) • Multiple threads of control factor = 1.0 factor = 1.0 – One (or more) per processor • Asynchronous A[1] = B[1] * A[2] = B[2] – All synchronization is programmer-specified factor * factor . • Threads are distinguished by myid j += factor j += factor • Choice: Are variables local or global? Prof. Aiken CS 315B Lecture 2 17 Prof. Aiken CS 315B Lecture 2 18 Comparison MPI • SIMD • Message Passing Interface – Designed for tightly-coupled, synchronous – A widely used standard hardware – Runs on everything – i.e., vector units • A runtime system • SPMD – Designed for clusters • Most popular way to write SPMD programs – Too expensive to synchronize every statement – Need a model that allows asynchrony Prof. Aiken CS 315B Lecture 2 19 Prof. Aiken CS 315B Lecture 2 20 5 MPI Programs MPI Point-to-Point Routines • Standard sequential programs • MPI_Send(buffer, count, type, dest, …) – All variables are local to a thread • MPI_Recv(buffer, count, type, source, …) • Augmented with calls to the MPI interface – SPMD model – Every thread has a unique identifier – Threads can send/receive messages – Synchronization primitives Prof. Aiken CS 315B Lecture 2 21 Prof. Aiken CS 315B Lecture 2 22 Example MPI Point-to-Point Routines, Non-Blocking for (….) { // p = number of chunks of 1D grid, id = process id, h[] = local chunk of the grid • MPI_ISend( … ) // boundary elements of h[] are copies of neighbors boundary elements • MPI_Irecv(…) .... Local computation ... // exchange with neighbors on a 1-D grid if ( 0 < id ) MPI_Send ( &h[1], 1, MPI_DOUBLE, id-1, 1, MPI_COMM_WORLD ); • MPI_Wait(…) if ( id < p-1 ) MPI_Recv ( &h[n+1], 1, MPI_DOUBLE, id+1, 1, MPI_COMM_WORLD, &status ); if ( id < p-1 ) MPI_Send ( &h[n], 1, MPI_DOUBLE, id+1, 2, MPI_COMM_WORLD ); if ( 0 < id ) MPI_Recv ( &h[0], 1, MPI_DOUBLE, id-1, 2, MPI_COMM_WORLD, &status ); … More local computation ... } Prof. Aiken CS 315B Lecture 2 23 Prof. Aiken CS 315B Lecture 2 24 6 Example MPI Collective Communication Routines for (….) { // p = number of chunks of 1D grid, id = process id, h[] = local chunk of the grid • MPI_Barrier(…) // boundary elements of h[] are copies of neighbors boundary elements • MPI_Bcast(...) .... Local computation ... // exchange with neighbors on a 1-D grid • MPI_Scatter(…) if ( 0 < id ) MPI_ISend ( &h[1], 1, MPI_DOUBLE, id-1, 1, MPI_COMM_WORLD ); • MPI_Gather(…) if ( id < p-1 ) MPI_IRecv ( &h[n+1], 1, MPI_DOUBLE, id+1, 1, MPI_COMM_WORLD, &status ); • MPI_Reduce(…) if ( id < p-1 ) MPI_ISend ( &h[n], 1, MPI_DOUBLE, id+1, 2, MPI_COMM_WORLD ); if ( 0 < id ) MPI_IRecv ( &h[0], 1, MPI_DOUBLE, id-1, 2, MPI_COMM_WORLD, &status ); MPI_Wait(…1...) MPI_Wait(...2...) … More local computation ... } Prof. Aiken CS 315B Lecture 2 25 Prof. Aiken CS 315B Lecture 2 26 Typical Structure PGAS Model • PGAS = Partitioned Global Address Space communicate_get_work_to_do(); • There is one global address space barrier; // not always needed do_local_work(); • But each thread owns a partition of the address space barrier; that is more efficient to access communicate_write_results(); – i.e., the local memory of a processor • Equivalent in functionality to MPI What does this remind you of? – But typically presented as a programming language – Examples: Split-C, UPC, Titanium Prof. Aiken CS 315B Lecture 2 27 Prof. Aiken CS 315B Lecture 2 28 7 PGAS Languages PGAS Languages • No library calls for communication • Also provide collective communication • Instead, variables can name memory locations • Barrier on other machines • Broadcast/Reduce // Assume y points to a remote location – 1-many // The following is equivalent to a send/receive x = *y • Exchange – All-to-all Prof. Aiken CS 315B Lecture 2 29 Prof. Aiken CS 315B Lecture 2 30 PGAS vs. MPI Summary • Programming model very similar • SPMD is well-matched to cluster programming – Both provide SPMD – Also works well on shared memory machines • One thread per core • From a pragmatic point of view, MPI rules – No need for compiler to discover parallelism – Easy to add MPI to an existing sequential language – No danger of overwhelming # of threads • For productivity, PGAS is better • Model exposes memory architecture – Programs filled with low-level details of MPI calls – Local vs. Global variables – Local computation vs. sends/receives – PGAS programs easier to modify – PGAS compilers can know more/do better job Prof. Aiken CS 315B Lecture 2 31 Prof. Aiken CS 315B Lecture 2 32 8 Control SIMD SPMD if (factor == 0) if (factor == 0) Analysis factor = 1.0 factor = 1.0 forall (i = 1..N) A[myid] = B[myid] * A[i] = B[i] * factor; factor; j += factor; j += factor; … … Prof. Aiken CS 315B Lecture 2 33 Prof. Aiken CS 315B Lecture 2 34 Control, Cont. Global Synchronization Revisited • SPMD replicates the sequential part of the • In the presence of non-blocking global memory SIMD computation operations, we also need memory fence – Across all threads! operations • Why? • Two choices – Often cheaper to replicate computation in parallel – Have a separate classes of memory and control than compute in one place and broadcast synchronization operations – A general principle . • E.g., barrier and memory_fence – Have a single set of operations • E.g., barrier implies memory and control synchronization Prof. Aiken CS 315B Lecture 2 35 Prof. Aiken CS 315B Lecture 2 36 9 Message Passing Implementations Bulk Synchronous/SPMD Model • Idea: A memory fence is a special message • Easy to understand sent on the network; when it arrives, all the memory operations are complete • Phase structure guarantees no data races – Barrier synchronization also easy to understand • To work, underlying message system must deliver messages in order • Fits many problems well • This is one of the key properties of MPI – And most message systems Prof. Aiken CS 315B Lecture 2 37 Prof. Aiken CS 315B Lecture 2 38 But … Hierarchy • Assumes 2-level memory hierarchy • Current & future machines are more – Local/global, a flat collection of homogenous hierarchical sequential processors – 3-4 levels, not 2 • No overlap of communication and computation • Leads to programs written in a mix of – MPI (network level) • Barriers scale poorly with machine size – OpenMP (node level) + vectorization – (# of operations lost) * (# of processors) – CUDA (GPU level) • Each is a different programming model Prof. Aiken CS 315B Lecture 2 39 Prof. Aiken CS 315B Lecture 2 40 10 No Overlap of Computation/Communication Global Operations • Leaves major portion of the machine idle in • Global operations (such as barriers) are bad each phase – Require synchronization across the machine – Especially bad when there is performance variation among participating threads • And this potential is lost at many scales – Hierarchical machines again • Need a model that favors asynchrony • Increasingly, communication is key – Couple as few things together as possible – Data movement is what matters – Most of the execution time, most of the energy Prof. Aiken CS 315B Lecture 2 41 Prof. Aiken CS 315B Lecture 2 42 Global Operations Continued I/O • MPI has evolved to include more asynchronous • How do programs get initial data?

Load more