Bulk Synchronous and SPMD Programming CS315B Lecture 2

Bulk Synchronous and SPMD Programming The Bulk Synchronous Model CS315B Lecture 2 Prof. Aiken CS 315B Lecture 2 1 Prof. Aiken CS 315B Lecture 2 2 Bulk Synchronous Model The Machine • A model – An idealized machine • Originally proposed for analyzing parallel algorithms – Leslie Valiant – “A Bridging Model for Parallel Computatation”, 1990 Prof. Aiken CS 315B Lecture 2 3 Prof. Aiken CS 315B Lecture 2 4 1 Computations • A sequence of supersteps: • Repeat: What are some properties of this machine • All processors do local computation model? • Barrier • All processors communicate • Barrier Prof. Aiken CS 315B Lecture 2 5 Prof. Aiken CS 315B Lecture 2 6 Basic Properties • Uniform – compute nodes – communication costs What are properties of this computational model? • Separate communication and computation • Synchronization is global Prof. Aiken CS 315B Lecture 2 7 Prof. Aiken CS 315B Lecture 2 8 2 The Idea How Does This Work? • Programs are • Roughly – written for v virtual processors – Memory addresses are hashed to a random location – run on p physical processors in the machine – Guarantees that on average, memory accesses have the same cost • If v >= p log p then – The extra log p factor of threads are multiplexed – Managing memory, communication and onto the p processors to hide the latency of synchronization can be done automatically within a memory requests constant factor of optimal – The processors are kept busy and do no more compute than necessary Prof. Aiken CS 315B Lecture 2 9 Prof. Aiken CS 315B Lecture 2 10 Terminology • SIMD – Single Instruction, Multiple Data SPMD • SPMD – Single Program, Multiple Data Prof. Aiken CS 315B Lecture 2 11 Prof. Aiken CS 315B Lecture 2 12 3 SIMD = Vector Processing Picture if (factor == 0) if (factor == 0) factor = 1.0 factor = 1.0 A[1..N] = B[1..N] * factor; A[1] = B[1] * A[2] = B[2] A[3] = B[3] j += factor; factor * factor * factor . j += factor Prof. Aiken CS 315B Lecture 2 13 Prof. Aiken CS 315B Lecture 2 14 Comments SPMD = Single Program, Multiple Data • Single thread of control SIMD SPMD – Global synchronization at each program instruction if (factor == 0) if (factor == 0) • Can exploit fine-grain parallelism factor = 1.0 factor = 1.0 – Assumption of hardware support A[1..N] = B[1..N] * A[myid] = B[myid] * factor; factor; j += factor; j += factor; … … Prof. Aiken CS 315B Lecture 2 15 Prof. Aiken CS 315B Lecture 2 16 4 Picture Comments if (factor == 0) if (factor == 0) • Multiple threads of control factor = 1.0 factor = 1.0 – One (or more) per processor • Asynchronous A[1] = B[1] * A[2] = B[2] – All synchronization is programmer-specified factor * factor . • Threads are distinguished by myid j += factor j += factor • Choice: Are variables local or global? Prof. Aiken CS 315B Lecture 2 17 Prof. Aiken CS 315B Lecture 2 18 Comparison MPI • SIMD • Message Passing Interface – Designed for tightly-coupled, synchronous – A widely used standard hardware – Runs on everything – i.e., vector units • A runtime system • SPMD – Designed for clusters • Most popular way to write SPMD programs – Too expensive to synchronize every statement – Need a model that allows asynchrony Prof. Aiken CS 315B Lecture 2 19 Prof. Aiken CS 315B Lecture 2 20 5 MPI Programs MPI Point-to-Point Routines • Standard sequential programs • MPI_Send(buffer, count, type, dest, …) – All variables are local to a thread • MPI_Recv(buffer, count, type, source, …) • Augmented with calls to the MPI interface – SPMD model – Every thread has a unique identifier – Threads can send/receive messages – Synchronization primitives Prof. Aiken CS 315B Lecture 2 21 Prof. Aiken CS 315B Lecture 2 22 Example MPI Point-to-Point Routines, Non-Blocking for (….) { // p = number of chunks of 1D grid, id = process id, h[] = local chunk of the grid • MPI_ISend( … ) // boundary elements of h[] are copies of neighbors boundary elements • MPI_Irecv(…) .... Local computation ... // exchange with neighbors on a 1-D grid if ( 0 < id ) MPI_Send ( &h[1], 1, MPI_DOUBLE, id-1, 1, MPI_COMM_WORLD ); • MPI_Wait(…) if ( id < p-1 ) MPI_Recv ( &h[n+1], 1, MPI_DOUBLE, id+1, 1, MPI_COMM_WORLD, &status ); if ( id < p-1 ) MPI_Send ( &h[n], 1, MPI_DOUBLE, id+1, 2, MPI_COMM_WORLD ); if ( 0 < id ) MPI_Recv ( &h[0], 1, MPI_DOUBLE, id-1, 2, MPI_COMM_WORLD, &status ); … More local computation ... } Prof. Aiken CS 315B Lecture 2 23 Prof. Aiken CS 315B Lecture 2 24 6 Example MPI Collective Communication Routines for (….) { // p = number of chunks of 1D grid, id = process id, h[] = local chunk of the grid • MPI_Barrier(…) // boundary elements of h[] are copies of neighbors boundary elements • MPI_Bcast(...) .... Local computation ... // exchange with neighbors on a 1-D grid • MPI_Scatter(…) if ( 0 < id ) MPI_ISend ( &h[1], 1, MPI_DOUBLE, id-1, 1, MPI_COMM_WORLD ); • MPI_Gather(…) if ( id < p-1 ) MPI_IRecv ( &h[n+1], 1, MPI_DOUBLE, id+1, 1, MPI_COMM_WORLD, &status ); • MPI_Reduce(…) if ( id < p-1 ) MPI_ISend ( &h[n], 1, MPI_DOUBLE, id+1, 2, MPI_COMM_WORLD ); if ( 0 < id ) MPI_IRecv ( &h[0], 1, MPI_DOUBLE, id-1, 2, MPI_COMM_WORLD, &status ); MPI_Wait(…1...) MPI_Wait(...2...) … More local computation ... } Prof. Aiken CS 315B Lecture 2 25 Prof. Aiken CS 315B Lecture 2 26 Typical Structure PGAS Model • PGAS = Partitioned Global Address Space communicate_get_work_to_do(); • There is one global address space barrier; // not always needed do_local_work(); • But each thread owns a partition of the address space barrier; that is more efficient to access communicate_write_results(); – i.e., the local memory of a processor • Equivalent in functionality to MPI What does this remind you of? – But typically presented as a programming language – Examples: Split-C, UPC, Titanium Prof. Aiken CS 315B Lecture 2 27 Prof. Aiken CS 315B Lecture 2 28 7 PGAS Languages PGAS Languages • No library calls for communication • Also provide collective communication • Instead, variables can name memory locations • Barrier on other machines • Broadcast/Reduce // Assume y points to a remote location – 1-many // The following is equivalent to a send/receive x = *y • Exchange – All-to-all Prof. Aiken CS 315B Lecture 2 29 Prof. Aiken CS 315B Lecture 2 30 PGAS vs. MPI Summary • Programming model very similar • SPMD is well-matched to cluster programming – Both provide SPMD – Also works well on shared memory machines • One thread per core • From a pragmatic point of view, MPI rules – No need for compiler to discover parallelism – Easy to add MPI to an existing sequential language – No danger of overwhelming # of threads • For productivity, PGAS is better • Model exposes memory architecture – Programs filled with low-level details of MPI calls – Local vs. Global variables – Local computation vs. sends/receives – PGAS programs easier to modify – PGAS compilers can know more/do better job Prof. Aiken CS 315B Lecture 2 31 Prof. Aiken CS 315B Lecture 2 32 8 Control SIMD SPMD if (factor == 0) if (factor == 0) Analysis factor = 1.0 factor = 1.0 forall (i = 1..N) A[myid] = B[myid] * A[i] = B[i] * factor; factor; j += factor; j += factor; … … Prof. Aiken CS 315B Lecture 2 33 Prof. Aiken CS 315B Lecture 2 34 Control, Cont. Global Synchronization Revisited • SPMD replicates the sequential part of the • In the presence of non-blocking global memory SIMD computation operations, we also need memory fence – Across all threads! operations • Why? • Two choices – Often cheaper to replicate computation in parallel – Have a separate classes of memory and control than compute in one place and broadcast synchronization operations – A general principle . • E.g., barrier and memory_fence – Have a single set of operations • E.g., barrier implies memory and control synchronization Prof. Aiken CS 315B Lecture 2 35 Prof. Aiken CS 315B Lecture 2 36 9 Message Passing Implementations Bulk Synchronous/SPMD Model • Idea: A memory fence is a special message • Easy to understand sent on the network; when it arrives, all the memory operations are complete • Phase structure guarantees no data races – Barrier synchronization also easy to understand • To work, underlying message system must deliver messages in order • Fits many problems well • This is one of the key properties of MPI – And most message systems Prof. Aiken CS 315B Lecture 2 37 Prof. Aiken CS 315B Lecture 2 38 But … Hierarchy • Assumes 2-level memory hierarchy • Current & future machines are more – Local/global, a flat collection of homogenous hierarchical sequential processors – 3-4 levels, not 2 • No overlap of communication and computation • Leads to programs written in a mix of – MPI (network level) • Barriers scale poorly with machine size – OpenMP (node level) + vectorization – (# of operations lost) * (# of processors) – CUDA (GPU level) • Each is a different programming model Prof. Aiken CS 315B Lecture 2 39 Prof. Aiken CS 315B Lecture 2 40 10 No Overlap of Computation/Communication Global Operations • Leaves major portion of the machine idle in • Global operations (such as barriers) are bad each phase – Require synchronization across the machine – Especially bad when there is performance variation among participating threads • And this potential is lost at many scales – Hierarchical machines again • Need a model that favors asynchrony • Increasingly, communication is key – Couple as few things together as possible – Data movement is what matters – Most of the execution time, most of the energy Prof. Aiken CS 315B Lecture 2 41 Prof. Aiken CS 315B Lecture 2 42 Global Operations Continued I/O • MPI has evolved to include more asynchronous • How do programs get initial data?

Bulk Synchronous and SPMD Programming CS315B Lecture 2

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support