Bulk Synchronous and SPMD Programming The Bulk Synchronous Model
CS315B Lecture 2
Prof. Aiken CS 315B Lecture 2 1 Prof. Aiken CS 315B Lecture 2 2
Bulk Synchronous Model The Machine
• A model – An idealized machine
• Originally proposed for analyzing parallel algorithms – Leslie Valiant – “A Bridging Model for Parallel Computatation”, 1990
Prof. Aiken CS 315B Lecture 2 3 Prof. Aiken CS 315B Lecture 2 4
1 Computations
• A sequence of supersteps:
• Repeat: What are some properties of this machine • All processors do local computation model? • Barrier • All processors communicate • Barrier
Prof. Aiken CS 315B Lecture 2 5 Prof. Aiken CS 315B Lecture 2 6
Basic Properties
• Uniform – compute nodes – communication costs What are properties of this computational model? • Separate communication and computation
• Synchronization is global
Prof. Aiken CS 315B Lecture 2 7 Prof. Aiken CS 315B Lecture 2 8
2 The Idea How Does This Work?
• Programs are • Roughly – written for v virtual processors – Memory addresses are hashed to a random location – run on p physical processors in the machine – Guarantees that on average, memory accesses have the same cost • If v >= p log p then – The extra log p factor of threads are multiplexed – Managing memory, communication and onto the p processors to hide the latency of synchronization can be done automatically within a memory requests constant factor of optimal – The processors are kept busy and do no more compute than necessary
Prof. Aiken CS 315B Lecture 2 9 Prof. Aiken CS 315B Lecture 2 10
Terminology
• SIMD – Single Instruction, Multiple Data SPMD • SPMD – Single Program, Multiple Data
Prof. Aiken CS 315B Lecture 2 11 Prof. Aiken CS 315B Lecture 2 12
3 SIMD = Vector Processing Picture
if (factor == 0) if (factor == 0) factor = 1.0 factor = 1.0 A[1..N] = B[1..N] * factor; A[1] = B[1] * A[2] = B[2] A[3] = B[3] j += factor; factor * factor * factor . . .
j += factor
Prof. Aiken CS 315B Lecture 2 13 Prof. Aiken CS 315B Lecture 2 14
Comments SPMD = Single Program, Multiple Data
• Single thread of control SIMD SPMD – Global synchronization at each program instruction if (factor == 0) if (factor == 0) • Can exploit fine-grain parallelism factor = 1.0 factor = 1.0 – Assumption of hardware support A[1..N] = B[1..N] * A[myid] = B[myid] * factor; factor; j += factor; j += factor; … …
Prof. Aiken CS 315B Lecture 2 15 Prof. Aiken CS 315B Lecture 2 16
4 Picture Comments if (factor == 0) if (factor == 0) • Multiple threads of control factor = 1.0 factor = 1.0 – One (or more) per processor
• Asynchronous A[1] = B[1] * A[2] = B[2] – All synchronization is programmer-specified factor * factor . . . • Threads are distinguished by myid j += factor j += factor • Choice: Are variables local or global?
Prof. Aiken CS 315B Lecture 2 17 Prof. Aiken CS 315B Lecture 2 18
Comparison MPI
• SIMD • Message Passing Interface – Designed for tightly-coupled, synchronous – A widely used standard hardware – Runs on everything – i.e., vector units • A runtime system • SPMD – Designed for clusters • Most popular way to write SPMD programs – Too expensive to synchronize every statement – Need a model that allows asynchrony
Prof. Aiken CS 315B Lecture 2 19 Prof. Aiken CS 315B Lecture 2 20
5 MPI Programs MPI Point-to-Point Routines
• Standard sequential programs • MPI_Send(buffer, count, type, dest, …) – All variables are local to a thread • MPI_Recv(buffer, count, type, source, …)
• Augmented with calls to the MPI interface – SPMD model – Every thread has a unique identifier – Threads can send/receive messages – Synchronization primitives
Prof. Aiken CS 315B Lecture 2 21 Prof. Aiken CS 315B Lecture 2 22
Example MPI Point-to-Point Routines, Non-Blocking
for (….) { // p = number of chunks of 1D grid, id = process id, h[] = local chunk of the grid • MPI_ISend( … ) // boundary elements of h[] are copies of neighbors boundary elements • MPI_Irecv(…) .... Local computation ...
// exchange with neighbors on a 1-D grid if ( 0 < id ) MPI_Send ( &h[1], 1, MPI_DOUBLE, id-1, 1, MPI_COMM_WORLD ); • MPI_Wait(…) if ( id < p-1 ) MPI_Recv ( &h[n+1], 1, MPI_DOUBLE, id+1, 1, MPI_COMM_WORLD, &status ); if ( id < p-1 ) MPI_Send ( &h[n], 1, MPI_DOUBLE, id+1, 2, MPI_COMM_WORLD ); if ( 0 < id ) MPI_Recv ( &h[0], 1, MPI_DOUBLE, id-1, 2, MPI_COMM_WORLD, &status );
… More local computation ...
}
Prof. Aiken CS 315B Lecture 2 23 Prof. Aiken CS 315B Lecture 2 24
6 Example MPI Collective Communication Routines
for (….) { // p = number of chunks of 1D grid, id = process id, h[] = local chunk of the grid • MPI_Barrier(…) // boundary elements of h[] are copies of neighbors boundary elements • MPI_Bcast(...) .... Local computation ...
// exchange with neighbors on a 1-D grid • MPI_Scatter(…) if ( 0 < id ) MPI_ISend ( &h[1], 1, MPI_DOUBLE, id-1, 1, MPI_COMM_WORLD ); • MPI_Gather(…) if ( id < p-1 ) MPI_IRecv ( &h[n+1], 1, MPI_DOUBLE, id+1, 1, MPI_COMM_WORLD, &status ); • MPI_Reduce(…) if ( id < p-1 ) MPI_ISend ( &h[n], 1, MPI_DOUBLE, id+1, 2, MPI_COMM_WORLD ); if ( 0 < id ) MPI_IRecv ( &h[0], 1, MPI_DOUBLE, id-1, 2, MPI_COMM_WORLD, &status ); MPI_Wait(…1...) MPI_Wait(...2...)
… More local computation ... }
Prof. Aiken CS 315B Lecture 2 25 Prof. Aiken CS 315B Lecture 2 26
Typical Structure PGAS Model
• PGAS = Partitioned Global Address Space communicate_get_work_to_do(); • There is one global address space barrier; // not always needed do_local_work(); • But each thread owns a partition of the address space barrier; that is more efficient to access communicate_write_results(); – i.e., the local memory of a processor
• Equivalent in functionality to MPI What does this remind you of? – But typically presented as a programming language – Examples: Split-C, UPC, Titanium
Prof. Aiken CS 315B Lecture 2 27 Prof. Aiken CS 315B Lecture 2 28
7 PGAS Languages PGAS Languages
• No library calls for communication • Also provide collective communication
• Instead, variables can name memory locations • Barrier on other machines • Broadcast/Reduce // Assume y points to a remote location – 1-many // The following is equivalent to a send/receive x = *y • Exchange – All-to-all
Prof. Aiken CS 315B Lecture 2 29 Prof. Aiken CS 315B Lecture 2 30
PGAS vs. MPI Summary
• Programming model very similar • SPMD is well-matched to cluster programming – Both provide SPMD – Also works well on shared memory machines
• One thread per core • From a pragmatic point of view, MPI rules – No need for compiler to discover parallelism – Easy to add MPI to an existing sequential language – No danger of overwhelming # of threads
• For productivity, PGAS is better • Model exposes memory architecture – Programs filled with low-level details of MPI calls – Local vs. Global variables – Local computation vs. sends/receives – PGAS programs easier to modify – PGAS compilers can know more/do better job Prof. Aiken CS 315B Lecture 2 31 Prof. Aiken CS 315B Lecture 2 32
8 Control
SIMD SPMD if (factor == 0) if (factor == 0) Analysis factor = 1.0 factor = 1.0 forall (i = 1..N) A[myid] = B[myid] * A[i] = B[i] * factor; factor; j += factor; j += factor; … …
Prof. Aiken CS 315B Lecture 2 33 Prof. Aiken CS 315B Lecture 2 34
Control, Cont. Global Synchronization Revisited
• SPMD replicates the sequential part of the • In the presence of non-blocking global memory SIMD computation operations, we also need memory fence – Across all threads! operations
• Why? • Two choices – Often cheaper to replicate computation in parallel – Have a separate classes of memory and control than compute in one place and broadcast synchronization operations – A general principle . . . • E.g., barrier and memory_fence – Have a single set of operations • E.g., barrier implies memory and control synchronization
Prof. Aiken CS 315B Lecture 2 35 Prof. Aiken CS 315B Lecture 2 36
9 Message Passing Implementations Bulk Synchronous/SPMD Model
• Idea: A memory fence is a special message • Easy to understand sent on the network; when it arrives, all the memory operations are complete • Phase structure guarantees no data races – Barrier synchronization also easy to understand • To work, underlying message system must deliver messages in order • Fits many problems well
• This is one of the key properties of MPI – And most message systems
Prof. Aiken CS 315B Lecture 2 37 Prof. Aiken CS 315B Lecture 2 38
But … Hierarchy
• Assumes 2-level memory hierarchy • Current & future machines are more – Local/global, a flat collection of homogenous hierarchical sequential processors – 3-4 levels, not 2
• No overlap of communication and computation • Leads to programs written in a mix of – MPI (network level) • Barriers scale poorly with machine size – OpenMP (node level) + vectorization – (# of operations lost) * (# of processors) – CUDA (GPU level)
• Each is a different programming model Prof. Aiken CS 315B Lecture 2 39 Prof. Aiken CS 315B Lecture 2 40
10 No Overlap of Computation/Communication Global Operations
• Leaves major portion of the machine idle in • Global operations (such as barriers) are bad each phase – Require synchronization across the machine – Especially bad when there is performance variation among participating threads • And this potential is lost at many scales – Hierarchical machines again • Need a model that favors asynchrony • Increasingly, communication is key – Couple as few things together as possible – Data movement is what matters – Most of the execution time, most of the energy
Prof. Aiken CS 315B Lecture 2 41 Prof. Aiken CS 315B Lecture 2 42
Global Operations Continued I/O
• MPI has evolved to include more asynchronous • How do programs get initial data? Produce and point-to-point primitives output?
• But these do not always mix well with the • In many models answer is clear collective/global operations – Passed in and out from root function • Map-Reduce
– Multithreaded shared memory applications just use the normal file system interface
Prof. Aiken CS 315B Lecture 2 43 Prof. Aiken CS 315B Lecture 2 44
11 I/O, Cont. I/O, Cont.
• Not clear in SPMD • Option 1 – Make thread 0 special • Program begins running and – Thread 0 does all I/O on behalf of the program – Issue: Awkward to read/write large data sets – Each thread is running in its own address space • Limited by thread 0’s memory size – No thread is special • Neither master nor slave • Option 2 – Parallel I/O – Each thread has access to its own file system – Containing distributed files • Each file f a portion of a collective file f Prof. Aiken CS 315B Lecture 2 45 Prof. Aiken CS 315B Lecture 2 46
I/O Summary Next Week
• Option 2 is clearly more SPMD-ish • Intro to Regent programming • Creating/deallocating files requires a barrier • First assignment • In general, parallel programming languages have not paid much attention to I/O
Prof. Aiken CS 315B Lecture 2 47 Prof. Aiken CS 315B Lecture 2 48
12