Bulk Synchronous and SPMD Programming The Bulk Synchronous Model

CS315B Lecture 2

Prof. Aiken CS 315B Lecture 2 1 Prof. Aiken CS 315B Lecture 2 2

Bulk Synchronous Model The Machine

• A model – An idealized machine

• Originally proposed for analyzing parallel algorithms – Leslie Valiant – “A Bridging Model for Parallel Computatation”, 1990

Prof. Aiken CS 315B Lecture 2 3 Prof. Aiken CS 315B Lecture 2 4

1 Computations

• A sequence of supersteps:

• Repeat: What are some properties of this machine • All processors do local computation model? • Barrier • All processors communicate • Barrier

Prof. Aiken CS 315B Lecture 2 5 Prof. Aiken CS 315B Lecture 2 6

Basic Properties

• Uniform – compute nodes – communication costs What are properties of this computational model? • Separate communication and computation

• Synchronization is global

Prof. Aiken CS 315B Lecture 2 7 Prof. Aiken CS 315B Lecture 2 8

2 The Idea How Does This Work?

• Programs are • Roughly – written for v virtual processors – Memory addresses are hashed to a random location – run on p physical processors in the machine – Guarantees that on average, memory accesses have the same cost • If v >= p log p then – The extra log p factor of threads are multiplexed – Managing memory, communication and onto the p processors to hide the latency of synchronization can be done automatically within a memory requests constant factor of optimal – The processors are kept busy and do no more compute than necessary

Prof. Aiken CS 315B Lecture 2 9 Prof. Aiken CS 315B Lecture 2 10

Terminology

• SIMD – Single Instruction, Multiple Data SPMD • SPMD – Single Program, Multiple Data

Prof. Aiken CS 315B Lecture 2 11 Prof. Aiken CS 315B Lecture 2 12

3 SIMD = Vector Processing Picture

if (factor == 0) if (factor == 0) factor = 1.0 factor = 1.0 A[1..N] = B[1..N] * factor; A[1] = B[1] * A[2] = B[2] A[3] = B[3] j += factor; factor * factor * factor . . .

j += factor

Prof. Aiken CS 315B Lecture 2 13 Prof. Aiken CS 315B Lecture 2 14

Comments SPMD = Single Program, Multiple Data

• Single of control SIMD SPMD – Global synchronization at each program instruction if (factor == 0) if (factor == 0) • Can exploit fine-grain parallelism factor = 1.0 factor = 1.0 – Assumption of hardware support A[1..N] = B[1..N] * A[myid] = B[myid] * factor; factor; j += factor; j += factor; … …

Prof. Aiken CS 315B Lecture 2 15 Prof. Aiken CS 315B Lecture 2 16

4 Picture Comments if (factor == 0) if (factor == 0) • Multiple threads of control factor = 1.0 factor = 1.0 – One (or more) per

• Asynchronous A[1] = B[1] * A[2] = B[2] – All synchronization is programmer-specified factor * factor . . . • Threads are distinguished by myid j += factor j += factor • Choice: Are variables local or global?

Prof. Aiken CS 315B Lecture 2 17 Prof. Aiken CS 315B Lecture 2 18

Comparison MPI

• SIMD • Interface – Designed for tightly-coupled, synchronous – A widely used standard hardware – Runs on everything – i.e., vector units • A runtime system • SPMD – Designed for clusters • Most popular way to write SPMD programs – Too expensive to synchronize every statement – Need a model that allows asynchrony

Prof. Aiken CS 315B Lecture 2 19 Prof. Aiken CS 315B Lecture 2 20

5 MPI Programs MPI Point-to-Point Routines

• Standard sequential programs • MPI_Send(buffer, count, type, dest, …) – All variables are local to a thread • MPI_Recv(buffer, count, type, source, …)

• Augmented with calls to the MPI interface – SPMD model – Every thread has a unique identifier – Threads can send/receive messages – Synchronization primitives

Prof. Aiken CS 315B Lecture 2 21 Prof. Aiken CS 315B Lecture 2 22

Example MPI Point-to-Point Routines, Non-Blocking

for (….) { // p = number of chunks of 1D grid, id = id, h[] = local chunk of the grid • MPI_ISend( … ) // boundary elements of h[] are copies of neighbors boundary elements • MPI_Irecv(…) .... Local computation ...

// exchange with neighbors on a 1-D grid if ( 0 < id ) MPI_Send ( &h[1], 1, MPI_DOUBLE, id-1, 1, MPI_COMM_WORLD ); • MPI_Wait(…) if ( id < p-1 ) MPI_Recv ( &h[n+1], 1, MPI_DOUBLE, id+1, 1, MPI_COMM_WORLD, &status ); if ( id < p-1 ) MPI_Send ( &h[n], 1, MPI_DOUBLE, id+1, 2, MPI_COMM_WORLD ); if ( 0 < id ) MPI_Recv ( &h[0], 1, MPI_DOUBLE, id-1, 2, MPI_COMM_WORLD, &status );

… More local computation ...

}

Prof. Aiken CS 315B Lecture 2 23 Prof. Aiken CS 315B Lecture 2 24

6 Example MPI Collective Communication Routines

for (….) { // p = number of chunks of 1D grid, id = process id, h[] = local chunk of the grid • MPI_Barrier(…) // boundary elements of h[] are copies of neighbors boundary elements • MPI_Bcast(...) .... Local computation ...

// exchange with neighbors on a 1-D grid • MPI_Scatter(…) if ( 0 < id ) MPI_ISend ( &h[1], 1, MPI_DOUBLE, id-1, 1, MPI_COMM_WORLD ); • MPI_Gather(…) if ( id < p-1 ) MPI_IRecv ( &h[n+1], 1, MPI_DOUBLE, id+1, 1, MPI_COMM_WORLD, &status ); • MPI_Reduce(…) if ( id < p-1 ) MPI_ISend ( &h[n], 1, MPI_DOUBLE, id+1, 2, MPI_COMM_WORLD ); if ( 0 < id ) MPI_IRecv ( &h[0], 1, MPI_DOUBLE, id-1, 2, MPI_COMM_WORLD, &status ); MPI_Wait(…1...) MPI_Wait(...2...)

… More local computation ... }

Prof. Aiken CS 315B Lecture 2 25 Prof. Aiken CS 315B Lecture 2 26

Typical Structure PGAS Model

• PGAS = Partitioned Global Address Space communicate_get_work_to_do(); • There is one global address space barrier; // not always needed do_local_work(); • But each thread owns a partition of the address space barrier; that is more efficient to access communicate_write_results(); – i.e., the local memory of a processor

• Equivalent in functionality to MPI What does this remind you of? – But typically presented as a programming language – Examples: Split-C, UPC, Titanium

Prof. Aiken CS 315B Lecture 2 27 Prof. Aiken CS 315B Lecture 2 28

7 PGAS Languages PGAS Languages

• No library calls for communication • Also provide collective communication

• Instead, variables can name memory locations • Barrier on other machines • Broadcast/Reduce // Assume y points to a remote location – 1-many // The following is equivalent to a send/receive x = *y • Exchange – All-to-all

Prof. Aiken CS 315B Lecture 2 29 Prof. Aiken CS 315B Lecture 2 30

PGAS vs. MPI Summary

• Programming model very similar • SPMD is well-matched to cluster programming – Both provide SPMD – Also works well on machines

• One thread per core • From a pragmatic point of view, MPI rules – No need for compiler to discover parallelism – Easy to add MPI to an existing sequential language – No danger of overwhelming # of threads

• For productivity, PGAS is better • Model exposes memory architecture – Programs filled with low-level details of MPI calls – Local vs. Global variables – Local computation vs. sends/receives – PGAS programs easier to modify – PGAS compilers can know more/do better job Prof. Aiken CS 315B Lecture 2 31 Prof. Aiken CS 315B Lecture 2 32

8 Control

SIMD SPMD if (factor == 0) if (factor == 0) Analysis factor = 1.0 factor = 1.0 forall (i = 1..N) A[myid] = B[myid] * A[i] = B[i] * factor; factor; j += factor; j += factor; … …

Prof. Aiken CS 315B Lecture 2 33 Prof. Aiken CS 315B Lecture 2 34

Control, Cont. Global Synchronization Revisited

• SPMD replicates the sequential part of the • In the presence of non-blocking global memory SIMD computation operations, we also need memory fence – Across all threads! operations

• Why? • Two choices – Often cheaper to replicate computation in parallel – Have a separate classes of memory and control than compute in one place and broadcast synchronization operations – A general principle . . . • E.g., barrier and memory_fence – Have a single set of operations • E.g., barrier implies memory and control synchronization

Prof. Aiken CS 315B Lecture 2 35 Prof. Aiken CS 315B Lecture 2 36

9 Message Passing Implementations Bulk Synchronous/SPMD Model

• Idea: A memory fence is a special message • Easy to understand sent on the network; when it arrives, all the memory operations are complete • Phase structure guarantees no data races – Barrier synchronization also easy to understand • To work, underlying message system must deliver messages in order • Fits many problems well

• This is one of the key properties of MPI – And most message systems

Prof. Aiken CS 315B Lecture 2 37 Prof. Aiken CS 315B Lecture 2 38

But … Hierarchy

• Assumes 2-level memory hierarchy • Current & future machines are more – Local/global, a flat collection of homogenous hierarchical sequential processors – 3-4 levels, not 2

• No overlap of communication and computation • Leads to programs written in a mix of – MPI (network level) • Barriers scale poorly with machine size – OpenMP (node level) + vectorization – (# of operations lost) * (# of processors) – CUDA (GPU level)

• Each is a different programming model Prof. Aiken CS 315B Lecture 2 39 Prof. Aiken CS 315B Lecture 2 40

10 No Overlap of Computation/Communication Global Operations

• Leaves major portion of the machine idle in • Global operations (such as barriers) are bad each phase – Require synchronization across the machine – Especially bad when there is performance variation among participating threads • And this potential is lost at many scales – Hierarchical machines again • Need a model that favors asynchrony • Increasingly, communication is key – Couple as few things together as possible – Data movement is what matters – Most of the execution time, most of the energy

Prof. Aiken CS 315B Lecture 2 41 Prof. Aiken CS 315B Lecture 2 42

Global Operations Continued I/O

• MPI has evolved to include more asynchronous • How do programs get initial data? Produce and point-to-point primitives output?

• But these do not always mix well with the • In many models answer is clear collective/global operations – Passed in and out from root function • Map-Reduce

– Multithreaded shared memory applications just use the normal file system interface

Prof. Aiken CS 315B Lecture 2 43 Prof. Aiken CS 315B Lecture 2 44

11 I/O, Cont. I/O, Cont.

• Not clear in SPMD • Option 1 – Make thread 0 special • Program begins running and – Thread 0 does all I/O on behalf of the program – Issue: Awkward to read/write large data sets – Each thread is running in its own address space • Limited by thread 0’s memory size – No thread is special • Neither master nor slave • Option 2 – Parallel I/O – Each thread has access to its own file system – Containing distributed files • Each file f a portion of a collective file f Prof. Aiken CS 315B Lecture 2 45 Prof. Aiken CS 315B Lecture 2 46

I/O Summary Next Week

• Option 2 is clearly more SPMD-ish • Intro to Regent programming • Creating/deallocating files requires a barrier • First assignment • In general, parallel programming languages have not paid much attention to I/O

Prof. Aiken CS 315B Lecture 2 47 Prof. Aiken CS 315B Lecture 2 48

12