Introduction to and the Message Passing Interface (MPI)

Rolf Kuiper

July, 1st 2008 Chapter 1

Introduction to Parallel Computing

1.1 Architectures

1.1.1 von Neumann Architecture (single machine) CPU sequentially performs decoded instructions (on data)

RAM Random Access Memory stores program and data instructions

'CPU $

& % RAM

1 1.1.2 Flynn’s taxonomy of parallelism

Instruction

Single Multiple

Single SISD MISD Data Multiple SIMD MIMD

SISD Uni-processor machine / entirely sequential program SIMD Same operation repeatedly over different data MISD Rarely used, only few specified applications MIMD Most common Parallel Cluster architecture, the focus of this introduction

2 1.1.3 Parallel Architectures

'CPU $ 'CPU $ 'CPU $ 'CPU $

& % & % & % & % RAM

• Default programming language: Open Multi-Processing (OpenMP)

Distributed Memory

'CPU $ 'CPU $ 'CPU $ 'CPU $

& % & % & % & % RAM RAM RAM RAM

Distributed Memory Network

• Default programming language: Message Passing Interface (MPI)

3 Hybrid Distributed-Shared Memory

'CPU $ 'CPU $ 'CPU $ 'CPU $

& % & % & % & % RAM RAM

Distributed Memory Network

• Default programming language:

– Cluster OpenMP – MPI 2.0

4 1.2 Partitioning

1.2.1 Functional Decomposition • Each processor takes charge of his own subroutine, e.g. Climate Model:

Atmosphere Model

@I 6 @ @ @R Ocean Model

? 

Surface Model ©

5 1.2.2 Domain decomposition • The data is decomposed into different data sets.

• Each processor executes the same tasks on different data sets.

• Single Program Multiple Data (SPMD) e.g. Grid-based Hydrodynamics:

Processor 0 Processor 1

6 1.2.3 Performance Speedup S Definition: The Speedup is the ratio of the time of the serial to the parallel code. t S = 1 tn with the number of processors n and the wall-clock-time ti, which the program run needs on i processors.

Efficiency E Definition: The Efficiency is the gain in percent you win by parallelism. t S E = 1 = tnn n

7 Amdahl’s law Assumptions:

• fixed problem size

• serial portion α of the code independent on n

1 S ≤ P 1 − P + n 1 1 −→ = for n → ∞ 1 − P α with the parallel portion of the code P ((1 − P ) is the serial portion α).

8 100 100

P=1.0 P=0.99

80 80

60 60 S

P=0.98 Speedup 40 40

P=0.97

P=0.96

20 P=0.95 20

200 400 600 800 1000 #Processors n

9 Gustafson’s law

S ≤ n − α(n − 1) −→ n(1 − α) = nP for n → ∞

10 Load Balancing For Grid-based hydrodynamics: Each processor should contain the same number of gridcells.

Granularity

Computation Granularity = Communication • Local domains with less surface and huge volume.

• Cubic domain decomposition.

e.g. 2D, 8x8 grid: next slide(s)

11 12 @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @

V 16 V 16 Boundaries: = = 2.0 Inner Domain: = = 1.0 A 8 A 16 @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @

V 8 Total Domain: = = 2.0 A 16 1.2.4 Ghost cells • Allow same computation algorithm for border cells as for inner domain.

• Ghosts outside the the physical domain have to be set via Boundary Conditions.

• Ghosts inside the global physical domain, but outside the local domain have to be communicated.

Communicated Ghosts

u u u Boundary Local Domain Comm. Condition Ghosts

u u u

Boundary Condition

15 1.2.5 Stencils • The communication stencil is equal to the computation stencil.

• Standard are Star- and Box-Stencils.

Standard 5-point-stencil (2D Star-Stencil):

u

eu u

u

16 Standard 9-point-stencil (2D Box-Stencil):

u u u

u e u

u u u

17 Dimension Star-Stencil Box-Stencil 1 2-point 2-point Computation Stencil: 2 5-point 9-point 3 7-point 27-point

Communication Stencil:

 uuuu j

?

e e e u

e u

e u

18 Chapter 2

The Message Passing Interface (MPI)

• MPI 1.1 & MPI 2.0 standards (approx. 600 pages): http://www.mpi-forum.org/docs/docs.html

• MPI 1.1 defines the basics (which we will focus on in this introduction)

• MPI 2.0 main add-ons:

– Dynamic creation – Shared memory operations – Parallel I/O

• Interface to c/c++ and Fortran

• Different implementations available (e.g. MPICH, Open MPI)

19 2.1 Short Glossary

Communicator Group of communicating processors in ordered topology (default for all running processors: MPI COMM WORLD)

Rank Processor identifier (routine to get rank: MPI Comm rank(comm, &rank))

Topology Mapping / Ordering of processors (e.g. cartesian grid)

Buffer Space for message storage at sender or receiver (also possible: additional user-specified application buffer)

20 2.2 MPI Primitive Data Types

C Data Types Fortran Data Types

MPI CHAR signed char MPI CHARACTER character(1) MPI SHORT signed short int MPI INT signed int MPI INTEGER integer MPI LONG signed long int MPI UNSIGNED CHAR unsigned char MPI UNSIGNED SHORT unsigned short int MPI UNSIGNED unsigned int MPI UNSIGNED LONG unsigned long int MPI FLOAT float MPI REAL real MPI DOUBLE double MPI DOUBLE PRECISION double precision MPI LONG DOUBLE long double MPI COMPLEX complex MPI DOUBLE COMPLEX double complex MPI LOGICAL logical MPI BYTE 8 binary digits MPI BYTE 8 binary digits

It is also possible to define new data types (called ”Derived Data Types”), like vector, struct ...

21 2.3 Getting started or the Environment Management Routines

​// ****************** ​// ** HelloWorld.c ** ​// ****************** ​// ​// Compile via "gcc -o HelloWorld HelloWorld.c" ​// ​// Run via "./HelloWorld" ​// ​ ​ ​// ​// Include Libraries: ​// ​#include // for printf() ​ ​ ​// ​// Main routine: ​// ​int main() ​{ ​ // ​ // "Hello World!" ​ // ​ printf("Hello World!\n"); ​ ​ return 0; ​} ​ ​// ********************** ​// ** HelloWorld_MPI.c ** ​// ********************** ​// ​// Compile via "mpicc -o HelloWorld_MPI HelloWorld_MPI.c" ​// ​// Run via "mpiexec -n '#procs' ./HelloWorld_MPI" ​// ​ ​ ​// ​// Include Libraries: ​// ​#include // for Message Passing Interface ​#include // for printf() ​ ​ ​// ​// Main routine: ​// ​int main(argc, argv) ​int argc; ​char *argv[]; ​{ ​ // ​ // Initialize MPI: ​ // ​ MPI_Init(&argc, &argv); ​ ​ // ​ // "Hello World!" ​ // ​ printf("Hello World!\n"); ​ ​ // ​ // Finalize MPI: ​ // ​ MPI_Finalize(); ​ ​ return 0; ​} ​ ​// ************************ ​// ** HelloWorld_MPI_2.c ** ​// ************************ ​// ​// Compile via "mpicc -o HelloWorld_MPI_2 HelloWorld_MPI_2.c" ​// ​// Run via "mpiexec -n '#procs' ./HelloWorld_MPI_2" ​// ​ ​ ​// ​// Include Libraries: ​// ​#include // for Message Passing Interface ​#include // for printf() ​ ​ ​// ​// Main routine: ​// ​int main(argc, argv) ​int argc; ​char *argv[]; ​{ ​ // ​ // Initialize MPI: ​ // ​ if(MPI_Init(&argc, &argv) != MPI_SUCCESS){ ​ // Every call to a MPI-routine returns an error code! ​ fprintf(stderr, "ERROR: MPI_Init failed.\n"); ​ return -1; ​ } ​ ​ // ​ // "Hello World!" ​ // ​ printf("Hello World!\n"); ​ ​ // ​ // Finalize MPI: ​ // ​ MPI_Finalize(); ​ ​ return 0; ​} ​#include // for Message Passing Interface ​#include // for printf() ​ ​// ​// Parallel environment: ​// ​int MPI_rank; // identifier of single processor in the communicator ​int MPI_size; // number of total processors in the communicator ​ ​// ​// Main routine: ​// ​int main(argc, argv) ​int argc; ​char *argv[]; ​{ ​ // ​ // Initialize MPI: ​ // ​ if(MPI_Init(&argc, &argv) != MPI_SUCCESS){ ​ fprintf(stderr, "ERROR: MPI_Init failed.\n"); ​ return -1; ​ } ​ ​ // ​ // Get number of processors and own local rank: ​ // ​ // MPI_COMM_WORLD = communicator, which includes all processors. ​ MPI_Comm_size(MPI_COMM_WORLD, &MPI_size); ​ // Every processor in the communicator has its own identifier. ​ MPI_Comm_rank(MPI_COMM_WORLD, &MPI_rank); ​ ​ // ​ // "Hello World!" ​ // ​ printf("[%d / %d] Hello World!\n", MPI_rank, MPI_size); ​ ​ // ​ // Finalize MPI: ​ // ​ MPI_Finalize(); ​ ​ return 0; ​} ​ 2.4 Communication

2.4.1 Point-to-point communication • One message from one sender to another receiver processor.

• The (default) way / path of the message goes from the specified space at the sender to the sending buffer to the receiver buffer to the specified space at the receiver.

Blocking communication • Process waits until send data can be reused safely (blocking send) or the required data for further computation is received (blocking receive).

• Default way of communication (MPI Send(), MPI Recv()).

Non-blocking communication • After requesting the send or receive process the processor just deals with the next instructions.

• The status of the communication can be requested later on.

• Overlapping communication and computation (!?).

• Key-letter is the capital ’I’ (MPI Isend(), MPI Irecv()).

26 Synchronous communication • Sending process waits until data is received.

• Key-letter is the ’S/s’.

Asynchronous communication • Committed data is stored in an user-specified application buffer (MPI Buffer attach routine) until it is requested by the receiver process (MPI Request handle).

• Key-letter is the ’B/b’.

27 Ready Send • Blocking immediate send process.

• The programmer has to ensure, that the destination process has requested the receive already.

• Key-letter is the ’R/r’.

Sendrecv • Send the message.

• Requests the receive.

• Then blocks.

• MPI Sendrecv().

28 Combinations and Syntax • It is possible to combine any send with any receive method.

• Calls:

– MPI Send(), MPI Recv() – MPI Ssend() – MPI Bsend() – MPI Rsend() – MPI Isend(), MPI Irecv() – MPI Issend() – MPI Ibsend() – MPI Irsend()

29 • Main syntax:

– MPI Send(&buffer, count, datatype, dest, tag, comm) – MPI Recv(&buffer, count, datatype, source, tag, comm, &status) – MPI Isend(&buffer, count, datatype, dest, tag, comm, &request) – MPI Sendrecv(&sendbuffer, sendcount, sendtype, dest, sendtag, &recvbuffer, recvcount, recvtype, source, recvtag, comm, &status)

with

– buffer = Reference to the first data adress – count = The number of elements to commit – datatype = One of the MPI Data Types (see prior section) – tag = Non-negative integer to identify different send/recv - process. – comm = The communicator (see prior section). – status / request = Handle for non-blocking routines, ehich can be used in MPI Wait(), MPI Test() or MPI Probe() later on (see MPI standards for details).

30 2.4.2 Collective communication • Involves data sharing between more than two processors, normally all processors (MPI COMM WORLD communicator).

• Always blocking communication.

• No derived Data Types allowed.

Three reasons for collective communication:

• Synchronization (e.g. wait, till all processors reached a specified point in the programm)

• Data Movement (e.g. broadcasting information from one process to whole group)

• Collective Computation (e.g. calculate global mass or kinetic energy in Hydrodynamics)

31 Synchronization • MPI Barrier(comm);

• Creates a barrier synchronization in a group. Each task, when reaching the MPI Barrier call, blocks until all tasks in the group reach the same MPI Barrier call.

Data Movement • MPI Bcast(&buffer, count, datatype, root, comm) Broadcasts a message from root to all other processes in comm.

• MPI Scatter(&sendbuffer, sendcount, sendtype, &recvbuffer, recvcount, recvtype, root, comm) Distributes a message from root to each process in comm (chop and send).

• MPI Gather(&sendbuffer, sendcount, sendtype, &recvbuffer, recvcount, recvtype, root, comm) Gathers distributed messages from comm on root (receive and assembly).

• MPI Allgather(&sendbuf, sendcount, sendtype, &recvbufer, recvcount, recvtype, comm) MPI Gather() to all processes in comm.

• MPI Alltoall(&sendbuffer, sendcount, sendtype, &recvbuf, recvcount, recvtype, comm) MPI Scatter() from each processor to all other processes in comm in ranking order.

32 Collective Computation • MPI Reduce(&sendbuffer, &recvbuffer, count, datatype, operation, root, comm) with operation

– MPI MAX = maximum – MPI MIN = minimum – MPI SUM = sum – MPI PROD = product – MPI LAND = logical AND – MPI BAND = bit-wise AND – MPI LOR = logical OR – MPI BOR = bit-wise OR – MPI LXOR = logical XOR – MPI BXOR = bit-wise XOR – MPI MAXLOC = maximum + location – MPI MINLOC = minimum + location

• MPI Allreduce(&sendbuffer, &recvbuffer, count, datatype, operation, comm) MPI Reduce() to all processes in comm.

• MPI Reduce scatter(&sendbuffer, &recvbuffer, recvcount, datatype, operation, comm) MPI Reduce() on a vector + MPI Scatter().

• MPI Scan(&sendbuffer, &recvbuffer, count, datatype, operation, comm) MPI Reduce() on all processes from partial processes up to own rank in comm.

33 2.5 Group and Communicator Management Routines

• Split MPI COMM WORLD into distinct communicating sub-groups.

• Define new user-specified virtual topologies.

• MPI provides over 40 routines related to groups, communicators, and virtual topologies.

• And we will not address them in this introduction.

34 2.6 Last message (for physicists)

• Normally you don’t have to implement MPI calls directly on your own.

• User-friendly libraries on top of MPI allow

– Safe communication via easy to use interface routines. – Good performance. – Different add-ons, like ∗ Parallel grid setup (topologies) and data management (e.g. CHOMBO, Cactus, DUNE). ∗ Parallel linear algebra algorithms (e.g. PETSc). – Good parallel debugging and profiling tools.

• But: It is always easier to use such interface libraries, if you understand the basics of parallel MPI communication and implemented these basics as an exercise (to know about the possible bugs).

35 Chapter 3

Attachment

3.1 Useful Links

• MPI-Forum: http://www.mpi-forum.org/ • MPI standards: http://www.mpi-forum.org/docs/docs.html • Syntax overview (german only): http://www.tu-chemnitz.de/informatik/RA/projects/mpihelp/mpihelp.html • MPICH: http://www.mcs.anl.gov/research/projects/mpich2/ • Open MPI: http://www.open-mpi.org/ • OpenMP: http://openmp.org/wp/ • Introductions / Tutorials from Blaise Barney, Livermore Computing: https://computing.llnl.gov/tutorials/ – Parallel programming: https://computing.llnl.gov/tutorials/parallel comp/ – MPI: https://computing.llnl.gov/tutorials/mpi/

36 3.2 Exercise class

• No exercise class on next monday.

• On monday afternoon in two weeks we’ll parallelize the Donor-Cell Advection scheme (with point-to- point communication, collective reduction and collective data movement)

• No prior code / attendance is needed. The serial advection scheme is available on the website of the exercise class in c and Fortran. 3.3 Exercise class

• No exercise class on next monday.

• On monday afternoon in two weeks we’ll parallelize the Donor-Cell Advection scheme (with point-to- point communication, collective reduction and collective data movement)

• No prior code / attendance is needed. The serial advection scheme is available on the website of the exercise class in c and Fortran.

Thanks for your attention!