Tutorial on MPI: The

Message-Passing Interface

Willi am Gropp

TIONAL A LA N B O E R N A

N T

O O

G R

R Y A

• •

U

N O

I G V E A R C S I I T C H

Y O F

Mathematics and Computer Science Division

Argonne National Lab oratory

Argonne, IL 60439

[email protected] 1

Course Outline

 Background on Parallel

 Getting Started

 MPI Basics

 Intermediate MPI

 To ols for writing libraries

 Final comments

Thanks to Rusty Lusk for some of the material in this

tutorial.

This tutorial maybe used in conjunction with

the book \Using MPI" which contains detailed

descriptions of the use of the MPI routines.

 Material that b eings with this symb ol is `advanced'

and may b e skipp ed on a rst reading. 2

Background



 Communicating with other pro cesses

 Co op erative op erations

 One-sided op erations

 The MPI pro cess 3

Parallel Computing

 Separate workers or pro cesses

 Interact by exchanging information 4

Typ es of parallel computing

All use di erent data for each worker

Data-parallel Same op erations on di erent

data. Also called SIMD

SPMD Same program, di erent data

MIMD Di erent programs, di erent data

SPMD and MIMD are essentially the same

b ecause any MIMD can be made SPMD

SIMD is also equivalent, but in a less

practical sense.

MPI is primaril y for SPMD/MIMD. HPF is

an example of a SIMD interface. 5

Communicating with other pro cesses

Data must be exchanged with other workers

 Co op erative | all parties agree to

transfer data

 One sided | one worker p erforms

transfer of data 6

Co op erative op erations

Message-passing is an approach that makes

the exchange of data co op erative.

Data must b oth be explicitl y sent and

received.

An advantage is that any change in the

receiver's memory is made with the receiver's

participation. 0 Process 1

SEND( data )

RECV( data ) 7

One-sided op erations

One-sided op erations between parallel

pro cesses include remote memory reads and

writes.

An advantage is that data can be accessed

without waiting for another pro cess

Process 0 Process 1

PUT( data )

(Memory)

Process 0 Process 1

(Memory)

GET( data ) 8

Class Example

Take a pad of pap er. Algorithm: Initialize with the

numb er of neighb ors you have

 Compute average of your neighb or's values and

subtract from your value. Make that your new

value.

 Rep eat until done

Questions

1. Howdoyou get values from your neighb ors?

2. Which step or iteration do they corresp ond to?

Do you know? Do you care?

3. Howdoyou decide when you are done? 9

Hardware mo dels

The previous example illustrates the

hardware mo dels by how data is exchanged

among workers.

 e.g., Paragon, IBM

SPx, workstation network

 e.g., SGI Power

Challenge, T3D

Either may be used with SIMD or MIMD

software mo dels.

 All memory is distributed. 10

What is MPI?

 A message-passing library sp eci cation

{ message-passing mo del

{ not a compiler sp eci cation

{ not a sp eci c pro duct

 Forparallel computers, clusters, and heterogeneous

networks

 Full-featured

 Designed to p ermit unleash? the development of

parallel software libraries

 Designed to provide access to advanced parallel

hardware for

{ end users

{ library writers

{ to ol develop ers 11

Motivation for a New Design

 now mature as programming

paradigm

{ well understo o d

{ ecient match to hardware

{ many applications

 Vendor systems not p ortable

 Portable systems are mostly research projects

{ incomplete

{ lack vendor supp ort

{ not at most ecient level 12

Motivation cont.

Few systems o er the full range of desired features.

 mo dularity for libraries

 access to p eak p erformance

 portability

 heterogeneity

 subgroups

 top ologies

 p erformance measurement to ols 13

The MPI Pro cess

 Began at Williamsburg Workshop in April, 1992

 Organized at Sup ercomputing '92 Novemb er

 Followed HPF format and pro cess

 Met every six weeks fortwodays

 Extensive, op en email discussions

 Drafts, readings, votes

 Pre- nal draft distributed at Sup ercomputing '93

 Two-month public comment p erio d

 Final version of draft in May, 1994

 Widely availablenow on the Web, ftp sites, netlib

http://www.mcs.anl.gov/mpi/index.html

 Public implementations available

 Vendor implementations coming so on 14

Who Designed MPI?

 Broad participation

 Vendors

{ IBM, Intel, TMC, Meiko, Cray, Convex, Ncub e

 Library writers

{ PVM, p4, Zip co de, TCGMSG, Chameleon,

Express, Linda

 Application sp ecialists and consultants

Companies Lab oratories Universities

ARCO ANL UC Santa Barbara

Convex GMD Syracuse U

Cray Res LANL Michigan State U

IBM LLNL Oregon Grad Inst

Intel NOAA U of New Mexico

KAI NSF Miss. State U.

Meiko ORNL U of Southampton

NAG PNL U of Colorado

nCUBE Sandia Yale U

ParaSoft SDSC UofTennessee

Shell SRC UofMaryland

TMC Western Mich U

U of Edinburgh

Cornell U.

Rice U.

U of San Francisco 15

Features of MPI

 General

{ Communicators combine context and group for

message security

{ safety

 Point-to-p oint communication

{ Structured bu ers and derived datatyp es,

heterogeneity

{ Mo des: normal blo cking and non-blo cking,

synchronous, ready to allow access to fast

proto cols, bu ered

 Collective

{ Both built-in and user-de ned collective

op erations

{ Large numb er of data movement routines

{ Subgroups de ned directly orby top ology 16

Features of MPI cont.

 Application-oriented pro cess top ologies

{ Built-in supp ort for grids and graphs uses

groups

 Pro ling

{ Ho oks allow users to intercept MPI calls to

install their own to ols

 Environmental

{ inquiry

{ error control 17

Features not in MPI

 Non-message-passing concepts not included:

{ pro cess management

{ remote memory transfers

{ active messages

{ threads

{ virtual shared memory

 MPI do es not address these issues, but has tried to

remain compatible with these ideas e.g. thread

safety as a goal, intercommunicators 18

Is MPI Large or Small?

 MPI is large 125 functions

{ MPI's extensive functionality requires many

functions

{ Numb er of functions not necessarily a measure

of complexity

 MPI is small 6 functions

{ Many parallel programs can b e written with just

6 basic functions.

 MPI is just right

{ One can access exibility when it is required.

{ One need not master all parts of MPI to use it. 19

Where to use MPI?

 You need a portable parallel program

 You are writing a parallel library

 You have irregular or dynamic data

relationships that do not t a data

parallel mo del

Where not to use MPI:

 You can use HPF or a parallel Fortran 90

 You don't need parallelism at all

 You can use librari es which may be

written in MPI 20

Why learn MPI?

 Portable

 Expressive

 Good way to learn ab out subtle issues in

parallel computing 21

Getting started

 Writing MPI programs

 Compiling and linking

 Running MPI programs

 More information

{ Using MPI by William Gropp, Ewing Lusk,

and Anthony Skjellum,

{ The LAM companion to \Using MPI..." by

Zdzislaw Meglicki

{ Designing and Building Parallel Programs by

Ian Foster.

{ ATutorial/User's Guide for MPI byPeter

Pacheco

ftp://math.usfca.edu/pub/MPI/mpi.guide.ps

{ The MPI standard and other information is

available at http://www.mcs.anl.gov/mpi. Also

the source for several implementations. 22

Writing MPI programs

include "mpi.h"

include

int main argc, argv 

int argc;

char **argv;

{

MPI_Init &argc, &argv ;

printf "Hello world\n" ;

MPI_Finalize;

return 0;

} 23

Commentary

 include "mpi.h" provides basic MPI

de nitions and typ es

 MPI_Init starts MPI

 MPI_Finalize exits MPI

 Note that all non-MPI routines are lo cal;

thus the printf run on each pro cess 24

Compiling and linking

For simple programs, sp ecial compiler

commands can be used. For large projects,

it is b est to use a standard Make le.

The MPICH implementation provides

the commands mpicc and mpif77

as well as `Makefile' examples in

`/usr/local/mpi/examples/Makefile.in' 25

Sp ecial compilation commands

The commands

mpicc -o first first.c

mpif77 -o firstf firstf.f

may b e used to build simple programs when using

MPICH.

These provide sp ecial options that exploit the pro ling

features of MPI

-mpilog Generate log les of MPI calls

-mpitrace Trace execution of MPI calls

-mpianim Real-time animation of MPI not available

on all systems

There are sp eci c to the MPICH implementation;

other implementations mayprovide similar commands

e.g., mpcc and mpxlf on IBM SP2. 26

Using Make les

The le `Makefile.in' is a template Make le.

The program script `mpireconfig' translates

this to a Make le for a particular system.

This allows you to use the same Make le for

a network of workstations and a massively

parallel computer, even when they use

di erent compilers, librari es, and linker

options.

mpireconfig Makefile

Note that you must have `mpireconfig' in

your PATH. 27

Sample Make le.in

 User configur ab le options 

ARCH = @ARCH@

COMM = @COMM@

INSTALL_D IR = @INSTAL L_ DI R@

CC = @CC@

F77 = @F77@

CLINKER = @CLINKE R@

FLINKER = @FLINKE R@

OPTFLAGS = @OPTFLA GS @



LIB_PATH = -L$INS TA LL _D IR / li b/ $ AR CH /$ C OM M

FLIB_PATH =

@FLIB_PAT H_ LE AD ER @$ I NS TA LL _D IR / li b/ $ ARC H /$ C OM M

LIB_LIST = @LIB_LI ST @



INCLUDE_D IR = @INCLUD E_ PA TH @ -I$INST AL L_D IR / in cl ud e

 End User configur ab le options  28

Sample Make le.in con't

CFLAGS = @CFLAGS @ $OPTFLA GS  $INCLUD E_D IR  -DMPI_$ AR CH 

FFLAGS = @FFLAGS@ $INCLU DE _D IR  $OPTFLAG S

LIBS = $LIB_PA TH  $LIB_LI ST 

FLIBS = $FLIB_ PA TH  $LIB_LI ST 

EXECS = hello

default: hello

all: $EXECS

hello: hello.o $INSTAL L_ DI R /i nc lu de /m pi. h

$CLINK ER  $OPTFLA GS  -o hello hello.o \

$LIB_P AT H $LIB_L IS T -lm

clean:

/bin/rm -f *.o *~ PI* $EXECS 

.c.o:

$CC $CFLAG S -c $*.c

.f.o:

$F77 $FFLAGS  -c $*.f 29

Running MPI programs

mpirun -np 2 hello

`mpirun' is not part of the standard, but

some version of it is common with several

MPI implementations. The version shown

here is for the MPICH implementation of

MPI.

 Just as Fortran do es not sp ecify how

Fortran programs are started, MPI do es not

sp ecify how MPI programs are started.

 The option -t shows the commands that

mpirun would execute; you can use this to

nd out how mpirun starts programs on yor

system. The option -help shows all options

to mpirun. 30

Finding out ab out the environment

Two of the rst questions asked in a parallel

program are: How many pro cesses are there?

and Who am I?

How many is answered with MPI_Comm_size

and who am I is answered with MPI_Comm_rank.

The rank is a numb er between zero and

size-1. 31

A simple program

include "mpi.h"

include

int main argc, argv 

int argc;

char **argv;

{

int rank, size;

MPI_Init &argc, &argv ;

MPI_Comm_rank MPI_COMM_WORLD, &rank ;

MPI_Comm_size MPI_COMM_WORLD, &size ;

printf "Hello world! I'm d of d\n",

rank, size ;

MPI_Finalize;

return 0;

} 32

Caveats

 These sample programs have b een kept

as simple as p ossible by assuming that all

pro cesses can do output. Not all parallel

systems provide this feature, and MPI

provides a way to handle this case. 33

Exercise - Getting Started

Objective: Learn how to login, write,

compile, and run a simple MPI program.

Run the \Hello world" programs. Try two

di erent parallel computers. What do es the

output lo ok like? 34

Sending and Receiving messages

Process 0 Process 1

A: Send Recv

B:

Questions:

 To whom is data sent?

 What is sent?

 How do es the receiver identify it? 35

Current Message-Passing

 Atypical blo cking send lo oks like

send dest, type, address, length 

where

{ dest is an integer identi er representing the

pro cess to receive the message.

{ type is a nonnegative integer that the

destination can use to selectively screen

messages.

{ address, length describ es a contiguous area in

memory containing the message to b e sent.

and

 Atypical global op eration lo oks like:

broadcast type, address, length 

 All of these sp eci cations are a go o d match to

hardware, easy to understand, but to o in exible. 36

The Bu er

Sending and receiving only a contiguous arrayof

bytes:

 hides the real data structure from hardware which

might b e able to handle it directly

 requires pre-packing disp ersed data

{ rows of a matrix stored columnwise

{ general collections of structures

 prevents communications b etween machines with

di erent representations even lengths for same

data typ e 37

Generalizing the Bu er Description

 Sp eci ed in MPI by starting address , datatyp e , and

count , where datatyp e is:

{ elementary all C and Fortran datatyp es

{ contiguous array of datatyp es

{ strided blo cks of datatyp es

{ indexed array of blo cks of datatyp es

{ general structure

 Datatyp es are constructed recursively.

 Sp eci cations of elementary datatyp es allows

heterogeneous communication.

 Elimination of length in favor of count is clearer.

 Sp ecifying application-oriented layout of data

allows maximal use of sp ecial hardware. 38

Generalizing the Typ e

 A single typ e eld is to o constraining. Often

overloaded to provide needed exibility.

 Problems:

{ under user control

{ wild cards allowed MPI_ANY_TAG

{ library use con icts with user and with other

libraries 39

Sample Program using Library Calls

Sub1 and Sub2 are from di erent libraries.

Sub1;

Sub2;

Sub1a and Sub1b are from the same library

Sub1a;

Sub2;

Sub1b;

Thanks to Marc Snir for the following four examples 40

Correct Execution of Library Calls

Process 0 Process 1 Process 2

recv(any) send(1) Sub1

recv(any) send(0)

recv(1) send(0) Sub2 recv(2) send(1)

send(2) recv(0) 41

Incorrect Execution of Library Calls

Process 0 Process 1 Process 2

recv(any) send(1) Sub1

recv(any) send(0)

recv(1) send(0)

Sub2 recv(2) send(1)

send(2) recv(0) 42

Correct Execution of Library Calls with Pending Communcication

Process 0 Process 1 Process 2

recv(any) send(1) Sub1a

send(0)

recv(2) send(0)

Sub2 send(2) recv(1)

send(1) recv(0)

recv(any)

Sub1b 43

Incorrect Execution of Library Calls with Pending Communication

Process 0 Process 1 Process 2

recv(any) send(1) Sub1a

send(0)

recv(2) send(0)

Sub2 send(2) recv(1)

send(1) recv(0)

recv(any)

Sub1b 44

Solution to the typ e problem

 A separate communication context for each family

of messages, used for queueing and matching.

This has often b een simulated in the past by

overloading the tag eld.

 No wild cards allowed, for security

 Allo cated by the system, for security

 Typ es tags , in MPI retained fornormal use wild

cards OK 45

Delimiting Scop e of Communication

 Separate groups of pro cesses working on

subproblems

{ Merging of pro cess name space interferes with

mo dularity

{ \Lo cal" pro cess identi ers desirable

 Parallel invo cation of parallel libraries

{ Messages from application must b e kept

separate from messages internal to library.

{ Knowledge of library message typ es interferes

with mo dularity.

{ Synchronizing b efore and after library calls is

undesirable. 46

Generalizing the Pro cess Identi er

 Collective op erations typically op erated on all

pro cesses although some systems provide

subgroups.

 This is to o restrictive e.g., need minimum over a

column or a sum across a row, of pro cesses

 MPI provides groups of pro cesses

{ initial \all" group

{ group management routines build, delete

groups

 All communication not just collective op erations

takes place in groups.

 A group and a context are combined in a

communicator.

 Source/destination in send/receive op erations refer

to rank in group asso ciated with a given

communicator. MPI_ANY_SOURCE p ermitted in a

receive. 47

MPI Basic Send/Receive

Thus the basic blo cking send has b ecome:

MPI_Send start, count, datatype, dest, tag,

comm 

and the receive:

MPI_Recvstart, count, datatype, source, tag,

comm, status

The source, tag, and count of the message actually

received can b e retrieved from status.

Two simple collective op erations:

MPI_Bcaststart, count, datatype, root, comm

MPI_Reducestart, result, count, datatype,

operation, root, comm 48

Getting information ab out a message

MPI_Status status;

MPI_Recv ..., &status ;

... status.MPI_TAG;

... status.MPI_SOURCE;

MPI_Get_count &status, datatype, &count ;

MPI_TAG and MPI_SOURCE primarilyof use when

MPI_ANY_TAG and/or MPI_ANY_SOURCE in the receive.

MPI_Get_count may b e used to determine how much

data of a particulartyp e was received. 49

Simple Fortran example

program main

include 'mpif.h '

integer rank, size, to, from, tag, count, i, ierr

integer src, dest

integer st_sour ce , st_tag, st_count

integer status MP I_ ST AT US _S IZ E

double precisio n data100 

call MPI_INIT  ierr 

call MPI_COMM _R AN K MPI_COM M_ WO RL D, rank, ierr 

call MPI_COMM _S IZ E MPI_COM M_ WO RL D, size, ierr 

print *, 'Process ', rank, ' of ', size, ' is alive'

dest = size - 1

src = 0

C

if rank .eq. src then

to = dest

count =10

tag = 2001

do 10 i=1, 10

10 datai =i

call MPI_SEN D data, count, MPI_DOUBL E_ PR EC IS IO N, to,

+ tag, MPI_COM M_ WOR LD , ierr 

else if rank .eq. dest then

tag = MPI_ANY_ TA G

count =10

from = MPI_ANY_ SO UR CE

call MPI_REC V da ta , count, MPI_DOUB LE _P RE CI SI ON , from,

+ tag, MPI_COMM _W ORL D, status, ierr  50

Simple Fortran example cont.

call MPI_GET _C OU NT  status, MPI_DOUBL E_ PR EC IS IO N,

+ st_count , ierr 

st_sourc e = statusM PI _S OU RC E

st_tag = statusM PI _T AG 

C

print *, 'Status info: source = ', st_sourc e,

+ ' tag = ', st_tag, ' count = ', st_coun t

print *, rank, ' receive d' , datai, i= 1, 10 

endif

call MPI_FINA LI ZE  ierr 

end 51

Six Function MPI

MPI is very simple. These six functions allow

you to write many programs:

MPI Init

Finalize MPI

MPI Comm size

MPI Comm rank

MPI Send

MPI Recv 52

A taste of things to come

The following examples show a C and

Fortran version of the same program.

This program computes PI with a very

simple metho d but do es not use MPI_Send

and MPI_Recv. Instead, it uses collective

op erations to send data to and from all of

the running pro cesses. This gives a di erent

six-function MPI set:

MPI Init

MPI Finalize

MPI Comm size

MPI Comm rank

MPI Bcast

Reduce MPI 53

Broadcast and Reduction

The routine MPI_Bcast sends data from one

pro cess to all others.

The routine MPI_Reduce combines data from

all pro cesses by adding them in this case,

and returning the result to a single pro cess. 54

Fortran example: PI

program main

include "mpif.h "

double precisio n PI25DT

paramet er PI25DT = 3.1415926 53 58 97 93 23 84 62 64 3d 0

double precisio n mypi, pi, h, sum, x, f, a

integer n, myid, numprocs , i, rc

c function to integrat e

fa = 4.d0 / 1.d0 + a*a

call MPI_INIT  ierr 

call MPI_COMM _R AN K MPI_COM M_ WO RL D, myid, ierr 

call MPI_COMM _S IZ E MPI_COM M_ WO RL D, numprocs , ierr 

10 if  myid .eq. 0  then

write6, 98 

98 format' En te r the number of intervals : 0 quits' 

read5,9 9 n

99 formati 10 

endif

call MPI_BCAS T n, 1, MP I_ IN TE GE R, 0, MPI _C OM M_ WO RL D, ie rr  55

Fortran example cont.

c check for quit signal

if  n .le. 0  goto 30

c calculate the interva l size

h = 1.0d0/n

sum = 0.0d0

do 20 i = myid+1, n, numproc s

x = h * dblei  - 0.5d0

sum = sum + fx

20 continue

mypi =h*sum

c collect all the partial sums

call MPI_RED UC E my pi ,p i, 1, MP I_ DO UB LE_ PR EC IS IO N, MP I_ SU M, 0,

$ MPI_COM M_ WO RL D, ie rr 

c node 0 prints the answer.

if myid .eq. 0 then

write6 , 97 pi, abspi - PI25DT

97 format ' pi is approxi ma te ly : ', F18.16,

+ ' Error is: ', F18.16

endif

goto 10

30 call MPI_FIN AL IZ E rc 

stop

end 56

C example: PI

include "mpi.h"

include

int mainargc ,a rg v

int argc;

char *argv[];

{

int done = 0, n, myid, numprocs , i, rc;

double PI25DT = 3.14159 26 53 58 97 93 23 84 626 43 ;

double mypi, pi, h, sum, x, a;

MPI_Init &a rg c, &a rg v ;

MPI_Comm_ si ze M PI _C OM M_ WO RL D, &n um pr oc s;

MPI_Comm_ ra nk M PI _C OM M_ WO RL D, &m yi d ; 57

C example cont.

while !done

{

if myid == 0 {

printf "E nt er the number of interval s: 0 quits ";

scanf" d ", &n ;

}

MPI_Bcast & n, 1, MPI_INT, 0, MPI_COMM_ WO RL D ;

if n == 0 break;

h = 1.0 / double  n;

sum = 0.0;

for i = myid + 1; i <= n; i += numprocs  {

x=h*doubl e i - 0.5;

sum += 4.0 / 1.0 + x*x;

}

mypi = h * sum;

MPI_Reduc e &m yp i, &pi, 1, MPI_DOU BL E, MPI_SUM, 0,

MPI_COM M_ WO RL D ;

if myid == 0

printf "p i is approxi ma te ly .16f, Error is .16f\n" ,

pi, fabspi - PI25DT ;

}

MPI_Final iz e ;

} 58

Exercise - PI

Objective: Exp eriment with send/receive

Run either program for PI. Write new

versions that replace the calls to MPI_Bcast

and MPI_Reduce with MPI_Send and MPI_Recv.

 The MPI broadcast and reduce op erations

use at most log p send and receive op erations

on each pro cess where p is the size of

COMM WORLD. How many op erations do MPI

your versions use? 59

Exercise - Ring

Objective: Exp eriment with send/receive

Write aprogram to send a message around a

ring of pro cessors. That is, pro cessor 0 sends

to pro cessor 1, who sends to pro cessor 2,

etc. The last pro cessor returns the message

to pro cessor 0.

 You can use the routine MPI Wtime to time

co de in MPI. The statement

t = MPI Wtime;

returns the time as a double DOUBLE

PRECISION in Fortran. 60

Top ologies

MPI provides routines to provide structure to

collections of pro cesses

This helps to answer the question:

Who are my neighb ors? 61

Cartesian Top ologies

A Cartesian top ology is a mesh

Example of 3  4Cartesian mesh with arrows

p ointing at the right neighb ors:

(0,2) (1,2) (2,2) (3,2)

(0,1) (1,1) (2,1) (3,1)

(0,0) (1,0) (2,0) (3,0) 62

De ning a Cartesian Top ology

The routine MPI_Cart_create creates a Cartesian

decomp osition of the pro cesses, with the numb er of

dimensions given by the ndim argument.

dims1 = 4

dims2 = 3

periods1 = .false.

periods2 = .false.

reorder = .true.

ndim = 2

call MPI_CART_CREATE MPI_COMM_WORLD, ndim, dims,

$ periods, reorder, comm2d, ierr  63

Finding neighb ors

MPI_Cart_create creates a new communicator with the

same pro cesses as the input communicator, but with

the sp eci ed top ology.

The question, Who are my neighb ors, can nowbe

answered with MPI_Cart_shift:

call MPI_CART_SHIFT comm2d, 0, 1,

nbrleft, nbrright, ierr 

call MPI_CART_SHIFT comm2d, 1, 1,

nbrbottom, nbrtop, ierr 

The values returned are the ranks, in the

communicator comm2d, of the neighb ors shifted by 1

in the two dimensions. 64

Who am I?

Can be answered with

integer coords2

call MPI_COMM_RANK comm1d, myrank, ierr 

call MPI_CART_COORDS comm1d, myrank, 2,

$ coords, ierr 

Returns the Cartesian co ordinates of the calling

pro cess in coords. 65

Partitioning

When creating a Cartesian top ology, one question is

\What is a go o d choice for the decomp osition of the

pro cessors?"

This question can b e answered with MPI_Dims_create:

integer dims2

dims1 = 0

dims2 = 0

call MPI_COMM_SIZE MPI_COMM_WORLD, size, ierr 

call MPI_DIMS_CREATE size, 2, dims, ierr  66

Other Top ology Routines

MPI contains routines to translate between

Cartesian co ordinates and ranks in a

communicator, and to access the prop erties

of a Cartesian top ology.

The routine MPI_Graph_create allows the

creation of a general graph top ology. 67

Why are these routines in MPI?

In many parallel computer interconnects,

some pro cessors are closer to than

others. These routines allow the MPI

implementation to provide an ordering of

pro cesses in a top ology that makes logical

neighb ors close in the physical interconnect.

 Some parallel programmers may rememb er

hyp ercub es and the e ort that went into

assigning no des in a mesh to pro cessors

in a hyp ercub e through the use of Grey

co des. Many new systems have di erent

interconnects; ones with multiple paths

may have notions of near neighb ors that

changes with time. These routines free

the programmer from many of these

considerations. The reorder argument is

used to request the b est ordering. 68

The p erio ds argument

Who are my neighb ors if I am at the edge of

a Cartesian Mesh?

? 69

Perio dic Grids

Sp ecify this in MPI_Cart_create with

dims1 = 4

dims2 = 3

periods1 = .TRUE.

periods2 = .TRUE.

reorder = .true.

ndim = 2

call MPI_CART_CREATE MPI_COMM_WORLD, ndim, dims,

$ periods, reorder, comm2d, ierr  70

Nonp erio dic Grids

In the nonp erio dic case, a neighb or may

not exist. This is indicated by a rank of

MPI_PROC_NULL.

This rank may be used in send and receive

calls in MPI. The action in b oth cases is as if

the call was not made. 71

Collective Communications in MPI

 Communication is co ordinated among a group of

pro cesses.

 Groups can b e constructed \by hand" with MPI

group-manipulation routines orby using MPI

top ology-de nition routines.

 Message tags are not used. Di erent

communicatorsare used instead.

 No non-blo cking collective op erations.

 Three classes of collective op erations:

{ synchronization

{ data movement

{ collective computation 72

Synchronization

 MPI_Barriercomm

 Function blo cks untill all pro cesses in

comm call it. 73

Available Collective Patterns

P0 AAP0

P1 Broadcast P1 A P2 P2 A

P3 P3 A

P0 ABCDP0 A Scatter P1 P1 B

P2 P2 C Gather P3 P3 D

P0 A P0 ABCD

P1 B All gather P1 ABCD P2 C P2 ABCD

P3 D P3 ABCD

P0 A0 A1 A2 A3 P0 A0 B0 C0 D0

P1 B0 B1 B2 B3 All to All P1 A1 B1 C1 D1

P2 C0 C1 C2 C3 P2 A2 B2 C2 D2

P3 D0 D1 D2 D3 P3 A3 B3 C3 D3

Schematic representation of collective data

movement in MPI 74

Available Collective Computation Patterns

P0 A P0 ABCD

P1 B Reduce P1

P2 C P2

P3 D P3

P0 A P0 A AB P1 B Scan P1 P2 C P2 ABC

P3 D P3 ABCD

Schematic representation of collective data

movement in MPI 75

MPI Collective Routines

 Many routines:

Allgather Allgatherv Allreduce

Alltoall Alltoallv Bcast

Gather Gatherv Reduce

ReduceScatter Scan Scatter

Scatterv

 All versions deliver results to all participating

pro cesses.

 V versions allow the chunks to have di erent sizes.

 Allreduce, Reduce, ReduceScatter, and Scan take

b oth built-in and user-de ned combination

functions. 76

Built-in Collective Computation Op erations

MPI Name Op eration

MPI MAX Maximum

MPI MIN Minimum

MPI PROD Pro duct

MPI SUM Sum

MPI LAND Logical and

MPI LOR Logical or

MPI LXOR Logical exclusive or xor

MPI BAND Bitwise and

MPI BOR Bitwise or

MPI BXOR Bitwise xor

MPI MAXLOC Maximum value and lo cation

MPI MINLOC Minimum value and lo cation 77

De ning Your Own Collective Op erations

MPI_Op_createuser_function, commute, op

MPI_Op_freeop

user_functioninvec, inoutvec, len, datatype

The user function should p erform

inoutvec[i] = invec[i] op inoutvec[i];

for i from 0 to len-1.

user_function can b e non-commutative e.g., matrix

multiply. 78

Sample user function

For example, to create an op eration that has the

same e ect as MPI_SUM on Fortran double precision

values, use

subroutine myfunc invec, inoutvec, len, datatype 

integer len, datatype

double precision inveclen, inoutveclen

integer i

do 10 i=1,len

10 inoutveci = inveci + inoutveci

return

end

To use, just

integer myop

call MPI_Op_create myfunc, .true., myop, ierr 

call MPI_Reduce a, b, 1, MPI_DOUBLE_PRECISON, myop, ... 

The routine MPI_Op_free destroys user-functions when

they are no longer needed. 79

De ning groups

All MPI communication is relative to a

communicator which contains a context

and a group. The group is just a set of

pro cesses. 80

Sub dividing a communicator

The easiest way to create communicators with new

groups is with MPI_COMM_SPLIT.

For example, to form groups of rows of pro cesses Column 01234 0

Row 1

2

use

MPI_Comm_split oldcomm, row, 0, &newcomm ;

To maintain the order by rank, use

MPI_Comm_rank oldcomm, &rank ;

MPI_Comm_split oldcomm, row, rank, &newcomm ; 81

Sub dividing con't

Similarly,toform groups of columns, Column 01234 0

Row 1

2

use

MPI_Comm_split oldcomm, column, 0, &newcomm2 ;

To maintain the order by rank, use

MPI_Comm_rank oldcomm, &rank ;

MPI_Comm_split oldcomm, column, rank, &newcomm2 ; 82

Manipulating Groups

Another way to create a communicator with sp eci c

memb ers is to use MPI_Comm_create.

MPI_Comm_create oldcomm, group, &newcomm ;

The group can b e created in many ways: 83

Creating Groups

All group creation routines create a group by

sp ecifying the memb ers to take from an existing

group.

 MPI_Group_incl sp eci es sp eci c memb ers

 MPI_Group_excl excludes sp eci c memb ers

 MPI_Group_range_incl and MPI_Group_range_excl

use ranges of memb ers

 MPI_Group_union and MPI_Group_intersection

creates a new group from two existing groups.

To get an existing group, use

MPI_Comm_group oldcomm, &group ;

Free a group with

MPI_Group_free &group ; 84

Bu ering issues

Where do es data go when you send it? One

p ossibility is: Process 1 Process 2 A:

Local Buffer

Local Buffer

The Network B: 85

Better bu ering

This is not very ecient. There are three

copies in addition to the exchange of data

between pro cesses. We prefer Process 1 Process 2 A:

B:

But this requires that either that MPI_Send

not return until the data has b een delivered

or that we allow a send op eration to return

b efore completing the transfer. In this case,

we need to test for completion later. 86

Blo cking and Non-Blo cking communication

 So farwe have used blocking communication:

{ MPI Send do es not complete until bu er is empty

available for reuse.

{ MPI Recv do es not complete until bu er is full

available for use.

 Simple, but can b e \unsafe":

Pro cess 0 Pro cess 1

Send1 Send0

Recv1 Recv0

Completion dep ends in general on size of message

and amount of system bu ering.

 Send works for small enough messages but fails

when messages get to o large. Too large ranges from

zero bytes to 100's of Megabytes. 87

Some Solutions to the \Unsafe" Problem

 Order the op erations more carefully:

Pro cess 0 Pro cess 1

Send1 Recv0

Recv1 Send0

 Supply receive bu er at same time as send, with

MPI Sendrecv:

Pro cess 0 Pro cess 1

Sendrecv1 Sendrecv0

 Use non-blo cking op erations:

Pro cess 0 Pro cess 1

Isend1 Isend0

Irecv1 Irecv0

Waitall Waitall

 Use MPI_Bsend 88

MPI's Non-Blo cking Op erations

Non-blo cking op erations return immediately

\request handles" that can b e waited on and queried:

 MPI Isendstart, count, datatype, dest, tag, comm,

request

 MPI Irecvstart, count, datatype, dest, tag, comm,

request

 MPI Waitrequest, status

One can also test without waiting: MPI_Test request,

flag, status 89

Multiple completions

It is often desirable to wait on multiple requests. An

example is a master/slave program, where the master

waits for one ormore slaves to send it a message.

Waitallcount, array of requests,  MPI

array of statuses

 MPI Waitanycount, array of requests, index,

status

Waitsomeincount, array of requests, outcount,  MPI

array of indices, array of statuses

There are corresp onding versions of test for each of

these.

WAITSOME and MPI TESTSOME may b e used to  The MPI

implement master/slave algorithms that provide fair

access to the master by the slaves. 90

Fairness

What happ ens with this program:

include "mpi.h"

include

int mainargc , argv

int argc;

char **argv;

{

int rank, size, i, buf[1];

MPI_Statu s status;

MPI_Init &argc, &argv ;

MPI_Comm_ ra nk  MPI_COMM _W OR LD , &rank ;

MPI_Comm_ si ze  MPI_COMM _W OR LD , &size ;

if rank == 0 {

for i=0; i<100* si ze -1 ; i++ {

MPI_Rec v buf, 1, MPI_INT , MPI_ANY_S OU RC E,

MPI_ANY_ TA G, MPI_COM M_ WOR LD , &status ;

printf "Msg from d with tag d\n",

status. MP I_ SO UR CE , status.MP I_ TA G ;

}

}

else {

for i=0; i<100; i++

MPI_Sen d buf, 1, MPI_INT , 0, i, MPI_COM M_ WO RL D ;

}

MPI_Final iz e ;

return 0;

} 91

Fairness in message-passing

An parallel algorithm is fair if no pro cess

is e ectively ignored. In the preceeding

program, pro cesses with low rank like

pro cess zero may be the only one whose

messages are received.

MPI makes no guarentees ab out fairness.

However, MPI makes it p ossible to write

ecient, fair programs. 92

Providing Fairness

One alternative is

define large 128

MPI_Reque st request s[ la rg e] ;

MPI_Statu s statuse s[ la rg e] ;

int indices [l ar ge ];

int buf[lar ge ];

for i=1; i

MPI_Irecv  buf+i, 1, MPI_INT, i,

MPI_ANY_ TA G, MPI_COM M_ WO RLD , &request s[ i- 1] ;

whilenot done {

MPI_Waits om e size-1, request s, &ndone, indices, statuse s ;

for i=0; i

j = indices [i ];

printf "Msg from d with tag d\n",

statuse s[ i] .M PI _S OU RC E,

statuse s[ i] .M PI _T AG ;

MPI_Ire cv  buf+j, 1, MPI_INT, j,

MPI_ANY_ TA G, MPI_COM M_W OR LD , &request s[ j] ;

}

} 93

Providing Fairness Fortran

One alternative is

paramete r large = 128 

integer requests l ar ge ;

integer statuses M PI _S TA TU S_ SI ZE ,l ar ge ;

integer indices la rg e ;

integer buflarg e ;

logical done

do 10 i = 1,size-1

10 call MPI_Ire cv  bufi, 1, MPI_INT EG ER, i,

* MPI_ANY_ TA G, MPI_COM M_ WO RLD , requests i , ierr 

20 if .not. done then

call MPI_Wait so me  size-1, requests , ndone,

indices, statuse s, ierr 

do 30 i=1, ndone

j = indices i 

print *, 'Msg from ', statuse s MPI _S OU RC E, i , ' with tag',

* statuses M PI _T AG ,i 

call MPI_Irec v bufj, 1, MPI_INTEG ER , j,

MPI_ANY_ TA G, MPI_COM M_W OR LD , requests j , ierr 

done = ...

30 continue

goto 20

endif 94

Exercise - Fairness

Objective: Use nonblo cking communications

Complete the program fragment on

\providing fairness". Make sure that you

leave no uncompleted requests. How would

you test your program? 95

More on nonblo cking communication

In applications where the time to send data b etween

pro cesses is large, it is often helpful to cause

communication and computation to overlap. This can

easily b e done with MPI's non-blo cking routines.

For example, in a 2-D nite di erence mesh, moving

data needed for the b oundaries can b e done at the

same time as computation on the interior.

MPI_Irecv ... each ghost edge ... ;

MPI_Isend ... data for each ghost edge ... ;

... compute on interior

while still some uncompleted requests {

MPI_Waitany ... requests ... 

if request is a receive

... compute on that edge ...

}

Note that we call MPI_Waitany several times. This

exploits the fact that after a request is satis ed, it

is set to MPI_REQUEST_NULL, and that this is a valid

request object to the wait and test routines. 96

Communication Mo des

MPI provides mulitple mo des for sending messages:

 Synchronous mo de MPI Ssend: the send do es not

complete until a matching receive has b egun.

Unsafe programs b ecome incorrect and usually

deadlo ck within an MPI_Ssend.

 Bu ered mo de MPI Bsend: the user supplies the

bu er to system for its use. User supplies enough

memory to make unsafe program safe.

 Ready mo de MPI Rsend: user guarantees that

matching receive has b een p osted.

{ allows access to fast proto cols

{ unde ned b ehavior if the matching receive is not

p osted

Non-blo cking versions:

MPI Issend, MPI Irsend, MPI Ibsend

Note that an MPI_Recv may receive messages sent with

any send mo de. 97

Bu ered Send

MPI provides a send routine that may b e used when

MPI_Isend is awkward to use e.g., lots of small

messages.

MPI_Bsend makes use of a user-provided bu er to save

any messages that can not b e immediately sent.

int bufsize;

char *buf = mallocbufsize;

MPI_Buffer_attach buf, bufsize ;

...

MPI_Bsend ... same as MPI_Send ... ;

...

MPI_Buffer_detach &buf, &bufsize ;

The MPI_Buffer_detach call do es not complete until all

messages are sent.

Bsend dep ends on the  The p erformance of MPI

implementation of MPI and may also dep end on

the size of the message. For example, making a

message one byte longer may cause a signi cant drop

in p erformance. 98

Reusing the same bu er

Consider a lo op

MPI_Buffer_attach buf, bufsize ;

while !done {

...

MPI_Bsend ... ;

}

where the buf is large enough to hold the message in

the MPI_Bsend. This co de may fail b ecause the

{

void *buf; int bufsize;

MPI_Buffer_detach &buf, &bufsize ;

MPI_Buffer_attach buf, bufsize ;

} 99

Other Point-to-Point Features

 MPI_SENDRECV, MPI_SENDRECV_REPLACE

 MPI_CANCEL

 Persistent communication requests 100

Datatyp es and Heterogenity

MPI datatyp es have two main purp oses

 Heterogenity | parallel programs

between di erent pro cessors

 Noncontiguous data | structures,

vectors with non-unit stride, etc.

Basic datatyp e, corresp onding to the

underlying language, are prede ned.

The user can construct new datatyp es at run

time; these are called derived datatyp es. 101

Datatyp es in MPI

Elementary: Language-de ned typ es e.g.,

MPI_INT or MPI_DOUBLE_PRECISION 

Vector: Separated by constant \stride"

Contiguous: Vector with stride of one

Hvector: Vector, with stride in bytes

Indexed: Array of indices for

scatter/gather

Hindexed: Indexed, with indices in bytes

Struct: General mixed typ es for C structs

etc. 102

Basic Datatyp es Fortran

MPI datatyp e Fortran datatyp e

MPI_INTEGER INTEGER

MPI_REAL REAL

MPI_DOUBLE_PRECISION DOUBLE PRECISION

MPI_COMPLEX COMPLEX

MPI_LOGICAL LOGICAL

MPI_CHARACTER CHARACTER1

MPI_BYTE

MPI_PACKED 103

Basic Datatyp es C

MPI datatyp e C datatyp e

MPI_CHAR signed char

MPI_SHORT signed short int

MPI_INT signed int

MPI_LONG signed long int

MPI_UNSIGNED_CHAR unsigned char

MPI_UNSIGNED_SHORT unsigned short int

MPI_UNSIGNED unsigned int

MPI_UNSIGNED_LONG unsigned long int

MPI_FLOAT float

MPI_DOUBLE double

MPI_LONG_DOUBLE long double

MPI_BYTE

MPI_PACKED 104

Vectors

29 30 31 32 33 34 35

22 23 24 25 26 27 28

15 16 17 18 19 20 21

8 9 10 11 12 13 14

1234567

To sp ecify this row in C order, we can use

MPI_Type_vector count, blocklen, stride, oldtype,

&newtype ;

MPI_Type_commit &newtype ;

The exact co de for this is

MPI_Type_vector 5, 1, 7, MPI_DOUBLE, &newtype ;

MPI_Type_commit &newtype ; 105

Structures

Structures are describ ed byarrays of

 numb er of elements array_of_len

 displacement or lo cation array_of_displs

 datatyp e array_of_types

MPI_Type_structure count, array_of_len,

array_of_displs,

array_of_types, &newtype ; 106

Example: Structures

struct {

char display[ 50 ]; /* Name of display */

int maxiter; /* max  of iteratio ns */

double xmin, ymin; /* lower left corner of rectangl e */

double xmax, ymax; /* upper right corner */

int width; /* of display in pixels */

int height; /* of display in pixels */

} cmdline;

/* set up 4 blocks */

int blockcou nt s[ 4] = {50,1,4 ,2 };

MPI_Datat yp e types[4] ;

MPI_Aint displs[4 ];

MPI_Datat yp e cmdtype;

/* initiali ze types and displs with addresse s of items */

MPI_Addre ss  &cmdline .d is pl ay , &displs[ 0] ;

MPI_Addre ss  &cmdline .m ax it er , &displs[ 1] ;

MPI_Addre ss  &cmdline .x mi n, &displs[ 2] ;

MPI_Addre ss  &cmdline .w id th , &displs[ 3] ;

types[0] = MPI_CHAR ;

types[1] = MPI_INT;

types[2] = MPI_DOUB LE ;

types[3] = MPI_INT;

for i = 3; i >= 0; i--

displs[i] -= displs[0 ];

MPI_Type_ st ru ct  4, blockco un ts , displs, types, &cmdtype ;

MPI_Type_ co mm it  &cmdtype ; 107

Strides

The extent of a datatyp e is normally the

distance between the rst and last memb er.

Memory locations specified by datatype

EXTENT

LB UB

You can set an arti cial extent by using

MPI_UB and MPI_LB in MPI_Type_struct. 108

Vectors revisited

This co de creates a datatyp e for an arbitrary

numb er of element in a row of an array

stored in Fortran order column rst.

int blens[2], displs[2];

MPI_Datatype types[2], rowtype;

blens[0] = 1;

blens[1] = 1;

displs[0] = 0;

displs[1] = number_in_column * sizeofdouble;

types[0] = MPI_DOUBLE;

types[1] = MPI_UB;

MPI_Type_struct 2, blens, displs, types, &rowtype ;

MPI_Type_commit &rowtype ;

To send n elements, you can use

MPI_Send buf, n, rowtype, ... ; 109

Structures revisited

When sending an array of a structure, it is imp ortant

to ensure that MPI and the C compiler have the

same value for the size of each structure. The most

portable way to do this is to add an MPI_UB to the

structure de nition for the end of the structure. In

the previous example, this is

/* initiali ze types and displs with addresse s of items */

MPI_Addre ss  &cmdline .d is pl ay , &displs[ 0] ;

MPI_Addre ss  &cmdline .m ax it er , &displs[ 1] ;

MPI_Addre ss  &cmdline .x mi n, &displs[ 2] ;

MPI_Addre ss  &cmdline .w id th , &displs[ 3] ;

MPI_Addre ss  &cmdline +1 , &displs[ 4] ;

types[0] = MPI_CHAR ;

types[1] = MPI_INT;

types[2] = MPI_DOUB LE ;

types[3] = MPI_INT;

types[4] = MPI_UB;

for i = 4; i >= 0; i--

displs[i] -= displs[0 ];

MPI_Type_ st ru ct  5, blockco un ts , displs, types, &cmdtype ;

MPI_Type_ co mm it  &cmdtype ; 110

Interleaving data

By moving the UB inside the data, you can

interleave data.

Consider the matrix

0 8 16 24 32 1 9 17 25 33 2 10 18 26 34 3 11 19 27 35 4 12 20 28 36 5 13 21 29 37 6 14 22 30 38

7 15 23 31 39

We wish to send 0-3,8-11,16-19, and 24-27

to pro cess 0, 4-7,12-15,20-23, and 28-31 to

pro cess 1, etc. How can we do this with

MPI_Scatterv? 111

An interleaved datatyp e

MPI_Type_vector 4, 4, 8, MPI_DOUBLE, &vec ;

de nes a blo ck of this matrix.

blens[0] = 1; blens[1] = 1;

types[0] = vec; types[1] = MPI_UB;

displs[0] = 0; displs[1] = sizeofdouble;

MPI_Type_struct 2, blens, displs, types, &block ;

de nes a blo ck whose extent is just 1 entries. 112

Scattering a Matrix

We set the displacements for each blo ck as the

lo cation of the rst element in the blo ck. This works

b ecause MPI_Scatterv uses the extents to determine

the start of each piece to send.

scdispls[0] = 0;

scdispls[1] = 4;

scdispls[2] = 32;

scdispls[3] = 36;

MPI_Scatterv sendbuf, sendcounts, scdispls, block,

recvbuf, nx * ny, MPI_DOUBLE, 0,

MPI_COMM_WORLD ;

 Howwould use use the top ology routines to make

this more general? 113

Exercises - datatyp es

Objective: Learn ab out datatyp es

1. Write a program to send rows of a matrix stored

in column-majorform to the other pro cessors.

Let pro cessor 0 have the entire matrix, which has

as many rows as pro cessors.

Pro cessor 0 sends rowitopro cessori.

Pro cessor i reads that row into a lo cal array that

holds only that row. That is, pro cessor 0 has a

matrix AN; M  while the other pro cessors have a

row B M .

a Write the program to handle the case where

the matrix is square.

b Write the program to handle a numb er of

columns read from the terminal.

Cprogrammers may send columns of a matrix

stored in row-majorform if they prefer.

If you have time, try one of the following. If you

don't have time, think ab out howyou would

program these.

2. Write a program to transp ose a matrix, where

each pro cessor hasapart of the matrix. Use

top ologies to de ne a 2-Dimensional partitioning 114

of the matrix across the pro cessors, and assume

that all pro cessors have the same size submatrix.

a Use MPI_Send and MPI_Recv to send the blo ck,

the transp ose the blo ck.

b Use MPI_Sendrecv instead.

c Create a datatyp e that allows you to receive

the blo ck already transp osed.

3. Write a program to send the "ghostp oints" of a

2-Dimensional mesh to the neighb oring

pro cessors. Assume that each pro cessor has the

same size subblo ck.

a Use top ologies to nd the neighb ors

b De ne a datatyp e for the \rows"

c Use MPI_Sendrecv or MPI_IRecv and MPI_Send

with MPI_Waitall.

d Use MPI_Isend and MPI_Irecv to start the

communication, do some computation on the

interior, and then use MPI_Waitany to pro cess

the b oundaries as they arrive

The same approach works for general

datastructures, such as unstructured meshes.

4. Do 3, but for 3-Dimensional meshes. You will

need MPI_Type_Hvector.

To ols for writing libraries

MPI is sp eci cally designed to make it easier

to write message-passing librari es

 Communicators solve tag/source

wild-card problem

 Attributes provide a way to attach

information to a communicator 115

Private communicators

One of the rst thing that a library should

normally do is create private communicator.

This allows the library to send and receive

messages that are known only to the library.

MPI_Comm_dup old_comm, &new_comm ; 116

Attributes

Attributes are data that can be attached to

one or more communicators.

Attributes are referenced by keyval. Keyvals

are created with MPI_KEYVAL_CREATE.

Attributes are attached to a communicator

with MPI_Attr_put and their values accessed

by MPI_Attr_get.

 Op erations are de ned for what happ ens

to an attribute when it is copied by creating

one communicator from another or deleted

by deleting a communicator when the

keyval is created. 117

What is an attribute?

In C, an attribute is a p ointer of typ e void *.

You must allo cate storage for the attribute

to p oint to make sure that you don't use

the address of a lo cal variable.

In Fortran, it is a single INTEGER. 118

Examples of using attributes

 Forcing sequential op eration

 Managing tags 119

Sequential Sections

include "mpi.h"

include

static int MPE_Seq_ ke yv al = MPI_KEY VA L_ INV AL ID ;

/*@

MPE_Seq_ be gi n - Begins a sequent ia l section of code.

Input Paramete rs :

. comm - Communi ca to r to sequent ia li ze .

. ng - Number in group. This many processe s are allowed

to execute

at the same time. Usually one.

@*/

void MPE_Seq_ be gi n comm, ng 

MPI_Comm comm;

int ng;

{

int lidx, np;

int flag;

MPI_Comm local_co mm ;

MPI_Statu s status;

/* Get the private communic at or for the sequenti al

operation s */

if MPE_Seq _k ey va l == MPI_KEY VA L_ IN VA LI D {

MPI_Keyva l_ cr ea te  MPI_NULL _C OP Y_ FN ,

MPI_NULL _D EL ET E_ FN,

&MPE_Seq _k ey va l, NULL ;

} 120

Sequential Sections I I

MPI_Attr_ ge t comm, MPE_Seq _k ey va l, void *&local _c om m,

&flag ;

if !flag {

/* This expects a communi ca to r to be a pointer */

MPI_Comm_ du p comm, &local_ co mm ;

MPI_Attr_ pu t comm, MPE_Seq _k ey va l,

void *local _c om m ;

}

MPI_Comm_ ra nk  comm, &lidx ;

MPI_Comm_ si ze  comm, &np ;

if lidx != 0 {

MPI_Recv NULL, 0, MPI_INT, lidx-1, 0, local_c om m,

&status ;

}

/* Send to the next process in the group unless we

are the last process in the processo r set */

if  lidx ng

MPI_Send NULL, 0, MPI_INT, lidx + 1, 0, local_c om m ;

}

} 121

Sequential Sections I I I

/*@

MPE_Seq_ en d - Ends a sequent ia l section of code.

Input Paramete rs :

. comm - Communi ca to r to sequent ia li ze .

. ng - Number in group.

@*/

void MPE_Seq_ en d comm, ng 

MPI_Comm comm;

int ng;

{

int lidx, np, flag;

MPI_Statu s status;

MPI_Comm local_co mm ;

MPI_Comm_ ra nk  comm, &lidx ;

MPI_Comm_ si ze  comm, &np ;

MPI_Attr_ ge t comm, MPE_Seq _k ey va l, void *&local _c om m,

&flag ;

if !flag

MPI_Abort  comm, MPI_ERR_ UN KN OW N ;

/* Send to the first process in the next group OR to the

first process

in the process or set */

if  lidx  ng == ng - 1 || lidx == np - 1 {

MPI_Send NULL, 0, MPI_INT, lidx + 1  np, 0,

local_com m ;

}

if lidx == 0 {

MPI_Recv NULL, 0, MPI_INT, np-1, 0, local_c om m,

&status ;

}

} 122

Comments on sequential sections

 Note use of MPI_KEYVAL_INVALID to

determine to create a keyval

 Note use of ag on MPI_Attr_get to

discover that a communicator has no

attribute for the keyval 123

Example: Managing tags

Problem: A library contains many objects

that need to communicate in ways that are

not known until runtime.

Messages between objects are kept separate

by using di erent message tags. How are

these tags chosen?

 Unsafe to use compile time values

 Must allo cate tag values at runtime

Solution:

Use a private communicator and use an

attribute to keep track of available tags in

that communicator. 124

Caching tags on communicator

include "mpi.h"

static int MPE_Tag_ ke yv al = MPI_KEY VA L_ INV AL ID ;

/*

Private routine to delete internal storage when a

communica to r is freed.

*/

int MPE_DelTa g comm, keyval, attr_va l, extra_st at e 

MPI_Comm *comm;

int *keyval;

void *attr_va l, *extra_ st at e;

{

free attr_va l ;

return MPI_SUCC ES S;

} 125

Caching tags on communicator II

/*@

MPE_GetTa gs - Returns tags that can be used in

communica ti on with a

communica to r

Input Paramet er s:

. comm_in - Input communi ca to r

. ntags - Number of tags

Output Paramete rs :

. comm_out - Output communi ca to r. May be 'comm_in '.

. first_tag - First tag availab le

@*/

int MPE_GetTa gs  comm_in, ntags, comm_out, first_t ag 

MPI_Comm comm_in, *comm_o ut ;

int ntags, *first_ ta g;

{

int mpe_errno = MPI_SUC CE SS ;

int tagval, *tagval p, *maxval , flag;

if MPE_Tag _k ey va l == MPI_KEY VA L_ IN VA LI D {

MPI_Keyva l_ cr ea te  MPI_NULL _C OP Y_ FN , MPE_Del Ta g,

&MPE_Tag _k ey va l, void *0 ;

} 126

Caching tags on communicator III

if mpe_err no = MPI_Att r_ ge t comm_in , MPE_Tag_k ey va l,

&tagvalp, &flag 

return mpe_errn o;

if !flag {

/* This communi ca to r is not yet known to this system,

so we

dup it and setup the first value */

MPI_Comm_ du p comm_in , comm_out ;

comm_in = *comm_o ut ;

MPI_Attr_ ge t MPI_COM M_ WO RL D, MPI_TAG_ UB , &maxval,

&flag ;

tagvalp = int *malloc  2 * sizeofin t ;

printf "Malloc in g address x\n", tagvalp ;

if !tagval p return MPI_ERR_ EX HA US TED ;

tagvalp = *maxval ;

MPI_Attr_ pu t comm_in , MPE_Tag_ ke yv al, tagvalp ;

return MPI_SUCC ES S;

} 127

Caching tags on communicator IV

*comm_out = comm_in ;

if *tagval p < ntags {

/* Error, out of tags. Another solution would be to do

an MPI_Com m_ du p. */

return MPI_ERR_ IN TE RN ;

}

*first_ta g = *tagvalp - ntags;

*tagvalp = *first_t ag ;

return MPI_SUCC ES S;

} 128

Caching tags on communicator V

/*@

MPE_Retur nT ag s - Returns tags allocat ed with MPE_Get Ta gs .

Input Paramet er s:

. comm - Communic at or to return tags to

. first_tag - First of the tags to return

. ntags - Number of tags to return.

@*/

int MPE_Retur nT ag s comm, first_t ag , ntags 

MPI_Comm comm;

int first_ta g, ntags;

{

int *tagvalp, flag, mpe_err no ;

if mpe_err no = MPI_Att r_ ge t comm, MPE_Tag_ ke yv al ,

&tagvalp, &flag 

return mpe_errn o;

if !flag {

/* Error, attribu te does not exist in this communi ca to r

*/

return MPI_ERR_ OT HE R;

}

if *tagval p == first_t ag 

*tagvalp = first_ta g + ntags;

return MPI_SUCC ES S;

} 129

Caching tags on communicator VI

/*@

MPE_TagsE nd - Returns the private keyval.

@*/

int MPE_TagsE nd 

{

MPI_Keyva l_ fr ee  &MPE_Tag _k ey va l ;

MPE_Tag_k ey va l = MPI_KEYV AL _I NV AL ID ;

} 130

Commentary

 Use MPI_KEYVAL_INVALID to detect when

keyval must be created

 Use flag return from MPI_ATTR_GET to

detect when a communicator needs to be

initial i zed 131

Exercise - Writing libraries

Objective: Use private communicators and attributes

Write a routine to circulate data to the next pro cess,

using a nonblo cking send and receive op eration.

void Init_pipe comm 

void ISend_pipe comm, bufin, len, datatype, bufout 

void Wait_pipe comm 

Atypical use is

Init_pipe MPI_COMM_WORLD 

for i=0; i

ISend_pipe comm, bufin, len, datatype, bufout ;

Do_Work bufin, len ;

Wait_pipe comm ;

t = bufin; bufin = bufout; bufout = t;

}

What happ ens if Do_Work calls MPI routines?

 What do you need to do to clean up Init pipe?

 How can you use a user-de ned top ology to

Topo test determine the next pro cess? Hint: see MPI

and MPI Cartdim get. 132

MPI Objects

 MPI has a variety of objects

communicators, groups, datatyp es, etc.

that can be created and destroyed. This

section discusses the typ es of these data and

how MPI manages them.

 This entire chapter may be skipp ed by

b eginners. 133

The MPI Objects

MPI Request Handle for nonblo cking

communication, normally freed by MPI in

a test or wait

MPI Datatype MPI datatyp e. Free with

MPI_Type_free.

Op User-de ned op eration. Free with MPI

MPI_Op_free.

MPI Comm Communicator. Free with

MPI_Comm_free.

MPI Group Group of pro cesses. Free with

MPI_Group_free.

MPI Errhandler MPI errorhandler. Free with

MPI_Errhandler_free. 134

When should objects b e freed?

Consider this co de

MPI_Type_vector ly, 1, nx, MPI_DOUBLE, &newx1 ;

MPI_Type_hvector lz, 1, nx*ny*sizeofdouble, newx1,

&newx ;

MPI_Type_commit &newx ;

This creates a datatyp e for one face of a 3-D

decomp osition. When should newx1 b e freed? 135

Reference counting

MPI keeps track of the use of an MPI object, and

only truely destroys it when no-one is using it. newx1

is b eing used by the user the MPI_Type_vector that

created it and by the MPI_Datatype newx that uses it.

If newx1 is not needed after newx is de ned, it should

b e freed:

MPI_Type_vector ly, 1, nx, MPI_DOUBLE, &newx1 ;

MPI_Type_hvector lz, 1, nx*ny*sizeofdouble, newx1,

&newx ;

MPI_Type_free &newx1 ;

MPI_Type_commit &newx ; 136

Why reference counts

Why not just free the object?

Consider this library routine:

void MakeDatatype nx, ny, ly, lz, MPI_Datatype *new 

{

MPI_Datatype newx1;

MPI_Type_vector ly, 1, nx, MPI_DOUBLE, &newx1 ;

MPI_Type_hvector lz, 1, nx*ny*sizeofdouble, newx1,

new ;

MPI_Type_free &newx1 ;

MPI_Type_commit new ;

}

Without the MPI_Type_free &newx1 ,itwould b e very

awkward to later free newx1 when new was freed. 137

To ols for evaluating programs

MPI provides some to ols for evaluating the

p erformance of parallel programs.

These are

 Timer

 Pro ling interface 138

The MPI Timer

The elapsed wall-clo ck time between two

p oints in an MPI program can be computed

using MPI_Wtime:

double t1, t2;

t1 = MPI_Wtime;

...

t2 = MPI_Wtime;

printf "Elapsed time is f\n", t2 - t1 ;

The value returned by a single call to

MPI_Wtime has littl e value.

 The times are lo cal; the attribute

MPI WTIME IS GLOBAL may b e used to determine

if the times are also synchronized with each

other for all pro cesses in MPI COMM WORLD. 139

Pro ling

 All routines have two entry p oints: MPI ... and

PMPI ....

 This makes it easy to provide a single level of

low-overhead routines to intercept MPI calls

without any source co de mo di cations.

 Used to provide \automatic" generation of trace les.

MPI_Send MPI_Send MPI_Send PMPI_Send PMPI_Send

MPI_Bcast MPI_Bcast

User ProgramProfile Library MPI Library

static int nsend =0;

int MPI_Send start, count, datatyp e, dest, tag, comm 

{

nsend++;

return PMPI_Sen d start, count, datatyp e, dest, tag, comm 

} 140

Writing pro ling routines

The MPICH implementation contains a program for

writing wrapp ers.

This description will write out each MPI routine that

is called.:

ifdef MPI_BUILD_PROFILING

undef MPI_BUILD_PROFILING

endif

include

include "mpi.h"

{{fnall fn_name}}

{{vardecl int llrank}}

PMPI_Comm_rank MPI_COMM_WORLD, &llrank ;

printf "[d] Starting {{fn_name}}...\n",

llrank ; fflush stdout ;

{{callfn}}

printf "[d] Ending {{fn_name}}\n", llrank ;

fflush stdout ;

{{endfnall}}

The command

wrappergen -w trace.w -o trace.c

converts this to a C program. The complie the le

`trace.c' and insert the resulting object le into your

link line:

cc -o a.out a.o ... trace.o -lpmpi -lmpi 141

Another pro ling example

This version counts all calls and the numb er of bytes sent with

MPI_Send, MPI_Bse nd ,or MPI_Ise nd .

include "mpi.h"

{{foreach fn fn_name MPI_Sen d MPI_Bsen d MPI_Isend }}

static long {{fn_na me }} _n by te s_ {{ fi le no }}; {{ en df or ea ch fn }}

{{forallf n fn_name MPI_Init MPI_Fin al iz e MPI_Wti me }} in t

{{fn_name }} _n ca ll s_ {{ fi le no }} ;

{{endfora ll fn }}

{{fnall this_fn _n am e MPI_Fina li ze }}

printf "{{this _f n_ na me }} is being called.\n " ;

{{callfn} }

{{this_fn _n am e} }_ nc al ls _{ {f il en o} }+ +;

{{endfnal l} }

{{fn fn_name MPI_Send MPI_Bse nd MPI_Ise nd} }

{{vardecl int typesiz e} }

{{callfn} }

MPI_Type_ si ze  {{dataty pe }} , MPI_Ain t *&{{ty pe si ze }} ;

{{fn_name }} _n by te s_ {{ fi le no }} += {{ ty pe siz e} }* {{ co un t} }

{{fn_name }} _n ca ll s_ {{ fi le no }} ++ ;

{{endfn}} 142

Another pro ling example con't

{{fn fn_name MPI_Fina li ze }}

{{forallf n dis_fn}}

if {{dis_f n} }_ nc al ls _{ {f il en o} } {

printf "{{dis_ fn }} : d calls\n ",

{{dis_fn }} _n ca ll s_ {{ fi le no} } ;

}

{{endfora ll fn }}

if MPI_Sen d_ nc al ls _{ {f il en o} } {

printf "d bytes sent in d calls with MPI_Sen d\ n" ,

MPI_Send _n by te s_ {{ fi le no }},

MPI_Send_ nc al ls _{ {f il en o} } ;

}

{{callfn} }

{{endfn}} 143

Generating and viewing log les

Log les that contain a history of a

parallel computation can be very valuable

in understanding a parallel program. The

upshot and nupshot programs, provided in

the MPICH and MPI-F implementations,

may be used to view log les 144

Generating a log le

This is very easy with the MPICH

implementation of MPI. Simply replace -lmpi

with -llmpi -lpmpi -lm in the link line for

your program, and relink your program. You

do not need to recompile.

On some systems, you can get a real-time

animation by using the librari es -lampi -lmpe

-lm -lX11 -lpmpi.

Alternately, you can use the -mpilog or

-mpianim options to the mpicc or mpif77

commands. 145

Connecting several programs together

MPI provides supp ort for connection separate

message-passing programs together through

the use of intercommunicators. 146

Sending messages between di erent programs

Programs share MPI_COMM_WORLD.

Programs have separate and disjoint

communicators.

App1 App2

Comm1 Comm2

Comm_intercomm

MPI_COMM_WORLD 147

Exchanging data between programs

 Form intercommunicator

MPI_INTERCOMM_CREATE

 Send data

MPI_Send ..., 0, intercomm 

MPI_Recv buf, ..., 0, intercomm ;

MPI_Bcast buf, ..., localcomm ;

More complex p oint-to-p oint op erations

can also be used 148

Collective op erations

Use MPI_INTERCOMM_MERGE to create an

intercommunicator. 149

Final Comments

Additional features of MPI not covered in

this tutorial

 Persistent Communication

 Error handling 150

Sharable MPI Resources

 The Standard itself:

{ As a Technical rep ort: U. of Tennessee.

rep ort

{ As p ostscript for ftp: at info.mcs.anl.gov in

pub/mpi/mpi-report.ps.

{ As hyp ertext on the World Wide Web:

http://www.mcs.anl.gov/mpi

{ As a journal article: in the Fall issue of the

Journal of Sup ercomputing Applications

 MPI Forum discussions

{ The MPI Forum email discussions and b oth

current and earlier versions of the Standard

are available from netlib.

 Bo oks:

{ Using MPI: Portable Parallel Programming

with the Message-Passing Interface,by

Gropp, Lusk, and Skjellum, MIT Press, 1994

{ MPI Annotated Reference Manual,by Otto,

et al., in preparation. 151

Sharable MPI Resources, continued

 Newsgroup:

{ comp.parallel.mpi

 Mailing lists:

{ [email protected]: the MPI Forum

discussion list.

{ [email protected]: the implementors'

discussion list.

 Implementations availableby ftp:

{ MPICH is availableby anonymous ftp from

info.mcs.anl.gov in the directory

pub/mpi/mpich, le mpich.tar.Z.

{ LAM is available by anonymous ftp from

tbag.osc.edu in the directory pub/lam.

{ The CHIMP version of MPI is availableby

anonymous ftp from ftp.epcc.ed.ac.uk in the

directory pub/chimp/release.

 Test co de rep ository:

{ ftp://info.mcs.anl.gov/pub/mpi/mpi-test 152

MPI-2

 The MPI Forum with old and new participants

has b egun a follow-on series of meetings.

 Goals

{ clarify existing draft

{ provide features users have requested

{ make extensions, not changes

 MajorTopics b eing considered

{ dynamic pro cess management

{ client/server

{ real-time extensions

{ \one-sided" communication put/get, active

messages

{ portable access to MPI system state for

debuggers

{ language bindings for C++ and Fortran-90

 Schedule

{ Dynamic pro cesses, client/server by SC '95

{ MPI-2 complete by SC '96 153

Summary

 The parallel computing community has co op erated

to develop a full-featured standard message-passing

library interface.

 Implementations ab ound

 Applications b eginning to b e develop ed orported

 MPI-2 pro cess b eginning

 Lots of MPI material available 154